Cindy Gross | Befriending Dragons

Category: SQL & Big Data

Load SQL Server BCP Data to Hive

Load SQL Server BCP Data to Hive

Small Bites of Big Data

Cindy Gross, SQLCAT PM

As you start learning more about Hadoop you may want to take a look at how the same data and queries work for SQL Server and for Hadoop. There are various ways to do this. For now I’ll show you something that utilizes some of your existing SQL Server skills (BCP) and some new Hadoop skills (basic Hadoop FS and Hive commands). There are other methods such as FTP, cURL, and Sqoop that I won’t cover right now. Also, if you want to load data from the Azure DataMarket you can follow these instructions.

Today we’ll walk through making a copy of the FactProductInventory table from AdventureWorksDW2012 on a SQL Server 2012 instance. The below TSQL will generate BCP output commands for a single table in the current SQL Server database context and write tab delimited data to c:temp. Comment out the @tablename references in the SELECT and WHERE clauses to generate the script for all tables in the database.

USE [AdventureWorksDW2012];

DECLARE @servername sysname, @dbname sysname, @tablename sysname, @outputdir sysname

SELECT @servername = @@SERVERNAME

       ,@dbname = DB_NAME()

       ,@outputdir = ‘c:temp’

       ,@tablename = ‘FactProductInventory’

SELECT ‘bcp ‘ + OBJECT_SCHEMA_NAME(object_id) + ‘.’ + name + ‘ out ‘

       + @outputdir + OBJECT_SCHEMA_NAME(object_id) + ‘_’ + name + ‘.dat -b 10000 -d ‘

       + @dbname + ‘ -T -c -S ‘ + @servername

       FROM sys.objects

       WHERE type_desc = ‘USER_TABLE’

       AND name = @tablename

       ;

In this case the BCP code generated is (no line break):

bcp dbo.FactProductInventory out c:tempdbo_FactProductInventory.dat -b 10000 -d AdventureWorksDW2012 -T -c -S CGROSSBOISESQL2012

Paste the BCP command to a Command Prompt and run it.

If you have not yet created an Apache™ Hadoop™-based Services for Windows Azure cluster follow these steps to do so (this is a CTP so the exact steps/screens will change over time).

From your Hadoop cluster portal click on the “Remote Desktop” button and choose to “Save As” the RDP that is generated.

Right click on the RDP you saved and choose “edit”. Go to the “Local Resources” tab click on “More…” under “Local devices and resources”. Add a check mark to “Drives” then click “OK”. Go back to the “General” tab and click on “Save”. Now choose “Connect” to open a remote desktop connection to your Hadoop head node.

Open the “Hadoop Command Shell”. Copy/paste or type these commands (beware of some editors changing dashes or other characters to non-executable values) to create a directory and copy the data file to your head node. The /y on the copy will overwrite the file if it already exists.

Md c:data

Copy \tsclientCtempdbo_FactProductInventory.dat c:data /y

Dir c:data

Now from the same prompt load the data into Hadoop HDFS. The fs indicates you are running a filesystem command from a Hadoop script. Generally the same commands are available from the “Interactive JavaScript” console in the portal by replacing “hadoop fs –“ with “#”. For example, “hadoop fs –lsr /” from a Hadoop Command Prompt and “#lsr /” from the JavaScript console both return a recursive list of all directories and files starting at the root (/). Try some variations such as “#ls” (non-recursive contents of default directory) and “#lsr /user” (recursive list of the user directory).

hadoop fs -put c:datadbo_FactProductInventory.dat /demo/demo.dat

hadoop fs -lsr /

Now launch the command line version of Hive (you can alternatively use the Interactive Hive console in the portal, but I’m showing you the automatable version) and add metadata to the HDFS data. Note that the CREATE EXTERNAL TABLE statement wraps in the window, there is no line break. Because I choose to use the EXTERNAL keyword the data stays in its original HDFS location and will not be deleted when I drop the Hive table. Since the Hive keywords are different the data type names are not exactly the same as they were in SQL Server. Basically I generated the script for this table from SSMS then made a few changes. I removed the constraints and indexes then changed date and money to string. I also removed the brackets and the “dbo.” schema qualifier. If you don’t copy the empty line under the last command you will have to hit enter for the last command to complete.

hive;

CREATE EXTERNAL TABLE FactProductInventory(ProductKey int,DateKey int,MovementDate string,UnitCost string,UnitsIn int,UnitsOut int,UnitsBalance int) COMMENT ‘Hive Demo for #24HOP’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘t’ STORED AS TEXTFILE LOCATION ‘/demo’;

quit;

Now let’s run some queries. You can either start Hive again from the command line or run it from the Interactive Hive portal in the GUI.

select * from FactProductInventory where ProductKey = 230 and DateKey = ‘20080220’;

The output on my single data node Hadoop cluster looks like this (the line starting with 230 is the actual result set):

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there’s no reduce operator

Starting Job = job_201209281938_0013, Tracking URL = http://10.119.36.106:50030/jobdetails.jsp?jobid=job_201209281938_0013

Kill Command = c:Appsdistbinhadoop.cmd job -Dmapred.job.tracker=10.119.36.106:9010 -kill job_201209281938_0013

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2012-09-28 23:27:45,271 Stage-1 map = 0%, reduce = 0%

2012-09-28 23:27:58,301 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:27:59,316 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:00,316 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:01,332 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:02,347 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:03,363 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:04,379 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:05,394 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:06,410 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:07,425 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.187 sec

MapReduce Total cumulative CPU time: 3 seconds 187 msec

Ended Job = job_201209281938_0013

MapReduce Jobs Launched:

Job 0: Map: 1   Accumulative CPU: 3.187 sec   HDFS Read: 31077011 HDFS Write: 38 SUCESS

Total MapReduce CPU Time Spent: 3 seconds 187 msec

OK

230     20080220        2008-02-20      20.3900 0       0       4

Time taken: 48.465 seconds

Note that if you run this same query in SSMS you will see an instant response but here in Hadoop it took 48 seconds of total time. No matter how many times you run this query you will see approximately the same execution time because the result set is not cached. This shows that we have used Hadoop in an anti-pattern way – we went after a single row of data. Filtered row sets are a strength of well-indexed relational systems while querying entire, very large, unindexed data sets is a strength of Hadoop. Hive generates MapReduce code and that MapReduce code goes through the same steps each time to find the data, distribute job tasks across the data nodes (map), and then bring the results sets back (reduce). The cumulative CPU time once it actually executes the map phase is still over 3 seconds. I chose this example both to illustrate that point and because it gives you data you are familiar with to ease you into the Hadoop and Hive worlds.

If you want to remove the metadata from Hive and the data you’ve just loaded from Hadoop HDFS execute these steps from a Hadoop Command Shell:

hive

drop table FactProductInventory;

quit;

hadoop fs -rmr /demo/demo.dat

Now you know one way to copy data from SQL Server to Hadoop. Keep exploring Hadoop and keep buzzing about Hive!

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

September 28, 2012
What’s all the Buzz about Hadoop and Hive?

What’s all the Buzz about Hadoop and Hive?

Why it Matters for SQL Server Peeps

Small Bites of Big Data

Cindy Gross, SQLCAT PM

On September 20, 2012 we have another 24 Hours of PASS event! This PASS Summit Preview will give you a taste of what is coming at this year’s PASS Summit. There are 190+ technical sessions this year at the Summit, and you’ll get a preview of 24 of them at the #24HOP event tomorrow! Come hear about some of the hottest topics and features in the SQL Server, BI, and data world.

One of the big buzzwords over the last year or so is Hadoop, and the most familiar part of Big Data and Hadoop to most SQL Server professionals is Hive. Do you wonder what it is and why you should jump in now while it’s still new and growing by leaps and bounds? I have just the #24HOP session for you!

#24HOP: What’s all the Buzz about Hadoop and Hive? – Why it Matters for SQL Server Peeps

Everyone is buzzing about Hive and trumpeting the virtues of Hadoop. But what does it mean? Why does it matter to a SQL Server and/or BI professional? Come get a taste of the Hive honey and see why this new technology is worth buzzing about!

During this talk I’ll give a very high level overview of Big Data, Hadoop, and Hive (for the nitty gritty details come to the Summit!). I’ll also go through why Hive matters in the SQL Server world, what a SQL Server Peep might end up doing in a Hive world, and why it is important for you as a SQL Server Peep to jump in and get your feet wet with Hive now.

Once you’ve heard this #24HOP talk I hope you’ll be fired up about Hive and more anxious than ever to sign up for the PASS Summit to learn even more about Hadoop, Hive, Big Data, and all things BI and SQL Server. I’ll be co-presenting at the Summit with SQL Server PM Dipti Sangani:

SQLCAT: Big Data – All Abuzz About Hive [BIA-305-A]
Session Category: Regular Session (75 minutes)
Session Track: BI Platform Architecture, Development & Administration
Speaker(s): Cindy Gross, Dipti Sangani

Got a bee in your bonnet about simplifying access to Hadoop data? Want to cross-pollinate your existing SQL skills into the world of Big Data? Join this session to see how to become the Queen Bee of your Hadoop world with Hive and gain Business Intelligence insights with HiveQL filters and joins of HDFS datasets. We’ll navigate through the honeycomb to see how HiveQL generates MapReduce code and outputs files to answer your questions about your Big Data.

After this session, you’ll be able to democratize access to Big Data using familiar tools such as Excel and a SQL-like language without having to write MapReduce jobs. You’ll also understand Hive basics, uses, strengths, and limitations and be able to determine if/when to use Hive in combination with Hadoop.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

UPDATE 9/28/12 – demo steps to load the AdventureWorks data to Hive are available at http://blogs.msdn.com/b/cindygross/archive/2012/05/07/load-data-from-the-azure-datamarket-to-hadoop-on-azure-small-bites-of-big-data.aspx.

24HOPFall2012HiveBuzz.zip

September 19, 2012
How to Install the PowerShell Cmdlets for Apache™ Hadoop™-based Services for Windows
How to Install the PowerShell Cmdlets for Apache™ Hadoop™-based Services for Windows

Small Bites of Big Data

Cindy Gross, SQLCAT PM

UPDATED JUNE 2013 – The very early version of PowerShell cmdlets I discussed below have been replaced – see Managing Your HDInsight Cluster with PowerShell http://blogs.msdn.com/b/carlnol/archive/2013/06/07/managing-your-hdinsight-cluster-with-powershell.aspx

We have a cool new addition to Microsoft’s Apache™ Hadoop™-based Services for Windows – new PowerShell cmdlets! The initial readme.txt is not very detailed so I went through the installation and initial use step by step to help get your started. Enjoy this new way to remotely administer your Hadoop cluster!

1)      Log in to your Windows client (the location where you plan to do remote administration of your Hadoop cluster) with a local administrator account.

2)      Download the Hadoop PowerShell Cmdlets zip file from your Apache™ Hadoop™-based Services for Windows cluster: https://YourHadoopCluster.cloudapp.net/Home/Downloads(substitute your actual Hadoop Cluster name for “YourHadoopCluster”).

3)      Unzip the files to C:WindowsMicrosoft.NETFramework64v4.0.30319 (assuming an x64 OS and an installation on C:).

4)      Change PowerShell to use .NET 4.0 (by default it uses an older version that does not work with these cmdlets)
- Open “Windows PowerShell” (choose the x64 version – the one that does not show (x86) after the name).
- Type $pshome and capture the result such as “C:WindowsSystem32WindowsPowerShellv1.0.” This directory is for an x64 installation of PowerShell, the SysWOW64 directory is for x86/32bit.
- Close PowerShell.
- Open a dos-prompt or Windows Explorer with “Run as administrator” and navigate to the directory you found above (i.e. C:Windowssystem32WindowsPowerShellv1.0)
- If no powershell.exe.config file exists create an empty file with that name in that directory. Edit powershell.exe.config to add the following section:
  <?xml version=”1.0″?>
  <configuration>
  <startup useLegacyV2RuntimeActivationPolicy=”true”>
  <supportedRuntime version=”v4.0.30319″/>
  <supportedRuntime version=”v2.0.50727″/>
  </startup>
  </configuration>
5) Install the SDK cmdlets
- Open a “Command Prompt” with the “Run as administrator” option. Go to your .NET 4.0 directory which will be something like C:WindowsMicrosoft.NETFramework64v4.0.30319.
- Run:
  installutil.exe Isotope.Sdk.Cmdlets.dll
- Review the output to verify there were no errors during the install (look for things like commit instead of rollback and no errors).
6)      Open a new Windows PowerShell window (make sure you choose the x64 version on an x64 OS) and type

            Add-PSSnapin IsotopeSdkPSSnapIn

Verify it completes successfully.

7)      Type a set of commands in the PowerShell window to verify the cmdlets work. For example set which cluster the other commands apply to (change to your actual names/password), check that the settings are correct, then get a file listing or the user directories:

Set-ClusterSettings -ClusterDns <cluster name> -UserName <username> -Password <password>
Get-ClusterSettings
Get-FileListFromHdfs -HdfsPath hdfs:///user/

8)      Once the cmdlets are installed and you have verified it is working you can add the SnapIn to the profile so you don’t have to add it every time you open PowerShell. Details on PowerShell profiles can be found here, a summary is below.

a.       Open PowerShell with “Run as administrator” (required if you are changing the execution policy or creating a new profile)

b.      Allow scripts you’ve self-created to be run without being signed (or you can choose to sign the ps1 you create later):

Set-Executionpolicy -ExecutionPolicy RemoteSigned

c.       See if a Profile already exists: test-path $profile

d.      If no Profile exists, make a new one that’s available to all users (this assumes x64):

new-item -path $env:windirsystem32WindowsPowerShellv1.0profile.ps1 -itemtype file -force

e.      Edit the file you just created. Open Notepad with “Run as administrator”, paste in:

Add-PSSnapin IsotopeSdkPSSnapIn

f.        Optionally add the default Hadoop cluster (use your actual cluster name, username, password)

Set-ClusterSettings -ClusterDns <cluster name> -UserName <username> -Password <password>

g.       Save as C:Windowssystem32WindowsPowerShellv1.0profile.ps1 (choose “all files” so it does not append “.txt”).

h.      The next time you open PowerShell you will not have to add the snapin or set your default cluster you want to manage.

The full list of cmdlets is in the readme.txt. As of this time the list includes:

Set-ClusterSettings -ClusterDns <hadoopcluster.cloudapp.net or hadoopcluster> -UserName <username> -Password <password>

Get-ClusterSettings

New-HiveJob -Type query -Execute “SELECT * from HiveSampleTable limit 10;” [-Define <# delimited key=value pairs>]

New-MapReduceJarJob -Jar c:appsdisthadoop-examples-1.1.0-SNAPSHOT.jar -Arguments “pi 16 100” [-Define <# delimited key=value pairs>]

New-MapReduceStreamingJob -Input “/example/data/gutenberg/davinci.txt” -Output “/example/data/streamingoutput/wc.txt” -Mapper cat.exe -Reducer wc.exe -File “hdfs:///example/apps/wc.exe,hdfs:///example/apps/cat.exe” [-Define <# delimited key=value pairs>]

Get-JobStatus -JobId <isotope-jobId>

Get-FileListFromHdfs -HdfsPath hdfs:///user/

Copy-File -LocalPath <file> -HdfsPath hdfs:///<path>

Get-JobHistoryList -NumberOfItems 2

Get-JobHistory -JobId <isotope-jobId>

Get-JobHistoryCount

Set-AsvCredentials -AccountName <accountname> -AccountKey <accountkey>

Get-AsvCredentials

Remove-AsvCredentials

Get-S3Credentials

Set-S3Credentials -AccessKeyId <accesskeyid> -AccessKey <accesskey> -IsS3N <0|1>

Remove-S3Credentials

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.
August 23, 2012
Hadoop Hive Error: Could not connect client socket, timed_out

Small Bites of Big Data

Cindy Gross, SQLCAT PM

With the Hadoop on Azure CTP, when you create a Hadoop cluster it expires after a few days to free up the resources for other CTP users. Therefore each time I do a demo or test I am likely to create a new Hadoop cluster. There are a few settings that it’s easy to forget about. For example, today I spun up a cluster and tried to use the Hive Pane add-in from Excel. I entered the connection information and hit the “OK” button.

Instead of seeing the expected option to choose the Hive table I saw this error:

Text version:

Error connecting to Hive server. Details:

SQL_ERROR Failed to connect to (cgross.cloudapp.net:10000): Could not connect client socket. Details: <Host: cgross.cloudapp.net Port: 10000> endpoint: 168.62.107.55, error: timed_out

While there are very likely many possible reasons for this error, I’ve done this enough times to immediately realize I never opened the ODBC Server port on this particular cluster. I opened port 10000 on the Hadoop cluster via the “Open Ports” tile and tried again. Success! I can now query my Hive data!

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly

Other Small Bites of Big Data: http://blogs.msdn.com/b/cindygross/archive/tags/hadoop/

July 23, 2012
Load Data from the Azure DataMarket to Hadoop on Azure – Small Bites of Big Data
Load Data from the Azure DataMarket to Hadoop on Azure

Small Bites of Big Data

Cindy Gross, SQLCAT PM

UPDATED Jun 2013: HadoopOnAzure CTP has been replaced by HDInsight Preview which has a different interface and different functionality. Getting Started with Windows Azure HDInsight Service http://www.windowsazure.com/en-us/manage/services/hdinsight/get-started-hdinsight/

The first step in analyzing data is acquiring that data. You may have data available from logs, databases, or other internal sources. Or you may need to export data from an external source. In Hadoop on Azure we make it easy to load data from data sources such as the Azure DataMarket, Amazon S3, and the Windows Azure Blog Storage (ASV).

Preparing Hadoop On Azure for a data load
- Create a Hadoop cluster on http://HadoopOnAzure.com.
- Make sure you have available disk space on the Hadoop cluster equal to at least three times the size of the data you plan to import. This is because by default Hadoop has a replication factor of three, meaning it keeps three copies of each slice of data.
- Create an account on the Windows Azure Marketplace https://datamarket.azure.com/
Find your data source
- Find data on the Marketplace that meets your needs. For example, I looked for free demographic data.
- Some of the data marked as free has limits on how much is available without payment. I subscribed to the “Gender Info 2007 – United Nations Statistics Division” because it was free, had a significant amount of data, had unlimited transactions/month, and is easily understandable.
- You can click through on the data link for more information on the source, columns, and other details.
- When you click on the “use” button you get a screen that lets you query the data. You will use information from this screen to load the data into Hadoop.
- Leave this browser window open and stay logged in to the DataMarket.
Load the Data
- Log in to your cluster onHadoopOnAzure.com and click on the Manage Cluster tile.
- Click on the DataMarket tile
- Go to the DataMarket page for your data.
  - For the gender data, you need to change the “Query” value under “BUILD QUERY” from “DataSeries” to “Values”.
  - Next to “Primary Account Key” click on “Show”. This will expose a key value. Cut/paste that value into the “passkey” field of “Import from Data Market” on HadoopOnAzure.
  - Copy the “URL for current expressed query” into the “Query” field on HadoopOnAzure. Erase the end of the query ($top=100) that limits the results to the first 100 rows. For the gender data the query is https://api.datamarket.azure.com/Data.ashx/UnitedNations/GenderInfo2007/Values?$top=100, change that to https://api.datamarket.azure.com/Data.ashx/UnitedNations/GenderInfo2007/Values in the query field.
  - In HadoopOnAzure enter your DataMarket user name in the “User name” field and a unique new table name in the “Hive table name” field. If you the name of a table that already exists in this Hadoop cluster you will get an error.
  - Click on the “Import Data” button to start the data load.
  - You can now leave the screen and let the data load complete in the background.
- You can check on the status of the data load by clicking on the “Job History” tile.
- You can click through on the job to get more details.
- Once the data load is complete, you can query the table from the Interactive Console.
- It opens to the JavaScript console, you will need to click on the Hive button to get to the “Interactive Hive” screen.
- Enter a SQL query in the white box at the bottom and then hit “Evaluate”. The HiveQL or HQL language is based on standard SQL and is closer to the MySQL implementation than to the SQL Server TSQL implementation. Most of the syntax is the same as TSQL but there are some differences. For example, execute: select * from GenderInfoValuesAll where year = 2000 order by countryname limit 200;. This looks like TSQL except you use limit instead of top.
Now your data is loaded and you’ve done a basic test to verify the data looks correct. You’re ready to start using Hive on Hadoop!

Other Microsoft Hadoop data loading links:
- How To: Upload Data and Use the WordCount Sample with Hadoop Services for Windows Azure (video) http://social.technet.microsoft.com/wiki/contents/articles/6985.how-to-upload-data-and-use-the-wordcount-sample-with-hadoop-services-for-windows-azure-video.aspx
- How to SFTP Data to Hadoop-based services on Windows Azure http://social.technet.microsoft.com/wiki/contents/articles/6779.how-to-sftp-data-to-hadoop-based-services-on-windows-azure.aspx
I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly. Screenshots are from the May 2012 version of http://HadoopOnAzure.com.
May 7, 2012
24HOP/SQLRally – Fitting Microsoft Hadoop Into Your Enterprise BI Strategy
24HOP/SQLRally – Fitting Microsoft Hadoop

Into Your Enterprise BI Strategy

Small Bites of Big Data

Cindy Gross, SQLCAT PM

The world of #bigdata and in particular #Hadoop is going mainstream. At 24HOP 2012 I talked about how a SQL Server professional fits into this big data world.

Hadoop generally falls into the NOSQL realm. NOSQL uses basically available, soft-state, eventual consistency (BASE) instead of requiring immediate, 100% consistency via ACID properties. This allows certain types of queries/operations to be much faster since locking and consistency rules don’t have to be followed. This works well when any given program/query looks at the entire dataset and does not try to filter the data or query a subset of it. Basically with Hadoop you take data that may not have a known, consistent structure and impose a structure on it for the duration of your MapReduce program (query). This is all based on the concept of scaling out rather than up, with new commodity (low end enterprise server) hardware being added to an existing Hadoop cluster to meet dynamic demands. With Hadoop you are spreading hardware costs out over a longer time, possibly paying less overall for hardware, and shifting your IT costs to different departments and parts of the lifecycle.

Hive is a database which sits on top of Hadoop’s HDFS (Hadoop Distributed File System). The data stays in HDFS but Hive stores metadata about the imposed structure and may store intermediate files. HiveQL looks a lot like TSQL and like most SQL languages makes an effort to align with the SQL standard when it makes sense. HiveQL will generate (usually multiple) MapReduce jobs to take an entire large result set and allow easier use of filters, joins, and aggregates. It’s still going to be slower than a relational database for most filters and joins but it allows access to data that may be too expensive (for whatever reason) to put into a relational or multi-dimensional database at this point in the lifecycle.

You may keep your source data outside HDFS and bring it in only for the duration of a project. You can sqoop data to/from a relational database (there is a sqoop driver for SQL Server and PDW) or pull data from blob stores like Windows Azure and Amazon S3, use secure FTP, query OData sources like the Windows Azure DataMarket. The Hive ODBC driver, often via the Excel Hive Add-in, and sqoop can be used to pull data from Hadoop into other tools or systems.

So far I’ve been talking as if big data = Hadoop. However, that’s not really true. Big data is data that is too big, expensive, or complex to process using your existing environment (often traditional RDBMSs with expensive SANs and high end hardware). While MapReduce, most commonly via Hadoop, is a very common solution it’s not the only one. There are streaming technologies like StreamInsight and HStreaming, machine learning like Mahout, massively parallel processing databases like PDW, and more. Generally big data at least starts out as unstructured or semi-structure, or perhaps of variable structure. It may flow very quickly with the need to process large amounts of data in a very small time window where decisions have to be made in real time.

At its core Hadoop has the file system HDFS which sits on top of the Windows or Linux file system and allows data to be mapped over many nodes in a Hadoop cluster. A head node maintains metadata about where each piece of data resides. MapReduce programs “map” the data to the many nodes then reduce the output based on the required logic, resulting in an output file that is the end result of the program/query. Other pieces of the Hadoop ecosystem may include things like Hive, HBase, HCatalog, Lucene, Mahout, Zookeeper and more.

So when would you use Hadoop? It’s good for exploring data, finding out what you don’t know. If you need to know your unknown unknowns or look for unexpected correlations, Hadoop may be what you need. It allows you to be very flexible and not lock yourself into a rigid schema until after you’ve fully explored your options. Some examples of common use cases are risk modeling, machine learning, cyber security, sentiment analysis, recommendation engines, log analysis, and sensor data. However, don’t think big data is the solution to all your problems. Often it is used to find and define the most useful bits of data or find the most useful algorithms. Then a more traditional design process may be used to create an RDBMS or multiple dimensional system for faster querying during day to day operations. Think of Hadoop as the foundation for a very specialized type of database that meets very specific needs, not as a replacement for relational. It’s important to note that every VLDB is not a fit for big data. “Big” is somewhat of a misnomer, size is only one of the characteristics and is relative to what your environment is prepared to handle.

Often you ask about the 4 “Vs” when deciding whether to use Hadoop – volume, velocity, variety, variability. Think about big data when you describe a problem with terms like tame the chaos, reduce the complexity, explore, I don’t know what I don’t know, unknown unknowns, unstructured, changing quickly, too much for what my environment can handle now, or unused data.
- Volume = More data than the current environment can handle with vertical scaling, need to make sure of data that it is currently too expensive to use
- Velocity = Small decision window compared to data change rate, ask how quickly you need to analyze and how quickly data arrives
- Variety = Many different formats that are expensive to integrate, probably from many data sources/feeds
- Variability = Many possible interpretations of the data
Microsoft is taking the existing Apache Hadoop code and making sure it runs on Windows. We have checked back in HDFS, MapReduce, and Hive code to the Apache open source community. More is on the way. We are adding things like the Excel add-in for the Hive ODBC driver, JavaScript, cloud (http://HadoopOnAzure.com), C# samples, etc. Where Microsoft is often seen as adding value is making Hadoop more available and allowing the reuse of existing skill sets. We offer self service, visualization, easier data movement, elasticity, and familiarity for faster completion of projects. This allows data scientists (people who know the business and statistics first and tech 2^nd), data teams, information workers, and anyone seeking insights through BI to do their job better and faster.

We offer visualization through PowerPivot, Power View, and the Excel Hive ODBC Add-in. For many used to Hadoop on Linux these things are new and exciting, even though they seem familiar to many in the SQL Server world.

In the end, big data is really all about discovering insights and then taking action. If you don’t act on what you find, why did you spend all that effort mining the data? Sometimes the action is to go back and refine your process or change the way you looked at the data, but the end goal should be to make decisions that impact your business. Big data should add value to the business proposition.

References
- Hadoop: The Definitive Guide by Tom White
- SQL Server Sqoop http://bit.ly/rulsjX
- JavaScript http://bit.ly/wdaTv6
- Twitter https://twitter.com/#!/search/%23bigdata
- Hive http://hive.apache.org
- Excel to Hadoop via Hive ODBC http://tinyurl.com/7c4qjjj
- Hadoop On Azure Videos http://tinyurl.com/6munnx2
- Klout http://tinyurl.com/6qu9php
- Microsoft Big Data http://microsoft.com/bigdata
- Denny Lee http://dennyglee.com/category/bigdata /
- Carl Nolan http://tinyurl.com/6wbfxy9
- Cindy Gross http://tinyurl.com/SmallBitesBigData
My full presentation from #24HOP can be downloaded below. There are notes on many of the slides with further details. The recording from 24 Hours of PASS as soon as it is available at http://www.sqlpass.org/LearningCenter/SessionRecordings/24HoursSpring2012.aspx. Download their player then choose the session by title – “Fitting Microsoft Hadoop into your Enterprise BI
Strategy (Cindy Gross)”.

I updated the attachment to include the updated slides from my May 11, 2012 talk at #SQLRally. There are few changes other than the background.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

MSHadoopEntBIStrategy24HOPandSQLRally2012.zip
March 21, 2012
Using the Pi Estimator Sample on HadoopOnAzure CTP – Small Bites of Big Data

Using the Pi Estimator Sample on HadoopOnAzure CTP

Small Bites of Big Data

Cindy Gross, SQLCAT PM

UPDATED Jun 2013: HadoopOnAzure CTP has been replaced with HDInsight Preview. See: How to Run the HDInsight Samples http://www.windowsazure.com/en-us/manage/services/hdinsight/howto-run-samples/

Now that you have created your Hadoop on Azure cluster you can run the sample programs to become familiar with the interface. Click on the “Samples” tile under “Manage your account”.

The gallery of available samples is growing rapidly.

We’ll start with the very simple Pi Estimator sample. When you click on the Pi tile you’ll see some information about the sample (scroll down to see more). You can download the files and review them in as much detail as you want. You can review the PiEstimator.java file in notepad. The .jar file is a zip file that contains many other files. You can unzip it with many compression utilities then explore the files at your convenience. The description tells us the first parameter indicates how many maps to create (default of 16) and the second indicates how many samples per map (10 million by default). For now, we’ll take advantage of the Azure portal’s simplification of the complexities of Hadoop and click on “Deploy to your cluster”.

This brings up a screen to create the Hadoop job. You can modify the job name and parameters if you like. The “final command”, in this case “Hadoop jar hadoop-examples-0.20.203.1-SNAPSHOT.jar pi 16 10000000”, can be copied to the command line if you choose to go execute the job at the command line later.

Click on “Execute job”. It may run for a minute or two, or perhaps longer if the CTP system is being heavily used. As it progresses entries are added to the “Output (stdout)” and “Debug Output (stderr)” sections. Eventually you will see “Status: Completed Successfully” under “Job Info”. I highlighted the runtime (“Job Finished in 66.123 seconds”) and output (“Estimated value of Pi is 3.14159155000000000000”).

You can see that there were 16 maps created because that’s the first parameter we passed in to the jar command. If I change it to 8 maps and 1000 samples per map the command is now generated as “call hadoop.cmd jar hadoop-examples-0.20.203.1-SNAPSHOT.jar pi 8 1000” and the output has fewer significant digits even though the runtime is nearly the same:

Job Finished in 57.061 seconds

Estimated value of Pi is 3.14100000000000000000

When you go back to the main portal screen you now see the “Pi Example” tile. If the job is currently running you will see “…in progress…” Once it finishes the tile will show “Complete”.

To view job results, click on the “Job History” tile under “Manage your account”.

If you click on any history row it will take you to the same output you see if you are looking at the job interactively.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

February 26, 2012
Open Ports for HadoopOnAzure CTP – Small Bites of Big Data

Open Ports for HadoopOnAzure CTP

Small Bites of Big Data

Cindy Gross, SQLCAT PM

UPDATED Jun 2013: HadoopOnAzure CTP has been replaced by HDInsight Preview. See Troubleshooting ODBC connectivity to HDInsight http://social.msdn.microsoft.com/Forums/en-US/hdinsight/thread/b4ca52ea-f7cf-420c-959d-53e09f801f7d.

Once you have created your Hadoop on Azure cluster you will likely be moving data in and out of the system. That means you need to open the ports in Azure. By default the two ports exposed through the Metro interface are both locked. The error when you try to use an ODBC connection to the cluster when the ODBC Server port is closed will include the words “Could not connect client socket”.

Click on the “Open Ports” tile to open the “Configure Ports” dialog.

To do Hive or other ODBC queries, open the “ODBC Server” port 10000. If you plan to use FTP open the FTPS port 2226.

Click on the “Windows Azure” tile at the top to go back to the main portal screen. If you need to open other, less commonly used ports you can use the Remote Desktop icon to connect directly to the VM and make the changes on the server.

I will cover how to connect via Remote Desktop in a separate post. Once there use whatever firewall or other port software is installed to open the needed ports.

I hope you’ve enjoyed this small bite of big data! Look for more big data blog posts soon.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

February 25, 2012
Creating a Cluster for HadoopOnAzure CTP – Small Bites of Big Data
Creating a Cluster for HadoopOnAzure CTP

Small Bites of Big Data

Cindy Gross, SQLCAT PM

UPDATED Jun 2013: HadoopOnAzure CTP has been replaced with HDInsight Preview which has a different interface: Getting Started with Windows Azure HDInsight Service http://www.windowsazure.com/en-us/manage/services/hdinsight/get-started-hdinsight/

Are you ready to dive into this “Big Data” thing you keep hearing about? A good way to get started without having to understand and install all the HDFS, MapReduce, and other pieces yourself is to join the Hadoop CTP. It’s a very popular program so once you sign up you may have a short wait before you are given an account. Any cluster you create is time-bombed to free up unused resources for other eager CTP participants. This post will assume you’ve been granted a CTP account to use on https://www.hadooponazure.com/.

1) Sign in to your Hadoop on Azure account. Go to https://www.hadooponazure.com/ and click on the “Sign in” button.

2) If you don’t currently have a cluster allocated, you will be taken to a screen where you can request one. Note that the saying at the top of the screen will change randomly each time you navigate to a screen.
- Choose a unique name for your cluster and it will append .cloudapp.net. My cluster is called cgross.cloudapp.net.
- If you are just playing around be respectful of the fact that this is a very popular CTP and choose the “Small” cluster size (currently 4 nodes, 2 TB disk space).
- Choose a Cluster login and password. I use a different login than my cluster name, cgross1.
- You can choose to use SQL Azure for your Hive Metastore but I’m going to skip that for the sake of simplicity.
- Choose “Request cluster”
3) It will take several minutes for your cluster VMs to be created and allocated.

4) After a few minutes you will see a message that your cluster is ready for use. In the upper left you can see how long you have before the cluster expires. When you reach about 6 hours left you can choose the “Extend now” button to keep your cluster longer. You can also choose to “Release cluster” when you are finished to free up those resources for other CTP participants.

Explore the various tiles and what you can do in each. You may want to start with the “Downloads” tile and go through the “How-To and FAQ”. If you are idle for a while you will be logged out of the site, but this as long as the cluster has not expired it will all still be there once you log back in. The “Job History” stays around across cluster builds, it is not cleared out when a cluster is released or expired.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.
February 25, 2012
SQL Server 2012 RC0 available
Release Candidate 0 (RC0) of SQL Server 2012 is available for download! That means you can start to experience a feature complete version of the next great SQL release. 🙂 We’re still looking at the first half of calendar year 2012 for the RTM release of SQL Server 2012.

FURTHER. FORWARD. FASTER.

Downloads:
- Microsoft® SQL Server® 2012 Release Candidate 0 (RC0) http://www.microsoft.com/download/en/details.aspx?id=28145 (time-bombed at 180 days)
Greater availability. Deliver the required 9s and data protection with AlwaysOn, delivering added functionality over CTP3 that allows customers to experience multiple, readable secondaries for distributed scale of reporting and backup workloads and support for FileTable and FILESTREAM which brings first-class HA to complex data types.

Blazing-fast performance. More effectively govern performance in multi-tenancy environments like private cloud. Resource Governor enhancements include support for 64 resource pools, greater CPU usage control, and resource pool affinity for partitioning of physical resources and predictable resource allocation.

Rapid data exploration. Empower end users with new insights through rapid data exploration and visualization. Discover new insights at the speed of thought with more feature/functionality in Power View, the highly interactive, familiar browser-based data exploration, visualization, and presentation experience for end users.

Credible, consistent data. In addition to CTP3 functionality delivered for Data Quality Services and Master Data Services, customers can better support heterogeneous data within Data Warehouses through new Change Data Capture (CDC) support for Oracle.

Optimized productivity. Optimize IT and developer productivity across server and cloud with the new SQL Server Data Tools, a unified development experience for database and business intelligence projects, and cloud-ready capabilities within SQL Server Management Studio for snapshot backups to the Windows Azure Platform. Additionally, SQL Server 2012 offers a new version of Express – LocalDB. SQL Express LocalDB is a lightweight version of Express with all its programmability features, yet runs in user mode with a fast, zero-configuration installation and short list of pre-requisites. For more information and to try it out, go here.
- Microsoft SQL Server 2012 Training Kit Now available for Download http://blogs.msdn.com/b/dpe_nigeria/archive/2011/11/15/microsoft-sql-server-2012-training-kit-now-available-for-download.aspx
- Microsoft® SQL Server® 2012 Semantic Language Statistics RC0 http://www.microsoft.com/download/en/details.aspx?id=28148
- Microsoft® SQL Server® 2012 Release Candidate 0 (RC0) Manageability Tool Kit http://www.microsoft.com/download/en/details.aspx?id=28147
- Microsoft® SQL Server® 2012 Express RC0 http://www.microsoft.com/download/en/details.aspx?id=28151
- Microsoft® SQL Server® 2012 RC0 Master Data Services Add-in For Microsoft® Excel® http://www.microsoft.com/download/en/details.aspx?id=28149
- Microsoft® SQL Server® 2012 PowerPivot® for Microsoft® Excel® 2010 Release Candidate 0 (RC0) http://www.microsoft.com/download/en/details.aspx?id=28150
Licensing:
- SQL SERVER 2012 LICENSING OVERVIEW http://www.microsoft.com/sqlserver/en/us/future-editions/sql2012-licensing.aspx
- SQL SERVER 2012 EDITIONS http://www.microsoft.com/sqlserver/en/us/future-editions/sql2012-editions.aspx
- SQL Server 2012 Licensing Datasheet http://download.microsoft.com/download/A/5/D/A5D112E1-78FF-491F-9364-F1BC6FAE7D57/SQL_Server_2012_Licensing_Datasheet_Nov2011.pdf
- (Denny Cherry) SQL Server 2012 Licensing Changes http://itknowledgeexchange.techtarget.com/sql-server/sql-server-2012-licensing-changes/
- SQL 2012 Licensing Thoughts http://weblogs.sqlteam.com/geoffh/archive/2011/11/03/sql-2012-licensing-thoughts.aspx
- Microsoft® SQL Server® 2012 RC0 End User License Agreements and Privacy Statement http://www.microsoft.com/download/en/details.aspx?id=28146
SQL Server 2012 Wallpaper/backgrounds:
- 1024×768 http://twitpic.com/7d7y08/full
- 1280×800 http://twitpic.com/7d7x8d/full
- 1400×900 http://twitpic.com/7d7wiy/full
- 1680×1024 http://twitpic.com/7d7vwh/full
More about SQL Server 2012:
- SQL Server 2012 Release Candidate is now Available! http://blogs.technet.com/b/dataplatforminsider/archive/2011/11/17/sql-server-2012-release-candidate-is-now-available.aspx
- SQL Server 2012 http://www.microsoft.com/sqlserver/en/us/future-editions.aspx
- SQL PASS TV https://www.youtube.com/user/SQLPASSTV
If you don’t already use it, now might be a good time to run the Microsoft Assessment and Planning (MAP) Toolkit. If you load the sample MAP database and go to File.Prepare New Reports and Proposals you can see what typical SQL inventory reports look like. The output includes version, service pack level, edition, whether it’s clustered, # of procs and cores, and system memory. You would need to have a SQL database to store your own inventory results.

Microsoft Assessment and Planning (MAP) Toolkit http://technet.microsoft.com/en-us/solutionaccelerators/dd537566.aspx

Data Sheet:

MAP provides you with actionable information from the start. In a matter of hours, MAP helps provide you with answers to key questions like:
- Which of my PCs can run Windows 7, Microsoft® Office 2010, and Office 365?
- Are my web application portfolios and databases ready to migrate to the Windows Azure platform? What is the TCO-ROI?
- What next steps should I take to plan a private cloud?
- Which of my servers are capable of migrating to Windows Server 2008 R2, or can be virtualized using Hyper-V?
- Where can I find Microsoft® SQL Server® 2000, MySQL, Oracle, and Sybase instances for migration to Microsoft® SQL Server® 2008 R2?
- Is our software usage in compliance with software license agreements?
November 17, 2011