Befriending Dragons

Transform Tech with Anti-bullying Cultures


1 Comment

Big Data – All Abuzz About Hive at #SQLPASS Summit 2012

Big Data – All Abuzz About Hive
Small Bites of Big Data

Cindy Gross, SQLCAT PM

I hope to see you at the #SQLPASS Summit 2012 this week!

There are many reasons people come to the PASS Summit – SQL friends, SQL family, networking, great content in 190 sessions, the SQL clinic, the product team, CSS, SQL CAT, MVPs, MCMs, SQL Community, and sometimes just to get away from the daily grind of work. All those are great reasons, and they all make for a great Summit.

This year I am focusing on BI and Big Data. For the Summit this year I will be introducing you to Hive. Hive is a great way to leverage your existing SQL skills and enter into the world of #BigData. Big Data is here in a big way. CIOs are pushing it, business analysts want to use it, and everyone wants to gain new insights that will help their business grow and thrive. Don’t let Big Data pass you by, leaving you wondering how you missed out. Hive is fun and interesting, and for SQL Pros it looks very familiar. Come to my talk and come by the SQL Clinic to ask questions throughout the week. Please come up and introduce yourself at any time, I love to meet new SQL Peeps!

BIA-305-A SQLCAT: Big Data – All Abuzz About Hive
Wednesday 1015am   |   Cindy Gross, Dipti Sangani, Ed Katibah

Got a bee in your bonnet about simplifying access to Hadoop data? Want to cross-pollinate your existing SQL skills into the world of Big Data? Join this session to see how to become the Queen Bee of your Hadoop world with Hive and gain Business Intelligence insights with HiveQL filters and joins of HDFS datasets. We’ll navigate through the honeycomb to see how HiveQL generates MapReduce code and outputs files to answer your questions about your Big Data.

After this session, you’ll be able to democratize access to Big Data using familiar tools such as Excel and a SQL-like language without having to write MapReduce jobs. You’ll also understand Hive basics, uses, strengths, and limitations and be able to determine if/when to use Hive in combination with Hadoop.

And there’s more! Here is a sampling of blog posts about this year’s summit:

After the Summit, you can still stay involved. Follow some SQL Peeps on Twitter, sign up for a few SQL blog feeds, and buy a book or two. Attend local events like SQL Saturdays and User Group meetings. Reach out to your fellow SQL-ites and stay in touch with those you meet. Keep SQL fun and interesting and share what you learn!

See you at the #SQLPASS Summit 2012!

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

PASS2012BIA305AAllAbuzzAboutHive.pptx


1 Comment

Load SQL Server BCP Data to Hive

Load SQL Server BCP Data to Hive

Small Bites of Big Data

Cindy Gross, SQLCAT PM

As you start learning more about Hadoop you may want to take a look at how the same data and queries work for SQL Server and for Hadoop. There are various ways to do this. For now I’ll show you something that utilizes some of your existing SQL Server skills (BCP) and some new Hadoop skills (basic Hadoop FS and Hive commands). There are other methods such as FTP, cURL, and Sqoop that I won’t cover right now. Also, if you want to load data from the Azure DataMarket you can follow these instructions.

Today we’ll walk through making a copy of the FactProductInventory table from AdventureWorksDW2012 on a SQL Server 2012 instance. The below TSQL will generate BCP output commands for a single table in the current SQL Server database context and write tab delimited data to c:temp. Comment out the @tablename references in the SELECT and WHERE clauses to generate the script for all tables in the database.

USE [AdventureWorksDW2012];

DECLARE @servername sysname, @dbname sysname, @tablename sysname, @outputdir sysname

SELECT  @servername = @@SERVERNAME

       ,@dbname = DB_NAME()

       ,@outputdir = ‘c:temp’

       ,@tablename = ‘FactProductInventory’

SELECT ‘bcp ‘ + OBJECT_SCHEMA_NAME(object_id) + ‘.’ + name + ‘ out ‘

       + @outputdir + OBJECT_SCHEMA_NAME(object_id) + ‘_’ + name + ‘.dat -b 10000 -d ‘

       + @dbname + ‘ -T -c -S ‘ + @servername

       FROM sys.objects

       WHERE type_desc = ‘USER_TABLE’

       AND name = @tablename

       ;

In this case the BCP code generated is (no line break):

bcp dbo.FactProductInventory out c:tempdbo_FactProductInventory.dat -b 10000 -d AdventureWorksDW2012 -T -c -S CGROSSBOISESQL2012

Paste the BCP command to a Command Prompt and run it.

If you have not yet created an Apache™ Hadoop™-based Services for Windows Azure cluster follow these steps to do so (this is a CTP so the exact steps/screens will change over time).

From your Hadoop cluster portal click on the “Remote Desktop” button and choose to “Save As” the RDP that is generated.

Right click on the RDP you saved and choose “edit”. Go to the “Local Resources” tab click on “More…” under “Local devices and resources”. Add a check mark to “Drives” then click “OK”. Go back to the “General” tab and click on “Save”. Now choose “Connect” to open a remote desktop connection to your Hadoop head node.

Open the “Hadoop Command Shell”. Copy/paste or type these commands (beware of some editors changing dashes or other characters to non-executable values) to create a directory and copy the data file to your head node. The /y on the copy will overwrite the file if it already exists.

Md c:data

Copy \tsclientCtempdbo_FactProductInventory.dat c:data /y

Dir c:data

 Now from the same prompt load the data into Hadoop HDFS. The fs indicates you are running a filesystem command from a Hadoop script. Generally the same commands are available from the “Interactive JavaScript” console in the portal by replacing “hadoop fs –“ with “#”. For example, “hadoop fs –lsr /” from a Hadoop Command Prompt and “#lsr /” from the JavaScript console both return a recursive list of all directories and files starting at the root (/). Try some variations such as “#ls” (non-recursive contents of default directory) and “#lsr /user” (recursive list of the user directory).

hadoop fs -put c:datadbo_FactProductInventory.dat /demo/demo.dat

hadoop fs -lsr /

 Now launch the command line version of Hive (you can alternatively use the Interactive Hive console in the portal, but I’m showing you the automatable version) and add metadata to the HDFS data. Note that the CREATE EXTERNAL TABLE statement wraps in the window, there is no line break. Because I choose to use the EXTERNAL keyword the data stays in its original HDFS location and will not be deleted when I drop the Hive table. Since the Hive keywords are different the data type names are not exactly the same as they were in SQL Server. Basically I generated the script for this table from SSMS then made a few changes. I removed the constraints and indexes then changed date and money to string. I also removed the brackets and the “dbo.” schema qualifier. If you don’t copy the empty line under the last command you will have to hit enter for the last command to complete.

hive;

CREATE EXTERNAL TABLE FactProductInventory(ProductKey int,DateKey int,MovementDate string,UnitCost string,UnitsIn int,UnitsOut int,UnitsBalance int) COMMENT ‘Hive Demo for #24HOP’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘t’ STORED AS TEXTFILE LOCATION ‘/demo’;

quit;

 Now let’s run some queries. You can either start Hive again from the command line or run it from the Interactive Hive portal in the GUI.

select * from FactProductInventory where ProductKey = 230 and DateKey = ‘20080220’;

 The output on my single data node Hadoop cluster looks like this (the line starting with 230 is the actual result set):

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there’s no reduce operator

Starting Job = job_201209281938_0013, Tracking URL = http://10.119.36.106:50030/jobdetails.jsp?jobid=job_201209281938_0013

Kill Command = c:Appsdistbinhadoop.cmd job  -Dmapred.job.tracker=10.119.36.106:9010 -kill job_201209281938_0013

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2012-09-28 23:27:45,271 Stage-1 map = 0%,  reduce = 0%

2012-09-28 23:27:58,301 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:27:59,316 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:00,316 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:01,332 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:02,347 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:03,363 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:04,379 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:05,394 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:06,410 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:07,425 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.187 sec

MapReduce Total cumulative CPU time: 3 seconds 187 msec

Ended Job = job_201209281938_0013

MapReduce Jobs Launched:

Job 0: Map: 1   Accumulative CPU: 3.187 sec   HDFS Read: 31077011 HDFS Write: 38 SUCESS

Total MapReduce CPU Time Spent: 3 seconds 187 msec

OK

230     20080220        2008-02-20      20.3900 0       0       4

Time taken: 48.465 seconds

Note that if you run this same query in SSMS you will see an instant response but here in Hadoop it took 48 seconds of total time. No matter how many times you run this query you will see approximately the same execution time because the result set is not cached. This shows that we have used Hadoop in an anti-pattern way – we went after a single row of data. Filtered row sets are a strength of well-indexed relational systems while querying entire, very large, unindexed data sets is a strength of Hadoop. Hive generates MapReduce code and that MapReduce code goes through the same steps each time to find the data, distribute job tasks across the data nodes (map), and then bring the results sets back (reduce). The cumulative CPU time once it actually executes the map phase is still over 3 seconds. I chose this example both to illustrate that point and because it gives you data you are familiar with to ease you into the Hadoop and Hive worlds.

If you want to remove the metadata from Hive and the data you’ve just loaded from Hadoop HDFS execute these steps from a Hadoop Command Shell:

hive

drop table FactProductInventory;

quit;

hadoop fs -rmr /demo/demo.dat

 Now you know one way to copy data from SQL Server to Hadoop. Keep exploring Hadoop and keep buzzing about Hive!

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.


Leave a comment

What’s all the Buzz about Hadoop and Hive?

What’s all the Buzz about Hadoop and Hive?

Why it Matters for SQL Server Peeps

Small Bites of Big Data

Cindy Gross, SQLCAT PM

On September 20, 2012 we have another 24 Hours of PASS event! This PASS Summit Preview will give you a taste of what is coming at this year’s PASS Summit. There are 190+ technical sessions this year at the Summit, and you’ll get a preview of 24 of them at the #24HOP event tomorrow! Come hear about some of the hottest topics and features in the SQL Server, BI, and data world.

One of the big buzzwords over the last year or so is Hadoop, and the most familiar part of Big Data and Hadoop to most SQL Server professionals is Hive. Do you wonder what it is and why you should jump in now while it’s still new and growing by leaps and bounds? I have just the #24HOP session for you!

#24HOP: What’s all the Buzz about Hadoop and Hive? – Why it Matters for SQL Server Peeps

Everyone is buzzing about Hive and trumpeting the virtues of Hadoop. But what does it mean? Why does it matter to a SQL Server and/or BI professional? Come get a taste of the Hive honey and see why this new technology is worth buzzing about!

During this talk I’ll give a very high level overview of Big Data, Hadoop, and Hive (for the nitty gritty details come to the Summit!). I’ll also go through why Hive matters in the SQL Server world, what a SQL Server Peep might end up doing in a Hive world, and why it is important for you as a SQL Server Peep to jump in and get your feet wet with Hive now.

Once you’ve heard this #24HOP talk I hope you’ll be fired up about Hive and more anxious than ever to sign up for the  PASS Summit to learn even more about Hadoop, Hive, Big Data, and all things BI and SQL Server. I’ll be co-presenting at the Summit with SQL Server PM Dipti Sangani:

SQLCAT: Big Data – All Abuzz About Hive [BIA-305-A]
Session Category: Regular Session (75 minutes)
Session Track: BI Platform Architecture, Development & Administration
Speaker(s): Cindy Gross, Dipti Sangani

Got a bee in your bonnet about simplifying access to Hadoop data? Want to cross-pollinate your existing SQL skills into the world of Big Data? Join this session to see how to become the Queen Bee of your Hadoop world with Hive and gain Business Intelligence insights with HiveQL filters and joins of HDFS datasets. We’ll navigate through the honeycomb to see how HiveQL generates MapReduce code and outputs files to answer your questions about your Big Data.

After this session, you’ll be able to democratize access to Big Data using familiar tools such as Excel and a SQL-like language without having to write MapReduce jobs. You’ll also understand Hive basics, uses, strengths, and limitations and be able to determine if/when to use Hive in combination with Hadoop.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

UPDATE 9/28/12 – demo steps to load the AdventureWorks data to Hive are available at http://blogs.msdn.com/b/cindygross/archive/2012/05/07/load-data-from-the-azure-datamarket-to-hadoop-on-azure-small-bites-of-big-data.aspx.

24HOPFall2012HiveBuzz.zip


Leave a comment

How to Install the PowerShell Cmdlets for Apache™ Hadoop™-based Services for Windows

How to Install the PowerShell Cmdlets for Apache™ Hadoop™-based Services for Windows

Small Bites of Big Data

Cindy Gross, SQLCAT PM

UPDATED JUNE 2013 – The very early version of PowerShell cmdlets I discussed below have been replaced – see Managing Your HDInsight Cluster with PowerShell http://blogs.msdn.com/b/carlnol/archive/2013/06/07/managing-your-hdinsight-cluster-with-powershell.aspx

We have a cool new addition to Microsoft’s Apache™ Hadoop™-based Services for Windows – new PowerShell cmdlets! The initial readme.txt is not very detailed so I went through the installation and initial use step by step to help get your started. Enjoy this new way to remotely administer your Hadoop cluster!

1)      Log in to your Windows client (the location where you plan to do remote administration of your Hadoop cluster) with a local administrator account.

2)      Download the Hadoop PowerShell Cmdlets zip file from your Apache™ Hadoop™-based Services for Windows cluster: https://YourHadoopCluster.cloudapp.net/Home/Downloads(substitute your actual Hadoop Cluster name for “YourHadoopCluster”).

3)      Unzip the files to C:WindowsMicrosoft.NETFramework64v4.0.30319 (assuming an x64 OS and an installation on C:).

4)      Change PowerShell to use .NET 4.0 (by default it uses an older version that does not work with these cmdlets)

  • Open “Windows PowerShell” (choose the x64 version – the one that does not show (x86) after the name).
  • Type $pshome and capture the result such as “C:WindowsSystem32WindowsPowerShellv1.0.”  This directory is for an x64 installation of PowerShell, the SysWOW64 directory is for x86/32bit.
  • Close PowerShell.
  • Open a dos-prompt or Windows Explorer with “Run as administrator” and navigate to the directory you found above (i.e. C:Windowssystem32WindowsPowerShellv1.0)
  • If no powershell.exe.config file exists create an empty file with that name in that directory. Edit powershell.exe.config to add the following section:
    <?xml version=”1.0″?>
    <configuration>
    <startup useLegacyV2RuntimeActivationPolicy=”true”>
      <supportedRuntime version=”v4.0.30319″/>
      <supportedRuntime version=”v2.0.50727″/>
    </startup>
    <
    /configuration>

5)      Install the SDK cmdlets

  • Open a “Command Prompt” with the “Run as administrator” option. Go to your .NET 4.0 directory which will be something like C:WindowsMicrosoft.NETFramework64v4.0.30319.
  • Run:
    installutil.exe Isotope.Sdk.Cmdlets.dll
  • Review the output to verify there were no errors during the install (look for things like commit instead of rollback and no errors).

6)      Open a new Windows PowerShell window (make sure you choose the x64 version on an x64 OS) and type 

            Add-PSSnapin IsotopeSdkPSSnapIn

Verify it completes successfully.

7)      Type a set of commands in the PowerShell window to verify the cmdlets work. For example set which cluster the other commands apply to (change to your actual names/password), check that the settings are correct, then get a file listing or the user directories:

Set-ClusterSettings -ClusterDns <cluster name> -UserName <username> -Password <password>
Get-ClusterSettings
Get-FileListFromHdfs -HdfsPath hdfs:///user/

8)      Once the cmdlets are installed and you have verified it is working you can add the SnapIn to the profile so you don’t have to add it every time you open PowerShell. Details on PowerShell profiles can be found here, a summary is below.

a.       Open PowerShell with “Run as administrator” (required if you are changing the execution policy or creating a new profile)

b.      Allow scripts you’ve self-created to be run without being signed (or you can choose to sign the ps1 you create later):

Set-Executionpolicy -ExecutionPolicy RemoteSigned

c.       See if a Profile already exists: test-path $profile

d.      If no Profile exists, make a new one that’s available to all users (this assumes x64):

new-item -path $env:windirsystem32WindowsPowerShellv1.0profile.ps1 -itemtype file -force

e.      Edit the file you just created. Open Notepad with “Run as administrator”, paste in:

Add-PSSnapin IsotopeSdkPSSnapIn

f.        Optionally add the default Hadoop cluster (use your actual cluster name, username, password)

Set-ClusterSettings -ClusterDns <cluster name> -UserName <username> -Password <password>

g.       Save as C:Windowssystem32WindowsPowerShellv1.0profile.ps1 (choose “all files” so it does not append “.txt”).

h.      The next time you open PowerShell you will not have to add the snapin or set your default cluster  you want to manage.

The full list of cmdlets is in the readme.txt. As of this time the list includes:

Set-ClusterSettings -ClusterDns <hadoopcluster.cloudapp.net or hadoopcluster> -UserName <username> -Password <password>

Get-ClusterSettings

New-HiveJob -Type query -Execute “SELECT * from HiveSampleTable limit 10;” [-Define <# delimited key=value pairs>]

New-MapReduceJarJob -Jar c:appsdisthadoop-examples-1.1.0-SNAPSHOT.jar -Arguments “pi 16 100” [-Define <# delimited key=value pairs>]

New-MapReduceStreamingJob -Input “/example/data/gutenberg/davinci.txt” -Output “/example/data/streamingoutput/wc.txt” -Mapper cat.exe -Reducer wc.exe -File “hdfs:///example/apps/wc.exe,hdfs:///example/apps/cat.exe” [-Define <# delimited key=value pairs>]

Get-JobStatus -JobId <isotope-jobId>

Get-FileListFromHdfs -HdfsPath hdfs:///user/

Copy-File -LocalPath <file> -HdfsPath hdfs:///<path>

Get-JobHistoryList -NumberOfItems 2

Get-JobHistory -JobId <isotope-jobId>

Get-JobHistoryCount

Set-AsvCredentials -AccountName <accountname> -AccountKey <accountkey>

Get-AsvCredentials

Remove-AsvCredentials

Get-S3Credentials

Set-S3Credentials -AccessKeyId <accesskeyid> -AccessKey <accesskey> -IsS3N <0|1>

Remove-S3Credentials

 

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.


Leave a comment

Hadoop Hive Error: Could not connect client socket, timed_out

Small Bites of Big Data

Cindy Gross, SQLCAT PM

With the Hadoop on Azure CTP, when you create a Hadoop cluster it expires after a few days to free up the resources for other CTP users. Therefore each time I do a demo or test I am likely to create a new Hadoop cluster. There are a few settings that it’s easy to forget about. For example, today I spun up a cluster and tried to use the Hive Pane add-in from Excel. I entered the connection information and hit the “OK” button.

clip_image002

Instead of seeing the expected option to choose the Hive table I saw this error:

clip_image003

Text version:

Error connecting to Hive server. Details:

SQL_ERROR Failed to connect to (cgross.cloudapp.net:10000): Could not connect client socket. Details: <Host: cgross.cloudapp.net Port: 10000> endpoint: 168.62.107.55, error: timed_out

While there are very likely many possible reasons for this error, I’ve done this enough times to immediately realize I never opened the ODBC Server port on this particular cluster. I opened port 10000 on the Hadoop cluster via the “Open Ports” tile and tried again. Success! I can now query my Hive data!

clip_image004

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly

Other Small Bites of Big Data: http://blogs.msdn.com/b/cindygross/archive/tags/hadoop/


1 Comment

Load Data from the Azure DataMarket to Hadoop on Azure – Small Bites of Big Data

Load Data from the Azure DataMarket to Hadoop on Azure

Small Bites of Big Data

Cindy Gross, SQLCAT PM

UPDATED Jun 2013: HadoopOnAzure CTP has been replaced by HDInsight Preview which has a different interface and different functionality. Getting Started with Windows Azure HDInsight Service http://www.windowsazure.com/en-us/manage/services/hdinsight/get-started-hdinsight/

 The first step in analyzing data is acquiring that data. You may have data available from logs, databases, or other internal sources. Or you may need to export data from an external source. In Hadoop on Azure we make it easy to load data from data sources such as the Azure DataMarket, Amazon S3, and the Windows Azure Blog Storage (ASV).

Preparing Hadoop On Azure for a data load

Find your data source

  • Find data on the Marketplace that meets your needs. For example, I looked for free demographic data.

     

  • Some of the data marked as free has limits on how much is available without payment. I subscribed to the “Gender Info 2007 – United Nations Statistics Division” because it was free, had a significant amount of data, had unlimited transactions/month, and is easily understandable.

 

  • You can click through on the data link for more information on the source, columns, and other details.

 

  • When you click on the “use” button you get a screen that lets you query the data. You will use information from this screen to load the data into Hadoop.

  • Leave this browser window open and stay logged in to the DataMarket.

Load the Data

  • Log in to your cluster onHadoopOnAzure.com and click on the Manage Cluster tile.

  • Click on the DataMarket tile

  • Go to the DataMarket page for your data.
    • For the gender data, you need to change the “Query” value under “BUILD QUERY” from “DataSeries” to “Values”.
    • Next to “Primary Account Key” click on “Show”. This will expose a key value. Cut/paste that value into the “passkey” field of “Import from Data Market” on HadoopOnAzure.
    • Copy the “URL for current expressed query” into the “Query” field on HadoopOnAzure. Erase the end of the query ($top=100) that limits the results to the first 100 rows. For the gender data the query is https://api.datamarket.azure.com/Data.ashx/UnitedNations/GenderInfo2007/Values?$top=100, change that to https://api.datamarket.azure.com/Data.ashx/UnitedNations/GenderInfo2007/Values in the query field.
    • In HadoopOnAzure enter your DataMarket user name in the “User name” field and a unique new table name in the “Hive table name” field. If you the name of a table that already exists in this Hadoop cluster you will get an error.
    • Click on the “Import Data” button to start the data load.
    • You can now leave the screen and let the data load complete in the background.

 

  • You can check on the status of the data load by clicking on the “Job History” tile.

  • You can click through on the job to get more details.

  • Once the data load is complete, you can query the table from the Interactive Console.

  • It opens to the JavaScript console, you will need to click on the Hive button to get to the “Interactive Hive” screen.

  • Enter a SQL query in the white box at the bottom and then hit “Evaluate”. The HiveQL or HQL language is based on standard SQL and is closer to the MySQL implementation than to the SQL Server TSQL implementation. Most of the syntax is the same as TSQL but there are some differences. For example, execute: select * from GenderInfoValuesAll where year = 2000 order by countryname limit 200;. This looks like TSQL except you use limit instead of top.

Now your data is loaded and you’ve done a basic test to verify the data looks correct. You’re ready to start using Hive on Hadoop!

Other Microsoft Hadoop data loading links:

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly. Screenshots are from the May 2012 version of http://HadoopOnAzure.com.


5 Comments

24HOP/SQLRally – Fitting Microsoft Hadoop Into Your Enterprise BI Strategy

24HOP/SQLRally – Fitting Microsoft Hadoop

Into Your Enterprise BI Strategy

Small Bites of Big Data

Cindy Gross, SQLCAT PM

The world of #bigdata and in particular #Hadoop is going mainstream. At 24HOP 2012 I talked about how a SQL Server professional fits into this big data world.

Hadoop generally falls into the NOSQL realm. NOSQL uses basically available, soft-state, eventual consistency (BASE) instead of requiring immediate, 100% consistency via ACID properties. This allows certain types of queries/operations to be much faster since locking and consistency rules don’t have to be followed. This works well when any given program/query looks at the entire dataset and does not try to filter the data or query a subset of it. Basically with Hadoop you take data that may not have a known, consistent structure and impose a structure on it for the duration of your MapReduce program (query). This is all based on the concept of scaling out rather than up, with new commodity (low end enterprise server) hardware being added to an existing Hadoop cluster to meet dynamic demands. With Hadoop you are spreading hardware costs out over a longer time, possibly paying less overall for hardware, and shifting your IT costs to different departments and parts of the lifecycle.

Hive is a database which sits on top of Hadoop’s HDFS (Hadoop Distributed File System). The data stays in HDFS but Hive stores metadata about the imposed structure and may store intermediate files. HiveQL looks a lot like TSQL and like most SQL languages makes an effort to align with the SQL standard when it makes sense. HiveQL will generate (usually multiple) MapReduce jobs to take an entire large result set and allow easier use of filters, joins, and aggregates. It’s still going to be slower than a relational database for most filters and joins but it allows access to data that may be too expensive (for whatever reason) to put into a relational or multi-dimensional database at this point in the lifecycle.

You may keep your source data outside HDFS and bring it in only for the duration of a project. You can sqoop data to/from a relational database (there is a sqoop driver for SQL Server and PDW) or pull data from blob stores like Windows Azure and Amazon S3, use secure FTP, query OData sources like the Windows Azure DataMarket. The Hive ODBC driver, often via the Excel Hive Add-in, and sqoop can be used to pull data from Hadoop into other tools or systems.

So far I’ve been talking as if big data = Hadoop. However, that’s not really true. Big data is data that is too big, expensive, or complex to process using your existing environment (often traditional RDBMSs with expensive SANs and high end hardware). While MapReduce, most commonly via Hadoop, is a very common solution it’s not the only one. There are streaming technologies like StreamInsight and HStreaming, machine learning like Mahout, massively parallel processing databases like PDW, and more. Generally big data at least starts out as unstructured or semi-structure, or perhaps of variable structure. It may flow very quickly with the need to process large amounts of data in a very small time window where decisions have to be made in real time.

At its core Hadoop has the file system HDFS which sits on top of the Windows or Linux file system and allows data to be mapped over many nodes in a Hadoop cluster. A head node maintains metadata about where each piece of data resides. MapReduce programs “map” the data to the many nodes then reduce the output based on the required logic, resulting in an output file that is the end result of the program/query. Other pieces of the Hadoop ecosystem may include things like Hive, HBase, HCatalog, Lucene, Mahout, Zookeeper and more.

So when would you use Hadoop? It’s good for exploring data, finding out what you don’t know. If you need to know your unknown unknowns or look for unexpected correlations, Hadoop may be what you need. It allows you to be very flexible and not lock yourself into a rigid schema until after you’ve fully explored your options. Some examples of common use cases are risk modeling, machine learning, cyber security, sentiment analysis, recommendation engines, log analysis, and sensor data. However, don’t think big data is the solution to all your problems. Often it is used to find and define the most useful bits of data or find the most useful algorithms. Then a more traditional design process may be used to create an RDBMS or multiple dimensional system for faster querying during day to day operations. Think of Hadoop as the foundation for a very specialized type of database that meets very specific needs, not as a replacement for relational. It’s important to note that every VLDB is not a fit for big data. “Big” is somewhat of a misnomer, size is only one of the characteristics and is relative to what your environment is prepared to handle.

Often you ask about the 4 “Vs” when deciding whether to use Hadoop – volume, velocity, variety, variability. Think about big data when you describe a problem with terms like tame the chaos, reduce the complexity, explore, I don’t know what I don’t know, unknown unknowns, unstructured, changing quickly, too much for what my environment can handle now, or unused data.

  • Volume = More data than the current environment can handle with vertical scaling, need to make sure of data that it is currently too expensive to use
  • Velocity = Small decision window compared to data change rate, ask how quickly you need to analyze and how quickly data arrives
  • Variety = Many different formats that are expensive to integrate, probably from many data sources/feeds
  • Variability = Many possible interpretations of the data

Microsoft is taking the existing Apache Hadoop code and making sure it runs on Windows. We have checked back in HDFS, MapReduce, and Hive code to the Apache open source community. More is on the way. We are adding things like the Excel add-in for the Hive ODBC driver, JavaScript, cloud (http://HadoopOnAzure.com), C# samples, etc. Where Microsoft is often seen as adding value is making Hadoop more available and allowing the reuse of existing skill sets. We offer self service, visualization, easier data movement, elasticity, and familiarity for faster completion of projects. This allows data scientists (people who know the business and statistics first and tech 2nd), data teams, information workers, and anyone seeking insights through BI to do their job better and faster.

We offer visualization through PowerPivot, Power View, and the Excel Hive ODBC Add-in. For many used to Hadoop on Linux these things are new and exciting, even though they seem familiar to many in the SQL Server world.

In the end, big data is really all about discovering insights and then taking action. If you don’t act on what you find, why did you spend all that effort mining the data? Sometimes the action is to go back and refine your process or change the way you looked at the data, but the end goal should be to make decisions that impact your business. Big data should add value to the business proposition.

References

My full presentation from #24HOP can be downloaded below. There are notes on many of the slides with further details. The recording from 24 Hours of PASS as soon as it is available at http://www.sqlpass.org/LearningCenter/SessionRecordings/24HoursSpring2012.aspx. Download their player then choose the session by title – “Fitting Microsoft Hadoop into your Enterprise BI
Strategy
(Cindy Gross)”.

I updated the attachment to include the updated slides from my May 11, 2012 talk at #SQLRally. There are few changes other than the background.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

MSHadoopEntBIStrategy24HOPandSQLRally2012.zip


Leave a comment

Using the Pi Estimator Sample on HadoopOnAzure CTP – Small Bites of Big Data

Using the Pi Estimator Sample on HadoopOnAzure CTP

Small Bites of Big Data

Cindy Gross, SQLCAT PM

UPDATED Jun 2013: HadoopOnAzure CTP has been replaced with HDInsight Preview. See: How to Run the HDInsight Samples http://www.windowsazure.com/en-us/manage/services/hdinsight/howto-run-samples/

 Now that you have created your Hadoop on Azure cluster you can run the sample programs to become familiar with the interface. Click on the “Samples” tile under “Manage your account”.

The gallery of available samples is growing rapidly.

We’ll start with the very simple Pi Estimator sample. When you click on the Pi tile you’ll see some information about the sample (scroll down to see more). You can download the files and review them in as much detail as you want. You can review the PiEstimator.java file in notepad. The .jar file is a zip file that contains many other files. You can unzip it with many compression utilities then explore the files at your convenience. The description tells us the first parameter indicates how many maps to create (default of 16) and the second indicates how many samples per map (10 million by default). For now, we’ll take advantage of the Azure portal’s simplification of the complexities of Hadoop and click on “Deploy to your cluster”.

This brings up a screen to create the Hadoop job. You can modify the job name and parameters if you like. The “final command”, in this case “Hadoop jar hadoop-examples-0.20.203.1-SNAPSHOT.jar pi 16 10000000”, can be copied to the command line if you choose to go execute the job at the command line later.

Click on “Execute job”.  It may run for a minute or two, or perhaps longer if the CTP system is being heavily used. As it progresses entries are added to the “Output (stdout)” and “Debug Output (stderr)” sections. Eventually you will see “Status: Completed Successfully” under “Job Info”. I highlighted the runtime (“Job Finished in 66.123 seconds”) and output (“Estimated value of Pi is 3.14159155000000000000”).

 

You can see that there were 16 maps created because that’s the first parameter we passed in to the jar command. If I change it to 8 maps and 1000 samples per map the command is now generated as “call hadoop.cmd jar hadoop-examples-0.20.203.1-SNAPSHOT.jar pi 8 1000” and the output has fewer significant digits even though the runtime is nearly the same:

Job Finished in 57.061 seconds

Estimated value of Pi is 3.14100000000000000000

When you go back to the main portal screen you now see the “Pi Example” tile. If the job is currently running you will see “…in progress…” Once it finishes the tile will show “Complete”.

     

To view job results, click on the “Job History” tile under “Manage your account”.

If you click on any history row it will take you to the same output you see if you are looking at the job interactively.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.