Tag: Big Data

  • HDInsight: Hive Internal and External Tables Intro

    Small Bites of Big Data

    Cindy Gross, SQLCAT PM

    HDInsight is Microsoft’s distribution, in partnership with Hortonworks, of Hadoop. Hive is the component of the Hadoop ecosystem that imposes structure on Hadoop data in a way that makes it usable from BI tools that expect rows and columns with defined data types. Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.

    Use EXTERNAL tables when:

    • The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
    • Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
    • You want to use a custom location such as ASV.
    • Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
    • You are not creating table based on existing table (AS SELECT).

    Use INTERNAL tables when:

    • The data is temporary.
    • You want Hive to completely manage the lifecycle of the table and data.

    We’ll walk through creating basic tables with a few rows of data so you can see some of the differences between EXTERNAL and INTERNAL tables. The demo data files are attached at the bottom of the blog. Alternatively you can simply open notepad and create your own files with a series of single column rows. If you create your own files make sure you have a carriage return/line feed at the end of all rows including the last one. The files should be in a Windows directory called c:data on the HDInsight Head Node. For HDInsight Server (on-premises) that’s the machine where you ran setup. For HDInsight Services (Azure) you can create a Remote Desktop connection (RDP) to the head node from the Hadoop portal.

    Note: Your client tool editor or the website may change the dashes or other characters in the following commands to “smart” characters. If you get syntax errors from a direct cut/paste, try pasting into notepad first or deleting then retyping the dash (or other special characters).

    Create an HDInsight cluster. You can do this on your own Windows machine by installing HDInsight Server or by signing up for HDInsight Services on Azure. For the CTP of HDInsight Services as of February 2013 you fill out a form to request access and receive access within a few days. Soon the service will be available from the Azure portal via your Azure subscription. Since the portal interface will be changing soon and all the commands are straightforward I will show you how to do all the steps through the Hive CLI (command line interface).

    Open a Hadoop Command Prompt:

    HadoopCmdPrompt

    Change to the Hive directory (necessary in early preview builds of Hive):

    cd %hive_home%bin

    Load some data (hadoop file system put) and then verify it loaded (hadoop file system list recursively):

    hadoop fs -put c:databacon.txt /user/demo/food/bacon.txt

    hadoop fs -lsr /user/demo/food

    The put command doesn’t return a result, the list command returns one row per file or subdirectory/file:

    -rw-r–r–   1 cgross supergroup        124 2013-02-05 22:41 /user/demo/food/bacon.txt

    Enter the Hive CLI (command line interface):

    hive

    Tell Hive to show the column names above the results (all Hive commands require a semi-colon as a terminator, no result is returned from this set command):

    Set hive.cli.print.header=true;

    Create an INTERNAL table in Hive and point it to the directory with the bacon.txt file:

    CREATE INTERNAL TABLE internal1 (col1 string) LOCATION ‘/user/demo/food’;

    Oops… that failed because INTERNAL isn’t a keyword, the absence of EXTERNAL makes it a managed, or internal, table.

    FAILED: Parse Error: line 1:7 Failed to recognize predicate ‘INTERNAL’.

    So let’s create it without the invalid INTERNAL keyword. Normally we would let an INTERNAL table default to the default location of /hive/warehouse but it is possible to specify a particular directory:

    CREATE TABLE internal1 (col1 string) LOCATION ‘/user/demo/food’;

    That will return the time taken but no other result. Now let’s look at the schema that was created:. Note that the table type is MANAGED_TABLE.

    DESCRIBE FORMATTED internal1;

    col_name        data_type       comment
    # col_name              data_type               comment

    col1                    string                  None

    # Detailed Table Information
    Database:               default
    Owner:                  cgross
    CreateTime:             Tue Feb 05 22:45:57 PST 2013
    LastAccessTime:         UNKNOWN
    Protect Mode:           None
    Retention:              0
    Location:               hdfs://localhost:8020/user/demo/food
    Table Type:             MANAGED_TABLE
    Table Parameters:
    transient_lastDdlTime   1360133157

    # Storage Information
    SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
    InputFormat:            org.apache.hadoop.mapred.TextInputFormat
    OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
    Compressed:             No
    Num Buckets:            -1
    Bucket Columns:         []
    Sort Columns:           []
    Storage Desc Params:
    serialization.format    1

    And now look at some rows:

    SELECT * FROM internal1;

    col1
    HDInsight_Bacon
    SQL_Bacon
    PASS_bacon
    Summit_BACON
    Database_Bacon
    NoSQL_Bacon
    BigData_Bacon
    Hadoop_Bacon
    Hive_Bacon

    What happens if we don’t specify a directory for an INTERNAL table?

    CREATE TABLE internaldefault (col1 string);

    It is created in the default Hive directory, which by default is in /hive/warehouse (dfs shells back out to Hadoop fs):

    dfs -lsr /hive/warehouse;

    We can see that Hive has created a subdirectory with the same name as the table. If we were to load data into the table Hive would put it in this directory:
    drwxr-xr-x   – cgross supergroup          0 2013-02-05 22:52 /hive/warehouse/internaldefault

    However, we won’t use this table for the rest of the demo so let’s drop it to avoid confusion. The drop also removes the subdirectory.

    DROP TABLE internaldefault;

    dfs -lsr /hive/warehouse;

    Once we dropped the internaldefault table the directory that Hive created was automatically cleaned up. Now let’s add a 2nd file to the first internal table and check that it exists:

    dfs -put c:databacon2.txt /user/demo/food/bacon2.txt;

    dfs -lsr /user/demo/food;

    -rw-r–r–   1 cgross supergroup        124 2013-02-05 23:04 /user/demo/food/bacon.txt
    -rw-r–r–   1 cgross supergroup         31 2013-02-05 23:03 /user/demo/food/bacon2.txt

    Since the CREATE TABLE statement points to a directory rather than a single file any new files added to the directory are immediately visible (remember that the column name col1 is only showing up because we enabled showing headers in the output – there is no row value of col1 in the data as headers are not generally included in Hadoop data):

    SELECT * FROM internal1;

    col1
    HDInsight_Bacon
    SQL_Bacon
    PASS_bacon
    Summit_BACON
    Database_Bacon
    NoSQL_Bacon
    BigData_Bacon
    Hadoop_Bacon
    Hive_Bacon
    More_BaCoN
    AndEvenMore_bAcOn

    Now let’s create an EXTERNAL table that points to the same directory and look at the schema:

    CREATE EXTERNAL TABLE external1 (colE1 string) LOCATION ‘/user/demo/food’;

    DESCRIBE FORMATTED external1;

    col_name        data_type       comment
    # col_name              data_type               comment

    cole1                   string                  None

    # Detailed Table Information
    Database:               default
    Owner:                  cgross
    CreateTime:             Tue Feb 05 23:07:12 PST 2013
    LastAccessTime:         UNKNOWN
    Protect Mode:           None
    Retention:              0
    Location:               hdfs://localhost:8020/user/demo/food
    Table Type:             EXTERNAL_TABLE

    Table Parameters:
    EXTERNAL                TRUE
    transient_lastDdlTime   1360134432

    # Storage Information
    SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
    InputFormat:            org.apache.hadoop.mapred.TextInputFormat
    OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
    Compressed:             No
    Num Buckets:            -1
    Bucket Columns:         []
    Sort Columns:           []
    Storage Desc Params:
    serialization.format    1

    This time the table type is EXTERNAL_TABLE. You can see that the location was expanded to include the default settings which in this case are the localhost machine using the default HDFS (as opposed to ASV or Azure Storage Vault).

    Now look at the data:

    SELECT * FROM external1;

    The result set is a combination of the two bacon files:

    HDInsight_Bacon
    SQL_Bacon
    PASS_bacon
    Summit_BACON
    Database_Bacon
    NoSQL_Bacon
    BigData_Bacon
    Hadoop_Bacon
    Hive_Bacon
    More_BaCoN
    AndEvenMore_bAcOn

    That table returns the same data as the first table – we have two tables pointing at the same data set! We can add another one if we want:

    CREATE EXTERNAL TABLE external2 (colE2 string) LOCATION ‘/user/demo/food’;

    DESCRIBE FORMATTED external2;

    SELECT * FROM external2;

    You may create multiple tables for the same data set if you are experimenting with various structures/schemas.

    Add another data file to the same directory and see how it’s visible to all the tables that point to that directory:

    dfs -put c:dataveggies.txt /user/demo/food/veggies.txt;

    SELECT * FROM internal1;

    SELECT * FROM external1;

    SELECT * FROM external2;

    Each table will return the same results:

    HDInsight_Bacon
    SQL_Bacon
    PASS_bacon
    Summit_BACON
    Database_Bacon
    NoSQL_Bacon
    BigData_Bacon
    Hadoop_Bacon
    Hive_Bacon
    More_BaCoN
    AndEvenMore_bAcOn
    SQL_Apple
    NoSQL_Pear
    SQLFamily_Kiwi
    Summit_Mango
    HDInsight_Watermelon
    SQLSat_Strawberries
    Raspberrylimelemonorangecherryblueberry 123 456

    Now drop the INTERNAL table and then look at the data from the EXTERNAL tables which now return only the column name:

    DROP TABLE internal1;

    SELECT * FROM external1;

    SELECT * FROM external2;

    dfs -lsr /user/demo/food;

    Result: lsr: Cannot access /user/demo/food: No such file or directory.

    Because the INTERNAL (managed) table is under Hive’s control, when the INTERNAL table was dropped it removed the underlying data. The other tables that point to that same data now return no rows even though they still exist!

    Clean up the demo tables and directory:

    DROP TABLE external1;

    DROP TABLE external2;

    exit;

    This should give you a very introductory level understanding of some of the key differences between INTERNAL and EXTERNAL Hive tables. If you want full control of the data loading and management process, use the EXTERNAL keyword when you create the table.

    I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

    Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the TAP and CTP builds may change rapidly.

    bacon.zip

  • Hurricane Sandy Mash-Up: Hive, SQL Server, PowerPivot, Power View

    Small Bites of Big Data

    Authors: Cindy Gross Microsoft SQLCAT PM, Ed Katibah Microsoft SQLCAT PM

    Tech Reviewers: Bob Beauchemin Developer Skills Partner at SQLSkills, Jeannine Nelson-Takaki Microsoft Technical Writer, John Sirmon Microsoft SQLCAT PM, Lara Rubbelke Microsoft Technical Architect, Murshed Zaman Microsoft SQLCAT PM

    For my #SQLPASS Summit 2012 talk SQLCAT: Big Data – All Abuzz About Hive (slides available to all | recording available to PASS Summit 2012 attendees) I showed a mash-up of Hive, SQL Server, and Excel data that had been imported to PowerPivot and then displayed via Power View in Excel 2013 (using the new SharePoint-free self-service option). PowerPivot brings together the new world of unstructured data from Hadoop with structured data from more traditional relational and multi-dimensional sources to gain new business insights and break down data silos. We were able to take very recent data from Hurricane Sandy, which occurred the week before the PASS Summit, and quickly build a report to pinpoint some initial areas of interest. The report provides a sample foundation for exploring to find additional insights. If you need more background on Big Data, Hadoop, and Hive please see my previous blogs and talks.

    I will walk you through the steps to create the report including loading population demographics (census), weather (NOAA), and lookup table (state abbreviations) data into Hive, SQL Server, Excel, and PowerPivot then creating visualizations in Power View to gain additional insights. Our initial goal is to see if there are particular geographic areas in the path of Hurricane Sandy that might need extra assistance with evacuation. One hypothesis is that people above a given age might be more likely to need assistance, so we want to compare age data with the projected rainfall patterns related to the path of the hurricane. Once you see this basic demonstration you can envision all sorts of additional data sets that could add value to the model, along with different questions that could be asked given the existing data sets. Data from the CDC, pet ownership figures, housing details, job statistics, zombie predictions, and public utility data could be added to Hive or pulled directly from existing sources and added to the report to gain additional insights. Those insights might, for example, help first responders during future storms, assist your business to understand various ways it can help after a storm or major cleanup effort, or aid future research into reducing the damage done by natural disasters.

    Prerequisites

    • Excel 2013 (a 30 day trial is available)
    • SQL Server 2008 R2 or later
    • HDInsight Server or HDInsight Service (Service access will take a few days so plan in advance or use Server)
    • Hive ODBC Driver (available with HDInsight)
    • NOAA Data Set restored to SQL Server 2008 R2 or later (available from SkyDrive)
    • Census Data Set (attached at bottom of blog)

    SQL Server

    Relational Data

    One of the data sets for the demo is NOAA weather data that includes spatial characteristics. Since SQL Server has a rich spatial engine and the data set is well known and highly structured that data was a good fit for SQL Server. Spatial_Ed has a wiki post on how to create the finished data set from raw NOAA data. Ed also provides a backup of the completed data set for our use. Take the SQL Server database backup (download here) made available by Ed in his wiki post and restore it to your SQL Server 2008 R2 or later instance as a database called NOAA.

    USE [master];
    RESTORE DATABASE [NOAA] FROM  DISK = N'C:DemoBigDataSandyNOAA.bak' 
    WITH MOVE N'NOAA2' TO N'C:DATANOAA2.mdf',  
    MOVE N'NOAA2_log' TO N'C:DATANOAA2_log.LDF';
    GO
    
    
    

    Since this data will be used by business users, add a view to the NOAA database with a friendlier name for the rainfall/flashflood data:

    USE NOAA;
    GO
    CREATE VIEW flashflood AS SELECT * FROM [dbo].[nws_ffg7];
    

    Take a look at a few rows of the data. For more information on what information is available in the census data, see the U.S. Census Bureau website.

    SELECT TOP 10 * FROM flashflood;

    Hive

    Hive is a part of the Hadoop ecosystem that allows you to create tables and impose structure on Hadoop data. It is available in HDInsight which is Microsoft’s distribution of Hadoop, sometimes referred to as Hadoop on Azure or Hadoop on Windows. HDInsight is currently available for preview in both Azure (HDInsight Service) and on-premises (HDInsight Server) versions. The HDInsight Server version is lightweight and simple to install – you can even put it on your laptop as it is a single node installation for now. Or you can request access to an Azure Hadoop cluster via HDInsight Service, though this takes a few days in the CTP phase. Hive is automatically installed as part of Hadoop in the preview versions, though the Hive ODBC driver requires a separate setup step. Hive can be described in many ways, but for your purposes within this article the key point is that it provides metadata and structure to Hadoop data that allows the data to be treated as “just another data source” to an ODBC compliant application such as Excel. HiveQL, or HQL, looks very similar to other SQL languages and has similar functionality.

    ODBC

    Once you have access to an HDInsight cluster, install the Hive ODBC driver to your local box (where you have Excel 2013). Make sure you install the ODBC driver platform (32-bit or 64-bit) that matches the platform of Excel. We recommend the 64-bit version of Excel 2013 and of the Hive ODBC driver since PowerPivot is able to take advantage of the larger memory available in x64. The Hive ODBC driver is available from the Downloads tile in the HDInsight Service (Azure) or HDInsight Server (on-premises) portal. Click on the appropriate installer (32-bit or 64-bit) and click through the Microsoft ODBC Driver for Hive Setup. You will end up with version .09 of the Microsoft Hive ODBC driver.

    Make sure the Hive ODBC port 10000 is open on your Hadoop cluster (instructions here).

    Create a Hive ODBC system DSN pointing to your Hadoop cluster. For this example I used the ODBC Data Source Administrator to create a system DSN called CGrossHOAx64 pointing to my http://HadoopOnAzure.com instance cgross.cloudapp.net with port = 10000 and an account called cgross1. For the on-premise version you can use localhost for the Host value and you will not specify an account or password.

    ODBCAdminHive    SystemDSNHive

    Note: With Windows 2008 R2 and earlier or Windows 7 and earlier if you must use 32-bit Excel and the 32-bit Hive ODBC driver (not recommended) on 64-bit Windows, to create the system DSN you have to use the 32-bit version of the ODBC Data Source Administrator. It is in a location similar to C:WindowsSysWOW64odbcad32.exe. In Windows 2012 or later or Windows 8 or later there is a single ODBC Data Source Administrator for both 32-bit and 64-bit drivers.

    Key Hadoop Pointers
    • The Hadoop Command Prompt is installed with HDInsight. There will be an icon added to the desktop on your head node. When you open it you will see that it is a basic Windows Command Prompt but with some specific settings applied.
      HadoopCmdPrompt  HadoopCmdPromptOpen
    • In early versions of HDInsight the Hive directory is not in the system path so you have to manually change to the Hive directory (cd %hive_home%bin) to run Hive commands from the Hadoop Command Prompt. This issue will be fixed in future versions.
    • HDFS is the scale-out storage technology that comes as part of core Hadoop. How it works behind the scenes is covered in great detail in many other places and is not relevant to this discussion. This is the default storage in HDInsight Server (on-prem) and the CTP of HDInsight Service (Azure).
    • ASV, or Azure Storage Vault, is the Windows Azure implementation of HDFS that allows HDInsight to use Azure Blob storage. This will be the default for the production HDInsight Service.
    Census Data

    Census data has a format that can vary quite a bit. The data is collected in different ways in different countries, it may vary over time, and many companies add value to the data and make available extended data sets in various formats. It could contain large amounts of data kept for very long periods of time. In our example the pieces of census data we find useful and how we look at that data (what structures we impose) may change quite a bit as we explore the data. The variety of structures and the flexibility of the ways to look at the data make it a candidate for Hadoop data. Our need to explore the data from common BI tools such as Excel, which expect rows and columns with metadata, leads us to make it into a Hive table.

    Census data was chosen because it includes age ranges and this fits with the initial scenario we are building to look at how many older individuals are in various high danger areas in the path of the hurricane. I have attached a demo-sized variation of this data set called Census2010.dat at the bottom of the blog. Download that tab-delimited U.S. census data set to the c:data folder on the head node for your HDInsight cluster. The head node is the machine you installed your single node HDInsight Server (on-prem) on, such as your laptop, or the machine configured for Remote Desktop access from the portal in your HDInsight Service (Azure) cluster. If you wish to explore the raw census data used in this demo, take a look at this site: http://www.census.gov/popest/research/modified.html.

    Next, load the census data to Hadoop’s HDFS storage. Unlike a relational system where the structure has to be well-known and pre-defined before the data is loaded, with Hadoop we have the option to load this data into HDFS before we even create the Hive table!

    Note: Your client tool editor or the website may change the dashes or other characters in the following commands to “smart” characters. If you get syntax errors from a direct cut/paste, try pasting into notepad first or deleting then retyping the dash (or other special characters).

    I will show you three of the many options for loading the data into an external Hive table (in the real world you would have multiple very large files in this directory, but for the purposes of the demo we have one small file), choose any one of the following three options. Note that the directories are created for you automatically as part of the load process. There are many ways to do the load, including cURL, SFTP, PIG, etc. but the steps below are good for illustration purposes.

    1) To load the data into HDFS via the Hadoop Command Prompt, open the Hadoop Command Prompt and type:

    hadoop fs -put c:datacensus2010.dat /user/demo/census/census.dat

    Let’s break down the command into its component pieces. The fs tells Hadoop you have a file system command, put is a data load command, c:datacensus2010.dat is the location of the file within NTFS on Windows, and /user/demo/census/census.dat is the location where we want to put the file within HDFS. Notice that I chose to change the name of the file during the load, mostly just for convenience as the HDFS name I chose is shorter and more generic. By default HDInsight currently defaults to using HDFS so I don’t have to specify that in the command, but if your system has a different default or you just want to be very specific you could specify HDFS in the command by adding hdfs:// before the location.

    hadoop fs -put c:datacensus2010.dat hdfs:///user/demo/census/census.dat

    Nerd point: You can specify the system name (instead of letting it default to localhost) by appending hdfs://localhost (or a specific name, often remote) to the location. This is rarely done as it is longer and makes the code less portable.

    hadoop fs -put c:datacensus2010.dat hdfs://localhost/user/demo/census/census.dat

    2) If you are using HDInsight Service (Azure), you can choose to easily store the data in an Azure Blob Store via ASV. If you haven’t already established a connection from your HDInsight cluster to your Azure Blob Store then go into your HDInsight cluster settings and enter your Azure Storage Account name and key. Then instead of letting the data load default to HDFS specify in the last parameter that the data will be loaded to Azure Storage Vault (ASV).

    hadoop fs -put c:datacensus2010.dat asv://user/demo/census/census.dat

    3) From the JavaScript interactive console in the Hadoop web portal (http://localhost:8085/Cluster/InteractiveJS from the head node) you can use fs.put() and specify either an HDFS or ASV location as your destination (use any ONE of these, not all of them):

    /user/demo/census/census.dat
    asv://user/demo/census/census.dat
    hdfs:///user/demo/census/census.dat

    Check from a Hadoop Command Prompt that the file has loaded:

    hadoop fs -lsr /user/demo/census

    Results will vary a bit, but the output should include the census.dat file: /user/demo/census/census.dat. For example the Hadoop Command Prompt in HDInsight Server should return this row with a different date and time, possibly with a different userid:

    -rw-r–r– 1 hadoop supergroup 360058 2013-01-31 17:17 /user/demo/census/census.dat

    If you want to run Hadoop file system (fs) commands from the JavaScript console, replace “hadoop fs -” with “#”.

    #lsr /user/demo/census

    This tells Hadoop that we want to execute a file system command and recursively list the subdirectories and files under the specified directory. If you chose to use ASV then add that prefix (I will assume for the rest of this blog that you know to add asv:/ or hdfs:// if necessary):

    hadoop fs -lsr asv://user/demo/census

    If you need to remove a file or directory for some reason, you can do so with one of the remove commands (rm = remove a single file or directory, rmr = remove recursively the directory, all subdirectories, and all files therein):

    hadoop fs –rm /user/demo/census/census.dat
    
    hadoop fs –rmr /user/demo/census
    
    

    For more details on rm or other commands:

    hadoop fs –help
    
    hadoop fs –help rm
    
    hadoop fs –help rmr
    
    Hive Table

    Create an external Hive table pointing to the Hadoop data set you just loaded. To do this from a command line open your Hadoop Command Prompt on your HDInsight head node. On an Azure cluster, look for the Remote Desktop tile from the HDInsight Service portal. This allows you to connect to the head node and you can optionally choose to save the connection as an RDP file for later use. In the Hadoop Command Prompt, type in the keyword hive to open the interactive Hive command line interface console (CLI). In some early versions of HDInsight the Hive directory isn’t in the system path so you may have to first change to the Hive bin directory in the Hadoop Command Prompt:

    cd %hive_home%bin

    Now enter the Hive CLI from the Hadoop Command Prompt:

    hive

    Copy and Paste the below create statement into the Hive CLI window. Or you can paste the CREATE statement into the web portal via the Hive interactive screen. If you are using the ASV option, change the STORED AS line to include asv:/ before the rest of the string (asv://user/demo/census). Take note of the fact that you are specifying the folder (and implicitly all files in it) and not a specific file.

    CREATE EXTERNAL TABLE census
    (State_FIPS int,
    County_FIPS int,
    Population bigint,
    Pop_Age_Over_69 bigint,
    Total_Households bigint,
    Median_Household_Income bigint,
    KeyID string) 
    COMMENT 'US Census Data' 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' 
    STORED AS TEXTFILE LOCATION '/user/demo/census';

    This CREATE command tells Hive to create an EXTERNAL table as opposed to an “internal” or managed table. This means you are in full control of the data loading, data location, and other parameters. Also, the data will NOT be deleted if you drop this table. Next in the command is the column list, each column uses one of the limited number of primitive data types available. The COMMENT is optional, you can also add a COMMENT for one or more of the columns if you like. ROW FORMAT DELIMITED means we have a delimiter between each column and TERMINATED BY ‘t’ means the column delimiter is a tab. The only allowed row delimiter as of Hive 0.9 is a new line and that is the default so we don’t need to specify it. Then we explicitly say this is a collection of TEXTFILEs stored at the LOCATION /user/demo/census (notice that we are not specifying a specific file, but rather the specific directory). We could specify either hdfs:/// or asv:// in front of the location; by default the current default storage location is HDFS so it is an optional prefix (the default can change!).

    Now check that the table has been created:

    SHOW TABLES;

    Check the definition of the table:

    DESCRIBE FORMATTED census;

    And look at a sampling of the data:

    SELECT * FROM census LIMIT 10;

    You can now choose to leave the Hive CLI (even the exit requires the semicolon to terminate the command):

    EXIT;

    Excel 2013 Visualizations

    Preparation

    In Excel 2013 you don’t have to download PowerPivot or Power View, you just need to make sure they are enabled as COM add-ins. All of the below steps assume you are using Excel 2013.

    In Excel 2013 make sure the “Hive for Excel”, “Microsoft Office PowerPivot for Excel 2013”, and “Power View” add-in are enabled. To do this click on the File tab in the ribbon, then choose Options from the menu on the left. This brings up the Excel Options dialog. Choose Add-ins from the menu on the left. At the very bottom of the box you’ll see a section to Manage the Add-ins. Choose “COM Add-ins” and click on the Go… button. This brings up the COM Add-Ins dialog where you can double check that you have checkmarks next to the Hive, PowerPivot, and Power View add-ins. Click OK.

    ExcelManageCOMAddins  ExcelCOMAddins

    PowerPivot Data

    Open a new blank workbook in Excel 2013. We are going to pull some data into PowerPivot, which is an in-memory tabular model engine within Excel. Click on the PowerPivot tab in the ribbon then on the Manage button in the Data Model group on the far left. Click on “Get External Data” in the PowerPivot for Excel dialog.

    PowerPivotExternalData

    We’re going to mash up data from multiple sources – SQL, Hive, and Excel.

    SQL Data

    First, let’s get the SQL Server data from the NOAA database. Choose “From Database” then “From SQL Server”. In the Table Import Wizard dialog enter the server and database information:

    Friendly connection name = NOAA
    Server name = [YourInstanceName such as MyLaptopSQL2012]
    Log on to the server = “Use Windows Authentication”
    Database name = NOAA

    ExternalSQLNOAA

    Choose “Next >” then “Select from a list of tables and views to choose the data to import”. Put a checkmark next to the flashflood view we created earlier. Click “Finish”. This will import the flash flood data from SQL Server into the in-memory xVelocity storage used by PowerPivot.

    ImportFlashFlood

    Hive Data

    Now let’s load the Hive data. Click on “Get External Data” again. This time choose “From Other Sources” then “Others (OLEDB/ODBC)” and click on the “Next >” button. Enter Census for the friendly name then click on the “Build” button.

    Friendly name for this connection = Census
    Connection String = “Build”
    Provider tab:
    OLE DATABASE Provider(s) = “Microsoft OLE DATABASE Provider for ODBC Drivers”
    Connection tab:
    Use data source name = CGrossHOAx64 [the Hive ODBC DSN created above]
    User name = [login for your Hadoop cluster, leave blank for on-premises Nov 2012 preview]
    Password = [password for your Hadoop cluster, leave blank for on-premises Nov 2012 preview]
    Allow saving password = true (checkmark)

    DSNODBCProvider  DSNHiveAzure

    This creates a connection string like this:
    Provider=MSDASQL.1;Persist Security Info=True;User ID=cgross1;DSN=CGrossHOAx64;Password=**********

    ExternalHiveCensus2

    Choose “Next >” then “Select from a list of tables and views to choose the data to import”. Choose the census table then click on “Finish”. This loads a copy of the Hive data into PowerPivot. Since Hive is a batch mode process which has the overhead of MapReduce job setup and/or HDFS streaming for every query, the in-memory PowerPivot queries against the copy of the data will be faster than queries that go directly against Hive. If you want an updated copy of the data loaded into PowerPivot you can click on the Refresh icon at the top of the PowerPivot window.

    Save the Excel file.

    Excel Data

    Our last data set will be a set of state abbreviations copied into Excel then accessed from PowerPivot. Go back to your Excel workbook (NOT the PowerPivot for Excel window). Rename the first sheet using the tab at the bottom to ShortState (just for clarity, this isn’t used anywhere).

    Open https://www.usps.com/send/official-abbreviations.htm in your web browser. Copy the entire “States” table including the headers. Paste the States data into the ShortState worksheet in Excel. Change the headers from State/Possession to State and from Abbreviation to ShortState. Highlight the columns and rows you just added (not the entire column or you will end up with duplicate values). Go to the Insert tab in the ribbon and click on Table in the Tables group on the far left. It will pop up a “Create Table” window, make sure the cells show “=$A$1:$B$60” and that “My table has headers” is checked.

    ExternalAbbreviations

    Now you will have a Table Tools ribbon at the top on the far right. Click on the Design tab then in the Properties group in the upper left change the Table Name to ShortState. The fact that the worksheet tab, the table, and one column have the same name is irrelevant, that just seemed less confusing to at the time. Now you have an Excel based table called ShortState that is available to be used in the tabular model.

    ExternalAbbreviations2

    Data Model

    Click on the PowerPivot tab and choose Add to Data Model at the top in the Tables group. This will change the focus to the PowerPivot for Excel window. Go to the Home tab in the PowerPivot for Excel window and click on Diagram View in the View group in the upper right. You can resize and rearrange the tables if that makes it easier to visualize the relationships. Add the following relationships by clicking on the relevant column in the first table then dragging the line to the other column in the second table:

    Table Flashflood, column State to table ShortState, column State

    Table Census, column keyid to table flashflood, column KEYID

    PPModelCensus

    Click on the flashflood table and then on the Create Hierarchy button in the upper right of the table (as circled in the image above). Name the hierarchy Geography and drag State then County under it.

    PPHierarchyCensus

    Save the Excel file. You may need to click on “enable content” if you close/reopen the file.

    Calculations

    In the PowerPivot for Excel window, we’re going to add some calculated columns for later use. Click on the Data View button in the View group at the top. You should now see three tabs at the bottom, one for each of the tables in the model. Click on the census table and then right-click on “Add Column” at the far right of the column headers. Choose to insert column. This will create a new column named CalculatedColumn1. In the formula bar that opened above the column headers paste in “=[Pop_Age_Over_69]/[Population]” (without the quotes). This will populate the new column with values such as 0.0838892880358548. Right-click on the column and choose “Rename Column” – name it PctPopOver69 since it gives us the percentage of the total population that is over 69 years old. That’s the first formula below, repeat those steps for the 2nd and 3rd formulas. PctBuckets bucketizes the output into more meaningful groupings while SortBucketsBy gives us a logical ordering of the buckets. Since most of the numbers fall into a range of less than 20% the buckets are skewed towards that lower range that actually has a variety of data.

    • PctPopOver69=[Pop_Age_Over_69]/[Population]
    • PctBuckets=IF([PctPopOver69]<=.05,”<5%”,IF([PctPopOver69]<.1,”5-10%”, IF([PctPopOver69]<.15,”10-15%”,IF([PctPopOver69]<.2,”15-20%”,”>20%”))))
    • SortBucketsBy=IF([PctPopOver69]<=.05,1,IF([PctPopOver69]<.1,2, IF([PctPopOver69]<.15,3,IF([PctPopOver69]<.2,4,5))))

    PPCalculationsCensus

    Now let’s use those new calculated columns. Highlight the new PctBuckets column (the 2nd one you just added) and click on the Sort by Column button in the Sort and Filter group. Pick Sort By Column and choose to sort PctBuckets by SortBucketsBy. This affects the order of data in the visualizations we will create later.

    PPSortByCensus

    Since some of the columns are purely for calculations or IT-type use and would confuse end users as they create visualizations and clutter the field picker, we’re going to hide them. In the Census table right-click on the PctPopOver69 column and choose “Hide from Client Tools”. Repeat for census.SortBucketsBy, census.keyid, census.State_FIPS, census.County_FIPS, flashflood.State_FIPS, flashflood.County_FIPS, and flashflood.KEYID. Our PowerPivot data is now complete so we can close the PowerPivot for Excel window. Note that the Excel workbook only shows the ShortState tab. That’s expected as this is the Excel source data, not the in-memory PowerPivot tables that we’ve created from our three data sources. All of our data is now in the spreadsheet and ready to use in a report.

    Save the Excel file.

    Power View

    Now let’s use Power View to create some visualizations. In the main Excel workbook, click on the Insert tab in the ribbon and then under the Reports group (near the middle of the ribbon) click on Power View.

    PVButton

    This creates a new tab in your workbook for your report. Rename the tab to SandyEvacuation. In the Power View Fields task pane on the right, make sure you choose All and not Active so that every table in the model is visible. Uncheck any fields that have already been added to the report.

    Where it says “Click here to add a title” enter a single space (we’ll create our own title in a different location later). In the Power View tab click on the Themes button in the Themes group. Choose the Solstice theme (7 from the end). This theme shows up well on the dark, stormy background we will add later and it has colors that provide good indicators for our numbers.

    Now let’s add some data to the report. In the Power View Fields task pane, expand the flashflood table and add a checkbox to State then to County. This will create a new table on the left. Add HR1, HR3, HR6, HR12, and HR24 from the flashflood table in that order (you can change the order later at the bottom of the task pane). Add Latitude and Longitude from flashflood. Expand census and click on PctBuckets. Now we have the data we need to build a map-based visualization that will show rainfall across a geographic area.

    Note: If you get a message at any point asking you to “enable content”, click on the button in the message to do so.

    Map

    In the Power View Fields task pane go to the Fields section at the bottom. Click on the arrow to the right of Latitude and Longitude and change both from sum to average.

    PVCensusFields

    With the focus on the table in the main window, click on the Design tab then Map within the Switch Visualization group on the left. This creates one map per state and changes the choices available in the task pane. Change the following fields at the bottom of the screen to get a map that allows us to see the path of Hurricane Sandy as it crosses New England.

    TILE BY = [blank]
    SIZE = HR6 (drag it from the field list to replace HR1, set it to minimum)
    LOCATIONS = Remove the existing field and drag down the Geography Hierarchy from flashflood
    LONGITUDE = Longitude
    LATITUDE = Latitude
    COLOR = PctBuckets
    VERTICAL MULTIPLES = [remove field so it’s blank]
    HORIZONTAL MULTIPLES = [blank]

    PVCensusMapFields

    With the focus still on the map, click on the LAYOUT tab in the ribbon. Set the following values for the buttons in the Labels group, they will apply to this particular visualization (the map):

    Title = none
    Legend = Show Legend at Right
    Data Labels = Center
    Map Background = Road Map Background

    PVMapViz

    Go to the Power View tab, make sure the focus is on the map. Click on Map in the Filters section to the right of the report. Click on the icon to the far right of State. This adds another icon (a right arrow) for “Advanced filter mode”, click on that. Choose “is not blank” from the first drop down. Click on “apply filter” at the bottom. Repeat for County. This eliminates irrelevant data with no matching rows.

    PVCensusFilter

    Make it Pretty

    Go to the Power View tab so we can add some text. In the Insert group click on Text Box. It will insert a box somewhere on the page, paste in “Evacuation Targets – % of Population Over 69 by County” (without the quotes). Size the box so that it is a long rectangle with all the text on one line, then move the box above the map. Repeat for a text box that says “Hurricane Sandy 2012” and move that above the previous text box. Make the text size of the last box = 24 in the TEXT tab (resize the box if needed).

    Hint: To move an object on the report, hover over the object until a hand appears then drag the object.

    PVTitles

    Download some Hurricane Sandy pictures. In the Power View tab go to the Background Image section. Choose Set Image and insert one of your hurricane images. Set Transparency to 70% or whatever looks best without overwhelming the report. You may need to change the Image Position to stretch. In the Themes group of the Power View tab set Background to “Light1 Center Gradient” (gray fade in to center). Move the two titles and the map as far into the left corner as you can to leave room for the other objects. If you like, reduce the size of the map a bit (hover near a corner until you get a double-headed arrow) add a couple more pictures on the right side (Insert group, Picture button).

    PVBackgroundMap

    Save the Excel file.

    Slicers

    Now let’s create some slicers and add another chart to make it more visually appealing and useful.

    Click in the empty area under the map, we are going to create a new object. Click on flashflood.State in the task pane to add that column to the new table. Click on the scroll bar on the right side of the table to make sure the focus is on the table. Click on Slicer in the Design tab. Stay in the Design tab and under the Text section click on the “A with a down arrow” button to reduce the font by one size. If necessary, do some minor resizing of the table so it doesn’t fall off the page or cut off a state name in the middle but still shows a handful of rows. Expand Filters on the right side of the screen and drag flashflood.State to the filters section. Choose the Advanced button, “is not blank”, and “apply filter”. Repeat these steps to create a separate slicer for County. Now when you click on a particular county, for example, in the slicer everything on the report will be filtered to that one county.

    PVSlicers

    Save the Excel file.

    Rainfall Chart

    Click in the empty space to the right of the new County slicer. Add ShortState.ShortState and 5 separate flashflood columns, one for each time increment collected: HR1, HR3, HR6, HR12, HR24. In the Design tab choose “Other Chart” then Line. In the “Power View Fields” task pane make sure TILE BY is empty, VALUES for HR1-24 = AVERAGE, and AXIS = ShortState.

    PVRainfallFields

    Reduce the font size by 1 in the Text group of the Design tab then resize to make sure all values from CT to WV show up in the chart (stretch it to the right). In the Power View tab click on Insert.Text Box, move the box to be above the new chart, and type in the text “Average Rainfall by State”.

    Make sure the focus is on the rainfall chart and go to the LAYOUT tab:

    Title = None
    Legend = Show Legend at Right
    Data Labels = Above

    PVHurricaneSandy

    Save the Excel file.

    That’s the completed report using the same steps as the report I demo’d at PASS Summit 2012 (download that report from the bottom of the blog). Try changing the SIZE field to HR1 (rainfall projection for the next hour), HR6, etc. Click through on the state map values to see the more detailed county level values. Click on a slicer value to see how that affects the map.

    What you see in this report is a combination of SQL Server (structured) data and Hive (originally unstructured) data all mashed up into a single report so the end user doesn’t even have to know about the data sources. You can add more types of data, change the granularity of the data, or change how you visualize the data as you explore. The Hive data in particular is very flexible, you can change the structure you’ve imposed without reloading the data and add new data sources without lots of pre-planning. Using BI client tools to access all the additional rich data companies are starting to archive and mine in Big Data technologies such as HDInsight is very powerful and I hope you are starting to see the possibilities of how to use Hive in your own work.

    I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

    Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly

    DemoSandy2012.zip

  • Big Data – All Abuzz About Hive at #SQLPASS Summit 2012

    Big Data – All Abuzz About Hive
    Small Bites of Big Data

    Cindy Gross, SQLCAT PM

    I hope to see you at the #SQLPASS Summit 2012 this week!

    There are many reasons people come to the PASS Summit – SQL friends, SQL family, networking, great content in 190 sessions, the SQL clinic, the product team, CSS, SQL CAT, MVPs, MCMs, SQL Community, and sometimes just to get away from the daily grind of work. All those are great reasons, and they all make for a great Summit.

    This year I am focusing on BI and Big Data. For the Summit this year I will be introducing you to Hive. Hive is a great way to leverage your existing SQL skills and enter into the world of #BigData. Big Data is here in a big way. CIOs are pushing it, business analysts want to use it, and everyone wants to gain new insights that will help their business grow and thrive. Don’t let Big Data pass you by, leaving you wondering how you missed out. Hive is fun and interesting, and for SQL Pros it looks very familiar. Come to my talk and come by the SQL Clinic to ask questions throughout the week. Please come up and introduce yourself at any time, I love to meet new SQL Peeps!

    BIA-305-A SQLCAT: Big Data – All Abuzz About Hive
    Wednesday 1015am   |   Cindy Gross, Dipti Sangani, Ed Katibah

    Got a bee in your bonnet about simplifying access to Hadoop data? Want to cross-pollinate your existing SQL skills into the world of Big Data? Join this session to see how to become the Queen Bee of your Hadoop world with Hive and gain Business Intelligence insights with HiveQL filters and joins of HDFS datasets. We’ll navigate through the honeycomb to see how HiveQL generates MapReduce code and outputs files to answer your questions about your Big Data.

    After this session, you’ll be able to democratize access to Big Data using familiar tools such as Excel and a SQL-like language without having to write MapReduce jobs. You’ll also understand Hive basics, uses, strengths, and limitations and be able to determine if/when to use Hive in combination with Hadoop.

    And there’s more! Here is a sampling of blog posts about this year’s summit:

    After the Summit, you can still stay involved. Follow some SQL Peeps on Twitter, sign up for a few SQL blog feeds, and buy a book or two. Attend local events like SQL Saturdays and User Group meetings. Reach out to your fellow SQL-ites and stay in touch with those you meet. Keep SQL fun and interesting and share what you learn!

    See you at the #SQLPASS Summit 2012!

    I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

    Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

    PASS2012BIA305AAllAbuzzAboutHive.pptx

  • Load SQL Server BCP Data to Hive

    Load SQL Server BCP Data to Hive

    Small Bites of Big Data

    Cindy Gross, SQLCAT PM

    As you start learning more about Hadoop you may want to take a look at how the same data and queries work for SQL Server and for Hadoop. There are various ways to do this. For now I’ll show you something that utilizes some of your existing SQL Server skills (BCP) and some new Hadoop skills (basic Hadoop FS and Hive commands). There are other methods such as FTP, cURL, and Sqoop that I won’t cover right now. Also, if you want to load data from the Azure DataMarket you can follow these instructions.

    Today we’ll walk through making a copy of the FactProductInventory table from AdventureWorksDW2012 on a SQL Server 2012 instance. The below TSQL will generate BCP output commands for a single table in the current SQL Server database context and write tab delimited data to c:temp. Comment out the @tablename references in the SELECT and WHERE clauses to generate the script for all tables in the database.

    USE [AdventureWorksDW2012];

    DECLARE @servername sysname, @dbname sysname, @tablename sysname, @outputdir sysname

    SELECT  @servername = @@SERVERNAME

           ,@dbname = DB_NAME()

           ,@outputdir = ‘c:temp’

           ,@tablename = ‘FactProductInventory’

    SELECT ‘bcp ‘ + OBJECT_SCHEMA_NAME(object_id) + ‘.’ + name + ‘ out ‘

           + @outputdir + OBJECT_SCHEMA_NAME(object_id) + ‘_’ + name + ‘.dat -b 10000 -d ‘

           + @dbname + ‘ -T -c -S ‘ + @servername

           FROM sys.objects

           WHERE type_desc = ‘USER_TABLE’

           AND name = @tablename

           ;

    In this case the BCP code generated is (no line break):

    bcp dbo.FactProductInventory out c:tempdbo_FactProductInventory.dat -b 10000 -d AdventureWorksDW2012 -T -c -S CGROSSBOISESQL2012

    Paste the BCP command to a Command Prompt and run it.

    If you have not yet created an Apache™ Hadoop™-based Services for Windows Azure cluster follow these steps to do so (this is a CTP so the exact steps/screens will change over time).

    From your Hadoop cluster portal click on the “Remote Desktop” button and choose to “Save As” the RDP that is generated.

    Right click on the RDP you saved and choose “edit”. Go to the “Local Resources” tab click on “More…” under “Local devices and resources”. Add a check mark to “Drives” then click “OK”. Go back to the “General” tab and click on “Save”. Now choose “Connect” to open a remote desktop connection to your Hadoop head node.

    Open the “Hadoop Command Shell”. Copy/paste or type these commands (beware of some editors changing dashes or other characters to non-executable values) to create a directory and copy the data file to your head node. The /y on the copy will overwrite the file if it already exists.

    Md c:data

    Copy \tsclientCtempdbo_FactProductInventory.dat c:data /y

    Dir c:data

     Now from the same prompt load the data into Hadoop HDFS. The fs indicates you are running a filesystem command from a Hadoop script. Generally the same commands are available from the “Interactive JavaScript” console in the portal by replacing “hadoop fs –“ with “#”. For example, “hadoop fs –lsr /” from a Hadoop Command Prompt and “#lsr /” from the JavaScript console both return a recursive list of all directories and files starting at the root (/). Try some variations such as “#ls” (non-recursive contents of default directory) and “#lsr /user” (recursive list of the user directory).

    hadoop fs -put c:datadbo_FactProductInventory.dat /demo/demo.dat

    hadoop fs -lsr /

     Now launch the command line version of Hive (you can alternatively use the Interactive Hive console in the portal, but I’m showing you the automatable version) and add metadata to the HDFS data. Note that the CREATE EXTERNAL TABLE statement wraps in the window, there is no line break. Because I choose to use the EXTERNAL keyword the data stays in its original HDFS location and will not be deleted when I drop the Hive table. Since the Hive keywords are different the data type names are not exactly the same as they were in SQL Server. Basically I generated the script for this table from SSMS then made a few changes. I removed the constraints and indexes then changed date and money to string. I also removed the brackets and the “dbo.” schema qualifier. If you don’t copy the empty line under the last command you will have to hit enter for the last command to complete.

    hive;

    CREATE EXTERNAL TABLE FactProductInventory(ProductKey int,DateKey int,MovementDate string,UnitCost string,UnitsIn int,UnitsOut int,UnitsBalance int) COMMENT ‘Hive Demo for #24HOP’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘t’ STORED AS TEXTFILE LOCATION ‘/demo’;

    quit;

     Now let’s run some queries. You can either start Hive again from the command line or run it from the Interactive Hive portal in the GUI.

    select * from FactProductInventory where ProductKey = 230 and DateKey = ‘20080220’;

     The output on my single data node Hadoop cluster looks like this (the line starting with 230 is the actual result set):

    Total MapReduce jobs = 1

    Launching Job 1 out of 1

    Number of reduce tasks is set to 0 since there’s no reduce operator

    Starting Job = job_201209281938_0013, Tracking URL = http://10.119.36.106:50030/jobdetails.jsp?jobid=job_201209281938_0013

    Kill Command = c:Appsdistbinhadoop.cmd job  -Dmapred.job.tracker=10.119.36.106:9010 -kill job_201209281938_0013

    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

    2012-09-28 23:27:45,271 Stage-1 map = 0%,  reduce = 0%

    2012-09-28 23:27:58,301 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

    2012-09-28 23:27:59,316 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

    2012-09-28 23:28:00,316 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

    2012-09-28 23:28:01,332 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

    2012-09-28 23:28:02,347 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

    2012-09-28 23:28:03,363 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

    2012-09-28 23:28:04,379 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

    2012-09-28 23:28:05,394 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

    2012-09-28 23:28:06,410 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

    2012-09-28 23:28:07,425 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.187 sec

    MapReduce Total cumulative CPU time: 3 seconds 187 msec

    Ended Job = job_201209281938_0013

    MapReduce Jobs Launched:

    Job 0: Map: 1   Accumulative CPU: 3.187 sec   HDFS Read: 31077011 HDFS Write: 38 SUCESS

    Total MapReduce CPU Time Spent: 3 seconds 187 msec

    OK

    230     20080220        2008-02-20      20.3900 0       0       4

    Time taken: 48.465 seconds

    Note that if you run this same query in SSMS you will see an instant response but here in Hadoop it took 48 seconds of total time. No matter how many times you run this query you will see approximately the same execution time because the result set is not cached. This shows that we have used Hadoop in an anti-pattern way – we went after a single row of data. Filtered row sets are a strength of well-indexed relational systems while querying entire, very large, unindexed data sets is a strength of Hadoop. Hive generates MapReduce code and that MapReduce code goes through the same steps each time to find the data, distribute job tasks across the data nodes (map), and then bring the results sets back (reduce). The cumulative CPU time once it actually executes the map phase is still over 3 seconds. I chose this example both to illustrate that point and because it gives you data you are familiar with to ease you into the Hadoop and Hive worlds.

    If you want to remove the metadata from Hive and the data you’ve just loaded from Hadoop HDFS execute these steps from a Hadoop Command Shell:

    hive

    drop table FactProductInventory;

    quit;

    hadoop fs -rmr /demo/demo.dat

     Now you know one way to copy data from SQL Server to Hadoop. Keep exploring Hadoop and keep buzzing about Hive!

    I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

    Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

  • What’s all the Buzz about Hadoop and Hive?

    What’s all the Buzz about Hadoop and Hive?

    Why it Matters for SQL Server Peeps

    Small Bites of Big Data

    Cindy Gross, SQLCAT PM

    On September 20, 2012 we have another 24 Hours of PASS event! This PASS Summit Preview will give you a taste of what is coming at this year’s PASS Summit. There are 190+ technical sessions this year at the Summit, and you’ll get a preview of 24 of them at the #24HOP event tomorrow! Come hear about some of the hottest topics and features in the SQL Server, BI, and data world.

    One of the big buzzwords over the last year or so is Hadoop, and the most familiar part of Big Data and Hadoop to most SQL Server professionals is Hive. Do you wonder what it is and why you should jump in now while it’s still new and growing by leaps and bounds? I have just the #24HOP session for you!

    #24HOP: What’s all the Buzz about Hadoop and Hive? – Why it Matters for SQL Server Peeps

    Everyone is buzzing about Hive and trumpeting the virtues of Hadoop. But what does it mean? Why does it matter to a SQL Server and/or BI professional? Come get a taste of the Hive honey and see why this new technology is worth buzzing about!

    During this talk I’ll give a very high level overview of Big Data, Hadoop, and Hive (for the nitty gritty details come to the Summit!). I’ll also go through why Hive matters in the SQL Server world, what a SQL Server Peep might end up doing in a Hive world, and why it is important for you as a SQL Server Peep to jump in and get your feet wet with Hive now.

    Once you’ve heard this #24HOP talk I hope you’ll be fired up about Hive and more anxious than ever to sign up for the  PASS Summit to learn even more about Hadoop, Hive, Big Data, and all things BI and SQL Server. I’ll be co-presenting at the Summit with SQL Server PM Dipti Sangani:

    SQLCAT: Big Data – All Abuzz About Hive [BIA-305-A]
    Session Category: Regular Session (75 minutes)
    Session Track: BI Platform Architecture, Development & Administration
    Speaker(s): Cindy Gross, Dipti Sangani

    Got a bee in your bonnet about simplifying access to Hadoop data? Want to cross-pollinate your existing SQL skills into the world of Big Data? Join this session to see how to become the Queen Bee of your Hadoop world with Hive and gain Business Intelligence insights with HiveQL filters and joins of HDFS datasets. We’ll navigate through the honeycomb to see how HiveQL generates MapReduce code and outputs files to answer your questions about your Big Data.

    After this session, you’ll be able to democratize access to Big Data using familiar tools such as Excel and a SQL-like language without having to write MapReduce jobs. You’ll also understand Hive basics, uses, strengths, and limitations and be able to determine if/when to use Hive in combination with Hadoop.

    I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

    Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

    UPDATE 9/28/12 – demo steps to load the AdventureWorks data to Hive are available at http://blogs.msdn.com/b/cindygross/archive/2012/05/07/load-data-from-the-azure-datamarket-to-hadoop-on-azure-small-bites-of-big-data.aspx.

    24HOPFall2012HiveBuzz.zip

  • How to Install the PowerShell Cmdlets for Apache™ Hadoop™-based Services for Windows

    How to Install the PowerShell Cmdlets for Apache™ Hadoop™-based Services for Windows

    Small Bites of Big Data

    Cindy Gross, SQLCAT PM

    UPDATED JUNE 2013 – The very early version of PowerShell cmdlets I discussed below have been replaced – see Managing Your HDInsight Cluster with PowerShell http://blogs.msdn.com/b/carlnol/archive/2013/06/07/managing-your-hdinsight-cluster-with-powershell.aspx

    We have a cool new addition to Microsoft’s Apache™ Hadoop™-based Services for Windows – new PowerShell cmdlets! The initial readme.txt is not very detailed so I went through the installation and initial use step by step to help get your started. Enjoy this new way to remotely administer your Hadoop cluster!

    1)      Log in to your Windows client (the location where you plan to do remote administration of your Hadoop cluster) with a local administrator account.

    2)      Download the Hadoop PowerShell Cmdlets zip file from your Apache™ Hadoop™-based Services for Windows cluster: https://YourHadoopCluster.cloudapp.net/Home/Downloads(substitute your actual Hadoop Cluster name for “YourHadoopCluster”).

    3)      Unzip the files to C:WindowsMicrosoft.NETFramework64v4.0.30319 (assuming an x64 OS and an installation on C:).

    4)      Change PowerShell to use .NET 4.0 (by default it uses an older version that does not work with these cmdlets)

    • Open “Windows PowerShell” (choose the x64 version – the one that does not show (x86) after the name).
    • Type $pshome and capture the result such as “C:WindowsSystem32WindowsPowerShellv1.0.”  This directory is for an x64 installation of PowerShell, the SysWOW64 directory is for x86/32bit.
    • Close PowerShell.
    • Open a dos-prompt or Windows Explorer with “Run as administrator” and navigate to the directory you found above (i.e. C:Windowssystem32WindowsPowerShellv1.0)
    • If no powershell.exe.config file exists create an empty file with that name in that directory. Edit powershell.exe.config to add the following section:
      <?xml version=”1.0″?>
      <configuration>
      <startup useLegacyV2RuntimeActivationPolicy=”true”>
        <supportedRuntime version=”v4.0.30319″/>
        <supportedRuntime version=”v2.0.50727″/>
      </startup>
      <
      /configuration>

    5)      Install the SDK cmdlets

    • Open a “Command Prompt” with the “Run as administrator” option. Go to your .NET 4.0 directory which will be something like C:WindowsMicrosoft.NETFramework64v4.0.30319.
    • Run:
      installutil.exe Isotope.Sdk.Cmdlets.dll
    • Review the output to verify there were no errors during the install (look for things like commit instead of rollback and no errors).

    6)      Open a new Windows PowerShell window (make sure you choose the x64 version on an x64 OS) and type 

                Add-PSSnapin IsotopeSdkPSSnapIn

    Verify it completes successfully.

    7)      Type a set of commands in the PowerShell window to verify the cmdlets work. For example set which cluster the other commands apply to (change to your actual names/password), check that the settings are correct, then get a file listing or the user directories:

    Set-ClusterSettings -ClusterDns <cluster name> -UserName <username> -Password <password>
    Get-ClusterSettings
    Get-FileListFromHdfs -HdfsPath hdfs:///user/

    8)      Once the cmdlets are installed and you have verified it is working you can add the SnapIn to the profile so you don’t have to add it every time you open PowerShell. Details on PowerShell profiles can be found here, a summary is below.

    a.       Open PowerShell with “Run as administrator” (required if you are changing the execution policy or creating a new profile)

    b.      Allow scripts you’ve self-created to be run without being signed (or you can choose to sign the ps1 you create later):

    Set-Executionpolicy -ExecutionPolicy RemoteSigned

    c.       See if a Profile already exists: test-path $profile

    d.      If no Profile exists, make a new one that’s available to all users (this assumes x64):

    new-item -path $env:windirsystem32WindowsPowerShellv1.0profile.ps1 -itemtype file -force

    e.      Edit the file you just created. Open Notepad with “Run as administrator”, paste in:

    Add-PSSnapin IsotopeSdkPSSnapIn

    f.        Optionally add the default Hadoop cluster (use your actual cluster name, username, password)

    Set-ClusterSettings -ClusterDns <cluster name> -UserName <username> -Password <password>

    g.       Save as C:Windowssystem32WindowsPowerShellv1.0profile.ps1 (choose “all files” so it does not append “.txt”).

    h.      The next time you open PowerShell you will not have to add the snapin or set your default cluster  you want to manage.

    The full list of cmdlets is in the readme.txt. As of this time the list includes:

    Set-ClusterSettings -ClusterDns <hadoopcluster.cloudapp.net or hadoopcluster> -UserName <username> -Password <password>

    Get-ClusterSettings

    New-HiveJob -Type query -Execute “SELECT * from HiveSampleTable limit 10;” [-Define <# delimited key=value pairs>]

    New-MapReduceJarJob -Jar c:appsdisthadoop-examples-1.1.0-SNAPSHOT.jar -Arguments “pi 16 100” [-Define <# delimited key=value pairs>]

    New-MapReduceStreamingJob -Input “/example/data/gutenberg/davinci.txt” -Output “/example/data/streamingoutput/wc.txt” -Mapper cat.exe -Reducer wc.exe -File “hdfs:///example/apps/wc.exe,hdfs:///example/apps/cat.exe” [-Define <# delimited key=value pairs>]

    Get-JobStatus -JobId <isotope-jobId>

    Get-FileListFromHdfs -HdfsPath hdfs:///user/

    Copy-File -LocalPath <file> -HdfsPath hdfs:///<path>

    Get-JobHistoryList -NumberOfItems 2

    Get-JobHistory -JobId <isotope-jobId>

    Get-JobHistoryCount

    Set-AsvCredentials -AccountName <accountname> -AccountKey <accountkey>

    Get-AsvCredentials

    Remove-AsvCredentials

    Get-S3Credentials

    Set-S3Credentials -AccessKeyId <accesskeyid> -AccessKey <accesskey> -IsS3N <0|1>

    Remove-S3Credentials

     

    I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

    Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

  • Load Data from the Azure DataMarket to Hadoop on Azure – Small Bites of Big Data

    Load Data from the Azure DataMarket to Hadoop on Azure

    Small Bites of Big Data

    Cindy Gross, SQLCAT PM

    UPDATED Jun 2013: HadoopOnAzure CTP has been replaced by HDInsight Preview which has a different interface and different functionality. Getting Started with Windows Azure HDInsight Service http://www.windowsazure.com/en-us/manage/services/hdinsight/get-started-hdinsight/

     The first step in analyzing data is acquiring that data. You may have data available from logs, databases, or other internal sources. Or you may need to export data from an external source. In Hadoop on Azure we make it easy to load data from data sources such as the Azure DataMarket, Amazon S3, and the Windows Azure Blog Storage (ASV).

    Preparing Hadoop On Azure for a data load

    Find your data source

    • Find data on the Marketplace that meets your needs. For example, I looked for free demographic data.

         

    • Some of the data marked as free has limits on how much is available without payment. I subscribed to the “Gender Info 2007 – United Nations Statistics Division” because it was free, had a significant amount of data, had unlimited transactions/month, and is easily understandable.

     

    • You can click through on the data link for more information on the source, columns, and other details.

     

    • When you click on the “use” button you get a screen that lets you query the data. You will use information from this screen to load the data into Hadoop.

    • Leave this browser window open and stay logged in to the DataMarket.

    Load the Data

    • Log in to your cluster onHadoopOnAzure.com and click on the Manage Cluster tile.

    • Click on the DataMarket tile

    • Go to the DataMarket page for your data.
      • For the gender data, you need to change the “Query” value under “BUILD QUERY” from “DataSeries” to “Values”.
      • Next to “Primary Account Key” click on “Show”. This will expose a key value. Cut/paste that value into the “passkey” field of “Import from Data Market” on HadoopOnAzure.
      • Copy the “URL for current expressed query” into the “Query” field on HadoopOnAzure. Erase the end of the query ($top=100) that limits the results to the first 100 rows. For the gender data the query is https://api.datamarket.azure.com/Data.ashx/UnitedNations/GenderInfo2007/Values?$top=100, change that to https://api.datamarket.azure.com/Data.ashx/UnitedNations/GenderInfo2007/Values in the query field.
      • In HadoopOnAzure enter your DataMarket user name in the “User name” field and a unique new table name in the “Hive table name” field. If you the name of a table that already exists in this Hadoop cluster you will get an error.
      • Click on the “Import Data” button to start the data load.
      • You can now leave the screen and let the data load complete in the background.

     

    • You can check on the status of the data load by clicking on the “Job History” tile.

    • You can click through on the job to get more details.

    • Once the data load is complete, you can query the table from the Interactive Console.

    • It opens to the JavaScript console, you will need to click on the Hive button to get to the “Interactive Hive” screen.

    • Enter a SQL query in the white box at the bottom and then hit “Evaluate”. The HiveQL or HQL language is based on standard SQL and is closer to the MySQL implementation than to the SQL Server TSQL implementation. Most of the syntax is the same as TSQL but there are some differences. For example, execute: select * from GenderInfoValuesAll where year = 2000 order by countryname limit 200;. This looks like TSQL except you use limit instead of top.

    Now your data is loaded and you’ve done a basic test to verify the data looks correct. You’re ready to start using Hive on Hadoop!

    Other Microsoft Hadoop data loading links:

    I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

    Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly. Screenshots are from the May 2012 version of http://HadoopOnAzure.com.

  • 24HOP/SQLRally – Fitting Microsoft Hadoop Into Your Enterprise BI Strategy

    24HOP/SQLRally – Fitting Microsoft Hadoop

    Into Your Enterprise BI Strategy

    Small Bites of Big Data

    Cindy Gross, SQLCAT PM

    The world of #bigdata and in particular #Hadoop is going mainstream. At 24HOP 2012 I talked about how a SQL Server professional fits into this big data world.

    Hadoop generally falls into the NOSQL realm. NOSQL uses basically available, soft-state, eventual consistency (BASE) instead of requiring immediate, 100% consistency via ACID properties. This allows certain types of queries/operations to be much faster since locking and consistency rules don’t have to be followed. This works well when any given program/query looks at the entire dataset and does not try to filter the data or query a subset of it. Basically with Hadoop you take data that may not have a known, consistent structure and impose a structure on it for the duration of your MapReduce program (query). This is all based on the concept of scaling out rather than up, with new commodity (low end enterprise server) hardware being added to an existing Hadoop cluster to meet dynamic demands. With Hadoop you are spreading hardware costs out over a longer time, possibly paying less overall for hardware, and shifting your IT costs to different departments and parts of the lifecycle.

    Hive is a database which sits on top of Hadoop’s HDFS (Hadoop Distributed File System). The data stays in HDFS but Hive stores metadata about the imposed structure and may store intermediate files. HiveQL looks a lot like TSQL and like most SQL languages makes an effort to align with the SQL standard when it makes sense. HiveQL will generate (usually multiple) MapReduce jobs to take an entire large result set and allow easier use of filters, joins, and aggregates. It’s still going to be slower than a relational database for most filters and joins but it allows access to data that may be too expensive (for whatever reason) to put into a relational or multi-dimensional database at this point in the lifecycle.

    You may keep your source data outside HDFS and bring it in only for the duration of a project. You can sqoop data to/from a relational database (there is a sqoop driver for SQL Server and PDW) or pull data from blob stores like Windows Azure and Amazon S3, use secure FTP, query OData sources like the Windows Azure DataMarket. The Hive ODBC driver, often via the Excel Hive Add-in, and sqoop can be used to pull data from Hadoop into other tools or systems.

    So far I’ve been talking as if big data = Hadoop. However, that’s not really true. Big data is data that is too big, expensive, or complex to process using your existing environment (often traditional RDBMSs with expensive SANs and high end hardware). While MapReduce, most commonly via Hadoop, is a very common solution it’s not the only one. There are streaming technologies like StreamInsight and HStreaming, machine learning like Mahout, massively parallel processing databases like PDW, and more. Generally big data at least starts out as unstructured or semi-structure, or perhaps of variable structure. It may flow very quickly with the need to process large amounts of data in a very small time window where decisions have to be made in real time.

    At its core Hadoop has the file system HDFS which sits on top of the Windows or Linux file system and allows data to be mapped over many nodes in a Hadoop cluster. A head node maintains metadata about where each piece of data resides. MapReduce programs “map” the data to the many nodes then reduce the output based on the required logic, resulting in an output file that is the end result of the program/query. Other pieces of the Hadoop ecosystem may include things like Hive, HBase, HCatalog, Lucene, Mahout, Zookeeper and more.

    So when would you use Hadoop? It’s good for exploring data, finding out what you don’t know. If you need to know your unknown unknowns or look for unexpected correlations, Hadoop may be what you need. It allows you to be very flexible and not lock yourself into a rigid schema until after you’ve fully explored your options. Some examples of common use cases are risk modeling, machine learning, cyber security, sentiment analysis, recommendation engines, log analysis, and sensor data. However, don’t think big data is the solution to all your problems. Often it is used to find and define the most useful bits of data or find the most useful algorithms. Then a more traditional design process may be used to create an RDBMS or multiple dimensional system for faster querying during day to day operations. Think of Hadoop as the foundation for a very specialized type of database that meets very specific needs, not as a replacement for relational. It’s important to note that every VLDB is not a fit for big data. “Big” is somewhat of a misnomer, size is only one of the characteristics and is relative to what your environment is prepared to handle.

    Often you ask about the 4 “Vs” when deciding whether to use Hadoop – volume, velocity, variety, variability. Think about big data when you describe a problem with terms like tame the chaos, reduce the complexity, explore, I don’t know what I don’t know, unknown unknowns, unstructured, changing quickly, too much for what my environment can handle now, or unused data.

    • Volume = More data than the current environment can handle with vertical scaling, need to make sure of data that it is currently too expensive to use
    • Velocity = Small decision window compared to data change rate, ask how quickly you need to analyze and how quickly data arrives
    • Variety = Many different formats that are expensive to integrate, probably from many data sources/feeds
    • Variability = Many possible interpretations of the data

    Microsoft is taking the existing Apache Hadoop code and making sure it runs on Windows. We have checked back in HDFS, MapReduce, and Hive code to the Apache open source community. More is on the way. We are adding things like the Excel add-in for the Hive ODBC driver, JavaScript, cloud (http://HadoopOnAzure.com), C# samples, etc. Where Microsoft is often seen as adding value is making Hadoop more available and allowing the reuse of existing skill sets. We offer self service, visualization, easier data movement, elasticity, and familiarity for faster completion of projects. This allows data scientists (people who know the business and statistics first and tech 2nd), data teams, information workers, and anyone seeking insights through BI to do their job better and faster.

    We offer visualization through PowerPivot, Power View, and the Excel Hive ODBC Add-in. For many used to Hadoop on Linux these things are new and exciting, even though they seem familiar to many in the SQL Server world.

    In the end, big data is really all about discovering insights and then taking action. If you don’t act on what you find, why did you spend all that effort mining the data? Sometimes the action is to go back and refine your process or change the way you looked at the data, but the end goal should be to make decisions that impact your business. Big data should add value to the business proposition.

    References

    My full presentation from #24HOP can be downloaded below. There are notes on many of the slides with further details. The recording from 24 Hours of PASS as soon as it is available at http://www.sqlpass.org/LearningCenter/SessionRecordings/24HoursSpring2012.aspx. Download their player then choose the session by title – “Fitting Microsoft Hadoop into your Enterprise BI
    Strategy
    (Cindy Gross)”.

    I updated the attachment to include the updated slides from my May 11, 2012 talk at #SQLRally. There are few changes other than the background.

    I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

    MSHadoopEntBIStrategy24HOPandSQLRally2012.zip

  • Using the Pi Estimator Sample on HadoopOnAzure CTP – Small Bites of Big Data

    Using the Pi Estimator Sample on HadoopOnAzure CTP

    Small Bites of Big Data

    Cindy Gross, SQLCAT PM

    UPDATED Jun 2013: HadoopOnAzure CTP has been replaced with HDInsight Preview. See: How to Run the HDInsight Samples http://www.windowsazure.com/en-us/manage/services/hdinsight/howto-run-samples/

     Now that you have created your Hadoop on Azure cluster you can run the sample programs to become familiar with the interface. Click on the “Samples” tile under “Manage your account”.

    The gallery of available samples is growing rapidly.

    We’ll start with the very simple Pi Estimator sample. When you click on the Pi tile you’ll see some information about the sample (scroll down to see more). You can download the files and review them in as much detail as you want. You can review the PiEstimator.java file in notepad. The .jar file is a zip file that contains many other files. You can unzip it with many compression utilities then explore the files at your convenience. The description tells us the first parameter indicates how many maps to create (default of 16) and the second indicates how many samples per map (10 million by default). For now, we’ll take advantage of the Azure portal’s simplification of the complexities of Hadoop and click on “Deploy to your cluster”.

    This brings up a screen to create the Hadoop job. You can modify the job name and parameters if you like. The “final command”, in this case “Hadoop jar hadoop-examples-0.20.203.1-SNAPSHOT.jar pi 16 10000000”, can be copied to the command line if you choose to go execute the job at the command line later.

    Click on “Execute job”.  It may run for a minute or two, or perhaps longer if the CTP system is being heavily used. As it progresses entries are added to the “Output (stdout)” and “Debug Output (stderr)” sections. Eventually you will see “Status: Completed Successfully” under “Job Info”. I highlighted the runtime (“Job Finished in 66.123 seconds”) and output (“Estimated value of Pi is 3.14159155000000000000”).

     

    You can see that there were 16 maps created because that’s the first parameter we passed in to the jar command. If I change it to 8 maps and 1000 samples per map the command is now generated as “call hadoop.cmd jar hadoop-examples-0.20.203.1-SNAPSHOT.jar pi 8 1000” and the output has fewer significant digits even though the runtime is nearly the same:

    Job Finished in 57.061 seconds

    Estimated value of Pi is 3.14100000000000000000

    When you go back to the main portal screen you now see the “Pi Example” tile. If the job is currently running you will see “…in progress…” Once it finishes the tile will show “Complete”.

         

    To view job results, click on the “Job History” tile under “Manage your account”.

    If you click on any history row it will take you to the same output you see if you are looking at the job interactively.

    I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

    Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

  • Open Ports for HadoopOnAzure CTP – Small Bites of Big Data

    Open Ports for HadoopOnAzure CTP

    Small Bites of Big Data

    Cindy Gross, SQLCAT PM

    UPDATED Jun 2013: HadoopOnAzure CTP has been replaced by HDInsight Preview. See Troubleshooting ODBC connectivity to HDInsight http://social.msdn.microsoft.com/Forums/en-US/hdinsight/thread/b4ca52ea-f7cf-420c-959d-53e09f801f7d.      

    Once you have created your Hadoop on Azure cluster you will likely be moving data in and out of the system. That means you need to open the ports in Azure. By default the two ports exposed through the Metro interface are both locked. The error when you try to use an ODBC connection to the cluster when the ODBC Server port is closed will include the words “Could not connect client socket”.

    Click on the “Open Ports” tile to open the “Configure Ports” dialog.


    To do Hive or other ODBC queries, open the “ODBC Server” port 10000. If you plan to use FTP open the FTPS port 2226.


    Click on the “Windows Azure” tile at the top to go back to the main portal screen. If you need to open other, less commonly used ports you can use the Remote Desktop icon to connect directly to the VM and make the changes on the server.


    I will cover how to connect via Remote Desktop in a separate post. Once there use whatever firewall or other port software is installed to open the needed ports.

    I hope you’ve enjoyed this small bite of big data! Look for more big data blog posts soon.

    Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.