Tag: Demo

  • Get HDInsight Properties with PowerShell

    Small Bites of Big Data from AzureCAT

    You’ve created your HDInsight Hadoop clusters and now you want to know exactly what you have out there in Azure. Maybe you want to pull the key information into a repository periodically as a reference for future troubleshooting, comparisons, or billing. Maybe you just need to get a quick look at your overall HDInsight usage. This is something you can easily automate with PowerShell.

    Environment

    First, open Windows Azure PowerShell or powershell_ise.exe.

    Set some values for your environment:

    $SubName = "YourSubscriptionName"
    
    Select-AzureSubscription -SubscriptionName $SubName
    
    Get-AzureSubscription -Current
    
    $ClusterName = "HDInsightClusterName" #HDInsight cluster name
    
    
    

    HDInsight Usage for the Subscription

    Take a look at your overall HDInsight usage for this subscription:

    Get-AzureHDInsightProperties

    Get-AzureHDInsightProperties returns the number of clusters for this subscription, the total HDInsight cores used and available (for head nodes and data nodes), the Azure regions where HDInsight clusters can be created, and the HDInsight versions available for new clusters:

    ClusterCount    : 2
    
    CoresAvailable  : 122
    
    CoresUsed       : 48
    
    Locations       : {East US, North Europe, Southeast Asia, West Europe...}
    
    MaxCoresAllowed : 170
    
    Versions        : {1.6, 2.1, 3.0}

    You can also pick out specific pieces of information and write them to a file, store them as variables, or use them elsewhere. This example simply outputs the values to the screen.

    write-host '== Max HDInsight Cores for Sub: ' (Get-AzureHDInsightProperties).MaxCoresAllowed
    
    write-host '== Cores Available:             ' (Get-AzureHDInsightProperties).CoresAvailable
    
    write-host '== Cores Used:                  ' (Get-AzureHDInsightProperties).CoresUsed

    HDInsight Cluster Information

    Get-AzureHDInsightCluster provides information about all existing HDInsight clusters for this subscription:

    Get-AzureHDInsightCluster
    As you can see this cmdlet tells you the size, connection information, and version.
    ClusterSizeInNodes    : 4
    
    ConnectionUrl         : https://BigCAT.azurehdinsight.net
    
    CreateDate            : 4/5/2014 3:37:23 PM
    
    DefaultStorageAccount : sqlcatwomanwestus.blob.core.windows.net
    
    HttpUserName          : Admin
    
    Location              : West US
    
    Name                  : BigCAT30
    
    State                 : Running
    
    StorageAccounts       : {}
    
    SubscriptionId        : {YourSubID}
    
    UserName              : Admin
    
    Version               : 3.0.0.0.661685
    
    VersionStatus         : Compatible
    
    
    
    ClusterSizeInNodes    : 4
    
    ConnectionUrl         : https://cgrosstest.azurehdinsight.net
    
    CreateDate            : 5/5/2014 6:09:58 PM
    
    DefaultStorageAccount : cgrosstest.blob.core.windows.net
    
    HttpUserName          : Admin
    
    Location              : West US
    
    Name                  : cgrosstest
    
    State                 : Running
    
    StorageAccounts       : {sqlcatwomanwestus.blob.core.windows.net}
    
    SubscriptionId        : {YourSubID}
    
    UserName              : Admin
    
    Version               : 3.0.2.0.727283
    
    VersionStatus         : Compatible

    You can also get information about just one HDInsight cluster at a time:

    Get-AzureHDInsightCluster  -name $ClusterName

    Or you can get very granular and look at specific properties, even some that aren’t in the default values:

    write-host '== Default Storage Account:     ' `
    
    (Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageAccountName.split(".")[0]
    
    write-host '== Default Container:           ' `
    
    (Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageContainerName

    This information will be a valuable source of information for tracking past configurations, current usage, and planning. Enjoy your Hadooping!

    Sample Script

    # Cindy Gross 2014
    
    # Get HDInsight properties
    
    $SubName = "YourSubscriptionName"
    
    Select-AzureSubscription -SubscriptionName $SubName
    
    Get-AzureSubscription -Current
    
    $ClusterName        = "YourHDInsightClusterName" #HDInsight cluster name
    
    
    
    Get-AzureHDInsightProperties 
    
    Get-AzureHDInsightCluster 
    
    Get-AzureHDInsightCluster  -name $ClusterName
    
    write-host '== Default Storage Account:     ' `
    
    (Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageAccountName.split(".")[0]
    
    write-host '== Default Container:           ' `
    
    (Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageContainerName
    
    write-host '== Max HDInsight Cores for Sub: ' (Get-AzureHDInsightProperties).MaxCoresAllowed
    
    write-host '== Cores Available:             ' (Get-AzureHDInsightProperties).CoresAvailable
    
    write-host '== Cores Used:                  ' (Get-AzureHDInsightProperties).CoresUsed
  • Sample PowerShell Script: HDInsight Custom Create

    This is a working script I use to create various HDInsight clusters. For a really reproducible, automated environment you would want to put this into a .ps1 script that accepts parameters (see here for an example). However, you may find the method below good for learning and experimenting. Replace all the “YOURxyz” sections with your actual information. Beware of oddities introduced by cut/paste such as spaces being replaced by line breaks or quotes being replaced by smart quotes. The # is a comment, some commands that you rarely run are commented out so remove the # to run them if you need them.

    # This PowerShell script is meant to be a cut/paste of specific parts, it is NOT designed to be run as a whole.

    # Do once after you install the cmdlets
    #Get-AzurePublishSettingsFile
    #Import-AzurePublishSettingsFile C:UsersYOURDirectoryDownloadsYOURName-credentials.publishsettings

    # Use if you admin more than one subscription
    #Get-AzureAccount # This may be needed to log in to Azure
    Select-AzureSubscription –SubscriptionName YOURSubscription
    Get-AzureSubscription -Current

    # Many things are easier in the ISE
    ise

    ###############################################
    ### create clusters ###
    ###############################################

    # Add your specific information here
    # Previous failures may make a name unavailable for a while – check to see if previous cluster was partially created
    $ClusterName = “YOURNewHDInsightClusterName” #the name you will give to your cluster
    $Location = “YOURDataCenter” #cluster data center must be East US, West US, or North Europe (as of December 2013)
    $NumOfNodes = 1 #start small
    $StorageAcct1 = “YOURExistingStorageAccountName” #currently must be in same data center as the cluster
    $DefaultContainer = “YOURExistingContainerName” #already exists on the storage account

    # These variables are automatically set for you
    $FullStorage1 = “${StorageAcct1}.blob.core.windows.net”
    $Key1 = Get-AzureStorageKey $StorageAcct1 | %{ $_.Primary }
    $SubID = Get-AzureSubscription -Current | %{ $_.SubscriptionId }
    $SubName = Get-AzureSubscription -Current | %{ $_.SubscriptionName }
    $Cert = Get-AzureSubscription -Current | %{ $_.Certificate }
    $Creds = Get-Credential -Message “New admin account to be created for your HDInsight cluster” #this prompts you

    ###############################################
    # Sample quick create
    ###############################################
    # Equivalent of quick create
    # The ` specifies that the cmd continues on the next line, beware of artifical line breaks added during cut/paste from the blog
    New-AzureHDInsightCluster -Name $ClusterName -ClusterSizeInNodes $NumOfNodes -Subscription $SubID -Location “$Location” `
    -DefaultStorageAccountName $FullStorage1 -DefaultStorageAccountKey $Key1 -DefaultStorageContainerName $DefaultContainer -Credential $Creds

    ###############################################
    # Sample custom create
    ###############################################
    #https://hadoopsdk.codeplex.com/wikipage?title=PowerShell%20Cmdlets%20for%20Cluster%20Management
    # Most params are the same as quick create, use a new cluster name
    # Pass in a 2nd storage account, a SQLAzure db for the metastore (assume same db for Oozie and Hive), add Avro library, some config values
    # Execute all the variable settings from above

    # This value is set for you, don’t change!
    $configvalues = new-object ‘Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightHiveConfiguration’

    # Add your specific information here
    $ClusterName = “YOURNewHDInsightClusterName
    $StorageAcct2 = “YOURExistingStorageAccountName2
    $MetastoreAzureSQLDBName = “YOURExistingSQLAzureDBName
    $MetastoreAzureServerName = “YOURExistingSQLAzureServer.database.windows.net” #gives a DNS error if you don’t use the full name
    $configvalues.Configuration = @{ “hive.exec.compress.output”=”true” }  #this is an example of a config value you may pass in

    # These variables are automatically set for you
    $FullStorage2 = “${StorageAcct2}.blob.core.windows.net”
    $Key2 = Get-AzureStorageKey $StorageAcct2 | %{ $_.Primary }
    $MetastoreCreds = Get-Credential -Message “existing id/password for your SQL Azure DB (metastore)” #This prompts for the existing id and password of your existing SQL Azure DB

    # Add a config file value
    # Add AVRO SerDe libraries for Hive (on storage 1)
    $configvalues.AdditionalLibraries = new-object ‘Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightDefaultStorageAccount’
    $configvalues.AdditionalLibraries.StorageAccountName = $FullStorage1
    $configvalues.AdditionalLibraries.StorageAccountKey = $Key1
    $configvalues.AdditionalLibraries.StorageContainerName = “hivelibs” #container called hivelibs must exist on specified storage account
    # Create custom cluster
    New-AzureHDInsightClusterConfig -ClusterSizeInNodes $NumOfNodes `
    | Set-AzureHDInsightDefaultStorage -StorageAccountName $FullStorage1 -StorageAccountKey $Key1 -StorageContainerName $DefaultContainer `
    | Add-AzureHDInsightStorage -StorageAccountName $FullStorage2 -StorageAccountKey $Key2 `
    | Add-AzureHDInsightMetastore -SqlAzureServerName $MetastoreAzureServerName -DatabaseName $MetastoreAzureSQLDBName -Credential $MetastoreCreds -MetastoreType OozieMetastore `
    | Add-AzureHDInsightMetastore -SqlAzureServerName $MetastoreAzureServerName -DatabaseName $MetastoreAzureSQLDBName -Credential $MetastoreCreds -MetastoreType HiveMetastore `
    | Add-AzureHDInsightConfigValues -Hive $configvalues `
    | New-AzureHDInsightCluster -Subscription $SubID -Location “$Location” -Name $ClusterName -Credential $Creds

    ###############################################
    # get status, properties, etc.
    ###############################################
    #$SubName = $SubID = Get-AzureSubscription -Current | %{ $_.SubscriptionName }
    Get-AzureHDInsightProperties -Subscription $SubName
    Get-AzureHDInsightCluster -Subscription $SubName
    Get-AzureHDInsightCluster -Subscription $SubName -name YOURClusterName

    ###############################################
    # remove cluster
    ###############################################
    #Remove-AzureHDInsightCluster -Name $ClusterName -Subscription $SubName

  • HDInsight: Hive Internal and External Tables Intro

    Small Bites of Big Data

    Cindy Gross, SQLCAT PM

    HDInsight is Microsoft’s distribution, in partnership with Hortonworks, of Hadoop. Hive is the component of the Hadoop ecosystem that imposes structure on Hadoop data in a way that makes it usable from BI tools that expect rows and columns with defined data types. Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.

    Use EXTERNAL tables when:

    • The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
    • Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
    • You want to use a custom location such as ASV.
    • Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
    • You are not creating table based on existing table (AS SELECT).

    Use INTERNAL tables when:

    • The data is temporary.
    • You want Hive to completely manage the lifecycle of the table and data.

    We’ll walk through creating basic tables with a few rows of data so you can see some of the differences between EXTERNAL and INTERNAL tables. The demo data files are attached at the bottom of the blog. Alternatively you can simply open notepad and create your own files with a series of single column rows. If you create your own files make sure you have a carriage return/line feed at the end of all rows including the last one. The files should be in a Windows directory called c:data on the HDInsight Head Node. For HDInsight Server (on-premises) that’s the machine where you ran setup. For HDInsight Services (Azure) you can create a Remote Desktop connection (RDP) to the head node from the Hadoop portal.

    Note: Your client tool editor or the website may change the dashes or other characters in the following commands to “smart” characters. If you get syntax errors from a direct cut/paste, try pasting into notepad first or deleting then retyping the dash (or other special characters).

    Create an HDInsight cluster. You can do this on your own Windows machine by installing HDInsight Server or by signing up for HDInsight Services on Azure. For the CTP of HDInsight Services as of February 2013 you fill out a form to request access and receive access within a few days. Soon the service will be available from the Azure portal via your Azure subscription. Since the portal interface will be changing soon and all the commands are straightforward I will show you how to do all the steps through the Hive CLI (command line interface).

    Open a Hadoop Command Prompt:

    HadoopCmdPrompt

    Change to the Hive directory (necessary in early preview builds of Hive):

    cd %hive_home%bin

    Load some data (hadoop file system put) and then verify it loaded (hadoop file system list recursively):

    hadoop fs -put c:databacon.txt /user/demo/food/bacon.txt

    hadoop fs -lsr /user/demo/food

    The put command doesn’t return a result, the list command returns one row per file or subdirectory/file:

    -rw-r–r–   1 cgross supergroup        124 2013-02-05 22:41 /user/demo/food/bacon.txt

    Enter the Hive CLI (command line interface):

    hive

    Tell Hive to show the column names above the results (all Hive commands require a semi-colon as a terminator, no result is returned from this set command):

    Set hive.cli.print.header=true;

    Create an INTERNAL table in Hive and point it to the directory with the bacon.txt file:

    CREATE INTERNAL TABLE internal1 (col1 string) LOCATION ‘/user/demo/food’;

    Oops… that failed because INTERNAL isn’t a keyword, the absence of EXTERNAL makes it a managed, or internal, table.

    FAILED: Parse Error: line 1:7 Failed to recognize predicate ‘INTERNAL’.

    So let’s create it without the invalid INTERNAL keyword. Normally we would let an INTERNAL table default to the default location of /hive/warehouse but it is possible to specify a particular directory:

    CREATE TABLE internal1 (col1 string) LOCATION ‘/user/demo/food’;

    That will return the time taken but no other result. Now let’s look at the schema that was created:. Note that the table type is MANAGED_TABLE.

    DESCRIBE FORMATTED internal1;

    col_name        data_type       comment
    # col_name              data_type               comment

    col1                    string                  None

    # Detailed Table Information
    Database:               default
    Owner:                  cgross
    CreateTime:             Tue Feb 05 22:45:57 PST 2013
    LastAccessTime:         UNKNOWN
    Protect Mode:           None
    Retention:              0
    Location:               hdfs://localhost:8020/user/demo/food
    Table Type:             MANAGED_TABLE
    Table Parameters:
    transient_lastDdlTime   1360133157

    # Storage Information
    SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
    InputFormat:            org.apache.hadoop.mapred.TextInputFormat
    OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
    Compressed:             No
    Num Buckets:            -1
    Bucket Columns:         []
    Sort Columns:           []
    Storage Desc Params:
    serialization.format    1

    And now look at some rows:

    SELECT * FROM internal1;

    col1
    HDInsight_Bacon
    SQL_Bacon
    PASS_bacon
    Summit_BACON
    Database_Bacon
    NoSQL_Bacon
    BigData_Bacon
    Hadoop_Bacon
    Hive_Bacon

    What happens if we don’t specify a directory for an INTERNAL table?

    CREATE TABLE internaldefault (col1 string);

    It is created in the default Hive directory, which by default is in /hive/warehouse (dfs shells back out to Hadoop fs):

    dfs -lsr /hive/warehouse;

    We can see that Hive has created a subdirectory with the same name as the table. If we were to load data into the table Hive would put it in this directory:
    drwxr-xr-x   – cgross supergroup          0 2013-02-05 22:52 /hive/warehouse/internaldefault

    However, we won’t use this table for the rest of the demo so let’s drop it to avoid confusion. The drop also removes the subdirectory.

    DROP TABLE internaldefault;

    dfs -lsr /hive/warehouse;

    Once we dropped the internaldefault table the directory that Hive created was automatically cleaned up. Now let’s add a 2nd file to the first internal table and check that it exists:

    dfs -put c:databacon2.txt /user/demo/food/bacon2.txt;

    dfs -lsr /user/demo/food;

    -rw-r–r–   1 cgross supergroup        124 2013-02-05 23:04 /user/demo/food/bacon.txt
    -rw-r–r–   1 cgross supergroup         31 2013-02-05 23:03 /user/demo/food/bacon2.txt

    Since the CREATE TABLE statement points to a directory rather than a single file any new files added to the directory are immediately visible (remember that the column name col1 is only showing up because we enabled showing headers in the output – there is no row value of col1 in the data as headers are not generally included in Hadoop data):

    SELECT * FROM internal1;

    col1
    HDInsight_Bacon
    SQL_Bacon
    PASS_bacon
    Summit_BACON
    Database_Bacon
    NoSQL_Bacon
    BigData_Bacon
    Hadoop_Bacon
    Hive_Bacon
    More_BaCoN
    AndEvenMore_bAcOn

    Now let’s create an EXTERNAL table that points to the same directory and look at the schema:

    CREATE EXTERNAL TABLE external1 (colE1 string) LOCATION ‘/user/demo/food’;

    DESCRIBE FORMATTED external1;

    col_name        data_type       comment
    # col_name              data_type               comment

    cole1                   string                  None

    # Detailed Table Information
    Database:               default
    Owner:                  cgross
    CreateTime:             Tue Feb 05 23:07:12 PST 2013
    LastAccessTime:         UNKNOWN
    Protect Mode:           None
    Retention:              0
    Location:               hdfs://localhost:8020/user/demo/food
    Table Type:             EXTERNAL_TABLE

    Table Parameters:
    EXTERNAL                TRUE
    transient_lastDdlTime   1360134432

    # Storage Information
    SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
    InputFormat:            org.apache.hadoop.mapred.TextInputFormat
    OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
    Compressed:             No
    Num Buckets:            -1
    Bucket Columns:         []
    Sort Columns:           []
    Storage Desc Params:
    serialization.format    1

    This time the table type is EXTERNAL_TABLE. You can see that the location was expanded to include the default settings which in this case are the localhost machine using the default HDFS (as opposed to ASV or Azure Storage Vault).

    Now look at the data:

    SELECT * FROM external1;

    The result set is a combination of the two bacon files:

    HDInsight_Bacon
    SQL_Bacon
    PASS_bacon
    Summit_BACON
    Database_Bacon
    NoSQL_Bacon
    BigData_Bacon
    Hadoop_Bacon
    Hive_Bacon
    More_BaCoN
    AndEvenMore_bAcOn

    That table returns the same data as the first table – we have two tables pointing at the same data set! We can add another one if we want:

    CREATE EXTERNAL TABLE external2 (colE2 string) LOCATION ‘/user/demo/food’;

    DESCRIBE FORMATTED external2;

    SELECT * FROM external2;

    You may create multiple tables for the same data set if you are experimenting with various structures/schemas.

    Add another data file to the same directory and see how it’s visible to all the tables that point to that directory:

    dfs -put c:dataveggies.txt /user/demo/food/veggies.txt;

    SELECT * FROM internal1;

    SELECT * FROM external1;

    SELECT * FROM external2;

    Each table will return the same results:

    HDInsight_Bacon
    SQL_Bacon
    PASS_bacon
    Summit_BACON
    Database_Bacon
    NoSQL_Bacon
    BigData_Bacon
    Hadoop_Bacon
    Hive_Bacon
    More_BaCoN
    AndEvenMore_bAcOn
    SQL_Apple
    NoSQL_Pear
    SQLFamily_Kiwi
    Summit_Mango
    HDInsight_Watermelon
    SQLSat_Strawberries
    Raspberrylimelemonorangecherryblueberry 123 456

    Now drop the INTERNAL table and then look at the data from the EXTERNAL tables which now return only the column name:

    DROP TABLE internal1;

    SELECT * FROM external1;

    SELECT * FROM external2;

    dfs -lsr /user/demo/food;

    Result: lsr: Cannot access /user/demo/food: No such file or directory.

    Because the INTERNAL (managed) table is under Hive’s control, when the INTERNAL table was dropped it removed the underlying data. The other tables that point to that same data now return no rows even though they still exist!

    Clean up the demo tables and directory:

    DROP TABLE external1;

    DROP TABLE external2;

    exit;

    This should give you a very introductory level understanding of some of the key differences between INTERNAL and EXTERNAL Hive tables. If you want full control of the data loading and management process, use the EXTERNAL keyword when you create the table.

    I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

    Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the TAP and CTP builds may change rapidly.

    bacon.zip