Befriending Dragons

Transform Tech with Anti-bullying Cultures


Why WASB Makes Hadoop on Azure So Very Cool

Rescue dogData. It’s all about the data. We want to make more data driven decisions. We want to keep more data so we can make better decisions. We want that data stored cheaply, easily accessible, and quickly ingested. Hadoop promises to help with all those things. However, when you deal with Hadoop on-premises you have a multi-step process to load the data. Azure and WASB to the rescue!

With a typical Hadoop installation you load your data to a staging location then you import it into the Hadoop Distributed File System (HDFS) within a single Hadoop cluster. That data is manipulated, massaged, and transformed. Then you may export some or all of the data back to a non-HDFS system (a SAN, a file share, a website).

What’s different in the cloud? With Azure you have Azure Blob Storage Accounts. Data can be stored there as blobs in any format. That data can be accessed by various applications – including Hadoop without first doing a separate load into HDFS! This is made possible because Microsoft used the public extensions available with HDFS to create the Windows Azure Storage Blobs (WASB) interface between Hadoop and the Azure blob storage. This WASB code is available for any distributor of Hadoop in the Apache source code and it is the default storage system in HDInsight – Microsoft’s Hadoop on Azure PaaS offering. It is also available for Hortonworks HDP on Azure VMs or Cloudera EDH/CDH on Azure VMs with some manual configuration steps.

With WASB you load your data to Azure blobs at any time – whether Hadoop clusters currently exist or not. That way you aren’t paying for Hadoop compute time simply to load data. You spin up one or more clusters, point them at the data sets (yes, multiple clusters pointing to same data!), and run your Hadoop jobs. When you don’t need the system for a while you take down your Hadoop cluster(s) and the data is still there. At any point, whether one or more Hadoop clusters are accessing the data or not, other applications can still access and manipulate the data. For example, you could have data sitting on an Azure storage account that is being added to by a SQL Server Integration Services (SSIS) job. At the same time someone is using Power Query to load that data into PowerPivot while a website inserts new data to the same location. Meanwhile your R&D department can be running highly intensive jobs that require a large cluster up for many days or weeks at a time, and your sales team can have a separate, smaller cluster that’s up for a few hours a day – all pointing at the same data!

With this separation of storage and compute you have simplified your data accessibility, reduced data movement and copies, and reduced the time it takes to have your data available! That all adds up to lower costs and a faster, more data-driven time to insight.

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  
@SQLCindy | @NealAnalytics | |


Azure Maximums and Resource Usage from PowerShell

Technorati Tags: ,

Have you ever struggled to find out how many VM cores, HDInsight cores, storage accounts, or other Azure resources your subscription is set to allow or how many you actually use? Maybe you want to use this information in your automation scripts to avoid trying to create components for which you don’t have resources.

quizzical owl

PowerShell to the rescue!

First a couple of key points. There are various maximums in Azure. Today we are talking about finding the currently configured maximums allowed for a specified subscription. There are default maximums (default limit) which you can increase for a given subscription by opening a billing support ticket. There are also hard maximums (maximum limit). However, with some products, such as HDInsight (Hadoop), you can get past some per-subscription maximums for dependent services by combining resources (storage accounts) from multiple subscriptions for a single HDInsight cluster. All the samples below find the current billing quota limitation and actual usage for the current subscription.

Let’s take a look at the information available on the subscription level cmdlet.

Start by checking which subscription is in focus / current for the PowerShell session.

(Get-AzureSubscription -Current).SubscriptionName

(Get-AzureSubscription -Current).CurrentStorageAccountName

If you need information on a different subscription either pass the subscription name (as defined on your client) for the cmdlets that support this or change the focus to a different subscription.

$SubName = “sqlcatwoman”

Select-AzureSubscription -SubscriptionName $SubName

Now we will look at the cores available for Azure virtual machines (VMs / IaaS). Note that HDInsight cores are tracked separately. Be careful with unexpected line wraps that may paste into your PowerShell window (or ISE) incorrectly. The below snippet is 1 comment line and 4 lines of code.

# How many cores are available to create new VMs (or increase size of existing VMs) for the current subscription?

[int]$maxVMCores     = (Get-AzureSubscription -current -ExtendedDetails).maxcorecount

[int]$currentVMCores = (Get-AzureSubscription -current -ExtendedDetails).currentcorecount

[int]$availableCores = $maxVMCores $currentVMCores

Write-Host “Cores available for VMs:” $availableCores

We can get similar information about cloud services:

#how many cloud (hosted) services are available on this subscription

[int]$maxAvl         = (Get-AzureSubscription -current -ExtendedDetails).MaxHostedServices

[int]$currentUsed    = (Get-AzureSubscription -current -ExtendedDetails).CurrentHostedServices

[int]$availableNow   = $maxAvl $currentUsed

Write-Host “Cloud services available:” $availableNow

Some limits and usage are available on cmdlets specific to a particular technology. For example, the HDInsight usage and maximums are available from the Get-AzureHDInsightProperties cmdlet. You can find details and samples on Get HDInsight Properties with PowerShell.

Other times we have to look at different cmdlets for different pieces of the information, such as for storage accounts:

#how many storage accounts are available on this subscription

[int]$maxAvl         = (Get-AzureSubscription -current -ExtendedDetails).MaxStorageAccounts

[int]$currentUsed    = (Get-AzureStorageAccount).Count

[int]$availableNow   = $maxAvl $currentUsed

Write-Host “Storage Accounts available:” $availableNow

We can look at all the extended properties available for a subscription:happy owl

Get-AzureSubscription -currentExtendedDetails

If you know you have a particular component created and this cmdlet shows the “Current” value is zero, take a look at the Get-Azure… cmdlet for that particular type of resource and look for a “Current” value.

Another handy thing to look at is the overall information about what Azure regions exist and what services are available in each region:


And you can pull off specific information:

Get-AzureLocation  | Select DisplayName

I hope these small bites of PowerShell help save the day for you in some way!

Leave a comment

Get HDInsight Properties with PowerShell

Small Bites of Big Data from AzureCAT

You’ve created your HDInsight Hadoop clusters and now you want to know exactly what you have out there in Azure. Maybe you want to pull the key information into a repository periodically as a reference for future troubleshooting, comparisons, or billing. Maybe you just need to get a quick look at your overall HDInsight usage. This is something you can easily automate with PowerShell.


First, open Windows Azure PowerShell or powershell_ise.exe.

Set some values for your environment:

$SubName = "YourSubscriptionName"
Select-AzureSubscription -SubscriptionName $SubName
Get-AzureSubscription -Current
$ClusterName = "HDInsightClusterName" #HDInsight cluster name

HDInsight Usage for the Subscription

Take a look at your overall HDInsight usage for this subscription:


Get-AzureHDInsightProperties returns the number of clusters for this subscription, the total HDInsight cores used and available (for head nodes and data nodes), the Azure regions where HDInsight clusters can be created, and the HDInsight versions available for new clusters:

ClusterCount    : 2
CoresAvailable  : 122
CoresUsed       : 48
Locations       : {East US, North Europe, Southeast Asia, West Europe...}
MaxCoresAllowed : 170
Versions        : {1.6, 2.1, 3.0}

You can also pick out specific pieces of information and write them to a file, store them as variables, or use them elsewhere. This example simply outputs the values to the screen.

write-host '== Max HDInsight Cores for Sub: ' (Get-AzureHDInsightProperties).MaxCoresAllowed
write-host '== Cores Available:             ' (Get-AzureHDInsightProperties).CoresAvailable
write-host '== Cores Used:                  ' (Get-AzureHDInsightProperties).CoresUsed

HDInsight Cluster Information

Get-AzureHDInsightCluster provides information about all existing HDInsight clusters for this subscription:

As you can see this cmdlet tells you the size, connection information, and version.
ClusterSizeInNodes    : 4
ConnectionUrl         :
CreateDate            : 4/5/2014 3:37:23 PM
DefaultStorageAccount :
HttpUserName          : Admin
Location              : West US
Name                  : BigCAT30
State                 : Running
StorageAccounts       : {}
SubscriptionId        : {YourSubID}
UserName              : Admin
Version               :
VersionStatus         : Compatible

ClusterSizeInNodes    : 4
ConnectionUrl         :
CreateDate            : 5/5/2014 6:09:58 PM
DefaultStorageAccount :
HttpUserName          : Admin
Location              : West US
Name                  : cgrosstest
State                 : Running
StorageAccounts       : {}
SubscriptionId        : {YourSubID}
UserName              : Admin
Version               :
VersionStatus         : Compatible

You can also get information about just one HDInsight cluster at a time:

Get-AzureHDInsightCluster  -name $ClusterName

Or you can get very granular and look at specific properties, even some that aren’t in the default values:

write-host '== Default Storage Account:     ' `
(Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageAccountName.split(".")[0]
write-host '== Default Container:           ' `
(Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageContainerName

This information will be a valuable source of information for tracking past configurations, current usage, and planning. Enjoy your Hadooping!

Sample Script

# Cindy Gross 2014
# Get HDInsight properties
$SubName = "YourSubscriptionName"
Select-AzureSubscription -SubscriptionName $SubName
Get-AzureSubscription -Current
$ClusterName        = "YourHDInsightClusterName" #HDInsight cluster name

Get-AzureHDInsightCluster  -name $ClusterName
write-host '== Default Storage Account:     ' `
(Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageAccountName.split(".")[0]
write-host '== Default Container:           ' `
(Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageContainerName
write-host '== Max HDInsight Cores for Sub: ' (Get-AzureHDInsightProperties).MaxCoresAllowed
write-host '== Cores Available:             ' (Get-AzureHDInsightProperties).CoresAvailable
write-host '== Cores Used:                  ' (Get-AzureHDInsightProperties).CoresUsed

Leave a comment

Use Additional Storage Accounts with HDInsight Hive

When you create an HDInsight Hadoop cluster you pass in one or more storage accounts and their associated keys. This allows you to access the files on all associated storage accounts from the cluster. If you want to use public storage that isn’t passed in at create time that’s easy – simply supply the storage account name each time you run a job. But how do you access data on private storage accounts that need an access key?

The steps are laid out in this wiki by Eric Hanson: Using an HDInsight Cluster with Alternate Storage Accounts and Metastores

I am providing a variable based variation of the PowerShell sample for Hive. To set up PowerShell for use with Azure see Getting Started with Azure PowerShell Cmdlets–Subscription Management.

First you will set some values for your environment. If you use your default subscription you don’t need to pass in the subscription name and select it. However, you will always need to specify the HDInsight cluster name. In this example $undefinedStorageAccount is the name of an account that you want to access from a cluster but you didn’t define it when you created the cluster. You always need to specify which container to use for any given reference so you also need to define $undefinedContainer. If the storage account belongs to the current subscription you can simply ask Azure to return the key (#commented out in the example below) or you can paste in the key that someone has given you.

$subscriptionName = "LocalAzureSubscriptionName"
$clusterName = "HDInsightClusterName"
$undefinedStorageAccount = "AdditionalStorageAccount"
$undefinedContainer = "ContainerOnAdditionalStorageAccount"
#$undefinedStorageKey = Get-AzureStorageKey $undefinedStorageAccount | %{ $_.Primary }
$undefinedStorageKey = "YourActualAccessKeyFromAzurePortal"

Now choose which of your locally defined subscriptions to use:

Select-AzureSubscription -SubscriptionName $subscriptionName

Set the context of the cluster you want to use:

Use-AzureHDInsightCluster $clusterName

Now let’s check your HDInsight cluster properties.

$defaultStorageAccount  = (Get-AzureHDInsightCluster -Name $clusterName).DefaultStorageAccount.StorageAccountName #default/only storage account
$defaultContainerName   = (Get-AzureHDInsightCluster -Subscription $SubID -Cluster $ClusterName).DefaultStorageAccount.StorageContainerName
$definedStorageAccounts = (Get-AzureHDInsightCluster -Name $clusterName).StorageAccounts #no 2nd account is associated, no value is returned

Let’s check the values and verify that the storage account you want to use is not listed as either the DefaultStorageAccount (every cluster has one) or as one of the additional known storage accounts configured during provisioning (you may have zero, one, or many).

write-host "===Default storage account"
write-host "===Default container name"
write-host "===Other defined storage accounts for this cluster"

Next we’ll get a non-recursive listing of the files in the default location:

invoke-hive "dfs -ls wasb://$defaultContainerName@$defaultStorageAccount/;" #default storage

And then try to get a listing for the private storage account that we have not associated with the cluster:

invoke-hive "dfs -ls wasb://$undefinedContainer@$undefinedStorageAccount/;" #not associated, errors

Because the storage account access key is not yet known you will see an error similar to this one:

Logging initialized using configuration in file:/C:/apps/dist/hive-
ls: Unable to access container xyz in account abc using anonymous credentials, 
and no credentials found for them  in the configuration.
Command failed with exit code = 1

But we can fix this! From PowerShell we can pass in “defines” statements to change configuration values, add libraries, etc.

$defines = @{}
$defines.Add("$", $undefinedStorageKey)
Invoke-Hive -Defines $defines -Query "dfs -ls wasb://$undefinedContainer@$;"

The access key is only available to this Hive query, but now that I have the variables set I can pass it in to other queries as well. Happy Hiving!

I hope you enjoyed this small bite of Big Data!


Getting Started with Azure PowerShell Cmdlets–Subscription Management

I’ve started using the Azure PowerShell cmdlets more often to manage virtual machines and HDInsight in Azure. Once you connect to a subscription everything just works. However, the initial steps to get one or more subscriptions configured to be used from your machine or understanding how to change subscription information on your machine can be confusing. Some of the docs are contradictory, outdated, or incomplete. Often they assume you are only a co-admin of one subscription. The below steps should get you going with Azure cmdlets whether you admin one or many subscriptions.

You need to enable your machine to talk to one or more Azure subscriptions. The first step is creating a certificate. Do NOT do this if you already used the PublishSettings commands unless you first use Remove-AzureSubscription (which removes the locally stored information about the specified subscription). Makecert is more secure than PublishSettings, especially if you (a given email address) have multiple co-administrators per subscription and/or you (a given email address) are a co-administrator of multiple subscriptions.

The steps to get going are documented in Shep’s blog “Cloud Spelunking, Managing Azure form your Desktop via PowerShell (the Setup)” I’ll go a bit deeper and fill in a few additional details on what Shep calls the “hard” option.

Create a Certificate

If you have IIS, Visual Studio, or the Windows SDK you will have some variation of a “Developer Command Prompt” (or VS201x or Visual Studio Command Prompt). Open that command prompt with the “run as administrator” option. Replace YourCertName with a meaningful name and run the below command. The cert always goes to the cert store on your local machine – the last parameter is an optional file based copy of that certificate that we will need for the next step. If you don’t specify the location it goes to %windir%system32. Be very protective of the .cer file – delete it once you have uploaded it. You can always generate another file if you need it.

makecert -sky exchange -r -n “CN=<YourCertName>” -pe -a sha1 -len 2048 -ss My “c:temp<YourCertName>.cer”

This certificate is yours – do not share it with others. If you want to reuse the certificate on other machines that you control, you can copy the .cer file to those machines and import them into the local certificate store on each machine. The .cer is just a copy, the actual certificate was loaded into your local certificate store (Manage Computer Certificates) by makecert.

Upload Certificate to Azure Subscription(s)

Generally you will not want to share certificates with others. Any certificate you use must be in your local certificate store (Manage Computer Certificates). The same certificate must also be uploaded to the portal and associated with each subscription you wish to manage from your machine.

From your local machine where you created the certificate in the above step:

  • Log in to the Azure Portal with an email address that is associated with the subscription you want to use from your own machine.
  • Scroll to the bottom of the left pane and choose “SETTINGS”




  • Click on the “UPLOAD” button in the middle of the bar at the bottom of the screen.


  • In the “Upload a management certificate” dialog navigate to the location specified in the last parameter above or %windir%system32 if you didn’t specify a location. Choose the .cer file you just created with makecert (or export a certificate from the local certificate store – just make sure it has the right properties). If you have multiple subscriptions there is a 2nd drop down box where you need to choose the subscription that the certificate will be associated with.


  • Repeat for any additional subscriptions that you want to manage with the same certificate (or create one certificate per subscription for additional security granularity).

Install and Configure the Azure PowerShell Cmdlets

Follow the steps here to install the Azure Cmdlets. Basically you are selecting “Azure PowerShell” from the Web Platform Installer. You can also check in the Web Platform Installer for updated versions of the cmdlets.

A very common setting that many admins set is the RemoteSigned Execution Policy. This is less secure than AllSigned or Restricted but allows you to use most downloaded scripts.

Open Windows Azure PowerShell with the “run as admin” option and run:

Set-ExecutionPolicy RemoteSigned –Force
Get-ExecutionPolicy –list

If you see errors when setting the execution policy, search on your specific error or start with this blog: Set-ExecutionPolicy : Windows PowerShell updated your execution policy successfully, but the setting is overridden by a policy defined at a more specific scope!!! You may need to open “Edit Group Policy” (in Windows 8 that opens the Local Group Policy Editor) and make a change.  Sometimes you may need to set each individual scope, but process scope settings go back to the default when the process is closed:

Set-ExecutionPolicy RemoteSigned -Scope Process -Force

Then import the Azure cmdlets:

Import-Module Azure

You can close the PowerShell window, you no longer need to “run as admin”.

Enable PowerShell to use a Subscription via a Certificate

Repeat this section on each machine that will be used to execute PowerShell code. Also repeat for additional subscriptions on each machine.

Open Windows Azure PowerShell. Optionally type ISE to open the Integrated Scripting Environment where you can edit, save, and run collections of cmdlets.

First, set some variables. You will need to copy some basic settings from the Azure Management Portal. On the far left side of the portal, scroll all the way to the bottom and choose “SETTINGS” and “MANAGEMENT CERTIFICATES” (see the “Upload Certificate to Azure Subscription(s)” section of this blog for more details – you are copying from the same place where you uploaded the certificate). Choose the certificate you just uploaded. Don’t worry if the numbers are cut off on the screen, if you highlight and copy it will get the whole value, even the part that doesn’t show on the screen. Replace the $subID and $thumbprint below – do not update $myCert as that is done based on your other variables. Execute the code in the PowerShell window.

#copy SUBSCRIPTION ID from portal 
#lower left, settings, management certificates
$subID = "11111111-2222-3333-4444-555555555555"
#copy THUMBPRINT from portal 
#lower left, settings, management certificates
$thumbprint = "1234567891234567891234567891234567891234"
$myCert = Get-Item cert:\CurrentUserMy$thumbprint  

Now set the subscription name you will use to refer to this subscription from this machine. In most cases you will choose the NAME of the subscription from the portal but that is not required. The matching between your machine’s knowledge of the subscription and the subscription on Azure is done via the SUBSCRIPTION ID. Update $localSubName below and execute the code in the PowerShell window. Note that the local subscription name is case-sensitive.

#subname to be used locally
#usually you will choose the actual subscription name
#stored in %appdata%Windows Azure PowerShellWindowsAzureProfile.xml
$localSubName = "MyFavSub"

Now that you have set the values for your own environment, run the code to actually update your machine’s knowledge of the subscription. Note that I used the back tick “`” to specify that the command continues on a new line.

Set-AzureSubscription –SubscriptionName $localSubName `
–SubscriptionId $subID -Certificate $myCert

Some operations rely on a default storage account, you may want to set the default storage account you want to use for each subscription.

#optionally set "current" storage account for this sub
$defaultStorageAccount = 'MyFavStorageAccount'
Set-AzureSubscription -SubscriptionName $localSubName `
-CurrentStorageAccount $defaultStorageAccount

Next you can set the default subscription that you will start with when you open PowerShell on this machine (note that we’ve changed from the Set cmdlet to the Select one):

Select-AzureSubscription –Default $localSubName

You can change which of the configured subscriptions is the current one:

Select-AzureSubscription –Current $localSubName

Check to see which subscription you are currently using:

Get-AzureSubscription –Current
(Get-AzureSubscription -Current).SubscriptionName

Verify that you can connect and list the services associated with the current subscription:

Get-AzureService | select ServiceName

Look at the Local Configuration

Now let’s look at what got updated on the local machine.

Open File Explorer and go to %appdata%Windows Azure PowerShell. Open WindowsAzureProfile.xml in Notepad or your favorite editor. Here are a few of the key values for each subscription you have mapped on your machine:

IsDefault tells you which one is the default subscription for your machine


The thumbprint id is stored as the ManagementCertificate:


The local name you chose for the subscription is stored in Name (to avoid confusion chose the name used in the portal):


The subscription id is stored in SubscriptionId:


Remove Subscription

If you need to remove a subscription from your machine, whether because you no longer have access to it or because you want to change one of the properties such as the name or which certificate you use, you can use Remove-AzureSubscription. This updates your local %appdata%Windows Azure PowerShell.

#Remove my machine's knowledge of a subscription 
#Removes info from %appdata%Windows Azure PowerShellWindowsAzureProfile.xml
Remove-AzureSubscription -SubscriptionName MyFavSub

Sample Script

Here is a handy dandy cut/paste version of the above PowerShell code to add a subscription and make it your default and current subscription:

#copy SUBSCRIPTION ID from portal 
#lower left, settings, management certificates
$subID = "YourOwnSubID"
#copy THUMBPRINT from portal 
#lower left, settings, management certificates
$thumbprint = "YourCertThumbprint"
$myCert = Get-Item cert:\CurrentUserMy$thumbprint  
#subname to be used locally
#usually you will choose the actual subscription name
#stored in %appdata%Windows Azure PowerShellWindowsAzureProfile.xml
$localSubName = "YourSubcriptionName"
#optionally set "current" storage account for this sub
$defaultStorageAccount = 'OptionalDefaultStorage'
Set-AzureSubscription –SubscriptionName $localSubName `
    –SubscriptionId $subID -Certificate $myCert
Set-AzureSubscription -SubscriptionName $localSubName `
    -CurrentStorageAccount $defaultStorageAccount
Select-AzureSubscription –Default $localSubName
Select-AzureSubscription –Current $localSubName
Get-AzureSubscription –Current
(Get-AzureSubscription -Current).SubscriptionName

You are Ready for PowerShell Gooey Goodness!

Woohoo! Now you can access your Azure subscriptions from your machine without entering ids and passwords. You can automate, simplify, and standardize any Azure activity that has an associated cmdlet! Happy PowerShelling!



Sample PowerShell Script: HDInsight Custom Create

This is a working script I use to create various HDInsight clusters. For a really reproducible, automated environment you would want to put this into a .ps1 script that accepts parameters (see here for an example). However, you may find the method below good for learning and experimenting. Replace all the “YOURxyz” sections with your actual information. Beware of oddities introduced by cut/paste such as spaces being replaced by line breaks or quotes being replaced by smart quotes. The # is a comment, some commands that you rarely run are commented out so remove the # to run them if you need them.

# This PowerShell script is meant to be a cut/paste of specific parts, it is NOT designed to be run as a whole.

# Do once after you install the cmdlets
#Import-AzurePublishSettingsFile C:UsersYOURDirectoryDownloadsYOURName-credentials.publishsettings

# Use if you admin more than one subscription
#Get-AzureAccount # This may be needed to log in to Azure
Select-AzureSubscription –SubscriptionName YOURSubscription
Get-AzureSubscription -Current

# Many things are easier in the ISE

### create clusters ###

# Add your specific information here
# Previous failures may make a name unavailable for a while – check to see if previous cluster was partially created
$ClusterName = “YOURNewHDInsightClusterName” #the name you will give to your cluster
$Location = “YOURDataCenter” #cluster data center must be East US, West US, or North Europe (as of December 2013)
$NumOfNodes = 1 #start small
$StorageAcct1 = “YOURExistingStorageAccountName” #currently must be in same data center as the cluster
$DefaultContainer = “YOURExistingContainerName” #already exists on the storage account

# These variables are automatically set for you
$FullStorage1 = “${StorageAcct1}”
$Key1 = Get-AzureStorageKey $StorageAcct1 | %{ $_.Primary }
$SubID = Get-AzureSubscription -Current | %{ $_.SubscriptionId }
$SubName = Get-AzureSubscription -Current | %{ $_.SubscriptionName }
$Cert = Get-AzureSubscription -Current | %{ $_.Certificate }
$Creds = Get-Credential -Message “New admin account to be created for your HDInsight cluster” #this prompts you

# Sample quick create
# Equivalent of quick create
# The ` specifies that the cmd continues on the next line, beware of artifical line breaks added during cut/paste from the blog
New-AzureHDInsightCluster -Name $ClusterName -ClusterSizeInNodes $NumOfNodes -Subscription $SubID -Location “$Location” `
-DefaultStorageAccountName $FullStorage1 -DefaultStorageAccountKey $Key1 -DefaultStorageContainerName $DefaultContainer -Credential $Creds

# Sample custom create
# Most params are the same as quick create, use a new cluster name
# Pass in a 2nd storage account, a SQLAzure db for the metastore (assume same db for Oozie and Hive), add Avro library, some config values
# Execute all the variable settings from above

# This value is set for you, don’t change!
$configvalues = new-object ‘Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightHiveConfiguration’

# Add your specific information here
$ClusterName = “YOURNewHDInsightClusterName
$StorageAcct2 = “YOURExistingStorageAccountName2
$MetastoreAzureSQLDBName = “YOURExistingSQLAzureDBName
$MetastoreAzureServerName = “” #gives a DNS error if you don’t use the full name
$configvalues.Configuration = @{ “hive.exec.compress.output”=”true” }  #this is an example of a config value you may pass in

# These variables are automatically set for you
$FullStorage2 = “${StorageAcct2}”
$Key2 = Get-AzureStorageKey $StorageAcct2 | %{ $_.Primary }
$MetastoreCreds = Get-Credential -Message “existing id/password for your SQL Azure DB (metastore)” #This prompts for the existing id and password of your existing SQL Azure DB

# Add a config file value
# Add AVRO SerDe libraries for Hive (on storage 1)
$configvalues.AdditionalLibraries = new-object ‘Microsoft.WindowsAzure.Management.HDInsight.Cmdlet.DataObjects.AzureHDInsightDefaultStorageAccount’
$configvalues.AdditionalLibraries.StorageAccountName = $FullStorage1
$configvalues.AdditionalLibraries.StorageAccountKey = $Key1
$configvalues.AdditionalLibraries.StorageContainerName = “hivelibs” #container called hivelibs must exist on specified storage account
# Create custom cluster
New-AzureHDInsightClusterConfig -ClusterSizeInNodes $NumOfNodes `
| Set-AzureHDInsightDefaultStorage -StorageAccountName $FullStorage1 -StorageAccountKey $Key1 -StorageContainerName $DefaultContainer `
| Add-AzureHDInsightStorage -StorageAccountName $FullStorage2 -StorageAccountKey $Key2 `
| Add-AzureHDInsightMetastore -SqlAzureServerName $MetastoreAzureServerName -DatabaseName $MetastoreAzureSQLDBName -Credential $MetastoreCreds -MetastoreType OozieMetastore `
| Add-AzureHDInsightMetastore -SqlAzureServerName $MetastoreAzureServerName -DatabaseName $MetastoreAzureSQLDBName -Credential $MetastoreCreds -MetastoreType HiveMetastore `
| Add-AzureHDInsightConfigValues -Hive $configvalues `
| New-AzureHDInsightCluster -Subscription $SubID -Location “$Location” -Name $ClusterName -Credential $Creds

# get status, properties, etc.
#$SubName = $SubID = Get-AzureSubscription -Current | %{ $_.SubscriptionName }
Get-AzureHDInsightProperties -Subscription $SubName
Get-AzureHDInsightCluster -Subscription $SubName
Get-AzureHDInsightCluster -Subscription $SubName -name YOURClusterName

# remove cluster
#Remove-AzureHDInsightCluster -Name $ClusterName -Subscription $SubName

Leave a comment

Your First HDInsight Cluster–Step by Step

Small Bites of Big Data from AZURECAT
Big Data Tech Training Series #1
Cindy Gross | Murshed Zaman

Sometimes it is just hard to get started. Have you been putting off your first foray into Hadoop? Are you not sure where to begin? Let’s get really basic.


Log on to the Windows Azure Portal

Go to storage Create a storage account in a location that is available to HDInsight (as of November 2013 that’s East US, West US, and North Europe). Do NOT choose an affinity group. If you choose to “Enable Geo-Replication” there will be an extra charge – it’s probably not necessary for a demo/test account as you have a limited amount of credit in the trial subscription. In the portal choose the STORAGE icon on the left. Then click on +NEW at the bottom. That opens a QUICK CREATE window. Enter a unique name for your storage, such as sqlcatwomanrules. It only allows lower case letters and numbers.

StorageNov2013 NewStorageAccountNov2013

Now click on the HDInsight icon just below the storage icon storage. Choose QUICK CREATE. Enter a unique name for your HDInsight cluster. For a demo choose 4 data nodes. Enter a password that contains upper and lower case letters, a number, and a special character. Choose the storage account you created above. Once you click on “CREATE HDINSIGHT CLUSTER” it will take several minutes for the cluster to be deployed.

StorageNov2013_HDI QuickCreateNov2013

Once it completes you are ready to use your cluster!


If you won’t be using the cluster right away, go ahead and delete it (look for the icon at the bottom of the portal) to save compute time and money. You can easily recreate it when you need it.


Look for more blogs soon on customizing your cluster with CUSTOM CREATE or PowerShell and on automating deployment and jobs with PowerShell. In the meantime see if you can get Invoke-Hive working from PowerShell for some simple Hive commands such as:

Invoke-Hive “select * from hivesampletable limit 10”

Big Data Technical Series:

Your First HDInsight Cluster–Step by Step

Automating HDInsight cluster creation with PowerShell

1 Comment

PowerShell for Azure cmdlets: Subscription was all Wacky

I was working on some HDInsight scripts in PowerShell and doing lots of experimenting. I’m not sure what exactly I did but all of a sudden everything stopped working. With lots of interruptions from meetings and chats and lunch…. I couldn’t retrace my steps. Everything seemed to fail on the Azure subscription information so I tried to get really basic – what did Get-AzureSubscription|%{$_.SubscriptionName} return? As it turns out, wacky garbage:

set-azuresubscription ?

What I expected to see was my single subscription:


So what happened? The Azure portal only shows one subscription. Obviously those other lines are not valid subscriptions – they look like the output of a help command or an error. Reinstalling the cmdlets, rebooting, and reimporting certificates didn’t help. I turned to my AzureCAT coworkers for help and @elcid98 pointed out this blog post that talks about how subscriptions are used in PowerShell:

Azure Subscriptions in PowerShell demystified

This caught my attention: “The second file – DefaultSubscriptionData.xml – also lists the available subscriptions and the associated certificates’ thumbprints“. Ok, where is that file? A search finds it in

C:Users%username%AppDataRoamingWindows Azure Powershell

I checked and sure enough, where I would expect just one entry I see multiple – and they’re named the same thing as the garbage in my output! I cleared out all but one entry to end up with this:

<?xml version=”1.0″ encoding=”utf-8″?>
< Subscriptions xmlns:xsd=”” xmlns:xsi=”” version=”0″ xmlns=”urn:Microsoft.WindowsAzure.Management:WaPSCmdlets”>
< Subscription name=”sqlcatwoman”>
< SubscriptionId>You don’t get to see the real info!</SubscriptionId>
<Thumbprint>Not here either!</Thumbprint>
< ServiceEndpoint></ServiceEndpoint>
< /Subscription>
< /Subscriptions>

Hmmmm…. I still got an error from Get-AzureSubscription. Back to C:Users%username%AppDataRoamingWindows Azure PowerShell. What’s this? WindowsAzureProfile.xml also has all the same junk! I cleared out all the extras to end up with this:

<?xml version=”1.0″ encoding=”utf-8″?>
< ProfileData xmlns:i=”” xmlns=””>
< DefaultEnvironmentName>AzureCloud</DefaultEnvironmentName>
<Environments />
< AzureSubscriptionData>
< ActiveDirectoryEndpoint></ActiveDirectoryEndpoint>
< ActiveDirectoryTenantId>More secrets!</ActiveDirectoryTenantId>
< ActiveDirectoryUserId></ActiveDirectoryUserId>
< CloudStorageAccount i:nil=”true” />
< IsDefault>true</IsDefault>
< LoginType i:nil=”true” />
< ManagementCertificate>Hiding this one too!</ManagementCertificate>
< ManagementEndpoint></ManagementEndpoint>
<RegisteredResourceProviders xmlns:d4p1=”” />
< SubscriptionId>And more secrets</SubscriptionId>
< /ProfileData>

Success! Get-AzureSubscription now returns just my single, valid subscription. All my other Azure cmdlets magically started working again. I don’t know how it got that way, but at least now I know where the subscription information is stored. I hope this helps someone else with their Azure subscription PowerShell scripting!


HDInsight Big Data Talks from #SQLPASS

SQL PASS Summit 2013 was another great data geek week! I chatted with many of you about Big Data, Hadoop, HDInsight, architecting solutions, SQL Server, data, BI, analytics, and general geekiness – great fun! This time around I delivered two talks on Hadoop and HDInsight – the slides from both are attached.

Zero to 60 with HDInsight takes you from an overview of Big Data and why it matters (zero) all the way through an end to end solution (60). We discussed how to create an HDInsight cluster with the Azure portal or PowerShell and talked through the architecture of the data and analysis behind the release of Halo 4. We talked about how you could use the same architectural pattern for many projects and walked through Hive and Pig script examples. We finished up with how to use Power Map (codename GeoFlow) over that data to gain new insights and improve the game experience for the end user.

The next session I co-presented with HDInsight PM Dipti Sangani: CAT: From Question to Insight with HDInsight and BI. We went deeper this time. Not only did we present an end to end story with how our own internal Windows Azure SQL Database team uses telemetry to improve your experience with SQL Server in Azure PaaS but we also went deeper with demos of Hive, Pig, and Oozie. We also gave another archetypical design scenario that will apply to many of your own scenarios and talked about how HDInsight fits with SQL Server and your other existing infrastructure. The deck covers your cloud and on-premises options for Hadoop on Windows including HDInsight Service, Hortonworks HDP for Windows, OneBox, and PDW with Polybase.

Please let me know if you have any questions from the talks or just general HDInsight questions!


Big Data Twitter Demo

Real-time. Social Sentiment Analysis. Twitter. Cloud. Insights. We have your Big Data buzzwords here!

Everyone seems to want to incorporate social sentiment into their business analysis. Well we have the demo for you! Use it for a quick demonstration of what can be done and when the excitement goes through the roof, use it to inspire your own design!

Real-Time Processing – Instant Insights!

First, use Event Driven Processing (EDP) to show the data on a dashboard. In the demo you’ll use StreamInsight, also referred to as Complex Event Processing (CEP), though you may want to use Microsoft Open Technologies’s RX / Reactive Extensions in your own project. Use the dashboard to make real-time decisions and take immediate action. For example, configure your EDP to “pop” only on terms related to your company and your marketing analyst can watch how the volume of tweets and the sentiment (measured positive/neutral/negative in this example) change in response to your Super Bowl ad or a mention of your company on a the news show. She can respond instantly with changes to your website, your own tweets, sales/promotions, or whatever is appropriate to your business.

Data Storage – Enable Insights!

EDP reads the data as it is pushed through a query, there is no inherent storage involved. You could just discard the data and never store it. In this example we chose to store the data for later trending and historical analysis. We take the tweet id, the date/time the tweet was captured, the sentiment score calculated during the real-time processing, and the keyword that caught our attention and store it in SQL Azure. This data is available to other applications that need to join it with existing data and requires fast responses to individual queries. The remaining data including the raw tweet, any geographic data the tweeter chose to share, and other data is dropped in an Azure Blob Store.

Trends, Patterns, and Historical Insights!

Now point HDInsight (Hadoop) to the Azure Blob Storage using HDInsight’s ASV extension to HDFS. You can spin up a Hadoop cluster in Azure, pay for as many nodes as you need for as long as you need them then spin them down to save money. The data remains in the blob store – available for future Hadoop clusters, other applications, archival, or whatever you need it for. Add structure to the JSON data with Hive and now you have rows and columns that can be accessed by BI tools!

Visualization for Powerful Insights!

Now create a PowerPivot tabular model (self-service BI / client side) or an Analysis Service tabular model (corporate BI) to store the relevant data in a highly compressed, in-memory format for fast access. Add in a few Power View visualizations mashing up data from multiple structured and unstructured sources and you can show your business decision makers easily digestible and understandable data in a format they just get and love! Make some decisions, take some actions, and you’ve just shown how to turn free Twitter data into a valuable resource that can have a direct impact on your company!

How to Get These Insights

Follow the instructions on the CodePlex site for the Big Data Twitter Demo project. Set it up, run through the demo, get excited, and go improve your business!

Demo Created By:

Vu Le

Andrew Moll

Aviad Ezra @aviade_pro

Brad Sarsfield @Bradoop

Later Additions By:

Lara Rubbelke @SQLGal |

Robert Bruckner

Cindy Gross @SQLCindy |