Befriending Dragons

Transform Tech with Anti-bullying Cultures


Leave a comment

Get HDInsight Properties with PowerShell

Small Bites of Big Data from AzureCAT

You’ve created your HDInsight Hadoop clusters and now you want to know exactly what you have out there in Azure. Maybe you want to pull the key information into a repository periodically as a reference for future troubleshooting, comparisons, or billing. Maybe you just need to get a quick look at your overall HDInsight usage. This is something you can easily automate with PowerShell.

Environment

First, open Windows Azure PowerShell or powershell_ise.exe.

Set some values for your environment:

$SubName = "YourSubscriptionName"
Select-AzureSubscription -SubscriptionName $SubName
Get-AzureSubscription -Current
$ClusterName = "HDInsightClusterName" #HDInsight cluster name

HDInsight Usage for the Subscription

Take a look at your overall HDInsight usage for this subscription:

Get-AzureHDInsightProperties

Get-AzureHDInsightProperties returns the number of clusters for this subscription, the total HDInsight cores used and available (for head nodes and data nodes), the Azure regions where HDInsight clusters can be created, and the HDInsight versions available for new clusters:

ClusterCount    : 2
CoresAvailable  : 122
CoresUsed       : 48
Locations       : {East US, North Europe, Southeast Asia, West Europe...}
MaxCoresAllowed : 170
Versions        : {1.6, 2.1, 3.0}

You can also pick out specific pieces of information and write them to a file, store them as variables, or use them elsewhere. This example simply outputs the values to the screen.

write-host '== Max HDInsight Cores for Sub: ' (Get-AzureHDInsightProperties).MaxCoresAllowed
write-host '== Cores Available:             ' (Get-AzureHDInsightProperties).CoresAvailable
write-host '== Cores Used:                  ' (Get-AzureHDInsightProperties).CoresUsed

HDInsight Cluster Information

Get-AzureHDInsightCluster provides information about all existing HDInsight clusters for this subscription:

Get-AzureHDInsightCluster
As you can see this cmdlet tells you the size, connection information, and version.
ClusterSizeInNodes    : 4
ConnectionUrl         : https://BigCAT.azurehdinsight.net
CreateDate            : 4/5/2014 3:37:23 PM
DefaultStorageAccount : sqlcatwomanwestus.blob.core.windows.net
HttpUserName          : Admin
Location              : West US
Name                  : BigCAT30
State                 : Running
StorageAccounts       : {}
SubscriptionId        : {YourSubID}
UserName              : Admin
Version               : 3.0.0.0.661685
VersionStatus         : Compatible

ClusterSizeInNodes    : 4
ConnectionUrl         : https://cgrosstest.azurehdinsight.net
CreateDate            : 5/5/2014 6:09:58 PM
DefaultStorageAccount : cgrosstest.blob.core.windows.net
HttpUserName          : Admin
Location              : West US
Name                  : cgrosstest
State                 : Running
StorageAccounts       : {sqlcatwomanwestus.blob.core.windows.net}
SubscriptionId        : {YourSubID}
UserName              : Admin
Version               : 3.0.2.0.727283
VersionStatus         : Compatible

You can also get information about just one HDInsight cluster at a time:

Get-AzureHDInsightCluster  -name $ClusterName

Or you can get very granular and look at specific properties, even some that aren’t in the default values:

write-host '== Default Storage Account:     ' `
(Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageAccountName.split(".")[0]
write-host '== Default Container:           ' `
(Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageContainerName

This information will be a valuable source of information for tracking past configurations, current usage, and planning. Enjoy your Hadooping!

Sample Script

# Cindy Gross 2014
# Get HDInsight properties
$SubName = "YourSubscriptionName"
Select-AzureSubscription -SubscriptionName $SubName
Get-AzureSubscription -Current
$ClusterName        = "YourHDInsightClusterName" #HDInsight cluster name

Get-AzureHDInsightProperties 
Get-AzureHDInsightCluster 
Get-AzureHDInsightCluster  -name $ClusterName
write-host '== Default Storage Account:     ' `
(Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageAccountName.split(".")[0]
write-host '== Default Container:           ' `
(Get-AzureHDInsightCluster -Cluster $ClusterName).DefaultStorageAccount.StorageContainerName
write-host '== Max HDInsight Cores for Sub: ' (Get-AzureHDInsightProperties).MaxCoresAllowed
write-host '== Cores Available:             ' (Get-AzureHDInsightProperties).CoresAvailable
write-host '== Cores Used:                  ' (Get-AzureHDInsightProperties).CoresUsed


5 Comments

Getting Started with Azure PowerShell Cmdlets–Subscription Management

I’ve started using the Azure PowerShell cmdlets more often to manage virtual machines and HDInsight in Azure. Once you connect to a subscription everything just works. However, the initial steps to get one or more subscriptions configured to be used from your machine or understanding how to change subscription information on your machine can be confusing. Some of the docs are contradictory, outdated, or incomplete. Often they assume you are only a co-admin of one subscription. The below steps should get you going with Azure cmdlets whether you admin one or many subscriptions.

You need to enable your machine to talk to one or more Azure subscriptions. The first step is creating a certificate. Do NOT do this if you already used the PublishSettings commands unless you first use Remove-AzureSubscription (which removes the locally stored information about the specified subscription). Makecert is more secure than PublishSettings, especially if you (a given email address) have multiple co-administrators per subscription and/or you (a given email address) are a co-administrator of multiple subscriptions.

The steps to get going are documented in Shep’s blog “Cloud Spelunking, Managing Azure form your Desktop via PowerShell (the Setup)” http://blogs.msdn.com/b/sql_shep/archive/2013/03/29/cloud-spelunking-managing-azure-form-your-desktop-via-powershell.aspx. I’ll go a bit deeper and fill in a few additional details on what Shep calls the “hard” option.

Create a Certificate

If you have IIS, Visual Studio, or the Windows SDK you will have some variation of a “Developer Command Prompt” (or VS201x or Visual Studio Command Prompt). Open that command prompt with the “run as administrator” option. Replace YourCertName with a meaningful name and run the below command. The cert always goes to the cert store on your local machine – the last parameter is an optional file based copy of that certificate that we will need for the next step. If you don’t specify the location it goes to %windir%system32. Be very protective of the .cer file – delete it once you have uploaded it. You can always generate another file if you need it.

makecert -sky exchange -r -n “CN=<YourCertName>” -pe -a sha1 -len 2048 -ss My “c:temp<YourCertName>.cer”

This certificate is yours – do not share it with others. If you want to reuse the certificate on other machines that you control, you can copy the .cer file to those machines and import them into the local certificate store on each machine. The .cer is just a copy, the actual certificate was loaded into your local certificate store (Manage Computer Certificates) by makecert.

Upload Certificate to Azure Subscription(s)

Generally you will not want to share certificates with others. Any certificate you use must be in your local certificate store (Manage Computer Certificates). The same certificate must also be uploaded to the portal and associated with each subscription you wish to manage from your machine.

From your local machine where you created the certificate in the above step:

  • Log in to the Azure Portal with an email address that is associated with the subscription you want to use from your own machine.
  • Scroll to the bottom of the left pane and choose “SETTINGS”

settings

  • Choose “MANAGEMENT CERTIFICATES”

AzurePortalSettingsMgmtCert

  • Click on the “UPLOAD” button in the middle of the bar at the bottom of the screen.

image

  • In the “Upload a management certificate” dialog navigate to the location specified in the last parameter above or %windir%system32 if you didn’t specify a location. Choose the .cer file you just created with makecert (or export a certificate from the local certificate store – just make sure it has the right properties). If you have multiple subscriptions there is a 2nd drop down box where you need to choose the subscription that the certificate will be associated with.

image

  • Repeat for any additional subscriptions that you want to manage with the same certificate (or create one certificate per subscription for additional security granularity).

Install and Configure the Azure PowerShell Cmdlets

Follow the steps here to install the Azure Cmdlets. Basically you are selecting “Azure PowerShell” from the Web Platform Installer. You can also check in the Web Platform Installer for updated versions of the cmdlets.

A very common setting that many admins set is the RemoteSigned Execution Policy. This is less secure than AllSigned or Restricted but allows you to use most downloaded scripts.

Open Windows Azure PowerShell with the “run as admin” option and run:

Set-ExecutionPolicy RemoteSigned –Force
Get-ExecutionPolicy –list

If you see errors when setting the execution policy, search on your specific error or start with this blog: Set-ExecutionPolicy : Windows PowerShell updated your execution policy successfully, but the setting is overridden by a policy defined at a more specific scope!!! You may need to open “Edit Group Policy” (in Windows 8 that opens the Local Group Policy Editor) and make a change.  Sometimes you may need to set each individual scope, but process scope settings go back to the default when the process is closed:

Set-ExecutionPolicy RemoteSigned -Scope Process -Force

Then import the Azure cmdlets:

Import-Module Azure

You can close the PowerShell window, you no longer need to “run as admin”.

Enable PowerShell to use a Subscription via a Certificate

Repeat this section on each machine that will be used to execute PowerShell code. Also repeat for additional subscriptions on each machine.

Open Windows Azure PowerShell. Optionally type ISE to open the Integrated Scripting Environment where you can edit, save, and run collections of cmdlets.

First, set some variables. You will need to copy some basic settings from the Azure Management Portal. On the far left side of the portal, scroll all the way to the bottom and choose “SETTINGS” and “MANAGEMENT CERTIFICATES” (see the “Upload Certificate to Azure Subscription(s)” section of this blog for more details – you are copying from the same place where you uploaded the certificate). Choose the certificate you just uploaded. Don’t worry if the numbers are cut off on the screen, if you highlight and copy it will get the whole value, even the part that doesn’t show on the screen. Replace the $subID and $thumbprint below – do not update $myCert as that is done based on your other variables. Execute the code in the PowerShell window.

#copy SUBSCRIPTION ID from portal 
#lower left, settings, management certificates
$subID = "11111111-2222-3333-4444-555555555555"
#copy THUMBPRINT from portal 
#lower left, settings, management certificates
$thumbprint = "1234567891234567891234567891234567891234"
$myCert = Get-Item cert:\CurrentUserMy$thumbprint  

Now set the subscription name you will use to refer to this subscription from this machine. In most cases you will choose the NAME of the subscription from the portal but that is not required. The matching between your machine’s knowledge of the subscription and the subscription on Azure is done via the SUBSCRIPTION ID. Update $localSubName below and execute the code in the PowerShell window. Note that the local subscription name is case-sensitive.

#subname to be used locally
#usually you will choose the actual subscription name
#stored in %appdata%Windows Azure PowerShellWindowsAzureProfile.xml
$localSubName = "MyFavSub"

Now that you have set the values for your own environment, run the code to actually update your machine’s knowledge of the subscription. Note that I used the back tick “`” to specify that the command continues on a new line.

Set-AzureSubscription –SubscriptionName $localSubName `
–SubscriptionId $subID -Certificate $myCert

Some operations rely on a default storage account, you may want to set the default storage account you want to use for each subscription.

#optionally set "current" storage account for this sub
$defaultStorageAccount = 'MyFavStorageAccount'
Set-AzureSubscription -SubscriptionName $localSubName `
-CurrentStorageAccount $defaultStorageAccount

Next you can set the default subscription that you will start with when you open PowerShell on this machine (note that we’ve changed from the Set cmdlet to the Select one):

Select-AzureSubscription –Default $localSubName

You can change which of the configured subscriptions is the current one:

Select-AzureSubscription –Current $localSubName

Check to see which subscription you are currently using:

Get-AzureSubscription –Current
(Get-AzureSubscription -Current).SubscriptionName

Verify that you can connect and list the services associated with the current subscription:

Get-AzureService | select ServiceName

Look at the Local Configuration

Now let’s look at what got updated on the local machine.

Open File Explorer and go to %appdata%Windows Azure PowerShell. Open WindowsAzureProfile.xml in Notepad or your favorite editor. Here are a few of the key values for each subscription you have mapped on your machine:

IsDefault tells you which one is the default subscription for your machine

<IsDefault>true</IsDefault>

The thumbprint id is stored as the ManagementCertificate:

<ManagementCertificate>1234567891234567891234567891234567891234</ManagementCertificate>

The local name you chose for the subscription is stored in Name (to avoid confusion chose the name used in the portal):

<Name>MyFavSub</Name>

The subscription id is stored in SubscriptionId:

<SubscriptionId>11111111-2222-3333-4444-555555555555</SubscriptionId>

Remove Subscription

If you need to remove a subscription from your machine, whether because you no longer have access to it or because you want to change one of the properties such as the name or which certificate you use, you can use Remove-AzureSubscription. This updates your local %appdata%Windows Azure PowerShell.

#RemoveSub
#Remove my machine's knowledge of a subscription 
#Removes info from %appdata%Windows Azure PowerShellWindowsAzureProfile.xml
Remove-AzureSubscription -SubscriptionName MyFavSub

Sample Script

Here is a handy dandy cut/paste version of the above PowerShell code to add a subscription and make it your default and current subscription:

#copy SUBSCRIPTION ID from portal 
#lower left, settings, management certificates
$subID = "YourOwnSubID"
#copy THUMBPRINT from portal 
#lower left, settings, management certificates
$thumbprint = "YourCertThumbprint"
$myCert = Get-Item cert:\CurrentUserMy$thumbprint  
#subname to be used locally
#usually you will choose the actual subscription name
#stored in %appdata%Windows Azure PowerShellWindowsAzureProfile.xml
$localSubName = "YourSubcriptionName"
#optionally set "current" storage account for this sub
$defaultStorageAccount = 'OptionalDefaultStorage'
Set-AzureSubscription –SubscriptionName $localSubName `
    –SubscriptionId $subID -Certificate $myCert
Set-AzureSubscription -SubscriptionName $localSubName `
    -CurrentStorageAccount $defaultStorageAccount
Select-AzureSubscription –Default $localSubName
Select-AzureSubscription –Current $localSubName
Get-AzureSubscription –Current
(Get-AzureSubscription -Current).SubscriptionName

You are Ready for PowerShell Gooey Goodness!

Woohoo! Now you can access your Azure subscriptions from your machine without entering ids and passwords. You can automate, simplify, and standardize any Azure activity that has an associated cmdlet! Happy PowerShelling!


					
		
	


1 Comment

PowerShell for Azure cmdlets: Subscription was all Wacky

I was working on some HDInsight scripts in PowerShell and doing lots of experimenting. I’m not sure what exactly I did but all of a sudden everything stopped working. With lots of interruptions from meetings and chats and lunch…. I couldn’t retrace my steps. Everything seemed to fail on the Azure subscription information so I tried to get really basic – what did Get-AzureSubscription|%{$_.SubscriptionName} return? As it turns out, wacky garbage:

set-azuresubscription ?
!?
!
help
sqlcatwoman
?
Sqlcatwomanwestus

What I expected to see was my single subscription:

sqlcatwoman

So what happened? The Azure portal only shows one subscription. Obviously those other lines are not valid subscriptions – they look like the output of a help command or an error. Reinstalling the cmdlets, rebooting, and reimporting certificates didn’t help. I turned to my AzureCAT coworkers for help and @elcid98 pointed out this blog post that talks about how subscriptions are used in PowerShell:

Azure Subscriptions in PowerShell demystified http://yossidahan.wordpress.com/2012/12/12/azure-subscriptions-in-powershell-demystified/

This caught my attention: “The second file – DefaultSubscriptionData.xml – also lists the available subscriptions and the associated certificates’ thumbprints“. Ok, where is that file? A search finds it in

C:Users%username%AppDataRoamingWindows Azure Powershell

I checked and sure enough, where I would expect just one entry I see multiple – and they’re named the same thing as the garbage in my output! I cleared out all but one entry to end up with this:

<?xml version=”1.0″ encoding=”utf-8″?>
< Subscriptions xmlns:xsd=”http://www.w3.org/2001/XMLSchema” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” version=”0″ xmlns=”urn:Microsoft.WindowsAzure.Management:WaPSCmdlets”>
< Subscription name=”sqlcatwoman”>
< SubscriptionId>You don’t get to see the real info!</SubscriptionId>
<Thumbprint>Not here either!</Thumbprint>
< ServiceEndpoint>https://management.core.windows.net</ServiceEndpoint>
< /Subscription>
< /Subscriptions>

Hmmmm…. I still got an error from Get-AzureSubscription. Back to C:Users%username%AppDataRoamingWindows Azure PowerShell. What’s this? WindowsAzureProfile.xml also has all the same junk! I cleared out all the extras to end up with this:

<?xml version=”1.0″ encoding=”utf-8″?>
< ProfileData xmlns:i=”http://www.w3.org/2001/XMLSchema-instance” xmlns=”http://schemas.datacontract.org/2004/07/Microsoft.WindowsAzure.Commands.Utilities.Common”>
< DefaultEnvironmentName>AzureCloud</DefaultEnvironmentName>
<Environments />
<Subscriptions>
< AzureSubscriptionData>
< ActiveDirectoryEndpoint>https://login.windows.net/</ActiveDirectoryEndpoint>
< ActiveDirectoryTenantId>More secrets!</ActiveDirectoryTenantId>
< ActiveDirectoryUserId>cgross@microsoft.com</ActiveDirectoryUserId>
< CloudStorageAccount i:nil=”true” />
< IsDefault>true</IsDefault>
< LoginType i:nil=”true” />
< ManagementCertificate>Hiding this one too!</ManagementCertificate>
< ManagementEndpoint>https://management.core.windows.net/</ManagementEndpoint>
<Name>sqlcatwoman</Name>
<RegisteredResourceProviders xmlns:d4p1=”http://schemas.microsoft.com/2003/10/Serialization/Arrays” />
< SubscriptionId>And more secrets</SubscriptionId>
</AzureSubscriptionData>
</Subscriptions>
< /ProfileData>

Success! Get-AzureSubscription now returns just my single, valid subscription. All my other Azure cmdlets magically started working again. I don’t know how it got that way, but at least now I know where the subscription information is stored. I hope this helps someone else with their Azure subscription PowerShell scripting!


3 Comments

Big Data Twitter Demo

Real-time. Social Sentiment Analysis. Twitter. Cloud. Insights. We have your Big Data buzzwords here!

Everyone seems to want to incorporate social sentiment into their business analysis. Well we have the demo for you! Use it for a quick demonstration of what can be done and when the excitement goes through the roof, use it to inspire your own design!

Real-Time Processing – Instant Insights!

First, use Event Driven Processing (EDP) to show the data on a dashboard. In the demo you’ll use StreamInsight, also referred to as Complex Event Processing (CEP), though you may want to use Microsoft Open Technologies’s RX / Reactive Extensions in your own project. Use the dashboard to make real-time decisions and take immediate action. For example, configure your EDP to “pop” only on terms related to your company and your marketing analyst can watch how the volume of tweets and the sentiment (measured positive/neutral/negative in this example) change in response to your Super Bowl ad or a mention of your company on a the news show. She can respond instantly with changes to your website, your own tweets, sales/promotions, or whatever is appropriate to your business.

Data Storage – Enable Insights!

EDP reads the data as it is pushed through a query, there is no inherent storage involved. You could just discard the data and never store it. In this example we chose to store the data for later trending and historical analysis. We take the tweet id, the date/time the tweet was captured, the sentiment score calculated during the real-time processing, and the keyword that caught our attention and store it in SQL Azure. This data is available to other applications that need to join it with existing data and requires fast responses to individual queries. The remaining data including the raw tweet, any geographic data the tweeter chose to share, and other data is dropped in an Azure Blob Store.

Trends, Patterns, and Historical Insights!

Now point HDInsight (Hadoop) to the Azure Blob Storage using HDInsight’s ASV extension to HDFS. You can spin up a Hadoop cluster in Azure, pay for as many nodes as you need for as long as you need them then spin them down to save money. The data remains in the blob store – available for future Hadoop clusters, other applications, archival, or whatever you need it for. Add structure to the JSON data with Hive and now you have rows and columns that can be accessed by BI tools!

Visualization for Powerful Insights!

Now create a PowerPivot tabular model (self-service BI / client side) or an Analysis Service tabular model (corporate BI) to store the relevant data in a highly compressed, in-memory format for fast access. Add in a few Power View visualizations mashing up data from multiple structured and unstructured sources and you can show your business decision makers easily digestible and understandable data in a format they just get and love! Make some decisions, take some actions, and you’ve just shown how to turn free Twitter data into a valuable resource that can have a direct impact on your company!

How to Get These Insights

Follow the instructions on the CodePlex site for the Big Data Twitter Demo project. Set it up, run through the demo, get excited, and go improve your business!

Demo Created By:

Vu Le

Andrew Moll

Aviad Ezra @aviade_pro

Brad Sarsfield @Bradoop

Later Additions By:

Lara Rubbelke @SQLGal | http://sqlblog.com/blogs/lara_rubbelke/

Robert Bruckner http://blogs.msdn.com/b/robertbruckner/

Cindy Gross @SQLCindy | http://blogs.msdn.com/cindygross


7 Comments

Access Azure Blob Stores from HDInsight

Small Bites of Big Data

Edit Mar 6, 2014: This is no longer necessary for HDInsight – you specify the storage accounts when you create the cluster and the rest happens auto-magically. See http://blogs.msdn.com/b/cindygross/archive/2013/11/25/your-first-hdinsight-cluster-step-by-step.aspx or http://blogs.msdn.com/b/cindygross/archive/2013/12/06/sample-powershell-script-hdinsight-custom-create.aspx.

One of the great enhancements in Microsoft’s HDInsight distribution of Hadoop is the ability to store and access Hadoop data on an Azure Blob Store. We do this via the HDFS API extension called Azure Storage Vault (ASV). This allows you to persist data even after you spin down an HDInsight cluster and to make that data available across multiple programs or clusters from persistent storage. Blob stores can be replicated for redundancy and are highly available. When you need to access the data from Hadoop you point your cluster at the existing data and the data persists even after the cluster is spun down.

Azure Blob Storage

Let’s start with how your data is stored. A storage account is created in the Azure portal and has access keys associated with it. All access to your Azure blob data is done via storage accounts. Within a storage account you need to create at least one container, though you can have many. Files (blobs) are put in the container(s). For more information on how to create and use storage accounts and containers see: http://www.windowsazure.com/en-us/develop/net/how-to-guides/blob-storage/. Any storage accounts associated with HDInsight should be in the same data center as the cluster and must not be in an affinity group.

You can create a container from the Azure portal or from any of the many Azure storage utilities available such as CloudXplorer. In the Azure portal you click on the Storage Account then go to the CONTAINERS tab. Next click on ADD CONTAINER at the very bottom of the screen. Enter a name for your container, choose the ACCESS property, and click on the checkmark.

AzureStorageContainers_thumb AzureStorageManageContainers_thumb AzureStorageNewContainer_thumb

HDInsight Service Preview

When you create your HDInsight Service cluster on Azure you associate your cluster with an existing Azure storage account in the same data center. In the current interface the QUICK CREATE doesn’t allow you to choose a default container on that storage account so it creates a container with the same name as the cluster. If you choose CUSTOM CREATE you have the option to choose the default container from existing containers associated with the storage account you choose. This is all done in the Azure management portal: https://manage.windowsazure.com/.

Quick Create: image

Custom Create: image

You can then add additional storage accounts to the cluster by updating C:appsdisthadoop-1.1.0-SNAPSHOTconfcore-site.xml on the head node. This is only necessary if those additional accounts have private containers (this is a property set in the Azure portal for each container within a storage account). Public containers and public blobs can be accessed without the id/key being stored in the configuration file. You choose the public/private setting when you create the container and can later edit it in the “Edit container metadata” dialog on the Azure portal.

image StorageContainerEdit

The key storage properties in the default core-site.xml on HDInsight Service Preview are:

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>asv://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net</value>
<description>The name of the default file system.  Either the
literal string “local” or a host:port for NDFS.</description>
<final>true</final>
</property>

<property>
<name>dfs.namenode.rpc-address</name>
<value>hdfs://namenodehost:9000</value>
</property>

<property>
<name>fs.azure.account.key.YOURStorageAccount.blob.core.windows.net</name>
<value>YOURActualStorageKeyValue</value>
</property>

To add another storage account you will need the Windows Azure storage account information from https://manage.windowsazure.com. Log in to your Azure subscription and pick storage from the left menu. Click on the account you want to use then at the very bottom click on the “MANAGE KEYS” button. Cut and paste the PRIMARY ACCESS KEY (you can use the secondary if you prefer) into the new property values we’ll discuss below.

AzureStorageAccounts_thumb AzureStorageContainerKeys_thumb1 AzureStorageMgKeys_thumb1

Create a Remote Desktop (RDP) connection to the head node of your HDInsight Service cluster. You can do this by clicking on the CONNECT button at the bottom of the screen when your HDInsight Preview cluster is highlighted. You can choose to save the .RDP file and edit it before you connect (right click on the .RDP file in Explorer and choose Edit). You may want to enable access to your local drives from the head node via the “Local Resources” tab under the “More” button in the “Local devices and resources” section. Then go back to the General tab and save the settings. Connect to the head node (either choose Open after you click CONNECT or use the saved RDP).

image image

On the head node make a copy of C:appsdisthadoop-1.1.0-SNAPSHOTconfcore-site.xml in case you have to revert back to the original. Next open core-site.xml in Notepad or your favorite editor.

Add your 2nd Azure storage account by adding another property.

<property>
<name>fs.azure.account.key.YOUR_SECOND_StorageAccount.blob.core.windows.net</name>
<value>YOUR_SECOND_ActualStorageKeyValue</value>
</property>

Save core-site.xml.

Repeat for each storage account you need to access from this cluster.

HDInsight Server Preview

If you have downloaded the on-premises HDInsight Server preview from http://microsoft.com/bigdata that gives you a single node “OneBox” install to test basic functionality. You can put it on your local machine, on a Hyper-V virtual machine, or in a Windows Azure IaaS virtual machine. You can also point this OneBox install to ASV. Using an IaaS VM in the same data center as your storage account will give you better performance, though the OneBox preview is meant purely for basic functional testing and not for high performance as it is limited to a single node. The steps are slightly different for on-premises as the installation directory and default properties in core-site.xml are different.

Make a backup copy of C:Hadoophadoop-1.1.0-SNAPSHOTconfcore-site.xml from your local installation (local could be on a VM).

Edit core-site.xml:

1) Change the default file system from local HDFS to remote ASV

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>hdfs://localhost:8020</value>
<description>The name of the default file system.  Either the
literal string “local” or a host:port for NDFS.</description>
<final>true</final>
</property>

to:

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>asv://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net</value>
<description>The name of the default file system.  Either the
literal string “local” or a host:port for NDFS.</description>
<final>true</final>
</property>

2) Add the namenode property (do not change any values)

<property>
<name>dfs.namenode.rpc-address</name>
<value>hdfs://namenodehost:9000</value>
</property>

3) Add the information that associates the key value with your default storage account

<property>
<name>fs.azure.account.key.YOURStorageAccount.blob.core.windows.net</name>
<value>YOURActualStorageKeyValue</value>
</property>

4) Add any additional storage accounts you plan to access

<property>
<name>fs.azure.account.key.YOUR_SECOND_StorageAccount.blob.core.windows.net</name>
<value>YOUR_SECOND_ActualStorageKeyValue</value>
</property>

Save core-site.xml.

Files

Upload one or more files to your container(s). You can use many methods for loading the data including Hadoop file system commands such as copyFromLocal or put, 3rd party tools like CloudXPlorer, JavaScript, or whatever method you find fits your needs. For example, I can upload all files in a data directory (for simplicity this sample refers to C: which is local to the head node) using the Hadoop put command:

hadoop fs -put c:data asv://data@sqlcatwomanblog.blob.core.windows.net/

Or upload a single file:

hadoop fs -put c:databacon.txt asv://data@sqlcatwomanblog.blob.core.windows.net/bacon.txt

To view the files in a linked non-default container or a public container use this syntax from a Hadoop Command Line prompt (fs=file system, ls=list):

hadoop fs -ls asv://data@sqlcatwomanblog.blob.core.windows.net/

Found 1 items
-rwxrwxrwx   1        124 2013-04-24 20:12 /bacon.txt

In this case the container data on the private storage account sqlcatwomanblog has one file called bacon.txt.

For the default container the syntax does not require the container and account information. Since the default storage is ASV rather than HDFS (even for HDInsight Server in this case because we changed it in core-site.xml) you can even leave out the ASV reference.

hadoop fs -ls asv:///bacon.txt
hadoop fs -ls /bacon.txt

More Information

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon.

Note: the Preview, CTP, and TAP programs are available for a limited time. Details of the usage and the availability of the pre-release versions may change rapidly.

Digg This


2 Comments

Self-Service BI Works!

When I talk to people about adding self-service BI to their company’s environment I generally get a list of reasons why it won’t work. Some things I commonly hear:

  • I can’t get anyone in IT or on the business side to even try it.
  • The business side doesn’t know how to use the technology.
  • This threatens my job.
  • I just don’t know where to start either politically/culturally or with the technology.
  • I have too many other things to do.
  • How can it possibly be secure, allow standardization, or result in quality data and decisions?
  • That’s not the way we do things.
  • I don’t really know what self-service BI means.

#PASSBAC 2013 Cindy and Eduardo 

So what is a forward thinking BI implementer to do? Well, Intel just went out and did it, blowing through the supposed obstacles. Eduardo Gamez of Intel’s Technology Manufacturing Engineering (TME) group interviewed business folks to find those who were motivated for change, found a great pilot project with committed employees, and drove the process forward. They put a “sandbox” environment up for the business to use and came up with a plan for monitoring the sandbox activity to find models and reports worth adding to their priority queue for enterprise BI projects. The business creates their own data models and their own reports for both high and low priority items. IT provides the infrastructure and training including products like Analysis Services, PowerPivot, Power View, SharePoint, Excel, SQL Server, and various data sources. The self-service models and reports are useful to the business – they reduce manual efforts, give them the reports they want much faster, and ultimately drive better, more agile business decisions. If a model isn’t quite right after the first try, they can quickly modify it. The same models and reports are useful to IT – they are very refined and complete requirements docs that shorten the time to higher quality enterprise models and reports, they free up IT resources to build a more robust infrastructure and allow IT to concentrate on projects that require specialized IT knowledge. Everyone wins with a shorter time to decision, higher quality decisions, and a significant impact on the bottom line.

Learn more about how Intel TME is implementing self-service BI:

Eduardo (eduardo.m.gamez@intel.com) and I (cgross@microsoft.com or @SQLCindy) are happy to talk to you about Self-Service BI – let us know what you need to know!

Digg This

How_Intel__Integrates_Self-Service_BI_with_IT_for_Better_Business_Results_[DAV-208-M].zip


1 Comment

HDInsight: Jiving about Hadoop and Hive with CAT

Tomorrow I will be talking about Hive as part of Pragmatic Work’s Women in Technology (WIT) month of webcasts. I am proud to be part of this lineup with all these stellar WITs! I encourage my fellow WITs to get more involved in your data community and if you don’t already do so start tweeting, blogging, and speaking. I am happy to coach you through your first speaking engagement if you are interested. Get out there and start showing the world what you can do!

Thursday’s talk is going to be HDInsight: Jiving about Hadoop and Hive with CAT. Let’s break that title down.

HDInsight is Microsoft’s distribution of Hadoop. As part of the HDInsight project we have checked code back into the core Apache Hadoop source code to make the core code runs great on Windows. We are also adding functionality and features such as JavaScript and Azure Storage Vault that make the product more robust and enterprise friendly. This week the HDInsight Service Preview on Azure became available to those with an Azure subscription.

Hadoop is a scale out methodology that allows businesses to quickly consume and analyze data in ways they haven’t been able to before. This can lead to faster, better business insights and business actions.

Hive is a way to impose metadata and structure on the loosely structured (unstructured, multi-structured, semi-structured) data that resides in Hadoop’s HDFS file system. With Hive and the Hive ODBC driver you can make Hadoop data look like any other data source to your familiar BI tools such as Excel. PowerPivot can connect to Hive data, mash that data up with existing data sources such as SQL Azure, SQL Server, and OData, and allow you to visualize it with Power View. I have an end to end demo of this: Hurricane Sandy Mash-Up: Hive, SQL Server, PowerPivot & Power View.

CAT is my team at Microsoft. The Customer Advisory Team (CAT) works with customers who are doing new, unusual, and interesting things that push the boundaries of technology. We share what we learn with the community so you can do your jobs better and we take what we learn from you to the product team to help improve the product.

My slides are attached at the bottom of this post. I believe a recording of the talk will be posted by Pragmatic Works on their site.

I look forward to “seeing” you all at my talk tomorrow and would love to see your tweets or hear directly from you afterwards.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

Digg This

PragmaticWorksHDInsightJivingAboutHadoopAndHiveWithCATMar202013.pptx


Leave a comment

PASS BAC PREVIEW SERIES: SQL Professionals and the World of Self-service BI and Big Data

Are you excited about the upcoming PASS Business Analytics Conference? You should be! This conference will offer a wide range of sessions about Microsoft’s End to End Business Intelligence (including Self-Service BI), Analytics, Big Data, Architecture, Reporting, Information Delivery, Data Management, and Visualization solutions. Whether you are an implementer, a planner, or a decision maker there is something here for you!

PASS_BAC_Horizontal_Banner

What makes this conference different? Why should you put in the effort to attend this conference in particular? We are seeing a paradigm shift focused on shorter time to decision, more data available than ever before, and the need for self-service BI. There are exciting technology solutions being presented to deal with these needs and new architectural skills are needed to implement them properly. Self-Service BI and Big Data are very different in many ways but also responding to the same problem – the need for additional insights and less time spent getting to those insights and the resulting impactful decisions. Self-Service BI via PowerPivot, Power View, Excel, and existing and new data sources including HDInsight/Hadoop (usually via Hive) offers fast time to decision, but you still sometimes need Enterprise BI to add additional value via services such as data curation, data stewardship, collaboration tools, additional security, training, and automation. Add in the powerful new data sources available with Big Data technologies such as HDInsight/Hadoop that can also reduce time to decision and open up all sorts of new opportunities for insight and you have many powerful new areas to explore. Not to mention that Dr. Steven Levitt, author of Freakonomics and SuperFreakonomics, is one of the keynote speakers!

Read more about my thoughts on Self-Service BI and Big Data in this #PASSBAC guest blog published today: PASS BAC PREVIEW SERIES: SQL Professionals and the World of Self-service BI and Big Data

And sign up for the session I am co-presenting at #PASSBAC with Eduardo Gamez of Intel: How Intel Integrates Self-Service BI with IT for Better Business Results

Take a look at all the information tagged with #PASSBAC and tweeted by @PASSBAC, there are some good blogs, preview sessions, and tidbits being posted. Get your own Twibbon for Twitter, Facebook, or however you want to use it, the Twibbon site will add a ribbon to the picture of your choice:

PASSBA2013Cindy

If you’re going to be in Chicago anyway, you might as well stay a few extra days for two nearby SQL Saturdays. The weekend before the conference take a short hop over to Madison, WI for #SQLSAT206 on April 6, 2013 at the Madison Area Tech College. Then head over to the bacon, uhhh, PASS BA CONference April 10-12. Stay one more day in Chicago (technically Addison, IL) for the #SQLSAT211 sessions at Devry. This is a great opportunity for even more SQL Server immersion and networking!

See you at #PASSBAC in Chicago in April!

@SQLCindy

Small Bites of Big Data


11 Comments

HDInsight: Hive Internal and External Tables Intro

Small Bites of Big Data

Cindy Gross, SQLCAT PM

HDInsight is Microsoft’s distribution, in partnership with Hortonworks, of Hadoop. Hive is the component of the Hadoop ecosystem that imposes structure on Hadoop data in a way that makes it usable from BI tools that expect rows and columns with defined data types. Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.

Use EXTERNAL tables when:

  • The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
  • Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
  • You want to use a custom location such as ASV.
  • Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
  • You are not creating table based on existing table (AS SELECT).

Use INTERNAL tables when:

  • The data is temporary.
  • You want Hive to completely manage the lifecycle of the table and data.

We’ll walk through creating basic tables with a few rows of data so you can see some of the differences between EXTERNAL and INTERNAL tables. The demo data files are attached at the bottom of the blog. Alternatively you can simply open notepad and create your own files with a series of single column rows. If you create your own files make sure you have a carriage return/line feed at the end of all rows including the last one. The files should be in a Windows directory called c:data on the HDInsight Head Node. For HDInsight Server (on-premises) that’s the machine where you ran setup. For HDInsight Services (Azure) you can create a Remote Desktop connection (RDP) to the head node from the Hadoop portal.

Note: Your client tool editor or the website may change the dashes or other characters in the following commands to “smart” characters. If you get syntax errors from a direct cut/paste, try pasting into notepad first or deleting then retyping the dash (or other special characters).

Create an HDInsight cluster. You can do this on your own Windows machine by installing HDInsight Server or by signing up for HDInsight Services on Azure. For the CTP of HDInsight Services as of February 2013 you fill out a form to request access and receive access within a few days. Soon the service will be available from the Azure portal via your Azure subscription. Since the portal interface will be changing soon and all the commands are straightforward I will show you how to do all the steps through the Hive CLI (command line interface).

Open a Hadoop Command Prompt:

HadoopCmdPrompt

Change to the Hive directory (necessary in early preview builds of Hive):

cd %hive_home%bin

Load some data (hadoop file system put) and then verify it loaded (hadoop file system list recursively):

hadoop fs -put c:databacon.txt /user/demo/food/bacon.txt

hadoop fs -lsr /user/demo/food

The put command doesn’t return a result, the list command returns one row per file or subdirectory/file:

-rw-r–r–   1 cgross supergroup        124 2013-02-05 22:41 /user/demo/food/bacon.txt

Enter the Hive CLI (command line interface):

hive

Tell Hive to show the column names above the results (all Hive commands require a semi-colon as a terminator, no result is returned from this set command):

Set hive.cli.print.header=true;

Create an INTERNAL table in Hive and point it to the directory with the bacon.txt file:

CREATE INTERNAL TABLE internal1 (col1 string) LOCATION ‘/user/demo/food’;

Oops… that failed because INTERNAL isn’t a keyword, the absence of EXTERNAL makes it a managed, or internal, table.

FAILED: Parse Error: line 1:7 Failed to recognize predicate ‘INTERNAL’.

So let’s create it without the invalid INTERNAL keyword. Normally we would let an INTERNAL table default to the default location of /hive/warehouse but it is possible to specify a particular directory:

CREATE TABLE internal1 (col1 string) LOCATION ‘/user/demo/food’;

That will return the time taken but no other result. Now let’s look at the schema that was created:. Note that the table type is MANAGED_TABLE.

DESCRIBE FORMATTED internal1;

col_name        data_type       comment
# col_name              data_type               comment

col1                    string                  None

# Detailed Table Information
Database:               default
Owner:                  cgross
CreateTime:             Tue Feb 05 22:45:57 PST 2013
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://localhost:8020/user/demo/food
Table Type:             MANAGED_TABLE
Table Parameters:
transient_lastDdlTime   1360133157

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
serialization.format    1

And now look at some rows:

SELECT * FROM internal1;

col1
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon

What happens if we don’t specify a directory for an INTERNAL table?

CREATE TABLE internaldefault (col1 string);

It is created in the default Hive directory, which by default is in /hive/warehouse (dfs shells back out to Hadoop fs):

dfs -lsr /hive/warehouse;

We can see that Hive has created a subdirectory with the same name as the table. If we were to load data into the table Hive would put it in this directory:
drwxr-xr-x   – cgross supergroup          0 2013-02-05 22:52 /hive/warehouse/internaldefault

However, we won’t use this table for the rest of the demo so let’s drop it to avoid confusion. The drop also removes the subdirectory.

DROP TABLE internaldefault;

dfs -lsr /hive/warehouse;

Once we dropped the internaldefault table the directory that Hive created was automatically cleaned up. Now let’s add a 2nd file to the first internal table and check that it exists:

dfs -put c:databacon2.txt /user/demo/food/bacon2.txt;

dfs -lsr /user/demo/food;

-rw-r–r–   1 cgross supergroup        124 2013-02-05 23:04 /user/demo/food/bacon.txt
-rw-r–r–   1 cgross supergroup         31 2013-02-05 23:03 /user/demo/food/bacon2.txt

Since the CREATE TABLE statement points to a directory rather than a single file any new files added to the directory are immediately visible (remember that the column name col1 is only showing up because we enabled showing headers in the output – there is no row value of col1 in the data as headers are not generally included in Hadoop data):

SELECT * FROM internal1;

col1
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn

Now let’s create an EXTERNAL table that points to the same directory and look at the schema:

CREATE EXTERNAL TABLE external1 (colE1 string) LOCATION ‘/user/demo/food’;

DESCRIBE FORMATTED external1;

col_name        data_type       comment
# col_name              data_type               comment

cole1                   string                  None

# Detailed Table Information
Database:               default
Owner:                  cgross
CreateTime:             Tue Feb 05 23:07:12 PST 2013
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://localhost:8020/user/demo/food
Table Type:             EXTERNAL_TABLE

Table Parameters:
EXTERNAL                TRUE
transient_lastDdlTime   1360134432

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
serialization.format    1

This time the table type is EXTERNAL_TABLE. You can see that the location was expanded to include the default settings which in this case are the localhost machine using the default HDFS (as opposed to ASV or Azure Storage Vault).

Now look at the data:

SELECT * FROM external1;

The result set is a combination of the two bacon files:

HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn

That table returns the same data as the first table – we have two tables pointing at the same data set! We can add another one if we want:

CREATE EXTERNAL TABLE external2 (colE2 string) LOCATION ‘/user/demo/food’;

DESCRIBE FORMATTED external2;

SELECT * FROM external2;

You may create multiple tables for the same data set if you are experimenting with various structures/schemas.

Add another data file to the same directory and see how it’s visible to all the tables that point to that directory:

dfs -put c:dataveggies.txt /user/demo/food/veggies.txt;

SELECT * FROM internal1;

SELECT * FROM external1;

SELECT * FROM external2;

Each table will return the same results:

HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn
SQL_Apple
NoSQL_Pear
SQLFamily_Kiwi
Summit_Mango
HDInsight_Watermelon
SQLSat_Strawberries
Raspberrylimelemonorangecherryblueberry 123 456

Now drop the INTERNAL table and then look at the data from the EXTERNAL tables which now return only the column name:

DROP TABLE internal1;

SELECT * FROM external1;

SELECT * FROM external2;

dfs -lsr /user/demo/food;

Result: lsr: Cannot access /user/demo/food: No such file or directory.

Because the INTERNAL (managed) table is under Hive’s control, when the INTERNAL table was dropped it removed the underlying data. The other tables that point to that same data now return no rows even though they still exist!

Clean up the demo tables and directory:

DROP TABLE external1;

DROP TABLE external2;

exit;

This should give you a very introductory level understanding of some of the key differences between INTERNAL and EXTERNAL Hive tables. If you want full control of the data loading and management process, use the EXTERNAL keyword when you create the table.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the TAP and CTP builds may change rapidly.

bacon.zip


4 Comments

Hurricane Sandy Mash-Up: Hive, SQL Server, PowerPivot, Power View

Small Bites of Big Data

Authors: Cindy Gross Microsoft SQLCAT PM, Ed Katibah Microsoft SQLCAT PM

Tech Reviewers: Bob Beauchemin Developer Skills Partner at SQLSkills, Jeannine Nelson-Takaki Microsoft Technical Writer, John Sirmon Microsoft SQLCAT PM, Lara Rubbelke Microsoft Technical Architect, Murshed Zaman Microsoft SQLCAT PM

For my #SQLPASS Summit 2012 talk SQLCAT: Big Data – All Abuzz About Hive (slides available to all | recording available to PASS Summit 2012 attendees) I showed a mash-up of Hive, SQL Server, and Excel data that had been imported to PowerPivot and then displayed via Power View in Excel 2013 (using the new SharePoint-free self-service option). PowerPivot brings together the new world of unstructured data from Hadoop with structured data from more traditional relational and multi-dimensional sources to gain new business insights and break down data silos. We were able to take very recent data from Hurricane Sandy, which occurred the week before the PASS Summit, and quickly build a report to pinpoint some initial areas of interest. The report provides a sample foundation for exploring to find additional insights. If you need more background on Big Data, Hadoop, and Hive please see my previous blogs and talks.

I will walk you through the steps to create the report including loading population demographics (census), weather (NOAA), and lookup table (state abbreviations) data into Hive, SQL Server, Excel, and PowerPivot then creating visualizations in Power View to gain additional insights. Our initial goal is to see if there are particular geographic areas in the path of Hurricane Sandy that might need extra assistance with evacuation. One hypothesis is that people above a given age might be more likely to need assistance, so we want to compare age data with the projected rainfall patterns related to the path of the hurricane. Once you see this basic demonstration you can envision all sorts of additional data sets that could add value to the model, along with different questions that could be asked given the existing data sets. Data from the CDC, pet ownership figures, housing details, job statistics, zombie predictions, and public utility data could be added to Hive or pulled directly from existing sources and added to the report to gain additional insights. Those insights might, for example, help first responders during future storms, assist your business to understand various ways it can help after a storm or major cleanup effort, or aid future research into reducing the damage done by natural disasters.

Prerequisites

  • Excel 2013 (a 30 day trial is available)
  • SQL Server 2008 R2 or later
  • HDInsight Server or HDInsight Service (Service access will take a few days so plan in advance or use Server)
  • Hive ODBC Driver (available with HDInsight)
  • NOAA Data Set restored to SQL Server 2008 R2 or later (available from SkyDrive)
  • Census Data Set (attached at bottom of blog)

SQL Server

Relational Data

One of the data sets for the demo is NOAA weather data that includes spatial characteristics. Since SQL Server has a rich spatial engine and the data set is well known and highly structured that data was a good fit for SQL Server. Spatial_Ed has a wiki post on how to create the finished data set from raw NOAA data. Ed also provides a backup of the completed data set for our use. Take the SQL Server database backup (download here) made available by Ed in his wiki post and restore it to your SQL Server 2008 R2 or later instance as a database called NOAA.

USE [master];
RESTORE DATABASE [NOAA] FROM  DISK = N'C:DemoBigDataSandyNOAA.bak' 
WITH MOVE N'NOAA2' TO N'C:DATANOAA2.mdf',  
MOVE N'NOAA2_log' TO N'C:DATANOAA2_log.LDF';
GO

Since this data will be used by business users, add a view to the NOAA database with a friendlier name for the rainfall/flashflood data:

USE NOAA;
GO
CREATE VIEW flashflood AS SELECT * FROM [dbo].[nws_ffg7];

Take a look at a few rows of the data. For more information on what information is available in the census data, see the U.S. Census Bureau website.

SELECT TOP 10 * FROM flashflood;

Hive

Hive is a part of the Hadoop ecosystem that allows you to create tables and impose structure on Hadoop data. It is available in HDInsight which is Microsoft’s distribution of Hadoop, sometimes referred to as Hadoop on Azure or Hadoop on Windows. HDInsight is currently available for preview in both Azure (HDInsight Service) and on-premises (HDInsight Server) versions. The HDInsight Server version is lightweight and simple to install – you can even put it on your laptop as it is a single node installation for now. Or you can request access to an Azure Hadoop cluster via HDInsight Service, though this takes a few days in the CTP phase. Hive is automatically installed as part of Hadoop in the preview versions, though the Hive ODBC driver requires a separate setup step. Hive can be described in many ways, but for your purposes within this article the key point is that it provides metadata and structure to Hadoop data that allows the data to be treated as “just another data source” to an ODBC compliant application such as Excel. HiveQL, or HQL, looks very similar to other SQL languages and has similar functionality.

ODBC

Once you have access to an HDInsight cluster, install the Hive ODBC driver to your local box (where you have Excel 2013). Make sure you install the ODBC driver platform (32-bit or 64-bit) that matches the platform of Excel. We recommend the 64-bit version of Excel 2013 and of the Hive ODBC driver since PowerPivot is able to take advantage of the larger memory available in x64. The Hive ODBC driver is available from the Downloads tile in the HDInsight Service (Azure) or HDInsight Server (on-premises) portal. Click on the appropriate installer (32-bit or 64-bit) and click through the Microsoft ODBC Driver for Hive Setup. You will end up with version .09 of the Microsoft Hive ODBC driver.

Make sure the Hive ODBC port 10000 is open on your Hadoop cluster (instructions here).

Create a Hive ODBC system DSN pointing to your Hadoop cluster. For this example I used the ODBC Data Source Administrator to create a system DSN called CGrossHOAx64 pointing to my http://HadoopOnAzure.com instance cgross.cloudapp.net with port = 10000 and an account called cgross1. For the on-premise version you can use localhost for the Host value and you will not specify an account or password.

ODBCAdminHive    SystemDSNHive

Note: With Windows 2008 R2 and earlier or Windows 7 and earlier if you must use 32-bit Excel and the 32-bit Hive ODBC driver (not recommended) on 64-bit Windows, to create the system DSN you have to use the 32-bit version of the ODBC Data Source Administrator. It is in a location similar to C:WindowsSysWOW64odbcad32.exe. In Windows 2012 or later or Windows 8 or later there is a single ODBC Data Source Administrator for both 32-bit and 64-bit drivers.

Key Hadoop Pointers
  • The Hadoop Command Prompt is installed with HDInsight. There will be an icon added to the desktop on your head node. When you open it you will see that it is a basic Windows Command Prompt but with some specific settings applied.
    HadoopCmdPrompt  HadoopCmdPromptOpen
  • In early versions of HDInsight the Hive directory is not in the system path so you have to manually change to the Hive directory (cd %hive_home%bin) to run Hive commands from the Hadoop Command Prompt. This issue will be fixed in future versions.
  • HDFS is the scale-out storage technology that comes as part of core Hadoop. How it works behind the scenes is covered in great detail in many other places and is not relevant to this discussion. This is the default storage in HDInsight Server (on-prem) and the CTP of HDInsight Service (Azure).
  • ASV, or Azure Storage Vault, is the Windows Azure implementation of HDFS that allows HDInsight to use Azure Blob storage. This will be the default for the production HDInsight Service.
Census Data

Census data has a format that can vary quite a bit. The data is collected in different ways in different countries, it may vary over time, and many companies add value to the data and make available extended data sets in various formats. It could contain large amounts of data kept for very long periods of time. In our example the pieces of census data we find useful and how we look at that data (what structures we impose) may change quite a bit as we explore the data. The variety of structures and the flexibility of the ways to look at the data make it a candidate for Hadoop data. Our need to explore the data from common BI tools such as Excel, which expect rows and columns with metadata, leads us to make it into a Hive table.

Census data was chosen because it includes age ranges and this fits with the initial scenario we are building to look at how many older individuals are in various high danger areas in the path of the hurricane. I have attached a demo-sized variation of this data set called Census2010.dat at the bottom of the blog. Download that tab-delimited U.S. census data set to the c:data folder on the head node for your HDInsight cluster. The head node is the machine you installed your single node HDInsight Server (on-prem) on, such as your laptop, or the machine configured for Remote Desktop access from the portal in your HDInsight Service (Azure) cluster. If you wish to explore the raw census data used in this demo, take a look at this site: http://www.census.gov/popest/research/modified.html.

Next, load the census data to Hadoop’s HDFS storage. Unlike a relational system where the structure has to be well-known and pre-defined before the data is loaded, with Hadoop we have the option to load this data into HDFS before we even create the Hive table!

Note: Your client tool editor or the website may change the dashes or other characters in the following commands to “smart” characters. If you get syntax errors from a direct cut/paste, try pasting into notepad first or deleting then retyping the dash (or other special characters).

I will show you three of the many options for loading the data into an external Hive table (in the real world you would have multiple very large files in this directory, but for the purposes of the demo we have one small file), choose any one of the following three options. Note that the directories are created for you automatically as part of the load process. There are many ways to do the load, including cURL, SFTP, PIG, etc. but the steps below are good for illustration purposes.

1) To load the data into HDFS via the Hadoop Command Prompt, open the Hadoop Command Prompt and type:

hadoop fs -put c:datacensus2010.dat /user/demo/census/census.dat

Let’s break down the command into its component pieces. The fs tells Hadoop you have a file system command, put is a data load command, c:datacensus2010.dat is the location of the file within NTFS on Windows, and /user/demo/census/census.dat is the location where we want to put the file within HDFS. Notice that I chose to change the name of the file during the load, mostly just for convenience as the HDFS name I chose is shorter and more generic. By default HDInsight currently defaults to using HDFS so I don’t have to specify that in the command, but if your system has a different default or you just want to be very specific you could specify HDFS in the command by adding hdfs:// before the location.

hadoop fs -put c:datacensus2010.dat hdfs:///user/demo/census/census.dat

Nerd point: You can specify the system name (instead of letting it default to localhost) by appending hdfs://localhost (or a specific name, often remote) to the location. This is rarely done as it is longer and makes the code less portable.

hadoop fs -put c:datacensus2010.dat hdfs://localhost/user/demo/census/census.dat

2) If you are using HDInsight Service (Azure), you can choose to easily store the data in an Azure Blob Store via ASV. If you haven’t already established a connection from your HDInsight cluster to your Azure Blob Store then go into your HDInsight cluster settings and enter your Azure Storage Account name and key. Then instead of letting the data load default to HDFS specify in the last parameter that the data will be loaded to Azure Storage Vault (ASV).

hadoop fs -put c:datacensus2010.dat asv://user/demo/census/census.dat

3) From the JavaScript interactive console in the Hadoop web portal (http://localhost:8085/Cluster/InteractiveJS from the head node) you can use fs.put() and specify either an HDFS or ASV location as your destination (use any ONE of these, not all of them):

/user/demo/census/census.dat
asv://user/demo/census/census.dat
hdfs:///user/demo/census/census.dat

Check from a Hadoop Command Prompt that the file has loaded:

hadoop fs -lsr /user/demo/census

Results will vary a bit, but the output should include the census.dat file: /user/demo/census/census.dat. For example the Hadoop Command Prompt in HDInsight Server should return this row with a different date and time, possibly with a different userid:

-rw-r–r– 1 hadoop supergroup 360058 2013-01-31 17:17 /user/demo/census/census.dat

If you want to run Hadoop file system (fs) commands from the JavaScript console, replace “hadoop fs -” with “#”.

#lsr /user/demo/census

This tells Hadoop that we want to execute a file system command and recursively list the subdirectories and files under the specified directory. If you chose to use ASV then add that prefix (I will assume for the rest of this blog that you know to add asv:/ or hdfs:// if necessary):

hadoop fs -lsr asv://user/demo/census

If you need to remove a file or directory for some reason, you can do so with one of the remove commands (rm = remove a single file or directory, rmr = remove recursively the directory, all subdirectories, and all files therein):

hadoop fs –rm /user/demo/census/census.dat

hadoop fs –rmr /user/demo/census

For more details on rm or other commands:

hadoop fs –help

hadoop fs –help rm

hadoop fs –help rmr
Hive Table

Create an external Hive table pointing to the Hadoop data set you just loaded. To do this from a command line open your Hadoop Command Prompt on your HDInsight head node. On an Azure cluster, look for the Remote Desktop tile from the HDInsight Service portal. This allows you to connect to the head node and you can optionally choose to save the connection as an RDP file for later use. In the Hadoop Command Prompt, type in the keyword hive to open the interactive Hive command line interface console (CLI). In some early versions of HDInsight the Hive directory isn’t in the system path so you may have to first change to the Hive bin directory in the Hadoop Command Prompt:

cd %hive_home%bin

Now enter the Hive CLI from the Hadoop Command Prompt:

hive

Copy and Paste the below create statement into the Hive CLI window. Or you can paste the CREATE statement into the web portal via the Hive interactive screen. If you are using the ASV option, change the STORED AS line to include asv:/ before the rest of the string (asv://user/demo/census). Take note of the fact that you are specifying the folder (and implicitly all files in it) and not a specific file.

CREATE EXTERNAL TABLE census
(State_FIPS int,
County_FIPS int,
Population bigint,
Pop_Age_Over_69 bigint,
Total_Households bigint,
Median_Household_Income bigint,
KeyID string) 
COMMENT 'US Census Data' 
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' 
STORED AS TEXTFILE LOCATION '/user/demo/census';

This CREATE command tells Hive to create an EXTERNAL table as opposed to an “internal” or managed table. This means you are in full control of the data loading, data location, and other parameters. Also, the data will NOT be deleted if you drop this table. Next in the command is the column list, each column uses one of the limited number of primitive data types available. The COMMENT is optional, you can also add a COMMENT for one or more of the columns if you like. ROW FORMAT DELIMITED means we have a delimiter between each column and TERMINATED BY ‘t’ means the column delimiter is a tab. The only allowed row delimiter as of Hive 0.9 is a new line and that is the default so we don’t need to specify it. Then we explicitly say this is a collection of TEXTFILEs stored at the LOCATION /user/demo/census (notice that we are not specifying a specific file, but rather the specific directory). We could specify either hdfs:/// or asv:// in front of the location; by default the current default storage location is HDFS so it is an optional prefix (the default can change!).

Now check that the table has been created:

SHOW TABLES;

Check the definition of the table:

DESCRIBE FORMATTED census;

And look at a sampling of the data:

SELECT * FROM census LIMIT 10;

You can now choose to leave the Hive CLI (even the exit requires the semicolon to terminate the command):

EXIT;

Excel 2013 Visualizations

Preparation

In Excel 2013 you don’t have to download PowerPivot or Power View, you just need to make sure they are enabled as COM add-ins. All of the below steps assume you are using Excel 2013.

In Excel 2013 make sure the “Hive for Excel”, “Microsoft Office PowerPivot for Excel 2013”, and “Power View” add-in are enabled. To do this click on the File tab in the ribbon, then choose Options from the menu on the left. This brings up the Excel Options dialog. Choose Add-ins from the menu on the left. At the very bottom of the box you’ll see a section to Manage the Add-ins. Choose “COM Add-ins” and click on the Go… button. This brings up the COM Add-Ins dialog where you can double check that you have checkmarks next to the Hive, PowerPivot, and Power View add-ins. Click OK.

ExcelManageCOMAddins  ExcelCOMAddins

PowerPivot Data

Open a new blank workbook in Excel 2013. We are going to pull some data into PowerPivot, which is an in-memory tabular model engine within Excel. Click on the PowerPivot tab in the ribbon then on the Manage button in the Data Model group on the far left. Click on “Get External Data” in the PowerPivot for Excel dialog.

PowerPivotExternalData

We’re going to mash up data from multiple sources – SQL, Hive, and Excel.

SQL Data

First, let’s get the SQL Server data from the NOAA database. Choose “From Database” then “From SQL Server”. In the Table Import Wizard dialog enter the server and database information:

Friendly connection name = NOAA
Server name = [YourInstanceName such as MyLaptopSQL2012]
Log on to the server = “Use Windows Authentication”
Database name = NOAA

ExternalSQLNOAA

Choose “Next >” then “Select from a list of tables and views to choose the data to import”. Put a checkmark next to the flashflood view we created earlier. Click “Finish”. This will import the flash flood data from SQL Server into the in-memory xVelocity storage used by PowerPivot.

ImportFlashFlood

Hive Data

Now let’s load the Hive data. Click on “Get External Data” again. This time choose “From Other Sources” then “Others (OLEDB/ODBC)” and click on the “Next >” button. Enter Census for the friendly name then click on the “Build” button.

Friendly name for this connection = Census
Connection String = “Build”
Provider tab:
OLE DATABASE Provider(s) = “Microsoft OLE DATABASE Provider for ODBC Drivers”
Connection tab:
Use data source name = CGrossHOAx64 [the Hive ODBC DSN created above]
User name = [login for your Hadoop cluster, leave blank for on-premises Nov 2012 preview]
Password = [password for your Hadoop cluster, leave blank for on-premises Nov 2012 preview]
Allow saving password = true (checkmark)

DSNODBCProvider  DSNHiveAzure

This creates a connection string like this:
Provider=MSDASQL.1;Persist Security Info=True;User ID=cgross1;DSN=CGrossHOAx64;Password=**********

ExternalHiveCensus2

Choose “Next >” then “Select from a list of tables and views to choose the data to import”. Choose the census table then click on “Finish”. This loads a copy of the Hive data into PowerPivot. Since Hive is a batch mode process which has the overhead of MapReduce job setup and/or HDFS streaming for every query, the in-memory PowerPivot queries against the copy of the data will be faster than queries that go directly against Hive. If you want an updated copy of the data loaded into PowerPivot you can click on the Refresh icon at the top of the PowerPivot window.

Save the Excel file.

Excel Data

Our last data set will be a set of state abbreviations copied into Excel then accessed from PowerPivot. Go back to your Excel workbook (NOT the PowerPivot for Excel window). Rename the first sheet using the tab at the bottom to ShortState (just for clarity, this isn’t used anywhere).

Open https://www.usps.com/send/official-abbreviations.htm in your web browser. Copy the entire “States” table including the headers. Paste the States data into the ShortState worksheet in Excel. Change the headers from State/Possession to State and from Abbreviation to ShortState. Highlight the columns and rows you just added (not the entire column or you will end up with duplicate values). Go to the Insert tab in the ribbon and click on Table in the Tables group on the far left. It will pop up a “Create Table” window, make sure the cells show “=$A$1:$B$60” and that “My table has headers” is checked.

ExternalAbbreviations

Now you will have a Table Tools ribbon at the top on the far right. Click on the Design tab then in the Properties group in the upper left change the Table Name to ShortState. The fact that the worksheet tab, the table, and one column have the same name is irrelevant, that just seemed less confusing to at the time. Now you have an Excel based table called ShortState that is available to be used in the tabular model.

ExternalAbbreviations2

Data Model

Click on the PowerPivot tab and choose Add to Data Model at the top in the Tables group. This will change the focus to the PowerPivot for Excel window. Go to the Home tab in the PowerPivot for Excel window and click on Diagram View in the View group in the upper right. You can resize and rearrange the tables if that makes it easier to visualize the relationships. Add the following relationships by clicking on the relevant column in the first table then dragging the line to the other column in the second table:

Table Flashflood, column State to table ShortState, column State

Table Census, column keyid to table flashflood, column KEYID

PPModelCensus

Click on the flashflood table and then on the Create Hierarchy button in the upper right of the table (as circled in the image above). Name the hierarchy Geography and drag State then County under it.

PPHierarchyCensus

Save the Excel file. You may need to click on “enable content” if you close/reopen the file.

Calculations

In the PowerPivot for Excel window, we’re going to add some calculated columns for later use. Click on the Data View button in the View group at the top. You should now see three tabs at the bottom, one for each of the tables in the model. Click on the census table and then right-click on “Add Column” at the far right of the column headers. Choose to insert column. This will create a new column named CalculatedColumn1. In the formula bar that opened above the column headers paste in “=[Pop_Age_Over_69]/[Population]” (without the quotes). This will populate the new column with values such as 0.0838892880358548. Right-click on the column and choose “Rename Column” – name it PctPopOver69 since it gives us the percentage of the total population that is over 69 years old. That’s the first formula below, repeat those steps for the 2nd and 3rd formulas. PctBuckets bucketizes the output into more meaningful groupings while SortBucketsBy gives us a logical ordering of the buckets. Since most of the numbers fall into a range of less than 20% the buckets are skewed towards that lower range that actually has a variety of data.

  • PctPopOver69=[Pop_Age_Over_69]/[Population]
  • PctBuckets=IF([PctPopOver69]<=.05,”<5%”,IF([PctPopOver69]<.1,”5-10%”, IF([PctPopOver69]<.15,”10-15%”,IF([PctPopOver69]<.2,”15-20%”,”>20%”))))
  • SortBucketsBy=IF([PctPopOver69]<=.05,1,IF([PctPopOver69]<.1,2, IF([PctPopOver69]<.15,3,IF([PctPopOver69]<.2,4,5))))

PPCalculationsCensus

Now let’s use those new calculated columns. Highlight the new PctBuckets column (the 2nd one you just added) and click on the Sort by Column button in the Sort and Filter group. Pick Sort By Column and choose to sort PctBuckets by SortBucketsBy. This affects the order of data in the visualizations we will create later.

PPSortByCensus

Since some of the columns are purely for calculations or IT-type use and would confuse end users as they create visualizations and clutter the field picker, we’re going to hide them. In the Census table right-click on the PctPopOver69 column and choose “Hide from Client Tools”. Repeat for census.SortBucketsBy, census.keyid, census.State_FIPS, census.County_FIPS, flashflood.State_FIPS, flashflood.County_FIPS, and flashflood.KEYID. Our PowerPivot data is now complete so we can close the PowerPivot for Excel window. Note that the Excel workbook only shows the ShortState tab. That’s expected as this is the Excel source data, not the in-memory PowerPivot tables that we’ve created from our three data sources. All of our data is now in the spreadsheet and ready to use in a report.

Save the Excel file.

Power View

Now let’s use Power View to create some visualizations. In the main Excel workbook, click on the Insert tab in the ribbon and then under the Reports group (near the middle of the ribbon) click on Power View.

PVButton

This creates a new tab in your workbook for your report. Rename the tab to SandyEvacuation. In the Power View Fields task pane on the right, make sure you choose All and not Active so that every table in the model is visible. Uncheck any fields that have already been added to the report.

Where it says “Click here to add a title” enter a single space (we’ll create our own title in a different location later). In the Power View tab click on the Themes button in the Themes group. Choose the Solstice theme (7 from the end). This theme shows up well on the dark, stormy background we will add later and it has colors that provide good indicators for our numbers.

Now let’s add some data to the report. In the Power View Fields task pane, expand the flashflood table and add a checkbox to State then to County. This will create a new table on the left. Add HR1, HR3, HR6, HR12, and HR24 from the flashflood table in that order (you can change the order later at the bottom of the task pane). Add Latitude and Longitude from flashflood. Expand census and click on PctBuckets. Now we have the data we need to build a map-based visualization that will show rainfall across a geographic area.

Note: If you get a message at any point asking you to “enable content”, click on the button in the message to do so.

Map

In the Power View Fields task pane go to the Fields section at the bottom. Click on the arrow to the right of Latitude and Longitude and change both from sum to average.

PVCensusFields

With the focus on the table in the main window, click on the Design tab then Map within the Switch Visualization group on the left. This creates one map per state and changes the choices available in the task pane. Change the following fields at the bottom of the screen to get a map that allows us to see the path of Hurricane Sandy as it crosses New England.

TILE BY = [blank]
SIZE = HR6 (drag it from the field list to replace HR1, set it to minimum)
LOCATIONS = Remove the existing field and drag down the Geography Hierarchy from flashflood
LONGITUDE = Longitude
LATITUDE = Latitude
COLOR = PctBuckets
VERTICAL MULTIPLES = [remove field so it’s blank]
HORIZONTAL MULTIPLES = [blank]

PVCensusMapFields

With the focus still on the map, click on the LAYOUT tab in the ribbon. Set the following values for the buttons in the Labels group, they will apply to this particular visualization (the map):

Title = none
Legend = Show Legend at Right
Data Labels = Center
Map Background = Road Map Background

PVMapViz

Go to the Power View tab, make sure the focus is on the map. Click on Map in the Filters section to the right of the report. Click on the icon to the far right of State. This adds another icon (a right arrow) for “Advanced filter mode”, click on that. Choose “is not blank” from the first drop down. Click on “apply filter” at the bottom. Repeat for County. This eliminates irrelevant data with no matching rows.

PVCensusFilter

Make it Pretty

Go to the Power View tab so we can add some text. In the Insert group click on Text Box. It will insert a box somewhere on the page, paste in “Evacuation Targets – % of Population Over 69 by County” (without the quotes). Size the box so that it is a long rectangle with all the text on one line, then move the box above the map. Repeat for a text box that says “Hurricane Sandy 2012” and move that above the previous text box. Make the text size of the last box = 24 in the TEXT tab (resize the box if needed).

Hint: To move an object on the report, hover over the object until a hand appears then drag the object.

PVTitles

Download some Hurricane Sandy pictures. In the Power View tab go to the Background Image section. Choose Set Image and insert one of your hurricane images. Set Transparency to 70% or whatever looks best without overwhelming the report. You may need to change the Image Position to stretch. In the Themes group of the Power View tab set Background to “Light1 Center Gradient” (gray fade in to center). Move the two titles and the map as far into the left corner as you can to leave room for the other objects. If you like, reduce the size of the map a bit (hover near a corner until you get a double-headed arrow) add a couple more pictures on the right side (Insert group, Picture button).

PVBackgroundMap

Save the Excel file.

Slicers

Now let’s create some slicers and add another chart to make it more visually appealing and useful.

Click in the empty area under the map, we are going to create a new object. Click on flashflood.State in the task pane to add that column to the new table. Click on the scroll bar on the right side of the table to make sure the focus is on the table. Click on Slicer in the Design tab. Stay in the Design tab and under the Text section click on the “A with a down arrow” button to reduce the font by one size. If necessary, do some minor resizing of the table so it doesn’t fall off the page or cut off a state name in the middle but still shows a handful of rows. Expand Filters on the right side of the screen and drag flashflood.State to the filters section. Choose the Advanced button, “is not blank”, and “apply filter”. Repeat these steps to create a separate slicer for County. Now when you click on a particular county, for example, in the slicer everything on the report will be filtered to that one county.

PVSlicers

Save the Excel file.

Rainfall Chart

Click in the empty space to the right of the new County slicer. Add ShortState.ShortState and 5 separate flashflood columns, one for each time increment collected: HR1, HR3, HR6, HR12, HR24. In the Design tab choose “Other Chart” then Line. In the “Power View Fields” task pane make sure TILE BY is empty, VALUES for HR1-24 = AVERAGE, and AXIS = ShortState.

PVRainfallFields

Reduce the font size by 1 in the Text group of the Design tab then resize to make sure all values from CT to WV show up in the chart (stretch it to the right). In the Power View tab click on Insert.Text Box, move the box to be above the new chart, and type in the text “Average Rainfall by State”.

Make sure the focus is on the rainfall chart and go to the LAYOUT tab:

Title = None
Legend = Show Legend at Right
Data Labels = Above

PVHurricaneSandy

Save the Excel file.

That’s the completed report using the same steps as the report I demo’d at PASS Summit 2012 (download that report from the bottom of the blog). Try changing the SIZE field to HR1 (rainfall projection for the next hour), HR6, etc. Click through on the state map values to see the more detailed county level values. Click on a slicer value to see how that affects the map.

What you see in this report is a combination of SQL Server (structured) data and Hive (originally unstructured) data all mashed up into a single report so the end user doesn’t even have to know about the data sources. You can add more types of data, change the granularity of the data, or change how you visualize the data as you explore. The Hive data in particular is very flexible, you can change the structure you’ve imposed without reloading the data and add new data sources without lots of pre-planning. Using BI client tools to access all the additional rich data companies are starting to archive and mine in Big Data technologies such as HDInsight is very powerful and I hope you are starting to see the possibilities of how to use Hive in your own work.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly

DemoSandy2012.zip