Cindy Gross | Befriending Dragons

Tag: HDInsight

Your First HDInsight Cluster–Step by Step
Small Bites of Big Data from AZURECAT
Big Data Tech Training Series #1

Cindy Gross | Murshed Zaman

Sometimes it is just hard to get started. Have you been putting off your first foray into Hadoop? Are you not sure where to begin? Let’s get really basic.

Prerequisites:
- Azure subscription (free trials available)
- Install Azure PowerShell Tools and Windows Azure HDInsight PowerShell
- Read Get Started with Windows Azure Cmdlets
Log on to the Windows Azure Portal https://manage.windowsazure.com

Go to storage https://manage.windowsazure.com/#Workspaces/StorageExtension/storage. Create a storage account in a location that is available to HDInsight (as of November 2013 that’s East US, West US, and North Europe). Do NOT choose an affinity group. If you choose to “Enable Geo-Replication” there will be an extra charge – it’s probably not necessary for a demo/test account as you have a limited amount of credit in the trial subscription. In the portal choose the STORAGE icon on the left. Then click on +NEW at the bottom. That opens a QUICK CREATE window. Enter a unique name for your storage, such as sqlcatwomanrules. It only allows lower case letters and numbers.

Now click on the HDInsight icon just below the storage icon storage. Choose QUICK CREATE. Enter a unique name for your HDInsight cluster. For a demo choose 4 data nodes. Enter a password that contains upper and lower case letters, a number, and a special character. Choose the storage account you created above. Once you click on “CREATE HDINSIGHT CLUSTER” it will take several minutes for the cluster to be deployed.

Once it completes you are ready to use your cluster!

If you won’t be using the cluster right away, go ahead and delete it (look for the icon at the bottom of the portal) to save compute time and money. You can easily recreate it when you need it.

Look for more blogs soon on customizing your cluster with CUSTOM CREATE or PowerShell and on automating deployment and jobs with PowerShell. In the meantime see if you can get Invoke-Hive working from PowerShell for some simple Hive commands such as:

Invoke-Hive “select * from hivesampletable limit 10”

Big Data Technical Series:

Your First HDInsight Cluster–Step by Step

Automating HDInsight cluster creation with PowerShell
November 25, 2013
PowerShell for Azure cmdlets: Subscription was all Wacky

I was working on some HDInsight scripts in PowerShell and doing lots of experimenting. I’m not sure what exactly I did but all of a sudden everything stopped working. With lots of interruptions from meetings and chats and lunch…. I couldn’t retrace my steps. Everything seemed to fail on the Azure subscription information so I tried to get really basic – what did Get-AzureSubscription|%{$_.SubscriptionName} return? As it turns out, wacky garbage:

set-azuresubscription ?
!?
!
help
sqlcatwoman
?
Sqlcatwomanwestus

What I expected to see was my single subscription:

sqlcatwoman

So what happened? The Azure portal only shows one subscription. Obviously those other lines are not valid subscriptions – they look like the output of a help command or an error. Reinstalling the cmdlets, rebooting, and reimporting certificates didn’t help. I turned to my AzureCAT coworkers for help and @elcid98 pointed out this blog post that talks about how subscriptions are used in PowerShell:

Azure Subscriptions in PowerShell demystified http://yossidahan.wordpress.com/2012/12/12/azure-subscriptions-in-powershell-demystified/

This caught my attention: “The second file – DefaultSubscriptionData.xml – also lists the available subscriptions and the associated certificates’ thumbprints“. Ok, where is that file? A search finds it in

C:Users%username%AppDataRoamingWindows Azure Powershell

I checked and sure enough, where I would expect just one entry I see multiple – and they’re named the same thing as the garbage in my output! I cleared out all but one entry to end up with this:

<?xml version=”1.0″ encoding=”utf-8″?>
< Subscriptions xmlns:xsd=”http://www.w3.org/2001/XMLSchema” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” version=”0″ xmlns=”urn:Microsoft.WindowsAzure.Management:WaPSCmdlets”>
< Subscription name=”sqlcatwoman”>
< SubscriptionId>You don’t get to see the real info!</SubscriptionId>
<Thumbprint>Not here either!</Thumbprint>
< ServiceEndpoint>https://management.core.windows.net</ServiceEndpoint>
< /Subscription>
< /Subscriptions>

Hmmmm…. I still got an error from Get-AzureSubscription. Back to C:Users%username%AppDataRoamingWindows Azure PowerShell. What’s this? WindowsAzureProfile.xml also has all the same junk! I cleared out all the extras to end up with this:

<?xml version=”1.0″ encoding=”utf-8″?>
< ProfileData xmlns:i=”http://www.w3.org/2001/XMLSchema-instance” xmlns=”http://schemas.datacontract.org/2004/07/Microsoft.WindowsAzure.Commands.Utilities.Common”>
< DefaultEnvironmentName>AzureCloud</DefaultEnvironmentName>
<Environments />
<Subscriptions>
< AzureSubscriptionData>
< ActiveDirectoryEndpoint>https://login.windows.net/</ActiveDirectoryEndpoint>
< ActiveDirectoryTenantId>More secrets!</ActiveDirectoryTenantId>
< ActiveDirectoryUserId>cgross@microsoft.com</ActiveDirectoryUserId>
< CloudStorageAccount i:nil=”true” />
< IsDefault>true</IsDefault>
< LoginType i:nil=”true” />
< ManagementCertificate>Hiding this one too!</ManagementCertificate>
< ManagementEndpoint>https://management.core.windows.net/</ManagementEndpoint>
<Name>sqlcatwoman</Name>
<RegisteredResourceProviders xmlns:d4p1=”http://schemas.microsoft.com/2003/10/Serialization/Arrays” />
< SubscriptionId>And more secrets</SubscriptionId>
</AzureSubscriptionData>
</Subscriptions>
< /ProfileData>

Success! Get-AzureSubscription now returns just my single, valid subscription. All my other Azure cmdlets magically started working again. I don’t know how it got that way, but at least now I know where the subscription information is stored. I hope this helps someone else with their Azure subscription PowerShell scripting!

November 22, 2013
HDInsight Big Data Talks from #SQLPASS

SQL PASS Summit 2013 was another great data geek week! I chatted with many of you about Big Data, Hadoop, HDInsight, architecting solutions, SQL Server, data, BI, analytics, and general geekiness – great fun! This time around I delivered two talks on Hadoop and HDInsight – the slides from both are attached.

Zero to 60 with HDInsight takes you from an overview of Big Data and why it matters (zero) all the way through an end to end solution (60). We discussed how to create an HDInsight cluster with the Azure portal or PowerShell and talked through the architecture of the data and analysis behind the release of Halo 4. We talked about how you could use the same architectural pattern for many projects and walked through Hive and Pig script examples. We finished up with how to use Power Map (codename GeoFlow) over that data to gain new insights and improve the game experience for the end user.

The next session I co-presented with HDInsight PM Dipti Sangani: CAT: From Question to Insight with HDInsight and BI. We went deeper this time. Not only did we present an end to end story with how our own internal Windows Azure SQL Database team uses telemetry to improve your experience with SQL Server in Azure PaaS but we also went deeper with demos of Hive, Pig, and Oozie. We also gave another archetypical design scenario that will apply to many of your own scenarios and talked about how HDInsight fits with SQL Server and your other existing infrastructure. The deck covers your cloud and on-premises options for Hadoop on Windows including HDInsight Service, Hortonworks HDP for Windows, OneBox, and PDW with Polybase.

Please let me know if you have any questions from the talks or just general HDInsight questions!

PASSSummit2013BigData.zip

October 20, 2013
Big Data Twitter Demo

Real-time. Social Sentiment Analysis. Twitter. Cloud. Insights. We have your Big Data buzzwords here!

Everyone seems to want to incorporate social sentiment into their business analysis. Well we have the demo for you! Use it for a quick demonstration of what can be done and when the excitement goes through the roof, use it to inspire your own design!

Real-Time Processing – Instant Insights!

First, use Event Driven Processing (EDP) to show the data on a dashboard. In the demo you’ll use StreamInsight, also referred to as Complex Event Processing (CEP), though you may want to use Microsoft Open Technologies’s RX / Reactive Extensions in your own project. Use the dashboard to make real-time decisions and take immediate action. For example, configure your EDP to “pop” only on terms related to your company and your marketing analyst can watch how the volume of tweets and the sentiment (measured positive/neutral/negative in this example) change in response to your Super Bowl ad or a mention of your company on a the news show. She can respond instantly with changes to your website, your own tweets, sales/promotions, or whatever is appropriate to your business.

Data Storage – Enable Insights!

EDP reads the data as it is pushed through a query, there is no inherent storage involved. You could just discard the data and never store it. In this example we chose to store the data for later trending and historical analysis. We take the tweet id, the date/time the tweet was captured, the sentiment score calculated during the real-time processing, and the keyword that caught our attention and store it in SQL Azure. This data is available to other applications that need to join it with existing data and requires fast responses to individual queries. The remaining data including the raw tweet, any geographic data the tweeter chose to share, and other data is dropped in an Azure Blob Store.

Trends, Patterns, and Historical Insights!

Now point HDInsight (Hadoop) to the Azure Blob Storage using HDInsight’s ASV extension to HDFS. You can spin up a Hadoop cluster in Azure, pay for as many nodes as you need for as long as you need them then spin them down to save money. The data remains in the blob store – available for future Hadoop clusters, other applications, archival, or whatever you need it for. Add structure to the JSON data with Hive and now you have rows and columns that can be accessed by BI tools!

Visualization for Powerful Insights!

Now create a PowerPivot tabular model (self-service BI / client side) or an Analysis Service tabular model (corporate BI) to store the relevant data in a highly compressed, in-memory format for fast access. Add in a few Power View visualizations mashing up data from multiple structured and unstructured sources and you can show your business decision makers easily digestible and understandable data in a format they just get and love! Make some decisions, take some actions, and you’ve just shown how to turn free Twitter data into a valuable resource that can have a direct impact on your company!

How to Get These Insights

Follow the instructions on the CodePlex site for the Big Data Twitter Demo project. Set it up, run through the demo, get excited, and go improve your business!

Demo Created By:

Vu Le

Andrew Moll

Aviad Ezra @aviade_pro

Brad Sarsfield @Bradoop

Later Additions By:

Lara Rubbelke @SQLGal | http://sqlblog.com/blogs/lara_rubbelke/

Robert Bruckner http://blogs.msdn.com/b/robertbruckner/

Cindy Gross @SQLCindy | http://blogs.msdn.com/cindygross

May 17, 2013
Access Azure Blob Stores from HDInsight
Small Bites of Big Data

Edit Mar 6, 2014: This is no longer necessary for HDInsight – you specify the storage accounts when you create the cluster and the rest happens auto-magically. See http://blogs.msdn.com/b/cindygross/archive/2013/11/25/your-first-hdinsight-cluster-step-by-step.aspx or http://blogs.msdn.com/b/cindygross/archive/2013/12/06/sample-powershell-script-hdinsight-custom-create.aspx.

One of the great enhancements in Microsoft’s HDInsight distribution of Hadoop is the ability to store and access Hadoop data on an Azure Blob Store. We do this via the HDFS API extension called Azure Storage Vault (ASV). This allows you to persist data even after you spin down an HDInsight cluster and to make that data available across multiple programs or clusters from persistent storage. Blob stores can be replicated for redundancy and are highly available. When you need to access the data from Hadoop you point your cluster at the existing data and the data persists even after the cluster is spun down.

Azure Blob Storage

Let’s start with how your data is stored. A storage account is created in the Azure portal and has access keys associated with it. All access to your Azure blob data is done via storage accounts. Within a storage account you need to create at least one container, though you can have many. Files (blobs) are put in the container(s). For more information on how to create and use storage accounts and containers see: http://www.windowsazure.com/en-us/develop/net/how-to-guides/blob-storage/. Any storage accounts associated with HDInsight should be in the same data center as the cluster and must not be in an affinity group.

You can create a container from the Azure portal or from any of the many Azure storage utilities available such as CloudXplorer. In the Azure portal you click on the Storage Account then go to the CONTAINERS tab. Next click on ADD CONTAINER at the very bottom of the screen. Enter a name for your container, choose the ACCESS property, and click on the checkmark.

HDInsight Service Preview

When you create your HDInsight Service cluster on Azure you associate your cluster with an existing Azure storage account in the same data center. In the current interface the QUICK CREATE doesn’t allow you to choose a default container on that storage account so it creates a container with the same name as the cluster. If you choose CUSTOM CREATE you have the option to choose the default container from existing containers associated with the storage account you choose. This is all done in the Azure management portal: https://manage.windowsazure.com/.

Quick Create:

Custom Create:

You can then add additional storage accounts to the cluster by updating C:appsdisthadoop-1.1.0-SNAPSHOTconfcore-site.xml on the head node. This is only necessary if those additional accounts have private containers (this is a property set in the Azure portal for each container within a storage account). Public containers and public blobs can be accessed without the id/key being stored in the configuration file. You choose the public/private setting when you create the container and can later edit it in the “Edit container metadata” dialog on the Azure portal.

The key storage properties in the default core-site.xml on HDInsight Service Preview are:

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>asv://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net</value>
<description>The name of the default file system. Either the
literal string “local” or a host:port for NDFS.</description>
<final>true</final>
</property>

<property>
<name>dfs.namenode.rpc-address</name>
<value>hdfs://namenodehost:9000</value>
</property>

<property>
<name>fs.azure.account.key.YOURStorageAccount.blob.core.windows.net</name>
<value>YOURActualStorageKeyValue</value>
</property>

To add another storage account you will need the Windows Azure storage account information from https://manage.windowsazure.com. Log in to your Azure subscription and pick storage from the left menu. Click on the account you want to use then at the very bottom click on the “MANAGE KEYS” button. Cut and paste the PRIMARY ACCESS KEY (you can use the secondary if you prefer) into the new property values we’ll discuss below.

Create a Remote Desktop (RDP) connection to the head node of your HDInsight Service cluster. You can do this by clicking on the CONNECT button at the bottom of the screen when your HDInsight Preview cluster is highlighted. You can choose to save the .RDP file and edit it before you connect (right click on the .RDP file in Explorer and choose Edit). You may want to enable access to your local drives from the head node via the “Local Resources” tab under the “More” button in the “Local devices and resources” section. Then go back to the General tab and save the settings. Connect to the head node (either choose Open after you click CONNECT or use the saved RDP).

On the head node make a copy of C:appsdisthadoop-1.1.0-SNAPSHOTconfcore-site.xml in case you have to revert back to the original. Next open core-site.xml in Notepad or your favorite editor.

Add your 2nd Azure storage account by adding another property.

<property>
<name>fs.azure.account.key.YOUR_SECOND_StorageAccount.blob.core.windows.net</name>
<value>YOUR_SECOND_ActualStorageKeyValue</value>
</property>

Save core-site.xml.

Repeat for each storage account you need to access from this cluster.

HDInsight Server Preview

If you have downloaded the on-premises HDInsight Server preview from http://microsoft.com/bigdata that gives you a single node “OneBox” install to test basic functionality. You can put it on your local machine, on a Hyper-V virtual machine, or in a Windows Azure IaaS virtual machine. You can also point this OneBox install to ASV. Using an IaaS VM in the same data center as your storage account will give you better performance, though the OneBox preview is meant purely for basic functional testing and not for high performance as it is limited to a single node. The steps are slightly different for on-premises as the installation directory and default properties in core-site.xml are different.

Make a backup copy of C:Hadoophadoop-1.1.0-SNAPSHOTconfcore-site.xml from your local installation (local could be on a VM).

Edit core-site.xml:

1) Change the default file system from local HDFS to remote ASV

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>hdfs://localhost:8020</value>
<description>The name of the default file system. Either the
literal string “local” or a host:port for NDFS.</description>
<final>true</final>
</property>

to:

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>asv://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net</value>
<description>The name of the default file system. Either the
literal string “local” or a host:port for NDFS.</description>
<final>true</final>
</property>

2) Add the namenode property (do not change any values)

<property>
<name>dfs.namenode.rpc-address</name>
<value>hdfs://namenodehost:9000</value>
</property>

3) Add the information that associates the key value with your default storage account

<property>
<name>fs.azure.account.key.YOURStorageAccount.blob.core.windows.net</name>
<value>YOURActualStorageKeyValue</value>
</property>

4) Add any additional storage accounts you plan to access

<property>
<name>fs.azure.account.key.YOUR_SECOND_StorageAccount.blob.core.windows.net</name>
<value>YOUR_SECOND_ActualStorageKeyValue</value>
</property>

Save core-site.xml.

Files

Upload one or more files to your container(s). You can use many methods for loading the data including Hadoop file system commands such as copyFromLocal or put, 3rd party tools like CloudXPlorer, JavaScript, or whatever method you find fits your needs. For example, I can upload all files in a data directory (for simplicity this sample refers to C: which is local to the head node) using the Hadoop put command:

hadoop fs -put c:data asv://data@sqlcatwomanblog.blob.core.windows.net/

Or upload a single file:

hadoop fs -put c:databacon.txt asv://data@sqlcatwomanblog.blob.core.windows.net/bacon.txt

To view the files in a linked non-default container or a public container use this syntax from a Hadoop Command Line prompt (fs=file system, ls=list):

hadoop fs -ls asv://data@sqlcatwomanblog.blob.core.windows.net/

Found 1 items
-rwxrwxrwx 1 124 2013-04-24 20:12 /bacon.txt

In this case the container data on the private storage account sqlcatwomanblog has one file called bacon.txt.

For the default container the syntax does not require the container and account information. Since the default storage is ASV rather than HDFS (even for HDInsight Server in this case because we changed it in core-site.xml) you can even leave out the ASV reference.

hadoop fs -ls asv:///bacon.txt
hadoop fs -ls /bacon.txt

More Information
- HDInsight Documentation Portal https://www.windowsazure.com/en-us/manage/services/hdinsight/
- Using Windows Azure Blob Storage with HDInsight http://www.windowsazure.com/en-us/manage/services/hdinsight/howto-blob-store/
- Hadoop Shell Commands http://hadoop.apache.org/docs/r0.18.1/hdfs_shell.html
- Updated HDInsight on Azure ASV paths for multiple storage accounts http://dennyglee.com/2013/03/25/updated-hdinsight-on-azure-asv-paths-for-multiple-storage-accounts/
I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon.

Note: the Preview, CTP, and TAP programs are available for a limited time. Details of the usage and the availability of the pre-release versions may change rapidly.
April 25, 2013
HDInsight: Jiving about Hadoop and Hive with CAT

Tomorrow I will be talking about Hive as part of Pragmatic Work’s Women in Technology (WIT) month of webcasts. I am proud to be part of this lineup with all these stellar WITs! I encourage my fellow WITs to get more involved in your data community and if you don’t already do so start tweeting, blogging, and speaking. I am happy to coach you through your first speaking engagement if you are interested. Get out there and start showing the world what you can do!

Thursday’s talk is going to be HDInsight: Jiving about Hadoop and Hive with CAT. Let’s break that title down.

HDInsight is Microsoft’s distribution of Hadoop. As part of the HDInsight project we have checked code back into the core Apache Hadoop source code to make the core code runs great on Windows. We are also adding functionality and features such as JavaScript and Azure Storage Vault that make the product more robust and enterprise friendly. This week the HDInsight Service Preview on Azure became available to those with an Azure subscription.

Hadoop is a scale out methodology that allows businesses to quickly consume and analyze data in ways they haven’t been able to before. This can lead to faster, better business insights and business actions.

Hive is a way to impose metadata and structure on the loosely structured (unstructured, multi-structured, semi-structured) data that resides in Hadoop’s HDFS file system. With Hive and the Hive ODBC driver you can make Hadoop data look like any other data source to your familiar BI tools such as Excel. PowerPivot can connect to Hive data, mash that data up with existing data sources such as SQL Azure, SQL Server, and OData, and allow you to visualize it with Power View. I have an end to end demo of this: Hurricane Sandy Mash-Up: Hive, SQL Server, PowerPivot & Power View.

CAT is my team at Microsoft. The Customer Advisory Team (CAT) works with customers who are doing new, unusual, and interesting things that push the boundaries of technology. We share what we learn with the community so you can do your jobs better and we take what we learn from you to the product team to help improve the product.

My slides are attached at the bottom of this post. I believe a recording of the talk will be posted by Pragmatic Works on their site.

I look forward to “seeing” you all at my talk tomorrow and would love to see your tweets or hear directly from you afterwards.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

PragmaticWorksHDInsightJivingAboutHadoopAndHiveWithCATMar202013.pptx

March 20, 2013
HDInsight: Hive Internal and External Tables Intro
Small Bites of Big Data

Cindy Gross, SQLCAT PM

HDInsight is Microsoft’s distribution, in partnership with Hortonworks, of Hadoop. Hive is the component of the Hadoop ecosystem that imposes structure on Hadoop data in a way that makes it usable from BI tools that expect rows and columns with defined data types. Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.

Use EXTERNAL tables when:
- The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
- Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
- You want to use a custom location such as ASV.
- Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
- You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
- The data is temporary.
- You want Hive to completely manage the lifecycle of the table and data.
We’ll walk through creating basic tables with a few rows of data so you can see some of the differences between EXTERNAL and INTERNAL tables. The demo data files are attached at the bottom of the blog. Alternatively you can simply open notepad and create your own files with a series of single column rows. If you create your own files make sure you have a carriage return/line feed at the end of all rows including the last one. The files should be in a Windows directory called c:data on the HDInsight Head Node. For HDInsight Server (on-premises) that’s the machine where you ran setup. For HDInsight Services (Azure) you can create a Remote Desktop connection (RDP) to the head node from the Hadoop portal.

Note: Your client tool editor or the website may change the dashes or other characters in the following commands to “smart” characters. If you get syntax errors from a direct cut/paste, try pasting into notepad first or deleting then retyping the dash (or other special characters).

Create an HDInsight cluster. You can do this on your own Windows machine by installing HDInsight Server or by signing up for HDInsight Services on Azure. For the CTP of HDInsight Services as of February 2013 you fill out a form to request access and receive access within a few days. Soon the service will be available from the Azure portal via your Azure subscription. Since the portal interface will be changing soon and all the commands are straightforward I will show you how to do all the steps through the Hive CLI (command line interface).

Open a Hadoop Command Prompt:

Change to the Hive directory (necessary in early preview builds of Hive):

cd %hive_home%bin

Load some data (hadoop file system put) and then verify it loaded (hadoop file system list recursively):

hadoop fs -put c:databacon.txt /user/demo/food/bacon.txt

hadoop fs -lsr /user/demo/food

The put command doesn’t return a result, the list command returns one row per file or subdirectory/file:

-rw-r–r–   1 cgross supergroup        124 2013-02-05 22:41 /user/demo/food/bacon.txt

Enter the Hive CLI (command line interface):

hive

Tell Hive to show the column names above the results (all Hive commands require a semi-colon as a terminator, no result is returned from this set command):

Set hive.cli.print.header=true;

Create an INTERNAL table in Hive and point it to the directory with the bacon.txt file:

CREATE INTERNAL TABLE internal1 (col1 string) LOCATION ‘/user/demo/food’;

Oops… that failed because INTERNAL isn’t a keyword, the absence of EXTERNAL makes it a managed, or internal, table.

FAILED: Parse Error: line 1:7 Failed to recognize predicate ‘INTERNAL’.

So let’s create it without the invalid INTERNAL keyword. Normally we would let an INTERNAL table default to the default location of /hive/warehouse but it is possible to specify a particular directory:

CREATE TABLE internal1 (col1 string) LOCATION ‘/user/demo/food’;

That will return the time taken but no other result. Now let’s look at the schema that was created:. Note that the table type is MANAGED_TABLE.

DESCRIBE FORMATTED internal1;

col_name        data_type       comment
# col_name              data_type               comment

col1                    string                  None

# Detailed Table Information
Database:               default
Owner:                  cgross
CreateTime:             Tue Feb 05 22:45:57 PST 2013
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://localhost:8020/user/demo/food
Table Type:             MANAGED_TABLE
Table Parameters:
transient_lastDdlTime   1360133157

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
serialization.format    1

And now look at some rows:

SELECT * FROM internal1;

col1
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon

What happens if we don’t specify a directory for an INTERNAL table?

CREATE TABLE internaldefault (col1 string);

It is created in the default Hive directory, which by default is in /hive/warehouse (dfs shells back out to Hadoop fs):

dfs -lsr /hive/warehouse;

We can see that Hive has created a subdirectory with the same name as the table. If we were to load data into the table Hive would put it in this directory:
drwxr-xr-x   – cgross supergroup          0 2013-02-05 22:52 /hive/warehouse/internaldefault

However, we won’t use this table for the rest of the demo so let’s drop it to avoid confusion. The drop also removes the subdirectory.

DROP TABLE internaldefault;

dfs -lsr /hive/warehouse;

Once we dropped the internaldefault table the directory that Hive created was automatically cleaned up. Now let’s add a 2nd file to the first internal table and check that it exists:

dfs -put c:databacon2.txt /user/demo/food/bacon2.txt;

dfs -lsr /user/demo/food;

-rw-r–r–   1 cgross supergroup        124 2013-02-05 23:04 /user/demo/food/bacon.txt
-rw-r–r–   1 cgross supergroup         31 2013-02-05 23:03 /user/demo/food/bacon2.txt

Since the CREATE TABLE statement points to a directory rather than a single file any new files added to the directory are immediately visible (remember that the column name col1 is only showing up because we enabled showing headers in the output – there is no row value of col1 in the data as headers are not generally included in Hadoop data):

SELECT * FROM internal1;

col1
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn

Now let’s create an EXTERNAL table that points to the same directory and look at the schema:

CREATE EXTERNAL TABLE external1 (colE1 string) LOCATION ‘/user/demo/food’;

DESCRIBE FORMATTED external1;

col_name        data_type       comment
# col_name              data_type               comment

cole1                   string                  None

# Detailed Table Information
Database:               default
Owner:                  cgross
CreateTime:             Tue Feb 05 23:07:12 PST 2013
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://localhost:8020/user/demo/food
Table Type:             EXTERNAL_TABLE
Table Parameters:
EXTERNAL                TRUE
transient_lastDdlTime   1360134432

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
serialization.format    1

This time the table type is EXTERNAL_TABLE. You can see that the location was expanded to include the default settings which in this case are the localhost machine using the default HDFS (as opposed to ASV or Azure Storage Vault).

Now look at the data:

SELECT * FROM external1;

The result set is a combination of the two bacon files:

HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn

That table returns the same data as the first table – we have two tables pointing at the same data set! We can add another one if we want:

CREATE EXTERNAL TABLE external2 (colE2 string) LOCATION ‘/user/demo/food’;

DESCRIBE FORMATTED external2;

SELECT * FROM external2;

You may create multiple tables for the same data set if you are experimenting with various structures/schemas.

Add another data file to the same directory and see how it’s visible to all the tables that point to that directory:

dfs -put c:dataveggies.txt /user/demo/food/veggies.txt;

SELECT * FROM internal1;

SELECT * FROM external1;

SELECT * FROM external2;

Each table will return the same results:

HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn
SQL_Apple
NoSQL_Pear
SQLFamily_Kiwi
Summit_Mango
HDInsight_Watermelon
SQLSat_Strawberries
Raspberrylimelemonorangecherryblueberry 123 456

Now drop the INTERNAL table and then look at the data from the EXTERNAL tables which now return only the column name:

DROP TABLE internal1;

SELECT * FROM external1;

SELECT * FROM external2;

dfs -lsr /user/demo/food;

Result: lsr: Cannot access /user/demo/food: No such file or directory.

Because the INTERNAL (managed) table is under Hive’s control, when the INTERNAL table was dropped it removed the underlying data. The other tables that point to that same data now return no rows even though they still exist!

Clean up the demo tables and directory:

DROP TABLE external1;

DROP TABLE external2;

exit;

This should give you a very introductory level understanding of some of the key differences between INTERNAL and EXTERNAL Hive tables. If you want full control of the data loading and management process, use the EXTERNAL keyword when you create the table.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the TAP and CTP builds may change rapidly.

bacon.zip
February 5, 2013