Befriending Dragons

Transform Tech with Anti-bullying Cultures


1 Comment

Load SQL Server BCP Data to Hive

Load SQL Server BCP Data to Hive

Small Bites of Big Data

Cindy Gross, SQLCAT PM

As you start learning more about Hadoop you may want to take a look at how the same data and queries work for SQL Server and for Hadoop. There are various ways to do this. For now I’ll show you something that utilizes some of your existing SQL Server skills (BCP) and some new Hadoop skills (basic Hadoop FS and Hive commands). There are other methods such as FTP, cURL, and Sqoop that I won’t cover right now. Also, if you want to load data from the Azure DataMarket you can follow these instructions.

Today we’ll walk through making a copy of the FactProductInventory table from AdventureWorksDW2012 on a SQL Server 2012 instance. The below TSQL will generate BCP output commands for a single table in the current SQL Server database context and write tab delimited data to c:temp. Comment out the @tablename references in the SELECT and WHERE clauses to generate the script for all tables in the database.

USE [AdventureWorksDW2012];

DECLARE @servername sysname, @dbname sysname, @tablename sysname, @outputdir sysname

SELECT  @servername = @@SERVERNAME

       ,@dbname = DB_NAME()

       ,@outputdir = ‘c:temp’

       ,@tablename = ‘FactProductInventory’

SELECT ‘bcp ‘ + OBJECT_SCHEMA_NAME(object_id) + ‘.’ + name + ‘ out ‘

       + @outputdir + OBJECT_SCHEMA_NAME(object_id) + ‘_’ + name + ‘.dat -b 10000 -d ‘

       + @dbname + ‘ -T -c -S ‘ + @servername

       FROM sys.objects

       WHERE type_desc = ‘USER_TABLE’

       AND name = @tablename

       ;

In this case the BCP code generated is (no line break):

bcp dbo.FactProductInventory out c:tempdbo_FactProductInventory.dat -b 10000 -d AdventureWorksDW2012 -T -c -S CGROSSBOISESQL2012

Paste the BCP command to a Command Prompt and run it.

If you have not yet created an Apache™ Hadoop™-based Services for Windows Azure cluster follow these steps to do so (this is a CTP so the exact steps/screens will change over time).

From your Hadoop cluster portal click on the “Remote Desktop” button and choose to “Save As” the RDP that is generated.

Right click on the RDP you saved and choose “edit”. Go to the “Local Resources” tab click on “More…” under “Local devices and resources”. Add a check mark to “Drives” then click “OK”. Go back to the “General” tab and click on “Save”. Now choose “Connect” to open a remote desktop connection to your Hadoop head node.

Open the “Hadoop Command Shell”. Copy/paste or type these commands (beware of some editors changing dashes or other characters to non-executable values) to create a directory and copy the data file to your head node. The /y on the copy will overwrite the file if it already exists.

Md c:data

Copy \tsclientCtempdbo_FactProductInventory.dat c:data /y

Dir c:data

 Now from the same prompt load the data into Hadoop HDFS. The fs indicates you are running a filesystem command from a Hadoop script. Generally the same commands are available from the “Interactive JavaScript” console in the portal by replacing “hadoop fs –“ with “#”. For example, “hadoop fs –lsr /” from a Hadoop Command Prompt and “#lsr /” from the JavaScript console both return a recursive list of all directories and files starting at the root (/). Try some variations such as “#ls” (non-recursive contents of default directory) and “#lsr /user” (recursive list of the user directory).

hadoop fs -put c:datadbo_FactProductInventory.dat /demo/demo.dat

hadoop fs -lsr /

 Now launch the command line version of Hive (you can alternatively use the Interactive Hive console in the portal, but I’m showing you the automatable version) and add metadata to the HDFS data. Note that the CREATE EXTERNAL TABLE statement wraps in the window, there is no line break. Because I choose to use the EXTERNAL keyword the data stays in its original HDFS location and will not be deleted when I drop the Hive table. Since the Hive keywords are different the data type names are not exactly the same as they were in SQL Server. Basically I generated the script for this table from SSMS then made a few changes. I removed the constraints and indexes then changed date and money to string. I also removed the brackets and the “dbo.” schema qualifier. If you don’t copy the empty line under the last command you will have to hit enter for the last command to complete.

hive;

CREATE EXTERNAL TABLE FactProductInventory(ProductKey int,DateKey int,MovementDate string,UnitCost string,UnitsIn int,UnitsOut int,UnitsBalance int) COMMENT ‘Hive Demo for #24HOP’ ROW FORMAT DELIMITED FIELDS TERMINATED by ‘t’ STORED AS TEXTFILE LOCATION ‘/demo’;

quit;

 Now let’s run some queries. You can either start Hive again from the command line or run it from the Interactive Hive portal in the GUI.

select * from FactProductInventory where ProductKey = 230 and DateKey = ‘20080220’;

 The output on my single data node Hadoop cluster looks like this (the line starting with 230 is the actual result set):

Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks is set to 0 since there’s no reduce operator

Starting Job = job_201209281938_0013, Tracking URL = http://10.119.36.106:50030/jobdetails.jsp?jobid=job_201209281938_0013

Kill Command = c:Appsdistbinhadoop.cmd job  -Dmapred.job.tracker=10.119.36.106:9010 -kill job_201209281938_0013

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2012-09-28 23:27:45,271 Stage-1 map = 0%,  reduce = 0%

2012-09-28 23:27:58,301 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:27:59,316 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:00,316 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:01,332 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:02,347 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:03,363 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:04,379 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:05,394 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:06,410 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.187 sec

2012-09-28 23:28:07,425 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.187 sec

MapReduce Total cumulative CPU time: 3 seconds 187 msec

Ended Job = job_201209281938_0013

MapReduce Jobs Launched:

Job 0: Map: 1   Accumulative CPU: 3.187 sec   HDFS Read: 31077011 HDFS Write: 38 SUCESS

Total MapReduce CPU Time Spent: 3 seconds 187 msec

OK

230     20080220        2008-02-20      20.3900 0       0       4

Time taken: 48.465 seconds

Note that if you run this same query in SSMS you will see an instant response but here in Hadoop it took 48 seconds of total time. No matter how many times you run this query you will see approximately the same execution time because the result set is not cached. This shows that we have used Hadoop in an anti-pattern way – we went after a single row of data. Filtered row sets are a strength of well-indexed relational systems while querying entire, very large, unindexed data sets is a strength of Hadoop. Hive generates MapReduce code and that MapReduce code goes through the same steps each time to find the data, distribute job tasks across the data nodes (map), and then bring the results sets back (reduce). The cumulative CPU time once it actually executes the map phase is still over 3 seconds. I chose this example both to illustrate that point and because it gives you data you are familiar with to ease you into the Hadoop and Hive worlds.

If you want to remove the metadata from Hive and the data you’ve just loaded from Hadoop HDFS execute these steps from a Hadoop Command Shell:

hive

drop table FactProductInventory;

quit;

hadoop fs -rmr /demo/demo.dat

 Now you know one way to copy data from SQL Server to Hadoop. Keep exploring Hadoop and keep buzzing about Hive!

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.


Leave a comment

What’s all the Buzz about Hadoop and Hive?

What’s all the Buzz about Hadoop and Hive?

Why it Matters for SQL Server Peeps

Small Bites of Big Data

Cindy Gross, SQLCAT PM

On September 20, 2012 we have another 24 Hours of PASS event! This PASS Summit Preview will give you a taste of what is coming at this year’s PASS Summit. There are 190+ technical sessions this year at the Summit, and you’ll get a preview of 24 of them at the #24HOP event tomorrow! Come hear about some of the hottest topics and features in the SQL Server, BI, and data world.

One of the big buzzwords over the last year or so is Hadoop, and the most familiar part of Big Data and Hadoop to most SQL Server professionals is Hive. Do you wonder what it is and why you should jump in now while it’s still new and growing by leaps and bounds? I have just the #24HOP session for you!

#24HOP: What’s all the Buzz about Hadoop and Hive? – Why it Matters for SQL Server Peeps

Everyone is buzzing about Hive and trumpeting the virtues of Hadoop. But what does it mean? Why does it matter to a SQL Server and/or BI professional? Come get a taste of the Hive honey and see why this new technology is worth buzzing about!

During this talk I’ll give a very high level overview of Big Data, Hadoop, and Hive (for the nitty gritty details come to the Summit!). I’ll also go through why Hive matters in the SQL Server world, what a SQL Server Peep might end up doing in a Hive world, and why it is important for you as a SQL Server Peep to jump in and get your feet wet with Hive now.

Once you’ve heard this #24HOP talk I hope you’ll be fired up about Hive and more anxious than ever to sign up for the  PASS Summit to learn even more about Hadoop, Hive, Big Data, and all things BI and SQL Server. I’ll be co-presenting at the Summit with SQL Server PM Dipti Sangani:

SQLCAT: Big Data – All Abuzz About Hive [BIA-305-A]
Session Category: Regular Session (75 minutes)
Session Track: BI Platform Architecture, Development & Administration
Speaker(s): Cindy Gross, Dipti Sangani

Got a bee in your bonnet about simplifying access to Hadoop data? Want to cross-pollinate your existing SQL skills into the world of Big Data? Join this session to see how to become the Queen Bee of your Hadoop world with Hive and gain Business Intelligence insights with HiveQL filters and joins of HDFS datasets. We’ll navigate through the honeycomb to see how HiveQL generates MapReduce code and outputs files to answer your questions about your Big Data.

After this session, you’ll be able to democratize access to Big Data using familiar tools such as Excel and a SQL-like language without having to write MapReduce jobs. You’ll also understand Hive basics, uses, strengths, and limitations and be able to determine if/when to use Hive in combination with Hadoop.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

UPDATE 9/28/12 – demo steps to load the AdventureWorks data to Hive are available at http://blogs.msdn.com/b/cindygross/archive/2012/05/07/load-data-from-the-azure-datamarket-to-hadoop-on-azure-small-bites-of-big-data.aspx.

24HOPFall2012HiveBuzz.zip


1 Comment

SQL Server 2012 RC0 available

Release Candidate 0 (RC0) of SQL Server 2012 is available for download! That means you can start to experience a feature complete version of the next great SQL release. 🙂 We’re still looking at the first half of calendar year 2012 for the RTM release of SQL Server 2012.

FURTHER.  FORWARD.  FASTER.

Downloads:

Greater availability. Deliver the required 9s and data protection with AlwaysOn, delivering added functionality over CTP3 that allows customers to experience multiple, readable secondaries for distributed scale of reporting and backup workloads and support for FileTable and FILESTREAM which brings first-class HA to complex data types.

Blazing-fast performance. More effectively govern performance in multi-tenancy environments like private cloud. Resource Governor enhancements include support for 64 resource pools, greater CPU usage control, and resource pool affinity for partitioning of physical resources and predictable resource allocation.

Rapid data exploration. Empower end users with new insights through rapid data exploration and visualization. Discover new insights at the speed of thought with more feature/functionality in Power View, the highly interactive, familiar browser-based data exploration, visualization, and presentation experience for end users.

Credible, consistent data. In addition to CTP3 functionality delivered for Data Quality Services and Master Data Services, customers can better support heterogeneous data within Data Warehouses through new Change Data Capture (CDC) support for Oracle.

Optimized productivity. Optimize IT and developer productivity across server and cloud with the new SQL Server Data Tools, a unified development experience for database and business intelligence projects, and cloud-ready capabilities within SQL Server Management Studio for snapshot backups to the Windows Azure Platform. Additionally, SQL Server 2012 offers a new version of Express – LocalDB. SQL Express LocalDB is a lightweight version of Express with all its programmability features, yet runs in user mode with a fast, zero-configuration installation and short list of pre-requisites. For more information and to try it out, go here.

 Licensing:

 SQL Server 2012 Wallpaper/backgrounds:

 More about SQL Server 2012:

 If you don’t already use it, now might be a good time to run the Microsoft Assessment and Planning (MAP) Toolkit. If you load the sample MAP database and go to File.Prepare New Reports and Proposals you can see what typical SQL inventory reports look like. The output includes version, service pack level, edition, whether it’s clustered, # of procs and cores, and system memory. You would need to have a SQL database to store your own inventory results.

 Microsoft Assessment and Planning (MAP) Toolkit http://technet.microsoft.com/en-us/solutionaccelerators/dd537566.aspx

 Data Sheet:

MAP provides you with actionable information from the start. In a matter of hours, MAP helps provide you with answers to key questions like:

  • Which of my PCs can run Windows 7, Microsoft® Office 2010, and Office 365?
  • Are my web application portfolios and databases ready to migrate to the Windows Azure platform? What is the TCO-ROI?
  • What next steps should I take to plan a private cloud?
  • Which of my servers are capable of migrating to Windows Server 2008 R2, or can be virtualized using Hyper-V?
  • Where can I find Microsoft® SQL Server® 2000, MySQL, Oracle, and Sybase instances for migration to Microsoft® SQL Server® 2008 R2?
  • Is our software usage in compliance with software license agreements?