Cindy Gross | Befriending Dragons

Category: Career

Interview with Julie Strauss–Microsoft BI WIT
Julie Strauss is a very accomplished and respected Senior PM at Microsoft. Her current role is technical assistant for Microsoft Data Platform Group (DPG) Corporate Vice President Quentin Clark. She has been the public face of Microsoft BI at conferences and helps deliver great technical content and data stories to the public. Julie loves to help others so she has shared some background on herself and some great business advice that could be helpful to others seeking to improve their success.

Julie saw a job posting for the support team in Microsoft Norway (at the time Great Plains) looking for an individual willing to learn the ins and outs of the Microsoft BI products. She was excited that the posting indicated a willingness to learn was more important than previous knowledge of the particular Microsoft product. This was how and why Julie came here – she loves the technology and the data driven parts of the business and finds them fascinating.

Julie has a notable role with a wide range of responsibilities. The majority of her time is spent working on strategic projects to meet the goals of the team at the DPG Vice President level. Projects can vary in nature and cover everything from exploratory and technical projects to organizational projects. She gets to work with many areas of the business and enjoys interactions across the org. In addition to these internal facing responsibilities Julie also manages a set of customer and partner engagements for the business. Overall this role has provided Julie with an amazing learning opportunity. She gets to widen her scope while maintaining her data and BI focus and also use her years of experience from responsibilities ranging through sales, marketing, support, engineering, program management and people management. She merged these experiences into a role as technical assistant that utilizes some aspects of all those areas. Throughout her career she has chosen new jobs that allowed her to stretch and grow with a significant amount of change. But throughout it all she kept one core thing the same – her focus on BI and data. This mix of old and new in each role helps her cultivate new skills while leveraging what she already knows and expanding her influence. Within Microsoft there are many opportunities, something Julie feels is unique in the corporate world, and we can all find a way to shine and grow here.

Julie has an extensive network she finds invaluable in navigating all that opportunity. Her network lets her know about new opportunities and the network members also influence decision makers. She emphasizes that your reputation is everything – your network carries that reputation to others. In a strong network everyone is contributing to each other’s success. She has a large network though at any given point in time she is only actively interacting with a few people.

In addition to a network of contacts, Julie has closer relationships with a smaller group of people as both a mentor and a mentee. When Julie made the decision to move from marketing to engineering she leveraged her close mentoring relationship with Donald Farmer. Donald knew Julie and her work ethic and was willing to take a chance on Julie’s ability to succeed even though on paper it wasn’t an obvious fit. She stresses the importance of having semi-formal mentoring relationships with people at various levels. She asks various mentors for advice with experiences, projects, and specific interactions. Julie contributes back as a mentor to others – this keeps her coaching skills active. Julie observed that while she doesn’t treat her mentees differently based on their gender they tend to bucket themselves. More often than not women ask how to handle a specific situation or how to become more efficient or appear more confident. On the other hand men are more likely to ask task oriented questions such as how to make a specific change or how to write a better spec. She enjoys helping with both types of questions. Some of her mentees and mentors are people she already knew and some are people she grew to know only after the mentor-mentee relationship started.

I asked Julie what advice she feels is most important to her success that would be helpful to others in the organization. In addition to networking and mentors, she offered these pearls of wisdom:
- Be willing to take risks and take on new challenges. She has few regrets because she goes after what she wants. She does wonder if having no regrets at all means she didn’t stretch enough. You have to find your own balance.
- Be true to who you are – how people see you, your brand, should reflect the real you. For Julie it has been very important to never compromise on being true to herself. Julie’s brand is “Give me a challenge and I will work my butt off to get it done, being creative as needed, bringing in people who will make it work.”
- Never be a victim. Women are strong.
- Pick something concrete to improve upon and just do it. For example, Julie was ranked as the lowest presenter at a conference. She decided to become a top 10 presenter – she achieved that goal and grew to truly enjoy presenting along the way.
- Find work you love. Julie finds data fascinating because it is very tangible and with BI you control how it leads to insights, learnings, and possibilities. She loves how data and BI let you use your own imagination and set your own boundaries.
- State your needs and get buy-in. For example you might tell your manager that you want a promotion and lay out your plan to get there. Then you ask “Is this realistically going to get me to my goal”? Make sure your manager understands your value and gives you feedback, then follow through on the actions with appropriately timed check-ins on whether you are still on track.
Over the years Julie has lived in Denmark, Norway, the UK, and the US. She is always looking for new challenges whether it’s how to succeed in a new country or job or taking on a demanding project. Whatever she does she is working hard and getting things done. Follow her advice – build your network, find a mentor or two, be clear on expectations, and always be true to who you are.

I want to thank Julie for sharing herself and her ideas with us – it can be tough to open up but Julie did a stellar job!
November 21, 2013
HDInsight Big Data Talks from #SQLPASS

SQL PASS Summit 2013 was another great data geek week! I chatted with many of you about Big Data, Hadoop, HDInsight, architecting solutions, SQL Server, data, BI, analytics, and general geekiness – great fun! This time around I delivered two talks on Hadoop and HDInsight – the slides from both are attached.

Zero to 60 with HDInsight takes you from an overview of Big Data and why it matters (zero) all the way through an end to end solution (60). We discussed how to create an HDInsight cluster with the Azure portal or PowerShell and talked through the architecture of the data and analysis behind the release of Halo 4. We talked about how you could use the same architectural pattern for many projects and walked through Hive and Pig script examples. We finished up with how to use Power Map (codename GeoFlow) over that data to gain new insights and improve the game experience for the end user.

The next session I co-presented with HDInsight PM Dipti Sangani: CAT: From Question to Insight with HDInsight and BI. We went deeper this time. Not only did we present an end to end story with how our own internal Windows Azure SQL Database team uses telemetry to improve your experience with SQL Server in Azure PaaS but we also went deeper with demos of Hive, Pig, and Oozie. We also gave another archetypical design scenario that will apply to many of your own scenarios and talked about how HDInsight fits with SQL Server and your other existing infrastructure. The deck covers your cloud and on-premises options for Hadoop on Windows including HDInsight Service, Hortonworks HDP for Windows, OneBox, and PDW with Polybase.

Please let me know if you have any questions from the talks or just general HDInsight questions!

PASSSummit2013BigData.zip

October 20, 2013
Jo Ann Morris is Igniting Women with Courage
Ignite: Inspiring Courageous Leaders – A Book of Thought-Provoking Wisdom and a Manual for Action

Go Lead Idaho sponsored a “meet the author” talk by Jo Ann Morris this week at the Boise WaterCooler. Jo Ann is the author of Ignite: Inspiring Courageous Leaders – A Book of Thought-Provoking Wisdom and a Manual for Action and co-founder of White Men as Full Diversity Partners. She describes herself as a proud radical feminist – I wish more people, men and women, had the courage to say that!

Jo Ann’s book, Ignite, helps you to take your own courageous actions. It has a series of “thought exercises” that each start with a powerful quote. She has suggested questions to ask yourself about each quote and there is room to write down the thoughts and feelings evoked by the quote. The exercises make you really think and help you get in the habit of looking beneath the surface and really digging deep. Then you can use your new insights to take action. Thoughts need to be followed by action to be powerful.

Jo Ann talked about taking charge in many ways. We are all responsible for ourselves. And we all need to help those around us.
- Don’t spend time being nice – nice is overrated. This doesn’t mean to be deliberately mean, but don’t prioritize being nice or being polite above getting things done or getting what you need.
- To be successful we need to take risks.
- Don’t wait – step up and offer your ideas and actions.
- Demand what you’re worth.
- Be comfortable being uncomfortable.
- Choose courage.
- Be vulnerable to be courageous.
Courage!

Courage encompasses four things. It can be manifested when you do one or more of these things:
- See and speak the truth.
- Champion an unpopular or risky vision.
- Persevere.
- Collaborate with AND rely on others. If you don’t rely on those you collaborate with you aren’t truly collaborating or being truly courageous.
In life we need truth, courage, and risk – they can’t really be separated. Women have the power to change the world. Don’t be “honorary men” – lead the way to a world that has a great combination of “feminine” and “masculine” ways of doing things. Have the courage to be the change!

Step up now – in your every-day life, in relationships, at work – and take charge of your own life. Be courageous, be uncomfortable, and be vulnerable. Stand up for yourself, help others, and be a proud radical feminist!
September 27, 2013
Go Lead Idaho – Get in the Game

This past week I attended another great Go Lead Idaho event – A Legacy of Leading. Go Lead Idaho helps women build leadership skills and helps women engage in politics, public advocacy, and public planning. The speakers this week, Marilyn Monroe Fordham and Rose Bowman, are two veterans of being “first”. Sometimes it’s easy to forget just how recently and severely women’s work and political options were limited to a small subset of opportunities.

Both speakers talked about being strongly discouraged in the 70s and 80s from choosing challenging, non-secretarial type degrees in college and from applying for jobs that were at the time typically reserved for men. Marilyn talked about staying in a banking job for years trying to break through the glass ceiling of “no women can be bank officers”. She eventually left to start her own business as promotion after promotion passed her by. They didn’t even hide why they wouldn’t consider her – they flatly stated it was because she was a woman. The powers that be also talked about the possibility that she might someday get pregnant as a roadblock to many roles – those were the days when women were expected to quit working as soon as they “showed” their pregnancy. While today few people would come out and say so, and many may think they’re being totally fair when evaluating people, there are countless subtle perceptions and reactions that still keep women from being completely successful.

This doesn’t mean we give up or sit around complaining – we need to stand up for ourselves. Don’t get discouraged, keep things positive, and stay focused on the goal. How other people perceive you matters – but don’t let it define you. And don’t try to do it alone. Ask for help and give help to others. Step up to help with projects – you will learn a lot, make new contacts, and show people what you can do. Even if you’re volunteering or doing something outside the scope of your core job you’re still showing people your skills and giving them a reason to remember you the next time an opportunity arises. Always be ready to help others, especially women who may be looking for a female based network. Help others feel confident and build their own circles.

When you are choosing new projects and opportunities challenge yourself. Don’t compare yourself to others and what they could do with the job, project, or role – think of what you can contribute and be creative about it. Others don’t really know more about how to do it than you do – and what you do know how to do could be exactly what is needed whether it’s typical or not. Stretch yourself and don’t focus at first on the practicalities. Figure out what needs to be done then come up with a plan that combines your needs with the needs of the job or project. Many times the schedules and specifics are much more flexible than they seem at first – ask for what you need.

Sometimes the best way to solve a problem is to redefine it. Marilyn recounted how she sat on boards with a mix of men and women and there were often 1-2 people who tried to “win” and dominate discussions. However, as she joined boards that were composed of all women she saw a lot more of a focus on solving the problem and collaborating. Over the years the boards she was on became more efficient as they spent less time “playing golf” and instead focused on getting the work done sooner so they could get back to their responsibilities such as families and full time jobs. This wasn’t because of the inherent gender differences but because the women had different goals in mind and focused on them. They stated their needs, got things done, and made the job better.

When Rose ran for US Senate in Idaho in 1972 she was the first woman to do so. People were less likely to give money to a woman and she was running in the primary against the husbands of friends. She got out, made contacts, networked, but still lost the primary. But she was out there, she showed everyone that a woman could run, and she leveraged the contacts she made into appointments to multiple statewide offices. She made a difference. So what was next – what women have run for US Senate since then in Idaho? None. Women haven’t stepped up. We all have our excuses – we’re too busy, we don’t feel we have the skills, or it just seems like too much work. But really – why hasn’t any woman run again in the last 40 years?

You don’t have to start out with a national political office – but start somewhere. Do something new, extend your comfort zone, grow your network, and get in the game – any game! Go lead Idaho!

September 22, 2013
Big Data Twitter Demo

Real-time. Social Sentiment Analysis. Twitter. Cloud. Insights. We have your Big Data buzzwords here!

Everyone seems to want to incorporate social sentiment into their business analysis. Well we have the demo for you! Use it for a quick demonstration of what can be done and when the excitement goes through the roof, use it to inspire your own design!

Real-Time Processing – Instant Insights!

First, use Event Driven Processing (EDP) to show the data on a dashboard. In the demo you’ll use StreamInsight, also referred to as Complex Event Processing (CEP), though you may want to use Microsoft Open Technologies’s RX / Reactive Extensions in your own project. Use the dashboard to make real-time decisions and take immediate action. For example, configure your EDP to “pop” only on terms related to your company and your marketing analyst can watch how the volume of tweets and the sentiment (measured positive/neutral/negative in this example) change in response to your Super Bowl ad or a mention of your company on a the news show. She can respond instantly with changes to your website, your own tweets, sales/promotions, or whatever is appropriate to your business.

Data Storage – Enable Insights!

EDP reads the data as it is pushed through a query, there is no inherent storage involved. You could just discard the data and never store it. In this example we chose to store the data for later trending and historical analysis. We take the tweet id, the date/time the tweet was captured, the sentiment score calculated during the real-time processing, and the keyword that caught our attention and store it in SQL Azure. This data is available to other applications that need to join it with existing data and requires fast responses to individual queries. The remaining data including the raw tweet, any geographic data the tweeter chose to share, and other data is dropped in an Azure Blob Store.

Trends, Patterns, and Historical Insights!

Now point HDInsight (Hadoop) to the Azure Blob Storage using HDInsight’s ASV extension to HDFS. You can spin up a Hadoop cluster in Azure, pay for as many nodes as you need for as long as you need them then spin them down to save money. The data remains in the blob store – available for future Hadoop clusters, other applications, archival, or whatever you need it for. Add structure to the JSON data with Hive and now you have rows and columns that can be accessed by BI tools!

Visualization for Powerful Insights!

Now create a PowerPivot tabular model (self-service BI / client side) or an Analysis Service tabular model (corporate BI) to store the relevant data in a highly compressed, in-memory format for fast access. Add in a few Power View visualizations mashing up data from multiple structured and unstructured sources and you can show your business decision makers easily digestible and understandable data in a format they just get and love! Make some decisions, take some actions, and you’ve just shown how to turn free Twitter data into a valuable resource that can have a direct impact on your company!

How to Get These Insights

Follow the instructions on the CodePlex site for the Big Data Twitter Demo project. Set it up, run through the demo, get excited, and go improve your business!

Demo Created By:

Vu Le

Andrew Moll

Aviad Ezra @aviade_pro

Brad Sarsfield @Bradoop

Later Additions By:

Lara Rubbelke @SQLGal | http://sqlblog.com/blogs/lara_rubbelke/

Robert Bruckner http://blogs.msdn.com/b/robertbruckner/

Cindy Gross @SQLCindy | http://blogs.msdn.com/cindygross

May 17, 2013
Access Azure Blob Stores from HDInsight
Small Bites of Big Data

Edit Mar 6, 2014: This is no longer necessary for HDInsight – you specify the storage accounts when you create the cluster and the rest happens auto-magically. See http://blogs.msdn.com/b/cindygross/archive/2013/11/25/your-first-hdinsight-cluster-step-by-step.aspx or http://blogs.msdn.com/b/cindygross/archive/2013/12/06/sample-powershell-script-hdinsight-custom-create.aspx.

One of the great enhancements in Microsoft’s HDInsight distribution of Hadoop is the ability to store and access Hadoop data on an Azure Blob Store. We do this via the HDFS API extension called Azure Storage Vault (ASV). This allows you to persist data even after you spin down an HDInsight cluster and to make that data available across multiple programs or clusters from persistent storage. Blob stores can be replicated for redundancy and are highly available. When you need to access the data from Hadoop you point your cluster at the existing data and the data persists even after the cluster is spun down.

Azure Blob Storage

Let’s start with how your data is stored. A storage account is created in the Azure portal and has access keys associated with it. All access to your Azure blob data is done via storage accounts. Within a storage account you need to create at least one container, though you can have many. Files (blobs) are put in the container(s). For more information on how to create and use storage accounts and containers see: http://www.windowsazure.com/en-us/develop/net/how-to-guides/blob-storage/. Any storage accounts associated with HDInsight should be in the same data center as the cluster and must not be in an affinity group.

You can create a container from the Azure portal or from any of the many Azure storage utilities available such as CloudXplorer. In the Azure portal you click on the Storage Account then go to the CONTAINERS tab. Next click on ADD CONTAINER at the very bottom of the screen. Enter a name for your container, choose the ACCESS property, and click on the checkmark.

HDInsight Service Preview

When you create your HDInsight Service cluster on Azure you associate your cluster with an existing Azure storage account in the same data center. In the current interface the QUICK CREATE doesn’t allow you to choose a default container on that storage account so it creates a container with the same name as the cluster. If you choose CUSTOM CREATE you have the option to choose the default container from existing containers associated with the storage account you choose. This is all done in the Azure management portal: https://manage.windowsazure.com/.

Quick Create:

Custom Create:

You can then add additional storage accounts to the cluster by updating C:appsdisthadoop-1.1.0-SNAPSHOTconfcore-site.xml on the head node. This is only necessary if those additional accounts have private containers (this is a property set in the Azure portal for each container within a storage account). Public containers and public blobs can be accessed without the id/key being stored in the configuration file. You choose the public/private setting when you create the container and can later edit it in the “Edit container metadata” dialog on the Azure portal.

The key storage properties in the default core-site.xml on HDInsight Service Preview are:

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>asv://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net</value>
<description>The name of the default file system. Either the
literal string “local” or a host:port for NDFS.</description>
<final>true</final>
</property>

<property>
<name>dfs.namenode.rpc-address</name>
<value>hdfs://namenodehost:9000</value>
</property>

<property>
<name>fs.azure.account.key.YOURStorageAccount.blob.core.windows.net</name>
<value>YOURActualStorageKeyValue</value>
</property>

To add another storage account you will need the Windows Azure storage account information from https://manage.windowsazure.com. Log in to your Azure subscription and pick storage from the left menu. Click on the account you want to use then at the very bottom click on the “MANAGE KEYS” button. Cut and paste the PRIMARY ACCESS KEY (you can use the secondary if you prefer) into the new property values we’ll discuss below.

Create a Remote Desktop (RDP) connection to the head node of your HDInsight Service cluster. You can do this by clicking on the CONNECT button at the bottom of the screen when your HDInsight Preview cluster is highlighted. You can choose to save the .RDP file and edit it before you connect (right click on the .RDP file in Explorer and choose Edit). You may want to enable access to your local drives from the head node via the “Local Resources” tab under the “More” button in the “Local devices and resources” section. Then go back to the General tab and save the settings. Connect to the head node (either choose Open after you click CONNECT or use the saved RDP).

On the head node make a copy of C:appsdisthadoop-1.1.0-SNAPSHOTconfcore-site.xml in case you have to revert back to the original. Next open core-site.xml in Notepad or your favorite editor.

Add your 2nd Azure storage account by adding another property.

<property>
<name>fs.azure.account.key.YOUR_SECOND_StorageAccount.blob.core.windows.net</name>
<value>YOUR_SECOND_ActualStorageKeyValue</value>
</property>

Save core-site.xml.

Repeat for each storage account you need to access from this cluster.

HDInsight Server Preview

If you have downloaded the on-premises HDInsight Server preview from http://microsoft.com/bigdata that gives you a single node “OneBox” install to test basic functionality. You can put it on your local machine, on a Hyper-V virtual machine, or in a Windows Azure IaaS virtual machine. You can also point this OneBox install to ASV. Using an IaaS VM in the same data center as your storage account will give you better performance, though the OneBox preview is meant purely for basic functional testing and not for high performance as it is limited to a single node. The steps are slightly different for on-premises as the installation directory and default properties in core-site.xml are different.

Make a backup copy of C:Hadoophadoop-1.1.0-SNAPSHOTconfcore-site.xml from your local installation (local could be on a VM).

Edit core-site.xml:

1) Change the default file system from local HDFS to remote ASV

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>hdfs://localhost:8020</value>
<description>The name of the default file system. Either the
literal string “local” or a host:port for NDFS.</description>
<final>true</final>
</property>

to:

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>asv://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net</value>
<description>The name of the default file system. Either the
literal string “local” or a host:port for NDFS.</description>
<final>true</final>
</property>

2) Add the namenode property (do not change any values)

<property>
<name>dfs.namenode.rpc-address</name>
<value>hdfs://namenodehost:9000</value>
</property>

3) Add the information that associates the key value with your default storage account

<property>
<name>fs.azure.account.key.YOURStorageAccount.blob.core.windows.net</name>
<value>YOURActualStorageKeyValue</value>
</property>

4) Add any additional storage accounts you plan to access

<property>
<name>fs.azure.account.key.YOUR_SECOND_StorageAccount.blob.core.windows.net</name>
<value>YOUR_SECOND_ActualStorageKeyValue</value>
</property>

Save core-site.xml.

Files

Upload one or more files to your container(s). You can use many methods for loading the data including Hadoop file system commands such as copyFromLocal or put, 3rd party tools like CloudXPlorer, JavaScript, or whatever method you find fits your needs. For example, I can upload all files in a data directory (for simplicity this sample refers to C: which is local to the head node) using the Hadoop put command:

hadoop fs -put c:data asv://data@sqlcatwomanblog.blob.core.windows.net/

Or upload a single file:

hadoop fs -put c:databacon.txt asv://data@sqlcatwomanblog.blob.core.windows.net/bacon.txt

To view the files in a linked non-default container or a public container use this syntax from a Hadoop Command Line prompt (fs=file system, ls=list):

hadoop fs -ls asv://data@sqlcatwomanblog.blob.core.windows.net/

Found 1 items
-rwxrwxrwx 1 124 2013-04-24 20:12 /bacon.txt

In this case the container data on the private storage account sqlcatwomanblog has one file called bacon.txt.

For the default container the syntax does not require the container and account information. Since the default storage is ASV rather than HDFS (even for HDInsight Server in this case because we changed it in core-site.xml) you can even leave out the ASV reference.

hadoop fs -ls asv:///bacon.txt
hadoop fs -ls /bacon.txt

More Information
- HDInsight Documentation Portal https://www.windowsazure.com/en-us/manage/services/hdinsight/
- Using Windows Azure Blob Storage with HDInsight http://www.windowsazure.com/en-us/manage/services/hdinsight/howto-blob-store/
- Hadoop Shell Commands http://hadoop.apache.org/docs/r0.18.1/hdfs_shell.html
- Updated HDInsight on Azure ASV paths for multiple storage accounts http://dennyglee.com/2013/03/25/updated-hdinsight-on-azure-asv-paths-for-multiple-storage-accounts/
I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon.

Note: the Preview, CTP, and TAP programs are available for a limited time. Details of the usage and the availability of the pre-release versions may change rapidly.
April 25, 2013
Self-Service BI Works!
When I talk to people about adding self-service BI to their company’s environment I generally get a list of reasons why it won’t work. Some things I commonly hear:
- I can’t get anyone in IT or on the business side to even try it.
- The business side doesn’t know how to use the technology.
- This threatens my job.
- I just don’t know where to start either politically/culturally or with the technology.
- I have too many other things to do.
- How can it possibly be secure, allow standardization, or result in quality data and decisions?
- That’s not the way we do things.
- I don’t really know what self-service BI means.
So what is a forward thinking BI implementer to do? Well, Intel just went out and did it, blowing through the supposed obstacles. Eduardo Gamez of Intel’s Technology Manufacturing Engineering (TME) group interviewed business folks to find those who were motivated for change, found a great pilot project with committed employees, and drove the process forward. They put a “sandbox” environment up for the business to use and came up with a plan for monitoring the sandbox activity to find models and reports worth adding to their priority queue for enterprise BI projects. The business creates their own data models and their own reports for both high and low priority items. IT provides the infrastructure and training including products like Analysis Services, PowerPivot, Power View, SharePoint, Excel, SQL Server, and various data sources. The self-service models and reports are useful to the business – they reduce manual efforts, give them the reports they want much faster, and ultimately drive better, more agile business decisions. If a model isn’t quite right after the first try, they can quickly modify it. The same models and reports are useful to IT – they are very refined and complete requirements docs that shorten the time to higher quality enterprise models and reports, they free up IT resources to build a more robust infrastructure and allow IT to concentrate on projects that require specialized IT knowledge. Everyone wins with a shorter time to decision, higher quality decisions, and a significant impact on the bottom line.

Learn more about how Intel TME is implementing self-service BI:
- White Paper: Implementing Self-Service BI to Improve Business Decision Making
- Presentation from PASS Business Analytics Conference: How Intel Integrates Self-Service BI with IT for Better Business Results (also attached to this blog post)
Eduardo (eduardo.m.gamez@intel.com) and I (cgross@microsoft.com or @SQLCindy) are happy to talk to you about Self-Service BI – let us know what you need to know!

How_Intel__Integrates_Self-Service_BI_with_IT_for_Better_Business_Results_[DAV-208-M].zip
April 16, 2013
HDInsight: Jiving about Hadoop and Hive with CAT

Tomorrow I will be talking about Hive as part of Pragmatic Work’s Women in Technology (WIT) month of webcasts. I am proud to be part of this lineup with all these stellar WITs! I encourage my fellow WITs to get more involved in your data community and if you don’t already do so start tweeting, blogging, and speaking. I am happy to coach you through your first speaking engagement if you are interested. Get out there and start showing the world what you can do!

Thursday’s talk is going to be HDInsight: Jiving about Hadoop and Hive with CAT. Let’s break that title down.

HDInsight is Microsoft’s distribution of Hadoop. As part of the HDInsight project we have checked code back into the core Apache Hadoop source code to make the core code runs great on Windows. We are also adding functionality and features such as JavaScript and Azure Storage Vault that make the product more robust and enterprise friendly. This week the HDInsight Service Preview on Azure became available to those with an Azure subscription.

Hadoop is a scale out methodology that allows businesses to quickly consume and analyze data in ways they haven’t been able to before. This can lead to faster, better business insights and business actions.

Hive is a way to impose metadata and structure on the loosely structured (unstructured, multi-structured, semi-structured) data that resides in Hadoop’s HDFS file system. With Hive and the Hive ODBC driver you can make Hadoop data look like any other data source to your familiar BI tools such as Excel. PowerPivot can connect to Hive data, mash that data up with existing data sources such as SQL Azure, SQL Server, and OData, and allow you to visualize it with Power View. I have an end to end demo of this: Hurricane Sandy Mash-Up: Hive, SQL Server, PowerPivot & Power View.

CAT is my team at Microsoft. The Customer Advisory Team (CAT) works with customers who are doing new, unusual, and interesting things that push the boundaries of technology. We share what we learn with the community so you can do your jobs better and we take what we learn from you to the product team to help improve the product.

My slides are attached at the bottom of this post. I believe a recording of the talk will be posted by Pragmatic Works on their site.

I look forward to “seeing” you all at my talk tomorrow and would love to see your tweets or hear directly from you afterwards.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the CTP may change rapidly.

PragmaticWorksHDInsightJivingAboutHadoopAndHiveWithCATMar202013.pptx

March 20, 2013
PASS BAC PREVIEW SERIES: SQL Professionals and the World of Self-service BI and Big Data

Are you excited about the upcoming PASS Business Analytics Conference? You should be! This conference will offer a wide range of sessions about Microsoft’s End to End Business Intelligence (including Self-Service BI), Analytics, Big Data, Architecture, Reporting, Information Delivery, Data Management, and Visualization solutions. Whether you are an implementer, a planner, or a decision maker there is something here for you!

What makes this conference different? Why should you put in the effort to attend this conference in particular? We are seeing a paradigm shift focused on shorter time to decision, more data available than ever before, and the need for self-service BI. There are exciting technology solutions being presented to deal with these needs and new architectural skills are needed to implement them properly. Self-Service BI and Big Data are very different in many ways but also responding to the same problem – the need for additional insights and less time spent getting to those insights and the resulting impactful decisions. Self-Service BI via PowerPivot, Power View, Excel, and existing and new data sources including HDInsight/Hadoop (usually via Hive) offers fast time to decision, but you still sometimes need Enterprise BI to add additional value via services such as data curation, data stewardship, collaboration tools, additional security, training, and automation. Add in the powerful new data sources available with Big Data technologies such as HDInsight/Hadoop that can also reduce time to decision and open up all sorts of new opportunities for insight and you have many powerful new areas to explore. Not to mention that Dr. Steven Levitt, author of Freakonomics and SuperFreakonomics, is one of the keynote speakers!

Read more about my thoughts on Self-Service BI and Big Data in this #PASSBAC guest blog published today: PASS BAC PREVIEW SERIES: SQL Professionals and the World of Self-service BI and Big Data

And sign up for the session I am co-presenting at #PASSBAC with Eduardo Gamez of Intel: How Intel Integrates Self-Service BI with IT for Better Business Results

Take a look at all the information tagged with #PASSBAC and tweeted by @PASSBAC, there are some good blogs, preview sessions, and tidbits being posted. Get your own Twibbon for Twitter, Facebook, or however you want to use it, the Twibbon site will add a ribbon to the picture of your choice:

If you’re going to be in Chicago anyway, you might as well stay a few extra days for two nearby SQL Saturdays. The weekend before the conference take a short hop over to Madison, WI for #SQLSAT206 on April 6, 2013 at the Madison Area Tech College. Then head over to the bacon, uhhh, PASS BA CONference April 10-12. Stay one more day in Chicago (technically Addison, IL) for the #SQLSAT211 sessions at Devry. This is a great opportunity for even more SQL Server immersion and networking!

See you at #PASSBAC in Chicago in April!

@SQLCindy

Small Bites of Big Data

February 15, 2013
HDInsight: Hive Internal and External Tables Intro
Small Bites of Big Data

Cindy Gross, SQLCAT PM

HDInsight is Microsoft’s distribution, in partnership with Hortonworks, of Hadoop. Hive is the component of the Hadoop ecosystem that imposes structure on Hadoop data in a way that makes it usable from BI tools that expect rows and columns with defined data types. Hive tables can be created as EXTERNAL or INTERNAL. This is a choice that affects how data is loaded, controlled, and managed.

Use EXTERNAL tables when:
- The data is also used outside of Hive. For example, the data files are read and processed by an existing program that doesn’t lock the files.
- Data needs to remain in the underlying location even after a DROP TABLE. This can apply if you are pointing multiple schemas (tables or views) at a single data set or if you are iterating through various possible schemas.
- You want to use a custom location such as ASV.
- Hive should not own data and control settings, dirs, etc., you have another program or process that will do those things.
- You are not creating table based on existing table (AS SELECT).
Use INTERNAL tables when:
- The data is temporary.
- You want Hive to completely manage the lifecycle of the table and data.
We’ll walk through creating basic tables with a few rows of data so you can see some of the differences between EXTERNAL and INTERNAL tables. The demo data files are attached at the bottom of the blog. Alternatively you can simply open notepad and create your own files with a series of single column rows. If you create your own files make sure you have a carriage return/line feed at the end of all rows including the last one. The files should be in a Windows directory called c:data on the HDInsight Head Node. For HDInsight Server (on-premises) that’s the machine where you ran setup. For HDInsight Services (Azure) you can create a Remote Desktop connection (RDP) to the head node from the Hadoop portal.

Note: Your client tool editor or the website may change the dashes or other characters in the following commands to “smart” characters. If you get syntax errors from a direct cut/paste, try pasting into notepad first or deleting then retyping the dash (or other special characters).

Create an HDInsight cluster. You can do this on your own Windows machine by installing HDInsight Server or by signing up for HDInsight Services on Azure. For the CTP of HDInsight Services as of February 2013 you fill out a form to request access and receive access within a few days. Soon the service will be available from the Azure portal via your Azure subscription. Since the portal interface will be changing soon and all the commands are straightforward I will show you how to do all the steps through the Hive CLI (command line interface).

Open a Hadoop Command Prompt:

Change to the Hive directory (necessary in early preview builds of Hive):

cd %hive_home%bin

Load some data (hadoop file system put) and then verify it loaded (hadoop file system list recursively):

hadoop fs -put c:databacon.txt /user/demo/food/bacon.txt

hadoop fs -lsr /user/demo/food

The put command doesn’t return a result, the list command returns one row per file or subdirectory/file:

-rw-r–r–   1 cgross supergroup        124 2013-02-05 22:41 /user/demo/food/bacon.txt

Enter the Hive CLI (command line interface):

hive

Tell Hive to show the column names above the results (all Hive commands require a semi-colon as a terminator, no result is returned from this set command):

Set hive.cli.print.header=true;

Create an INTERNAL table in Hive and point it to the directory with the bacon.txt file:

CREATE INTERNAL TABLE internal1 (col1 string) LOCATION ‘/user/demo/food’;

Oops… that failed because INTERNAL isn’t a keyword, the absence of EXTERNAL makes it a managed, or internal, table.

FAILED: Parse Error: line 1:7 Failed to recognize predicate ‘INTERNAL’.

So let’s create it without the invalid INTERNAL keyword. Normally we would let an INTERNAL table default to the default location of /hive/warehouse but it is possible to specify a particular directory:

CREATE TABLE internal1 (col1 string) LOCATION ‘/user/demo/food’;

That will return the time taken but no other result. Now let’s look at the schema that was created:. Note that the table type is MANAGED_TABLE.

DESCRIBE FORMATTED internal1;

col_name        data_type       comment
# col_name              data_type               comment

col1                    string                  None

# Detailed Table Information
Database:               default
Owner:                  cgross
CreateTime:             Tue Feb 05 22:45:57 PST 2013
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://localhost:8020/user/demo/food
Table Type:             MANAGED_TABLE
Table Parameters:
transient_lastDdlTime   1360133157

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
serialization.format    1

And now look at some rows:

SELECT * FROM internal1;

col1
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon

What happens if we don’t specify a directory for an INTERNAL table?

CREATE TABLE internaldefault (col1 string);

It is created in the default Hive directory, which by default is in /hive/warehouse (dfs shells back out to Hadoop fs):

dfs -lsr /hive/warehouse;

We can see that Hive has created a subdirectory with the same name as the table. If we were to load data into the table Hive would put it in this directory:
drwxr-xr-x   – cgross supergroup          0 2013-02-05 22:52 /hive/warehouse/internaldefault

However, we won’t use this table for the rest of the demo so let’s drop it to avoid confusion. The drop also removes the subdirectory.

DROP TABLE internaldefault;

dfs -lsr /hive/warehouse;

Once we dropped the internaldefault table the directory that Hive created was automatically cleaned up. Now let’s add a 2nd file to the first internal table and check that it exists:

dfs -put c:databacon2.txt /user/demo/food/bacon2.txt;

dfs -lsr /user/demo/food;

-rw-r–r–   1 cgross supergroup        124 2013-02-05 23:04 /user/demo/food/bacon.txt
-rw-r–r–   1 cgross supergroup         31 2013-02-05 23:03 /user/demo/food/bacon2.txt

Since the CREATE TABLE statement points to a directory rather than a single file any new files added to the directory are immediately visible (remember that the column name col1 is only showing up because we enabled showing headers in the output – there is no row value of col1 in the data as headers are not generally included in Hadoop data):

SELECT * FROM internal1;

col1
HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn

Now let’s create an EXTERNAL table that points to the same directory and look at the schema:

CREATE EXTERNAL TABLE external1 (colE1 string) LOCATION ‘/user/demo/food’;

DESCRIBE FORMATTED external1;

col_name        data_type       comment
# col_name              data_type               comment

cole1                   string                  None

# Detailed Table Information
Database:               default
Owner:                  cgross
CreateTime:             Tue Feb 05 23:07:12 PST 2013
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               hdfs://localhost:8020/user/demo/food
Table Type:             EXTERNAL_TABLE
Table Parameters:
EXTERNAL                TRUE
transient_lastDdlTime   1360134432

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
serialization.format    1

This time the table type is EXTERNAL_TABLE. You can see that the location was expanded to include the default settings which in this case are the localhost machine using the default HDFS (as opposed to ASV or Azure Storage Vault).

Now look at the data:

SELECT * FROM external1;

The result set is a combination of the two bacon files:

HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn

That table returns the same data as the first table – we have two tables pointing at the same data set! We can add another one if we want:

CREATE EXTERNAL TABLE external2 (colE2 string) LOCATION ‘/user/demo/food’;

DESCRIBE FORMATTED external2;

SELECT * FROM external2;

You may create multiple tables for the same data set if you are experimenting with various structures/schemas.

Add another data file to the same directory and see how it’s visible to all the tables that point to that directory:

dfs -put c:dataveggies.txt /user/demo/food/veggies.txt;

SELECT * FROM internal1;

SELECT * FROM external1;

SELECT * FROM external2;

Each table will return the same results:

HDInsight_Bacon
SQL_Bacon
PASS_bacon
Summit_BACON
Database_Bacon
NoSQL_Bacon
BigData_Bacon
Hadoop_Bacon
Hive_Bacon
More_BaCoN
AndEvenMore_bAcOn
SQL_Apple
NoSQL_Pear
SQLFamily_Kiwi
Summit_Mango
HDInsight_Watermelon
SQLSat_Strawberries
Raspberrylimelemonorangecherryblueberry 123 456

Now drop the INTERNAL table and then look at the data from the EXTERNAL tables which now return only the column name:

DROP TABLE internal1;

SELECT * FROM external1;

SELECT * FROM external2;

dfs -lsr /user/demo/food;

Result: lsr: Cannot access /user/demo/food: No such file or directory.

Because the INTERNAL (managed) table is under Hive’s control, when the INTERNAL table was dropped it removed the underlying data. The other tables that point to that same data now return no rows even though they still exist!

Clean up the demo tables and directory:

DROP TABLE external1;

DROP TABLE external2;

exit;

This should give you a very introductory level understanding of some of the key differences between INTERNAL and EXTERNAL Hive tables. If you want full control of the data loading and management process, use the EXTERNAL keyword when you create the table.

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon on the samples and other activities.

Note: the CTP and TAP programs are available for a limited time. Details of the usage and the availability of the TAP and CTP builds may change rapidly.

bacon.zip
February 5, 2013