We’re good people. As good people we don’t want to think we do things that have negative consequences for others. But sometimes our subconscious can fool us. What we intend isn’t always what happens. We think we’re making a totally rational decision based on our conscious values – but subtle, unconscious bias creeps in. Yes, even for good people. For 20+ years folks at Harvard have been using something called the Implicit Association Test (IAT) to help us identify our biases.
The human brain works with patterns and categories. It’s how we make it through the day. We are bombarded with 100s of thousands of data points every day – we can’t possibly think through each one every time. We unconsciously assign data points, including our perception of people, into buckets. Those buckets have values and characteristics assigned to them that may or may not reflect the individual person we put in that bucket.
This automatic assignment is called intuitive thinking or system 1 thinking. It’s easy and takes little effort. It serves us well and lets us take on many tasks every day. However, it also sometimes leads us down the path of thinking we’ve chosen the “best” person when we’re really hired someone who meets some set of assumptions.
Sometimes we use slow thinking, or system 2 thinking. It’s rarely a conscious decision, something just makes us take some extra time and we usually don’t even realize it. That’s when we stop to question what we’re doing – maybe we adjust which categories we put someone in or we adjust the category or the values and judgments associated with it. We’re good people but system 2 thinking is tiring and we just can’t do it all the time.
Why does diversity matter at work? Personally, when we’re on a diverse team we tend to have higher personal and job satisfaction. Diverse teams are interesting and we often learn more. People who don’t feel like they’re the “only one” of something (gender, sexual orientation, race, introvert/extravert, etc.) relax, contribute more, and are more productive. And study after study shows that more diverse teams lead to better products and a better bottom line.
Companies with women on their boards have higher ROIs, more diverse companies tend to perform above average, and let’s face it – we don’t have enough STEM graduates to fill needed jobs if we don’t encourage a more diverse group of people to enter the field.
But we’re good people and we don’t make these snap judgments. We are rational and we always know why we made a decision. Or do we?
Optical illusions fool us all the time. Even knowing those lines are all the same length, did you have to measure them just to be sure? The same thing happens in our interactions with people. What’s the first thing that comes to mind for single parent, introvert, doctor, CEO, or programmer? That first thing hints at your categories – the categories built up by a lifetime of media saturation filled with type-cast actors.
Back to the science of bias. Let’s think about resumes. In one study, resumes were handed out to academics who were asked to rate the job candidates for competency, hireability, mentoring, and suggested salary. Some resumes were for John and some for Jennifer. Professors of all genders rated Jennifer 25% less competent and less likely to be hired. They rated John worth about $4000 more. When asked why they gave ratings their justifications sounded rational but…. 4 industry publications was awesome for John and 4 was just not enough for Jennifer. They are good people but they (we!) are at the mercy of their subconscious and years of societal conditioning.
We’re good people so what do we do?
Take the IATs – there are many, take at least a couple and understand your unexpected biases. Talk about this with others so we all become comfortable talking about our subtle biases. Work to consciously update your mental categories – seek out images and reminders of people who are different and successful. Now that you know your own categories a bit better, be more mindful about switching to system 2 thinking. Reach out to one person and mentor them. Spend time with someone who makes you uncomfortable. Pay attention to the “firsts” (the first autistic character on Sesame Street, the first black President, the first whatever) and see if that helps you update your mental categories.
Increase the pipeline. Participate in groups that help kids learn to code. Recruit beyond your normal network, post jobs on diversity sites, and consider non-traditional backgrounds. Join diverse groups that don’t match your own diversity.
Be careful with words. Is someone bossy or exhibiting leadership? Is someone aggressive or a go-getter? Are they emotional or passionate. You may be surprised how you assign different words for the same behavior in unexpected ways.
When you post a job, only list something as “required” if it truly is. Women for example tend to only apply if they meet almost all the requirements, men tend to apply if they meet a few. Do you really require Java experience or do you need a good coder who is willing to learn new things? Don’t ask for a specific type of leader, look for someone who can lead in any of many productive ways. Explicitly state that you value a diverse team. And beware of subtle stereotypes – words like best, rock star, action-oriented define a particular picture but may not represent what you’re really looking for.
When reviewing resumes, have HR take off names, cities, and years. Before you pick up a resume decide on your priorities – does experience or willingness to learn matter more for example? Look for people who fill gaps rather than trying to replicate people you already have. And remember, system 2 thinking is tiring so do this when you’re alert and can take the time to think about what you’re doing.
For the interviews, have a diverse group participate. Simply looking at picture of or talking about diverse people before starting interviews increases the chance you hire with diversity in mind. Don’t confuse either confidence or “geek cred” with competence. Keep an open mind about different ways of approaching problems – it’s the result that matters.
SQL Server is a great technology – I’ve been using it since 1993 when the user interface consisted of a query window with the options to save and execute and not much else. With every release there’s something new and exciting and there’s always something to learn about even the most familiar of features. However, not everyone uses SQL Server for every storage and compute opportunity – sad but true.
So what is a SQL geek to do in the face of all the new options out there – many under the umbrella of Big Data (distributed processing)? Why just jump right on in and learn it! No one can know all the pieces because it’s a big, fluid, messy collection of “things”. But don’t worry about that, start with one thing and build from there. Even if you never plan to implement a production Big Data system you need to learn about it – because if you don’t have some hands-on experience with it then someone who does have that experience will be influencing the decision makers without you. For a SQL Pro I suggest Hive as that easy entry point. At some point maybe Spark SQL will jump into that gap, but for now Hive is the easiest entry point for most SQL pros.
For more, I refer you to the talk I gave at the Pacific Northwest SQL Server User Group meeting on October 14, 2015. Excerpts are below, the file is attached.
Look, it’s SQL!
SELECT score, fun
WHERE type = ‘they pay me for this?’;
Here’s how that code looks from Visual Studio along with the links to how you find the output and logs:
And yet it’s more!
CREATE EXTERNAL TABLE IF NOT EXISTS toDo
rank INT COMMENT ‘rank the greatness’,
COMMENT ‘two tables walk into a bar….’
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
A mix of old and new
— read some data
SELECT ‘you cannot make me ‘, score, fun, type
WHERE score <= 0
ORDER BY score;
SELECT ‘when can we ‘, score, fun, type
WHERE score > 0
DISTRIBUTE BY score SORT BY score;
That’s Hive folks!
on AzureBig Data in the cloud!
Data exploration, analytics and reporting, new data-driven actionable insights
Data driven actions for early competitive advantage or first to market
Low number of direct, concurrent users
Low cost data archival
Replace system whose pain points don’t align with Hadoop’s strengths
OLTP needs adequately met by an existing system
Known data with a static schema
Many end users
Interactive response time requirements (becoming less true)
Your first Hadoop project + mission critical system
Azure has so much more
Go straight to the business code
Scale storage and compute separately
Managed and unmanaged services
On-demand and 24×7 options
It’s a Polyglot
Stream your data into a lake
Pick the best compute for each task
On June 1 I will be moving into a new role on AzureCAT. I tried the small business consulting world with Neal Analytics and it just wasn’t a good fit for me and my passions. So here I go, on to new challenges at Microsoft! I’ll be making the world a better place with the help of Big Data.
And while I’m making changes, I’ll also be moving from Boise, ID to the Redmond, WA area. It’s new adventures all around for me. I’ll miss Boise – my friends, my political battles, the greenbelt and hiking trails, sitting on the patios downtown. And I’m also excited about all the new opportunities I’ll have in my new, blue state.
One of the frequently overlooked yet essential best practices for Hadoop is to prefer fewer, bigger files over more, smaller files. How small is too small and how many is too many? How do you stitch together all those small Internet of Things files into files “big enough” for Hadoop to process efficiently?
One performance best practice for Hadoop is to have fewer large files as opposed to large numbers of small files. A related best practice is to not partition “too much”. Part of the reason for not over-partitioning is that it generally leads to larger numbers of smaller files.
Too small is smaller than HDFS block size (chunk size), or realistically small is something less than several times larger than chunk size. A very, very rough rule of thumb is files should be at least 1GB each and no more than maybe around 10,000-ish files per table. These numbers, especially the maximum total number of files per table, vary depending on many factors. However, it gives you a reference point. The 1GB is based on multiples of the chunk size while the 2nd is honestly a bit of a guess based on a typical small cluster.
Why Is It Important?
One reason for this recommendation is that Hadoop’s name node service keep track of all the files and where the internal chunks of the individual files are. The more files it has to track the more memory it needs on the head node and the longer it takes to build a job execution plan. The number and size of files also affects how memory is used on each node.
Let’s say your chunk size is 256MB. That’s the maximum size of each piece of the file that Hadoop will store per node. So if you have 10 nodes and a single 1GB file it would be split into 4 chunks of 256MB each and stored on 4 of those nodes (I’m ignoring the replication factor for this discussion). If you have 1000 files that are 1MB each (still a total data size of ~1GB) then every one of those files is a separate chunk and 1000 chunks are spread across those 10 nodes. NOTE: In Azure and WASB this happens somewhat differently behind the scenes – the data isn’t physically chunked up when initially stored but rather chunked up at the time a job runs.
With the single 1GB file the name node has 5 things to keep track of – the logical file plus the 4 physical chunks and their associated physical locations. With 1000 smaller files the name node has to track the logical file plus 1000 physical chunks and their physical locations. That uses more memory and results in more work when the head node service uses the file location information to build out the plan for how it will split out any Hadoop job into tasks across the many nodes. When we’re talking about systems that often have TBs or PBs of data the difference between small and large files can add up quickly.
The other problem comes at the time that the data is read by a Hadoop job. When the job runs on each node it loads the files the task tracker identified for it to work with into memory on that local node (in WASB the chunking is done at this point). When there are more files to be read for the same amount of data it results in more work and slower execution time for each task within each job. Sometimes you will see hard errors when operating system limits are hit related to the number of open files. There is also more internal work involved in reading the larger number of files and combining the data.
There are several options for stitching files together.
Combine the files as they land using the code that moves the files. This is the most performant and efficient method in most cases.
INSERT into new Hive tables (directories) which creates larger files under the covers. The output file size can be controlled with settings like hive.merge.smallfiles.avgsize and hive.merge.size.per.task.
Use a combiner in Pig to load the many small files into bigger splits.
The Small Files Problem http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ “Reading through files in a HAR is no more efficient than reading through files in HDFS, and in fact may be slower since each HAR file access requires two index file reads as well as the data file read (see diagram). And although HAR files can be used as input to MapReduce, there is no special magic that allows maps to operate over all the files in the HAR co-resident on a HDFS block.”
The key here is to work with fewer, larger files as much as possible in Hadoop. The exact steps to get there will vary depending on your specific scenario.
You can use the new Azure portal to create or edit Azure Data Factory components. Once you are done you may automate the process of creating future Data Factory components from PowerShell. In that case you can use the JSON files you edited in the portal GUI as configuration files for the PowerShell cmdlets. For example, you may try to create a new linked service using settings from C:CoolerHDInsight.JSON as specified below:
If you check the JSON file that you exported from the portal and referenced in the PowerShell script, you will see it ends with something like this:
The hubName is currently automatically generated based on the name of the Data Factory and should not be present in the JSON files used by PowerShell. Remove the comma on the line above the hubName and the entire line starting with hubName.
That will leave the end of the file looking something like this:
Check out all your other JSON files you are using for Data Factory components and do the same editing for any that have a hubName.
NOTE: This applies to Azure Data Factory as of April 2015. At some point the hubName should become a viable parameter usable by PowerShell.
This year’s Boise Code Camp has an exciting new panel session: “Where are the Women in Technology?” We have a great lineup of women who will help us all understand why women in tech matter to your bottom line. The panel invites everyone of any gender to attend and learn more about how we can all help make our businesses more productive, our work culture more pleasant, and our employees and coworkers happier. The session is sponsored by “Girl Develop It! Boise”.
Saturday, March 21 at 140p at the BSU Special Events Center
How does it impact you and your business when women are underrepresented in your tech department? Women are interested, why aren’t they working in IT? Why is the number of women in IT decreasing? What can you personally do about it? Why should you care? We can all help make our businesses more productive, our work culture more pleasant, and our employees and coworkers happier. Sponsored by “Girl Develop It – Boise”.
We frequently hear how Idaho wants to be seen as a top tech destination. We want to attract new startups as well as existing businesses. Women are an untapped resource that can help make that happen. When women join boards and hold decision making positions in businesses, those businesses tend to outperform businesses with fewer women. Having a pipeline of women techies coming out of our local universities and a workforce with a high percentage of techie women can help attract high tech businesses. Tech businesses historically have a big focus on diverse workforces since over the years they’ve seen the benefits of having a multitude of perspectives when developing new products and services. Women are leading adopters of new products and services, it just makes sense to have women adding a female viewpoint to the decision making process as those products and services are chosen and built.
While tech businesses value women, they still see women exiting the field at much higher rates than men. Take game development as an example. Sexism, misogyny, and sexual harassment are worse in the areas of tech that traditionally are even more overwhelmingly male than most – like video game development. While nearly half of women who play games are women, many games are marketed with ads that emphasize women as sex objects and few games have female avatars to choose from. When female characters or avatars are present in a game they generally have a much more limited range of body types with an emphasis on highly sexualized looks. Again, that is usually attributed to the fact that most of the people who write and market games are men – they never even question what they are doing, it’s just accepted. There is an unproven assumption that women will play “boy games” but men won’t play “girl games”. Female gamers like panelist and IT veteran Jane Miceli are asked “are you really a girl?” when they play well and score high in an online game. When more women are hired into gaming companies/divisions the products change. More non-sexualized female characters appear. Games begin to have more layers to them, more ways of interacting. And then more people buy those games. That’s good for business.
When women are mentored, encouraged, and valued at work they help drive better business decisions.
Our panelists include three women representing a range of IT career stages. Kelsey Suyehira is a BSU senior with a math degree who has returned to school to get a degree in computer science. She is currently president of the Association for Computing Machinery Women’s group, a club that supports the recruitment and retention of women in computer science on campus. Suyehira is joined on the panel by Marianna Budnikova, a professional hacker at the locally owned MetaGeek. In addition to her passion for machine learning and genetic algorithms, which were the basis of her Master’s thesis, Budnikova loves to develop iOS and Android apps. Rounding out the techies on the panel is Jane Miceli. Miceli is a 15 year veteran of the IT industry and a SCRUM Master with a Master’s degree in computer science. She is an avid supporter of tech education and ongoing volunteering on the Idaho Technology Council’s Education Committee, Boise School District’s First Robotics Team 2122, the annual Boise Code Camp conference. Miceli is also managing director for Girls In Tech Boise. All three women show their leadership skills in the Boise chapter of Girl Develop It – a nonprofit that offers affordable technology classes to women through introductory classes such as Python, HTML, and Hadoop. These female tech leaders will talk about their personal experiences with being a woman in tech.
The panel is rounded out by two BSU gender studies lecturers. Carrie Semmelroth lectures on early and special education and gender topics at BSU. Semmelroth spends much of her time on data analysis in her department which gives her an interesting perspective on the intersection of gender and technology. Representative Melissa Wintrow is an educator, trainer, and leadership consultant who lectures on gender studies. Wintrow has been elected to represent District 19 in the Idaho legislature. Semmelroth and Wintrow will bring their perspectives as leaders and educators to help us understand how we can each have a positive impact at work.
Panel moderator Cindy Gross is a long time IT veteran with a passion for changing the way businesses do business, whether that’s through adoption of new technologies such as Hadoop or by increasing diversity in the workplace. Gross looks forward to a great hour of discussion on a range of Women in Tech topics that take us from a view of today’s reality to concrete, real-world actions that each of us can take to attract and retain more women in technology. The panel invites live tweeters to use the hashtag #BoiseWIT when discussing the session.
We Need More Women in Tech: The Data Prove It http://www.theatlantic.com/technology/archive/2013/10/we-need-more-women-in-tech-the-data-prove-it/280964/ “In 2010, although 57 percent of undergraduate degree recipients were female, but only 14 percent of the computer science degrees at major research universities. Incredibly, this number has actually fallen in recent years: In 1985, 37 percent of undergraduates degree recipients in computer science were women. By 2010, that fell to 18 percent, and at major research universities, the number was 14 percent.”…”[W]omen-operated, venture-backed high tech companies average 12 percent higher annual revenues. They also use on average one-third less capital than male counterparts’ startups.”
Creating an HDInsight cluster from the Azure portal is very easy. However, sometimes you want all the choices and best practices explained as well as the “how to”. I have created a series of slides with audio recordings to walk you through the process and choices. They are available as sessions 1-8 of “Create HDInsight Cluster in Azure Portal” on my YouTube channel Small Bites of Big Data.
Most accounts will only have one Azure subscription associated with them. But if you seem to have unexpected resources, check to make sure you are in the expected subscription. The Subscriptions button is on the upper right of the Azure portal.
Option: Add more Microsoft Accounts as admins of the Azure Subscription.
1. Choose SETTINGS at the very bottom on the left.
2. Then choose ADMINISTRATORS at the top. Click on the ADD button at the very bottom.
3. Enter a Microsoft Account or federated enterprise account that will be an admin.
Recording of getting started with an Azure subscription on YouTube
Azure Storage – WASB
I recommend you manually create at least one Azure storage account and container ahead of time. While the HDInsight creation dialogue gives the option of creating the storage account and container for you, that only works if you don’t plan to reuse data across clusters.
Create a Storage Account
1. Click on STORAGE in the left menu then NEW.
2. URL: Choose a lower-case storage account name that is unique within *.core.windows.net.
3. LOCATION: Choose the same location for the SQL Azure metastore database, the storage account(s), and HDInsight.
4. REPLICATION: Locally redundant stores fewer copies and costs less.
Repeat if you need additional storage.
Create a Container
1. Click on your storage account in the left menu then CONTAINERS on the top.
2. Choose CREATE A CONTAINER or choose the NEW button at the bottom.
3. Enter a lower-case NAME for the container, unique within that storage account.
4. Choose either Private or Public ACCESS. If there is any chance of sensitive or PII data being loaded to this container choose Private. Private access requires a key. HDInsight can be configured with that key during creation or keys can be passed in for individual jobs.
This will be the default container for the cluster. If you want to manage your data separately you may want to create additional containers.
Recording of creating an Azure storage account and container on YouTube.
In Azure you have the option to create a metastore for Hive and/or Oozie that exists independently of your HDInsight clusters. This allows you to reuse your Hive schemas and Oozie workflows as you drop and recreate your cluster(s). I highly recommend using this option for a production environment or anything that involves repeated access to the same, standard schemas and/or workflows.
Create a Metastore aka Azure SQL DB
Persist your Hive and Oozie metadata across cluster instances, even if no cluster exists, with an HCatalog metastore in an Azure SQL Database. This database should not be used for anything else. While it works to share a single metastore across multiple instances it is not officially tested or supported.
1. Click on SQL DATABASES then NEW and choose CUSTOM CREATE.
2. Choose a NAME unique to your server.
3. Click on the “?” to help you decide what TIER of database to create.
4. Use the default database COLLATION.
5. If you choose an existing SERVER you will share sysadmin access with other databases.
You can make the system more secure if you create a custom login on the Azure server. Add that login as a user in the database you just created. Grant it minimal read/write permissions in the database. This is not well documented or tested so the exact permissions needed for this are vague. You may see odd errors if you don’t grant the appropriate permissions.
In order to refer to the metastore from automated cluster creation scripts such as PowerShell your workstation must be added to the firewall rules.
1. Click on MANAGE then choose YES.
2. You can also use the MANAGE button to connect to the SQL Azure database and manage logins and permissions.
Now that we have the pre-requisites done we can move on to creating the cluster.
Quick Create through the Azure portal is the fastest way to get started with all the default settings.
The Azure portal Custom Create allows you to customize size, storage, and other configuration options.
You can customize and automate through code including .NET and PowerShell. This increases standardization and lets you automate the creation and deletion of clusters over time.
For all the examples here we will create a basic Hadoop cluster with Hive, Pig, and MapReduce.
A cluster will take several minutes to create, the type and size of the cluster have little impact on the time for creation.
Quick Create Option
For your first cluster choose a Quick Create.
1. Click on HDINSIGHT in the left menu, then NEW.
2. Choose Hadoop. HBase and Storm also include the features of a basic Hadoop cluster but are optimized for in-memory key value pairs (HBase) or alerting (Storm).
3. Choose a NAME unique in the azurehdinisght.net domain.
4. Start with a small CLUSTER SIZE, often 2 or 4 nodes.
5. Choose the admin PASSWORD.
6. The location of the STORAGE ACCOUNT determines the location of the cluster.
Custom Create Option
You can also customize your size, admin account, storage, metastore, and more through the portal. We’ll walk through a basic Hadoop cluster.
1. Click on HDINSIGHT in the left menu, then NEW in the lower left.
2. Choose CUSTOM CREATE.
1. Choose a NAME unique in the azurehdinisght.net domain.
2. Choose Hadoop. HBase and Storm also include the features of a basic Hadoop cluster but are optimized for in-memory key-value pairs (HBase) or alerting (Storm).
3. Choose Windows or Linux as the OPERATING SYSTEM. Linux is only available if you have signed up for the preview.
4. In most cases you will want the default VERSION.
Size and Location
1. Choose the number of DATA NODES for this cluster. Head nodes and gateway nodes will also be created and they all use HDInsight cores. For information on how many cores are used by each node see the “Pricing details” link.
2. Each subscription has a billing limit set for the maximum number of HDInsight cores available to that subscription. To change the number available to your subscription choose “Create a support ticket.” If the total of all HDInsight cores in use plus the number needed for the cluster you are creating exceeds the billing limit you will receive a message: “This cluster requires X cores, but only Y cores are available for this subscription”. Note that the messages are in cores and your configuration is specified in nodes.
3. The storage account(s), metastore, and cluster will all be in the same REGION.
1. Choose an administrator USER NAME. It is more secure to avoid “admin” and to choose a relatively obscure name. This account will be added to the cluster and doesn’t have to match any existing external accounts.
2. Choose a strong PASSWORD of at least 10 characters with upper/lower case letters, a number, and a special character. Some special characters may not be accepted.
On the same page as the Hadoop cluster admin account you can optionally choose to use a common metastore (Hcatalog).
1. Click on the blue box to the right of “Enter the Hive/Oozie Metastore”. This makes more fields available.
2. Choose the SQL Azure database you created earlier as the METASTORE.
3. Enter a login (DATABASE USER) and PASSWORD that allow you to access the METASTORE database. If you encounter errors, try logging in to the database manually from the portal. You may need to open firewall ports or change permissions.
Default Storage Account
Every cluster has a default storage account. You can optionally specify additional storage accounts at cluster create time or at run time.
1. To access existing data on an existing STORAGE ACCOUNT, choose “Use Existing Storage”.
2. Specify the NAME of the existing storage account.
3. Choose a DEFAULT CONTAINER on the default storage account. Other containers (units of data management) can be used as long as the storage account is known to the cluster.
4. To add ADDITIONAL STORAGE ACCOUNTS that will be accessible without the user providing the storage account key, specify that here.
Additional Storage Accounts
If you specified there will be additional accounts you will see this screen.
1. If you choose “Use Existing Storage” you simply enter the NAME of the storage account.
2. If you choose “Use Storage From Another Subscription” you specify the NAME and the GUID KEY for that storage account.
You can add additional components or configure existing components as the cluster is deployed. This is beyond the scope of this demo.
1. Click “add script action” to show the remaining parameters.
2. Enter a unique NAME for your action.
3. The SCRIPT URI points to code for your custom action.
4. Choose the NODE TYPE for deployment.
Create is Done!
Once you click on the final checkmark Azure goes to work and creates the cluster. This takes several minutes. When the cluster is ready you can view it in the portal.
Recording of HDInsight quick and custom create on YouTube
Query with Hive
For most people the easiest, fastest way to learn Hadoop is through Hive. Hive is also the most widely used component of Hadoop. When you use the Hive ODBC driver any ODBC-compliant app can access the Hive data as “just another data source”. That includes Azure Machine Learning, Power BI, Excel, and Tableau.
The simplest, most relatable way for most people to use Hadoop is via the SQL-like, Database-like Hive and HiveQL (HQL).
The query console defaults to selecting the first 10 rows from the pre-loaded sample table. This table is created when the cluster is created.
1. Optionally edit or replace the default query: Select * from hivesampletable LIMIT 10;
2. Optionally name your query to make it easier to find in the job history.
3. Click Submit.
Hive is a batch system optimized for processing huge amounts of data. It spends several seconds up front splitting the job across the nodes and this overhead exists even for small result sets. If you are doing the equivalent of a table scan in SQL Server and have enough nodes in Hadoop, Hadoop will probably be faster than SQL Server. If your query uses indexes in SQL Server, then SQL Server will likely be faster than Hive.
View Hive Results
1. Click on the Query you just submitted in the Job Session. This opens a new tab.
2. You can see the text of the Job Query that was submitted. You can Download it.
3. The first few lines of the Job Output (query result) are available. To see the full output choose Download File.
4. The Job Log has details including errors if there are any.
5. Additional information about the job is available in the upper right.
View Hive Data in Excel Workbook
At this point HDInsight is “just another data source” for any application that supports ODBC.
1. Install the Microsoft Hive ODBC driver.
2. Define an ODBC data source pointing to your HDInsight instance.
3. From DATA choose From Other Sources and From Data Connection Wizard.
View Hive Data in PowerPivot
At this point HDInsight is “just another data source” for any application that supports ODBC.
1. Install the Microsoft Hive ODBC driver.
2. Define an ODBC data source pointing to your HDInsight instance.
3. Click on POWERPIVOT then choose Manage. This opens a new PowerPivot for Excel window.
4. Choose Get External Data then Others (OLEDB/ODBC).
Now you can combine the Hive data with other data inside the tabular PowerPivot data model.
In the cloud you don’t have to load data to Hadoop, you can load data to an Azure Storage Account. Then you point your HDInsight or other WASB compliant Hadoop cluster to the existing data source. There many ways to load data, for the demo we’ll use CloudXplorer.
You use the Accounts button to add Azure, S3, or other data/storage accounts you want to manage.
In this example nealhadoop is the Azure storage account, demo is the container, and bacon is a “directory”. The files are bacon1.txt and bacon2.txt. Any Hive tables would point to the bacon directory, not to individual files. Drag and drop files from Windows Explorer to CloudXplorer.
Once you have created the HDInsight cluster you can use it and play with it and try many things. When you are done, simply remove the cluster. If you created an independent metastore in SQL Azure you can use that same metastore and the same Azure storage account(s) the next time you create a cluster. You are charged for the existence of the cluster, not for the usage of it. So make sure you drop the cluster when you aren’t using it. You can use automation, such as PowerShell, to spin up a cluster that is configured the same every time and to drop it. Check the website for the most recent information.
Automate with PowerShell
With PowerShell, .NET, or the Cross-Platform cmd line tools you can specify even more configuration settings that aren’t available in the portal. This includes node size, a library store, and changing default configuration settings such as Tez and compression.
Automation allows you to standardize and with version control lets you track your configurations over time.
Hadoop is the hot buzzword of the Big Data world, and many IT people are being told “go create a Hadoop cluster and do some magic”. It’s hard to know where to start or which projects are a good fit. The information available online is sparse, often conflicting, and usually focused on how to solve a technical problem rather than a business problem. So let’s look at this from a business perspective.
For the average business just getting into using Hadoop for the first time, you are most likely to be successful if you choose a project related to data exploration, analytics and reporting, and/or looking for new data-driven actionable insights. In many ways Hadoop is ‘just another data source.” Generally most businesses will not start with replacing existing, high-functioning OLTP implementations. Instead you will likely see the highest initial return on investment (ROI) from adding on to those existing systems. Pull some of the existing data into Hadoop, add new data, and look for new ways to use that data. The goal should remain clearly focused on how to use the data to take action based on the new data-driven insights you will uncover.
Below are some characteristics that are often present for a successful Hadoop implementation. You don’t need to have all of them to be successful, use the list to brainstorm new ideas.
Goals include innovation, exploration, iteration, and experimentation. Hadoop allows you to ask lots of “what-if” questions cheaply, to “fail fast” so you can try out many potential hypotheses, and look for that one cool thing everyone else has missed that can really impact your business.
New data or data variations will be explored. Some of it may be loosely structured. Hadoop, especially in the cloud, allows you to import and experiment with data much more quickly and cheaply than with traditional systems. Hadoop on Azure in particular has the WASB option to make data ingestion even easier and faster.
You are looking for the “Unknown Unknowns”. There are always lurking things that haven’t come to your attention before but which may be sparks for new actions. You know you don’t know what you want or what to ask for and will use that to spur innovation.
Flexible, fast scaling without the need to change your code is important. Hadoop is built on the premise that it is infinitely scalable – you simply add more nodes when you need more processing power. In the cloud you can also scale your storage and compute separately and more easily scale down during slow periods.
You are looking to gain some competitive advantage faster than your competition based on data-driven actions. This goes back to the previous points, you are using Hadoop to look for something new that can change your business or help you be first to market with something.
There are a low number of direct, concurrent users of the Hadoop system itself. The more jobs you have running at the same time, the more robust and expensive your head node(s) must be and often the larger your cluster must be. This changes the cost/benefit ratio quickly. Once data is processed and curated in Hadoop it can be sent to systems that are less-batch oriented and more available and familiar to the average power user or data steward.
Archiving data in a low-cost manner is important. Often historical data is kept in Hadoop while more interactive data is kept in a relational system.
Quite often I hear people proposing Hadoop for projects that are not an ideal use for Hadoop, at least not as you are learning it and looking for quick successes to bolster confidence in the new technology. The below characteristics are generally indicators that you do NOT want to use Hadoop in a project.
You plan to replace an existing system whose pain points don’t align with Hadoop’s strengths.
There are OLTP business requirements, especially if they are adequately met by an existing system. Yes, there are some components of Hadoop that can meet OLTP requirements and those features are growing and expanding rapidly. If you have an OLTP scenario that requires ACID properties and fast interactive response time it is possible Hadoop could be a fit but it’s usually not a good first project for you to learn Hadoop and truly use Hadoop’s strengths.
Data is well-known and the schema is static. Generally speaking, though the tipping point is changing rapidly, when you can use an index for a query it will likely be faster in a relational system. When you do the equivalent of a table scan across a large volume of data and provide enough scaled-out nodes it is likely faster on a Big Data system such as Hadoop. Well-known, well-structured data is highly likely to have well-known, repeated queries that have supporting indexes.
A large number of users will need to directly access the system and they have interactive response time requirements (response within seconds).
Your first project and learning is on a mission critical system or application. Learn on something new, something that makes Hadoop’s strengths really apparent and easy to see.
And in Conclusion
Choosing the right first project for your dive into Hadoop is crucial. Make it bite-sized, clearly outline your goals, make sure it has some of the above success criteria and avoid the anti-patterns. Make learning Hadoop a key goal of the project. Budget time for everyone to really learn not only how things work but why they work that way and whether there are better ways to do certain things. Hadoop is becoming ubiquitous, avoiding it completely is not an option. Jump in, but do so with your eyes wide open and make some good up-front decisions. Happy Big Data-ing!
This week I helped a client move files from AWS S3 to Azure Storage blobs. Sounds simple, right? Here’s the tricky part… While there are both Azure and AWS cmdlets for PowerShell, they don’t cooperate. Neither has a cmdlet that accepts credentials from the other and neither accepts arbitrary URLs from outside their own cloud. And AzCopy also doesn’t accept S3 URLs. None of the S3 tools seem to recognize Azure. So what’s a girl to do?
The Search and The Discovery
After hours of trying to get creative with PowerShell or AzCopy I resorted to Bing searches. When what to my wondering eyes should appear, but a miniature sleigh…. uh, a fully fledged, well-written tool to move data between Azure and S3. But there’s more! This tool, known as Rudolph… I mean AzureCopy, can move data between Azure, S3, OneDrive, SharePoint online, Dropbox, and local file systems! Ken Faulkner has written a wonderful, holly jolly tool! After a few hiccups as I learned how to use the tool and learned about how S3 URLs are (and at first mostly are not) formed I quickly had all my data moved from S3 to Azure! Simple. Easy. It flew like the down of a thistle (whatever that means). So, what was required after installing the tool?
Open a dos-prompt and go to the directory where you installed AzureCopy. Instead of using a config file I set the values at the command line (use your own real values for the directory and after each equal sign):
set AWSRegion value=us-west-2
Then I got a listing of my files on S3 – this took longer than it should because I had trouble getting the S3 URL correct. That was a problem with my newness to S3, not a problem with the tool. If you’re in the default region you use mybucket.s3.amazonaws.com. Otherwise you use mybucket.s3-region.amazonaws.com. See Amazon’s docs on S3 buckets for more details on the URL.
Also, I didn’t need all the keys passed in on both commands, it was just easier to write and copy the code that way as I tried to get it all working.
Next I listed out the files in Azure. At this point the container was empty but the command at least verified my access worked. I uploaded a small test file and verified I could see it with AzureCopy, then deleted the test file.
And just like that, within a couple of minutes, the list command for azurecopy showed all the files in Azure! I double-checked with my Azure and AWS PowerShell cmdlets that yes, this was really true! This tool saved me SO MUCH TIME! And now you know, the built in tools from the major cloud vendors lock you into their own cloud. But with AzureCopy you too can free your data!
More and more we have to work with data in many different locations. This week I got to work with S3 files that were moving to Azure blob storage. I was surprised to find that Amazon has published AWS cmdlets for PowerShell. It took me a little while to figure out the format and terminology so I’ll try to explain that and compare/contrast how we interact with storage in AWS and Azure. Today we will cover viewing the files.
Well first, let’s get things set up. Install the Azure and AWS cmdlets for PowerShell. These examples will pass keys for everything so there’s no need to configure PowerShell with certificates to access the clouds.
The first time (depending on your PowerShell version) you use PowerShell after installing AWS cmdlets you may need to run these cmdlets:
We’ll start with AWS S3. Each connection to S3 storage requires an AWS region (unless you use the default “US Standard”, an access id (unique identifier), a secret key, and a bucket. You are storing data within a specific region on an access point in a managed grouping called a bucket. The access id in S3 is equivalent to a storage account name in Azure. A bucket in S3 is roughly equivalent to a container in Azure.
Next let’s use those values to make a new client connection to S3. You define a configuration object that points to the full URL for the region. Then you pass that configuration object, the access id, and the secret key to a function that creates a client connection to S3. This sets the context for the entire session and the context does not have to be passed to the individual commands. Note that the URL changes depending on the region, for example https://s3-us-west-2.amazonaws.com
Let’s compare that to how we list files in Azure blob storage. First you specify the location and credentials. The region is implied because the storage account name is unique across all regions. The container and secret key value are similar in meaning.
Then you define the storage context which is the location and credentials of an object. Alternatively you could set the default storage context for the session or for a particular profile’s connection to a given subscription.
This intro to PowerShell for S3 opens up the door to many possibilities – data migrations, multi-cloud hybrid solutions, and whatever your imagination can conjure up! Today we reviewed how to view files, I’ll cover more in future posts. Happy PowerShelling!
When you open “Microsoft Azure PowerShell” type ISE in the window to launch the interactive PowerShell shell. It has intellisense, multiple script windows, and a cmdlet viewer.