I have so many thoughts and ideas about where my passion will lead me next. I haven’t yet settled on any one thing for a new career, so I went back to the basics. Listen. Listen to my community. I envision my community as marginalized people in tech. So I have started a meetup group where we can get together and talk. Where we can listen to each other. Where we help each other. Join me and let’s go on this journey to our futures together.
A gathering place for people forging new paths after harassment at work.
This is a safe space – no hate speech, bullying, harassment, or discrimination is tolerated. We value input from a variety of identities and will center the views, needs, and decisions of those who are not cishet white men.
I’m a 50 year old white woman leaving the tech world. As I talk about the harassment, bullying, and discrimination I’ve faced over the years other women open up about their own experiences. So many of us have no place to talk to others with the same experiences. Let’s share our stories, our growth, our pain and joy. This is a place to talk about surviving and thriving, about careers, family, friends, life, work, play, and about disrupting the white patriarchy to nurture a new way of doing things.
#Words4Justice
Befriending Dragons – Life After Workplace Harassment
Bellevue, WA 3Members
A gathering place for people forging new paths after harassment at work.This is a safe space – no hate speech, bullying, harassment, or discrimination is tolerated. We value …
I no longer feel safe, comfortable, or valued working in tech. Going forward I’ll be working to actively disrupt tech culture and systems to reduce harassment and discrimination. Keep an eye on #Words4Justice. đ
Be kind. Be brave. Go beyond ally to accomplice to actively disrupt bullying and discrimination.
After all these years soaring through the data world, from SQL Server 1.11 all the way through todayâs modern Big Data technologies, I am making a flight adjustment. My next adventure will be in the land of the Windows Hypervisor: Hyper-V. Last week I started working with my new team and I am already learning to corral and wield a whole new world of acronyms, technologies, and scenarios. As a software engineer on the quality team Iâll help define and implement test scenarios that lead to better customer experiences across multiple products.
I won’t be leaving data behind! This new role has a lot of data aspects and of course the hypervisor underlies many of the world’s data systems! It’s been great working with the #SQLFamily over the years and I look forward to continuing to work with you all!
Weâre good people. As good people we donât want to think we do things that have negative consequences for others. But sometimes our subconscious can fool us. What we intend isnât always what happens. We think weâre making a totally rational decision based on our conscious values â but subtle, unconscious bias creeps in. Yes, even for good people. For 20+ years folks at Harvard have been using something called the Implicit Association Test (IAT) to help us identify our biases.
The human brain works with patterns and categories. Itâs how we make it through the day. We are bombarded with 100s of thousands of data points every day â we canât possibly think through each one every time. We unconsciously assign data points, including our perception of people, into buckets. Those buckets have values and characteristics assigned to them that may or may not reflect the individual person we put in that bucket.
This automatic assignment is called intuitive thinking or system 1 thinking. Itâs easy and takes little effort. It serves us well and lets us take on many tasks every day. However, it also sometimes leads us down the path of thinking weâve chosen the âbestâ person when weâre really hired someone who meets some set of assumptions.
Sometimes we use slow thinking, or system 2 thinking. Itâs rarely a conscious decision, something just makes us take some extra time and we usually donât even realize it. Thatâs when we stop to question what weâre doing â maybe we adjust which categories we put someone in or we adjust the category or the values and judgments associated with it. Weâre good people but system 2 thinking is tiring and we just canât do it all the time.
Diversity Matters
Why does diversity matter at work? Personally, when weâre on a diverse team we tend to have higher personal and job satisfaction. Diverse teams are interesting and we often learn more. People who donât feel like theyâre the âonly oneâ of something (gender, sexual orientation, race, introvert/extravert, etc.) relax, contribute more, and are more productive. And study after study shows that more diverse teams lead to better products and a better bottom line.
Companies with women on their boards have higher ROIs, more diverse companies tend to perform above average, and letâs face it â we donât have enough STEM graduates to fill needed jobs if we donât encourage a more diverse group of people to enter the field.
Mind Tricks
But weâre good people and we donât make these snap judgments. We are rational and we always know why we made a decision. Or do we?
Optical illusions fool us all the time. Even knowing those lines are all the same length, did you have to measure them just to be sure? The same thing happens in our interactions with people. Whatâs the first thing that comes to mind for single parent, introvert, doctor, CEO, or programmer? That first thing hints at your categories â the categories built up by a lifetime of media saturation filled with type-cast actors.
Back to the science of bias. Letâs think about resumes. In one study, resumes were handed out to academics who were asked to rate the job candidates for competency, hireability, mentoring, and suggested salary. Some resumes were for John and some for Jennifer. Professors of all genders rated Jennifer 25% less competent and less likely to be hired. They rated John worth about $4000 more. When asked why they gave ratings their justifications sounded rational butâŚ. 4 industry publications was awesome for John and 4 was just not enough for Jennifer. They are good people but they (we!) are at the mercy of their subconscious and years of societal conditioning.
Moving On
Weâre good people so what do we do?
Take the IATs â there are many, take at least a couple and understand your unexpected biases. Talk about this with others so we all become comfortable talking about our subtle biases. Work to consciously update your mental categories â seek out images and reminders of people who are different and successful. Now that you know your own categories a bit better, be more mindful about switching to system 2 thinking. Reach out to one person and mentor them. Spend time with someone who makes you uncomfortable. Pay attention to the âfirstsâ (the first autistic character on Sesame Street, the first black President, the first whatever) and see if that helps you update your mental categories.
Increase the pipeline. Participate in groups that help kids learn to code. Recruit beyond your normal network, post jobs on diversity sites, and consider non-traditional backgrounds. Join diverse groups that donât match your own diversity.
Be careful with words. Is someone bossy or exhibiting leadership? Is someone aggressive or a go-getter? Are they emotional or passionate. You may be surprised how you assign different words for the same behavior in unexpected ways.
When you post a job, only list something as ârequiredâ if it truly is. Women for example tend to only apply if they meet almost all the requirements, men tend to apply if they meet a few. Do you really require Java experience or do you need a good coder who is willing to learn new things? Donât ask for a specific type of leader, look for someone who can lead in any of many productive ways. Explicitly state that you value a diverse team. And beware of subtle stereotypes â words like best, rock star, action-oriented define a particular picture but may not represent what youâre really looking for.
When reviewing resumes, have HR take off names, cities, and years. Before you pick up a resume decide on your priorities â does experience or willingness to learn matter more for example? Look for people who fill gaps rather than trying to replicate people you already have. And remember, system 2 thinking is tiring so do this when youâre alert and can take the time to think about what youâre doing.
For the interviews, have a diverse group participate. Simply looking at picture of or talking about diverse people before starting interviews increases the chance you hire with diversity in mind. Donât confuse either confidence or âgeek credâ with competence. Keep an open mind about different ways of approaching problems â itâs the result that matters.
Many flowers make a beautiful bouquet – @IsisAnchalee
SQL Server is a great technology â Iâve been using it since 1993 when the user interface consisted of a query window with the options to save and execute and not much else. With every release thereâs something new and exciting and thereâs always something to learn about even the most familiar of features. However, not everyone uses SQL Server for every storage and compute opportunity â sad but true.
So what is a SQL geek to do in the face of all the new options out there â many under the umbrella of Big Data (distributed processing)? Why just jump right on in and learn it! No one can know all the pieces because itâs a big, fluid, messy collection of âthingsâ. But donât worry about that, start with one thing and build from there. Even if you never plan to implement a production Big Data system you need to learn about it â because if you donât have some hands-on experience with it then someone who does have that experience will be influencing the decision makers without you. For a SQL Pro I suggest Hive as that easy entry point. At some point maybe Spark SQL will jump into that gap, but for now Hive is the easiest entry point for most SQL pros.
For more, I refer you to the talk I gave at the Pacific Northwest SQL Server User Group meeting on October 14, 2015. Excerpts are below, the file is attached.
Look, itâs SQL!
SELECT score, fun
FROM toDo
WHERE type = ‘they pay me for this?’;
Hereâs how that code looks from Visual Studio along with the links to how you find the output and logs:
And yet itâs more!
CREATE EXTERNAL TABLE IF NOT EXISTS toDo
(fun STRING,
rank INT COMMENT ‘rank the greatness’,
type STRING)
COMMENT ‘two tables walk into a bar….’
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
LOCATION ‘/data/demo/’;
A mix of old and new
— read some data
SELECT ‘you cannot make me ‘, score, fun, type
FROM toDo
WHERE score <= 0
ORDER BY score;
SELECT ‘when can we ‘, score, fun, type
FROM toDo
WHERE score > 0
DISTRIBUTE BY score SORT BY score;
Thatâs Hive folks!
Hive
on Hadoop
on HDInsight
on AzureBig Data in the cloud!
Data exploration, analytics and reporting, new data-driven actionable insights
Rapid iterating
Unknown unknowns
Flexible scaling
Data driven actions for early competitive advantage or first to market
Low number of direct, concurrent users
Low cost data archival
Hadoop Anti-PatternsâŚ.
Replace system whose pain points donât align with Hadoopâs strengths
OLTP needs adequately met by an existing system
Known data with a static schema
Many end users
Interactive response time requirements (becoming less true)
Your first Hadoop project + mission critical system
Azure has so much more
Go straight to the business code
Scale storage and compute separately
Open Source
Linux
Managed and unmanaged services
Hybrid
On-demand and 24×7 options
SQL Server
Itâs a Polyglot
Stream your data into a lake
Pick the best compute for each task
One of the frequently overlooked yet essential best practices for Hadoop is to prefer fewer, bigger files over more, smaller files. How small is too small and how many is too many? How do you stitch together all those small Internet of Things files into files “big enough” for Hadoop to process efficiently?
The Problem
One performance best practice for Hadoop is to have fewer large files as opposed to large numbers of small files. A related best practice is to not partition âtoo muchâ. Part of the reason for not over-partitioning is that it generally leads to larger numbers of smaller files.
Too small is smaller than HDFS block size (chunk size), or realistically small is something less than several times larger than chunk size. A very, very rough rule of thumb is files should be at least 1GB each and no more than maybe around 10,000-ish files per table. These numbers, especially the maximum total number of files per table, vary depending on many factors. However, it gives you a reference point. The 1GB is based on multiples of the chunk size while the 2nd is honestly a bit of a guess based on a typical small cluster.
Why Is It Important?
One reason for this recommendation is that Hadoopâs name node service keep track of all the files and where the internal chunks of the individual files are. The more files it has to track the more memory it needs on the head node and the longer it takes to build a job execution plan. The number and size of files also affects how memory is used on each node.
Letâs say your chunk size is 256MB. Thatâs the maximum size of each piece of the file that Hadoop will store per node. So if you have 10 nodes and a single 1GB file it would be split into 4 chunks of 256MB each and stored on 4 of those nodes (Iâm ignoring the replication factor for this discussion). If you have 1000 files that are 1MB each (still a total data size of ~1GB) then every one of those files is a separate chunk and 1000 chunks are spread across those 10 nodes. NOTE: In Azure and WASB this happens somewhat differently behind the scenes â the data isnât physically chunked up when initially stored but rather chunked up at the time a job runs.
With the single 1GB file the name node has 5 things to keep track of â the logical file plus the 4 physical chunks and their associated physical locations. With 1000 smaller files the name node has to track the logical file plus 1000 physical chunks and their physical locations. That uses more memory and results in more work when the head node service uses the file location information to build out the plan for how it will split out any Hadoop job into tasks across the many nodes. When weâre talking about systems that often have TBs or PBs of data the difference between small and large files can add up quickly.
The other problem comes at the time that the data is read by a Hadoop job. When the job runs on each node it loads the files the task tracker identified for it to work with into memory on that local node (in WASB the chunking is done at this point). When there are more files to be read for the same amount of data it results in more work and slower execution time for each task within each job. Sometimes you will see hard errors when operating system limits are hit related to the number of open files. There is also more internal work involved in reading the larger number of files and combining the data.
Stitching
There are several options for stitching files together.
Combine the files as they land using the code that moves the files. This is the most performant and efficient method in most cases.
INSERT into new Hive tables (directories) which creates larger files under the covers. The output file size can be controlled with settings like hive.merge.smallfiles.avgsize and hive.merge.size.per.task.
Use a combiner in Pig to load the many small files into bigger splits.
Dealing with Hadoop’s small files problem http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/Â âaggregating with the small files first reduced total processing time from 2 hours 57 minutes to just 9 minutes – of which 3 minutes was the aggregation, and 4 minutes was running our actual Enrichment process. Thatâs a speedup of 1,867%.â
The Small Files Problem http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ âReading through files in a HAR is no more efficient than reading through files in HDFS, and in fact may be slower since each HAR file access requires two index file reads as well as the data file read (see diagram). And although HAR files can be used as input to MapReduce, there is no special magic that allows maps to operate over all the files in the HAR co-resident on a HDFS block.â
The key here is to work with fewer, larger files as much as possible in Hadoop. The exact steps to get there will vary depending on your specific scenario.
You can use the new Azure portal to create or edit Azure Data Factory components. Once you are done you may automate the process of creating future Data Factory components from PowerShell. In that case you can use the JSON files you edited in the portal GUI as configuration files for the PowerShell cmdlets. For example, you may try to create a new linked service using settings from C:CoolerHDInsight.JSON as specified below:
In that case you may see something like this error:
New-AzureDataFactoryLinkedService : Hub: {SomeName_hub} not found.
CategoryInfo               : CloseError: (:) [New-AzureDataFactoryLinkedService]. Provisioning FailedException
FullyQualifiedErrorIDÂ Â : Microsoft.Azure.Commands.DataFactories.NewAzureDataFactoryLinkedServiceCommand
If you check the JSON file that you exported from the portal and referenced in the PowerShell script, you will see it ends with something like this:
       “isPaused”: false,
“hubName”: “SomeName_hub”
}
}
The hubName is currently automatically generated based on the name of the Data Factory and should not be present in the JSON files used by PowerShell. Remove the comma on the line above the hubName and the entire line starting with hubName.
      ,
“hubName”: “SomeName_hub”
That will leave the end of the file looking something like this:
       “isPaused”: false
}
}
Check out all your other JSON files you are using for Data Factory components and do the same editing for any that have a hubName.
NOTE: This applies to Azure Data Factory as of April 2015. At some point the hubName should become a viable parameter usable by PowerShell.
Creating an HDInsight cluster from the Azure portal is very easy. However, sometimes you want all the choices and best practices explained as well as the “how to”. I have created a series of slides with audio recordings to walk you through the process and choices. They are available as sessions 1-8 of “Create HDInsight Cluster in Azure Portal” on my YouTube channel Small Bites of Big Data.
Most accounts will only have one Azure subscription associated with them. But if you seem to have unexpected resources, check to make sure you are in the expected subscription. The Subscriptions button is on the upper right of the Azure portal.
Add Accounts
Option: Add more Microsoft Accounts as admins of the Azure Subscription.
1. Choose SETTINGS at the very bottom on the left.
2. Then choose ADMINISTRATORS at the top. Click on the ADD button at the very bottom.
3. Enter a Microsoft Account or federated enterprise account that will be an admin.
Recording of getting started with an Azure subscription on YouTube
Azure Storage – WASB
I recommend you manually create at least one Azure storage account and container ahead of time. While the HDInsight creation dialogue gives the option of creating the storage account and container for you, that only works if you don’t plan to reuse data across clusters.
Create a Storage Account
1. Click on STORAGE in the left menu then NEW.
2. URL: Choose a lower-case storage account name that is unique within *.core.windows.net.
3. LOCATION: Choose the same location for the SQL Azure metastore database, the storage account(s), and HDInsight.
4. REPLICATION: Locally redundant stores fewer copies and costs less.
Repeat if you need additional storage.
Create a Container
1. Click on your storage account in the left menu then CONTAINERS on the top.
2. Choose CREATE A CONTAINER or choose the NEW button at the bottom.
3. Enter a lower-case NAME for the container, unique within that storage account.
4. Choose either Private or Public ACCESS. If there is any chance of sensitive or PII data being loaded to this container choose Private. Private access requires a key. HDInsight can be configured with that key during creation or keys can be passed in for individual jobs.
This will be the default container for the cluster. If you want to manage your data separately you may want to create additional containers.
WASB
Additional information about storage, including details on Windows Azure Storage Blobs (WASB) is on http://SmallBitesOfBigData.com.
Recording of creating an Azure storage account and container on YouTube.
Metastore (HCatalog)
In Azure you have the option to create a metastore for Hive and/or Oozie that exists independently of your HDInsight clusters. This allows you to reuse your Hive schemas and Oozie workflows as you drop and recreate your cluster(s). I highly recommend using this option for a production environment or anything that involves repeated access to the same, standard schemas and/or workflows.
Create a Metastore aka Azure SQL DB
Persist your Hive and Oozie metadata across cluster instances, even if no cluster exists, with an HCatalog metastore in an Azure SQL Database. This database should not be used for anything else. While it works to share a single metastore across multiple instances it is not officially tested or supported.
1. Click on SQL DATABASES then NEW and choose CUSTOM CREATE.
2. Choose a NAME unique to your server.
3. Click on the â?â to help you decide what TIER of database to create.
4. Use the default database COLLATION.
5. If you choose an existing SERVER you will share sysadmin access with other databases.
You can make the system more secure if you create a custom login on the Azure server. Add that login as a user in the database you just created. Grant it minimal read/write permissions in the database. This is not well documented or tested so the exact permissions needed for this are vague. You may see odd errors if you donât grant the appropriate permissions.
Firewall Rules
In order to refer to the metastore from automated cluster creation scripts such as PowerShell your workstation must be added to the firewall rules.
1. Click on MANAGE then choose YES.
2. You can also use the MANAGE button to connect to the SQL Azure database and manage logins and permissions.
Now that we have the pre-requisites done we can move on to creating the cluster.
Quick Create through the Azure portal is the fastest way to get started with all the default settings.
The Azure portal Custom Create allows you to customize size, storage, and other configuration options.
You can customize and automate through code including .NET and PowerShell. This increases standardization and lets you automate the creation and deletion of clusters over time.
For all the examples here we will create a basic Hadoop cluster with Hive, Pig, and MapReduce.
A cluster will take several minutes to create, the type and size of the cluster have little impact on the time for creation.
Quick Create Option
For your first cluster choose a Quick Create.
1. Click on HDINSIGHT in the left menu, then NEW.
2. Choose Hadoop. HBase and Storm also include the features of a basic Hadoop cluster but are optimized for in-memory key value pairs (HBase) or alerting (Storm).
3. Choose a NAME unique in the azurehdinisght.net domain.
4. Start with a small CLUSTER SIZE, often 2 or 4 nodes.
5. Choose the admin PASSWORD.
6. The location of the STORAGE ACCOUNT determines the location of the cluster.
Custom Create Option
You can also customize your size, admin account, storage, metastore, and more through the portal. Weâll walk through a basic Hadoop cluster.
New
1. Click on HDINSIGHT in the left menu, then NEW in the lower left.
2. Choose CUSTOM CREATE.
Basic Info
1. Choose a NAME unique in the azurehdinisght.net domain.
2. Choose Hadoop. HBase and Storm also include the features of a basic Hadoop cluster but are optimized for in-memory key-value pairs (HBase) or alerting (Storm).
3. Choose Windows or Linux as the OPERATING SYSTEM. Linux is only available if you have signed up for the preview.
4. In most cases you will want the default VERSION.
Size and Location
1. Choose the number of DATA NODES for this cluster. Head nodes and gateway nodes will also be created and they all use HDInsight cores. For information on how many cores are used by each node see the âPricing detailsâ link.
2. Each subscription has a billing limit set for the maximum number of HDInsight cores available to that subscription. To change the number available to your subscription choose âCreate a support ticket.â If the total of all HDInsight cores in use plus the number needed for the cluster you are creating exceeds the billing limit you will receive a message: âThis cluster requires X cores, but only Y cores are available for this subscriptionâ. Note that the messages are in cores and your configuration is specified in nodes.
3. The storage account(s), metastore, and cluster will all be in the same REGION.
Cluster Admin
1. Choose an administrator USER NAME. It is more secure to avoid âadminâ and to choose a relatively obscure name. This account will be added to the cluster and doesnât have to match any existing external accounts.
2. Choose a strong PASSWORD of at least 10 characters with upper/lower case letters, a number, and a special character. Some special characters may not be accepted.
Metastore (HCatalog)
On the same page as the Hadoop cluster admin account you can optionally choose to use a common metastore (Hcatalog).
1. Click on the blue box to the right of âEnter the Hive/Oozie Metastoreâ. This makes more fields available.
2. Choose the SQL Azure database you created earlier as the METASTORE.
3. Enter a login (DATABASE USER) and PASSWORD that allow you to access the METASTORE database. If you encounter errors, try logging in to the database manually from the portal. You may need to open firewall ports or change permissions.
Default Storage Account
Every cluster has a default storage account. You can optionally specify additional storage accounts at cluster create time or at run time.
1. To access existing data on an existing STORAGE ACCOUNT, choose âUse Existing Storageâ.
2. Specify the NAME of the existing storage account.
3. Choose a DEFAULT CONTAINER on the default storage account. Other containers (units of data management) can be used as long as the storage account is known to the cluster.
4. To add ADDITIONAL STORAGE ACCOUNTS that will be accessible without the user providing the storage account key, specify that here.
Additional Storage Accounts
If you specified there will be additional accounts you will see this screen.
1. If you choose âUse Existing Storageâ you simply enter the NAME of the storage account.
2. If you choose âUse Storage From Another Subscriptionâ you specify the NAME and the GUID KEY for that storage account.
Script Actions
You can add additional components or configure existing components as the cluster is deployed. This is beyond the scope of this demo.
1. Click âadd script actionâ to show the remaining parameters.
2. Enter a unique NAME for your action.
3. The SCRIPT URI points to code for your custom action.
4. Choose the NODE TYPE for deployment.
Create is Done!
Once you click on the final checkmark Azure goes to work and creates the cluster. This takes several minutes. When the cluster is ready you can view it in the portal.
Recording of HDInsight quick and custom create on YouTube
Query with Hive
For most people the easiest, fastest way to learn Hadoop is through Hive. Hive is also the most widely used component of Hadoop. When you use the Hive ODBC driver any ODBC-compliant app can access the Hive data as “just another data source”. That includes Azure Machine Learning, Power BI, Excel, and Tableau.
Hive Console
The simplest, most relatable way for most people to use Hadoop is via the SQL-like, Database-like Hive and HiveQL (HQL).
1. Put focus on your HDInsight cluster and choose QUERY CONSOLE to open a new tab in your browser. In my case it opens: https://dragondemo1.azurehdinsight.net//
2. Click on Hive Editor.
Query Hive
The query console defaults to selecting the first 10 rows from the pre-loaded sample table. This table is created when the cluster is created.
1. Optionally edit or replace the default query: Select * from hivesampletable LIMIT 10;
2. Optionally name your query to make it easier to find in the job history.
3. Click Submit.
Hive is a batch system optimized for processing huge amounts of data. It spends several seconds up front splitting the job across the nodes and this overhead exists even for small result sets. If you are doing the equivalent of a table scan in SQL Server and have enough nodes in Hadoop, Hadoop will probably be faster than SQL Server. If your query uses indexes in SQL Server, then SQL Server will likely be faster than Hive.
View Hive Results
1. Click on the Query you just submitted in the Job Session. This opens a new tab.
2. You can see the text of the Job Query that was submitted. You can Download it.
3. The first few lines of the Job Output (query result) are available. To see the full output choose Download File.
4. The Job Log has details including errors if there are any.
5. Additional information about the job is available in the upper right.
View Hive Data in Excel Workbook
At this point HDInsight is âjust another data sourceâ for any application that supports ODBC.
1. Install the Microsoft Hive ODBC driver.
2. Define an ODBC data source pointing to your HDInsight instance.
3. From DATA choose From Other Sources and From Data Connection Wizard.
View Hive Data in PowerPivot
At this point HDInsight is âjust another data sourceâ for any application that supports ODBC.
1. Install the Microsoft Hive ODBC driver.
2. Define an ODBC data source pointing to your HDInsight instance.
3. Click on POWERPIVOT then choose Manage. This opens a new PowerPivot for Excel window.
4. Choose Get External Data then Others (OLEDB/ODBC).
Now you can combine the Hive data with other data inside the tabular PowerPivot data model.
In the cloud you donât have to load data to Hadoop, you can load data to an Azure Storage Account. Then you point your HDInsight or other WASB compliant Hadoop cluster to the existing data source. There many ways to load data, for the demo weâll use CloudXplorer.
You use the Accounts button to add Azure, S3, or other data/storage accounts you want to manage.
In this example nealhadoop is the Azure storage account, demo is the container, and bacon is a âdirectoryâ. The files are bacon1.txt and bacon2.txt. Any Hive tables would point to the bacon directory, not to individual files. Drag and drop files from Windows Explorer to CloudXplorer.
Once you have created the HDInsight cluster you can use it and play with it and try many things. When you are done, simply remove the cluster. If you created an independent metastore in SQL Azure you can use that same metastore and the same Azure storage account(s) the next time you create a cluster. You are charged for the existence of the cluster, not for the usage of it. So make sure you drop the cluster when you aren’t using it. You can use automation, such as PowerShell, to spin up a cluster that is configured the same every time and to drop it. Check the website for the most recent information.
Pricing
Automate with PowerShell
With PowerShell, .NET, or the Cross-Platform cmd line tools you can specify even more configuration settings that arenât available in the portal. This includes node size, a library store, and changing default configuration settings such as Tez and compression.
Automation allows you to standardize and with version control lets you track your configurations over time.
Sample PowerShell Script: HDInsight Custom Create http://blogs.msdn.com/b/cindygross/archive/2013/12/06/sample-powershell-script-hdinsight-custom-create.aspx. If your HDInsight and/or Azure cmdlets donât match the current documention or return unexpected errors run Web Platform Installer and check for a new version of âMicrosoft Azure PowerShell with Microsoft Azure SDKâ or âMicrosoft Azure PowerShell (standalone).â
Recording of Pricing, Automation, and Wrapup on YouTube
Summary
HDInsight is Hadoop on Azure as a service, specifically Hortonworks HDP on either Windows or Linux
Easy, cost effective, changeable scale out data processing for a lower TCO â easily add/remove/scale
Separation of storage and compute allows data to exist across clusters via WASB
Metastore (Hcatalog) exists independently across clusters via SQL DB
#, size, type of clusters are flexible and can all access the same data
Instantly access data born in the cloud; Easily, cheaply load, share, and merge public or private data
Load data now, add schema later (write once, read many)
Fail fast â iterate through many questions to find the right question
Faster time from question to insight
Hadoop is âjust another data sourceâ for BI, Analytics, Machine Learning
I hope you enjoyed this Small Bite of Big Data! Happy Hadooping!
Hadoop is the hot buzzword of the Big Data world, and many IT people are being told “go create a Hadoop cluster and do some magic”. It’s hard to know where to start or which projects are a good fit. The information available online is sparse, often conflicting, and usually focused on how to solve a technical problem rather than a business problem. So let’s look at this from a business perspective.
Data-Driven Insights
For the average business just getting into using Hadoop for the first time, you are most likely to be successful if you choose a project related to data exploration, analytics and reporting, and/or looking for new data-driven actionable insights. In many ways Hadoop is ‘just another data source.” Generally most businesses will not start with replacing existing, high-functioning OLTP implementations. Instead you will likely see the highest initial return on investment (ROI) from adding on to those existing systems. Pull some of the existing data into Hadoop, add new data, and look for new ways to use that data. The goal should remain clearly focused on how to use the data to take action based on the new data-driven insights you will uncover.
Success
Below are some characteristics that are often present for a successful Hadoop implementation. You don’t need to have all of them to be successful, use the list to brainstorm new ideas.
Goals include innovation, exploration, iteration, and experimentation. Hadoop allows you to ask lots of “what-if” questions cheaply, to “fail fast” so you can try out many potential hypotheses, and look for that one cool thing everyone else has missed that can really impact your business.
New data or data variations will be explored. Some of it may be loosely structured. Hadoop, especially in the cloud, allows you to import and experiment with data much more quickly and cheaply than with traditional systems. Hadoop on Azure in particular has the WASB option to make data ingestion even easier and faster.
You are looking for the “Unknown Unknowns”. There are always lurking things that haven’t come to your attention before but which may be sparks for new actions. You know you don’t know what you want or what to ask for and will use that to spur innovation.
Flexible, fast scaling without the need to change your code is important. Hadoop is built on the premise that it is infinitely scalable – you simply add more nodes when you need more processing power. In the cloud you can also scale your storage and compute separately and more easily scale down during slow periods.
You are looking to gain some competitive advantage faster than your competition based on data-driven actions. This goes back to the previous points, you are using Hadoop to look for something new that can change your business or help you be first to market with something.
There are a low number of direct, concurrent users of the Hadoop system itself. The more jobs you have running at the same time, the more robust and expensive your head node(s) must be and often the larger your cluster must be. This changes the cost/benefit ratio quickly. Once data is processed and curated in Hadoop it can be sent to systems that are less-batch oriented and more available and familiar to the average power user or data steward.
Archiving data in a low-cost manner is important. Often historical data is kept in Hadoop while more interactive data is kept in a relational system.
Anti-Patterns
Quite often I hear people proposing Hadoop for projects that are not an ideal use for Hadoop, at least not as you are learning it and looking for quick successes to bolster confidence in the new technology. The below characteristics are generally indicators that you do NOT want to use Hadoop in a project.
You plan to replace an existing system whose pain points don’t align with Hadoop’s strengths.
There are OLTP business requirements, especially if they are adequately met by an existing system. Yes, there are some components of Hadoop that can meet OLTP requirements and those features are growing and expanding rapidly. If you have an OLTP scenario that requires ACID properties and fast interactive response time it is possible Hadoop could be a fit but it’s usually not a good first project for you to learn Hadoop and truly use Hadoop’s strengths.
Data is well-known and the schema is static. Generally speaking, though the tipping point is changing rapidly, when you can use an index for a query it will likely be faster in a relational system. When you do the equivalent of a table scan across a large volume of data and provide enough scaled-out nodes it is likely faster on a Big Data system such as Hadoop. Well-known, well-structured data is highly likely to have well-known, repeated queries that have supporting indexes.
A large number of users will need to directly access the system and they have interactive response time requirements (response within seconds).
Your first project and learning is on a mission critical system or application. Learn on something new, something that makes Hadoop’s strengths really apparent and easy to see.
And in Conclusion
Choosing the right first project for your dive into Hadoop is crucial. Make it bite-sized, clearly outline your goals, make sure it has some of the above success criteria and avoid the anti-patterns. Make learning Hadoop a key goal of the project. Budget time for everyone to really learn not only how things work but why they work that way and whether there are better ways to do certain things. Hadoop is becoming ubiquitous, avoiding it completely is not an option. Jump in, but do so with your eyes wide open and make some good up-front decisions. Happy Big Data-ing!
This week I helped a client move files from AWS S3 to Azure Storage blobs. Sounds simple, right? Here’s the tricky part… While there are both Azure and AWS cmdlets for PowerShell, they don’t cooperate. Neither has a cmdlet that accepts credentials from the other and neither accepts arbitrary URLs from outside their own cloud. And AzCopy also doesn’t accept S3 URLs. None of the S3 tools seem to recognize Azure. So what’s a girl to do?
The Search and The Discovery
After hours of trying to get creative with PowerShell or AzCopy I resorted to Bing searches. When what to my wondering eyes should appear, but a miniature sleigh…. uh, a fully fledged, well-written tool to move data between Azure and S3. But there’s more! This tool, known as Rudolph… I mean AzureCopy, can move data between Azure, S3, OneDrive, SharePoint online, Dropbox, and local file systems! Ken Faulkner has written a wonderful, holly jolly tool! After a few hiccups as I learned how to use the tool and learned about how S3 URLs are (and at first mostly are not) formed I quickly had all my data moved from S3 to Azure! Simple. Easy. It flew like the down of a thistle (whatever that means). So, what was required after installing the tool?
Open a dos-prompt and go to the directory where you installed AzureCopy. Instead of using a config file I set the values at the command line (use your own real values for the directory and after each equal sign):
cd C:installsazurecopy
set AzureAccountKey=MyAzureStorageAccountKey
set AWSAccessKeyID=MyS3AccessId
set AWSSecretAccessKeyID=MyS3SecretKey
set AWSRegion value=us-west-2
Then I got a listing of my files on S3 – this took longer than it should because I had trouble getting the S3 URL correct. That was a problem with my newness to S3, not a problem with the tool. If you’re in the default region you use mybucket.s3.amazonaws.com. Otherwise you use mybucket.s3-region.amazonaws.com. See Amazon’s docs on S3 buckets for more details on the URL.
Also, I didn’t need all the keys passed in on both commands, it was just easier to write and copy the code that way as I tried to get it all working.
Next I listed out the files in Azure. At this point the container was empty but the command at least verified my access worked. I uploaded a small test file and verified I could see it with AzureCopy, then deleted the test file.
And just like that, within a couple of minutes, the list command for azurecopy showed all the files in Azure! I double-checked with my Azure and AWS PowerShell cmdlets that yes, this was really true! This tool saved me SO MUCH TIME! And now you know, the built in tools from the major cloud vendors lock you into their own cloud. But with AzureCopy you too can free your data!