Cindy Gross | Befriending Dragons

Tag: best practices

Hadoop Likes Big Files
One of the frequently overlooked yet essential best practices for Hadoop is to prefer fewer, bigger files over more, smaller files. How small is too small and how many is too many? How do you stitch together all those small Internet of Things files into files “big enough” for Hadoop to process efficiently?

The Problem

One performance best practice for Hadoop is to have fewer large files as opposed to large numbers of small files. A related best practice is to not partition “too much”. Part of the reason for not over-partitioning is that it generally leads to larger numbers of smaller files.

Too small is smaller than HDFS block size (chunk size), or realistically small is something less than several times larger than chunk size. A very, very rough rule of thumb is files should be at least 1GB each and no more than maybe around 10,000-ish files per table. These numbers, especially the maximum total number of files per table, vary depending on many factors. However, it gives you a reference point. The 1GB is based on multiples of the chunk size while the 2^nd is honestly a bit of a guess based on a typical small cluster.

Why Is It Important?

One reason for this recommendation is that Hadoop’s name node service keep track of all the files and where the internal chunks of the individual files are. The more files it has to track the more memory it needs on the head node and the longer it takes to build a job execution plan. The number and size of files also affects how memory is used on each node.

Let’s say your chunk size is 256MB. That’s the maximum size of each piece of the file that Hadoop will store per node. So if you have 10 nodes and a single 1GB file it would be split into 4 chunks of 256MB each and stored on 4 of those nodes (I’m ignoring the replication factor for this discussion). If you have 1000 files that are 1MB each (still a total data size of ~1GB) then every one of those files is a separate chunk and 1000 chunks are spread across those 10 nodes. NOTE: In Azure and WASB this happens somewhat differently behind the scenes – the data isn’t physically chunked up when initially stored but rather chunked up at the time a job runs.

With the single 1GB file the name node has 5 things to keep track of – the logical file plus the 4 physical chunks and their associated physical locations. With 1000 smaller files the name node has to track the logical file plus 1000 physical chunks and their physical locations. That uses more memory and results in more work when the head node service uses the file location information to build out the plan for how it will split out any Hadoop job into tasks across the many nodes. When we’re talking about systems that often have TBs or PBs of data the difference between small and large files can add up quickly.

The other problem comes at the time that the data is read by a Hadoop job. When the job runs on each node it loads the files the task tracker identified for it to work with into memory on that local node (in WASB the chunking is done at this point). When there are more files to be read for the same amount of data it results in more work and slower execution time for each task within each job. Sometimes you will see hard errors when operating system limits are hit related to the number of open files. There is also more internal work involved in reading the larger number of files and combining the data.

Stitching

There are several options for stitching files together.
- Combine the files as they land using the code that moves the files. This is the most performant and efficient method in most cases.
- INSERT into new Hive tables (directories) which creates larger files under the covers. The output file size can be controlled with settings like hive.merge.smallfiles.avgsize and hive.merge.size.per.task.
- Use a combiner in Pig to load the many small files into bigger splits.
- Use the HDFS FileSystem Concat API http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#concat.
- Write custom stitching code and make it a JAR.
- Enable the Hadoop Archive (HAR). This is not very efficient for this scenario but I am including it for completeness.
There are several writeups out there that address the details of each of these methods so I won’t repeat them.
- Merging small files on HDInsight http://blogs.msdn.com/b/mostlytrue/archive/2014/04/10/merging-small-files-on-hdinsight.aspx which uses a Java MapReduce JAR https://github.com/mooso/smallfilesmerge.
- Quick Tip for Compressing Many Small Text Files within HDFS via Pig http://dennyglee.com/2014/01/06/quick-tip-for-compressing-many-small-text-files-within-hdfs-via-pig/.
- FileCrush https://github.com/edwardcapriolo/filecrush.
- HDFS FileSystem Concat API
- CombineFileInputFormat (splits)
  - This may not work with really large numbers of files and has to be used EVERY time a job is run.
  - http://www.ibm.com/developerworks/library/bd-hadoopcombine/index.html
  - Process Small Files on Hadoop Using CombineFileInputFormat (1) http://www.idryman.org/blog/2013/09/22/process-small-files-on-hadoop-using-combinefileinputformat-1/
- Dealing with Hadoop’s small files problem http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/ “aggregating with the small files first reduced total processing time from 2 hours 57 minutes to just 9 minutes – of which 3 minutes was the aggregation, and 4 minutes was running our actual Enrichment process. That’s a speedup of 1,867%.”
- The Small Files problem in Hadoop http://piglog4j.blogspot.com/2013/06/the-small-files-problem-in-hadoop.html
- Hadoop Archive: File Compaction for HDFS https://developer.yahoo.com/blogs/hadoop/hadoop-archive-file-compaction-hdfs-461.html
- The Small Files Problem http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ “Reading through files in a HAR is no more efficient than reading through files in HDFS, and in fact may be slower since each HAR file access requires two index file reads as well as the data file read (see diagram). And although HAR files can be used as input to MapReduce, there is no special magic that allows maps to operate over all the files in the HAR co-resident on a HDFS block.”
The key here is to work with fewer, larger files as much as possible in Hadoop. The exact steps to get there will vary depending on your specific scenario.

I hope you enjoyed this small bite of big data!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow
@SQLCindy | @NealAnalytics | CindyG@NealAnalytics.com | http://smallbitesofbigdata.com

Tweets by @SQLCindy //

Technorati Tags: HDInsight,Best practices,performance,Microsoft Azure,Neal Analytics,Big Data,configuration,deploy,Hadoop,SQLCindy
May 4, 2015
Create HDInsight Cluster in Azure Portal
Creating an HDInsight cluster from the Azure portal is very easy. However, sometimes you want all the choices and best practices explained as well as the “how to”. I have created a series of slides with audio recordings to walk you through the process and choices. They are available as sessions 1-8 of “Create HDInsight Cluster in Azure Portal” on my YouTube channel Small Bites of Big Data.

Playlist Getting Started with HDInsight: https://www.youtube.com/playlist?list=PLAD2dOpGM3s1R2L5HgPMX4MkTGvSza7gv
1. Why HDInsight: https://youtu.be/J9KzIShLeD8
2. Azure Subscription: https://youtu.be/lSxMtmRE114
3. Azure Storage – WASB: https://youtu.be/6OdDDmdaVVE
4. Metastore: https://youtu.be/1Og_eftYVpA
5. Create HDInsight: https://youtu.be/SysIo3LwONk
6. Hive Query: https://youtu.be/DRAuOXsuec0
7. Load Demo Data: https://youtu.be/XyiOpRPjfUs
8. Pricing, Automation, and Wrapup: https://youtu.be/78YowrOnNGM
PowerPoint deck: http://www.slideshare.net/cindygross1/create-hd-insightfeb2015

Why HDInsight?

HDInsight is Hadoop on Azure as a service.
- Easy, cost effective, changeable scale out data processing
- Lower TCO – easily add/remove/scale
- Separation of storage and compute allows data to exist across clusters
- Hortonworks HDP is one of the 3 major Hadoop
  distributors, the most purely open source
- HDInsight *IS* Hortonworks HDP as a service in Azure (cloud)
- Metastore (Hcatalog) exists independently across clusters via SQL DB
- #, size, type of clusters are flexible and can all access the same data
- Hive is a Hadoop component that makes data look like rows/columns for data warehouse type activities
It offers the standard advantages of Hadoop:
- Scale-out
- Load data now, add schema later (write once, read many)
- Fail fast – iterate through many questions to find the right question
- Faster time from question to insight
- Hadoop is “just another data source” for BI, Analytics, Machine Learning
In addition you have the advantages of Hadoop in the cloud:
- Instantly access data born in the cloud
- Easily, cheaply load, share, and merge public or private data
- Data exists independently across clusters (separation of storage and compute) via WASB on Azure storage accounts
Recording of why HDInsight on YouTube

Azure Subscription

You have many options to obtain a Microsoft Azure subscription:
- Trial: http://azure.microsoft.com/en-us/pricing/free-trial /
- MSDN Subscription: http://azure.microsoft.com/en-us/pricing/member-offers/msdn-benefits /
- Startup BizSpark: http://azure.microsoft.com/en-us/pricing/member-offers/bizspark-startups /
- Classroom: http://www.microsoftazurepass.com/azureu
- Pay-As-You-Go or Enterprise Agreement: http://azure.microsoft.com/en-us/pricing /
Login to Azure Subscription

1. Login on Azure Portal https ://manage.windowsazure.com

2. Use a Microsoft Account http://www.microsoft.com/en-us/account/default.aspx
Note: Some companies have federated their accounts and can use company accounts.

Choose Subscription

Most accounts will only have one Azure subscription associated with them. But if you seem to have unexpected resources, check to make sure you are in the expected subscription. The Subscriptions button is on the upper right of the Azure portal.

Add Accounts

Option: Add more Microsoft Accounts as admins of the Azure Subscription.

1. Choose SETTINGS at the very bottom on the left.

2. Then choose ADMINISTRATORS at the top. Click on the ADD button at the very bottom.

3. Enter a Microsoft Account or federated enterprise account that will be an admin.

Recording of getting started with an Azure subscription on YouTube

Azure Storage – WASB

I recommend you manually create at least one Azure storage account and container ahead of time. While the HDInsight creation dialogue gives the option of creating the storage account and container for you, that only works if you don’t plan to reuse data across clusters.

Create a Storage Account

1. Click on STORAGE in the left menu then NEW.

2. URL: Choose a lower-case storage account name that is unique within *.core.windows.net.

3. LOCATION: Choose the same location for the SQL Azure metastore database, the storage account(s), and HDInsight.

4. REPLICATION: Locally redundant stores fewer copies and costs less.

Repeat if you need additional storage.

Create a Container

1. Click on your storage account in the left menu then CONTAINERS on the top.

2. Choose CREATE A CONTAINER or choose the NEW button at the bottom.

3. Enter a lower-case NAME for the container, unique within that storage account.

4. Choose either Private or Public ACCESS. If there is any chance of sensitive or PII data being loaded to this container choose Private. Private access requires a key. HDInsight can be configured with that key during creation or keys can be passed in for individual jobs.

This will be the default container for the cluster. If you want to manage your data separately you may want to create additional containers.

WASB

Additional information about storage, including details on Windows Azure Storage Blobs (WASB) is on http://SmallBitesOfBigData.com.

Recording of creating an Azure storage account and container on YouTube.

Metastore (HCatalog)

In Azure you have the option to create a metastore for Hive and/or Oozie that exists independently of your HDInsight clusters. This allows you to reuse your Hive schemas and Oozie workflows as you drop and recreate your cluster(s). I highly recommend using this option for a production environment or anything that involves repeated access to the same, standard schemas and/or workflows.

Create a Metastore aka Azure SQL DB

Persist your Hive and Oozie metadata across cluster instances, even if no cluster exists, with an HCatalog metastore in an Azure SQL Database. This database should not be used for anything else. While it works to share a single metastore across multiple instances it is not officially tested or supported.

1. Click on SQL DATABASES then NEW and choose CUSTOM CREATE.

2. Choose a NAME unique to your server.

3. Click on the “?” to help you decide what TIER of database to create.

4. Use the default database COLLATION.

5. If you choose an existing SERVER you will share sysadmin access with other databases.

You can make the system more secure if you create a custom login on the Azure server. Add that login as a user in the database you just created. Grant it minimal read/write permissions in the database. This is not well documented or tested so the exact permissions needed for this are vague. You may see odd errors if you don’t grant the appropriate permissions.

Firewall Rules

In order to refer to the metastore from automated cluster creation scripts such as PowerShell your workstation must be added to the firewall rules.

1. Click on MANAGE then choose YES.

2. You can also use the MANAGE button to connect to the SQL Azure database and manage logins and permissions.

Recording of creating the metastore on YouTube.

Create the HDInsight Cluster

Now that we have the pre-requisites done we can move on to creating the cluster.
- Quick Create through the Azure portal is the fastest way to get started with all the default settings.
- The Azure portal Custom Create allows you to customize size, storage, and other configuration options.
- You can customize and automate through code including .NET and PowerShell. This increases standardization and lets you automate the creation and deletion of clusters over time.
- For all the examples here we will create a basic Hadoop cluster with Hive, Pig, and MapReduce.
- A cluster will take several minutes to create, the type and size of the cluster have little impact on the time for creation.
Quick Create Option

For your first cluster choose a Quick Create.

1. Click on HDINSIGHT in the left menu, then NEW.

2. Choose Hadoop. HBase and Storm also include the features of a basic Hadoop cluster but are optimized for in-memory key value pairs (HBase) or alerting (Storm).

3. Choose a NAME unique in the azurehdinisght.net domain.

4. Start with a small CLUSTER SIZE, often 2 or 4 nodes.

5. Choose the admin PASSWORD.

6. The location of the STORAGE ACCOUNT determines the location of the cluster.

Custom Create Option

You can also customize your size, admin account, storage, metastore, and more through the portal. We’ll walk through a basic Hadoop cluster.

New

1. Click on HDINSIGHT in the left menu, then NEW in the lower left.

2. Choose CUSTOM CREATE.

Basic Info

1. Choose a NAME unique in the azurehdinisght.net domain.

2. Choose Hadoop. HBase and Storm also include the features of a basic Hadoop cluster but are optimized for in-memory key-value pairs (HBase) or alerting (Storm).

3. Choose Windows or Linux as the OPERATING SYSTEM. Linux is only available if you have signed up for the preview.

4. In most cases you will want the default VERSION.

Size and Location

1. Choose the number of DATA NODES for this cluster. Head nodes and gateway nodes will also be created and they all use HDInsight cores. For information on how many cores are used by each node see the “Pricing details” link.

2. Each subscription has a billing limit set for the maximum number of HDInsight cores available to that subscription. To change the number available to your subscription choose “Create a support ticket.” If the total of all HDInsight cores in use plus the number needed for the cluster you are creating exceeds the billing limit you will receive a message: “This cluster requires X cores, but only Y cores are available for this subscription”. Note that the messages are in cores and your configuration is specified in nodes.

3. The storage account(s), metastore, and cluster will all be in the same REGION.

Cluster Admin

1. Choose an administrator USER NAME. It is more secure to avoid “admin” and to choose a relatively obscure name. This account will be added to the cluster and doesn’t have to match any existing external accounts.

2. Choose a strong PASSWORD of at least 10 characters with upper/lower case letters, a number, and a special character. Some special characters may not be accepted.

Metastore (HCatalog)

On the same page as the Hadoop cluster admin account you can optionally choose to use a common metastore (Hcatalog).

1. Click on the blue box to the right of “Enter the Hive/Oozie Metastore”. This makes more fields available.

2. Choose the SQL Azure database you created earlier as the METASTORE.

3. Enter a login (DATABASE USER) and PASSWORD that allow you to access the METASTORE database. If you encounter errors, try logging in to the database manually from the portal. You may need to open firewall ports or change permissions.

Default Storage Account

Every cluster has a default storage account. You can optionally specify additional storage accounts at cluster create time or at run time.

1. To access existing data on an existing STORAGE ACCOUNT, choose “Use Existing Storage”.

2. Specify the NAME of the existing storage account.

3. Choose a DEFAULT CONTAINER on the default storage account. Other containers (units of data management) can be used as long as the storage account is known to the cluster.

4. To add ADDITIONAL STORAGE ACCOUNTS that will be accessible without the user providing the storage account key, specify that here.

Additional Storage Accounts

If you specified there will be additional accounts you will see this screen.

1. If you choose “Use Existing Storage” you simply enter the NAME of the storage account.

2. If you choose “Use Storage From Another Subscription” you specify the NAME and the GUID KEY for that storage account.

Script Actions

You can add additional components or configure existing components as the cluster is deployed. This is beyond the scope of this demo.

1. Click “add script action” to show the remaining parameters.

2. Enter a unique NAME for your action.

3. The SCRIPT URI points to code for your custom action.

4. Choose the NODE TYPE for deployment.

Create is Done!

Once you click on the final checkmark Azure goes to work and creates the cluster. This takes several minutes. When the cluster is ready you can view it in the portal.

Recording of HDInsight quick and custom create on YouTube

Query with Hive

For most people the easiest, fastest way to learn Hadoop is through Hive. Hive is also the most widely used component of Hadoop. When you use the Hive ODBC driver any ODBC-compliant app can access the Hive data as “just another data source”. That includes Azure Machine Learning, Power BI, Excel, and Tableau.

Hive Console

The simplest, most relatable way for most people to use Hadoop is via the SQL-like, Database-like Hive and HiveQL (HQL).

1. Put focus on your HDInsight cluster and choose QUERY CONSOLE to open a new tab in your browser. In my case it opens: https://dragondemo1.azurehdinsight.net//

2. Click on Hive Editor.

Query Hive

The query console defaults to selecting the first 10 rows from the pre-loaded sample table. This table is created when the cluster is created.

1. Optionally edit or replace the default query:
Select * from hivesampletable LIMIT 10;

2. Optionally name your query to make it easier to find in the job history.

3. Click Submit.

Hive is a batch system optimized for processing huge amounts of data. It spends several seconds up front splitting the job across the nodes and this overhead exists even for small result sets. If you are doing the equivalent of a table scan in SQL Server and have enough nodes in Hadoop, Hadoop will probably be faster than SQL Server. If your query uses indexes in SQL Server, then SQL Server will likely be faster than Hive.

View Hive Results

1. Click on the Query you just submitted in the Job Session. This opens a new tab.

2. You can see the text of the Job Query that was submitted. You can Download it.

3. The first few lines of the Job Output (query result) are available. To see the full output choose Download File.

4. The Job Log has details including errors if there are any.

5. Additional information about the job is available in the upper right.

View Hive Data in Excel Workbook

At this point HDInsight is “just another data source” for any application that supports ODBC.

1. Install the Microsoft Hive ODBC driver.

2. Define an ODBC data source pointing to your HDInsight instance.

3. From DATA choose From Other Sources and From Data Connection Wizard.

View Hive Data in PowerPivot

At this point HDInsight is “just another data source” for any application that supports ODBC.

1. Install the Microsoft Hive ODBC driver.

2. Define an ODBC data source pointing to your HDInsight instance.

3. Click on POWERPIVOT then choose Manage. This opens a new PowerPivot for Excel window.

4. Choose Get External Data then Others (OLEDB/ODBC).

Now you can combine the Hive data with other data inside the tabular PowerPivot data model.

Recording of querying Hive on YouTube

Load Demo Data

In the cloud you don’t have to load data to Hadoop, you can load data to an Azure Storage Account. Then you point your HDInsight or other WASB compliant Hadoop cluster to the existing data source. There many ways to load data, for the demo we’ll use CloudXplorer.

You use the Accounts button to add Azure, S3, or other data/storage accounts you want to manage.

In this example nealhadoop is the Azure storage account, demo is the container, and bacon is a “directory”. The files are bacon1.txt and bacon2.txt. Any Hive tables would point to the bacon directory, not to individual files. Drag and drop files from Windows Explorer to CloudXplorer.

Windows Azure Storage Explorers (2014)

Recording of loading demo data on YouTube

WrapUp

Once you have created the HDInsight cluster you can use it and play with it and try many things. When you are done, simply remove the cluster. If you created an independent metastore in SQL Azure you can use that same metastore and the same Azure storage account(s) the next time you create a cluster. You are charged for the existence of the cluster, not for the usage of it. So make sure you drop the cluster when you aren’t using it. You can use automation, such as PowerShell, to spin up a cluster that is configured the same every time and to drop it. Check the website for the most recent information.

Pricing

Automate with PowerShell

With PowerShell, .NET, or the Cross-Platform cmd line tools you can specify even more configuration settings that aren’t available in the portal. This includes node size, a library store, and changing default configuration settings such as Tez and compression.

Automation allows you to standardize and with version control lets you track your configurations over time.

Sample PowerShell Script: HDInsight Custom Create http://blogs.msdn.com/b/cindygross/archive/2013/12/06/sample-powershell-script-hdinsight-custom-create.aspx. If your HDInsight and/or Azure cmdlets don’t match the current documention or return unexpected errors run Web Platform Installer and check for a new version of “Microsoft Azure PowerShell with Microsoft Azure SDK” or “Microsoft Azure PowerShell (standalone).”

Recording of Pricing, Automation, and Wrapup on YouTube

Summary
- HDInsight is Hadoop on Azure as a service, specifically Hortonworks HDP on either Windows or Linux
- Easy, cost effective, changeable scale out data processing for a lower TCO – easily add/remove/scale
- Separation of storage and compute allows data to exist across clusters via WASB
- Metastore (Hcatalog) exists independently across clusters via SQL DB
- #, size, type of clusters are flexible and can all access the same data
- Instantly access data born in the cloud; Easily, cheaply load, share, and merge public or private data
- Load data now, add schema later (write once, read many)
- Fail fast – iterate through many questions to find the right question
- Faster time from question to insight
- Hadoop is “just another data source” for BI, Analytics, Machine Learning
I hope you enjoyed this Small Bite of Big Data! Happy Hadooping!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow
@SQLCindy | @NealAnalytics | CindyG@NealAnalytics.com | http://smallbitesofbigdata.com

Technorati Tags: hdinsight,hadoop,windows azure (brand),microsoft azure,small bites of big data,sqlcindy,create hdinsight cluster,neal analytics
February 26, 2015
Master Choosing the Right Project for Hadoop
Hadoop is the hot buzzword of the Big Data world, and many IT people are being told “go create a Hadoop cluster and do some magic”. It’s hard to know where to start or which projects are a good fit. The information available online is sparse, often conflicting, and usually focused on how to solve a technical problem rather than a business problem. So let’s look at this from a business perspective.

Data-Driven Insights

For the average business just getting into using Hadoop for the first time, you are most likely to be successful if you choose a project related to data exploration, analytics and reporting, and/or looking for new data-driven actionable insights. In many ways Hadoop is ‘just another data source.” Generally most businesses will not start with replacing existing, high-functioning OLTP implementations. Instead you will likely see the highest initial return on investment (ROI) from adding on to those existing systems. Pull some of the existing data into Hadoop, add new data, and look for new ways to use that data. The goal should remain clearly focused on how to use the data to take action based on the new data-driven insights you will uncover.

Success

Below are some characteristics that are often present for a successful Hadoop implementation. You don’t need to have all of them to be successful, use the list to brainstorm new ideas.
- Goals include innovation, exploration, iteration, and experimentation. Hadoop allows you to ask lots of “what-if” questions cheaply, to “fail fast” so you can try out many potential hypotheses, and look for that one cool thing everyone else has missed that can really impact your business.
- New data or data variations will be explored. Some of it may be loosely structured. Hadoop, especially in the cloud, allows you to import and experiment with data much more quickly and cheaply than with traditional systems. Hadoop on Azure in particular has the WASB option to make data ingestion even easier and faster.
- You are looking for the “Unknown Unknowns”. There are always lurking things that haven’t come to your attention before but which may be sparks for new actions. You know you don’t know what you want or what to ask for and will use that to spur innovation.
- Flexible, fast scaling without the need to change your code is important. Hadoop is built on the premise that it is infinitely scalable – you simply add more nodes when you need more processing power. In the cloud you can also scale your storage and compute separately and more easily scale down during slow periods.
- You are looking to gain some competitive advantage faster than your competition based on data-driven actions. This goes back to the previous points, you are using Hadoop to look for something new that can change your business or help you be first to market with something.
- There are a low number of direct, concurrent users of the Hadoop system itself. The more jobs you have running at the same time, the more robust and expensive your head node(s) must be and often the larger your cluster must be. This changes the cost/benefit ratio quickly. Once data is processed and curated in Hadoop it can be sent to systems that are less-batch oriented and more available and familiar to the average power user or data steward.
- Archiving data in a low-cost manner is important. Often historical data is kept in Hadoop while more interactive data is kept in a relational system.
Anti-Patterns

Quite often I hear people proposing Hadoop for projects that are not an ideal use for Hadoop, at least not as you are learning it and looking for quick successes to bolster confidence in the new technology. The below characteristics are generally indicators that you do NOT want to use Hadoop in a project.
- You plan to replace an existing system whose pain points don’t align with Hadoop’s strengths.
- There are OLTP business requirements, especially if they are adequately met by an existing system. Yes, there are some components of Hadoop that can meet OLTP requirements and those features are growing and expanding rapidly. If you have an OLTP scenario that requires ACID properties and fast interactive response time it is possible Hadoop could be a fit but it’s usually not a good first project for you to learn Hadoop and truly use Hadoop’s strengths.
- Data is well-known and the schema is static. Generally speaking, though the tipping point is changing rapidly, when you can use an index for a query it will likely be faster in a relational system. When you do the equivalent of a table scan across a large volume of data and provide enough scaled-out nodes it is likely faster on a Big Data system such as Hadoop. Well-known, well-structured data is highly likely to have well-known, repeated queries that have supporting indexes.
- A large number of users will need to directly access the system and they have interactive response time requirements (response within seconds).
- Your first project and learning is on a mission critical system or application. Learn on something new, something that makes Hadoop’s strengths really apparent and easy to see.
And in Conclusion

Choosing the right first project for your dive into Hadoop is crucial. Make it bite-sized, clearly outline your goals, make sure it has some of the above success criteria and avoid the anti-patterns. Make learning Hadoop a key goal of the project. Budget time for everyone to really learn not only how things work but why they work that way and whether there are better ways to do certain things. Hadoop is becoming ubiquitous, avoiding it completely is not an option. Jump in, but do so with your eyes wide open and make some good up-front decisions. Happy Big Data-ing!

Technorati Tags: architecture,scenario,solution,end to end solution,azure blob store,storage,best practices,big data,hadoop
February 25, 2015
Azure Maximums and Resource Usage from PowerShell

Technorati Tags: Azure,PowerShell

Have you ever struggled to find out how many VM cores, HDInsight cores, storage accounts, or other Azure resources your subscription is set to allow or how many you actually use? Maybe you want to use this information in your automation scripts to avoid trying to create components for which you don’t have resources.

PowerShell to the rescue!

First a couple of key points. There are various maximums in Azure. Today we are talking about finding the currently configured maximums allowed for a specified subscription. There are default maximums (default limit) which you can increase for a given subscription by opening a billing support ticket. There are also hard maximums (maximum limit). However, with some products, such as HDInsight (Hadoop), you can get past some per-subscription maximums for dependent services by combining resources (storage accounts) from multiple subscriptions for a single HDInsight cluster. All the samples below find the current billing quota limitation and actual usage for the current subscription.

Let’s take a look at the information available on the subscription level cmdlet.

Start by checking which subscription is in focus / current for the PowerShell session.

(Get-AzureSubscription -Current).SubscriptionName

(Get-AzureSubscription -Current).CurrentStorageAccountName

If you need information on a different subscription either pass the subscription name (as defined on your client) for the cmdlets that support this or change the focus to a different subscription.

$SubName = “sqlcatwoman”

Select-AzureSubscription -SubscriptionName $SubName

Now we will look at the cores available for Azure virtual machines (VMs / IaaS). Note that HDInsight cores are tracked separately. Be careful with unexpected line wraps that may paste into your PowerShell window (or ISE) incorrectly. The below snippet is 1 comment line and 4 lines of code.

# How many cores are available to create new VMs (or increase size of existing VMs) for the current subscription?

[int]$maxVMCores     = (Get-AzureSubscription -current -ExtendedDetails).maxcorecount

[int]$currentVMCores = (Get-AzureSubscription -current -ExtendedDetails).currentcorecount

[int]$availableCores = $maxVMCores – $currentVMCores

Write-Host “Cores available for VMs:” $availableCores

We can get similar information about cloud services:

#how many cloud (hosted) services are available on this subscription

[int]$maxAvl         = (Get-AzureSubscription -current -ExtendedDetails).MaxHostedServices

[int]$currentUsed    = (Get-AzureSubscription -current -ExtendedDetails).CurrentHostedServices

[int]$availableNow   = $maxAvl – $currentUsed

Write-Host “Cloud services available:” $availableNow

Some limits and usage are available on cmdlets specific to a particular technology. For example, the HDInsight usage and maximums are available from the Get-AzureHDInsightProperties cmdlet. You can find details and samples on Get HDInsight Properties with PowerShell.

Other times we have to look at different cmdlets for different pieces of the information, such as for storage accounts:

#how many storage accounts are available on this subscription

[int]$maxAvl         = (Get-AzureSubscription -current -ExtendedDetails).MaxStorageAccounts

[int]$currentUsed    = (Get-AzureStorageAccount).Count

[int]$availableNow   = $maxAvl – $currentUsed

Write-Host “Storage Accounts available:” $availableNow

We can look at all the extended properties available for a subscription:

Get-AzureSubscription -current –ExtendedDetails

If you know you have a particular component created and this cmdlet shows the “Current” value is zero, take a look at the Get-Azure… cmdlet for that particular type of resource and look for a “Current” value.

Another handy thing to look at is the overall information about what Azure regions exist and what services are available in each region:

Get-AzureLocation

And you can pull off specific information:

Get-AzureLocation | Select DisplayName

I hope these small bites of PowerShell help save the day for you in some way!

July 9, 2014
SQL PASS: All the Magic Knobs – Tools

SQL PASS: All the Magic Knobs – Tools

In my All the Magic Knobs talk at #SQLPASS 2011 I discussed some easy ways to determine if you’re using some of the performance magic for SQL Server. When you have many consolidated, non-tier 1 databases you don’t have a lot of control over, the best way to tune is to provide a solid, performant infrastructure through low effort, high impact choices. The same steps help in your tier 1 environments as well. The quickest way to see how close you are to that standard is to run one of our automated health checks. They check the SQL instance itself and some of the most important Windows settings that help SQL Server operate optimally.

SQL Best Practices Analyzer (BPA) is available for SQL Server 2000, 2005, and 2008/2008 R2. It is an add-in to the Microsoft Baseline Configuration Analyzer (MBCA). Both the MBCA and the SQL BPA are free. You can run the BPA locally or remotely and you can find plenty of sample scripts to run it against multiple instances. You choose your schedule for execution and you can either review the output after each execution manually or write your own program to alert you to what you consider the most serious items.

The System Center Advisor (SCA) is at this point still in pre-release. Licensing details will be available after release, for now you can download a free trial. It works for SQL Server 2008 and newer on Windows 2008 and newer. SCA runs on a schedule and sends alerts when a registered instance is not configured as advised. What it checks can change dynamically as PSS finds new important items.

Several companies, including Microsoft through our Premier Field Engineering (PFE) team, offer various health checks that include knowledge sharing and additional advice to help you decide if, how, when, and where to implement the recommendations.

Of course, you have to actually implement the recommendations to get the benefit; the tools listed above don’t do any remediation on their own. While that should go without saying, in my experience known recommendations often go unimplemented until some problem they would have prevented pops up.

For more SQL Server best practices see some of my other blogs: http://blogs.msdn.com/b/cindygross/archive/tags/best+practices/

October 23, 2011
SQL PASS: All the Magic Knobs
SQL PASS 2011 DBA-319-C #SQLPASS

All the Magic Knobs – Low Effort, High Return Tuning

Key points covered:
- Power Savings = High Performance
- Smart Virtualization
- Enough Hardware
- Control other apps, filter drivers
- Optimize for ad hoc workloads = ON
- Compression = ON
- Set LPIM + Max Server Memory
- Pre-size files, avoid shrink and autogrow
- Fast Tempdb
- Proper Maintenance
My presentation from 10/13/11 is attached.

AllTheMagicKnobs.pptx
October 11, 2011
Taming the Tempdb Tempest – WI SQL Server Virtual User Group, 22 Apr 2011

Thanks to the Wisconsin Virtual SQL Server User Group for letting me talk about tempdb today! The slides and demo queries are attached. Once the recording is available I will update this blog with a link to it.

Taming the Tempdb Tempest

Summary:

·         Multiple data files of the same size, one log file

·         Enough data files to avoid contention, not so many to cause problems.

·         Presize for peak periods of next X months, re-evaluate

·         Set autogrow to be rare but “big enough”

·         Instant File Initialization on (small security risk)

·         Fast IO subsystem

·         Change size/settings if you add new features that use tempdb

·         Monitor for approaching full, change in activity/size

·         Performance tune user databases and applications

·         Limit use of versioning or temp objects

The demo queries are:

·         sys.dm_db_file_space_usage.sql: How space is used inside tempdb

·         sys.dm_db_file_space_usage_companion1.sql: Show how different activities cause space to be used in tempdb

·         Autogrow.sql: Find autogrow settings for all dbs on an instance

·         TempdbContention.sql: Find contention on tempdb metadata

Also see my previous blog post with the same basic data in a different format: Compilation of SQL Server TempDB IO Best Practices http://blogs.msdn.com/b/cindygross/archive/2009/11/20/compilation-of-sql-server-tempdb-io-best-practices.aspx

I also delivered the talk to the Boise SQL Server User Group on 13 Sep 2011 and the updated queries and slide deck are attached below.

TempdbBoiseSQLSep2011.zip

April 22, 2011
General Hardware/OS/Network Guidelines for a SQL Box
I have put together some general guidelines for how you want a server to be delivered to the DBA team for a new SQL Server install. You won’t necessarily use all of them, but consider it a starting point for your SQL Server install standards. Places where it may be common to change the statements are in [red]. Always run the SQL Server Best Practices Analyzer or an equivalent tool to check for the latest recommendations before releasing the system to production. I’m sure some of you will disagree with some of the points for various reasons, but I’ve found them to be a good baseline for discussion and a great starting point for standards documents. I’m ok with that because I am very fond of saying “it depends”. 🙂

The below is specific to SQL Server 2008/200R2 on Windows 2008/2008R2.

OS Specifications (things often controlled by a Windows team)

·         Power saving features: For a SQL Server box if you want consistent, predictable, high performance you either need to fine tune the power setting parameters for each individual workload and/or for different times of day or just set the power options to high performance. Databases are harder to fit into the normal power saving profile so they don’t fit as well into the default power saving settings. [If your environment requires that you favor power savings over performance change this statement and be aware of the impact.]

·         You should double check that your virus scanner is certified for Windows 2008 R2. Older scanners use TDI and you need WFP models to work properly on the newer OSs. The older type of anti-virus scanners can cause serious IO problems.
981889 A Windows Filtering Platform (WFP) driver hotfix rollup package is available for Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2 http://support.microsoft.com/default.aspx?scid=kb;en-US;981889
979278 Using two Windows Filtering Platform (WFP) drivers causes a computer to crash when the computer is running Windows Vista, Windows 7, or Windows Server 2008 http://support.microsoft.com/default.aspx?scid=kb;EN-US;979278
979223 A nonpaged pool memory leak occurs when you use a WFP callout driver in Windows Vista, Windows 7, Windows Server 2008, or in Windows Server 2008 R2 http://support.microsoft.com/default.aspx?scid=kb;EN-US;979223
976759 WFP drivers may cause a failure to disconnect the RDP connection to a multiprocessor computer that is running Windows Vista, Windows Server 2008, windows 7 or Windows Server 2008 R2 http://support.microsoft.com/default.aspx?scid=kb;EN-US;976759
Windows Filtering Platform http://www.microsoft.com/whdc/device/network/WFP.mspx

·         Virus scanners and spyware detection should not scan SQL Server data and log files (usually mdf/ldf/ndf extensions) and other SQL related files because the scanning significantly degrades performance. [Note that this is a tradeoff with security and you must decide on performance vs. security based on your own security guidelines.]
REASON: Performance, smoother setup. See 309422 Guidelines for choosing antivirus software to run on the computers that are running SQL Server http://support.microsoft.com/default.aspx?scid=kb;EN-US;309422

·         Firmware, BIOS, network adapter drivers, storport drivers, etc. will be at their most recent, stable versions before the server is released to the DBAs.
REASON: There are common SQL Server performance, usability, and supportability problems caused by older firmware, BIOS, network adapter drivers, etc.

·         For Windows 2008 and Windows 2008 R2 you can download a Windows storport enhancement (packaged as a hotfix). This enhancement can lead to faster root cause analysis for slow IO issues. Once you apply this Windows hotfix you can use Event Tracing for Windows (ETW) via perfmon or xperf to capture more detailed IO information that you can share with your storage team. http://blogs.msdn.com/b/cindygross/archive/2010/07/13/windows-storport-enhancement-to-help-troubleshoot-io-issues.aspx

·         Do not install SQL Server on a domain controller (DC).
REASON: A busy DC can take resources away from SQL Server. There are also negative security implications from installing SQL Server on a DC.

·         Grant SE_MANAGE_VOLUME_NAME to the SQL Server group to allow instant file initialization of data (but not log) files. There is a small security risk associated with this but it can greatly improve the performance of CREATE/ALTER data (but not log) files. [Decide as a company whether this performance enhancement is worth the small risk] http://blogs.msdn.com/b/sql_pfe_blog/archive/2009/12/23/how-and-why-to-enable-instant-file-initialization.aspx

·         Critical updates for Windows will be tested and applied ASAP after their release.
REASON: Security that affects Windows often affects SQL Server as well.

·         Resource intensive screensavers will be disabled and replaced with low resource consumption security to lock the consoles.
REASON: Performance – Resource intensive screen savers can steal resources from SQL Server.

·         Files will be secured: All copies of the data and log files as well as all copies of the backup files will be secured with access given only to those documented in the SQL Server Disaster Recovery plan.
REASON: Data and log files can be copied and attached to another instance of SQL Server, thereby exposing the information to the sysadmins of the new instance. Therefore access to these files must be very limited. However enough access must be granted to allow for recovery.

·         EFS: SQL Server will not be installed on disk that is encrypted with EFS.
REASON: 922121 You may experience decreased performance in some features of SQL Server 2005 when you use EFS to encrypt database files http://support.microsoft.com/default.aspx?scid=kb;EN-US;922121

Storage Specifications (often configured by a Windows and/or storage team)

·         Battery backup must be enabled for all controllers or storage media which do write caching.
REASON: This is required by the WAL protocol to ensure stable media for SQL Server. See http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqliobasics.mspx#EYCAE

·         For SQL Server disks, performance is more important than conserving space. This means there may be what would be considered “wasted space” on a file server and that the overall cost per stored MB will be higher for a database system than for a file server. [This is a general guideline, if your environment prefers costs savings and space usage maximization over performance change this statement.]
REASON: High performance is generally a major requirement of a database system, and is much more important than on most file systems. Higher performance requires that disk be laid out, configured, and managed in particular ways.
- Disk alignment must be done to a multiple of 64KB. Some vendors may express a preference for a particular value, but most mainstream hardware vendors have agreed that 1024KB is acceptable. That is the default for Windows 2008+. If you use dynamic disks it is difficult to see the alignment from Windows.
  REASON: If the disk is not aligned, performance can suffer as much as 30-40% because some read/write activity may be to/from two blocks instead of one. See 929491 Disk performance may be slower than expected when you use multiple disks in Windows Server 2003, in Windows XP, and in Windows 2000 http://support.microsoft.com/default.aspx?scid=kb;EN-US;929491 and Disk Partition Alignment Best Practices for SQL Server http://msdn.microsoft.com/en-us/library/dd758814.aspx
- Disk allocation unit should be 64KB for SQL Server boxes.
  REASON: See Predeployment I/O Best Practices http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx
NTFS Allocation Unit Size

When formatting the partition that will be used for SQL Server data files, it is recommended that you use a 64-KB allocation unit size for data, logs, and tempdb. Be aware however, that using allocation unit sizes greater than 4 KB results in the inability to use NTFS compression on the volume. SQL Server, although it is not recommended that you use this, does support read-only data on compressed volumes.

·         Drive Compression: Drives will not be compressed.
REASON: Compression has a big negative performance impact on SQL Server.

·         NTFS file system will be used instead of FAT or Raw partitions.
REASON: NTFS allows features such as database snapshots, online DBCC checks, instant file initialization, mount points, and additional security. It has larger file size limits (16 exabytes) than FAT (4 GBs). Raw partitions limit your recoverability options.

·         Often you will need one or more of these to achieve optimal performance for a database [Decide which of these you will deploy for each tier of storage and whether each can be requested by a DBA at server configuration time.]

1.       HBA queue depth for SQL Server is often best at 64 or 128; testing will determine the optimal value.
REASON: See Predeployment I/O Best Practices http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx
HBA Queue Depth Settings

When configuring HBAs on the host, ensure that the Queue Depth is set to an optimal value. SQL Server applications are generally I/O-intensive, with many concurrent outstanding I/O requests. As a result, the default values for Queue Depth on HBAs are usually not high enough to support optimal performance. Currently the default value for the major HBA vendors is in the range of 8 to 32.

In our SQL Server testing, we have seen substantial gains in I/O performance when increasing this to 64 or even higher. It is worth noting that in these tests SQL Server was usually the only application using the storage array. It is important to discuss with your storage administrator the appropriate values for this setting, as the setting may affect other applications in sharing the same storage environment. When Queue Depth is set too low, a common symptom is increasing latency and less-than-expected throughput given the bandwidth between host/storage and the number of spindles in a particular configuration.

2.       RAID 10 or its equivalent will be used for the highest performance and best recoverability. Read-only data (no updates from users, replication, batch jobs, or anything else) can see acceptable performance on RAID 5. RAID 5 systems will have slower write performance and less recoverability but might be allowed for lower tiered systems with a signoff that high performance is not guaranteed.
REASON: RAID 10 is the fastest disk for SQL Server data and logs. It also provides the best recoverability options.

o   See Physical Database Storage Design http://www.microsoft.com/technet/prodtechnol/sql/2005/physdbstor.mspx
“For excellent performance and high reliability of both read and write data patterns, use RAID10.”

o   “RAID10 (stripe of mirrors): RAID10 is essentially many sets of RAID1 or mirrored drives in a RAID0 configuration. This configuration combines the best attributes of striping and mirroring: high performance and good fault tolerance. For these reasons, we recommend using this RAID level. However, the high performance and reliability level is the trade-off for storage capacity.”

o   RAID 10 is recommended for “Data requiring high performance for both read and write and excellent reliability while trading off storage efficiency and cost.“

3.       Follow hardware vendor recommendations for configuring the storage for a database, often this is very different than configuring for other non-database systems.

4.       Keep the physical disks no more than about 80% full (avoid full stroking, get closer to short stroking). Some SAN configurations may make this difficult to determine from Windows with concepts such as thin provisioning.

5.       Use multiple HBA controllers with a high throughput capacity. The same applies for other components such as switch ports, NICs, Fibre Channel array ports, storage array service processors, etc.

6.       Favor (battery backed) write cache over read cache for an OLTP system. Often 80/20 or 90/10 in favor of writes is beneficial. It is relatively easy for a busy SQL Server to flood the cache.

1.       Log writes have the lowest allowable latency of any SQL activity on an OLTP system.

2.       Write cache can help absorb checkpoint bursts as they write data to the disks.

3.       Maintenance operations can be write intensive and long running.

4.       SQL Server’s internal data organization rarely matches the physical layout of the data on disk so IO subsystem level read ahead through the read cache is rarely effective for a database.

7.       Performance will be more predictable, IO troubleshooting will be easier, and in many cases overall performance can be higher if SQL Server is isolated to an IO path not shared with other systems. If it does share with others (which is very common), it is better to share with other databases than with file servers or other systems that have different needs.

·         Data and log files will not go on the same drive (they can have the same drive letter if on different mount points). SANs often hide the physical layer behind the drive letter/mount point by mixing data on the back end but it is still important to keep them separate in case you later move to totally separate IO paths for each.

o   Where cost/benefit analysis allows, each database’s transaction log file(s) will get a separate drive. For maximum performance this would be a truly separate IO path.
REASON: Since writes to the transaction log are sequential (even if there are multiple log files for a given database, only one is written to at a time), if there are multiple database’s log files, or a mix of data and log files, on one drive the read head is moving between them and decreasing performance.

o   Where cost/benefit analysis allows it, multiple, separate IO paths may be requested for each data file.
REASON: The more drives you can use for data files, the more the IO can be spread out to increase performance. For certain types of activities guaranteeing that the IO does not overlap for two different data files can be advantageous.

·         Before putting each SQL box into production, run SQLIO under various loads, gather data to see if the IO subsystem is likely to be capable of the expected load. If you have numbers for a typical and current peak load, we can use those numbers and then add some to it (perhaps test things like 150% of expected peak batch cycle).

·         Before putting each SQL box into production, run SQLIOSim to test for data correctness.

·         Disk IO Subsystem Performance Requirements:

o   Log writes will not exceed [put your values here] ms per write under the expected peak load (typical is 1-5ms for log writes)

o   Non-log writes will not exceed [put your values here] ms per read under the expected peak load (typical is 1-20ms for OLTP and 25-30ms for warehouses)

o   Reads will not exceed [put your values here] ms per read under the expected peak load (typical is 5-20ms for OLTP and 25-30ms for warehouses)

o   These are general standards; some systems may require faster IO subsystems and some may be ok with slower IO subsystems.

o   Measures of read and write speed are from the Performance Monitor counters reads/sec and writes/sec and/or from SQLIO as these are the speeds seen by SQL Server.
REASON: Databases require fast disk access in order to deliver high performance. Many DBA and application developer hours can be spent troubleshooting performance problems that are later tracked to slow IO.

Network Specifications

·         SNP/TCP Chimney settings will depend on whether your NIC vendor supports it. If you have NICs/drivers that support it, turn it on. Otherwise disable it to avoid known problems with SQL Server and other products.
REASON: Performance and usability. When TCP Chimney is enabled it on a NIC that doesn’t support it, you will often see failed connectivity to SQL Server and/or dropped packets and connections that affect SQL server. See Information about the TCP Chimney Offload, Receive Side Scaling, and Network Direct Memory Access features in Windows Server 2008 http://support.microsoft.com/kb/951037 and 942861 Error message when an application connects to SQL Server on a server that is running Windows Server 2003: “General Network error,” “Communication link failure,” or “A transport-level error” http://support.microsoft.com/default.aspx?scid=kb;EN-US;942861

·         Put a firewall between your server and the internet. Block TCP port 1433, UDP port 1434, and static ports used by named instances on the perimeter firewalls but not necessarily on individual internal servers. Be careful which ports you block in the other firewalls, SQL Server will use various ports to communicate on.
REASON: Security – Hackers may try to use well known ports to attack SQL Server.

·         Open ports used by SQL Server in the individual server firewalls. [If you define a specific port range for all instances include it here.]
REASON: The DBAs have defined this range of ports as what each SQL Server uses for incoming connections.

Windows Cluster

·         Identical hardware: While Windows 2008 clusters are not required to be identical, to have a greater chance of predictability no matter which node owns each SQL Server group it is recommended that they be configured as close to the same as possible.

·         Windows policies and rights: Windows policies and rights should be the same on all nodes.
REASON: The behavior of SQL Server must be the same on all nodes. Policies can change SQL Server behavior.

·         Mount points: Do not install SQL Server 2000 on any Windows cluster with mount points. The mount points must have an associated drive letter and must be cluster resources in the group where SQL Server will reside. SQL Server must “depend on” all mount points that it uses.
REASON: SQL Server 2005+ supports mount points but SQL Server 2000 setup, including service packs and hotfixes, will fail when it tries to enumerate the mount points, even if they are not in the SQL Server 2000 group. For 2005+ instances, the mount points must be in the SQL Server group in order for SQL Server to access them.

·         The Cluster service account must be a login in the SQL Server instance and a simple user in the master database, but should NOT be a sysadmin role member.
REASON: Avoid elevated privileges.

·         MS DTC Choose a consistent standard for how you configure DTC. On Windows 2008+ clusters you can have more than one DTC per cluster and there are pros/cons to various configuraitons on how SQL Server uses one or more of those DTCs. http://blogs.msdn.com/cindygross/archive/2009/02/22/how-to-configure-dtc-for-sql-server-in-a-windows-2008-cluster.aspx DTC must be enabled for network access. If you choose to put DTC in a group with SQL Server and choose to have a DTC failure cause the group to fail, you should be aware that DTC can cause a SQL Server failover. This may be appropriate in some environments.

·         Cluster Group Names will not contain any special characters such as <, >, ‘, “, &
REASON: Special characters in any group name may cause SQL Server setup to fail.

·         NIC names will not have any special characters or trailing spaces.
REASON: Special characters in any network name may cause SQL Server setup to fail.

·         Auto start must NOT be on for any clustered resource in a Windows cluster.
REASON: The cluster administrator needs to bring the resources online (start the services). If Windows startup has already started the service the cluster service cannot bring it online which results in errors.

·         Use SIDs in a Windows cluster. If you choose not to use Service SIDs, create unique domain groups: Each individual service installed with SQL Server needs a unique domain group created for it unless you choose the default of SIDs. The following naming standard will be used: [put your company standard here: Examples: myserver1_instance1_SQLServer, myserver1_instance1_SQLAgent ]

http://technet.microsoft.com/en-us/library/ms143504.aspx

Best practices that you can use to set up domain groups and solutions to problems that may occur when you set up a domain group when you install a SQL Server 2005 failover cluster http://support.microsoft.com/default.aspx?scid=kb;EN-US;915846
REASON: On a cluster domain groups are used to manage permissions for the SQL Server components. Each service needs a unique group to reduce the attack surface.

Remote Control

·         Remote control to the server will only be done when absolutely necessary. No SQL Server tools will be opened on the production server itself unless there is no other way to access the server.
REASON: It adds overhead to the server and can cause performance problems. Most access will be done from client tools installed on desktops. The overhead of the GUI interfaces is not acceptable on a production server. Some SQL Server client tools are not available on 64-bit systems and all servers will be x64 going forward.

Hardware Specifications

·         Expected Life: Servers are spec’d with an expectation of being in service for up to [put your policy here, 3-4 is common] years unless otherwise stated.
REASON: We need to know in advance how long the hardware is expected to stay in service so we can predict the resources needed for that time period. An uncertainty factor will be considered as well, so systems with high uncertainty in the predictions may need more expansion opportunity (have the ability to add more hardware resources as needed).

·         Architecture: All new servers will be based on x64 hardware with an x64 version/edition of Windows.
REASON: 64-bit systems allow much more memory to be used than 32-bit systems.

Cindy Gross, Microsoft Dedicated Support Engineer for SQL Server and Microsoft Certified Master : SQL Server 2008
March 10, 2011
Compilation of SQL Server TempDB IO Best Practices

It is important to optimize TempDB for good performance. In particular, I am focusing on how to allocate files.

TempDB is a unique database in several ways. The ones most relevant to this discussion are:

·         It is often one of the busiest databases on an instance. This means the performance of TempDB is critical to your instance’s overall performance.

·         It is recreated as a copy of model each time SQL Server starts, taking all the properties of model except for the location, number, and size of its data and log files.

·         TempDB has a very high rate of create/drop object activity. This means the system metadata related to object creation/deletion is heavily used.

·         Slightly different logging and latching behavior.

General recommendations:

·         Pre-size TempDB appropriately. Leave autogrow on with instant file initialization enabled, but try to configure the database so that it never hits an autogrow event. Make sure the autogrow growth increment is appropriate.

·         Follow general IO recommendations for fast IO.

·         If your TempDB experiences metadata contention (waitresource = 2:1:1 or 2:1:3), you should split out your data onto multiple files. Generally you will want somewhere between 1/4 and 1 file per physical core. If you don’t want to wait to see if any metadata contention occurs you may want to start out with around 1/4 to 1/2 the number of data files as CPUs up to about 8 files. If you think you might need more than 8 files we should do some testing first to see what the impact is. For example, if you have 8 physical CPUs you may want to start with 2-4 data files and monitor for metadata contention.

·         All TempDB data files should be of equal size.

·         As with any database, your TempDB performance may improve if you spread it out over multiple drives. This only helps if each drive or mount point is truly a separate IO path. Whether each TempDB will have a measurable improvement from using multiple drives depends on the specific system.

·         In general you only need one log file. If you need to have multiple log files because you don’t have enough disk space on one drive that is fine, but there is no direct benefit from having the log on multiple files or drives.

·         On SQL Server 2000 and more rarely on SQL Server 2005 or later you may want to enable trace flag -T1118.

·         Avoid shrinking TempDB (or any database) files unless you are very certain you will never need the space again.

References:

·         Working with tempdb in SQL Server 2005 http://technet.microsoft.com/en-us/library/cc966545.aspx

o   “Divide tempdb into multiple data files of equal size. These multiple files don’t necessarily be on different disks/spindles unless you are also encountering I/O bottlenecks as well. The general recommendation is to have one file per CPU because only one thread is active per CPU at one time.”

o   “Having too many files increases the cost of file switching, requires more IAM pages, and increases the manageability overhead.”

·         How many files should a database have? – Part 1: OLAP workloads http://sqlcat.com/technicalnotes/archive/2008/03/07/How-many-files-should-a-database-have-part-1-olap-workloads.aspx

o   If you have too many files you can end up with smaller IO block sizes and decreased performance under extremely heavy load.

o   If you have too few files you can end up with decreased performance to GAM/SGAM contention (generally the problem you see in TempDB) or PFS contention (extremely heavy inserts).

o   The more files you have per database the longer it takes to do database recovery (bringing a database online, such as during SQL Server startup). This can become a problem with hundreds of files.

·         SQL Server Urban Legends Discussed http://blogs.msdn.com/psssql/archive/2007/02/21/sql-server-urban-legends-discussed.aspx

o   ” SQL Server uses asynchronous I/O allowing any worker to issue an I/O requests regardless of the number and size of the database files or what scheduler is involved.”

o   ” Tempdb is the database with the highest level of create and drop actions and under high stress the allocation pages, syscolumns and sysobjects can become bottlenecks.   SQL Server 2005 reduces contention with the ‘cached temp table’ feature and allocation contention skip ahead actions.”

·         Concurrency enhancements for the tempdb database http://support.microsoft.com/kb/328551

o   Note that this was originally written for SQL Server 2000 (the applies to section only lists 2000) and there are some tweaks/considerations for later versions that are not covered completely in this article. For example, -T1118 is not only much less necessary on SQL Server 2005+, it can in some cases cause problems.

·         FIX: Blocking and performance problems may occur when you enable trace flag 1118 in SQL Server 2005 if the temporary table creation workload is high http://support.microsoft.com/default.aspx?scid=kb;EN-US;936185

o   If you have SP2 based CU2 or later you will not see the problems described in this article. Also, on SP2 based CU2 or higher you are much less likely to even need -T1118 on a heavily used TempDB.

o   ” This hotfix significantly reduces the need to force uniform allocations by using trace flag 1118. If you apply the fix and are still encountering TEMPDB contention, consider also turning on trace flag 1118.”

·         Misconceptions around TF 1118 http://sqlskills.com/BLOGS/PAUL/post/Misconceptions-around-TF-1118.aspx

o   ” turn on TF1118, which makes the first 8 data pages in the temp table come from a dedicated extent “

o   “Instead of a 1-1 mapping between processor cores and tempdb data files (*IF* there’s latch contention), now you don’t need so many – so the recommendation from the SQL team is the number of data files should be 1/4 to 1/2 the number of processor cores (again, only *IF* you have latch contention). The SQL CAT team has also found that in 2005 and 2008, there’s usually no gain from having more than 8 tempdb data files, even for systems with larger numbers of processor cores. Warning: generalization – your mileage may vary – don’t post a comment saying this is wrong because your system benefits from 12 data files. It’s a generalization, to which there are always exceptions.”

·         Storage Top 10 Best Practices http://sqlcat.com/top10lists/archive/2007/11/21/storage-top-10-best-practices.aspx

o   “Make sure to move TEMPDB to adequate storage and pre-size after installing SQL Server. “

o   “Performance may benefit if TEMPDB is placed on RAID 1+0 (dependent on TEMPDB usage). “

o   “This is especially true for TEMPDB where the recommendation is 1 data file per CPU. “

o   “Dual core counts as 2 CPUs; logical procs (hyperthreading) do not. “

o   “Data files should be of equal size – SQL Server uses a proportional fill algorithm that favors allocations in files with more free space.

o   “Pre-size data and log files. “

o   “Do not rely on AUTOGROW, instead manage the growth of these files manually. You may leave AUTOGROW ON for safety reasons, but you should proactively manage the growth of the data files. “

Optimizing tempdb Performance http://msdn.microsoft.com/en-us/library/ms175527.aspx

November 20, 2009