Befriending Dragons

Transform Tech with Anti-bullying Cultures


1 Comment

PowerShell works for Amazon AWS S3 too!

More and more we have to work with data in many different locations. This week I got to work with S3 files that were moving to Azure blob storage. I was surprised to find that Amazon has published AWS cmdlets for PowerShell. It took me a little while to figure out the format and terminology so I’ll try to explain that and compare/contrast how we interact with storage in AWS and Azure. Today we will cover viewing the files.

Configure PowerShell

Well first, let’s get things set up. Install the Azure and AWS cmdlets for PowerShell. These examples will pass keys for everything so there’s no need to configure PowerShell with certificates to access the clouds.

The first time (depending on your PowerShell version) you use PowerShell after installing AWS cmdlets you may need to run these cmdlets:

Add-Type -Path “C:Program Files (x86)AWS SDK for .NETbinNet45AWSSDK.dll”
Import-Module “C:Program Files (x86)AWS ToolsPowerShellAWSPowerShellAWSPowerShell.psd1”

Connecting to Storage

S3

We’ll start with AWS S3. Each connection to S3 storage requires an AWS region (unless you use the default “US Standard”, an access id (unique identifier), a secret key, and a bucket. You are storing data within a specific region on an access point in a managed grouping called a bucket. The access id in S3 is equivalent to a storage account name in Azure. A bucket in S3 is roughly equivalent to a container in Azure.

$S3Bucket = “MyBucket”
$S3Key=”SecretKeyValue”
$S3AccessID=”AccessKey”
$AWSregion = “us-west-2”

Next let’s use those values to make a new client connection to S3. You define a configuration object that points to the full URL for the region. Then you pass that configuration object, the access id, and the secret key to a function that creates a client connection to S3. This sets the context for the entire session and the context does not have to be passed to the individual commands. Note that the URL changes depending on the region, for example https://s3-us-west-2.amazonaws.com

Set-DefaultAWSRegion $AWSregion # auto-stored to $StoredAWSRegion
$AWSserviceURL=”https://s3-$AWSRegion.amazonaws.com”
$config=New-Object Amazon.S3.AmazonS3Config
$config.ServiceURL = $AWSserviceURL
$S3Client=[Amazon.AWSClientFactory]::CreateAmazonS3Client($secretKeyID, $secretAccessKeyID, $config)

Azure

Let’s compare that to how we list files in Azure blob storage. First you specify the location and credentials. The region is implied because the storage account name is unique across all regions. The container and secret key value are similar in meaning.

$storageAccountName = “MyStorageAccountName”
$storageaccountkey = “SecretKeyValue”
$containerName = “MyBucket”

Then you define the storage context which is the location and credentials of an object. Alternatively you could set the default storage context for the session or for a particular profile’s connection to a given subscription.

$AzureContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountkey

View the Files

S3

Now you can get basic metadata about the S3 bucket:
Get-S3Bucket $S3Bucket
Get-S3BucketLocation $S3Bucket

Next let’s list the files in that bucket.

Get-S3Object -BucketName $S3Bucket

You can populate an array with the list, in this example I passed in just the name (key) of each file:
$S3FileList = (Get-S3Object -BucketName $S3Bucket).key

And you can filter the result set:
$S3FileList = (Get-S3Object -BucketName $S3Bucket | Where-Object {$_.lastmodified -lt “2/17/2015”}).Key
$S3FileList = (Get-S3Object -BucketName $S3Bucket | Where-Object {$_.key -like “*42*”}).Key

Azure

For Azure we can do similar operations to view the files. This example lists all files in the container:

Get-AzureStorageBlob -Context $AzureContext -Container $containerName

You can also populate an array with the list:

$AzureList = Get-AzureStorageBlob -Context $AzureContext -Container $containerName

Or pull out just a single property:

(Get-AzureStorageBlob -Context $AzureContext -Container $containerName).Name

Or list just blobs that match a wildcard value:

Get-AzureStorageBlob -Context $AzureContext -Container $containerName -Blob *42*

My Work Here is Done

This intro to PowerShell for S3 opens up the door to many possibilities – data migrations, multi-cloud hybrid solutions, and whatever your imagination can conjure up! Today we reviewed how to view files, I’ll cover more in future posts. Happy PowerShelling!

Tip

When you open “Microsoft Azure PowerShell” type ISE in the window to launch the interactive PowerShell shell. It has intellisense, multiple script windows, and a cmdlet viewer.


3 Comments

Understanding WASB and Hadoop Storage in Azure

Yesterday we learned Why WASB Makes Hadoop on Azure So Very Cool. Now let’s dive deeper into Windows Azure storage and WASB. I’ll answer some of the common questions I get when people first try to understand how WASB is the same as and different from HDFS.

What is HDFS?

The Hadoop Distributed File System (HDFS) is one of the core Hadoop components, it is how Hadoop manages data and storage. At a high level, when you load a file into Hadoop the “name node” uses HDFS to chunk the file into blocks and it spreads those blocks of data across the worker nodes within the cluster. Each chunk of data is stored on multiple nodes (assuming the replication factor is set to > 1) for higher availability. The name node knows where each chunk of data is stored and that information is used by the job manager to allocate tasks and resources appropriately across nodes.

What is WASB?

Windows Azure Storage Blob (WASB) is an extension built on top of the HDFS APIs. The WASBS variation uses SSL certificates for improved security. It in many ways “is” HDFS. However, WASB creates a layer of abstraction that enables separation of storage. This separation is what enables your data to persist even when no clusters currently exist and enables multiple clusters plus other applications to access a single piece of data all at the same time. This increases functionality and flexibility while reducing costs and reducing the time from question to insight.

What is an Azure blob store, an Azure storage account, and an Azure container? For that matter, what is Azure again?

Azure is Microsoft’s cloud solution. A cloud is essentially a collection of host data centers that you don’t have to directly manage. You can request services from that cloud. For example, you can request virtual machines and storage, data services such as SQL Azure Database or HDInsight, or services such as Websites or Service Bus. In Azure you store blobs on containers within Azure storage accounts. You grant access to a storage account, you create collections at the container level, and you place blobs (files of any format) inside the containers. This illustration from Microsoft’s documentation helps to show the structure:

Blob1

How do I manage and configure block/chunk size and the replication factor with WASB?

You don’t. It’s not generally necessary. The data is stored in the Azure storage accounts, remaining accessible to many applications at once. Each blob (file) is replicated 3x within the data center. If you choose to use geo-replication on your account you also get 3 copies of the data in another data center within the same region. The data is chunked and distributed to nodes when a job is run. If you need to change the chunk size for memory related performance at run time that is still an option. You can pass in any Hadoop configuration parameter setting when you create the cluster or you can use the SET command for a given job.

Isn’t one of the selling points of Hadoop that the data sits with the compute? How does that work with WASB?

Just like with any Hadoop system the data is loaded into memory on the individual nodes at compute time (when the job runs). The difference with WASB is that the data is loaded from the storage accounts instead of from local disks. Given the way Azure data center backbones are built the performance is generally the same or better than if you used disks locally attached to the VMs.

How do I load data to Hadoop on Azure?

You use any of the many Azure data loading methods. There isn’t really anything special about loading data that will be used for Hadoop. As with data used by any other application there are some guidelines around directory structures, optimal numbers of files, and internal format but that is independent of data loading. Some common examples are AZCopy, CloudXplorer and other storage explorers, and SQL Server Integration Services (SSIS).

And yes, I will blog about those guidelines but not here. 🙂

Can I have multiple Hadoop clusters pointing to one storage account?

Yes.

Can I have one Hadoop cluster pointing to multiple storage accounts?

Yes. Check!

See: Use Additional Storage Accounts with HDInsight Hive.

Can I have many Hadoop clusters pointing to multiple storage accounts?

Why, yes. Yes you can. Check!

Do I get to keep my data even if no Hadoop cluster currently exists?

What a fun day to say Yes. Check!

For a caveat see HDInsight: Hive Internal and External Tables Intro.

Is WASB available for any distribution of Hadoop other than HDInsight?

It is my pleasure to answer that with a resounding Yes. Check!

WASB is built into HDInsight (Microsoft’s Hadoop on Azure service) and is the default file system. WASB is also available in the Apache source code for Hadoop. Therefore when you install Hadoop, such as Hortonworks HDP or Cloudera EDH/CDH, on Azure VMs you can use WASB with some configuration changes to the cluster.

How do I manage files and directories?

Hive is the most common entry point for Hadoop jobs and with Hive you never point to a single file, you always point to a directory. If you are a stickler for details and want to point out that Azure doesn’t have directories, that’s technically true. However, Hadoop recognizes that a slash “/” is an indication of a directory. Therefore Hadoop treats the below Azure blob file as if it were AFile.txt in a directory structure of: SomeDirectory/ASubDirectory. But since you don’t access individual files in Hive you will reference either SomeDirectory or SomeDirectory/ASubDirectory.

Blob: wasb://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net/SomeDirectory/ASubDirectory/AFile.txt

You can add, remove, and modify files in the Azure blob store without regard to whether a Hadoop cluster exists. Each time a job runs it reads the data that currently exists in the directory(s) it references. Hadoop itself can also write to files.

What about ORCFile, Parquet, and AVRO?

They are proprietary formats often used within Hadoop but rarely used outside of Hadoop. There are performance advantages to using those formats for “write once, read many” data inside Hadoop, but chances are high that you won’t then be able to access the data without going through one of your Hadoop clusters.

Should I have lots of small files?

NO! No!  

Why is too long to answer here. The short answer is to use files that are many multiples of the in-memory chunk size, in the GB or TB size range. Whenever possible use fewer, larger files instead of many small files. If necessary stitch the files together.

That’s your storage lesson for today – please put your additional Hadoop on Azure storage questions in the comments or send me a tweet! Thanks for stopping by!

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  image
@SQLCindy | @NealAnalytics | CindyG@NealAnalytics.com | http://smallbitesofbigdata.com

http://blogs.msdn.com/b/cindygross/archive/2015/02/04/understanding-wasb-and-hadoop-storage-in-azure.aspx

http://www.nealanalytics.com/understanding-wasb-and-hadoop-storage-in-azure/

!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?’http’:’https’;if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+”://platform.twitter.com/widgets.js”;fjs.parentNode.insertBefore(js,fjs);}}(document,”script”,”twitter-wjs”);


2 Comments

Why WASB Makes Hadoop on Azure So Very Cool

Rescue dogData. It’s all about the data. We want to make more data driven decisions. We want to keep more data so we can make better decisions. We want that data stored cheaply, easily accessible, and quickly ingested. Hadoop promises to help with all those things. However, when you deal with Hadoop on-premises you have a multi-step process to load the data. Azure and WASB to the rescue!

With a typical Hadoop installation you load your data to a staging location then you import it into the Hadoop Distributed File System (HDFS) within a single Hadoop cluster. That data is manipulated, massaged, and transformed. Then you may export some or all of the data back to a non-HDFS system (a SAN, a file share, a website).

What’s different in the cloud? With Azure you have Azure Blob Storage Accounts. Data can be stored there as blobs in any format. That data can be accessed by various applications – including Hadoop without first doing a separate load into HDFS! This is made possible because Microsoft used the public extensions available with HDFS to create the Windows Azure Storage Blobs (WASB) interface between Hadoop and the Azure blob storage. This WASB code is available for any distributor of Hadoop in the Apache source code and it is the default storage system in HDInsight – Microsoft’s Hadoop on Azure PaaS offering. It is also available for Hortonworks HDP on Azure VMs or Cloudera EDH/CDH on Azure VMs with some manual configuration steps.

With WASB you load your data to Azure blobs at any time – whether Hadoop clusters currently exist or not. That way you aren’t paying for Hadoop compute time simply to load data. You spin up one or more clusters, point them at the data sets (yes, multiple clusters pointing to same data!), and run your Hadoop jobs. When you don’t need the system for a while you take down your Hadoop cluster(s) and the data is still there. At any point, whether one or more Hadoop clusters are accessing the data or not, other applications can still access and manipulate the data. For example, you could have data sitting on an Azure storage account that is being added to by a SQL Server Integration Services (SSIS) job. At the same time someone is using Power Query to load that data into PowerPivot while a website inserts new data to the same location. Meanwhile your R&D department can be running highly intensive jobs that require a large cluster up for many days or weeks at a time, and your sales team can have a separate, smaller cluster that’s up for a few hours a day – all pointing at the same data!

With this separation of storage and compute you have simplified your data accessibility, reduced data movement and copies, and reduced the time it takes to have your data available! That all adds up to lower costs and a faster, more data-driven time to insight.

Cindy Gross – Neal Analytics: Big Data and Cloud Technical Fellow  
@SQLCindy | @NealAnalytics | CindyG@NealAnalytics.com | http://smallbitesofbigdata.com

http://www.nealanalytics.com/why-wasb-makes-hadoop-on-azure-so-very-cool/

http://blogs.msdn.com/b/cindygross/archive/2015/02/03/why-wasb-makes-hadoop-on-azure-so-very-cool.aspx


7 Comments

Access Azure Blob Stores from HDInsight

Small Bites of Big Data

Edit Mar 6, 2014: This is no longer necessary for HDInsight – you specify the storage accounts when you create the cluster and the rest happens auto-magically. See http://blogs.msdn.com/b/cindygross/archive/2013/11/25/your-first-hdinsight-cluster-step-by-step.aspx or http://blogs.msdn.com/b/cindygross/archive/2013/12/06/sample-powershell-script-hdinsight-custom-create.aspx.

One of the great enhancements in Microsoft’s HDInsight distribution of Hadoop is the ability to store and access Hadoop data on an Azure Blob Store. We do this via the HDFS API extension called Azure Storage Vault (ASV). This allows you to persist data even after you spin down an HDInsight cluster and to make that data available across multiple programs or clusters from persistent storage. Blob stores can be replicated for redundancy and are highly available. When you need to access the data from Hadoop you point your cluster at the existing data and the data persists even after the cluster is spun down.

Azure Blob Storage

Let’s start with how your data is stored. A storage account is created in the Azure portal and has access keys associated with it. All access to your Azure blob data is done via storage accounts. Within a storage account you need to create at least one container, though you can have many. Files (blobs) are put in the container(s). For more information on how to create and use storage accounts and containers see: http://www.windowsazure.com/en-us/develop/net/how-to-guides/blob-storage/. Any storage accounts associated with HDInsight should be in the same data center as the cluster and must not be in an affinity group.

You can create a container from the Azure portal or from any of the many Azure storage utilities available such as CloudXplorer. In the Azure portal you click on the Storage Account then go to the CONTAINERS tab. Next click on ADD CONTAINER at the very bottom of the screen. Enter a name for your container, choose the ACCESS property, and click on the checkmark.

AzureStorageContainers_thumb AzureStorageManageContainers_thumb AzureStorageNewContainer_thumb

HDInsight Service Preview

When you create your HDInsight Service cluster on Azure you associate your cluster with an existing Azure storage account in the same data center. In the current interface the QUICK CREATE doesn’t allow you to choose a default container on that storage account so it creates a container with the same name as the cluster. If you choose CUSTOM CREATE you have the option to choose the default container from existing containers associated with the storage account you choose. This is all done in the Azure management portal: https://manage.windowsazure.com/.

Quick Create: image

Custom Create: image

You can then add additional storage accounts to the cluster by updating C:appsdisthadoop-1.1.0-SNAPSHOTconfcore-site.xml on the head node. This is only necessary if those additional accounts have private containers (this is a property set in the Azure portal for each container within a storage account). Public containers and public blobs can be accessed without the id/key being stored in the configuration file. You choose the public/private setting when you create the container and can later edit it in the “Edit container metadata” dialog on the Azure portal.

image StorageContainerEdit

The key storage properties in the default core-site.xml on HDInsight Service Preview are:

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>asv://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net</value>
<description>The name of the default file system.  Either the
literal string “local” or a host:port for NDFS.</description>
<final>true</final>
</property>

<property>
<name>dfs.namenode.rpc-address</name>
<value>hdfs://namenodehost:9000</value>
</property>

<property>
<name>fs.azure.account.key.YOURStorageAccount.blob.core.windows.net</name>
<value>YOURActualStorageKeyValue</value>
</property>

To add another storage account you will need the Windows Azure storage account information from https://manage.windowsazure.com. Log in to your Azure subscription and pick storage from the left menu. Click on the account you want to use then at the very bottom click on the “MANAGE KEYS” button. Cut and paste the PRIMARY ACCESS KEY (you can use the secondary if you prefer) into the new property values we’ll discuss below.

AzureStorageAccounts_thumb AzureStorageContainerKeys_thumb1 AzureStorageMgKeys_thumb1

Create a Remote Desktop (RDP) connection to the head node of your HDInsight Service cluster. You can do this by clicking on the CONNECT button at the bottom of the screen when your HDInsight Preview cluster is highlighted. You can choose to save the .RDP file and edit it before you connect (right click on the .RDP file in Explorer and choose Edit). You may want to enable access to your local drives from the head node via the “Local Resources” tab under the “More” button in the “Local devices and resources” section. Then go back to the General tab and save the settings. Connect to the head node (either choose Open after you click CONNECT or use the saved RDP).

image image

On the head node make a copy of C:appsdisthadoop-1.1.0-SNAPSHOTconfcore-site.xml in case you have to revert back to the original. Next open core-site.xml in Notepad or your favorite editor.

Add your 2nd Azure storage account by adding another property.

<property>
<name>fs.azure.account.key.YOUR_SECOND_StorageAccount.blob.core.windows.net</name>
<value>YOUR_SECOND_ActualStorageKeyValue</value>
</property>

Save core-site.xml.

Repeat for each storage account you need to access from this cluster.

HDInsight Server Preview

If you have downloaded the on-premises HDInsight Server preview from http://microsoft.com/bigdata that gives you a single node “OneBox” install to test basic functionality. You can put it on your local machine, on a Hyper-V virtual machine, or in a Windows Azure IaaS virtual machine. You can also point this OneBox install to ASV. Using an IaaS VM in the same data center as your storage account will give you better performance, though the OneBox preview is meant purely for basic functional testing and not for high performance as it is limited to a single node. The steps are slightly different for on-premises as the installation directory and default properties in core-site.xml are different.

Make a backup copy of C:Hadoophadoop-1.1.0-SNAPSHOTconfcore-site.xml from your local installation (local could be on a VM).

Edit core-site.xml:

1) Change the default file system from local HDFS to remote ASV

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>hdfs://localhost:8020</value>
<description>The name of the default file system.  Either the
literal string “local” or a host:port for NDFS.</description>
<final>true</final>
</property>

to:

<property>
<name>fs.default.name</name>
<!– cluster variant –>
<value>asv://YOURDefaultContainer@YOURStorageAccount.blob.core.windows.net</value>
<description>The name of the default file system.  Either the
literal string “local” or a host:port for NDFS.</description>
<final>true</final>
</property>

2) Add the namenode property (do not change any values)

<property>
<name>dfs.namenode.rpc-address</name>
<value>hdfs://namenodehost:9000</value>
</property>

3) Add the information that associates the key value with your default storage account

<property>
<name>fs.azure.account.key.YOURStorageAccount.blob.core.windows.net</name>
<value>YOURActualStorageKeyValue</value>
</property>

4) Add any additional storage accounts you plan to access

<property>
<name>fs.azure.account.key.YOUR_SECOND_StorageAccount.blob.core.windows.net</name>
<value>YOUR_SECOND_ActualStorageKeyValue</value>
</property>

Save core-site.xml.

Files

Upload one or more files to your container(s). You can use many methods for loading the data including Hadoop file system commands such as copyFromLocal or put, 3rd party tools like CloudXPlorer, JavaScript, or whatever method you find fits your needs. For example, I can upload all files in a data directory (for simplicity this sample refers to C: which is local to the head node) using the Hadoop put command:

hadoop fs -put c:data asv://data@sqlcatwomanblog.blob.core.windows.net/

Or upload a single file:

hadoop fs -put c:databacon.txt asv://data@sqlcatwomanblog.blob.core.windows.net/bacon.txt

To view the files in a linked non-default container or a public container use this syntax from a Hadoop Command Line prompt (fs=file system, ls=list):

hadoop fs -ls asv://data@sqlcatwomanblog.blob.core.windows.net/

Found 1 items
-rwxrwxrwx   1        124 2013-04-24 20:12 /bacon.txt

In this case the container data on the private storage account sqlcatwomanblog has one file called bacon.txt.

For the default container the syntax does not require the container and account information. Since the default storage is ASV rather than HDFS (even for HDInsight Server in this case because we changed it in core-site.xml) you can even leave out the ASV reference.

hadoop fs -ls asv:///bacon.txt
hadoop fs -ls /bacon.txt

More Information

I hope you’ve enjoyed this small bite of big data! Look for more blog posts soon.

Note: the Preview, CTP, and TAP programs are available for a limited time. Details of the usage and the availability of the pre-release versions may change rapidly.

Digg This


2 Comments

General Hardware/OS/Network Guidelines for a SQL Box

I have put together some general guidelines for how you want a server to be delivered to the DBA team for a new SQL Server install. You won’t necessarily use all of them, but consider it a starting point for your SQL Server install standards. Places where it may be common to change the statements are in [red]. Always run the SQL Server Best Practices Analyzer or an equivalent tool to check for the latest recommendations before releasing the system to production. I’m sure some of you will disagree with some of the points for various reasons, but I’ve found them to be a good baseline for discussion and a great starting point for standards documents. I’m ok with that because I am very fond of saying “it depends”. 🙂

The below is specific to SQL Server 2008/200R2 on Windows 2008/2008R2.

OS Specifications (things often controlled by a Windows team)

·         Power saving features: For a SQL Server box if you want consistent, predictable, high performance you either need to fine tune the power setting parameters for each individual workload and/or for different times of day or just set the power options to high performance. Databases are harder to fit into the normal power saving profile so they don’t fit as well into the default power saving settings. [If your environment requires that you favor power savings over performance change this statement and be aware of the impact.]

·         You should double check that your virus scanner is certified for Windows 2008 R2. Older scanners use TDI and you need WFP models to work properly on the newer OSs. The older type of anti-virus scanners can cause serious IO problems.
981889 A Windows Filtering Platform (WFP) driver hotfix rollup package is available for Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2
http://support.microsoft.com/default.aspx?scid=kb;en-US;981889
979278 Using two Windows Filtering Platform (WFP) drivers causes a computer to crash when the computer is running Windows Vista, Windows 7, or Windows Server 2008
http://support.microsoft.com/default.aspx?scid=kb;EN-US;979278
979223 A nonpaged pool memory leak occurs when you use a WFP callout driver in Windows Vista, Windows 7, Windows Server 2008, or in Windows Server 2008 R2
http://support.microsoft.com/default.aspx?scid=kb;EN-US;979223
976759 WFP drivers may cause a failure to disconnect the RDP connection to a multiprocessor computer that is running Windows Vista, Windows Server 2008, windows 7 or Windows Server 2008 R2
http://support.microsoft.com/default.aspx?scid=kb;EN-US;976759
Windows Filtering Platform
http://www.microsoft.com/whdc/device/network/WFP.mspx

·         Virus scanners and spyware detection should not scan SQL Server data and log files (usually mdf/ldf/ndf extensions) and other SQL related files because the scanning significantly degrades performance. [Note that this is a tradeoff with security and you must decide on performance vs. security based on your own security guidelines.]
REASON: Performance, smoother setup. See 309422 Guidelines for choosing antivirus software to run on the computers that are running SQL Server
http://support.microsoft.com/default.aspx?scid=kb;EN-US;309422

·         Firmware, BIOS, network adapter drivers, storport drivers, etc. will be at their most recent, stable versions before the server is released to the DBAs.
REASON: There are common SQL Server performance, usability, and supportability problems caused by older firmware, BIOS, network adapter drivers, etc.

·         For Windows 2008 and Windows 2008 R2 you can download a Windows storport enhancement (packaged as a hotfix). This enhancement can lead to faster root cause analysis for slow IO issues. Once you apply this Windows hotfix you can use Event Tracing for Windows (ETW) via perfmon or xperf to capture more detailed IO information that you can share with your storage team.  http://blogs.msdn.com/b/cindygross/archive/2010/07/13/windows-storport-enhancement-to-help-troubleshoot-io-issues.aspx

·         Do not install SQL Server on a domain controller (DC).
REASON:  A busy DC can take resources away from SQL Server. There are also negative security implications from installing SQL Server on a DC.

·         Grant SE_MANAGE_VOLUME_NAME to the SQL Server group to allow instant file initialization of data (but not log) files. There is a small security risk associated with this but it can greatly improve the performance of CREATE/ALTER data (but not log) files. [Decide as a company whether this performance enhancement is worth the small risk] http://blogs.msdn.com/b/sql_pfe_blog/archive/2009/12/23/how-and-why-to-enable-instant-file-initialization.aspx

·         Critical updates for Windows will be tested and applied ASAP after their release.
REASON: Security that affects Windows often affects SQL Server as well.

·         Resource intensive screensavers will be disabled and replaced with low resource consumption security to lock the consoles.
REASON: Performance – Resource intensive screen savers can steal resources from SQL Server.

·         Files will be secured: All copies of the data and log files as well as all copies of the backup files will be secured with access given only to those documented in the SQL Server Disaster Recovery plan.
REASON: Data and log files can be copied and attached to another instance of SQL Server, thereby exposing the information to the sysadmins of the new instance. Therefore access to these files must be very limited. However enough access must be granted to allow for recovery.

·         EFS: SQL Server will not be installed on disk that is encrypted with EFS.
REASON: 922121 You may experience decreased performance in some features of SQL Server 2005 when you use EFS to encrypt database files
http://support.microsoft.com/default.aspx?scid=kb;EN-US;922121

Storage Specifications (often configured by a Windows and/or storage team)

·         Battery backup must be enabled for all controllers or storage media which do write caching.
REASON: This is required by the WAL protocol to ensure stable media for SQL Server. See
http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqliobasics.mspx#EYCAE

·         For SQL Server disks, performance is more important than conserving space. This means there may be what would be considered “wasted space” on a file server and that the overall cost per stored MB will be higher for a database system than for a file server. [This is a general guideline, if your environment prefers costs savings and space usage maximization over performance change this statement.]
REASON: High performance is generally a major requirement of a database system, and is much more important than on most file systems. Higher performance requires that disk be laid out, configured, and managed in particular ways.

  • Disk alignment must be done to a multiple of 64KB. Some vendors may express a preference for a particular value, but most mainstream hardware vendors have agreed that 1024KB is acceptable. That is the default for Windows 2008+. If you use dynamic disks it is difficult to see the alignment from Windows.
    REASON: If the disk is not aligned, performance can suffer as much as 30-40% because some read/write activity may be to/from two blocks instead of one. See 929491 Disk performance may be slower than expected when you use multiple disks in Windows Server 2003, in Windows XP, and in Windows 2000 http://support.microsoft.com/default.aspx?scid=kb;EN-US;929491 and Disk Partition Alignment Best Practices for SQL Server http://msdn.microsoft.com/en-us/library/dd758814.aspx
  • Disk allocation unit should be 64KB for SQL Server boxes.
    REASON: See
    Predeployment I/O Best Practices http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx

NTFS Allocation Unit Size

When formatting the partition that will be used for SQL Server data files, it is recommended that you use a 64-KB allocation unit size for data, logs, and tempdb. Be aware however, that using allocation unit sizes greater than 4 KB results in the inability to use NTFS compression on the volume. SQL Server, although it is not recommended that you use this, does support read-only data on compressed volumes.

·         Drive Compression: Drives will not be compressed.
REASON: Compression has a big negative performance impact on SQL Server.

·         NTFS file system will be used instead of FAT or Raw partitions.
REASON: NTFS allows features such as database snapshots, online DBCC checks, instant file initialization, mount points, and additional security. It has larger file size limits (16 exabytes) than FAT (4 GBs). Raw partitions limit your recoverability options.

·         Often you will need one or more of these to achieve optimal performance for a database [Decide which of these you will deploy for each tier of storage and whether each can be requested by a DBA at server configuration time.]

1.       HBA queue depth for SQL Server is often best at 64 or 128; testing will determine the optimal value.
REASON: See Predeployment I/O Best Practices http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx
HBA Queue Depth Settings

When configuring HBAs on the host, ensure that the Queue Depth is set to an optimal value. SQL Server applications are generally I/O-intensive, with many concurrent outstanding I/O requests. As a result, the default values for Queue Depth on HBAs are usually not high enough to support optimal performance. Currently the default value for the major HBA vendors is in the range of 8 to 32.

In our SQL Server testing, we have seen substantial gains in I/O performance when increasing this to 64 or even higher. It is worth noting that in these tests SQL Server was usually the only application using the storage array. It is important to discuss with your storage administrator the appropriate values for this setting, as the setting may affect other applications in sharing the same storage environment. When Queue Depth is set too low, a common symptom is increasing latency and less-than-expected throughput given the bandwidth between host/storage and the number of spindles in a particular configuration.

2.       RAID 10 or its equivalent will be used for the highest performance and best recoverability. Read-only data (no updates from users, replication, batch jobs, or anything else) can see acceptable performance on RAID 5. RAID 5 systems will have slower write performance and less recoverability but might be allowed for lower tiered systems with a signoff that high performance is not guaranteed.
REASON: RAID 10 is the fastest disk for SQL Server data and logs. It also provides the best recoverability options.

o   See Physical Database Storage Design http://www.microsoft.com/technet/prodtechnol/sql/2005/physdbstor.mspx
“For excellent performance and high reliability of both read and write data patterns, use RAID10.”

o   “RAID10 (stripe of mirrors): RAID10 is essentially many sets of RAID1 or mirrored drives in a RAID0 configuration. This configuration combines the best attributes of striping and mirroring: high performance and good fault tolerance. For these reasons, we recommend using this RAID level. However, the high performance and reliability level is the trade-off for storage capacity.”

o   RAID 10 is recommended for “Data requiring high performance for both read and write and excellent reliability while trading off storage efficiency and cost.“

3.       Follow hardware vendor recommendations for configuring the storage for a database, often this is very different than configuring for other non-database systems.

4.       Keep the physical disks no more than about 80% full (avoid full stroking, get closer to short stroking). Some SAN configurations may make this difficult to determine from Windows with concepts such as thin provisioning.

5.       Use multiple HBA controllers with a high throughput capacity. The same applies for other components such as switch ports, NICs, Fibre Channel array ports, storage array service processors, etc.

6.       Favor (battery backed) write cache over read cache for an OLTP system. Often 80/20 or 90/10 in favor of writes is beneficial. It is relatively easy for a busy SQL Server to flood the cache.

1.       Log writes have the lowest allowable latency of any SQL activity on an OLTP system.

2.       Write cache can help absorb checkpoint bursts as they write data to the disks.

3.       Maintenance operations can be write intensive and long running.

4.       SQL Server’s internal data organization rarely matches the physical layout of the data on disk so IO subsystem level read ahead through the read cache is rarely effective for a database.

7.       Performance will be more predictable, IO troubleshooting will be easier, and in many cases overall performance can be higher if SQL Server is isolated to an IO path not shared with other systems. If it does share with others (which is very common), it is better to share with other databases than with file servers or other systems that have different needs.

·         Data and log files will not go on the same drive (they can have the same drive letter if on different mount points). SANs often hide the physical layer behind the drive letter/mount point by mixing data on the back end but it is still important to keep them separate in case you later move to totally separate IO paths for each.

o   Where cost/benefit analysis allows, each database’s transaction log file(s) will get a separate drive.  For maximum performance this would be a truly separate IO path.
REASON: Since writes to the transaction log are sequential (even if there are multiple log files for a given database, only one is written to at a time), if there are multiple database’s log files, or a mix of data and log files, on one drive the read head is moving between them and decreasing performance.

o   Where cost/benefit analysis allows it, multiple, separate IO paths may be requested for each data file.
REASON: The more drives you can use for data files, the more the IO can be spread out to increase performance. For certain types of activities guaranteeing that the IO does not overlap for two different data files can be advantageous.

·         Before putting each SQL box into production, run SQLIO under various loads, gather data to see if the IO subsystem is likely to be capable of the expected load. If you have numbers for a typical and current peak load, we can use those numbers and then add some to it (perhaps test things like 150% of expected peak batch cycle).

·         Before putting each SQL box into production, run SQLIOSim to test for data correctness.

·         Disk IO Subsystem Performance Requirements:

o   Log writes will not exceed [put your values here] ms per write under the expected peak load (typical is 1-5ms for log writes)

o   Non-log writes will not exceed [put your values here] ms per read under the expected peak load (typical is 1-20ms for OLTP and 25-30ms for warehouses)

o   Reads will not exceed [put your values here] ms per read under the expected peak load (typical is 5-20ms for OLTP and 25-30ms for warehouses)

o   These are general standards; some systems may require faster IO subsystems and some may be ok with slower IO subsystems.

o   Measures of read and write speed are from the Performance Monitor counters reads/sec and writes/sec and/or from SQLIO as these are the speeds seen by SQL Server.
REASON: Databases require fast disk access in order to deliver high performance. Many DBA and application developer hours can be spent troubleshooting performance problems that are later tracked to slow IO.

 

Network Specifications

·         SNP/TCP Chimney settings will depend on whether your NIC vendor supports it. If you have NICs/drivers that support it, turn it on. Otherwise disable it to avoid known problems with SQL Server and other products.
REASON: Performance and usability. When TCP Chimney is enabled it on a NIC that doesn’t support it, you will often see failed connectivity to SQL Server and/or
dropped packets and connections that affect SQL server. See Information about the TCP Chimney Offload, Receive Side Scaling, and Network Direct Memory Access features in Windows Server 2008 http://support.microsoft.com/kb/951037  and 942861 Error message when an application connects to SQL Server on a server that is running Windows Server 2003: “General Network error,” “Communication link failure,” or “A transport-level error” http://support.microsoft.com/default.aspx?scid=kb;EN-US;942861

·         Put a firewall between your server and the internet. Block TCP port 1433, UDP port 1434, and static ports used by named instances on the perimeter firewalls but not necessarily on individual internal servers. Be careful which ports you block in the other firewalls, SQL Server will use various ports to communicate on.
REASON: Security – Hackers may try to use well known ports to attack SQL Server.

·         Open ports used by SQL Server in the individual server firewalls. [If you define a specific port range for all instances include it here.]
REASON: The DBAs have defined this range of ports as what each SQL Server uses for incoming connections.

Windows Cluster

·         Identical hardware: While Windows 2008 clusters are not required to be identical, to have a greater chance of predictability no matter which node owns each SQL Server group it is recommended that they be configured as close to the same as possible.

·         Windows policies and rights: Windows policies and rights should be the same on all nodes.
REASON: The behavior of SQL Server must be the same on all nodes. Policies can change SQL Server behavior.

·         Mount points: Do not install SQL Server 2000 on any Windows cluster with mount points.  The mount points must have an associated drive letter and must be cluster resources in the group where SQL Server will reside. SQL Server must “depend on” all mount points that it uses.
REASON: SQL Server 2005+ supports mount points but SQL Server 2000 setup, including service packs and hotfixes, will fail when it tries to enumerate the mount points, even if they are not in the SQL Server 2000 group. For 2005+ instances, the mount points must be in the SQL Server group in order for SQL Server to access them.

·         The Cluster service account must be a login in the SQL Server instance and a simple user in the master database, but should NOT be a sysadmin role member.
REASON: Avoid elevated privileges.

·         MS DTC Choose a consistent standard for how you configure DTC. On Windows 2008+ clusters you can have more than one DTC per cluster and there are pros/cons to various configuraitons on how SQL Server uses one or more of those DTCs. http://blogs.msdn.com/cindygross/archive/2009/02/22/how-to-configure-dtc-for-sql-server-in-a-windows-2008-cluster.aspx DTC must be enabled for network access. If you choose to put DTC in a group with SQL Server and choose to have a DTC failure cause the group to fail, you should be aware that DTC can cause a SQL Server failover. This may be appropriate in some environments.

·         Cluster Group Names will not contain any special characters such as <, >, ‘, “, &
REASON: Special characters in any group name may cause SQL Server setup to fail.

·         NIC names will not have any special characters or trailing spaces.
REASON: Special characters in any network name may cause SQL Server setup to fail.

·         Auto start must NOT be on for any clustered resource in a Windows cluster.
REASON: The cluster administrator needs to bring the resources online (start the services). If Windows startup has already started the service the cluster service cannot bring it online which results in errors.

·         Use SIDs in a Windows cluster. If you choose not to use Service SIDs, create unique domain groups: Each individual service installed with SQL Server needs a unique domain group created for it unless you choose the default of SIDs. The following naming standard will be used: [put your company standard here:  Examples: myserver1_instance1_SQLServer, myserver1_instance1_SQLAgent  ]     

http://technet.microsoft.com/en-us/library/ms143504.aspx

Best practices that you can use to set up domain groups and solutions to problems that may occur when you set up a domain group when you install a SQL Server 2005 failover cluster http://support.microsoft.com/default.aspx?scid=kb;EN-US;915846
REASON: On a cluster domain groups are used to manage permissions for the SQL Server components. Each service needs a unique group to reduce the attack surface.

Remote Control

·         Remote control to the server will only be done when absolutely necessary. No SQL Server tools will be opened on the production server itself unless there is no other way to access the server.
REASON: It adds overhead to the server and can cause performance problems. Most access will be done from client tools installed on desktops. The overhead of the GUI interfaces is not acceptable on a production server. Some SQL Server client tools are not available on 64-bit systems and all servers will be x64 going forward.

Hardware Specifications

·         Expected Life: Servers are spec’d with an expectation of being in service for up to [put your policy here, 3-4 is common] years unless otherwise stated.
REASON: We need to know in advance how long the hardware is expected to stay in service so we can predict the resources needed for that time period. An uncertainty factor will be considered as well, so systems with high uncertainty in the predictions may need more expansion opportunity (have the ability to add more hardware resources as needed).

·         Architecture: All new servers will be based on x64 hardware with an x64 version/edition of Windows.
REASON: 64-bit systems allow much more memory to be used than 32-bit systems.

 Cindy Gross, Microsoft Dedicated Support Engineer for SQL Server and Microsoft Certified Master : SQL Server 2008