patternsqlMajor
Patterns for setting up sharding for SQL Server 2008 R2, in order to handle large datasets?
Viewed 0 times
order2008handlesqlpatternsdatasetssettinglargeforserver
Problem
I want to handle a large dataset (> 1 billion rows) in SQL Server 2008 R2. I heard that if you set up "sharding" or "horizontal partitioning", it makes it quicker to handle large datasets as it breaks the tables up into multiple files.
Have you used sharding or horizontal partitioning before? If so, could you suggest any patterns that, in your experience, achieves better results compared to a single massive table?
Have you used sharding or horizontal partitioning before? If so, could you suggest any patterns that, in your experience, achieves better results compared to a single massive table?
Solution
First, note that 1 billion rows can be handled effectively with a partitioned architecture on ordinary commodity server hardware. Exotic shared nothing architectures will not be necessary for this data volume, however, you will probably get significant benefits from table partitioning.
Sharding is something different from horizontal partitioning, and implies a 'shared nothing' architecture, which is not supported by most versions of SQL Server1
SQL Server can support horizontal partitioning, and a shared disk architecture will be adequate for ~1 billon rows.
In SQL Server, you create a partition function selects a partition based on values or ranges of values in a column on a table, e.g.
Then create one or more filegroups to allocate the partitions to. For a large data set, these file groups could be set up on different physical volumes. Note that direct attach storage will be much faster than a SAN for this in almost all cases. In the example below, we would have created 6 filegroups called PartVol1-PartVol6.
One or more partition schemes can be created to allocate table partitions to filegroups based on the value of the partition function, e.g.
This scheme is designed to partition on an accounting period. Dates are also frequently used for this, although any key could be used.
You can create a table on the partition scheme as if it was a filegroup, e.g.
Note that table is created on the partition scheme instead of a specified file group, and the clause specifies the column to be used as the partition key. Based on the partition key, rows in the table will be allocated to one of the filegroups in the partition scheme.
Note: A rule of thumb for designing a partitioning scheme is that each partition should have a row count in the low 10's of millions, say between 10 and 50 million depending on the width of the rows. The disk volume that the partition sits on should be fast enough to do a scan of at least a single partition in a few seconds.
Partitioning, Sharding and Shared Nothing systems
A bit of terminology seems in order here to disambiguate some of the discussion on this topic.
-
A 'shared nothing' system is a parallel system where the nodes do not have shared SAN storage, but use storage local to the node. The classic example of this type of architecture is Teradata. Shared nothing systems scale well to very large data sets as they have no central I/O bottlenecks. I/O throughput scales with the number of nodes in the system.
-
A 'shared disk' system is one where one or more database servers share a single disk storage subsystem. The database can be a single server with either local storage or attached to a SAN, or a cluster of servers attached to a shared SAN. Systems of this type are constrained by the throughput available from the storage subsystem.
-
'Sharding' is a term used to describe splitting a database amongst multiple physical servers in a shared nothing architecture. Various platforms will have greater or lesser support for sharded databases. In Teradata circles, the term is not used because Teradata presents a transparent single system image to the clients, even though the physical architecture is a shared nothing type.
Older versions of SQL Server have limited support for sharding throu
Sharding is something different from horizontal partitioning, and implies a 'shared nothing' architecture, which is not supported by most versions of SQL Server1
SQL Server can support horizontal partitioning, and a shared disk architecture will be adequate for ~1 billon rows.
In SQL Server, you create a partition function selects a partition based on values or ranges of values in a column on a table, e.g.
create partition function F_AccPrdPart (int)
as range left for values (
180001 -- Dummy value
,199012 ,199112 ,199212 ,199312, 199412 ,199512
,199612 ,199712 ,199812 ,199912 ,200012 ,200112
,200212 ,200312 ,200412 ,200512 ,200612 ,200712
,200812
,200901, 200902, 200903, 200904, 200905, 200906
,200907, 200908, 200909, 200910, 200911, 200912
,201001, 201002, 201003, 201004, 201005, 201006
,201007, 201008, 201009, 201010, 201011, 201012
,201101, 201102, 201103, 201104, 201105, 201106
,201107, 201108, 201109, 201110, 201111, 201112
,202012
,939999 -- Dummy value
)
goThen create one or more filegroups to allocate the partitions to. For a large data set, these file groups could be set up on different physical volumes. Note that direct attach storage will be much faster than a SAN for this in almost all cases. In the example below, we would have created 6 filegroups called PartVol1-PartVol6.
One or more partition schemes can be created to allocate table partitions to filegroups based on the value of the partition function, e.g.
create partition S_AccPrdPart as partition F_AccPrdPart
TO ([PRIMARY]
,[PartVol1], [PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6]
,[PartVol1], [PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6]
,[PartVol1], [PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6]
,[PartVol1]
,[PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6], [PartVol1]
,[PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6], [PartVol1]
,[PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6], [PartVol1]
,[PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6], [PartVol1]
,[PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6], [PartVol1]
,[PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6], [PartVol1]
,[PartVol2]
,[PRIMARY]
,[PRIMARY])
goThis scheme is designed to partition on an accounting period. Dates are also frequently used for this, although any key could be used.
You can create a table on the partition scheme as if it was a filegroup, e.g.
Create table FooTrans (
FooTransID int identity (1,1) not null
,AccPeriod int not null
,[...]
) on S_AccPrdPart (AccPeriod)
goNote that table is created on the partition scheme instead of a specified file group, and the clause specifies the column to be used as the partition key. Based on the partition key, rows in the table will be allocated to one of the filegroups in the partition scheme.
Note: A rule of thumb for designing a partitioning scheme is that each partition should have a row count in the low 10's of millions, say between 10 and 50 million depending on the width of the rows. The disk volume that the partition sits on should be fast enough to do a scan of at least a single partition in a few seconds.
Partitioning, Sharding and Shared Nothing systems
A bit of terminology seems in order here to disambiguate some of the discussion on this topic.
-
A 'shared nothing' system is a parallel system where the nodes do not have shared SAN storage, but use storage local to the node. The classic example of this type of architecture is Teradata. Shared nothing systems scale well to very large data sets as they have no central I/O bottlenecks. I/O throughput scales with the number of nodes in the system.
-
A 'shared disk' system is one where one or more database servers share a single disk storage subsystem. The database can be a single server with either local storage or attached to a SAN, or a cluster of servers attached to a shared SAN. Systems of this type are constrained by the throughput available from the storage subsystem.
-
'Sharding' is a term used to describe splitting a database amongst multiple physical servers in a shared nothing architecture. Various platforms will have greater or lesser support for sharded databases. In Teradata circles, the term is not used because Teradata presents a transparent single system image to the clients, even though the physical architecture is a shared nothing type.
Older versions of SQL Server have limited support for sharding throu
Code Snippets
create partition function F_AccPrdPart (int)
as range left for values (
180001 -- Dummy value
,199012 ,199112 ,199212 ,199312, 199412 ,199512
,199612 ,199712 ,199812 ,199912 ,200012 ,200112
,200212 ,200312 ,200412 ,200512 ,200612 ,200712
,200812
,200901, 200902, 200903, 200904, 200905, 200906
,200907, 200908, 200909, 200910, 200911, 200912
,201001, 201002, 201003, 201004, 201005, 201006
,201007, 201008, 201009, 201010, 201011, 201012
,201101, 201102, 201103, 201104, 201105, 201106
,201107, 201108, 201109, 201110, 201111, 201112
,202012
,939999 -- Dummy value
)
gocreate partition S_AccPrdPart as partition F_AccPrdPart
TO ([PRIMARY]
,[PartVol1], [PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6]
,[PartVol1], [PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6]
,[PartVol1], [PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6]
,[PartVol1]
,[PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6], [PartVol1]
,[PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6], [PartVol1]
,[PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6], [PartVol1]
,[PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6], [PartVol1]
,[PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6], [PartVol1]
,[PartVol2], [PartVol3], [PartVol4], [PartVol5], [PartVol6], [PartVol1]
,[PartVol2]
,[PRIMARY]
,[PRIMARY])
goCreate table FooTrans (
FooTransID int identity (1,1) not null
,AccPeriod int not null
,[...]
) on S_AccPrdPart (AccPeriod)
goContext
StackExchange Database Administrators Q#12789, answer score: 23
Revisions (0)
No revisions yet.