patternsqlModerate
Data Warehouse design for reporting against data for many time zones
Viewed 0 times
reportingwarehousedesigntimeagainstzonesformanydata
Problem
We are trying to optimize a data warehouse design that will support reporting against data for many time zones. For example, we might have a report for a month's worth of activity (millions of rows) that needs to show activity grouped by the hour of the day. And of course that hour of the day has to be the "local" hour for the given time zone.
We had a design that worked well when we just supported UTC and one local time. The standard design of Date and Time dimensions for UTC and local time, id's on the Fact tables. However, that approach does not seem to scale if we have to support reporting for 100+ time zones.
Our Fact tables would get very wide. Also, we'd have to solve the syntax issue in SQL of specifying which date and time id's to use for grouping on any given run of the report. Perhaps a very large CASE statement?
I've seen some suggestions to get all the data by the UTC time range you are covering, then return it to the presentation layer to convert to local and aggregate there, but limited testing with SSRS suggests that will be extremely slow.
I've consulted some books on the subject as well, and they all seem to say just have UTC and convert on display or have UTC and one local. Would appreciate any thoughts and suggestions.
Note: This question is similar to: Handling time zones in data mart/warehouse, but I cannot comment on that question, so felt this deserved its own question.
Update: I selected Aaron's answer after he made some significant updates and posted sample code and diagrams. My earlier comments on his answer won't make much sense anymore as they referred to the original edit of the answer. I'll try to come back and update this again if warranted
We had a design that worked well when we just supported UTC and one local time. The standard design of Date and Time dimensions for UTC and local time, id's on the Fact tables. However, that approach does not seem to scale if we have to support reporting for 100+ time zones.
Our Fact tables would get very wide. Also, we'd have to solve the syntax issue in SQL of specifying which date and time id's to use for grouping on any given run of the report. Perhaps a very large CASE statement?
I've seen some suggestions to get all the data by the UTC time range you are covering, then return it to the presentation layer to convert to local and aggregate there, but limited testing with SSRS suggests that will be extremely slow.
I've consulted some books on the subject as well, and they all seem to say just have UTC and convert on display or have UTC and one local. Would appreciate any thoughts and suggestions.
Note: This question is similar to: Handling time zones in data mart/warehouse, but I cannot comment on that question, so felt this deserved its own question.
Update: I selected Aaron's answer after he made some significant updates and posted sample code and diagrams. My earlier comments on his answer won't make much sense anymore as they referred to the original edit of the answer. I'll try to come back and update this again if warranted
Solution
I've solved this by having a very simple calendar table - each year has one row per supported time zone, with the standard offset and the start datetime / end datetime of DST and its offset (if that time zone supports it). Then an inline, schema-bound, table-valued function that takes the source time (in UTC of course) and adds/subtracts the offset.
This will obviously never perform extremely well if you are reporting against a large portion of data; partitioning might seem to help, but you will still have cases where the last few hours in one year or the first few hours in the next year actually belong to a different year when converted to a specific time zone - so you can never get true partition isolation, except when your reporting range does not include December 31 or January 1.
There are a couple of weird edge cases you need to consider:
-
2014-11-02 05:30 UTC and 2014-11-02 06:30 UTC both convert to 01:30 AM in the Eastern time zone, for example (one for the first time 01:30 was hit locally, and then one for the second time when the clocks rolled back from 2:00 AM to 1:00 AM, and another half hour elapsed). So you need to decide how to handle that hour of reporting - according to UTC, you should see double the traffic or volume of whatever you're measuring once those two hours get mapped to a single hour in a time zone that observes DST. This can also play fun games with sequencing of events, since something that logically had to happen after something else could appear to happen before it once the timing is adjusted to a single hour instead of two. An extreme example is a page view that happened at 05:59 UTC, then a click that happened at 06:00 UTC. In UTC time these happened a minute apart, but when converted to Eastern time, the view happened at 1:59 AM, and the click happened an hour earlier.
-
2014-03-09 02:30 never happens in the USA. This is because at 2:00 AM we roll the clocks forward to 3:00 AM. So likely you will want to raise an error if the user enters such a time and asks you to convert that to UTC, or design your form so that users can't pick such a time.
Even with those edge cases in mind, I still think you have the right approach: store the data in UTC. Much easier to map data to other time zones from UTC than from some time zone to some other time zone, especially when different time zones start / end DST on different dates, and even the same time zone can switch using different rules in different years (for example the U.S. changed the rules 6 years ago or so).
You will want to use a calendar table for all of this, not some gargantuan
A real live example, in the meantime
Let's say you have a very simple fact table. The only fact I care about in this case is the event time, but I'll add a meaningless GUID just to make the table wide enough to care about. Again, to be explicit, the fact table stores events in UTC time and UTC time only. I've even suffixed the column with
Now, let's load our fact table with 10,000,000 rows - representing every 3 seconds (1,200 rows per hour) from 2013-12-30 at midnight UTC until sometime after 5 AM UTC on 2014-12-12. This ensures that the data straddles a year boundary, as well as DST forward and back for multiple time zones. This looks really scary, but took ~9 seconds on my system. Table should end up being about 325 MB.
And just to show what a typical seek query will look like against this 10MM row table, if I run this query:
I get this plan, and it returns in 25 milliseconds*, doing 358 reads, to return 72 hourly totals:
* Duration as measured by the free SentryOne Plan Explorer, which discards results, so this does not include network transfer time of the data, rendering, etc.
It takes a little longer, obviously, if I make my range too larg
This will obviously never perform extremely well if you are reporting against a large portion of data; partitioning might seem to help, but you will still have cases where the last few hours in one year or the first few hours in the next year actually belong to a different year when converted to a specific time zone - so you can never get true partition isolation, except when your reporting range does not include December 31 or January 1.
There are a couple of weird edge cases you need to consider:
-
2014-11-02 05:30 UTC and 2014-11-02 06:30 UTC both convert to 01:30 AM in the Eastern time zone, for example (one for the first time 01:30 was hit locally, and then one for the second time when the clocks rolled back from 2:00 AM to 1:00 AM, and another half hour elapsed). So you need to decide how to handle that hour of reporting - according to UTC, you should see double the traffic or volume of whatever you're measuring once those two hours get mapped to a single hour in a time zone that observes DST. This can also play fun games with sequencing of events, since something that logically had to happen after something else could appear to happen before it once the timing is adjusted to a single hour instead of two. An extreme example is a page view that happened at 05:59 UTC, then a click that happened at 06:00 UTC. In UTC time these happened a minute apart, but when converted to Eastern time, the view happened at 1:59 AM, and the click happened an hour earlier.
-
2014-03-09 02:30 never happens in the USA. This is because at 2:00 AM we roll the clocks forward to 3:00 AM. So likely you will want to raise an error if the user enters such a time and asks you to convert that to UTC, or design your form so that users can't pick such a time.
Even with those edge cases in mind, I still think you have the right approach: store the data in UTC. Much easier to map data to other time zones from UTC than from some time zone to some other time zone, especially when different time zones start / end DST on different dates, and even the same time zone can switch using different rules in different years (for example the U.S. changed the rules 6 years ago or so).
You will want to use a calendar table for all of this, not some gargantuan
CASE expression (not statement). I just wrote a three-part series for MSSQLTips.com on this; I think the 3rd part will be the most useful for you:- http://www.mssqltips.com/sqlservertip/3173/handle-conversion-between-time-zones-in-sql-server--part-1/
- http://www.mssqltips.com/sqlservertip/3174/handle-conversion-between-time-zones-in-sql-server--part-2/
- http://www.mssqltips.com/sqlservertip/3175/handle-conversion-between-time-zones-in-sql-server--part-3/
A real live example, in the meantime
Let's say you have a very simple fact table. The only fact I care about in this case is the event time, but I'll add a meaningless GUID just to make the table wide enough to care about. Again, to be explicit, the fact table stores events in UTC time and UTC time only. I've even suffixed the column with
_UTC so there is no confusion.CREATE TABLE dbo.Fact
(
EventTime_UTC DATETIME NOT NULL,
Filler UNIQUEIDENTIFIER NOT NULL DEFAULT NEWSEQUENTIALID()
);
GO
CREATE CLUSTERED INDEX x ON dbo.Fact(EventTime_UTC);
GONow, let's load our fact table with 10,000,000 rows - representing every 3 seconds (1,200 rows per hour) from 2013-12-30 at midnight UTC until sometime after 5 AM UTC on 2014-12-12. This ensures that the data straddles a year boundary, as well as DST forward and back for multiple time zones. This looks really scary, but took ~9 seconds on my system. Table should end up being about 325 MB.
;WITH x(c) AS
(
SELECT TOP (10000000) DATEADD(SECOND,
3*(ROW_NUMBER() OVER (ORDER BY s1.[object_id])-1),
'20131230')
FROM sys.all_columns AS s1
CROSS JOIN sys.all_columns AS s2
ORDER BY s1.[object_id]
)
INSERT dbo.Fact WITH (TABLOCKX) (EventTime_UTC)
SELECT c FROM x;And just to show what a typical seek query will look like against this 10MM row table, if I run this query:
SELECT DATEADD(HOUR, DATEDIFF(HOUR, 0, EventTime_UTC), 0),
COUNT(*)
FROM dbo.Fact
WHERE EventTime_UTC >= '20140308'
AND EventTime_UTC < '20140311'
GROUP BY DATEADD(HOUR, DATEDIFF(HOUR, 0, EventTime_UTC), 0);I get this plan, and it returns in 25 milliseconds*, doing 358 reads, to return 72 hourly totals:
* Duration as measured by the free SentryOne Plan Explorer, which discards results, so this does not include network transfer time of the data, rendering, etc.
It takes a little longer, obviously, if I make my range too larg
Code Snippets
CREATE TABLE dbo.Fact
(
EventTime_UTC DATETIME NOT NULL,
Filler UNIQUEIDENTIFIER NOT NULL DEFAULT NEWSEQUENTIALID()
);
GO
CREATE CLUSTERED INDEX x ON dbo.Fact(EventTime_UTC);
GO;WITH x(c) AS
(
SELECT TOP (10000000) DATEADD(SECOND,
3*(ROW_NUMBER() OVER (ORDER BY s1.[object_id])-1),
'20131230')
FROM sys.all_columns AS s1
CROSS JOIN sys.all_columns AS s2
ORDER BY s1.[object_id]
)
INSERT dbo.Fact WITH (TABLOCKX) (EventTime_UTC)
SELECT c FROM x;SELECT DATEADD(HOUR, DATEDIFF(HOUR, 0, EventTime_UTC), 0),
COUNT(*)
FROM dbo.Fact
WHERE EventTime_UTC >= '20140308'
AND EventTime_UTC < '20140311'
GROUP BY DATEADD(HOUR, DATEDIFF(HOUR, 0, EventTime_UTC), 0);CREATE TABLE dbo.TimeZones
(
TimeZoneID TINYINT NOT NULL PRIMARY KEY,
Name VARCHAR(9) NOT NULL,
Offset SMALLINT NOT NULL, -- minutes
DSTName VARCHAR(9) NOT NULL,
DSTOffset SMALLINT NOT NULL -- minutes
);INSERT dbo.TimeZones VALUES
(1, 'UTC', 0, 'UTC', 0),
(2, 'GMT', 0, 'BST', 60),
-- London = UTC in winter, +1 in summer
(3, 'EST', -300, 'EDT', -240),
-- East coast US (-5 h in winter, -4 in summer)
(4, 'ACDT', 630, 'ACST', 570),
-- Adelaide (Australia) +10.5 h Oct - Apr, +9.5 Apr - Oct
(5, 'ACST', 570, 'ACST', 570);
-- Darwin (Australia) +9.5 h year roundContext
StackExchange Database Administrators Q#58762, answer score: 18
Revisions (0)
No revisions yet.