Recent Entries 10
- pattern minor 112d agoWhat's the best way to concatenate a multi-column foreign key?We are expanding our data warehouse to include a new source system. In the new system, we want a single column for a primary key, but in the old system it's two columns. For exmaple: ``` create table #new_case (case_id varchar(255)) create table #old_case (patient_id int, care_event_counter int) insert into #old_case (patient_id, Care_event_counter) values (55, 1), (55, 2), (55, 3), (56, 1), (57, 1), (57, 2) insert into #new_case (case_id) select cast(patient_id as varchar(55)) + '_' + cast(care_event_counter as varchar(55)) from #old_case select * from #new_case ``` problem: This has got to be super slow on a large scale, and I'd prefer to use int across the board. Question: How do you merge two integer columns intelligently? is there a math trick you could use to derive the values that would be faster?
- pattern moderate 112d agoWhat is the best option for High availability on a data warehouse?I have a customer who have an existing 1.5Tb data warehouse and are currently planning for a complete refresh of the DW on a new environment. Their infrastructure manager has organised 2 servers with SQL 2017 standard on each and has now asked me to plan a HA/DR plan for the new DW database/instance. I immediately thought of using AlwaysOn Availability Groups, although I have never used them before, and none of the articles I read talk about typical data warehouse workloads - it's all OLTP applications. With a large daily ETL process and a smaller intra-day ETL process running on their current DW, will that have an impact on how we approach this? Thanks - any help to point me in the right direction here would be beneficial!
- pattern minor 112d agoDimensional Modeling of a Loan Approval Process and Fact Grain(s)Context: I'm modeling for a lending company and I'm modeling our loan approval process. I have some basic dimensions relative to our business figured out: - Time(day) - Account - Opportunity - Channel And I and getting to the modeling of fact table(s) around the loan approval and credit decisioning process. An opportunity will get one or more "Credit Reviews" which have a series of steps they go through before arriving at a "Credit Decision". My instinct is to model the fact grain at the credit review step level or maybe the credit review level (or have one fact for each grain). I'm working with some analysts who are used to using a “canned report”, one flattened table extracted from the source system that is at the opportunity level, and finds a "best credit review", and do a lot of their reporting from that. I'm trying to explain my credit approval fact and how it would work and they think it's "overly complicated" and seem to want a fact table that is one-to-one with the opportunity, which is a dimension. And contains information on only the "best" credit review. In my mind that seems odd. I haven't run across a fact table that is at the same granularity (one-to-one) with a dimension. Have you ever heard of this? It feels like they are just trying to reproduce what they are used to, a non-dimensionally modeled data mart table at the opportunity level. I'm trying to explain the utility of having a fact table that is at a higher granularity than they are used to in order to support many questions about the data rather than one, but I seem to be running into a wall getting this concept across. I’m not sure who is right here. My question is this: Is it every appropriate, say in this situation, to model a fact that has the same cardinality as a dimension (one-to-one) with a core dimension? What would you recommend for this situation?
- snippet minor 112d agoHow to handle schema changes in source systems when running an ETL?How do people typically handle schema changes in source systems when running an ETL as part of an EDW? For example when a column you were operating on disappears and your ETL breaks.
- gotcha moderate 112d agoDifference Between a Staging and the Production DWHI have a general question on a Data Warehouse Architecture. I have heard a lot of a Staging and a Production, so that data should be loaded via SSIS first into a staging dwh and then via SSIS to the Production environment. But is there any difference? I mean is it only a "copy" of the production data warehouse (to do some quality checks on data before pushing to production?) or of the source data in a different model or (different schema)? From software development perspective i know that both are identical and the only purpose is to check the software version from all stakeholders and get the release to deploy on prod. Normally in the software case the staging is same (maybe only hardware sizing is different) to production.
- pattern moderate 112d agoHow many rows can SQL Server process in a single INSERT statement into a table?To illustrate my question, the following is a query detecting order's id which have not been inserted in a data warehouse and inserts them: ``` With NewOrders As ( Select OrderID From Orders Except Select OrderID From FactOrders ) Insert Into FactOrders(OrderID, OrderDate, CustomerId) Select OrderID, OrderDate, CustomerId From Orders Where OrderID in (Select OrderID from NewOrders); ``` Say the query is run for the first time and `Orders` contains 400 million rows or more: Can SQL Server handle this number of rows in one single `INSERT` statement? If not, how should I proceed? Should I limit the number of rows fetched in the `INSERT` statement? How many rows can the engine handle in a single `INSERT` statement?
- gotcha minor 112d agoWhat is the difference between Federated and Decentralized Data Warehouse?I can not find any clear definitions or explanations of any of these. Both seem decentralized. It appears that in Federated DWH, the data is distributed and not integrated into a single repository and accessed from distributed sources. While in Decentralized DWH implementation the data is integrated into one central repository. Please explain the difference between these two implementations.
- pattern minor 112d agoSQL Server Data Warehouse stored procedure locationWe are building a Kimball data warehouse. We have a Staging Database, and Final Datawarehouse. We have a stored procedure which creates the table CustomerTransaction. Where is the best location to place the stored procedure? In the Staging Database or the Datawarehouse? Or what is the positives/negatives of each location? I just started creating data warehouses and trying to learn. ``` create procedure dbo.FactCustomerTransactionImport -- should this be in StagingDB or datawarehouse DB? as insert into DWFinal.dbo.FactCustomerTransaction ( CustomerId, Quantity, Price, Amount ) select CustomerId, Quantity, Price, Quantity * Price as Amount from StagingDB.dbo.CustomerTransaction ```
- pattern minor 112d agoBest practices for transferring data across long, wide tablesAs part of my data warehouse project, I'll be transferring data from our OLTPs to the Data Warehouse. Some of the tables are long and wide so obviously, I'll only be transferring required columns. To reduce the overhead, I'm considering data tracking on these tables so we only look at changed values rather than scanning the complete table. Unfortunately our source system runs on SQL Server 2014 Standard Edition and CDC is an option we can not take advantage of. Is there an alternative method that would be better than change tracking?
- snippet minor 112d agoHow to Develop Union Reports from Single Tenant Databases?We have a database for each customer, for over 700 databases. All the schemas are the same. Now we want to conduct Reporting for : All Databases Customer Analysis. Many db professionals prefer single tenant databases. All database reside on the same server instance. Single Tenant Database Discussion ``` CREATE TABLE CustomerOne.[dbo].[CustomerTransaction] ( [CustomerTransactionid] [int] identity(1,1) primary key NOT NULL, [Customerid] [int] NOT NULL default(1), [QuantityBought] [int] NULL, ) ``` Does anyone know of any other ideas not listed below, which would work? Our idea is to either create large Views, however, View performance is slow. The second option is stored procedures. (1) ``` CREATE VIEW Reportingdb.dbo.CustomerTransaction as SELECT [Customertransactionid] ,[Customerid] ,[Quantitybought] FROM [CustomerOne].[dbo].[customertransaction] UNION ALL SELECT [Customertransactionid] ,[Customerid] ,[Quantitybought] FROM [CustomerTwo].[dbo].[customertransaction] ``` 2) We can also create a stored procedure. Stored Procedure which will insert into this large table. However there will be a time lag between runs, we want instantaneous data. ``` insert into ReportingDB.[dbo].[customertransaction] select * from CustomerOne.[dbo].[customertransaction] UNION ALL select * from CustomerTwo.[dbo].[customertransaction] ``` SQL How to Convert Single Tenant Databases to a Multitenant Database