Tutorial :Structure within staging area of data warehouse



Question:

We are working on a datawarehouse for a bank and have pretty much followed the standard Kimball model of staging tables, a star schema and an ETL to pull the data through the process.

Kimball talks about using the staging area for import, cleaning, processing and everything until you are ready to put the data into the star schema. In practice this typically means uploading data from the sources into a set of tables with little or no modification, followed by taking data optionally through intermediate tables until it is ready to go into the star schema. That's a lot of work for a single entity, no single responsibility here.

Previous systems I have worked on have made a distinction between the different sets of tables, to the extent of having:

  • Upload tables: raw source system data, unmodified
  • Staging tables: intermediate processing, typed and cleansed
  • Warehouse tables

You can stick these in separate schemas and then apply differing policies for archive/backup/security etc. One of the other guys has worked on a warehouse where there is a StagingInput and a StagingOutput, similar story. The team as a whole has a lot of experience, both datawarehouse and otherwise.

However, despite all this, looking through Kimball and the web there seems to be absolutely nothing in writing about giving any kind of structure to the staging database. One would be forgiven for believing that Mr Kimball would have us all work with staging as this big deep dark unstructured pool of data.

Whilst of course it is pretty obvious how to go about it if we want to add some more structure to the staging area, it seems very odd that there seems to be nothing written about it.

So, what is everyone else out there doing? Is staging just this big unstructured mess or do folk have some interesting designs on it?


Solution:1

I have experienced the same problem. We have a large HR DataWarehouse and I am pulling data from systems all over the enterprise. I've got a nice collection of Fact and Dimension tables, but the staging area is a mess. I don't know of any standards for design of this. I would follow the same path you are on and come up with a standard set of names to keep things in order. Your suggestion is pretty good for the naming. I'd keep working with that.


Solution:2

Just a note, there is a book called "The Data Warehouse ETL Toolkit" by Raph Kimball and Joe Caserta, so Mr. Kimball did put some effort into this. :)


Solution:3

We are working on a large Insurance DWH project at the moment, its slightly complicated, but each of the source system tables are put into a separate schema in a STAGING database, then we have ETL that moves/cleanses/conforms(MDM) the data from the staging database into a STAGINGCLEAN database, then further ETL that moves the data into a Kimball DWH.

The separation of the Staging and the StagingClean database we find very helpful in diagnosing issues particularly on data quality, as we have dirty staged data as well as the cleaned version before it is transformed into the DWH proper.


Solution:4

There can be sub areas in Staging. Called staging1, staging2, for example.

Staging1 can be a directly pull from data sources with no transformation. And Staging1 only keeps the latest data.

Staging2 keeps data transformed and ready to go to warehouse. Staging2 keeps all historical data.


Solution:5

Have a look at this post here. It gives a good overview of the responsibilities of a staging area within a DW.


Solution:6

What a great question.

In the past we have used _MIRR (for mirror) suffix for untransformed data landed into the database, ie. it mirrors the source. Then we use _STG for the transformed data from source, then _DW for the star schema.

The staging tables here would be in 3NF. I think this is the key point. Data is landed untransformed and kept separate from next step where we fully normalize the data, before then flattening it all out into our star schema for reporting.


Solution:7

Personally, I don't go looking for trouble, in Kimball, or elsewhere.

What kind of "structure" are you looking for? What kind of "structure" do you feel is needed? What problems are you seeing from the lack of "structure" you have today?

I may be leaving you with the impression that I don't think much of Kimball. Not so - I haven't read Kimball. I just don't think much of changing things for no reason beyond fitting some pattern. Change to solve some real-world problem would be fine. For instance, if you find you're backing up staging tables because a lack of structure caused the staging and warehouse tables to be treated the same, then this would be a reason to change the structure. But if that's the sort of thing you had in mind, then you should edit your question to indicate it.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »