Wednesday, 27 August 2014

Normalized Vs Dimensional Approach in DWH

The dimensional approach, made popular by in Ralph Kimball (website), states that the data warehouse should be modeled using a Dimensional Model (star schema or snowflake).  The normalized approach, also called the 3NF model, made popular by Bill Inmon (website), states that the data warehouse should be modeled using an E-R model/normalized model.

In a dimensional approach, data is partitioned into either “facts”, which are generally numeric transaction data, or “dimensions“, which are the reference information that gives context to the facts.  A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use.  Also, the retrieval of data from the data warehouse tends to operate very quickly.  The main disadvantage of the dimensional approach is that In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated.  Plus, if you are used to working with a normalized approach, it can take a while to fully understand the dimensional approach and to become efficient in building one.

In the normalized approach, the data in the data warehouse are stored following database normalization rules.  Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.).  The normalized structure divides data into entities, which creates several tables in a relational database.  When applied in large enterprises the result is dozens of tables that are linked together by a web of joins.  Furthermore, each of the created entities is converted into separate physical tables when the database is implemented.  The main advantage of this approach is that it is straightforward to add information into the database.  A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to join data from different sources into meaningful information and then access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.

Top-Down vs Bottom-Up Approach

Inmon’s top-down approach


Inmon defines data warehouse as a centralized repository for the entire enterprise. Data warehouse stores the ‘atomic’ data at the lowest level of detail. Dimensional data marts are created only after the complete data warehouse has been created. Thus, data warehouse is at the center of the Corporate Information Factory (CIF), which provides a logical framework for delivering business intelligence.
Inmon defines the data warehouse in the following terms:
  1. Subject-oriented: The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together
  2. Time-variant: The changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time
  3. Non-volatile: Data in the data warehouse is never over-written or deleted -- once committed, the data is static, read-only, and retained for future reporting
  4. Integrated: The database contains data from most or all of an organization's operational applications, and that this data is made consistent

Kimball’s bottom-up approach


Keeping in mind the most important business aspects or departments, data marts are created first. These provide a thin view into the organizational data, and as and when required these can be combined into a larger data warehouse. Kimball defines data warehouse as “A copy of transaction data specifically structured for query and analysis”.
Kimball’s data warehousing architecture is also known as Data Warehouse Bus (BUS). Dimensional modeling focuses on ease of end user accessibility and provides a high level of performance to the data warehouse.

Inmon vs. Kimball: Similar or different?



"You can catch all the minnows in the ocean and stack them together and they still do not make a whale." ~Inmon
“The data warehouse is nothing more than the union of all the data marts" ~Kimball


Pros and cons of both the approaches



Courtesy:Sansu George is a business analyst at ABIBA Systems
http://searchbusinessintelligence.techtarget.in/tip/Inmon-vs-Kimball-Which-approach-is-suitable-for-your-data-warehouse

Difference Between an ODS and a Datawarehouse

Answer1:
An ODS is a database that is subject-oriented, integrated, volatile and current. It is usually used by business managers, analysts or customer service representatives to monitor, manage and improve daily business processes and customer service. An ODS is often loaded daily or multiple times a day with data that represents the current state of operational systems. 

An EDW is a database that is subject-oriented, integrated, non-volatile (read-only) and time-variant (historical). It is usually used by financial analysts for historical trend analysis reporting, data mining and other activities that need historical data. An EDW keeps growing as you add more historical snapshots, either daily, weekly or monthly. Because an EDW has historical data (and the ODS usually does not), some companies use the EDW as a hub for loading their data marts.


Answer2:
To a business user/information consumer, an ODS is best described as a means of moving  the reporting function from the transactional systems so that users can report on current activity without impacting the performance of the transactional system. That is why the concept of an ODS was created. It is not a duplicate/copy of the OLTP system, but is designed to readily load transactions (DB logs) as well as facilitate creation of reports. The ODS may carry (store) data a little longer than the OLTP system, but generally it is defined as current/short term data. An ODS is normally tied to one OLTP system, so in a larger organization you may see multiple ODS instances.

ODS system generally works as a complementary system to EDW to achieve near real time analysis of important business facts. e.g. by end of every six hours, for a online marketing portal, it might be important to calculate total sales $ irrespective to whatever views (geography, marketing channels etc) sales has been done.

The DW on the other hand is the longer term repository of transactional data, and is designed around subject areas, rather than transactional applications. It normally has multiple systems sending data to it, and some of those systems can be ODS. It is designed to store data at the lowest level of detail (atomic) available from the data sources. The DW (or EDW) is the principle resource for providing normalized data to the business intelligence (BI) layer, either indirectly through a data mart/application or (infrequently) through direct query. Whether the DW design is 3NF or dimensional (federated data marts) probably wouldn't be of much concern to the end user. They just want a conformed, confident source of information without having to go to multiple sources to get it.

Characteristics of a Data warehouse

Data Warehouse characteristics set forth by Inmon:

Subject Oriented:Suppose you need to focus on your company's sales.So you better learn about company's sales data.Using this warehouse you can figure out "Who was the best customer for this year" or "Which item has been sold mostly this year"..Over all,Data warehouse in this case is defined by the matter of Subject.In this case,it is sales.

Integrated:Data warehouse is built from different source systems.Data should be maintained in a consistent format by eliminating all the discrepancies.Then it can be said as Integrated.

Non-Volatile:Once the data entered to the warehouse cannot be modified.Because that's what here to analyze the purpose.

Time Integrated:Analysts need a huge amount of data to identify the data patterns,business relationships.So it has to focus on change over time.

In general, fast query performance with high data throughput is the key to a successful data warehouse.