Database and Warehouse Concepts

Sunday, 14 June 2015

In Depth:Unconnected LookUp in Informatica.

What is the absolute necessity of Unconnected lookup, if same functionality can be achieved by Connected Lookup Transformation?

The main advantage of using an unconnected lookup is the fact that you can use it in an expression/any other transformation like a “function” in other programming languages. The Lookup Cache is calculated only once as part of the session initialization and hence more efficient, since all subsequent look-ups will re-use the Cache.

a)When you are looking up against a Static dimension (or any table) that is rarely ever updated

The most common example for this is the Date Dimension. Usually loaded once when your Data Warehouse goes to Production and very rarely updated.What makes it even more appealing to use unconnected lookup for Date Dimension is the Date Dimension Role Playing.
Example, An online order can have Order Date, Ship Date, Cancelled Date, Recieved Date and so on and the same Date Dimension table plays multiple roles.
Without an unconnected Lookup, here’s how your mapping would look..Also note that the Lookup Is being done on the same table internally (DATE_DIM), but the cache is being calculated one for each lookup.

Using an unconnected Lookup For Date Dimension, this is how it would be transformed..

As you can see, this promotes greater reuse ,a less complex mapping and is more efficient becuase of lesser Caching.

b) When you are looking up against a Base/Conformed Dimension that is loaded before any of the Facts are loaded.

Another example (one which is not static Data) is looking up against any customer Master Data/Conformed Dimension. Your DataWarehouse can have many Confirmed Dimensions, which are loaded before any of the Fact Loads.
One such example is the Employee Dimension. You can lookup the employee key multiple times in a given mapping and this makes a great candidate for Unconnected Lookup.

c) The logic used in the Lookup-override is required at a lot of other places.
This is more of a good coding practise than an Informatica specific “tip”. If your lookup involves a lookup override and you calculate the actual fact using a formula instead of using a direct column value, it would make sense to get the result using an unconnected lookup and use it at multiple places using a :LKP expression.

One such example is when you want to Lookup Revenue based on the Status of the Order instead of a direct Lookup, and you have a look-up override like the one below.

1

2

3

4

5

6

Select order_id,

       (case when order_status = 'Pending' then revenue*0.7

            when order_status = 'COnfirmed' then revenue

            when order_status = 'Cancelled' then -1*revenue

        end) revenue           

  from order_revenue

So if your lookup data is constant during your load and you have a good case for reuse, Unconnected Lookup can be very useful, both in terms of maintainability and efficiency.

Courtesy:ETL Developer

Unconnected Lookup Scenario

Lets say you have 50 million records coming from the source out of which you only want to look up for some 1 million records based on some condition then at this kind of scenario go for unconnected lookup

In unconnected lookup's, its not necessary that we call it for every row. Because its unconnected, and not connected to the data flow pipeline, we can choose to call when it meets certain conditions.

This way we can use the unconnected lookup more efficiently

Active Lookup Transformation

Active Lookup Transformation Restrictions:

We cannot return multiple rows from an unconnected Lookup transformation
We cannot enable dynamic cache for a Active Lookup transformation.
Active Lookup Transformation that returns multiple rows cannot share a cache with a similar Passive Lookup Transformation that returns one matching row for each input row.

What is Degenerate Dimension ?

A Degenerate dimension is a Dimension which has only a single attribute. This dimension is typically represented as a single field in a fact table.The data items that are not facts and data items that do not fit into the existing dimensions are termed as Degenerate Dimensions. Degenerate Dimensions are used when fact tables represent transaction data.They can be used as primary key for the fact table but they cannot act as foreign keys.

Lookup Cache Types in Informatica

To boost the performance of Lookup Transformation , we mention to cache the lookup data , so it can directly look into cache file instead of connecting to source and then reading data from there for each row -

Lookup can generate a cache file which persists for the duration of the session, or even created as a permanent named cache. This means that the data required for the lookup is read from its source only once. Informatica creates an index on the lookup cache minimizing the processing time for calls to the cache.

Static Lookup :

In case of static cache ,the Integration Service does not update the cache while it processes the transformation. This is why it is called as Static. In Static Cache when the Lookup condition is true it return value from lookup table else returns Null or Default value. In Static Cache the important thing is that you cannot insert or update the cache.

Static Cache is normally used while referring to reference Data or when Data source is not a target for the mapping.

Example : Referring to Dept table to get Dept Name based on the DEPT no for Each employee.

Dynamic Lookup:

In Dynamic Cache we can insert or update rows in the cache when we pass the rows. Normally while using Target as the lookup source to identify if new record is already there in Target or not we use this Dynamic cache.When a new record comes and it is not present in Cache (means missing in Target as well) , then it will insert this new record in Cache as well . So the dynamic cache is synchronized with the target with each processed row.

Example : If there are 2 or more entries of the same customer on the same day in Source systems then while loading to the target we want the Lookup Cache to be refreshed dynamically(Insert/Update) and see the latest data in Cache.

Non Persistent Cache:

By default, the Informatica Server uses a non-persistent cache when you enable caching in a Lookup transformation. The Informatica Server deletes the cache files at the end of a session. The next time you run the session, the Informatica Server builds the memory cache from the database

Persistent Cache:

Persistent lookups are used when there is a need to save and reuse the existing cache files. Persistent cache is used when a lookup table does not change between session runs. The first time the Informatica Server runs a session using a persistent lookup cache; it saves the cache files to disk instead of deleting them on session completion. The next time the Informatica Server runs the session which calls the same persistent lookup, it builds the memory cache from the cache files, eliminating the time required to read the lookup table. If the lookup table changes occasionally, you can override session properties to recache the lookup from the database.

Advantages:

Informatica Server uses existing cache files for subsequent lookups, eliminating the time required to built the cache again
Cache file can be shared
Persistent lookup cache can be used for same multiple lookup call with in the same mapping / different mappings

Disadvantages:

Cache needs to be rebuilt if lookup table changes occasionally
Cache needs to be rebuilt if lookup transformation is changed or data movement code is changed or database connection information is changed
Lookup SQL override, has to be same in all lookups using same persistent cache file, else the mapping will fail

Note: In case there are multiple interdependent jobs in a workflow, which uses same persistent lookup cache, it is recommended to have dummy jobs for persistent cache file generation in the beginning of workflow. This eases the support and maintenance and jobs execution time. As a practice, please ensure to delete the existing cache file prior to re-building the cache. Informatica internally handles deletion of existing cache files; it is still preferred to have the UNIX script for deletion. (For e.g.: In case session for building lookup cache is disabled or is not executed, then the subsequent session will use the same cache file available on the server. The addition of UNIX scripts job will ensure files are deleted; this can be done through command task

Courtesy:Techtricks

Saturday, 16 May 2015

Separating the warehouse tables from the Relational database

Having them in different schema clearly identifies their purpose.

In most cases separating the warehouse tables from the relational database is considered a best practice. Some of the reasons for doing this are:

Conceptual separation. In short relational databases and data warehouses are different types of SQL server databases and should be treated that way.

Security Separation: While implementing security is possible at a schema level, it is much easier to set up security at the database level.

Backups: This might be the biggest reason for wanting to separate the data warehouse from another database. Typically a data warehouse is backed up using a Simple Recovery Model where the transaction logs are not backed up. This is done because data warehouses are very large, but aren’t written to very often. A relational database is typically backed up using the Full Recovery Model, where the transaction logs are maintained and backed up regularly. The database can be recreated using the last back up and the transaction logs.

Wednesday, 10 December 2014

COALESCE operator

Assuming that you are calculating total salary by summing up salary and comm columns in emp table.

SELECT EMPNO,ENAME,SAL,COMM,

SAL+COMM AS TOTAL_SAL,

SAL+COALESCE(COMM,0) AS TOTAL_SAL

FROM EMP

COALESCE function substitutes the value "0" when the comm is null.

It is similar to NVL fucntion.

LIKE operator

I need to display the employee details whose name contains exactly one 'A'

SELECT * FROM EMP WHERE ENAME LIKE '%A%' AND ENAME NOT LIKE '%A%A%'

NOT IN vs NOT EXISTS with NULL values

NOT IN and NOT EXISTS are not same

EXISTS vs IN operator

IN and EXISTS are same everywhere

Delete based on DATE conditions

Source:

I have a table which contain duplicates on empno and ename on different dates.I need to preserve oldest value and delete rest of them.

Query:

DELETE FROM DEL_EMP
WHERE (EMPNO,LOAD_DT)
NOT IN
(
SELECT EMPNO,MIN(LOAD_DT) FROM DEL_EMP GROUP BY EMPNO
)

Output:

Eliminating Duplicates from the table

Source:

ROWID is the pseudo column,which is unique for every row.Based on this rowid the record can be eliminated.

Query:

DELETE FROM DUP_DEL
WHERE ROWID
NOT IN
(
SELECT MAX(ROWID) FROM DUP_DEL GROUP BY EMPNO
);

This calculates the maximum ROWID and deletes the rows which are not in the maximum ROWID.

After Delete:

Friday, 5 September 2014

Factless Fact table

When a fact table does not contain any facts, it is called a factless fact table.

There are two types of factless fact tables: those that describe events, and those that describe conditions. Both may play important roles in your dimensional models.

Factless fact table for event or activity

When designing dimensional model, you often find that you want to track events or activities that occurs in your business process but you can’t find measures to track. In these situations, you can create a transaction-grained fact table that has no facts to describe that events or activities. Even though there are no facts storing in the fact table, the event can be counted to produce very meaningful process measurements.

Factless fact table for event or activity example

For example, you may want to track employee leaves. How often and why your employee leaves are very important for you to plan your daily activities and resources.

At the center of diagram below is the FACT_LEAVE table that has no facts at all. However theFACT_LEAVE table is used to measure employee leave event when it occurs.

Query to retrieve the leave details of an employee:

SELECT ENAME,COUNT(LEAVE_ID)

FROM FACT_LEAVE JOIN DIM_EMPLOYEE

ON FACT_LEAVE.EMPNO=DIM_EMPLOYEE.EMPNO

GROUP BY ENAME;

Factless fact table for condition, eligibility or coverage

Factless fact table can be also used in these situations:

Tracking salesperson assigned to each prospect or customer
Logging the eligibility of employees for a compensation program
Capturing the promotion campaigns that are active at specific times such as holidays.

Those examples above describe conditions, eligibility or coverage. The factless fact table can be used to model conditions, eligibility or coverage. Typically information is captured by this star will not be studied alone but used with other business processes to produce meaningful information.

Let’s take a look at the sale star below. By looking only at the star, we don’t know what product has promotion that did not sell.

Sales Star Schema

In order to track this kind of information, we can create a star that has factless fact table which is known as coverage table (according to Kimball).

Factless Fact Table – Example 2

In order to answer the question: what product that has promotion did not sell, we need to do as follows:

Look at the second star to find out products that have promotions.
Look at the first star to find out products that have promotion that sell.
The difference between is the list of products that have promotion but did not sell.

Factless fact table is crucial in many complex business processes. By applying concepts and techniques about factless fact table in this tutorial, you can design a dimensional model that has no clear facts to produce more meaningful information for your business processes.

Courtesy:zentut

Wednesday, 27 August 2014

Normalized Vs Dimensional Approach in DWH

The dimensional approach, made popular by in Ralph Kimball (website), states that the data warehouse should be modeled using a Dimensional Model (star schema or snowflake). The normalized approach, also called the 3NF model, made popular by Bill Inmon (website), states that the data warehouse should be modeled using an E-R model/normalized model.

In a dimensional approach, data is partitioned into either “facts”, which are generally numeric transaction data, or “dimensions“, which are the reference information that gives context to the facts. A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated. Plus, if you are used to working with a normalized approach, it can take a while to fully understand the dimensional approach and to become efficient in building one.

In the normalized approach, the data in the data warehouse are stored following database normalization rules. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.). The normalized structure divides data into entities, which creates several tables in a relational database. When applied in large enterprises the result is dozens of tables that are linked together by a web of joins. Furthermore, each of the created entities is converted into separate physical tables when the database is implemented. The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to join data from different sources into meaningful information and then access the information without a precise understanding of the sources of data and of the data structure of the data warehouse.

Top-Down vs Bottom-Up Approach

Inmon’s top-down approach

Inmon defines data warehouse as a centralized repository for the entire enterprise. Data warehouse stores the ‘atomic’ data at the lowest level of detail. Dimensional data marts are created only after the complete data warehouse has been created. Thus, data warehouse is at the center of the Corporate Information Factory (CIF), which provides a logical framework for delivering business intelligence.

Inmon defines the data warehouse in the following terms:

Subject-oriented: The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together
Time-variant: The changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time
Non-volatile: Data in the data warehouse is never over-written or deleted -- once committed, the data is static, read-only, and retained for future reporting
Integrated: The database contains data from most or all of an organization's operational applications, and that this data is made consistent

Kimball’s bottom-up approach

Keeping in mind the most important business aspects or departments, data marts are created first. These provide a thin view into the organizational data, and as and when required these can be combined into a larger data warehouse. Kimball defines data warehouse as “A copy of transaction data specifically structured for query and analysis”.

Kimball’s data warehousing architecture is also known as Data Warehouse Bus (BUS). Dimensional modeling focuses on ease of end user accessibility and provides a high level of performance to the data warehouse.

Inmon vs. Kimball: Similar or different?

"You can catch all the minnows in the ocean and stack them together and they still do not make a whale." ~Inmon

“The data warehouse is nothing more than the union of all the data marts" ~Kimball

Pros and cons of both the approaches

Courtesy:Sansu George is a business analyst at ABIBA Systems

http://searchbusinessintelligence.techtarget.in/tip/Inmon-vs-Kimball-Which-approach-is-suitable-for-your-data-warehouse

Difference Between an ODS and a Datawarehouse

Answer1:
An ODS is a database that is subject-oriented, integrated, volatile and current. It is usually used by business managers, analysts or customer service representatives to monitor, manage and improve daily business processes and customer service. An ODS is often loaded daily or multiple times a day with data that represents the current state of operational systems.

An EDW is a database that is subject-oriented, integrated, non-volatile (read-only) and time-variant (historical). It is usually used by financial analysts for historical trend analysis reporting, data mining and other activities that need historical data. An EDW keeps growing as you add more historical snapshots, either daily, weekly or monthly. Because an EDW has historical data (and the ODS usually does not), some companies use the EDW as a hub for loading their data marts.

Answer2:
To a business user/information consumer, an ODS is best described as a means of moving the reporting function from the transactional systems so that users can report on current activity without impacting the performance of the transactional system. That is why the concept of an ODS was created. It is not a duplicate/copy of the OLTP system, but is designed to readily load transactions (DB logs) as well as facilitate creation of reports. The ODS may carry (store) data a little longer than the OLTP system, but generally it is defined as current/short term data. An ODS is normally tied to one OLTP system, so in a larger organization you may see multiple ODS instances.

ODS system generally works as a complementary system to EDW to achieve near real time analysis of important business facts. e.g. by end of every six hours, for a online marketing portal, it might be important to calculate total sales $ irrespective to whatever views (geography, marketing channels etc) sales has been done.

The DW on the other hand is the longer term repository of transactional data, and is designed around subject areas, rather than transactional applications. It normally has multiple systems sending data to it, and some of those systems can be ODS. It is designed to store data at the lowest level of detail (atomic) available from the data sources. The DW (or EDW) is the principle resource for providing normalized data to the business intelligence (BI) layer, either indirectly through a data mart/application or (infrequently) through direct query. Whether the DW design is 3NF or dimensional (federated data marts) probably wouldn't be of much concern to the end user. They just want a conformed, confident source of information without having to go to multiple sources to get it.

Characteristics of a Data warehouse

Data Warehouse characteristics set forth by Inmon:

Subject Oriented:Suppose you need to focus on your company's sales.So you better learn about company's sales data.Using this warehouse you can figure out "Who was the best customer for this year" or "Which item has been sold mostly this year"..Over all,Data warehouse in this case is defined by the matter of Subject.In this case,it is sales.

Integrated:Data warehouse is built from different source systems.Data should be maintained in a consistent format by eliminating all the discrepancies.Then it can be said as Integrated.

Non-Volatile:Once the data entered to the warehouse cannot be modified.Because that's what here to analyze the purpose.

Time Integrated:Analysts need a huge amount of data to identify the data patterns,business relationships.So it has to focus on change over time.

In general, fast query performance with high data throughput is the key to a successful data warehouse.

Pages