Database and Warehouse Concepts

Sunday, 14 June 2015

In Depth:Unconnected LookUp in Informatica.

What is the absolute necessity of Unconnected lookup, if same functionality can be achieved by Connected Lookup Transformation?

The main advantage of using an unconnected lookup is the fact that you can use it in an expression/any other transformation like a “function” in other programming languages. The Lookup Cache is calculated only once as part of the session initialization and hence more efficient, since all subsequent look-ups will re-use the Cache.

a)When you are looking up against a Static dimension (or any table) that is rarely ever updated

The most common example for this is the Date Dimension. Usually loaded once when your Data Warehouse goes to Production and very rarely updated.What makes it even more appealing to use unconnected lookup for Date Dimension is the Date Dimension Role Playing.
Example, An online order can have Order Date, Ship Date, Cancelled Date, Recieved Date and so on and the same Date Dimension table plays multiple roles.
Without an unconnected Lookup, here’s how your mapping would look..Also note that the Lookup Is being done on the same table internally (DATE_DIM), but the cache is being calculated one for each lookup.

Using an unconnected Lookup For Date Dimension, this is how it would be transformed..

As you can see, this promotes greater reuse ,a less complex mapping and is more efficient becuase of lesser Caching.

b) When you are looking up against a Base/Conformed Dimension that is loaded before any of the Facts are loaded.

Another example (one which is not static Data) is looking up against any customer Master Data/Conformed Dimension. Your DataWarehouse can have many Confirmed Dimensions, which are loaded before any of the Fact Loads.
One such example is the Employee Dimension. You can lookup the employee key multiple times in a given mapping and this makes a great candidate for Unconnected Lookup.

c) The logic used in the Lookup-override is required at a lot of other places.
This is more of a good coding practise than an Informatica specific “tip”. If your lookup involves a lookup override and you calculate the actual fact using a formula instead of using a direct column value, it would make sense to get the result using an unconnected lookup and use it at multiple places using a :LKP expression.

One such example is when you want to Lookup Revenue based on the Status of the Order instead of a direct Lookup, and you have a look-up override like the one below.

1

2

3

4

5

6

Select order_id,

       (case when order_status = 'Pending' then revenue*0.7

            when order_status = 'COnfirmed' then revenue

            when order_status = 'Cancelled' then -1*revenue

        end) revenue           

  from order_revenue

So if your lookup data is constant during your load and you have a good case for reuse, Unconnected Lookup can be very useful, both in terms of maintainability and efficiency.

Courtesy:ETL Developer

Unconnected Lookup Scenario

Lets say you have 50 million records coming from the source out of which you only want to look up for some 1 million records based on some condition then at this kind of scenario go for unconnected lookup

In unconnected lookup's, its not necessary that we call it for every row. Because its unconnected, and not connected to the data flow pipeline, we can choose to call when it meets certain conditions.

This way we can use the unconnected lookup more efficiently

Active Lookup Transformation

Active Lookup Transformation Restrictions:

We cannot return multiple rows from an unconnected Lookup transformation
We cannot enable dynamic cache for a Active Lookup transformation.
Active Lookup Transformation that returns multiple rows cannot share a cache with a similar Passive Lookup Transformation that returns one matching row for each input row.

What is Degenerate Dimension ?

A Degenerate dimension is a Dimension which has only a single attribute. This dimension is typically represented as a single field in a fact table.The data items that are not facts and data items that do not fit into the existing dimensions are termed as Degenerate Dimensions. Degenerate Dimensions are used when fact tables represent transaction data.They can be used as primary key for the fact table but they cannot act as foreign keys.

Lookup Cache Types in Informatica

To boost the performance of Lookup Transformation , we mention to cache the lookup data , so it can directly look into cache file instead of connecting to source and then reading data from there for each row -

Lookup can generate a cache file which persists for the duration of the session, or even created as a permanent named cache. This means that the data required for the lookup is read from its source only once. Informatica creates an index on the lookup cache minimizing the processing time for calls to the cache.

Static Lookup :

In case of static cache ,the Integration Service does not update the cache while it processes the transformation. This is why it is called as Static. In Static Cache when the Lookup condition is true it return value from lookup table else returns Null or Default value. In Static Cache the important thing is that you cannot insert or update the cache.

Static Cache is normally used while referring to reference Data or when Data source is not a target for the mapping.

Example : Referring to Dept table to get Dept Name based on the DEPT no for Each employee.

Dynamic Lookup:

In Dynamic Cache we can insert or update rows in the cache when we pass the rows. Normally while using Target as the lookup source to identify if new record is already there in Target or not we use this Dynamic cache.When a new record comes and it is not present in Cache (means missing in Target as well) , then it will insert this new record in Cache as well . So the dynamic cache is synchronized with the target with each processed row.

Example : If there are 2 or more entries of the same customer on the same day in Source systems then while loading to the target we want the Lookup Cache to be refreshed dynamically(Insert/Update) and see the latest data in Cache.

Non Persistent Cache:

By default, the Informatica Server uses a non-persistent cache when you enable caching in a Lookup transformation. The Informatica Server deletes the cache files at the end of a session. The next time you run the session, the Informatica Server builds the memory cache from the database

Persistent Cache:

Persistent lookups are used when there is a need to save and reuse the existing cache files. Persistent cache is used when a lookup table does not change between session runs. The first time the Informatica Server runs a session using a persistent lookup cache; it saves the cache files to disk instead of deleting them on session completion. The next time the Informatica Server runs the session which calls the same persistent lookup, it builds the memory cache from the cache files, eliminating the time required to read the lookup table. If the lookup table changes occasionally, you can override session properties to recache the lookup from the database.

Advantages:

Informatica Server uses existing cache files for subsequent lookups, eliminating the time required to built the cache again
Cache file can be shared
Persistent lookup cache can be used for same multiple lookup call with in the same mapping / different mappings

Disadvantages:

Cache needs to be rebuilt if lookup table changes occasionally
Cache needs to be rebuilt if lookup transformation is changed or data movement code is changed or database connection information is changed
Lookup SQL override, has to be same in all lookups using same persistent cache file, else the mapping will fail

Note: In case there are multiple interdependent jobs in a workflow, which uses same persistent lookup cache, it is recommended to have dummy jobs for persistent cache file generation in the beginning of workflow. This eases the support and maintenance and jobs execution time. As a practice, please ensure to delete the existing cache file prior to re-building the cache. Informatica internally handles deletion of existing cache files; it is still preferred to have the UNIX script for deletion. (For e.g.: In case session for building lookup cache is disabled or is not executed, then the subsequent session will use the same cache file available on the server. The addition of UNIX scripts job will ensure files are deleted; this can be done through command task

Courtesy:Techtricks

Saturday, 16 May 2015

Separating the warehouse tables from the Relational database

Having them in different schema clearly identifies their purpose.

In most cases separating the warehouse tables from the relational database is considered a best practice. Some of the reasons for doing this are:

Conceptual separation. In short relational databases and data warehouses are different types of SQL server databases and should be treated that way.

Security Separation: While implementing security is possible at a schema level, it is much easier to set up security at the database level.

Backups: This might be the biggest reason for wanting to separate the data warehouse from another database. Typically a data warehouse is backed up using a Simple Recovery Model where the transaction logs are not backed up. This is done because data warehouses are very large, but aren’t written to very often. A relational database is typically backed up using the Full Recovery Model, where the transaction logs are maintained and backed up regularly. The database can be recreated using the last back up and the transaction logs.

Wednesday, 10 December 2014

COALESCE operator

Assuming that you are calculating total salary by summing up salary and comm columns in emp table.

SELECT EMPNO,ENAME,SAL,COMM,

SAL+COMM AS TOTAL_SAL,

SAL+COALESCE(COMM,0) AS TOTAL_SAL

FROM EMP

COALESCE function substitutes the value "0" when the comm is null.

It is similar to NVL fucntion.

LIKE operator

I need to display the employee details whose name contains exactly one 'A'

SELECT * FROM EMP WHERE ENAME LIKE '%A%' AND ENAME NOT LIKE '%A%A%'

NOT IN vs NOT EXISTS with NULL values

NOT IN and NOT EXISTS are not same

EXISTS vs IN operator

IN and EXISTS are same everywhere

Delete based on DATE conditions

Source:

I have a table which contain duplicates on empno and ename on different dates.I need to preserve oldest value and delete rest of them.

Query:

DELETE FROM DEL_EMP
WHERE (EMPNO,LOAD_DT)
NOT IN
(
SELECT EMPNO,MIN(LOAD_DT) FROM DEL_EMP GROUP BY EMPNO
)

Output:

Eliminating Duplicates from the table

Source:

ROWID is the pseudo column,which is unique for every row.Based on this rowid the record can be eliminated.

Query:

DELETE FROM DUP_DEL
WHERE ROWID
NOT IN
(
SELECT MAX(ROWID) FROM DUP_DEL GROUP BY EMPNO
);

This calculates the maximum ROWID and deletes the rows which are not in the maximum ROWID.

After Delete:

Friday, 5 September 2014

Factless Fact table

When a fact table does not contain any facts, it is called a factless fact table.

There are two types of factless fact tables: those that describe events, and those that describe conditions. Both may play important roles in your dimensional models.

Factless fact table for event or activity

When designing dimensional model, you often find that you want to track events or activities that occurs in your business process but you can’t find measures to track. In these situations, you can create a transaction-grained fact table that has no facts to describe that events or activities. Even though there are no facts storing in the fact table, the event can be counted to produce very meaningful process measurements.

Factless fact table for event or activity example

For example, you may want to track employee leaves. How often and why your employee leaves are very important for you to plan your daily activities and resources.

At the center of diagram below is the FACT_LEAVE table that has no facts at all. However theFACT_LEAVE table is used to measure employee leave event when it occurs.

Query to retrieve the leave details of an employee:

SELECT ENAME,COUNT(LEAVE_ID)

FROM FACT_LEAVE JOIN DIM_EMPLOYEE

ON FACT_LEAVE.EMPNO=DIM_EMPLOYEE.EMPNO

GROUP BY ENAME;

Factless fact table for condition, eligibility or coverage

Factless fact table can be also used in these situations:

Tracking salesperson assigned to each prospect or customer
Logging the eligibility of employees for a compensation program
Capturing the promotion campaigns that are active at specific times such as holidays.

Those examples above describe conditions, eligibility or coverage. The factless fact table can be used to model conditions, eligibility or coverage. Typically information is captured by this star will not be studied alone but used with other business processes to produce meaningful information.

Let’s take a look at the sale star below. By looking only at the star, we don’t know what product has promotion that did not sell.

Sales Star Schema

In order to track this kind of information, we can create a star that has factless fact table which is known as coverage table (according to Kimball).

Factless Fact Table – Example 2

In order to answer the question: what product that has promotion did not sell, we need to do as follows:

Look at the second star to find out products that have promotions.
Look at the first star to find out products that have promotion that sell.
The difference between is the list of products that have promotion but did not sell.

Factless fact table is crucial in many complex business processes. By applying concepts and techniques about factless fact table in this tutorial, you can design a dimensional model that has no clear facts to produce more meaningful information for your business processes.

Courtesy:zentut

Pages