Database and Warehouse Concepts: Informatica

Showing posts with label Informatica. Show all posts

Wednesday, 29 July 2015

Performance Tuning steps for Informatica Mapping

1. Reduce the number of transformations. There is always overhead involved in moving data between transformations.
2. Consider more shared memory for large number of transformations. Session shared memory between 12MB and 40MB should suffice.
3. Calculate once, use many times.

Avoid calculating or testing the same value over and over.Calculate it once in an expression, and set a True/False flag.
Within an expression, use variable ports to calculate a value than can be used multiple times within that transformation.
Delete unnecessary links between transformations to minimize the amount of data moved, particularly in the Source Qualifier.
This is also helpful for maintenance. If a transformation needs to be reconnected, it is best to only have necessary ports set as input and output to reconnect.
In lookup transformations, change unused ports to be neither input nor output. This makes the transformations cleaner looking. It also makes the generated SQL override as small as possible, which cuts down on the amount of cache necessary and thereby improves performance.
The engine automatically converts compatible types.
Sometimes data conversion is excessive. Data types are automatically converted when types are different between connected ports. Minimize data type changes between transformations by planning data flow prior to developing the mapping.
Plan for reusable transformations upfront.
Use variables. Use both mapping variables as well as ports that are variables.Variable ports are especially beneficial when they can be used to calculate a complex expression or perform a disconnected lookup call only once instead of multiple times
Use mapplets to encapsulate multiple reusable transformations.
Use mapplets to leverage the work of critical developers and minimize mistakes when performing similar functions.
Reduce the number of non-essential records that are passed through the entire mapping.
Use active transformations that reduce the number of records as early in the mapping as possible (i.e., placing filters, aggregators as close to source as possible).
Select appropriate driving/master table while using joins. The table with the lesser number of rows should be the driving/master table for a faster join.
Redesign mappings to utilize one Source Qualifier to populate multiple targets.This way the server reads this source only once. If you have different Source Qualifiers for the same source (e.g., one for delete and one for update/insert), the server reads the source for each Source Qualifier.
Remove or reduce field-level stored procedures.
If you use field-level stored procedures, the PowerCenter server has to make a call to that stored procedure for every row, slowing performance.

4. Only connect what is used.
5. Watch the data types.
6. Facilitate reuse.
7. Only manipulate data that needs to be moved and transformed.
8. Utilize single-pass reads.
9. Sort the input data before passing to Joiner and Aggregate transformation.
10. In Lookup using customize query instead of default query. (Use '--' to overwrite lookup default order by clause).
11. Avoid using un-neccessary columns/port in sql query.
12. Filter un-neccessary data as closer to the source qualifier. (In case of Relational database include filter condition to the sql query).
13. In Joiner consider lesser value of data as Master Table.
14. In-case of mapping partition place aggregate transformation before the partition point.
15. Use Router instead of having multiple Filter transformations.

Wednesday, 8 July 2015

Sorted Input : Aggregator

Aggregator transformations often slow performance because they must group data before processing it. Aggregator transformations need additional memory to hold intermediate group results.

Using Sorted Input

To increase session performance, sort data for the Aggregator transformation. Use the Sorted Input option to sort data.

The Sorted Input option decreases the use of aggregate caches. When you use the Sorted Input option, the Integration Service assumes all data is sorted by group. As the Integration Service reads rows for a group, it performs aggregate calculations. When necessary, it stores group information in memory.

The Sorted Input option reduces the amount of data cached during the session and improves performance. Use this option with the Source Qualifier Number of Sorted Ports option or a Sorter transformation to pass sorted data to the Aggregator transformation.

Sunday, 5 July 2015

Parallel/Concurrent execution in workflows

I've a project requirement which goes like this

Table_stage has around 20 million of data.
It has to load data into Temp table in batches of 1 million, after doing transformations
Temp table will in turn load those 1 million into one main table and get truncated

so the flow is

Table_stage -> Temp -> Main, where it will repeat for batches of 1 million.

One of the Approaches would be :

There is Concurrent execution of workflow is available with 8.6. But you have to give some parameter to the SQ query so that it will change for each instance of the workflow (which is actually one workflow but run parallel 20 times/same time). Check workflow properties concurrent option and create 20 workflow in list and respective parameter files ( each has some change parameter to pick batch of record) with the parameter changes so that every workflow pick 1 million.

Saturday, 4 July 2015

Domain, Node, Master Gateway

When you say Domain in informatica means it is the unified administrative unit in the informatica server and is the head of the infa.

When you install infa(talking about the server, not the client) in your machine(in actual it is the server machine), the first thing to be created is Domain, where you can give the Domain name n all. It is an umbrella, under which we have all the other essential parts of infa like nodes,Integration service,repository service etc.

Suppose if our server machine have 1 TB hard disk and 250 GB RAM means, that is your Domain(Physically). Logically Domain will be running on this entire machine.

Next thing is the Node. It is the logical representation/partition of physical machine available(Domain). During installation you need to create node, after you configured the Domain using name,Machine(IP),host address n all. You can create as much as nodes you want.

If you created 2 nodes means, each node will share 500 GB hard disk n 125 GB RAM. i.e, all the resources allocated to the domain will equally distributed among the nodes created.

As i said, Domain is the entire server machine and node(s) is(are) the partition(s) of Domain. So all the services like IS and RS are running on the server(Domain), which can be accessed by the Clients through Server--Client structure(By giving the domain name/repository name/logon credentials).

In the back-end, upon the credential verification, the client will send the request to server(Domain) via TCP-IP protocol. The request will hit the server machine initially to a node, that is what the Gateway node. This node will act as an intermediate b/n Client and Server and will receive all the requests from server and returns all the answers from the Server.

i.e, Client<--->Master Gateway node<---> Server services(IS and RS)

-----------------SERVER-----------------------------

Also, if we have only one node in our domain means that will be the Master gateway node and also will responsible to run the all the sever services(IS and RS).

Thursday, 18 June 2015

Tracing Levels

Workflow and Task Status

Transformations that supports Sorted Input

1)Aggregator
2)Joiner
3)Look up

Stop vs Abort

Stop:
- The reading process will immediately be killed by using Stop command
- It does not have time out period

Abort:

- It gives a time out period of 60 seconds to the Informatica server.
- Otherwise it kills the DTM process.

List of output files during a session run

1. Session log
2. Badfile
3. Errors log
4. Workflow log

Mapping, Session, Worklet, Workflow, Mapplet

Mapping : Represents flow of data from source to destination

Session : A set of instructions which describes the data movement from source to destination

Worklet : Set of tasks

Workflow : A set of instructions that specifies the way to execute the tasks to Informatica

Mapplet : Used for creation and configuration of a group of transformations

Wednesday, 17 June 2015

Load Duplicates and Unique values in separate targets

If the source contains Duplicates,then we need to separate all the duplicate and unique records

Source			TGT_DUP
col1	col2	col3	col1	col2	col3
a	b	c	a	b	c
a	b	c	m	n	o
d	e	f
m	n	o	TGT_UNIQUE
m	n	o	col1	col2	col3
			d	e	f

Sunday, 14 June 2015

Target Update Override

One can perform updates in the target in mapping using 2 approaches:

Using a UPDATE STRATEGY transformation

Configuring Session and defining properties to "Treat all rows as UPDATE"

But both these approach works only when there is a Primary Key defined in the target table. The session will fail when there will not be any key defined in the target.

Alternatively, if you want to perform TARGET updates, with putting condition on COLUMNs not included as PRIMARY KEY in target table, it will not be feasible with out changing the Keys on target table. i.e. If PRIMARY KEY is defined on COLUMN_A in TABLE_A and you want to Update TABLE_A with conditions on or including COLUMN_B

In such cases the UPDATE_OVERRIDE property of TARGET Object can be useful.

How to access UPDATE_OVERRIDE property?

Go to Mapping

Double Clink on the concerned TARGET or edit the Target

Click on Properties Tab

The second Transformation Attribute is the property we are looking for

Syntax for UPDATE_OVERIDE SQL:

UDATE <TARGET TABLE>
SET <COLUMN NAME to be updated> = :TU.<TARGET COLUMN PORT NAME (As in Designer)>

                   , [Other columns need to be updated]
WHERE      <COLUMN NAME to be treated as KEY > = :TU.<corresponding TARGET COLUMN PORT NAME (As in Designer)>
AND          [other conditions]

Example:

UPDATE EMPL_POST_HIST

SET POST = :TU.POST

, UPDATE_DATE = :TU.UPDATE_DATE

WHERE EMPL = :TU.EMPL

Following is an example showing Problem statement and step by step Instructions, of two approaches, of suggested solution.

Problem Scenario:

Target table has not defined Primary key

Informatica target object has no Keys - Since the Database table has not Keys the Imported Object in Informatica will not have any Key in the Target definition

Original Mapping : The mapping created using this very Target definition

Defining update logic: configuring session for UPDATE Logic

Configure session for Treat All Source Row as "UPDATE"

Configure in the session target property to UPDATE as UPDATE - optional

The error you will get while execution - the session will fail

Solution:

Approach 1: Using Update in the Target

Go to properties tab of Target in Mapping:

Edit the "Update Override" attribute - generate SQL and add where Clause with the column name on which you want to update:

Please note that you need to remove the very column, used in WHERE clause from the SET Clause list

Go to Workflow and refresh changes in the mapping and save them

Execute the Workflow

Now it has successfully updated the rows:

Approach 2: Forcing Keys in the Target object definition in designer

Edit the Target in the target designer

Define your Primary key column. Please note that in this case the Database need not have keys defined on the same table:

Save the changes

Execute the workflow

The workflow will update the records in the target table:

PMCMD Command Usage in Informatica

Informatica provides four built-in command line programs or utilities to interact with the informatica features. They are:

infacmd
infasetup
pmcmd
pmrep

This article covers only about the pmcmd command. The pmcmd is a command line utility provided by the informatica to perform the following tasks.

Start workflows.
Start workflow from a specific task.
Stop, Abort workflows and Sessions.
Schedule the workflows.

How to use PMCMD Command in Informatica:

1. Scheduling the workflow

The pmcmd command syntax for scheduling the workflow is shown below:

pmcmd scheduleworkflow -service informatica-integration-Service -d domain-name -u user-name -p password -f folder-name -w workflow-name

You cannot specify the scheduling options here. This command just schedules the workflow for the next run.

2. Start workflow

The following pmcmd command starts the specified workflow:

pmcmd startworkflow -service informatica-integration-Service -d domain-name -u user-name -p password -f folder-name -w workflow-name

3. Stop workflow

Pmcmd command to stop the infromatica workflow is shown below:

pmcmd stopworkflow -service informatica-integration-Service -d domain-name -u user-name -p password -f folder-name -w workflow-name

4. Start workflow from a task

You can start the workflow from a specified task. This is shown below:

pmcmd startask -service informatica-integration-Service -d domain-name -u user-name -p password -f folder-name -w workflow-name -startfrom task-name

5. Stopping a task.

The following pmcmd command stops the specified task instance:

pmcmd stoptask -service informatica-integration-Service -d domain-name -u user-name -p password -f folder-name -w workflow-name task-name

courtesy:folkstalk

6. Aborting workflow and task.

The following pmcmd commands are used to abort workflow and task in a workflow:

pmcmd abortworkflow -service informatica-integration-Service -d domain-name -u user-name -p password -f folder-name -w workflow-name

pmcmd aborttask -service informatica-integration-Service -d domain-name -u user-name -p password -f folder-name -w workflow-name task-name

Informatica Workflow Successful : No Data in target !

Stop On Errors:

Indicates how many non-fatal errors the Integration Service can encounter before it stops the session. Non-fatal errors include reader, writer, and DTM errors. Enter the number of non-fatal errors you want to allow before stopping the session. The Integration Service maintains an independent error count for each source, target, and transformation. If you specify 0, non-fatal errors do not cause the session to stop.
Optionally use the $PMSessionErrorThreshold service variable to stop on the configured number of errors for the Integration Service.

In Oracle, it is the infamous “when others then null” .

a) Having “1” in your default session config : Fail the session on the first non-fatal error.
b) Over-write the session Configuration details and enter the “Stop On Errors” to “1” or a fixed number.
c) Use the $PMSessionErrorThreshold variable and set it at the integration service level. You can always override the variable in the parameter file.

/

In Depth:Unconnected LookUp in Informatica.

What is the absolute necessity of Unconnected lookup, if same functionality can be achieved by Connected Lookup Transformation?

The main advantage of using an unconnected lookup is the fact that you can use it in an expression/any other transformation like a “function” in other programming languages. The Lookup Cache is calculated only once as part of the session initialization and hence more efficient, since all subsequent look-ups will re-use the Cache.

a)When you are looking up against a Static dimension (or any table) that is rarely ever updated

The most common example for this is the Date Dimension. Usually loaded once when your Data Warehouse goes to Production and very rarely updated.What makes it even more appealing to use unconnected lookup for Date Dimension is the Date Dimension Role Playing.
Example, An online order can have Order Date, Ship Date, Cancelled Date, Recieved Date and so on and the same Date Dimension table plays multiple roles.
Without an unconnected Lookup, here’s how your mapping would look..Also note that the Lookup Is being done on the same table internally (DATE_DIM), but the cache is being calculated one for each lookup.

Using an unconnected Lookup For Date Dimension, this is how it would be transformed..

As you can see, this promotes greater reuse ,a less complex mapping and is more efficient becuase of lesser Caching.

b) When you are looking up against a Base/Conformed Dimension that is loaded before any of the Facts are loaded.

Another example (one which is not static Data) is looking up against any customer Master Data/Conformed Dimension. Your DataWarehouse can have many Confirmed Dimensions, which are loaded before any of the Fact Loads.
One such example is the Employee Dimension. You can lookup the employee key multiple times in a given mapping and this makes a great candidate for Unconnected Lookup.

c) The logic used in the Lookup-override is required at a lot of other places.
This is more of a good coding practise than an Informatica specific “tip”. If your lookup involves a lookup override and you calculate the actual fact using a formula instead of using a direct column value, it would make sense to get the result using an unconnected lookup and use it at multiple places using a :LKP expression.

One such example is when you want to Lookup Revenue based on the Status of the Order instead of a direct Lookup, and you have a look-up override like the one below.

1

2

3

4

5

6

Select order_id,

       (case when order_status = 'Pending' then revenue*0.7

            when order_status = 'COnfirmed' then revenue

            when order_status = 'Cancelled' then -1*revenue

        end) revenue           

  from order_revenue

So if your lookup data is constant during your load and you have a good case for reuse, Unconnected Lookup can be very useful, both in terms of maintainability and efficiency.

Courtesy:ETL Developer

Pages