Database and Warehouse Concepts

Monday, 29 June 2015

Pivoting a Result Set into One Row – Columns to row

Problem

You wish to take values from groups of rows and turn those values into columns in a single row per group. For example, you have a result set displaying the number of employees in each department:

DEPTNO	CNT
10	3
20	5
30	6

You would like to reformat the output such the result set looks as follows:

DEPTNO_10	DEPTNO_20	DEPTNO_30
3	5	6

Solution

Transpose the result set using a CASE expression and the aggregate function SUM:

select sum(case when deptno=10 then 1 else 0 end) as deptno_10,
sum(case when deptno=20 then 1 else 0 end) as deptno_20,
sum(case when deptno=30 then 1 else 0 end) as deptno_30
from emp

Discussion

This example is an excellent introduction to pivoting. The concept is simple: for each row returned by the unpivoted query, use a CASE expression to separate the rows into columns. Then, because this particular problem is to count the number of employees per department, use the aggregate function SUM to count the occurrence of each DEPTNO. If you’re having trouble understanding how this works exactly, execute the query with the aggregate function SUM and include DEPTNO for readability:

select deptno,
case when deptno=10 then 1 else 0 end as deptno_10,
case when deptno=20 then 1 else 0 end as deptno_20,
case when deptno=30 then 1 else 0 end as deptno_30
from emp
order by 1

DEPTNO	DEPTNO_10	DEPTNO_20	DEPTNO_30
10	1	0	0
10	1	0	0
10	1	0	0
20	0	1	0
20	0	1	0
20	0	1	0
20	0	1	0
30	0	0	1
30	0	0	1
30	0	0	1
30	0	0	1
30	0	0	1
30	0	0	1

You can think of each CASE expression as a flag to determine which DEPTNO a row belongs to. At this point, the “rows to columns” transformation is already done; the next step is to simply sum the values returned by DEPTNO_10, DEPTNO_20, and DEPTNO_30, and then to group by DEPTNO. Following are the results:

select deptno,
sum(case when deptno=10 then 1 else 0 end) as deptno_10,
sum(case when deptno=20 then 1 else 0 end) as deptno_20,
sum(case when deptno=30 then 1 else 0 end) as deptno_30
from emp
group by deptno

DEPTNO	DEPTNO_10	DEPTNO_20	DEPTNO_30
10	3	0	0
20	0	5	0
30	0	0	6

If you eyeball this result set, you see that logically the output makes sense; for example, DEPTNO 10 has 3 employees in DEPTNO_10 and zero in the other departments. Since the goal is to return one row, the last step is to lose the DEPTNO and GROUP BY, and simply sum the CASE expressions:

DEPTNO_10	DEPTNO_20	DEPTNO_30
3	5	6

Following is another approach that you may sometimes see applied to this same sort of problem:

select max(case when deptno=10 then empcount else null end) as deptno_10
max(case when deptno=20 then empcount else null end) as deptno_20,
max(case when deptno=10 then empcount else null end) as deptno_30
from (
select deptno, count(*) as empcount
from emp
group by deptno
) x

This approach uses an inline view to generate the employee counts per department. CASE expressions in the main query translate rows to columns, getting you to the following results:

DEPTNO_10	DEPTNO_20	DEPTNO_30
3	NULL	NULL
NULL	5	NULL
NULL	NULL	6

Then the MAX functions collapses the columns into one row:

DEPTNO_10	DEPTNO_20	DEPTNO_30
3	5	6

6750 H Model

Today, the Teradata Active Enterprise Data Warehouse platform is the 6750H Model which contains 12 Core 2.7Ghz Xeon processors, combinations of solid state and hard disk drives with a T-Perf rating of 240 per node.

DBC Queries

To find the number of nodes

Select count(distinct nodeid) from dbc.resusagescpu

To find the number of AMP's

Select nodeid,count(distinct Vproc) from dbc.ResCpuUsageByAmp

SELECT HASHAMP()+1

Friday, 26 June 2015

FastLoad Has Some Limits

There are more reasons why FastLoad is so fast. Many of these become restrictions and therefore, cannot slow it down. For instance, can you imagine a sprinter wearing cowboy boots in a race? Of course, not! Because of its speed, FastLoad, too, must travel light! This means that it will have limitations that may or may not apply to other load utilities. Remembering this short list will save you much frustration from failed loads and angry colleagues. It may even foster your reputation as a smooth operator!

Rule #1: No Secondary Indexes are allowed on the Target Table. High performance will only

allow FastLoad to utilize Primary Indexes when loading. The reason for this is that Primary (UPI and NUPI) indexes are used in Teradata to distribute the rows evenly across the AMPs and build only data rows. A secondary index is stored in a subtable block and many times on a different AMP from the data row. This would slow FastLoad down and they would have to call it: get ready now, HalfFastLoad. Therefore, FastLoad does not support them. If Secondary Indexes exist already, just drop them. You may easily recreate them after completing the load.

Rule #2: No Referential Integrity is allowed. FastLoad cannot load data into tables that are

defined with Referential Integrity (RI). This would require too much system checking to prevent

referential constraints to a different table. FastLoad only does one table. In short, RI constraints will need to be dropped from the target table prior to the use of FastLoad.

Rule #3: No Triggers are allowed at load time. FastLoad is much too focused on speed to pay attention to the needs of other tables, which is what Triggers are all about. Additionally, these require more than one AMP and more than one table. FastLoad does one table only. Simply ALTER the Triggers to the DISABLED status prior to using FastLoad.

Rule #4: Duplicate Rows (in Multi-Set Tables) are not supported. Multiset tables are tables that allow duplicate rows — that is when the values in every column are identical. When FastLoad finds duplicate rows, they are discarded. While FastLoad can load data into a multi-set table, FastLoad will not load duplicate rows into a multi-set table because FastLoad discards duplicate rows!

Rule #5: No AMPs may go down (i.e., go offline) while FastLoad is processing. The down AMP must be repaired before the load process can be restarted. Other than this, FastLoad can recover from system glitches and perform restarts.

Rule #6: No more than one data type conversion is allowed per column during a FastLoad.

Why just one? Data type conversion is highly resource intensive job on the system, which requires a "search and replace" effort. And that takes more time. Enough said!

How FastLoad Works

What makes FastLoad perform so well when it is loading millions or even billions of rows? It is because FastLoad assembles data into 64K blocks (64,000 bytes) to load it and can use multiple sessions simultaneously, taking further advantage of Teradata's parallel processing.

This is different from BTEQ and TPump, which load data at the row level.It takes full advantage of Teradata's parallel architecture. In fact, FastLoad will create a Teradata session for each AMP (Access Module Processor — the software processor in Teradata responsible for reading and writing data to the disks) in order to maximize parallel processing. This advantage is passed along to the FastLoad user in terms of awesome performance. Teradata is the only data warehouse loads data, processes data and backs up data in parallel.

Why it is Called "FAST" Load

FastLoad is known for its lightning-like speed in loading vast amounts of data from flat files from a host into empty tables in Teradata. Part of this speed is achieved because it does not use the

Transient Journal. You will see some more of the reasons enumerated below. But, regardless of the reasons that it is fast, know that FastLoad was developed to load millions of rows into a table.

The way FastLoad works can be illustrated by home construction, of all things! Let's look at three scenarios from the construction industry to provide an amazing picture of how the data gets loaded.

Scenario One: Builders prefer to start with an empty lot and construct a house on it, from the

foundation right on up to the roof. There is no pre-existing construction, just a smooth, graded lot. The fewer barriers there are to deal with, the quicker the new construction can progress. Building custom or spec houses this way is the fastest way to build them. Similarly, FastLoad likes to start with an empty table, like an empty lot, and then populate it with rows of data from another source. Because the target table is empty, this method is typically the fastest way to load data. Fast Load will never attempt to insert rows into a table that already holds data.

Scenario Two: The second scenario in this analogy is when someone buys the perfect piece of

land on which to build a home, but the lot already has a house on it. In this case, the person may determine that it is quicker and more advantageous just to demolish the old house and start fresh from the ground up — allowing for brand new construction. FastLoad also likes this approach to loading data. It can just 1) drop the existing table, which deletes the rows, 2) replace its structure,and then 3) populate it with the latest and greatest data. When dealing with huge volumes of new rows, this process will run much quicker than using MultiLoad to populate the existing table. Another option is to DELETE all the data rows from a populated target table and reload it. This requires less updating of the Data Dictionary than dropping and recreating a table. In either case, the result is a perfectly empty target table that FastLoad requires!

Scenario Three: Sometimes, a customer has a good house already but wants to remodel a portion of it or to add an additional room. This kind of work takes more time than the work described in Scenario One. Such work requires some tearing out of existing construction in order to build the new section. Besides, the builder never knows what he will encounter beneath the surface of the existing home. So you can easily see that remodeling or additions can take more time than new construction. In the same way, existing tables with data may need to be updated by adding new rows of data. To load populated tables quickly with large amounts of data while maintaining the data currently held in those tables, you would choose MultiLoad instead of FastLoad. MultiLoad is designed for this task but, like renovating or adding onto an existing house, it may take more time.

Space

Space (whether permanent, spool, or temporary) is measured in BYTES. The following table lists the three types of database space.


Type of Space	Description
Permanent (PERM)	Permanent space allocated to a user or database is a uniquely defined, logical repository for database objects, for example, tables, indexes, and journals. When an object is created or data rows are inserted, the system allocates space as needed from the PERM space of the immediate owner. The space is returned automatically when no longer needed. Note: Perm space allocated to a user or database that is currently unused is available for use as spool or temporary space. A database or user with no PERM space can still own views, macros, and triggers but cannot have objects that require space such as tables, UDFs, stored procedures, HIs, JIs, or journals.
Spool	Spool space holds intermediate query results or formatted answer sets to queries and volatile tables. The system can use unassigned PERM space for spool space. When a user creates new users or databases, the amount of spool space for those new objects must not exceed the spool space limit of their immediate owner. If you do not specify a spool space limit, a new user automatically inherits the spool limit of its parent. To more easily manage spool space, define the value for spool in a profile. You can then assign a profile to users and all of the users will inherit the spool space definition.
Temporary (TEMP)	Temp space defines the number of bytes the system will use to store data for global temporary tables. The value you specify must not exceed the value of the immediate parent at the time of creation. If you do not specify a value, the maximum value defaults to that of the parent. Note: Global temporary tables require both PERM and TEMP space. A minimum of 512 bytes of PERM space per AMP is required for the table header for the base global temporary table definition. Temp space is required for storing the actual rows. To more easily manage TEMP space, define the value for it in a profile. You can then assign a group of users to this profile and all of the users will inherit the TEMP space definition.

Problem Statement:

There are 3 users created under database which has PERM space 50 GB. Each user is allocated 5 GB PERM space & 20 GB Spool space.

Now, the PERM Space left in database- 35 GB.

I've created the another user for whom i allocated 20 GB PERM space.

Now, the PERM Space left in database- 15 GB ==> UNUSED space.

Even though i allocated 20 GB spool space to each user, if i submit a sql query which takes more than 15 GB should be allocated since the unused space is only 15 GB ??

Solution:

Spool memory = unused perm + spool assigned.

The particular user will have this much spool for his queries.

SPOOL is the currently not used perm space on system level, but not on user/database level.

The spool on a database level is only used for inheritance:

When there's no spool assigned for a new user he inherits the owner's spool size.

And when there's spool it may not exceed the owner's size.

Let us say if only 20 GB unused PERM space left at SYSTEM level. And there are 2 users in a system. If i run queries in both the users simultaneously and each is expected to take more than 15 GB spool space. In this scenario one query has to be aborted right?

The system will run out of spool and return a 2507 error: "Out of spool space on disk", this probably affects all active queries.

Thursday, 18 June 2015

Tracing Levels

Workflow and Task Status

Transformations that supports Sorted Input

1)Aggregator
2)Joiner
3)Look up

Stop vs Abort

Stop:
- The reading process will immediately be killed by using Stop command
- It does not have time out period

Abort:

- It gives a time out period of 60 seconds to the Informatica server.
- Otherwise it kills the DTM process.

List of output files during a session run

1. Session log
2. Badfile
3. Errors log
4. Workflow log

Mapping, Session, Worklet, Workflow, Mapplet

Mapping : Represents flow of data from source to destination

Session : A set of instructions which describes the data movement from source to destination

Worklet : Set of tasks

Workflow : A set of instructions that specifies the way to execute the tasks to Informatica

Mapplet : Used for creation and configuration of a group of transformations

factless fact table

A fact table with statistical information but without measures

It can view the number of occurring events

Example :
- A number of accidents occurred in a month.

Pages

Monday, 29 June 2015

Friday, 26 June 2015

Thursday, 18 June 2015