Database and Warehouse Concepts

Sunday 5 July 2015

Normalization

•Normalization is the process of discarding repeating groups, minimizing redundancy, eliminating composite keys for partial dependency and separating non-key attributes.

•In simple terms : "Each attribute (column) must be a fact about the key, the whole key, and nothing but the key." Said another way, each table should describe only one type of entity (information).

Identifying and Non Identifying relationships

Relationships between two entities may be classified as being either "identifying" or "non-identifying". Identifying relationships exist when the primary key of the parent entity is included in the primary key of the child entity. On the other hand, a non-identifying relationship exists when the primary key of the parent entity is included in the child entity but not as part of the child entity's primary key. In addition, non-identifying relationships may be further classified as being either "mandatory" or "non-mandatory". A mandatory non-identifying relationship exists when the value in the child table cannot be null. On the other hand, a non-mandatory non-identifying relationship exists when the value in the child table can be null.
Here's a simple example of an identifying relationship:
Parent
------
ID (PK)
Name

Child
-----
ID (PK)
ParentID (PK, FK to Parent.ID) -- notice PK
Name
Here's a corresponding non-identifying relationship:
Parent
------
ID (PK)
Name

Child
-----
ID (PK)
ParentID (FK to Parent.ID) -- notice no PK
Name

PLSQL Interaction with SQL

Example1:

DECLARE
V_EMPNO NUMBER(4);

BEGIN
SELECT EMPNO INTO V_EMPNO FROM EMP WHERE ENAME='KING';
DBMS_OUTPUT.PUT_LINE(V_EMPNO);
END;

Example2:

DECLARE
V_EMPNO NUMBER(4);
V_ENAME VARCHAR2(20);
V_JOB VARCHAR2(15);

BEGIN
SELECT EMPNO,ENAME,JOB INTO V_EMPNO,V_ENAME,V_JOB FROM EMP WHERE ENAME='KING';
DBMS_OUTPUT.PUT_LINE(V_EMPNO || CHR(9)|| V_ENAME ||CHR(9) ||V_JOB);
END;

Example3:

DECLARE
V_EMPNO NUMBER(4);
V_ENAME VARCHAR2(10);
V_JOB VARCHAR2(9);
BEGIN
SELECT EMPNO,ENAME,JOB INTO V_EMPNO,V_ENAME,V_JOB FROM EMP;
DBMS_OUTPUT.PUT_LINE(V_EMPNO||CHR(9)||V_ENAME||CHR(9)||V_JOB);
EXCEPTION
WHEN TOO_MANY_ROWS THEN
DBMS_OUTPUT.PUT_LINE('TOO MANY ROWS');
END;

Example4:

DECLARE
V_EMPNO EMP.EMPNO%TYPE;
V_ENAME EMP.ENAME%TYPE;
V_JOB EMP.JOB%TYPE;
V_MGR EMP.MGR%TYPE;
V_HIREDATE EMP.HIREDATE%TYPE;
V_SAL EMP.SAL%TYPE;
V_COMM EMP.COMM%TYPE;
V_DEPTNO EMP.DEPTNO%TYPE;
BEGIN
SELECT EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO
INTO V_EMPNO,V_ENAME,V_JOB,V_MGR,V_HIREDATE,V_SAL,V_COMM,V_DEPTNO
FROM EMP WHERE ENAME='KING';
DBMS_OUTPUT.PUT_LINE(V_EMPNO||CHR(9)||V_ENAME||CHR(9)||V_JOB);
END;

Example5:

DECLARE
EMP_REC EMP%ROWTYPE;
BEGIN
SELECT * INTO EMP_REC FROM EMP WHERE ENAME='KING';
DBMS_OUTPUT.PUT_LINE(EMP_REC.EMPNO || CHR(9)|| EMP_REC.ENAME ||CHR(9) ||EMP_REC.JOB);
END;

Parallel/Concurrent execution in workflows

I've a project requirement which goes like this

Table_stage has around 20 million of data.
It has to load data into Temp table in batches of 1 million, after doing transformations
Temp table will in turn load those 1 million into one main table and get truncated

so the flow is

Table_stage -> Temp -> Main, where it will repeat for batches of 1 million.

One of the Approaches would be :

There is Concurrent execution of workflow is available with 8.6. But you have to give some parameter to the SQ query so that it will change for each instance of the workflow (which is actually one workflow but run parallel 20 times/same time). Check workflow properties concurrent option and create 20 workflow in list and respective parameter files ( each has some change parameter to pick batch of record) with the parameter changes so that every workflow pick 1 million.

Saturday 4 July 2015

PL/SQL Basics

Basic Blocks

DECLARE
v_width INTEGER;
v_height INTEGER := 2;
v_area INTEGER ;
BEGIN
v_area := 6;
v_width := v_area / v_height;
DBMS_OUTPUT.PUT_LINE('width = ' || width);
EXCEPTION
WHEN ZERO_DIVIDE THEN
DBMS_OUTPUT.PUT_LINE('Division by zero');
END;

IF clause:

DECLARE
v_A Number := 50;
v_B Number;
BEGIN
IF v_A > 40 THEN
v_B := v_A - 40;
DBMS_OUTPUT.PUT_LINE('V_B = ' || v_B);
ELSE
v_B := 0;
DBMS_OUTPUT.PUT_LINE('V_B = ' || v_B);
END IF;
END;

/**IF -THEN-ELSE LADDER*/

DECLARE
v_Score Number := 85; --Percentage
v_LetterGrade Char(1);
BEGIN
IF v_Score >= 90 THEN
v_LetterGrade := 'A';
ELSIF v_Score >= 80 THEN
v_LetterGrade := 'B';
ELSIF v_Score >= 70 THEN
v_LetterGrade := 'C';
ELSIF v_Score >= 60 THEN
v_LetterGrade := 'D';
ELSE
v_LetterGrade := 'E';
END IF;
DBMS_OUTPUT.PUT_LINE('Your Letter Grade is: ' || v_LetterGrade);
END;

declare
v NUMBER;
begin
if v=1 then
DBMS_OUTPUT.put_line('Equal to 1');
elsif v!= 1 then
DBMS_OUTPUT.put_line('Not equal to 1');
elsif v = v then
DBMS_OUTPUT.put_line('Equal to itself');
else
DBMS_OUTPUT.put_line('Undefined result');
end if;
v:= v+1;
DBMS_OUTPUT.put_line('New value: <'||v||'>');
end;

LOOPS

DECLARE
hundreds_counter NUMBER(1,-2);
BEGIN
hundreds_counter := 100;
LOOP
DBMS_OUTPUT.PUT_LINE(hundreds_counter);
hundreds_counter := hundreds_counter + 100;
END LOOP;
EXCEPTION
WHEN OTHERS THEN
DBMS_OUTPUT.PUT_LINE('Done.');
END;

DECLARE
v_Radius NUMBER := 2;
MYPI NUMBER:=3.14;
BEGIN
LOOP
DBMS_OUTPUT.PUT_LINE('The AREA of the circle is ' ||
v_RADIUS*v_RADIUS * mypi);
v_Radius := v_Radius + 2;
EXIT WHEN v_Radius > 10;
END LOOP;
END;
/

BEGIN
FOR just_a_num IN 1..10 LOOP
dbms_output.put_line(just_a_num);
END LOOP;
END;
/

DECLARE
V_LOW INTEGER:=1;
V_HIGH INTEGER:=10;
BEGIN
FOR JUST_A_NUM IN V_LOW..V_HIGH LOOP
DBMS_OUTPUT.PUT_LINE(JUST_A_NUM);
END LOOP;
END;

BEGIN
FOR v_loopcounter IN 1..5 LOOP
DBMS_OUTPUT.PUT_LINE('Loop counter is ' || v_loopcounter);
END LOOP;
END;
/

DECLARE
v_Start Integer := 1;
BEGIN
FOR v_loopcounter IN REVERSE v_Start..5 LOOP
DBMS_OUTPUT.PUT_LINE('Loop counter is ' || v_loopcounter);
END LOOP;
END;
/
DECLARE
MYPI NUMBER:=3.14;
BEGIN
FOR v_loopcounter IN 1..20 LOOP
IF MOD(v_loopcounter,2) = 0 THEN
DBMS_OUTPUT.PUT_LINE('The AREA of the circle is ' ||v_loopcounter*v_loopcounter * mypi);
END IF;
IF v_loopcounter = 10 THEN
EXIT;
END IF;
END LOOP;
END;
/

BEGIN
<<outerloop>>
FOR v_outerloopcounter IN 1..2 LOOP
<<innerloop>>
FOR v_innerloopcounter IN 1..4 LOOP
DBMS_OUTPUT.PUT_LINE('Outer Loop counter is ' ||v_outerloopcounter ||
' Inner Loop counter is ' ||v_innerloopcounter);
END LOOP innerloop;
END LOOP outerloop;
END;
/

DECLARE
v_Status NUMBER := 1;
BEGIN
IF v_Status = 1 THEN
GOTO mybranch;
ELSE
v_Status := 1;
END IF;
<<mybranch>>
NULL;
END;
/

DECLARE
v_Radius NUMBER := 2;
MYPI NUMBER:=3.14;
BEGIN
WHILE v_Radius <=10 LOOP
DBMS_OUTPUT.PUT_LINE('The Area is ' ||mypi * v_Radius * v_Radius);
v_Radius := v_Radius + 2 ;
END LOOP;
END;
/

DECLARE
v_Radius NUMBER := 2;
MYPI NUMBER:=3.14;
BEGIN
WHILE TRUE LOOP
DBMS_OUTPUT.PUT_LINE('The Area is '||mypi * v_Radius * v_Radius);
IF v_Radius = 10 THEN
EXIT;
END IF;
v_Radius := v_Radius + 2 ;
END LOOP;
END;
/

Domain, Node, Master Gateway

When you say Domain in informatica means it is the unified administrative unit in the informatica server and is the head of the infa.

When you install infa(talking about the server, not the client) in your machine(in actual it is the server machine), the first thing to be created is Domain, where you can give the Domain name n all. It is an umbrella, under which we have all the other essential parts of infa like nodes,Integration service,repository service etc.

Suppose if our server machine have 1 TB hard disk and 250 GB RAM means, that is your Domain(Physically). Logically Domain will be running on this entire machine.

Next thing is the Node. It is the logical representation/partition of physical machine available(Domain). During installation you need to create node, after you configured the Domain using name,Machine(IP),host address n all. You can create as much as nodes you want.

If you created 2 nodes means, each node will share 500 GB hard disk n 125 GB RAM. i.e, all the resources allocated to the domain will equally distributed among the nodes created.

As i said, Domain is the entire server machine and node(s) is(are) the partition(s) of Domain. So all the services like IS and RS are running on the server(Domain), which can be accessed by the Clients through Server--Client structure(By giving the domain name/repository name/logon credentials).

In the back-end, upon the credential verification, the client will send the request to server(Domain) via TCP-IP protocol. The request will hit the server machine initially to a node, that is what the Gateway node. This node will act as an intermediate b/n Client and Server and will receive all the requests from server and returns all the answers from the Server.

i.e, Client<--->Master Gateway node<---> Server services(IS and RS)

-----------------SERVER-----------------------------

Also, if we have only one node in our domain means that will be the Master gateway node and also will responsible to run the all the sever services(IS and RS).

Friday 3 July 2015

Conceptual Data Modeling

Done in Initial phase of planning in constructing a top-down approach.
Get the business requirements from various sources like business docs,Functional teams,Business analysts,SME's and report users.
It's a high level graphical represntation of a business of the organization.
Starts with main subject area and identify the relationship with the other subject areas as well.
Define entities in a subject area and their relationships,which is represented by cardinality(1:1,1:M,M:N)
It is sent to functional team for review.

Eg: Bank has different subject areas like Savings,Credit cards,Investment,Retirement Plan Service.

Identifying Change Records and Voids in a partitioned table

Let's assume I am comparing the two partitions.I want to capture changes and voids while comparing the consecutive partitions.

In this example,I have taken 2 consecutive partitions as two tables.

1)Employee 1 and 2 are on Jan1st,but voided on Jan2nd.
2)Employee 3 and 4 are on both the days but does not have any change in the record.
3)Employee 5 has it's first record on Jan1st and change on Jan2nd.
4)Employee 6 has updates on Jan1st and Jan2nd.
5)Employee 7 has update sometime back but no change in this consecutive days

Now I need to fetch the records which are changed and voided.

SQL> SELECT * FROM PAR1;

SQL> SELECT * FROM PAR2;

OUTPUT:

SELECT PAR1.*

FROM

PAR2 RIGHT OUTER JOIN PAR1

ON PAR1.EMPNO=PAR2.EMPNO

WHERE COALESCE(PAR1.ACTN_IND,'$')<>COALESCE(PAR2.ACTN_IND,'$')

OR (PAR1.ACTN_IND='C' AND ORA_HASH(PAR1.EMPNO,PAR1.SAL)<>ORA_HASH(PAR2.EMPNO,PAR2.SAL))

Wednesday 1 July 2015

DATA , INFORMATION and KNOWLEDGE

You run a local departmental store and you log all the details of your customers in the store database. You know the names of your customers and what items they buy each day.

For example, Alex, Jessica and Paul visit your store every Sunday and buys candle. You store this information in your store database. This is data. Any time you want to know who are the visitors that buy candle, you can query your database and get the answer. This is information. You want to know how many candles are sold on each day of week from your store, you can again query your database and you’d get the answer – that’s also information.

But suppose there are 1000 other customers who also buy candle from you on every Sunday (mostly – with some percentage of variations) and all of them are Christian by religion. So, you can conclude that Alex, Jessica and Paul must be also Christian.

Now the religion of Alex, Jessica and Paul were not given to you as data. This could not be retrieved from the database as information. But you learnt this piece of information indirectly. This is the ”knowledge” that you discovered. And this discovery was done through a process called “Data Mining”.

Now there are chances that you are wrong about Alex, Jessica and Paul. But there are fare amount of chances that you are actually right. That is why it is very important to “evaluate” the result of KDD process.

I gave you this example because I wanted to make a clear distinction between knowledge and information in the context of data mining. This is important to understand our first question – why retrieving information from deep down of your database is not same as data mining. No matter how complex the information retrieval process is, no matter how deep the information is located at, it’s still not data mining.

As long as you are not dealing with predictive analysis or not discovering “new” pattern from the existing data – you are not doing data mining.

Normal DELETE vs MLOAD DELETE

MLOAD deletion is done in term of blocks where as normal DELETE done row by row.

Teradata Utilities does not support Transient journals where as normal delete supports.So MLOAD delete is faster than normal DELETE.