Reasons for PL/SQL Performance Problems
When a PL/SQL-based application performs poorly, it is often due to
- badly written SQL statements,
- poor programming practices,
- inattention to PL/SQL basics,
- Misuse of shared memory.
Badly Written SQL Statements in a PL/SQL Program
PL/SQL programs look relatively simple because their complexity is hidden in SQL statements, which do most of the work. That is why badly written SQL statements are the main reason for slow execution. If a program contains many badly written SQL statements, no number of well-written PL/SQL statements will help. If badly written SQL statements are slowing down your program, analyze their execution plans and performance using the methods listed below. Then, rewrite the SQL statements. For example, hints to the query optimizer can eliminate problems such as unnecessary full-table scans.
- EXPLAIN PLAN statement
- SQL Trace facility with TKPROF utility
- Oracle Trace facility
Avoid unnecessary calls to subprograms, and optimize their code:
1. If a function is called within a SQL query, you can cache the function value for each row by creating a function-based index on the table in the query. The CREATE INDEX statement might take a while, but queries can be much faster.
2. If a column is passed to a function within an SQL query, the query cannot use regular indexes on that column, and the function might be called for every row in a (potentially very large) table. Consider nesting the query so that the inner query filters the results to a small number of rows, and the outer query calls the function only a few times as shown in Example below.
BEGIN
-- Inefficient, calls function for every row FOR item IN (SELECT DISTINCT(SQRT(department_id)) col_alias FROM employees) LOOP DBMS_OUTPUT.PUT_LINE(item.col_alias); END LOOP ;-- Efficient, only calls function once for each distinct value. FOR item IN ( SELECT SQRT(department_id) col_alias FROM ( SELECT DISTINCT department_id FROM employees) ) LOOP DBMS_OUTPUT.PUT_LINE(item.col_alias); END LOOP ;END;/ Poor Programming Practices
Often, poor programming practices are a side effect of schedule crunches. In such circumstances, even experienced programmers might write code that hampers performance. No matter how suitable a programming language is for a given task, badly written subprograms (for example, a slow sort or search function) can ruin performance.
Duplication of Built-in Functions
PL/SQL provides many highly optimized functions such as REPLACE, TRANSLATE, SUBSTR, INSTR, RPAD, and LTRIM. Do not hand-code your own versions. Built-in functions are more efficient. Even when a built-in function has more power than you need, use it rather than hand-coding a subset of its functionality.
Inefficient Conditional Control Statements
When evaluating a logical expression, PL/SQL uses short-circuit evaluation. That is, PL/SQL stops evaluating the expression as soon as the result can be determined. Now, consider the following AND expression:
IF credit_ok(cust_id) AND (loan <>
...
END IF;
The Boolean function credit_ok is always called. However, if you switch the operands of AND as follows
IF (loan <>
...
END IF;
the function is called only when the expression loan <>is true (because AND returns TRUE only if both its operands are true). The same idea applies to EXIT-WHEN statements.
Implicit Datatype Conversions
At run time, PL/SQL converts between structurally different datatypes implicitly. For instance, assigning a PLS_INTEGER variable to a NUMBER variable results in a conversion because their internal representations are different. Avoiding implicit conversions can improve performance. Look at the example below. The integer literal 15 is represented internally as a signed 4-byte quantity, so PL/SQL must convert it to an Oracle number before the addition. However, the floating-point literal 15.0 is represented as a 22-byte Oracle number, so no conversion is necessary.
DECLARE
n NUMBER;
c CHAR(5);
BEGIN
n := n + 15; -- converted
n := n + 15.0; -- not converted
...
END;
Here is another example:
DECLARE
c CHAR(5);
BEGIN
c := 25; -- converted
c := ’25’; -- not converted
...
END;
Inappropriate Declarations for Numeric Datatypes
The datatype NUMBER and its subtypes are 22-byte. They are represented in a special internal format, designed for portability and arbitrary scale and precision, not performance. Operations on NUMBER or INTEGER variables require calls to library routines. When you need to declare an integer variable, use the datatype PLS_INTEGER, which is the most efficient numeric type. That is because PLS_INTEGER values require less storage than INTEGER or NUMBER values. Also, PLS_INTEGER operations use machine arithmetic, so they are faster than BINARY_INTEGER, INTEGER, or NUMBER operations, which use library arithmetic. Furthermore, INTEGER, NATURAL, NATURALN, POSITIVE, POSITIVEN, and SIGNTYPE are constrained subtypes. So, their variables require precision checking at run time, which can affect performance. The BINARY_FLOAT and BINARY_DOUBLE types can use native hardware arithmetic instructions, and are more efficient for number-crunching applications such as scientific processing. They also require less space in the database. These types do not always represent fractional values precisely, and handle rounding differently than the NUMBER types. These types are less suitable for financial code where accuracy is critical.
Unnecessary NOT NULL Constraints
In PL/SQL, using the NOT NULL constraint incurs a performance cost. Consider the following example:
PROCEDURE calc_m IS
m NUMBER NOT NULL := 0;
a NUMBER;
b NUMBER;
BEGIN
...
m := a + b;
...
END;
Because m is constrained by NOT NULL, the value of the expression a + b is assigned to a temporary variable, which is then tested for nullity. If the variable is not null, its value is assigned to m. Otherwise, an exception is raised. However, if m were not constrained, the value would be assigned to m directly. A more efficient way to write the last example follows:
PROCEDURE calc_m IS
m NUMBER; -- no constraint
a NUMBER;
b NUMBER;
BEGIN
...
m := a + b;
IF m IS NULL THEN -- enforce constraint programmatically
...
END IF;
END;
Note that the subtypes NATURALN and POSTIVEN are defined as NOT NULL. So, using them incurs the same performance cost.
Size Declarations for VARCHAR2 Variables
The VARCHAR2 datatype involves a trade-off between memory use and efficiency. For a VARCHAR2(>= 2000) variable, PL/SQL dynamically allocates only enough memory to hold the actual value. However, for a VARCHAR2(<>variable, PL/SQL preallocates enough memory to hold a maximum-size value. So, for example, if you assign the same 500-byte value to a VARCHAR2(2000) variable and to a VARCHAR2(1999) variable, the latter uses 1499 bytes more memory.
Misuse of Shared Memory in a PL/SQL Program
When you call a packaged subprogram for the first time, the whole package is loaded into the shared memory pool. So, subsequent calls to related subprograms in the package require no disk I/O, and your code executes faster. However, if the package is aged out of memory, it must be reloaded if you reference it again. You can improve performance by sizing the shared memory pool correctly. Make sure it is large enough to hold all frequently used packages but not so large that memory is wasted.
Pinned Packages
Another way to improve performance is to pin frequently used packages in the shared memory pool. When a package is pinned, it is not aged out by the least recently used (LRU) algorithm that Oracle normally uses. The package remains in memory no matter how full the pool gets or how frequently you access the package.
Serially Reusable Packages
To help you manage the use of memory, PL/SQL provides the pragma SERIALLY_REUSABLE, which lets you mark some packages as serially reusable. You can so mark a package if its state is needed only for the duration of one call to the server (for example, an OCI call to the server or a server-to-server RPC).
The global memory for such packages is pooled in the System Global Area (SGA), not allocated to individual users in the User Global Area (UGA). That way, the package work area can be reused. When the call to the server ends, the memory is returned to the pool. Each time the package is reused, its public variables are initialized to their default values or to NULL. The maximum number of work areas needed for a package is the number of concurrent users of that package, which is usually much smaller than the number of logged-on users.
Dynamic SQL
Some programs (a general-purpose report writer for example) must build and process a variety of SQL statements at run time. So, their full text is unknown until then. Such statements can, and probably will, change from execution to execution. So, they are called dynamic SQL statements. Formerly, to execute dynamic SQL statements, you had to use the supplied package DBMS_SQL. Now, within PL/SQL, you can execute any kind of dynamic SQL statement using an interface called native dynamic SQL. Native dynamic SQL is easier to use and much faster than the DBMS_SQL package. Except for multi-row queries, the dynamic string can contain any SQL statement (without the terminator) or any PL/SQL block (with the terminator). The string can also contain placeholders for bind arguments.
Improving Performance: In the example below, Oracle opens a different cursor for each distinct value of emp_id. This can lead to resource contention and poor performance.
CREATE PROCEDURE fire_employee (emp_id NUMBER) AS
BEGIN
EXECUTE IMMEDIATE
’DELETE FROM emp WHERE empno = ’ || TO_CHAR(emp_id);
END;
You can improve performance by using a bind variable, as shown below. This allows Oracle to reuse the same cursor for different values of emp_id.
CREATE PROCEDURE fire_employee (emp_id NUMBER) AS
BEGIN
EXECUTE IMMEDIATE
’DELETE FROM emp WHERE empno = :num’ USING emp_id;
END;
Limitation: Suppose you need a procedure that accepts the name of any database table, then drops that table from your schema. Using dynamic SQL, you might write the following standalone procedure:
CREATE PROCEDURE drop_table (table_name IN VARCHAR2) AS
BEGIN
EXECUTE IMMEDIATE ’DROP TABLE :tab’ USING table_name;
END;
However, at run time, this procedure fails with an invalid table name error. That is because you cannot use bind arguments to pass the names of schema objects to a dynamic SQL statement. Instead, you must embed parameters in the dynamic string, then pass the names of schema objects to those parameters. To debug the last example, you must revise the EXECUTE IMMEDIATE statement. Instead of using a placeholder and bind argument, you embed parameter table_name in the dynamic string, as follows:
CREATE PROCEDURE drop_table (table_name IN VARCHAR2) AS
BEGIN
EXECUTE IMMEDIATE ’DROP TABLE ’ || table_name;
END;
BULK-BIND
When SQL statements execute inside a loop using collection elements as bind variables, context switching between the PL/SQL and SQL engines can slow down execution. If the SQL statement affects four or more database rows, the use of bulk binds can improve performance considerably. To maximize performance, rewrite your programs as follows:
v If an INSERT, UPDATE, or DELETE statement executes inside a loop and references collection elements, move it into a FORALL statement.
v If a SELECT INTO, FETCH INTO, or RETURNING INTO clause references a collection, incorporate the BULK COLLECT clause. To speed up SELECT statements, include the BULK COLLECT INTO clause in the SELECT statement instead of using INTO.
v If possible, use host arrays to pass collections back and forth between your programs and the database server.
v If the failure of a DML operation on a particular row is not a serious problem, include the keywords SAVE EXCEPTIONS in the FORALL statement and report or clean up the errors in a subsequent loop using the %BULK_EXCEPTIONS attribute.
The assigning of values to PL/SQL variables in SQL statements is called binding. PL/SQL binding operations fall into three categories:
v in-bind: When a PL/SQL variable or host variable is stored in the database by an INSERT or UPDATE statement.
v out-bind: When a database value is assigned to a PL/SQL variable or a host variable by the RETURNING clause of an INSERT, UPDATE, or DELETE statement.
v define: When a database value is assigned to a PL/SQL variable or a host variable by a SELECT or FETCH statement.
Bulk SQL uses PL/SQL collections, such as varrays or nested tables, to pass large amounts of data back and forth in a single operation. This process is known as bulk binding. If the collection has 20 elements, bulk binding lets you perform the equivalent of 20 SELECT, INSERT, UPDATE, or DELETE statements using a single operation. Queries can pass back any number of results, without requiring a FETCH statement for each row.
The following example shows the use of bulk bind
declare
subtype emp_id_stype is DW_CUST_SALE.CUSTOMER_ID%TYPE;
type emp_id_type is table of emp_id_stype;
emp_id_tab emp_id_type;
subtype prod_id_stype is DW_CUST_SALE.PRODUCT_ID%TYPE;
type prod_id_type is table of prod_id_stype;
prod_id_tab prod_id_type;
subtype sale_amt_stype is DW_CUST_SALE.SALE_AMT%TYPE;
type sale_amt_type is table of sale_amt_stype;
sale_amt_tab sale_amt_type;
subtype sale_qty_stype is DW_CUST_SALE.SALE_QTY%TYPE;
type sale_qty_type is table of sale_qty_stype;
sale_qty_tab sale_qty_type;
subtype ord_amt_stype is DW_CUST_SALE.ORD_AMT%TYPE;
type ord_amt_type is table of ord_amt_stype;
ord_amt_tab ord_amt_type;
subtype ord_qty_stype is DW_CUST_SALE.ORD_QTY%TYPE;
type ord_qty_type is table of ord_qty_stype;
ord_qty_tab ord_qty_type;
CURSOR c1 IS
SELECT a.customer_id,
b.product_id,
sum(c.sale_amt) sale_amt,
sum(c.sale_qty) sale_qty,
0 ord_amt,
0 ord_qty
FROM DW_CUSTOMER a, DW_PRODUCT b, DW_SRC_SALE c
WHERE a.customer=c.customer AND
b.product=c.product
GROUP BY customer_id, product_id
UNION ALL
SELECT a.customer_id,
b.product_id,
0 sale_amt,
0 sale_qty,
sum(c.PRICE*c.quantity) ord_amt,
sum(c.quantity) ord_qty
FROM DW_CUSTOMER a, DW_PRODUCT b, DW_SRC_ORDERDTL c
WHERE a.customer=c.customer AND
b.product=c.product
GROUP BY customer_id, product_id;
row_limit PLS_INTEGER:=10;
BEGIN
OPEN c1;
FETCH c1 BULK COLLECT INTO emp_id_tab,
prod_id_tab,
sale_amt_tab,
sale_qty_tab,
ord_amt_tab,
ord_qty_tab LIMIT row_limit;
EXIT WHEN emp_id_tab.COUNT = 0;
FORALL i IN 1..emp_id_tab.count
INSERT INTO DW_CUST_SALE (customer_id, product_id, sale_amt, sale_qty, ord_amt, ord_qty) VALUES
(emp_id_tab(i), prod_id_tab(i), sale_amt_tab(i), sale_qty_tab(i), ord_amt_tab(i), ord_qty_tab(i));
COMMIT;
dbms_output.put_line('Records affected:'||SQL%ROWCOUNT);
END
END;
/
Handling exception
DECLARE TYPE empid_tab IS TABLE OF employees.employee_id%TYPE; emp_sr empid_tab;-- create an exception handler for ORA-24381 errors NUMBER; dml_errors EXCEPTION; PRAGMA EXCEPTION_INIT(dml_errors, -24381);BEGIN SELECT employee_id BULK COLLECT INTO emp_sr FROM emp_temp WHERE hire_date < '30-DEC-94';-- add '_SR' to the job_id of the most senior employees FORALL i IN emp_sr.FIRST..emp_sr.LAST SAVE EXCEPTIONS UPDATE emp_temp SET job_id = job_id || '_SR' WHERE emp_sr(i) = emp_temp.employee_id;-- If any errors occurred during the FORALL SAVE EXCEPTIONS,-- a single exception is raised when the statement completes. EXCEPTION WHEN dml_errors THEN -- Now we figure out what failed and why. errors := SQL%BULK_EXCEPTIONS.COUNT; DBMS_OUTPUT.PUT_LINE('Number of statements that failed: ' || errors); FOR i IN 1..errors LOOP DBMS_OUTPUT.PUT_LINE('Error #' || i || ' occurred during '|| 'iteration #' || SQL%BULK_EXCEPTIONS(i).ERROR_INDEX); DBMS_OUTPUT.PUT_LINE('Error message is ' || SQLERRM(-SQL%BULK_EXCEPTIONS(i).ERROR_CODE)); END LOOP ;END;/ NOCOPY Compiler Hint
By default, OUT and IN OUT parameters are passed by value. That is, the value of an IN OUT actual parameter is copied into the corresponding formal parameter. Then, if the subprogram exits normally, the values assigned to OUT and IN OUT formal parameters are copied into the corresponding actual parameters. When the parameters hold large data structures such as collections, records, and instances of object types, all this copying slows down execution and uses up memory. To prevent that, you can specify the NOCOPY hint, which allows the PL/SQL compiler to pass OUT and IN OUT parameters by reference. In the following example, you ask the compiler to pass IN OUT parameter my_unit by reference instead of by value:
DECLARE
TYPE Platoon IS VARRAY(200) OF Soldier;
PROCEDURE reorganize (my_unit IN OUT NOCOPY Platoon) IS ...
BEGIN
...
END;
RETURNING Clause
Often, applications need information about the row affected by a SQL operation, for example, to generate a report or take a subsequent action. The INSERT, UPDATE, and DELETE statements can include a RETURNING clause, which returns column values from the affected row into PL/SQL variables or host variables. This eliminates the need to SELECT the row after an insert or update, or before a delete.
As a result, fewer network round trips, less server CPU time, fewer cursors, and less server memory are required. In the following example, you update the salary of an employee and at the same time retrieve the employee's name and new salary into PL/SQL variables.
PROCEDURE update_salary (emp_id NUMBER) IS
name VARCHAR2(15);
new_sal NUMBER;
BEGIN
...
UPDATE emp SET sal = sal * 1.1
WHERE empno = emp_id
RETURNING ename, sal INTO name, new_sal;
-- Now do computations involving name and new_sal
...
END;
External Routines
PL/SQL is specialized for SQL transaction processing. So, some tasks are more quickly done in a lower-level language such as C, which is very efficient at machine-precision calculations. PL/SQL extends the functionality of the Oracle server by providing an interface for calling routines written in other languages. Standard libraries already written and available in other languages can be called from PL/SQL programs. For example, you can write methods in C, store them in a dynamic link library (DLL), register the library with PL/SQL, then call it from your applications. At run time, the library loads dynamically and, for safety, runs in a separate address space (implemented as a separate process).
Pipelined Table Functions
Table functions are used in situations such as data warehousing to apply multiple transformations to data. Pipelined table functions are functions that produce a collection of rows (either a nested table or a varray) that can be queried like a physical database table or assigned to a PL/SQL collection variable. You can use a table function in place of the name of a database table in the FROM clause of a query or in place of a column name in the SELECT list of a query.
A table function can take a collection of rows as input. An input collection parameter can be either a collection type (such as a VARRAY or a PL/SQL table) or a REF CURSOR.
Execution of a table function can be parallelized, and returned rows can be streamed directly to the next process without intermediate staging. Rows from a collection returned by a table function can also be pipelined, that is, iteratively returned as they are produced instead of in a batch after all processing of the table function's input is completed.
Streaming, pipelining, and parallel execution of table functions can improve performance:
- By enabling multi-threaded, concurrent execution of table functions
- By eliminating intermediate staging between processes
- By improving query response time: With non-pipelined table functions, the entire collection returned by a table function must be constructed and returned to the server before the query can return a single result row. Pipelining enables rows to be returned iteratively, as they are produced. This also reduces the memory that a table function requires, as the object cache does not need to materialize the entire collection.
- By iteratively providing result rows from the collection returned by a table function as the rows are produced instead of waiting until the entire collection is staged in tables or memory and then returning the entire collection.
Example1:
-- Define the ref cursor types and functionCREATE OR REPLACE PACKAGE refcur_pkg IS TYPE refcur_t IS REF CURSOR RETURN employees%ROWTYPE; TYPE outrec_typ IS RECORD ( var_num NUMBER(6), var_char1 VARCHAR2(30), var_char2 VARCHAR2(30)); TYPE outrecset IS TABLE OF outrec_typ; FUNCTION f_trans(p refcur_t) RETURN outrecset PIPELINED;END refcur_pkg;/ CREATE OR REPLACE PACKAGE BODY refcur_pkg IS FUNCTION f_trans(p refcur_t) RETURN outrecset PIPELINED IS out_rec outrec_typ; in_rec p%ROWTYPE; BEGIN LOOP FETCH p INTO in_rec; EXIT WHEN p%NOTFOUND; -- first row out_rec.var_num := in_rec.employee_id; out_rec.var_char1 := in_rec.first_name; out_rec.var_char2 := in_rec.last_name; PIPE ROW(out_rec); -- second row out_rec.var_char1 := in_rec.email; out_rec.var_char2 := in_rec.phone_number; PIPE ROW(out_rec); END LOOP ; CLOSE p; RETURN; END;END refcur_pkg;/-- SELECT query using the f_transc table functionSELECT * FROM TABLE( refcur_pkg.f_trans(CURSOR(SELECT * FROM employees WHERE department_id = 60)));
Example 2: Using a Pipelined Table Function as an Aggregate Function
CREATE TABLE gradereport (student VARCHAR2(30), subject VARCHAR2(30), weight NUMBER, grade NUMBER);INSERT INTO gradereport VALUES('Mark', 'Physics', 4, 4);INSERT INTO gradereport VALUES('Mark','Chemistry', 4, 3);INSERT INTO gradereport VALUES('Mark','Maths', 3, 3);INSERT INTO gradereport VALUES('Mark','Economics', 3, 4); CREATE PACKAGE pkg_gpa IS TYPE gpa IS TABLE OF NUMBER; FUNCTION weighted_average(input_values SYS_REFCURSOR) RETURN gpa PIPELINED;END pkg_gpa;/CREATE PACKAGE BODY pkg_gpa ISFUNCTION weighted_average(input_values SYS_REFCURSOR) RETURN gpa PIPELINED IS grade NUMBER; total NUMBER := 0; total_weight NUMBER := 0; weight NUMBER := 0;BEGIN-- The function accepts a ref cursor and loops through all the input rows LOOP FETCH input_values INTO weight, grade; EXIT WHEN input_values%NOTFOUND;-- Accumulate the weighted average total_weight := total_weight + weight; total := total + grade*weight; END LOOP ; PIPE ROW (total / total_weight); RETURN; -- the function returns a single resultEND;END pkg_gpa;/-- the query result comes back as a nested table with a single row-- COLUMN_VALUE is a keyword that returns the contents of a nested tableSELECT w.column_value "weighted result" FROM TABLE( pkg_gpa.weighted_average(CURSOR(SELECT weight, grade FROM gradereport))) w; CREATE TABLE ... AS SELECT And INSERT /*+APPEND*/ AS SELECT
The CREATE TABLE ... AS SELECT statement (CTAS) is a powerful tool for manipulating large sets of data. As shown in the following example, many data transformations can be expressed in standard SQL, and CTAS provides a mechanism for efficiently executing a SQL query and storing the results of that query in a new database table. The INSERT /*+APPEND*/ ... AS SELECT statement offers the same capabilities with existing database tables. In a data warehouse environment, CTAS is typically run in parallel using NOLOGGING mode for best performance. A simple and common type of data transformation is data substitution. In a data substitution transformation, some or all of the values of a single column are modified. For example, our sales table has a channel_id column. This column indicates whether a given sales transaction was made by a company's own sales force (a direct sale) or by a distributor (an indirect sale). You may receive data from multiple source systems for your data warehouse. Suppose that one of those source systems processes only direct sales, and thus the source system does not know indirect sales channels. When the data warehouse initially receives sales data from this system, all sales records have a NULL value for the sales.channel_id field. These NULL values must be set to the proper key value. For example, you can do this efficiently using a SQL function as part of the insertion into the target sales table statement. The structure of source table sales_activity_direct is as follows:
DESC sales_activity_direct
Name Null? Type
------------ ----- ----------------
SALES_DATE DATE
PRODUCT_ID NUMBER
CUSTOMER_ID NUMBER
PROMOTION_ID NUMBER
AMOUNT NUMBER
QUANTITY NUMBER
The following SQL statement inserts data from sales_activity_direct into the sales table of the sample schema, using a SQL function to truncate the sales date values to the
INSERT /*+ APPEND NOLOGGING PARALLEL */
INTO sales SELECT product_id, customer_id, TRUNC(sales_date), 3,
promotion_id, quantity, amount
FROM sales_activity_direct;
Transforming Data Using UPDATE
Another technique for implementing a data substitution is to use an UPDATE statement to modify the sales.channel_id column. An UPDATE will provide the correct result. However, if the data substitution transformations require that a very large percentage of the rows (or all of the rows) be modified, then, it may be more efficient to use a CTAS statement than an UPDATE.
Transforming Data Using MERGE
Oracle Database's merge functionality extends SQL, by introducing the SQL keyword MERGE, in order to provide the ability to update or insert a row conditionally into a table or out of line single table views. Conditions are specified in the ON clause. This is, besides pure bulk loading, one of the most common operations in data warehouse synchronization.
Merge Examples The following discusses various implementations of a merge. The examples assume that new data for the dimension table products is propagated to the data warehouse and has to be either inserted or updated. The table products_delta has the same structure as products.
MERGE INTO products t USING products_delta s
ON (t.prod_id=s.prod_id)
WHEN MATCHED THEN UPDATE SET
t.prod_list_price=s.prod_list_price, t.prod_min_price=s.prod_min_price
WHEN NOT MATCHED THEN INSERT (prod_id, prod_name, prod_desc, prod_subcategory,
prod_subcategory_desc, prod_category, prod_category_desc, prod_status,
prod_list_price, prod_min_price)
VALUES (s.prod_id, s.prod_name, s.prod_desc, s.prod_subcategory,
s.prod_subcategory_desc, s.prod_category, s.prod_category_desc,
s.prod_status, s.prod_list_price, s.prod_min_price);
Transforming Data Using Multitable INSERT
Many times, external data sources have to be segregated based on logical attributes for insertion into different target objects. It is also frequent in data warehouse environments to fan out the same source data into several target objects. Multitable inserts provide a new SQL statement for these kinds of transformations, where data can either end up in several or exactly one target, depending on the business transformation rules. This insertion can be done conditionally based on business rules or unconditionally.
It offers the benefits of the INSERT ... SELECT statement when multiple tables are involved as targets. In doing so, it avoids the drawbacks of the two obvious alternatives. You either had to deal with n independent INSERT … SELECT statements, thus processing the same source data n times and increasing the transformation workload n times. Alternatively, you had to choose a procedural approach with a per-row determination how to handle the insertion. This solution lacked direct access to high-speed access paths available in SQL. As with the existing INSERT ... SELECT statement, the new statement can be parallelized and used with the direct-load mechanism for faster performance. The following statement aggregates the transactional sales information, stored in sales_activity_direct, on a per daily base and inserts into both the sales and the costs fact table for the current day.
INSERT ALL
INTO sales VALUES (product_id, customer_id, today, 3, promotion_id,
quantity_per_day, amount_per_day)
INTO costs VALUES (product_id, today, promotion_id, 3,
product_cost, product_price)
SELECT TRUNC(s.sales_date) AS today, s.product_id, s.customer_id,
s.promotion_id, SUM(s.amount) AS amount_per_day, SUM(s.quantity)
quantity_per_day, p.prod_min_price*0.8 AS product_cost, p.prod_list_price
AS product_price
FROM sales_activity_direct s, products p
WHERE s.product_id = p.prod_id AND TRUNC(sales_date) = TRUNC(SYSDATE)
GROUP BY TRUNC(sales_date), s.product_id, s.customer_id, s.promotion_id, p.prod_min_price*0.8, p.prod_list_price;
The following statement inserts a row into the sales and costs tables for all sales transactions with a valid promotion and stores the information about multiple identical orders of a customer in a separate table cum_sales_activity. It is possible two rows will be inserted for some sales transactions, and none for others.
INSERT ALL
WHEN promotion_id IN (SELECT promo_id FROM promotions) THEN
INTO sales VALUES (product_id, customer_id, today, 3, promotion_id,
quantity_per_day, amount_per_day)
INTO costs VALUES (product_id, today, promotion_id, 3,
product_cost, product_price)
WHEN num_of_orders > 1 THEN
INTO cum_sales_activity VALUES (today, product_id, customer_id,
promotion_id, quantity_per_day, amount_per_day, num_of_orders)
SELECT TRUNC(s.sales_date) AS today, s.product_id, s.customer_id,
s.promotion_id, SUM(s.amount) AS amount_per_day, SUM(s.quantity)
quantity_per_day, COUNT(*) num_of_orders, p.prod_min_price*0.8
AS product_cost, p.prod_list_price AS product_price
FROM sales_activity_direct s, products p
WHERE s.product_id = p.prod_id
AND TRUNC(sales_date) = TRUNC(SYSDATE)
GROUP BY TRUNC(sales_date), s.product_id, s.customer_id,
s.promotion_id, p.prod_min_price*0.8, p.prod_list_price;
The following statement inserts into an appropriate shipping manifest according to the total quantity and the weight of a product order. An exception is made for high value orders, which are also sent by express, unless their weight classification is not too high. All incorrect orders, in this simple example represented as orders without a quantity, are stored in a separate table. It assumes the existence of appropriate tables
large_freight_shipping, express_shipping, default_shipping, and
incorrect_sales_order.
INSERT FIRST WHEN (sum_quantity_sold > 10 AND prod_weight_class <>
sum_quantity_sold >=1) OR (sum_quantity_sold > 5 AND prod_weight_class > 5) THEN
INTO large_freight_shipping VALUES
(time_id, cust_id, prod_id, prod_weight_class, sum_quantity_sold)
WHEN sum_amount_sold > 1000 AND sum_quantity_sold >=1 THEN
INTO express_shipping VALUES
(time_id, cust_id, prod_id, prod_weight_class,
sum_amount_sold, sum_quantity_sold)
WHEN (sum_quantity_sold >=1) THEN INTO default_shipping VALUES
(time_id, cust_id, prod_id, sum_quantity_sold)
ELSE INTO incorrect_sales_order VALUES (time_id, cust_id, prod_id)
SELECT s.time_id, s.cust_id, s.prod_id, p.prod_weight_class,
SUM(amount_sold) AS sum_amount_sold,
SUM(quantity_sold) AS sum_quantity_sold
FROM sales s, products p
WHERE s.prod_id = p.prod_id AND s.time_id = TRUNC(SYSDATE)
GROUP BY s.time_id, s.cust_id, s.prod_id, p.prod_weight_class;
The following example inserts new customers into the customers table and stores all new customers with cust_credit_limit higher then 4500 in an additional, separate table for further promotions.
INSERT FIRST WHEN cust_credit_limit >= 4500 THEN INTO customers
INTO customers_special VALUES (cust_id, cust_credit_limit)
ELSE INTO customers
SELECT * FROM customers_new;
BEST PRACTICE:
- In case of XXX Retail Company project, we already have sufficient test data in development environment. So, while putting SQL queries as cursors in your PL/SQL code and when you see the SQL is complex, run Explain Plan for that SQL statement and analyze if the SQL can be rewritten more efficiently.
- Don’t duplicate the ORACLE built-in functions. In other words, if some programming task can be achieved using ORACLE provided functions then use them rather than hand-coding the algorithm.
- While putting logical expressions in your code, be sure of the sequence of the conditions so that ORACLE’s short-circuit mechanism of PL/SQL engine can be leveraged as much as possible.
- Implicit datatype conversion within code always has overheads. So while declaring variable, be careful about assigning right datatype so that no conversion is required by PL/SQL engine down the line while executing the code.
- When you need to declare a numeric variable in your code, try to use PLS_INTEGER as much as possible since it is the most efficient datatype. Use PLS_INTEGER for Integer Arithmetic and Use BINARY_FLOAT and BINARY_DOUBLE for Floating-Point Arithmetic
- Instead of declaring a variable as ‘NOT NULL’, check for the null condition in an ‘IF’ clause in your code, which is more efficient.
- Be generous while declaring VARCHAR2 variables. Assign the size ‘>=2000’ to VARCHAR2 variables in your code so that the PL/SQL engine does not pre-allocate more memory than you need to hold maximum possible value.
- While doing Integration Testing, keep in mind to analyze the sequence of execution of packages and the reusability of packaged functions and procedures. Decide which the possible candidates for pinning in the memory are, or be marked as ‘SERIALLY REUSABLE’.
- When you use dynamic SQL, try to use Native Dynamic SQL (EXECUTE IMMEDIATE) as much as possible rather than DBMS_SQL built-in package.
- When you feel the necessity of intermediate staging for various transformations in your ETL code using PL/SQL, explore the possibility of using Pipelined Table Functions using parallelism features of ORACLE, instead of intermediate staging tables.
- When you fire a lot of DML statements within one/more loops, use BULK COLLECT, FORALL, LIMIT so that you can fetch a set of records and INSERT/UPDATE/DELETE a set of records in a single go. Using these features also enable you to leverage Oracle’s parallelism features.
- If you need to INSERT or UPDATE a table depending on if the record already exists in the table or not, consider using MERGE statement.
- If two or more tables need same data, use multitable insert so that you can insert data in all the tables in a single go. This way you eliminate extra queries that you need for inserting in each extra table.
- For procedures that return huge number of values/records, consider using NOCOPY keyword for OUT or INOUT parameters.
- If anyway you need intermediate staging for transformation, use CREATE TABLE AS SELECT statement to create intermediate staging tables whereas within SELECT statement try to do as much transformation as possible.
- Use hints /*+ APPEND*/ or /*+ APPEND NOLOGGING PARALLEL */ in INSERT/UPDATE statements using BULK BIND. This enables bulk loading feature. Also ‘NOLOGGING PARALLEL’ enables parallelism.
- If within a program, after applying changes to a set of records or inserting a set of records, you want to perform another set of tasks on the same set of records, you may need to fire a query to bring in all the affected records. In such condition, try to use ‘RETURNING’ clause along with INSERT or UPDATE statement so that you don’t need to re-query after applying INSERTS and UPDATES.
CASE STUDY :
We occasionally come across people who are highly experienced in loading huge volume of Data from Source to Target using ETL Tools like Data Stage, Informatica , Abinitio and others on Weekly and Monthly basis, but how often do we find people within IT Industry who have managed to load Billions of Data from Source to Target within Hours daily?. Data Base and Modeling team of a Retail Chain in
Case Study
System : IBM P-Series#. 64 GB RAM and 14 CPUs.
Source : Oracle 10G, using 64 Mb Uniform Extent Size.
Target : Oracle 10G, using 64 Mb Uniform Extent Size. STAR SCHEMA MODEL using Surrogate keys.
ETL Tool : PL/SQL
Daily Transaction Data : 26 Million.
Master Data : 350 Million
Business Requirement : 308 Million. Daily transaction data, Up to 40 Days of Update and
Target Time for ETL : 1 Hr for Daily load, 1 hr 30 mins for 40 Days Update and 1 hr for Aggregation along Time Dimension (Weekly, Monthly, Fiscal Quarter, Fiscal Half Year and Fiscal Year)
Challenges:
It was presumed that simple Oracle DML like Insert, Update and Delete would not work on the above volume of data, hence, the general strategy that was based on previous experiences on more or less similar volume of data was to use:
- Range Partition the STAR SCHEMA Fact Table on Date.
- Bulk Collect From source.
- Bulk Insert for Daily and History Load.
- Bulk Update for 40 Days Update.
- Use the best tuning approaches.
The above strategies met the functional business logic but could not meet the performance criteria of ETL loads. ETL programs got stuck at the start of UAT on the actual volume of Data and could not be completed even after several trials and tweaking system parameters.
After the major hurdles faced with oracle bulk DML only possible option was to get rid of Oracle DML while inserting and updating records and simplify queries. This might look something weird but this is how it is done:
Steps for replacing INSERT
- Hash partition the huge Source table based on Key combinations. No. of optimal partitions depends on volume of data ( 1024)
- Break nested loops into Intermediate work tables.
- Reduce Oracle Extents to multiples of average record length of target table.
- Create the Target Fact table using Range partition on date and 64 hash sub- partitions on other key combinations.
- Create hash partitioned work tables using Create Table as (CTAS).
- Use Parallel degree while creating intermediate work tables.
- Use Parallel Hint in queries involving the work tables for leveraging the parallel processes.
- Create final work table exactly similar to that of Target table. The final work table should have the same columns, data type, partitions, sub partitions and primary Indexes.
- Exchange the Work table with the date partition of the target Fact table based on Transaction date with the update Indexes. ‘NO Validation” may be use for faster Exchange partition.
Steps for replacing Update
- Create a hash partitioned work table with set of records to be updated using CTAS. This will cover various dates for which records will be updated.
- Create a hash partitioned work table forming the Delta set (incremented/decremented) records by combining the records from target table and updated source records.
- Select distinct dates that needs to be updated and start a loop
- Create another hash partitioned table selecting one date partition from Target table and another partition from Delta set using the same date partition and combine UNION ALL.
- Exchange the Date partition of Target Fact table with the hash portioned table corresponding to the selected date.
- End loop.
Steps for Aggregation
- Create hash partitioned table by selecting one partition from Daily Fact table.
- Create hash partitioned table by selecting one weekly partition from Aggregate Fact.
- Create Final hash partitioned work table by combining the above two sets by UNION ALL.
- Exchange the partition between work table and
Hence it can be concluded that the conventional DML operations, which all of us are very much familiar with , can easily be avoided while loading huge volume of Data using PL/SQL into Datawarehouse and specifically where Performance is a major criteria.
Learning Objective
The case study described in this abstract is very much common scenario within major ETL projects where people spend lots of valuable time tuning the system to meet performance standards by using conventional and traditional approaches and without thinking out of the Box to come up with better way of doing things. The abstract also sheds lights on some practices the ETL tools adopts while loading data from source to target and understanding the concept will enhance individual programming skills with the tools. Now we can gladly accept projects involving high volume of data and where customer biased using PL/SQL to build the ETL with respect to using tools.
The Trigger
The Case study is taken from the Performance Tuning approaches adopted in the project of a retail Chain in
No comments:
Post a Comment