Sunday, September 6, 2009

Basic Concepts of Datawarehouse(Questions and Answers)

What is a data warehouse?
A data warehouse is a collection of data marts representing historical data from different operations in the company. This data is stored in a structure optimized for querying and data analysis as a data warehouse. Table design, dimensions and organization should be consistent throughout a data warehouse so that reports or queries across the data warehouse are consistent. A data warehouse can also be viewed as a database for historical data from different functions within a company.

What is a data mart?
A data mart is a segment of a data warehouse that can provide data for reporting and analysis on a section, unit, department or operation in the company, e.g. sales, payroll, production. Data marts are sometimes complete individual data warehouses which are usually smaller than the corporate data warehouse.

What are the benefits of data warehousing?
Data warehouses are designed to perform well with aggregate queries running on large amounts of data.

The structure of data warehouses is easier for end users to navigate, understand and query against unlike the relational databases primarily designed to handle lots of transactions.

Data warehouses enable queries that cut across different segments of a company's operation. E.g. production data could be compared against inventory data even if they were originally stored in different databases with different structures.

Queries that would be complex in very normalized databases could be easier to build and maintain in data warehouses, decreasing the workload on transaction systems.

Data warehousing is an efficient way to manage and report on data that is from a variety of sources, non uniform and scattered throughout a company.

Data warehousing is an efficient way to manage demand for lots of information from lots of users.

Data warehousing provides the capability to analyze large amounts of historical data for nuggets of wisdom that can provide an organization with competitive advantage.

What is OLAP?
OLAP stands for Online Analytical Processing.
It uses database tables (fact and dimension tables) to enable multidimensional viewing, analysis and querying of large amounts of data. E.g. OLAP technology could provide management with fast answers to complex queries on their operational data or enable them to analyze their company's historical data for trends and patterns.

What is OLTP?
OLTP stands for Online Transaction Processing.
OLTP uses normalized tables to quickly record large amounts of transactions while making sure that these updates of data occur in as few places as possible. Consequently OLTP database are designed for recording the daily operations and transactions of a business. E.g. a timecard system that supports a large production environment must record successfully a large number of updates during critical periods like lunch hour, breaks, startup and close of work.

What are dimensions?
Dimensions are categories by which summarized data can be viewed. E.g. a profit summary in a fact table can be viewed by a Time dimension (profit by month, quarter, year), Region dimension (profit by country, state, city), Product dimension (profit for product1, product2).

What are fact tables?
A fact table is a table that contains summarized numerical and historical data (facts) and a multipart index composed of foreign keys from the primary keys of related dimension tables.

What are measures?
Measures are numeric data based on columns in a fact table. They are the primary data which end users are interested in. E.g. a sales fact table may contain a profit measure which represents profit on each sale.

What are aggregations?
Aggregations are precalculated numeric data. By calculating and storing the answers to a query before users ask for it, the query processing time can be reduced. This is key in providing fast query performance in OLAP.

What are cubes?
Cubes are data processing units composed of fact tables and dimensions from the data warehouse. They provide multidimensional views of data, querying and analytical capabilities to clients.

What is the PivotTable® Service?
This is the primary component that connects clients to the Microsoft® SQL Server™ 2000 Analysis Server. It also provides the capability for clients to create local offline cubes using it as an OLAP server. PivotTable® Service does not have a user interface, the clients using its services has to provide its user interface.



What are offline OLAP cubes?
These are OLAP cubes created by clients, end users or third-party applications accessing a data warehouse, relational database or OLAP cube through the Microsoft® PivotTable
® Service. E.g. Microsoft® Excel™ is very popular as a client for creating offline local OLAP cubes from relational databases for multidimensional analysis. These cubes have to be maintained and managed by the end users who have to manually refresh their data.

What are virtual cubes?
These are combinations of one or more real cubes and require no disk space to store them. They store only the definitions and not the data of the referenced source cubes. They are similar to views in relational databases.

What are MOLAP cubes?
MOLAP Cubes: stands for Multidimensional OLAP. In MOLAP cubes the data aggregations and a copy of the fact data are stored in a multidimensional structure on the Analysis Server computer. It is best when extra storage space is available on the Analysis Server computer and the best query performance is desired. MOLAP local cubes contain all the necessary data for calculating aggregates and can be used offline. MOLAP cubes provide the fastest query response time and performance but require additional storage space for the extra copy of data from the fact table.

What are ROLAP cubes?
ROLAP Cubes: stands for Relational OLAP. In ROLAP cubes a copy of data from the fact table is not made and the data aggregates are stored in tables in the source relational database. A ROLAP cube is best when there is limited space on the Analysis Server and query performance is not very important. ROLAP local cubes contain the dimensions and cube definitions but aggregates are calculated when they are needed. ROLAP cubes requires less storage space than MOLAP and HOLAP cubes.

What are HOLAP cubes?
HOLAP Cubes: stands for Hybrid OLAP. A ROLAP cube has a combination of the ROLAP and MOLAP cube characteristics. It does not create a copy of the source data however, data aggregations are stored in a multidimensional structure on the Analysis Server computer. HOLAP cubes are best when storage space is limited but faster query responses are needed.

What is the approximate size of a data warehouse?

You can estimate the approximate size of a data warehouse made up of only fact and dimension tables by estimating the approximate size of the fact tables and ignoring the sizes of the dimension tables.

To estimate the size of the fact table in bytes, multiply the size of a row by the number of rows in the fact table. A more exact estimate would include the data types, indexes, page sizes, etc. An estimate of the number of rows in the fact table is obtained by multiplying the number of transactions per hour by the number of hours in a typical work day and then multiplying the result by the number of days in a year and finally multiply this result by the number of years of transactions involved. Divide this result by 1024 to convert to kilobytes and by 1024 again to convert to megabytes.

E.g. A data warehouse will store facts about the help provided by a company’s product support representatives. The fact table is made of up of a composite key of 7 indexes (int data type) including the primary key. The fact table also contains 1 measure of time (datetime data type) and another measure of duration (int data type). 2000 product incidents are recorded each hour in a relational database. A typical work day is 8 hours and support is provided for every day in the year. What will be approximate size of this data warehouse in 5 years?

First calculate the approximate size of a row in bytes (int data type = 4 bytes, datetime data type = 8 bytes):

size of a row = size of all composite indexes (add the size of all indexes) + size of all measures (add the size of all measures).

Size of a row (bytes) = (4 * 7) + (8 + 4).

Size of a row (bytes) = 40 bytes.

Number of rows in fact table = (number of transactions per hour) * (8

hours) * (365 days in a year).

Number of rows in fact table = (2000 product incidents per hour) * (8

Hours ) * (365 days in a year).

Number of rows in fact table = 2000 * 8 * 365

Number of rows in fact table = 5840000

Size of fact table (1 year) = (Number of rows in fact table) * (Size of a

row)

Size of fact table (bytes per year) = 5840000 * 40

Size of fact table (bytes per year) = 233600000.

Size of fact table (megabytes per year) = 233600000 / (1024*1024)

Size of fact table (in megabytes for 5 years) =

(23360000 * 5) / (1024 *1024)

Size of fact table (megabytes) = 1113.89 MB

Size of fact table (gigabytes) = 1113.89 / 1024

Size of fact table (gigabytes) = 1.089 GB

Sunday, August 23, 2009

Best Practices for Cognos 8 BI

Table of Contents

1. Purpose

2. Scope

3. Modeling – Framework Manager

4. Report Design

5. Performance Tuning

1. Purpose

This document details the various design approaches to be used in Reporting for generating Cognos 8 reports. This is a technical paper that explains different report complexities and the various design options available to achieve them using Cognos Cognos 8 Report Studio. The document explains each option and weighs the advantages and disadvantages for each option. This document is a compilation of various options that were thought feasible keeping in mind the current environment and as per the requirements for reporting.

2. Scope

The design approaches considered in this document include

Ø Report Design – Cognos 8 BI

Ø Performance Tuning of the Reports – Cognos 8 BI

Note: The practices & approaches considered for a particular functionality / operational concept / process may result in trade-off on part of the functionality but will be performance optimal.

3. Modeling – Framework Manager

I. Verify Relationships And Cardinality:

Detect Cardinality from the Data Source

When importing from a relational data source, cardinality is detected based on a set of rules

specified by the modeler.

The available options are:

• Use primary and foreign keys

• Use matching query item names that represent uniquely indexed columns

• Use matching query item names

The most common situation is to use the first two options: primary and foreign keys as well as

query items that represent uniquely indexed columns. When you use these options to import your metadata and generate relationships, remember that you are adding this information to your model (this is important to note for later sections of this paper).

Note: Optional relationships, full outer joins, and many-to-many relationships can be imported

from your database and are supported for query execution. Recursive relationships can be

imported from your database and will appear in the model but they are not supported for query

execution.

Usage of Cardinality in Query Generation

Framework Manager Modeling Guidelines:

The general rule applied to cardinality is as follows:

0-to-n or 1-to-n cardinality implies FACT

0-to-1 or 1-to-1 cardinality implies DIMENSION

These rules are applied in the scope w\of what is included in any given query. Framework Manager supports both minimum/maximum cardinality as well as mandatory/optional.

In cases where minimum/maximum cardinality is defined (as recommended by Cognos’s best

practices) it will be applied to identify facts and dimensions.

The basic rules that apply to the usage of cardinality are:

• Cardinality is applied in the context of a query -- only the cardinalities of items explicitly

included in the query are evaluated.

• Query subjects defined with 1-to-many or 0-to-many cardinalities are identified as facts.

• Query subjects defined with 1-to-1 or 0-to-1 cardinalities are identified as dimensions.

• It is possible to have a query subject behave as a dimension in one query and as a fact in

another.

Note: When only mandatory/optional cardinality is employed, SQL generated will not be able to

compensate for double counting that can occur when dealing with hierarchical relationships and

different levels of granularity.

Identifying Facts and Dimensions:

Within the context of a query, a query subject that has only maximum (n) cardinality on all its

relationships to other query subjects can be regarded as a fact. The implication is that there are

more rows of data in the fact query subject than in the related query subject on the minimum (1)

side of the relationship. Any query subject having at least one relationship to another query

subject with minimum cardinality (1) will be treated as a dimension.

Analyze a Schema for Facts and Dimensions:

Example 1: In this example, all four query subjects are included in a query. The diagram shows

that the query subjects having only maximum (n) cardinalities are treated as facts.

Example 2: In this example, only three query subjects are included in the query. The diagram

shows that the Order Header query subject is now treated as a fact.

Example 3: In this example, query subjects whose cardinality indicates that they are always facts

are circled in blue. Areas where the behavior is dependent on the context of the query are circle

in red. All query subjects that are not circled behave as dimensions in all cases.



II. Simplify the Model with Dimensional Concepts

Framework Manager Modeling Guidelines:

Use the cardinality rules from the previous section to identify areas of the model that ambiguously identify dimensions or facts.

Collapse groups of query subjects with hierarchical relationships into a single query subject per business concept (typically applies to dimensions).

Collapse groups of query subjects with master/detail relationships into a single query subject per business concept (typically applies to facts).The end result of simplifying the model should be a layer of query subjects that clearly representthe data in terms of business concepts and ensure predictable query generation.

Collapse Hierarchical Relationships:

There are often cases in normalized or snowflake data sources where several tables exist to describe a single business concept. Product is a very common example of this. For example, anormalized representation of Product includes three tables related by one to many relationships.Each Product Line has one or more Product Types.

Each Product Type has one or more Products.

An end user may or may not be knowledgeable about the relationship between the individual levels of product, also the necessity to expand and select a query item from each query subject in the model makes more clicks for the end user. A modeler can make the decision to create a model query subject for product that not only simplifies using Product for the purpose of ad hoc query and reporting but also presents the levels of the hierarchy in order to visually cue the end user on the relationship between the levels. If desired, the modeler can use query item folders to clarify which attributes apply to which level. In this example further clarification is not required.


Collapse Master/Detail Relationships:

Another common scenario in data sources is the existence of master/detail tables containing facts. A good example of this is order header and order detail. For the purposes of inserting and updating data, this structure is beneficial. For reporting purposes, however, the concept of Orders exists separately from the physical representation of the data and, as discussed in the previous section, this scenario can also lead to unpredictability in query generation.

In order to simplify the model in this case, it is better to create a model query subject that combines the foreign keys of both Order Header and Order Details and includes all measures at the Order Detail level.

III. Resolve Ambiguous Relationships

Framework Manager Modeling Guidelines

Cases of multiple relationships or recursive relationships generally imply that the data represented by a query subject can be viewed in more than one context or role and should be modeled accordingly.

Two common examples of ambiguous relationships are:

Multiple Valid Relationships –typically occur between dimensions and facts. Create a shortcut or model query subject for each role with a single relationship specific to the role.

Recursive Relationships – typically imply two or more levels of granularity. As a minimum create a shortcut or model query subject each for the parent and the child roles. For a deep recursive hierarchy, we recommend that the hierarchy be flattened in the database and that you model the flattened hierarchy in a single query subject.

An ambiguous relationship occurs where there are multiple relationship paths between one or more query subjects leaving multiple options for how to write a query. This is a situation that can be responded to by the modeler or in the database by the DBA. If appropriate, the DBA can choose to remove extra relationships (although this can potentially result in another situation that




(IV) DEFINE DIMENSIONAL INFORMATION

Framework Manager Modeling Guidelines

The primary purpose of dimensional information in Framework Manager is to prevent double counting. By specifying the internal cardinality or granularity of the data in a query subject, it is possible to enhance the model and allow the query engine to anticipate situations in which double-counting could occur.

When specifying dimensional information consider the following:

  • Identify the levels that are relevant for reporting purposes
  • Identify the key(s) for those levels and if those keys uniquely identify the members of their respective levels
  • Identify any attributes associated to levels.
  • Identify the hierarchy of the levels – the relationship between levels.
  • Multiple hierarchies may exist, it is recommended that these be explicitly modeled with additional query subjects.
  • Star style dimension tables are perfect candidates for dimensional information because they typically have more than one level and they include attributes that are associated to more than one level.

When and How to Use Dimensional Information:

Use dimensional information to specify the relationship between levels in a multi-level dimension.For example, there is a time dimension table that contains keys for days, months, quarters, and years. Day key is the primary key of the table, However, there are also keys for month, quarter and current_year, each of which can uniquely identify the data at their respective levels and could relate directly to facts in the database.

In order to define the behavior expected when querying at a one or more levels of time,dimensional information is used. Levels are defined for years, quarters, months, and days. A key is defined for each level and in this example, that key is sufficient to uniquely identify the level. A hierarchy is defined to specify the relationships between the levels: days roll up to months, months to quarters, and quarters to years. Attributes and facts can also be defined for each level.

Any attribute or fact that is not explicitly associated to a level will be treated by default as if it were associated to the lowest level (day).


When performing multi-fact queries between facts at different levels of granularity, dimension information on conformed dimensions ensures correct behavior when a multi-fact query is submitted and each fact is recorded for a different time period (daily versus monthly, for example). The query engine will use its understanding of the relationship between days and months from the defined dimensional information to ensure that data is rolled up correctly.

How to Avoid Double-Counting

There are several common scenarios that can lead to double-counting:

• multi-fact, multi-grain queries

• aggregate tables

• multiple hierarchies

Multi-Fact/Multi-Grain Queries

Multi-fact queries where de-normalized dimensions are related to fact tables on different keys

(and therefore levels). A de-normalized dimension table typically has internal cardinality – distinct levels of attribute data with keys that have a hierarchical relationship to each other.

Example: Multi-Fact/Multi-Grain Query with Dimensional Information

A multi-fact/multi-grain query is typically the result of multiple star schemas sharing conformed

dimensions but using keys for different levels of the same hierarchy. There is potential for double counting the since an aggregation is required to bring quantitative items from both fact tables to the same level of granularity for reporting.




The example above shows two fact tables sharing a time dimension and a product dimension. The Time dimension has four levels: Year, Quarter, Month, and Day. It has a relationship to Inventory Levels Fact on the Month key and to Sales Fact on Day key. The Product dimension has three levels: Product Line, Product Type, and Product. It has relationships to both fact tables on the Product key. A report on these four tables will, by default, be aggregated to retrieve records from each fact table at the lowest common level of granularity, in this case at the month and product level. Without specifying dimensional information on the time dimension, it is possible that incorrect aggregation could occur – for example monthly fact values could be rolled up based on the lower time level of days. By using dimensional information to specify that days roll up to months, months to quarters, and quarters to years, any request to aggregate based on time will ensure that days are rolled up to months.

§ Approach 1 cannot be used in the following scenarios:

If the query include summaries, then there could be a chance that due to multiplication or replication of the records incorrect calculation of summaries in either of the child elements of the Tabular SET and hence the 1. INTERSECT/EXCEPT functionality will fail to not mimic the EXIST/NOT EXIST functionality.

§ Approach 2 cannot be used in the following scenarios:

1. If the query includes an EXIST clause, then since functionality of EXIST clause cannot be implemented using OUTER JOIN, approach 2 cannot be used.

2. If the query has a NOT EXIST clause and if the metadata created to implement the functionality of NOT EXIST is not generic (i.e has any report specific user input filters in the metadata, then approach 2 will not be used.

§ Approach 4 will be used instead of approach 3 if and only if the SQL can be completely reused for another report of the same application (without the need to have any additional joins in the metadata).

(iii) Reports with queries having UNION/UNION ALL clause

Approach 1: Tabular Set

Report Studio Tabular Sets can be used to create reports which have UNION/UNION ALL queries. The tabular set is basically a relational set expression that combines two or more tabular structures like tabular model, tabular SQL and so on. A tabular set produces a single result set, which is the UNION/UNION, ALL of the tabular structures.

Approach 2: Tabular SQL

Report Studio Tabular SQL feature can be used when there is a need to create reports having queries with UNION/UNION ALL. The SQL statement may be expressed by using syntax recognized by the native database, or by using Cognos SQL syntax

Approach 3: Data source Query Subjects in Framework Manager

Data source query subject in Framework Manager can be used when there is a need to create reports having queries with UNION/UNION ALL clause. The SQL statement may be expressed by using syntax recognized by the native database, or by using Cognos SQL syntax. This query subject can be used in report studio to create reports.

§ Approach 1 cannot be used in the following scenarios:

1. If the individual SQL of the query cannot be implemented using either Tabular Sets or Tabular Models, then approach 1 will not be used.

§ Approach 3 will be used instead of approach 2, if and only if the SQL can be completely reused for another report of the same application (without the need to have any additional joins in the metadata).

(iv) Reports with SQL’s having Inline Queries

When a SELECT query is nested inside another SELECT, then it is an inline query.Approach 1: Tabular SQL

Report Studio Tabular SQL feature can be used when there is a need to create reports having inline Query. The SQL statement may be expressed by using syntax recognized by the native database, or by using Cognos SQL syntax.

(V) Reports where the query has user inputs in outer join

Approach 1: #prompt# in Tabular SQL

Macro function #prompt# can be used to define a parameter in the outer join definition in the Tabular SQL. The value to this prompt is set from the prompt page thereby mimicking the functionality of passing user inputs to the outer join.

Approach 2: #prompt# in Data source Query Subjects in Framework Manager

Macro function #prompt# can be used to define a parameter in the outer join definition in the FM Metadata. The value to this prompt is set from the prompt page thereby mimicking the functionality of passing user inputs to the outer join.

Rules:

Approach 2 will be used instead of approach 1, if and only if the metadata can be completely reused for another report of the same application (without the need to have any additional joins in the metadata).

(vi) Reports having complex logic (Unions with multiple save data, Complex Calculations, Loops, nested Conditions, etc).

Approach 1: Summary Table

Summary table can be created. This summary table can be joined with the required tables in order to pull out the report in the report studio. The data in the table needs to be refreshed periodically.

Note: Summary tables should be used only in the following scenarios:

If data retention/volume is less than 1million records.

Data in the report is not real-time / not near real-time.

User volume is high.

Approach 2: Stored Procedure

If the data in the report needs to be real time or near real time report or if the data retention is more than 1 million records, then SP approach would be considered. The stored procedure is defined in the FM Metadata and the report uses this stored procedure query subject to pull out the data. In some cases table variables might need to be used.

(vii) Report Query needs inputs from online applications/batch process

Approach 1:

Cognos 8 SDK can be used to pass parameters to the reports.

5. Performance Tuning

(I). Try to set the Processing property to Database Only. It runs faster than Limited Local.

(II)Try to use fixed column widths where possible. We have found that dynamic sizing sometimes runs queries twice--once to find the largest data content, then again to dynamically size the columns based on the largest data content.

(III) Try to move as many calculations as possible into the model so your query doesn't have to do them for every row of data pulled.

(IV) Try to use as many filters as possible up front to minimize the data you must pull in and process.

(V) Simple CASE statements run faster than Search CASE statements.

(VI) Place calculations strategically. Rather than include a calculation in a CASE statement in multiple places, do it once in a separate field and then refer to it in multiple places. This minimizes processing time.

(VII) Create logical and useful index fields and logical table structures to make searching and pulling the data as efficient as possible.

(VIII) When sorting a field, sort it either at the tabular model level OR the queries level but NOT both.

(IX) Where possible, use database functions for calculations instead of their equivalent Cognos functions.

(X) When using direct facts in the report do not create shortcuts, use directly in facts under Query.

(XI) Cognos configuration by default the font is Times new roman the best practices are saying change the font to Arial - it will use less memory