Wednesday, December 4, 2013

Data Warehouse

In the last blog, I discussed about Big Data and the basics of Hadoop which is used to store and process big data. Now, let’s focus on the structured data produced by day to day operations in any organization. This operational data, when used properly to harness information, can reveal some interesting insights into the operations of an organization and help in informed decision making and driving the company towards its goal.
In order to analyze this structured historical data, it has to be stored in a way that it is easy and less time consuming to retrieve data. And that is achieved by a Data Warehouse. Traditional database systems are used to store transactional data in highly normalized relationships designed for recording transactions or data and occupy less space. But it is not optimized for reading and requires complex queries to retrieve data which impacts its performance. The difference between database and data warehouse can be better understood by comparing the data processing required to record the data and read the data.
  • OLTP (On-line Transaction Processing): It is an approach used to facilitate and manage day-to-day business applications, and involves large number of short on-line transactions (insert, update, delete). OLTP systems are designed to maintain data integrity in multi-access environment, and the relational data is stored in highly normalized form, usually in 3NF.
  • OLAP (On-line Analytical Processing): It is an approach used for low volume of transactions, and was conceived for Decision Support Systems and Management Information Systems. It facilitates use of complex and aggregated queries to retrieve data from multi-dimensional schema, and faster response time. OLAP systems are used for data warehousing and data mining where data has to be combined from multiple sources for easy querying.

Data Warehouse: Data Warehouse is a concept where database is used to store data in multi-dimensional schema rather than relational schema, and is used for reporting and data analysis. It can be seen as a central repository of current and historical data which is created by integrating data from various disparate sources.

Image Source: The Data Warehouse Toolkit - Ralph Kimball

The Ralph Kimball’s book ‘The Data Warehouse Toolkit’ describes the four distinct components of a data warehouse as: Operational Source Systems, Data Staging Area, Data Presentation Area, and Data Access Tools. Though the Operational source system is part of the model, Kimball states that “The source systems should be thought of as outside the data warehouse because presumably we have little to no control over the content and format of the data in these operational legacy systems.”
Operational Source Systems
Operational Source Systems are the databases which hold all the transactional data within an organization.  The main priorities of the source systems are processing, performance, and availability. These provide the necessary data which will be used to create the data warehouse.
Data Staging
The Data Staging Area is a temporary location where data from source systems is stored to perform the ETL (extraction-transformation-load) process on the data. A staging area is mainly required in a Data Warehousing Architecture to collect all required data at one place before the data can be integrated into the Data Warehouse.
Extraction is the first step in retrieving the data required to be loaded into data warehouse. The data is read from the source systems and filtered according to the requirements and loaded onto the data staging area.
After the data is extracted into the data staging area, Transformation step comes into effect where several transformations are applied such as cleansing the data (i.e. correct misspellings, dealing with missing data, converting to standard formats, etc.), combining data from disparate sources and assign the data warehouse keys which is analogous to a primary key in relational database.
The final step is the Loading of data where the new data is presented as dimensional tables and merged into the existing data marts. Each of the data marts then indexes this new data for query performance.
Data Presentation
The data presentation area is considered to be a set of integrated data marts. A data mart is a subset of the data warehouse and represents select data regarding a specific business function. An organization can have multiple data marts, each one relevant to the department for which it was designed. For example, the English department may have a data mart reflecting historical student data including demographics, placement scores, academic performance, and class schedules. This area is seen as the actual data warehouse which facilitates direct querying by users, report writers, and other analytical applications. The data contained in the data presentation area must be detailed and logically organized.
Data Access Tools
Data Access Tools are really all the software that can query the data in the data warehouse’s presentation area. Few examples of Data Access Tools are Cognos 10, SAP BeX, MicroStrategy, and Roambi. A data access tool can be as simple as an ad hoc query tool or as complex as a sophisticated data mining or modeling application.

No comments:

Post a Comment