In the last blog, I discussed about Big Data and the basics of Hadoop
which is used to store and process big data. Now, let’s focus on the structured
data produced by day to day operations in any organization. This operational data,
when used properly to harness information, can reveal some interesting insights
into the operations of an organization and help in informed decision making and
driving the company towards its goal.
In order to analyze this structured historical data, it has to be
stored in a way that it is easy and less time consuming to retrieve data. And
that is achieved by a Data Warehouse. Traditional database systems are used to
store transactional data in highly normalized relationships designed for
recording transactions or data and occupy less space. But it is not optimized
for reading and requires complex queries to retrieve data which impacts its
performance. The difference between database and data warehouse can be better
understood by comparing the data processing required to record the data and
read the data.
- OLTP (On-line Transaction Processing): It is an approach used to facilitate and manage day-to-day business applications, and involves large number of short on-line transactions (insert, update, delete). OLTP systems are designed to maintain data integrity in multi-access environment, and the relational data is stored in highly normalized form, usually in 3NF.
- OLAP (On-line Analytical Processing): It is an approach used for low volume of transactions, and was conceived for Decision Support Systems and Management Information Systems. It facilitates use of complex and aggregated queries to retrieve data from multi-dimensional schema, and faster response time. OLAP systems are used for data warehousing and data mining where data has to be combined from multiple sources for easy querying.
Data Warehouse: Data Warehouse is a concept where database is used to store
data in multi-dimensional schema rather than relational schema, and is used for
reporting and data analysis. It can be seen as a central repository of current
and historical data which is created by integrating data from various disparate
sources.
Image Source: The Data Warehouse Toolkit - Ralph Kimball
The Ralph Kimball’s book ‘The Data Warehouse Toolkit’
describes the four distinct components of a data warehouse as: Operational
Source Systems, Data Staging Area, Data Presentation Area, and Data Access
Tools. Though the Operational source system is part of the model, Kimball
states that “The source systems should be thought of as outside the data
warehouse because presumably we have little to no control over the content and
format of the data in these operational legacy systems.”
Operational Source
Systems
Operational Source Systems are the databases which hold all the
transactional data within an organization. The main priorities of the source systems are
processing, performance, and availability. These provide the necessary data
which will be used to create the data warehouse.
Data Staging
The Data Staging Area is a temporary location where data
from source systems is stored to perform the ETL (extraction-transformation-load)
process on the data. A staging area is mainly required in a Data Warehousing
Architecture to collect all required data at one place before the data can be
integrated into the Data Warehouse.
Extraction is the first step in retrieving the data required
to be loaded into data warehouse. The data is read from the source systems and
filtered according to the requirements and loaded onto the data staging area.
After the data is extracted into the data staging area, Transformation
step comes into effect where several transformations are applied such as
cleansing the data (i.e. correct misspellings, dealing with missing data,
converting to standard formats, etc.), combining data from disparate sources
and assign the data warehouse keys which is analogous to a primary key in
relational database.
The final step is the Loading of data where the new data is
presented as dimensional tables and merged into the existing data marts. Each
of the data marts then indexes this new data for query performance.
Data Presentation
The data presentation area is considered to be a set of
integrated data marts. A data mart is a subset of the data warehouse and
represents select data regarding a specific business function. An organization
can have multiple data marts, each one relevant to the department for which it
was designed. For example, the English department may have a data mart
reflecting historical student data including demographics, placement scores,
academic performance, and class schedules. This area is seen as the actual data
warehouse which facilitates direct querying by users, report writers, and other
analytical applications. The data contained in the data presentation area must
be detailed and logically organized.
Data Access Tools
Data Access Tools are really all the software that can query
the data in the data warehouse’s presentation area. Few examples of Data Access
Tools are Cognos 10, SAP BeX, MicroStrategy, and Roambi. A data access tool can
be as simple as an ad hoc query tool or as complex as a sophisticated data
mining or modeling application.
No comments:
Post a Comment