Tuesday, December 10, 2013

Summarizing MIS 587

The Business Intelligence (MIS 587) course which I studied in my final semester of Master's degree has really helped me to understand the data and use of data towards analyzing performance of an organization, and analyzing social media of users.
The lectures given by Dr. Sudha Ram were very interesting and invoked a data enthusiast in me. The class discussions started from understanding the data and data processing systems. I learnt about how the Balanced Scorecard is created and its role in monitoring and managing business KPIs. Then we moved towards Big Data and analytics which covered OLTP, OLAP and Data Warehousing and Reporting. After dealing with organizational data, I studied how to collect and handle web data. The social media and web analytics projects provided me a great deal of hands-on with data and analyzing the data. I also learned use of Google Analytics to study user behaviour on a website and how it can be leveraged to improve the website design and attract users. Towards the end, I learned network analysis and visualization of data which is an important aspect of Business Intelligence. The main purpose of collecting and storing the data is to present it to find patterns which are visible in visualization. I also got an opportunity to study my LinkedIn and facebook network, an I was amazed to see how much of information one can deduce by looking at these networks.
This course has been very informative and fruitful, and hopefully this will now help me direct my career towards data analysis and business intelligence!

Data Visualization

In the past blogs, we have seen the storage of several kinds of data such as unstructured, semi-structured and structured, and also the platforms used to store the data. These data can be an operational data of an organization or collection of messages on social media such as facebook or twitter. Once we have collected and stored this data, we can move ahead with the studying the data to uncover useful information. There are several ways in which the data can be studied such as statistical analysis, predictive analysis and so on.
One such method is visualizing the data. There are many tools available for data visualization such as Tableau, Spotfire, Gephi, Raphael and many javascript (js) files available free for use. But, in order to visualize the available data, you have to first understand the data and recognize what patterns you are looking for. Without understanding of data, you might easily get lost while studying the data and creating the visualization. For example, lets consider you want to see the twitter activity of a user over time. You might need to define the time, whether it would be weeks or days or hours, basically the granularity of an attribute. And then proceed with creating a two dimensional graph to view the pattern of tweets posted by user.
A visualization can be as simple as a graph, and can get complex depending on the number of attributes you want to consider while visualizing. Also, you will have to research on tools and decide which one would fulfill your requirements. Tableau is a good option for linear and graphical visualization such as bar charts and heat maps; while Gephi is good for network visualization.
Data visualization is a method to present your data and reveal visually appealing patterns which can help create stories in the context of data. Considering the previous twitter example, one might find that the user tweet activity is high during the day which would be normal, or you might as well observe that the user's tweet activity increases during evening and peaks after midnight. This would clearly indicate that the user is active during night and sleeps or works during the daytime. Also, if you want to dig deeper, you might want to see what kind of tweets he posts while at his peak time, and consider direct marketing based on the user's deducted interests from the tweet activity and followings.
Depending on the visualization you intend to do, you might also have to prepare the data for loading into the tools you will be using. Some of the BI or data warehousing tools such as OBIEE provide an in-built capability to visually present the selected data. These visualized data are most effectively used in dashboards and reports which display several types of data which would help someone, probably a manager to monitor the operational performance of an organization and make appropriate decisions.
To conclude, I would like to remind you the phrase "A picture is worth thousand words" which stands true for explaining the data using visualization rather than using data tables with actual data values.

Wednesday, December 4, 2013

Data Warehouse

In the last blog, I discussed about Big Data and the basics of Hadoop which is used to store and process big data. Now, let’s focus on the structured data produced by day to day operations in any organization. This operational data, when used properly to harness information, can reveal some interesting insights into the operations of an organization and help in informed decision making and driving the company towards its goal.
In order to analyze this structured historical data, it has to be stored in a way that it is easy and less time consuming to retrieve data. And that is achieved by a Data Warehouse. Traditional database systems are used to store transactional data in highly normalized relationships designed for recording transactions or data and occupy less space. But it is not optimized for reading and requires complex queries to retrieve data which impacts its performance. The difference between database and data warehouse can be better understood by comparing the data processing required to record the data and read the data.
  • OLTP (On-line Transaction Processing): It is an approach used to facilitate and manage day-to-day business applications, and involves large number of short on-line transactions (insert, update, delete). OLTP systems are designed to maintain data integrity in multi-access environment, and the relational data is stored in highly normalized form, usually in 3NF.
  • OLAP (On-line Analytical Processing): It is an approach used for low volume of transactions, and was conceived for Decision Support Systems and Management Information Systems. It facilitates use of complex and aggregated queries to retrieve data from multi-dimensional schema, and faster response time. OLAP systems are used for data warehousing and data mining where data has to be combined from multiple sources for easy querying.

Data Warehouse: Data Warehouse is a concept where database is used to store data in multi-dimensional schema rather than relational schema, and is used for reporting and data analysis. It can be seen as a central repository of current and historical data which is created by integrating data from various disparate sources.

Image Source: The Data Warehouse Toolkit - Ralph Kimball

The Ralph Kimball’s book ‘The Data Warehouse Toolkit’ describes the four distinct components of a data warehouse as: Operational Source Systems, Data Staging Area, Data Presentation Area, and Data Access Tools. Though the Operational source system is part of the model, Kimball states that “The source systems should be thought of as outside the data warehouse because presumably we have little to no control over the content and format of the data in these operational legacy systems.”
Operational Source Systems
Operational Source Systems are the databases which hold all the transactional data within an organization.  The main priorities of the source systems are processing, performance, and availability. These provide the necessary data which will be used to create the data warehouse.
Data Staging
The Data Staging Area is a temporary location where data from source systems is stored to perform the ETL (extraction-transformation-load) process on the data. A staging area is mainly required in a Data Warehousing Architecture to collect all required data at one place before the data can be integrated into the Data Warehouse.
Extraction is the first step in retrieving the data required to be loaded into data warehouse. The data is read from the source systems and filtered according to the requirements and loaded onto the data staging area.
After the data is extracted into the data staging area, Transformation step comes into effect where several transformations are applied such as cleansing the data (i.e. correct misspellings, dealing with missing data, converting to standard formats, etc.), combining data from disparate sources and assign the data warehouse keys which is analogous to a primary key in relational database.
The final step is the Loading of data where the new data is presented as dimensional tables and merged into the existing data marts. Each of the data marts then indexes this new data for query performance.
Data Presentation
The data presentation area is considered to be a set of integrated data marts. A data mart is a subset of the data warehouse and represents select data regarding a specific business function. An organization can have multiple data marts, each one relevant to the department for which it was designed. For example, the English department may have a data mart reflecting historical student data including demographics, placement scores, academic performance, and class schedules. This area is seen as the actual data warehouse which facilitates direct querying by users, report writers, and other analytical applications. The data contained in the data presentation area must be detailed and logically organized.
Data Access Tools
Data Access Tools are really all the software that can query the data in the data warehouse’s presentation area. Few examples of Data Access Tools are Cognos 10, SAP BeX, MicroStrategy, and Roambi. A data access tool can be as simple as an ad hoc query tool or as complex as a sophisticated data mining or modeling application.