Tuesday, December 10, 2013

Summarizing MIS 587

The Business Intelligence (MIS 587) course which I studied in my final semester of Master's degree has really helped me to understand the data and use of data towards analyzing performance of an organization, and analyzing social media of users.
The lectures given by Dr. Sudha Ram were very interesting and invoked a data enthusiast in me. The class discussions started from understanding the data and data processing systems. I learnt about how the Balanced Scorecard is created and its role in monitoring and managing business KPIs. Then we moved towards Big Data and analytics which covered OLTP, OLAP and Data Warehousing and Reporting. After dealing with organizational data, I studied how to collect and handle web data. The social media and web analytics projects provided me a great deal of hands-on with data and analyzing the data. I also learned use of Google Analytics to study user behaviour on a website and how it can be leveraged to improve the website design and attract users. Towards the end, I learned network analysis and visualization of data which is an important aspect of Business Intelligence. The main purpose of collecting and storing the data is to present it to find patterns which are visible in visualization. I also got an opportunity to study my LinkedIn and facebook network, an I was amazed to see how much of information one can deduce by looking at these networks.
This course has been very informative and fruitful, and hopefully this will now help me direct my career towards data analysis and business intelligence!

Data Visualization

In the past blogs, we have seen the storage of several kinds of data such as unstructured, semi-structured and structured, and also the platforms used to store the data. These data can be an operational data of an organization or collection of messages on social media such as facebook or twitter. Once we have collected and stored this data, we can move ahead with the studying the data to uncover useful information. There are several ways in which the data can be studied such as statistical analysis, predictive analysis and so on.
One such method is visualizing the data. There are many tools available for data visualization such as Tableau, Spotfire, Gephi, Raphael and many javascript (js) files available free for use. But, in order to visualize the available data, you have to first understand the data and recognize what patterns you are looking for. Without understanding of data, you might easily get lost while studying the data and creating the visualization. For example, lets consider you want to see the twitter activity of a user over time. You might need to define the time, whether it would be weeks or days or hours, basically the granularity of an attribute. And then proceed with creating a two dimensional graph to view the pattern of tweets posted by user.
A visualization can be as simple as a graph, and can get complex depending on the number of attributes you want to consider while visualizing. Also, you will have to research on tools and decide which one would fulfill your requirements. Tableau is a good option for linear and graphical visualization such as bar charts and heat maps; while Gephi is good for network visualization.
Data visualization is a method to present your data and reveal visually appealing patterns which can help create stories in the context of data. Considering the previous twitter example, one might find that the user tweet activity is high during the day which would be normal, or you might as well observe that the user's tweet activity increases during evening and peaks after midnight. This would clearly indicate that the user is active during night and sleeps or works during the daytime. Also, if you want to dig deeper, you might want to see what kind of tweets he posts while at his peak time, and consider direct marketing based on the user's deducted interests from the tweet activity and followings.
Depending on the visualization you intend to do, you might also have to prepare the data for loading into the tools you will be using. Some of the BI or data warehousing tools such as OBIEE provide an in-built capability to visually present the selected data. These visualized data are most effectively used in dashboards and reports which display several types of data which would help someone, probably a manager to monitor the operational performance of an organization and make appropriate decisions.
To conclude, I would like to remind you the phrase "A picture is worth thousand words" which stands true for explaining the data using visualization rather than using data tables with actual data values.

Wednesday, December 4, 2013

Data Warehouse

In the last blog, I discussed about Big Data and the basics of Hadoop which is used to store and process big data. Now, let’s focus on the structured data produced by day to day operations in any organization. This operational data, when used properly to harness information, can reveal some interesting insights into the operations of an organization and help in informed decision making and driving the company towards its goal.
In order to analyze this structured historical data, it has to be stored in a way that it is easy and less time consuming to retrieve data. And that is achieved by a Data Warehouse. Traditional database systems are used to store transactional data in highly normalized relationships designed for recording transactions or data and occupy less space. But it is not optimized for reading and requires complex queries to retrieve data which impacts its performance. The difference between database and data warehouse can be better understood by comparing the data processing required to record the data and read the data.
  • OLTP (On-line Transaction Processing): It is an approach used to facilitate and manage day-to-day business applications, and involves large number of short on-line transactions (insert, update, delete). OLTP systems are designed to maintain data integrity in multi-access environment, and the relational data is stored in highly normalized form, usually in 3NF.
  • OLAP (On-line Analytical Processing): It is an approach used for low volume of transactions, and was conceived for Decision Support Systems and Management Information Systems. It facilitates use of complex and aggregated queries to retrieve data from multi-dimensional schema, and faster response time. OLAP systems are used for data warehousing and data mining where data has to be combined from multiple sources for easy querying.

Data Warehouse: Data Warehouse is a concept where database is used to store data in multi-dimensional schema rather than relational schema, and is used for reporting and data analysis. It can be seen as a central repository of current and historical data which is created by integrating data from various disparate sources.

Image Source: The Data Warehouse Toolkit - Ralph Kimball

The Ralph Kimball’s book ‘The Data Warehouse Toolkit’ describes the four distinct components of a data warehouse as: Operational Source Systems, Data Staging Area, Data Presentation Area, and Data Access Tools. Though the Operational source system is part of the model, Kimball states that “The source systems should be thought of as outside the data warehouse because presumably we have little to no control over the content and format of the data in these operational legacy systems.”
Operational Source Systems
Operational Source Systems are the databases which hold all the transactional data within an organization.  The main priorities of the source systems are processing, performance, and availability. These provide the necessary data which will be used to create the data warehouse.
Data Staging
The Data Staging Area is a temporary location where data from source systems is stored to perform the ETL (extraction-transformation-load) process on the data. A staging area is mainly required in a Data Warehousing Architecture to collect all required data at one place before the data can be integrated into the Data Warehouse.
Extraction is the first step in retrieving the data required to be loaded into data warehouse. The data is read from the source systems and filtered according to the requirements and loaded onto the data staging area.
After the data is extracted into the data staging area, Transformation step comes into effect where several transformations are applied such as cleansing the data (i.e. correct misspellings, dealing with missing data, converting to standard formats, etc.), combining data from disparate sources and assign the data warehouse keys which is analogous to a primary key in relational database.
The final step is the Loading of data where the new data is presented as dimensional tables and merged into the existing data marts. Each of the data marts then indexes this new data for query performance.
Data Presentation
The data presentation area is considered to be a set of integrated data marts. A data mart is a subset of the data warehouse and represents select data regarding a specific business function. An organization can have multiple data marts, each one relevant to the department for which it was designed. For example, the English department may have a data mart reflecting historical student data including demographics, placement scores, academic performance, and class schedules. This area is seen as the actual data warehouse which facilitates direct querying by users, report writers, and other analytical applications. The data contained in the data presentation area must be detailed and logically organized.
Data Access Tools
Data Access Tools are really all the software that can query the data in the data warehouse’s presentation area. Few examples of Data Access Tools are Cognos 10, SAP BeX, MicroStrategy, and Roambi. A data access tool can be as simple as an ad hoc query tool or as complex as a sophisticated data mining or modeling application.

Sunday, November 17, 2013

Big Data!

BIG DATA! A term which is heard almost everyday in business and technology world. But many still don't know what it is and how to handle it!
Big Data can be referred to as data, data and lots of data. Gartner defines Big Data as high Volume (huge amount of data), high Velocity (speed in which data is generated), and/or high Variety (types and sources of data) information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. Few also identify a fourth dimension as Veracity (truthfulness) of data. In today's world, data is generated in such a high rate that capturing it and storing it did come as a challenge, but the real challenge is to analyze this data.
In order to analyze the data, it has to be stored somewhere for retrieval as and when required. The data can be structured, for e.g., transactions that take place in a supermarket; or unstructured, for e.g., an image posted on facebook along with a comment. Deciding on where to store the data depends on what kind of data you want to store. Generally, structured data are stored in relational tables in a database such as SQL Server, Oracle, MySQL and so on. Whereas, unstructured data are stored as mapped value pairs, or it might be converted to somewhat structured data for storage and are stored in NoSQL databases such as MongoDB and Cassandra. Now if you have heard of Big Data, you must have also heard Hadoop, but don't get confused between the two. 

Apache Hadoop is a distributed data storage system which can be used to store structured or unstructured data as required, and has its own file system called Hadoop Distributed File System (HDFS) derived from Google File System (GFS). The HDFS is designed to hold huge amount of data (terabytes to petabytes) and provides redundancy across multiple machines to ensure the durability to failures and high availability to applications.
HDFS allows individual files to be broken into blocks of fixed size, which is 64MB by default. These data blocks are stored across a cluster of several machines having data storage capability. The individual machines in this cluster is referred to as Data Node. Thus, multiple data nodes are accessed to read one file, and failure of one machine can lead to loss of data. To avoid this problem, one block of data is stored in multiple nodes (by default, 3) to maintain the availability.
The writing/storage of blocks into the data nodes is handled by a single machine called Name Node, which stores all the metadata for the file system. A redundant name node is also maintained to preserve the metadata and make it available in case the operational name node crashes. To read a file, an application client will request the name node and retrieve the list of data nodes containing individual blocks that comprise the file. Once the node and block in each node is identified, client reads the file directly from the data nodes avoiding any overhead on name node for this bulk transfer. The system is designed in this way to support write once and read several times.

MapReduce: MapReduce is a programming construct which runs in Hadoop environment. It is designed to process huge volume of data in parallel by dividing the process into a set of independent tasks and executing it across several machines. Basically, MapReduce programs processes a list of input data elements to produce the output data element. It can be seen as two phased processing, Map and Reduce. In the first phase called mapping, a mapping function is used to transform/process each data element as defined in mapping function. The second phase called reduce aggregates the resultant of mapping output into a single output value.
Hadoop serves as a data storage and processing platform for other tools and systems which have been developed and can be deployed over Hadoop's platform for additional capabilities. Some of these systems are Apache Hive, Apache Pig and Apache HBase.

Click here to know more on Hadoop architecture.

Sunday, September 22, 2013

Introducing BI

I heard the term "Business Intelligence" while I was in my final year of undergraduate studies. I was working on a team project and a team member jokingly said that we can add business intelligence capability to our project, well, I thought he was joking, but he was serious! After briefly studying about business intelligence, I thought, yes we can do it, but is it really worth, I mean what we are suggesting would not be a "BI" in the real sense. Of course, we did not get enough time to implement this additional functionality, but I got to know something called as "Business Intelligence".
That was the time I passively started reading about BI, what it means, how it works, why is it needed, and why is it gaining so much popularity.  And after learning about BI, I started to sense the importance of this term in today's world of immense data. Later on, I realized that we couldn't have got close to implementing BI with the naive knowledge we had at that time. Alright, I'll leave my story here.

Lets talk more about Business Intelligence!
What is Business Intelligence? To find out, just google it! That's what we do now-a-days, right? Want to know about something, just google it, and you'll get some answer. Sometimes you'll get exactly what you were looking for, and sometimes you have to browse through several search results to get a clear picture. Similarly, think about gaining an insight about your business, how would you to measure your performance, what can you do to improve it, what factors are affecting your business, what decisions should you make, how to achieve the operational excellence, basically, how to make more money with lower costs and streamlined processes. Google cannot answer this for you, Business Intelligence can! Actually, BI can do much more than that if used effectively.

In a nutshell, Business Intelligence can be referred to as a collection of software applications used to analyse an organization's data which involves several activities such as data mining, data warehousing, online analytical processing, querying and reporting. It also involves a set of practices and methodologies for studying the business processes and effectively implement BI concepts to associate numbers with each of these processes to track the performance. These numbers are called Key Performance Indicators (KPIs). The KPIs can be at organizational level, or for business units, or at a specific operational level and are usually defined and measured to track a strategic goal or act as a driver for monitoring performance.
One of the most common framework for defining KPIs is Balanced Scorecard (BSC). A balanced scorecard is designed for a specific strategic theme, for e.g., achieving operational excellence. It is designed on the basis of four perspectives to identify metrics, collect data and analyze it relative to each of these perspectives:

  • Financial: Identify the relevant financial measures. Answer 'How should we appear to our shareholders?'
  • Customer: Identify measures that answer 'How do customers see us?'
  • Internal Business Processes: Identify measures that answer 'What processes should we excel at?'
  • Learning and Growth: Identify measures that answer 'How do we continually improve and create value?'

Once you have formed a BSC, you can define the KPIs and associate relevant metrics to valuate a KPI.
Defining, measuring, monitoring and managing KPIs is just one part of BI. There is something which is a very important ingredient for BI, and that is called 'data', or may I say Big Data!

I'll talk more about Business Intelligence and Big Data in my future posts. Till then, go ask google about big data.


“In God we trust. All others must bring data.” – W. Edwards Deming


Click here to know more about Balance Scorecard.