BIG DATA! A term which is heard almost everyday in business and technology world. But many still don't know what it is and how to handle it!
Big Data can be referred to as data, data and lots of data. Gartner defines Big Data as high Volume (huge amount of data), high Velocity (speed in which data is generated), and/or high Variety (types and sources of data) information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. Few also identify a fourth dimension as Veracity (truthfulness) of data. In today's world, data is generated in such a high rate that capturing it and storing it did come as a challenge, but the real challenge is to analyze this data.
In order to analyze the data, it has to be stored somewhere for retrieval as and when required. The data can be structured, for e.g., transactions that take place in a supermarket; or unstructured, for e.g., an image posted on facebook along with a comment. Deciding on where to store the data depends on what kind of data you want to store. Generally, structured data are stored in relational tables in a database such as SQL Server, Oracle, MySQL and so on. Whereas, unstructured data are stored as mapped value pairs, or it might be converted to somewhat structured data for storage and are stored in NoSQL databases such as MongoDB and Cassandra. Now if you have heard of Big Data, you must have also heard Hadoop, but don't get confused between the two.
Apache Hadoop is a distributed data storage system which can be used to store structured or unstructured data as required, and has its own file system called Hadoop Distributed File System (HDFS) derived from Google File System (GFS). The HDFS is designed to hold huge amount of data (terabytes to petabytes) and provides redundancy across multiple machines to ensure the durability to failures and high availability to applications.
HDFS allows individual files to be broken into blocks of fixed size, which is 64MB by default. These data blocks are stored across a cluster of several machines having data storage capability. The individual machines in this cluster is referred to as Data Node. Thus, multiple data nodes are accessed to read one file, and failure of one machine can lead to loss of data. To avoid this problem, one block of data is stored in multiple nodes (by default, 3) to maintain the availability.
The writing/storage of blocks into the data nodes is handled by a single machine called Name Node, which stores all the metadata for the file system. A redundant name node is also maintained to preserve the metadata and make it available in case the operational name node crashes. To read a file, an application client will request the name node and retrieve the list of data nodes containing individual blocks that comprise the file. Once the node and block in each node is identified, client reads the file directly from the data nodes avoiding any overhead on name node for this bulk transfer. The system is designed in this way to support write once and read several times.
MapReduce: MapReduce is a programming construct which runs in Hadoop environment. It is designed to process huge volume of data in parallel by dividing the process into a set of independent tasks and executing it across several machines. Basically, MapReduce programs processes a list of input data elements to produce the output data element. It can be seen as two phased processing, Map and Reduce. In the first phase called mapping, a mapping function is used to transform/process each data element as defined in mapping function. The second phase called reduce aggregates the resultant of mapping output into a single output value.
Hadoop serves as a data storage and processing platform for other tools and systems which have been developed and can be deployed over Hadoop's platform for additional capabilities. Some of these systems are Apache Hive, Apache Pig and Apache HBase.
Click here to know more on Hadoop architecture.