Introduction
Data Science is the study of extracting meaningful insights from the data using various tools and technique for the growth of the business. Despite its inception at the time when computers came into the picture, the recent hype is a result of the huge amount of unstructured data that is getting generated and the unprecedented computational capacity that modern computers possess.
However, there is a lot of misconception among the masses about the true meaning of this field with many of the opinion that it is about predicting future outcomes from the data. Though predictive analytics is a part of Data Science, it is certainly not all of what Data Science stands for. In an analytics project, the first and foremost role is to get the build the pipeline and get the relevant data to perform predictive analytics later on. The professional who is responsible for building such ETL pipelines and the creating the system for flawless data flow is the Data Engineer and this field is known as Data Engineering.
Over the years the role of Data Engineers has evolved a lot. Previously it was about building Relational Database Management System using Structured Query Language or run ETL jobs. These days, the plethora of unstructured data from a multitude of sources has resulted in the advent of Big Data. It is nothing but a different forms of voluminous data which carries a lot of information if mined properly.
Now, the biggest challenge that professionals face is to analyse these huge terabytes of data which traditional file storage systems are incapable of handling. This problem was resolved by Hadoop which is an open-source Apache framework built to process large data in the form of clusters. Hadoop has several components which takes care of the data and one such component is known as Map Reduce.
What is Hadoop?
Created by Doug Cutting and Mike Cafarella in 2006, Hadoop facilitates distributed storage and processing of huge data sets in the form parallel clusters. HDFS or Hadoop Distributed File System is the storage component of Hadoop where different file formats could be stored to be processed using the Map Reduce programming which we would cover later on in this article.
The HDFS runs on large clusters and follows a master/slave architecture. The metadata of the file i.e., information about the relative position of the file in the node is managed by the NameNode which is the master and could save several DataNodes to store the data. Some of the other components of Hadoop are –
- Yarn – It manages the resources and performs job scheduling.
- Hive – It allows users to write SQL-like queries to analyse the data.
- Sqoop – Used for to and fro structured data transfer between the Hadoop Distributed file System and the Relational Database Management System.
- Flume – Similar to Sqoop but it facilitates the transfer of unstructured and semi-structured data between the HDFS and the source.
- Kafka – A messaging platform of Hadoop.
- Mahout – It used to create Machine Learning operations on big data.
Hadoop is a vast concept and in detail explanation of each components is beyond the scope of this blog. However, we would dive into one of its components – Map Reduce and understand how it works.
What is Map Reduce Programming
Map Reduce is the programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop Cluster, i.e. suppose you have a job to run and you write the Job using the MapReduce framework and then if there are a thousand machines available, the Job could run potentially in those thousand machines.
The Big Data is not stored traditionally in HDFS. The data gets divided into chunks of small blocks of data which gets stored in respective data nodes. No complete data’s present in one centralized location and hence a native client application cannot process the information right away. So a particular framework is needed with the capability of handling the data that stays as blocks of data into respective data nodes, and the processing can go there to process that data and bring back the result. In a nutshell, data is processed in parallel which makes processing faster.
To improve performance and for better efficiency, the idea of parallelization was developed. The process is automated and concurrently executed. The instructions which are fragmented could also run on a single machine or on different CPU’s. To gain direct disk access, multiple computers uses SAN or Storage Area Networks which is a common type of Clustered File System unlike the Distributed File Systems which sends the data using the network.
One term that is common in this maser/slave architecture of data processing is Load Balancing where among the processors the tasks are spread to avoid overload on any DataNode. Unlike the static balancers, there is more flexibility provided by the dynamic balancers.
The Map-Reduce algorithm which operates on three phases – Mapper Phase, Sort and Shuffle Phase and the Reducer Phase. To perform basic computation, it provides abstraction for Google engineers while hiding fault tolerance, parallelization, and load balancing details.
- Map Phase – In this stage, the input data is mapped into intermediate key-value pairs on all the mappers assigned to the data.
- Shuffle and Sort Phase – This phase acts as a bridge between the Map and the Reduce phase to decrease the computation time. The data here is shuffled and sorted simultaneously based on the keys i.e., all intermediate values from the mapper phase is grouped together with respect to the keys and passed on to reduce function.
- Reduce Phase– The sorted data is the input to the Reducer which aggregates the value corresponding to each key and produces the desired output.
How Map Reduce works
- Across multiple machines, the Map invocations are distributed and the input data is automatically partitioned into M pieces of size sixteen to sixty four megabytes per piece. On a cluster of machines, many copies of the program are then started up.
- Among the copies, one is the master copy while the rest are the slave copies. The master assigns M map and R reduce tasks to the slaves. Any idle worker would be assigned a task by the master.
- The map task worker would read the contents of the input and pass key-value pairs to the Map function defined by the user. In the memory buffer, the intermediate key-value pairs would be produced.
- To the local disk, the buffered pairs are written in a periodic fashion. The partitioning function then partitions them into R regions. The master would forward the location of the buffered key-value pairs to the reduce workers.
- The buffered data is read by the reduce workers after getting the location from the master. Once it is read, the data is sorted based on the intermediate keys grouping similar occurrences together.
- The Reduce function defined the user receives a set of intermediate values corresponding to each unique intermediate key that it encounters. The final output file would consists of the appended output from the Reduce function.
- The user program is woken up by the Master once all the Map and Reduce tasks are completed. In the R output files, the successful MapReduce execution output could be found.
- Each and every worker’s aliveness is checked by the master after the execution by sending periodic pings. If any worker does not respond to the ping, it is marked as failed after a certain point if time and its previous works are reset.
- In case of failures, the map tasks which are completed would be re-executed as their output would be inaccessible in the local disk. Output which are stored in the global file system need not to be re-executed.
Some of the examples of Map Reduce programming are –
- Map Reduce programming could count the frequencies of the URL access. The logs of web page would be processed by the map function and stored as output say <URL, 1> which would be processed by the Reduce function by adding all the same URL and output their count.
- Map Reduce programming could also be used to parse documents and count the number of words corresponding to each document.
- For a given URL, the list of all the associated source URL’s could be obtained with the help of Map Reduce.
- To calculate per host term vector, the map reduce programming could be used. The hostname and the term vector pair would be created for each document by the Map function which would be processed by the reduce function which in turn would remove less frequent terms and give a final hostname, term vector.
Conclusion
Data Engineering is a key step in any Data Science project and Map Reduce is undoubtedly an essential part of it. In this article we have a brief intuition about Big Data and provided an overview of Hadoop. Then we explained Map Reduce programming and its workflow and gave few real life applications of Map Reduce programming as well.
Read more here.