If you have learned temporal parallelism used to speed up CPU execution, you came across instruction pipelines aka pipeline processing. In pipeline processing, you will have many instructions in different stages of execution. The term “Data Pipeline” is a misnomer representing a high bandwidth communication channel used for data transportation between a source system and a destination. In certain cases the destination is called a sink. Pipelines by definition allow flow of a fluid automatically from one end to the other end, when one end is connected to a source. The flow of data through a communication channel made people to consider it as a pipeline and the term “Data pipeline” emerged. The source system can be an e-commerce / travel website, or a social media platform. The destination of a data pipeline can be a data warehouse, data lake, a visualization dashboard, or another application like a web-scrapping tool or a recommender system.
With the availability of social media on ubiquitous computing devices, all people in the world have become data entry entry operators. The IOT devices have become another source of continuous data. Consider a single comment on a social media site. The entry of this comment could generate data to feed a real-time report counting social media mentions, a sentiment analysis application that outputs a positive, negative, or neutral result. Though the data is from the same source in all cases, each of these applications are built on unique data pipelines that must smoothly complete before the end user sees the result. Hence data pipelines are one to many connections depending on the number of applications consuming the data. Common processing steps in data pipelines include data transformation, augmentation, enrichment, filtering, grouping, aggregating, and running of machine learning algorithms against that data.
Data pipelines have become a necessity for today’s data-driven enterprise to handle big data. We all know that Volume, Variety and Velocity are the key attributes of big data. Big data pipelines are built to accommodate one or more of these attributes efficiently. Let us take the case the first attribute of volume. The volume attribute is handled differently in the case of pipelines handling real-time stream data and batch data. The velocity of big data makes it necessary to build real-time streaming data pipelines for big data. The data can be captured and processed in real time to enable quick decisions on solutions like recommender systems. The volume attribute requires that data pipelines are scalable in capacity as the data volume can be variable over time. The big data pipeline must be able to scale in capacity to handle significant volumes of data concurrently. The variety attribute of big data requires that big data pipelines be able to recognize and process data in many different formats: structured, unstructured, and semi-structured.
Data pipelines are useful for businesses relying on large volumes of data arriving from multiple sources. Depending on the nature of usage of the data, the data pipelines are broadly classified into Real-Time, Batch, and Cloud native. Sometimes the data needs to be processed in real-time for systems in which sub-second decision making is required. Batch mode addresses the volume attribute of big data and these data pipelines are used when large volumes of data is to be processed at regular intervals. You can store the batch data in data tanks until they get processed and emptied. There can be multiple data tanks implemented in batch mode data pipelines. The Cloud native data pipelines are designed to work with cloud-based data by creating complex data processing workloads. For example, AWS Data Pipeline is a web service that easily automates and transforms data.
In Real-Time data pipelines, the data flows as and when it arrives. This type of pipelines addresses the velocity attribute of big data. There could be difference in the rate at which the data arrives and it is getting consumed. To take care of the mismatch in the rates, we need to implement queuing and buffering systems in the data pipeline. A commonly used tool is Apache Kafka, a messaging queue based event streaming platform. Kafka works on publish subscribe mode and ensures that the messages are queued in the order in which they arrive and delivered in the same order with high reliability. Kafka buffers the messages in memory for quick delivery.
Another term closely related with real-time data pipelines is stream computing. The term stream computing means pulling in streams of data in a single flow. Stream computing uses software algorithms that analyzes the data in real time as it streams in to increase speed and accuracy. A simple example of stream computing is graphics processing implemented using GPUs for rendering images on your computer screen. Other examples are data from a streaming sources such as financial markets or telemetry from connected devices. When stream computing does processing in real time, ETL tools have been used for processing workloads in batch. With the evolution of data pipe lines, a new breed of streaming ETL tools are emerging for real-time transformation of data.
Lambda architecture is a data-processing architecture evolved to meet the requirements of big data processing. It is designed to take advantage of both batch and stream-processing methods. This approach attempts to reduce latency using batch processing to provide accurate views of batch data simultaneously using real-time stream processing to provide views of online data. Lambda architecture describes a system consisting of three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries. The batch layer pre-computes results from archived data using distributed processing frameworks like Hadoop MapReduce and Apache Spark that can handle very large quantities of data. The batch layer aims at perfect accuracy by being able to process all available data when generating views. The speed layer processes data streams in real time and it sacrifices throughput as it aims to minimize latency by providing real-time views into the most recent data. The speed layer is responsible for filling the gap caused by the batch layer’s delay in providing views based on the most recent data. This layer’s views may not be as accurate or complete as the ones eventually produced by the batch layer, but they are available almost immediately after data is received, and can be replaced when the batch layer’s views for the same data become available. The two view outputs gets joined through a T junction in the data pipeline to the presentation layer in which the reports are generated . In Lambda architecture, real-time data pipelines merge with batch processing for decision making based on latest data.
The design of data pipeline architectures require many considerations based on the usage scenarios. For example, does your pipeline need to handle streaming data? If so, what rate of data do you expect? How much and what types of processing need to happen in the data pipeline? Is the data being generated in the cloud or on-premises, and where does it need to go? Do you plan to build the pipeline with micro services? Are there specific technologies which you can leverage for implementation ?
See you next time ……..
Janardhanan PS
Machine Learning Evangelist