Introduction to Apache Spark Streaming
A data stream is an unbounded sequence of data arriving continuously. Streaming divides continuously flowing input data into discrete units for further processing. Stream processing is low latency processing and analyzing of streaming data.
Spark Streaming was added to Apache spark in 2013, an extension of the core Spark API that provides scalable, high-throughput and fault-tolerant stream processing of live data streams. Data ingestion can be done from many sources like Kafka, Apache Flume, Amazon Kinesis or TCP sockets and processing can be done using complex algorithms that are expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems, databases and live dashboards.
Its internal working is as follows. Live input data streams is received and divided into batches by Spark streaming, these batches are then processed by the Spark engine to generate the final stream of results in batches.
Its key abstraction is Apache Spark Discretized Stream or, in short, a Spark DStream, which represents a stream of data divided into small batches. DStreams are built on Spark RDDs, Spark’s core data abstraction. This allows Streaming in Spark to seamlessly integrate with any other Apache Spark components like Spark MLlib and Spark SQL.
1. Need for Streaming in Apache Spark
To process the data, most traditional stream processing systems are designed with a continuous operator model, which works as follows:
- Streaming data is received from data sources (e.g. live logs, system telemetry data, IoT device data, etc.) into some data ingestion system like Apache Kafka, Amazon Kinesis, etc.
- The data is then processed in parallel on a cluster.
- Results are given to downstream systems like HBase, Cassandra, Kafka, etc.
There is a set of worker nodes, each of which runs one or more continuous operators. Each continuous operator processes the streaming data one record at a time and forwards the records to other operators in the pipeline.
Data is received from ingestion systems via Source operators and given as output to downstream systems via sink operators.
Continuous operators are a simple and natural model. However, this traditional architecture has also met some challenges with today’s trend towards larger scale and more complex real-time analytics:-
a. Fast failure and straggler recovery
In real time, the system must be able to fastly and automatically recover from failures and stragglers to provide results which is challenging in traditional systems due to the static allocation of continuous operators to worker nodes.
b. Load balancing
In a continuous operator system, uneven allocation of the processing load between the workers can cause bottlenecks. The system needs to be able to dynamically adapt the resource allocation based on the workload.
c. Unification of streaming, batch and interactive workloads
In many use cases, it is also attractive to query the streaming data interactively, or to combine it with static datasets (e.g. pre-computed models). This is hard in continuous operator systems which are not designed to new operators for ad-hoc queries. This requires a single engine that can combine batch, streaming and interactive queries.
d. Advanced analytics with machine learning and SQL queries
Complex workloads require continuously learning and updating data models, or even querying the streaming data with SQL queries. Having a common abstraction across these analytic tasks makes the developer’s job much easier.
2. Why Streaming in Spark?
Batch processing systems like Apache Hadoop have high latency that is not suitable for near real time processing requirements. Processing of a record is guaranteed by Storm if it hasn’t been processed, but this can lead to inconsistency as repetition of record processing might be there. The state is lost if a node running Storm goes down. In most environments, Hadoop is used for batch processing while Storm is used for stream processing that causes an increase in code size, number of bugs to fix, development effort, introduces a learning curve, and causes other issues. This creates difference between Big data Hadoop and Apache Spark.
Spark Streaming helps in fixing these issues and provides a scalable, efficient, resilient, and integrated (with batch processing) system. Spark has provided a unified engine that natively supports both batch and streaming workloads. Spark’s single execution engine and unified Spark programming model for batch and streaming lead to some unique benefits over other traditional streaming systems. This generates difference between Spark Streaming and storm.
Key reason behind Spark Streaming’s rapid adoption is the unification of disparate data processing capabilities. This makes it very easy for developers to use a single framework to satisfy all the processing needs. Furthermore, data from streaming sources can be combined with a very large range of static data sources available through Apache Spark SQL.
To address the problems of traditional stream processing engine, Spark Streaming uses a new architecture called Discretized Streams that directly leverages the rich libraries and fault tolerance of the Spark engine.
3. Spark Streaming Architecture and advantages
Instead of processing the streaming data one record at a time, Spark Streaming discretizes the data into tiny, sub-second micro-batches. In other words, Spark Streaming receivers accept data in parallel and buffer it in the memory of Spark’s workers nodes. Then the latency-optimized Spark engine runs short tasks to process the batches and output the results to other systems.
Unlike the traditional continuous operator model, where the computation is statically allocated to a node, Spark tasks are assigned to the workers dynamically on the basis of data locality and available resources. This enables better load balancing and faster fault recovery.
Each batch of data is a Resilient Distributed Dataset (RDD) in Spark, which is the basic abstraction of a fault-tolerant dataset in Spark. This allows the streaming data to be processed using any Spark code or library.
This architecture allows Spark Streaming to achieve the following goals:
a. Dynamic load balancing
Dividing the data into small micro-batches allows for fine-grained allocation of computations to resources. Let us consider a simple workload where partitioning of input data stream needs to be done by a key and processed. In the traditional record-at-a-time approach, if one of the partitions is more computationally intensive than others, the node to which that partition is assigned will become a bottleneck and slow down the pipeline. The job’s tasks will be naturally load balanced across the workers where some workers will process a few longer tasks while others will process more of the shorter tasks in Spark Streaming.
b. Fast failure and straggler recovery
Traditional systems have to restart the failed operator on another node to recompute the lost information in case of node failure. Only one node is handling the recomputation due to which the pipeline cannot proceed until the new node has caught up after the replay. In Spark, the computation is discretized into small tasks that can run anywhere without affecting correctness. So failed tasks can be distributed evenly on all the other nodes in the cluster to perform the recomputations and recover from the failure faster than the traditional approach.
c. Unification of batch, streaming and interactive analytics
A DStream in Spark is just a series of RDDs in Spark that allows batch and streaming workloads to interoperate seamlessly. Arbitrary Apache Spark functions can be applied on each batch of streaming data. Since the batches of streaming data are stored in the Spark’s worker memory, it can be interactively queried on demand.
d. Advanced analytics like machine learning and interactive SQL
Spark interoperability extends to rich libraries like MLlib (machine learning), SQL, DataFrames, and GraphX. RDDs generated by DStreams can be converted to DataFrames and queried with SQL.
Machine learning models generated offline with MLlib can applied on streaming data.
e. Performance
Spark Streaming’s ability to batch data and leverage the Spark engine leads to almost higher throughput to other streaming systems. Spark Streaming can achieve latencies as low as a few hundred milliseconds.
4. How Spark Streaming works?
In Spark Streaming data stream is divided into batches called Dstreams, which internally is a sequence of RDDs. The RDDs are then processed using Spark APIs, and the results are returned in batches.
Spark Streaming provides an API in Scala, Java, and Python. The Python API was recently introduced in Spark 1.2 and still lacks many features.
Spark Streaming maintains a state based on data coming in a stream and this is called as stateful computations. It also allows window operations (i.e., allows the developer to specify a time frame to perform operations on the data that flows in that time window). There is sliding interval in the window, which is the time interval of updating the window.
5. Spark Streaming Sources
Every input DStream (except file stream) is associated with a Receiver object which receives the data from a source and stores it in Spark’s memory for processing.
There are two categories of built-in streaming sources:
a. Basic sources
These are the sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
b. Advanced sources
Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies.
There are two types of receivers base on their reliability:
- Reliable Receiver — A reliable receiver is the one that correctly sends acknowledgment to a source when the data has been received and stored in Spark with replication.
- Unreliable Receiver — An unreliable receiver does not send acknowledgment to a source. This can be used for sources when one does not want or need to go into the complexity of acknowledgment.
6. Spark Streaming Operations
Spark streaming support two types of operations:
a. Transformation Operations in Spark
Similar to Spark RDDs, Spark transformations allow modification of the data from the input DStream. DStreams support many transformations that are available on normal Spark RDD’s. Some of the common ones are as follows.
map(), flatMap(), filter(), repartition(numPartitions), union(otherStream), count(), reduce(), countByValue(), reduceByKey(func, [numTasks]), join(otherStream, [numTasks]), cogroup(otherStream, [numTasks]), transform(), updateStateByKey(), Window()
2. Output Operations in Apache Spark
DStream’s data is pushed out to external systems like a database or file systems using Output Oeprations. Since external systems consume the transformed data as allowed by the output operations, they trigger the actual execution of all the DStream transformations. Currently, the following output operations are defined:
print(), saveAsTextFiles(prefix, [suffix])”prefix-TIME_IN_MS[.suffix]”, saveAsObjectFiles(prefix, [suffix]), saveAsHadoopFiles(prefix, [suffix]), foreachRDD(func)
DStreams like RDDs are executed lazily by the output operations. Specifically, received data is processed forcefully by RDD actions inside the DStream output operations. By default, output operations are executed one-at-a-time. And they are executed in the order they are defined in the Spark applications.