The modern enterprise is insight-driven, or, at least, aims to be. Historically, those insights were found in a data warehouse or data lake, populated with scheduled feeds and analysts, working feverishly over them. Feeds had plenty of bandwidth, but high latency. Think an 18-wheeler loaded with hard drives, driving from London to Birmingham.
Nowadays, insights need to be immediate and to deliver them, the real-time analytics space is evolving. Not just with historic vendors, but also with newer, more nimble providers moving into the space.
A reference architecture
There are four major sets of requirements to realize data architecture;
- Capture
- Transportation
- Transformation
- Exploitation
These create different challenges in real time.
Capture
The requirement is now to capture and process streams of data, or to receive and process messages continuously. But how does the data get into streams in the first place?
Change Data Capture enables the sourcing of data for real-time analytics styles, by one of three methods;
- scanning logs,
- queries
- triggers.
The least invasive method is log scanning. Other options have real problems to work around, specifically, contention in the operational systems. Capturing deletes is tricky, and fundamentally, is still a batch process.
Outside the major vendors, CDC providers include;
- Debezium, an Open Source offering built on Kafka Connect. Connecting to the usual RDMSs, Mongo and Vitesse. It uses log scanning to publish events onto Kafka for transportation.
- Flow, from Estuary. Also, open source, using a log-based approach. Offering real-time, natively written connectors for “high scale” technology platforms. Supporting destinations like Firebolt SnowFlake, BigQuery and onto queues like Kafka. It’s a viable option for sourcing data.
- HVR from FiveTran, a managed paid for service, also uses log-based CDC, and its own replication technology, where log access is not allowed.
- Striim, a complete Cloud SaaS integration platform, with pricing to match. Uses CDC to stream data to any sink on any cloud platform. Striim uses Log-based CDC and an Incremental Batch pattern, via JDBC with incrementing values and timestamps.
The major vendors offer CDC via the following;
- AWS’ Data Migration Service
- GCP’s DataStream
- Azure’s Data Factory
Transportation
Following on from capture, looking away from the major cloud platforms, other options for movement of real-time data include;
- Apache Kafka, the standard. Open source, highly available and fault tolerant. It has high administration requirements and needs careful planning to configure. It does, however, run well when done so. It is used by major platforms like Twitter, Spotify, and Netflix, along with large financial institutions.
- RedPanda, a specialized, single binary event streaming platform. It has a Kafka compatible API, but doesn’t require the infrastructure Kafka requires, like Zookeeper or a JVM. A fault tolerant event log, it separates event producers from consumers and natively supports schema registries and http proxies. Unlike Kafka, which requires multiple binaries for similar capability. Also boasting remarkable performance, it can be deployed onto “bare metal” if required. Further, it can be run as a managed service, or on your own cloud or premises.
- Confluent, wraps around Kafka to create a serverless, scalable high availability streaming platform. Using Kafka Connect to connect from source to destination it can scale out with minimal maintenance.
Major vendors manage transportation via the following;
- AWS’s Kinesis
- GCP’s Cloud PubSub
- Azure’s Synapse Pipelines
Transformation
The T in ETL and ELT, as a capability. Most tools mentioned allow for some transformation, for example CDC platforms HVR and Striim allow Transformation to be built into their pipelines and the Kafkas and Red Pandas allow for simple transformations. Specialist options include;
- Orbital (formerly Vyne) integrates with streaming and batch sources and uses its open-source taxonomy language to transform data from source to sink. Thus, giving full visibility at run time, of which data was used, where, and when.
- Apache Beam, an open-source programming model for data pipeline development. It powers Apache Flink, Spark, and Google’s Dataflow. It supports Python, Java and Go. Beam can be used in parallel processing tasks, and can chunk itself into smaller chunks of data which are processed independently.
- StreamSets, a multi-cloud, multi-platform data transformation and integration platform. This works across multi-cloud,both on, and off premises, for streaming. It uses Kafka, but integrates with most platforms.
Major vendors manage transformation via:
- AWS’s Glue
- GCP’s DataFlow
- Azure’s Synapse Pipelines, HDInsight and DataFactory
Exploitation
Once data is captured, moved and transformed, what now?
Major analytics platforms that work with real-time data include the Apache offerings; such as Druid and Spark, the new emergent leaders; Databricks and SnowFlake, and others, including streaming specialist databases like Materialize, Clickhouse or Firebolt. Each of these boast impressive scalability claims, and most offer multi cloud serverless implementations. Focusing on the specialists;
- Materialize recognizes streaming data as asynchronous; building views as it gets more information. Updating queries run before. It keeps results of queries and increments results as new data comes in, rather than re-running from scratch each time it’s called. This allows for sub-second answers to complex queries.
- Clickhouse, a column oriented cloud based OLAP database. Having a column oriented design means data can be read quickly and inexpensively from a performance perspective.
- Firebolt, another scalable cloud based data lake, built on top of an Amazon S3. It transforms various file formats into its own F3 format, built for speed of retrieval. Boasting an impressive number of comparative case studies, and a healthy ecosystem for connectors, it’s extremely customisable, tunable and manageable.
Finally, major cloud vendors’ own high speed analytics databases include;
- AWS Redshift
- GCP’s BigQuery
- Azure’s Cosmos
The big vendors are often a good place to start, offering a complete end to end capability for real time analytics. However, for those looking further afield for specialist offerings, there are plenty of options. Options which are just as scalable, performant, and compelling as the big players to deliver your real time analytics capability.
With thanks, David Yaffe at Estuary, Marty Pitt at Orbital, and Ottalei Martin.