Kafka started as open-source software installed on a server. Complex and highly-configurable, early Kafka adopters learned first-hand how difficult, time-consuming and expensive managing Kafka clusters could be. Those staying with on-premises Kafka are adopting solutions such as data observability platforms to empower them with automated visibility and control over their environments.
Other companies are turning to the cloud, where various options abound, each offering different levels of concierge service for Kafka. To oversimplify things, let’s divide the Kafka services into two basic categories:
Kafka-as-a-service, in which your Kafka clusters are lifted and shifted to a hosting provider such as AWS, Cloudera, Red Hat (IBM), or Azure, which handles much of the infrastructure management, including provisioning, configuring, and maintaining servers. For security, each customer’s Kafka instances are hosted on their own physical servers in a single-tenant architecture. Though in the cloud, customers still retain most of the control over their Kafka environment — meaning they still have the responsibility to manage it, too.
Fully-managed Kafka, pioneered by Confluent Cloud. Confluent Cloud removes nearly all the operational hassle out of running Kafka, while providing the instant scalability and no-fuss reliability that developers love. As Confluent Cloud evangelist Kai Waehner outlines in an admitted flattering way, if the Kafka software is a car engine, then hosted Kafka or Kafka-as-a-service is a car, making Confluent Cloud the equivalent of a self-driving car.
How the Cloud Does — and Doesn’t — Cut Your Costs
Despite being the standard bearer for fully-managed Kafka, Confluent does offer Kafka in any flavor that customers want, including on-premises, hybrid, and hosted/as-a-service. It recognizes that many customers are not ready to jump from one extreme — the full manual control and customizability of Kafka — to another extreme — little control over Kafka, and even less visibility, in serverless Confluent Cloud.
That flexibility, combined with the technical credibility of its founders, who helped create Kafka when it was an internal project at LinkedIn, has helped Confluent gain market traction.
Since Kafka costs, apart from the hardware, comes from managing and developing applications. Companies still have plenty of opportunities to streamline their Kafka environments and optimize their price performance.
In the case of Kafka-as-a-single-tenant service, operational complexity remains high for customers. Though hosting providers will automate tasks such as bringing on a new Kafka cluster, there are still many dashboards to monitor, deployment decisions to be made, bottlenecks and data errors to fix, storage to manage, etc. To alleviate this operational burden and help optimize the price performance of your dynamic Kafka environment, hosted Kafka users will benefit significantly from the same continuous data observability as their on-premises and hybrid counterparts.
Do fully-managed Kafka users face the same operational costs? Confluent would argue no, saying that Confluent Cloud’s back-end economies of scale, near-zero administration requirements, and instant, automatic elasticity for users translates into huge Total Cost of Ownership (TCO) savings — $2.6 million in savings over three years versus self-managing and deploying Kafka, according to a 2022 Forrester Total Economic Impact study commissioned by Confluent.
A Low Ops Dream to Some — A (Fiscal) Nightmare to Others
Confluent’s promise mirrors the one made by Snowflake, down to a $21 million ROI over 3 years study by Forrester. The low-ops, highly-scalable cloud data warehouse has been embraced by developers and data-driven companies. Enthralled by their new-found agility and time-to-market, many users mistook “low ops” as an excuse to do “no-ops.” They ignored basic principles of value engineering and cloud fin ops and neglected operational oversight, such as monitoring costs and setting up cost guardrails.
That came back to bite them in the form of billing “horror stories”. I shared a famous example in a recent blog about optimizing Snowflake usage about a company that misconfigured a $7 hours-long code test, resulting in a $72,000 charge from Snowflake instead. More prosaically, other Snowflake users found that cost-optimization was not automatic and still required substantial effort and oversight.
Snowflake users were forced to try out various solutions, from Snowflake’s built-in Resource Monitors to visual third-party dashboards and reports and a host of other tools. However, even when cobbled together, these solutions are unable to provide users continuous visibility, forecasting, and control over their costs, not to mention other key areas such as data reliability and data performance.
Similarly, optimizing costs in Confluent Cloud’s dynamic pay-as-you-go charging model is neither simple nor automatic. Streaming data volumes can instantly spike up to 10 GB per second. Monitoring and preventing such potential cost overruns is not easy. While the Confluent Cloud Console can show primary usage in real-time, actual customer costs lag between 6 to 24 hours.
And while Confluent Cloud enables users to create real-time triggers and alerts around data performance problems, there are none for cost overruns. That can be a problem if a developer forgets to turn off a high-volume test streaming data pipeline or if a conservative archiving policy leads to a buildup in storage charges. Confluent Cloud users may not notice until the hefty monthly bill arrives.
Moreover, users are starting to publicly state that despite whatever Confluent Cloud’s web calculator may say, the security blanket of a fully-managed service comes at a commensurate price.
How Data Observability Helps
For customers serious about monitoring and managing their Confluent Cloud environments, Confluent actually encourages them to look to a third-party provider.
That’s where an enterprise data observability platform can come in. The right platform can ingest cost and performance metrics via Confluent Cloud’s API, generates additional analytics through its monitoring, and then combines the two to create further insights, alerts, and recommendations. In particular, here are five ways that an effective data observability platform helps users prevent cost overruns and optimize costs:
1) Ongoing visibility and alerts around the performance and usage of all Confluent Cloud data pipelines. With real-time streaming, the volume of events processed, sent, and stored can spike wildly, especially with the instant, multi-GB scalability of Confluent Cloud. With compute observability, data teams can monitor for data spikes that can create bottlenecks or crashed processes. An effective data observability platform also provides a real-time view that helps you select the right number of Partitions and Topics to optimize your cost performance.
2) Producer-topic-lineage visibility. Acceldata’s Kapxy tool is a good example of a tool that lets Confluent Cloud customers drill down further into the three key Kafka components — the Producer, Topic, and Consumer — so you can track your data on a more granular level from end to end. Customers can then accurately calculate usage and cost by providing an in-depth view of how data flows by pipeline, application, or department. This enables accurate cost chargebacks, ROI calculations, and supports data pipeline reuse and other value engineering efforts.
3) Monitor and prevent consumer lag. One of the biggest potential problems in Kafka is the gap between ingested data and data received by the downstream application or Consumer. If the gap grows too large, then data stored in the Kafka broker may automatically expire before it is successfully transmitted. With high-level visibility over your entire Confluent Cloud data pipeline, you can easily identify potential concerns such as offline and under-replicated Partitions, the most significant and most skewed Topics, whether the number of replicas going out of sync is growing, and Consumer groups with the most lag. You can drill into the Consumer groups or view individual events. This visibility will help you prevent Consumer Lag before it becomes an outright problem, forcing you to increase costly computing or storage to compensate.
4) Preventing data loss. As mentioned above, Consumer Lag and other bottlenecks not only directly increase your processing costs, but they can also lead to data loss. That’s because users control how long data is stored by Kafka Brokers (servers). If data bottlenecks or delays go on too long, then data may be purged by the Broker before it is successfully transmitted onward to the Consumer applications. Data observability monitoring dashboards help you diagnose the causes of data loss, while tools such as Kapxy pinpoint the exact data lost and where.
5) Clean, validate and transform streaming data. Acceldata works with Kafka and Confluent Cloud pipelines to ingest, validate and transform events in real time. This improves your data quality and reliability. And that reduces the need to search out and fix data errors and troubleshoot the offending data pipelines and applications. This reduces your operational costs and boosts your ROI.
All five of these benefits also help ease the initial migration of your old Kafka environment into Confluent Cloud. As a multi-tenant, highly-managed architecture with custom features not available in Kafka, the best-known being the ksqlDB, Confluent Cloud differs enough from single-tenant Kafka clusters on-premises or hosted that any migration will be a total refactoring of your environment from the ground up, not a simple lift-and-shift.