Comment by Kai Weiner, Confluent
Old update with data streaming
Written by Kay Weiner
providers about it
More and more companies are developing digital business models. However, outdated existing data warehouses and messaging architectures are often a hindrance. A new flow-driven approach that helps increase infrastructure flexibility, agility and scalability.
Classic and traditional messaging Data warehouses DWHs have been among the well-established corporate IT technologies for many years. With the spread of the (hybrid) cloud as the leading infrastructure model, these approaches are no longer sufficient.
Classic data warehouse systems need because of the rise ETL– The effort (extraction, transformation, loading) is too long to combine data streams from different sources into a data warehouse whose data format is specified. For example, insights from the current day’s data are not available until the next day at the earliest. Traditional messaging platforms such as Tibco, IBM MQ, etc. are often integrated with ESBs (Enterprise Service Buses), backend processing, or offline APIs. The resulting systems are often homogeneous and result in tight, inelastic system joints. Thus, conversions or expansions of the messaging infrastructure are very time consuming, resulting in higher costs.
What we need are infrastructures that scale better, are more agile, are cheaper and faster, and that move or keep static data (data in motion). Messaging platforms should be able to process different data streams in real time such as sensor data, image data, usage data from platforms and also classic application data indiscriminately. No data may be lost, the sequence must be clear and any tampering with the original data must be traceable for compliance reasons.
The solution: Cloud DWH plus data streaming
Cloud DWHs linked to modern broadcast platforms provide a solution. Cloud DWH can use classic data formats from applications and a variety of formats Unstructured data for processing. Data streaming platforms specifically designed for the cloud ensure that data from other sources, the cloud or your on-premises data center is quickly and seamlessly transmitted to the cloud DWH and analyzed there. This provides companies with timely insights, which in turn can be translated into updated business improvement decisions.
This enables applications such as proactive maintenance, near real-time capacity saving, faster customer recommendations, recognition of multiple accounts or exact resumption of data flow at the point of interruption, for example when watching movies.
Apache Kafka can do that
Apache is the most important tool for building data-driven infrastructures Kafka. Confluent is also based on this solution. So here’s a quick look at the basic concepts.
Kafka requires resources that are distributed and clustered in the data center, while traditional message brokers usually assume a central unit and clients connected in a star configuration (centralization and occurrence). Kafka does not have a central queue, but rather flexible event-driven producer/consumer structures and themes. The latter are broken down into smaller units (“topic sections”) and then distributed to different cluster nodes. Each client node has access to all metadata, which then opens the way to the data of interest, no matter where it is. This increases read and write performance. In addition, the cluster can be dynamically expanded using new nodes.
Flexible producer/consumer architectures allow data to be consumed by multiple clients in parallel and in the correct order. Thus, many consumer instances combined into a consumer group can access the same thread at the same time. They still receive their messages independently and mutually, if desired, also with identity-based access protection.
One source of truth
Kafka relies on the so-called commit log, which is a guaranteed integrity data source for the entire system (“a single source of truth”), in which all messages are permanently stored in the form and order in which they were sent. Storing the serially retrieved message path on hard drives is much cheaper than is often assumed. Slow data consumers can no longer slow down the system because they are not directly connected to data producers.
Thread partitions can be placed on multiple fault-tolerant nodes. The number of copies can be determined individually for each subject. This enables extensions, repairs, and restarts of individual devices during continuous operation.
Enable cloud native architectures
Kafka-based data flow infrastructures such as Confluent are the ideal foundation for moving to cloud DWHs. Confluent enables cloud-native clusters with bandwidth that can scale from zero to gigabytes at the push of a button and on-demand billing based on throughput with a service level agreement of 99.95 percent. Data integration and messaging are integrated on a single level.
ksqlDB processes data streams in real time with millions of messages per second. 120 links, also for classic message brokers, allow the integration of data from many applications. Existing messaging styles are also supported. Local data resources can be combined differently via multi-cluster binding.
Updating the legacy infrastructure must be done in several steps: In the first, Confluent replaces the data integrity layer. The existing messaging solution is retained, but data is fed into desired cloud applications (such as Snowflake) via Confluent connectors. This creates a hybrid cloud infrastructure.
In the second phase, JMS (Java Messaging System)-based consumer and producer applications are rewritten to point to Confluent and gradually replace the previous middleware. The data is collected in Kafka Topics, but the application connections use JMS.
Phase 3 is the transition to cloud native messaging. ksqlDB creates streaming applications with new event-driven patterns needed for event notification, event retrieval, read-only physical renderings, and other purposes.
Successful immigration projects
Finally, some examples of successful migrations to storage and messaging infrastructure based on data flow:
- Thousands of physical packets and about one terabyte of data packets every day – Deutsche Post must be able to rely 100 per cent on the performance and scalability of its IT systems. Because package shipping and other services depend on seamless connectivity of systems and provision of real-time data. Due to the rapidly increasing number of data packages, Deutsche Post had to update its system landscape. The company relies on Apache Kafka and Confluent. This enables large amounts of data to be processed and correlated quickly – all in real time.
- At online travel company Expedia, a flexible, regionally distributed conversation platform has been built with Confluent. It is filled with millions of messages per second from separate systems via a clustered event-driven architecture. ksqlDB enriches data flows between systems. Data storage and data processing are separated. This is how Expedia has been able to handle the flood of inquiries during the Corona pandemic.
- The Data Reply integration tool has updated the data warehouse for a large live TV provider and video streaming portal. The new architecture combines data ingestion via Kafka product on AWS Fargate with data integration from databases via Kafka Connect in Confluent Cloud, stream processing with Kafka Streams, Kafka with Confluent Cloud as event broker, Apache Druid for real-time analysis and cloud DWH from Snowflakes With this solution, sales can be distributed to content providers with millisecond accuracy, prevent torment and error through the use of multiple accounts, create the most accurate follow-up points and evaluate real-time data instantly.
The classic DW and messaging infrastructures no longer meet the requirements of digital business models. It needs moving data. The examples show that modern data-flow-based processing with Kafka or products and services based on it form a good foundation for digital business with data acquisition and preservation. They provide the necessary flexibility, security and scalability.