With data flowing and updating from ever more sources, businesses are looking to quickly translate this data into information to make key decisions, usually in an automated way.
Real time data streams have become more popular due to the Internet of Things (IoT), sensors in everyday devices and of course the rise of social media. These platforms provide ever changing states. Analysing them even a day later can give misleading or now currently false information.
Streaming allows consistent data ingestion. There were a few flaws that prevented real time streaming being ubiquitous in the past.
One major problem was loss of data if there was a fault i.e. the stream was down so no data was being collected.
Another was that real time tools weren’t able to keep up with the velocity — the amount of data that some social networks produce every minute is staggering.
Streaming was an inevitable solution to solve this as you can take data more frequently as it’s produced — think of this as quicker and smaller increments meaning they are easier to manage compared to a huge daily bulk load of data.
As a bonus, because data is being ingested as it’s being produced, if you can find a way to process it at the same time you get faster insights.
Apache Kafka is an open-source stream processing platform and was developed by LinkedIn to solve the problems discussed above.
Kafka is a publish-subscribe message system. It provides isolation between data producers and data consumers. It also allows intermediate storage and buffering of data to help with large velocities.
This means that the thing that creates data is decoupled from the the thing that reads it, meaning one won’t break the other. It also means it solves the velocity problem mentioned above, as data can be buffered so the consumer doesn’t get blocked or miss data.
When choosing a streaming platform, there are three main things to consider:
Kafka is capable of providing each of these in its own way.
To ensure reliability, Kafka uses replication. One replica is the designated leader. The others follow and fetch data from the leader. This means every partition has a leader to fetch from. If any partition goes down another is there to fetch from.
Kafka is also fast. This article explains how to benchmark Kafka. They found it can perform 2 million writes per second on cheap hardware. One of the main reasons is the direct calls to the OS kernel to move data. It also avoids random disk access due to writing sequential, immutable data (called a commit log). Finally it can scale sideways meaning it can write to thousands of partitions spread over many machines.
Its flexibility comes from the decoupling of producers and consumers. Consumers keep track of what they have consumed rather than producers dictating the stream. Due to producers persisting data immediately to disk you can then pause consumers — once restarted they continue consuming from where they left off. This decoupling also allows streaming from the source once and consuming many times by many applications.
Real time streaming has been made easy. There are many ways to consume and visualise this data in real time. I’ll be looking into processing data in real time and producing insights and reporting too.