Note: Most of these blogs are for my personal reference and at a given time, some of those might just be unpolished drafts.
Message Queuing (Kafka and Zookeeper) for Microservices and ML Solutions Pipelines
Microservice architecture is a philosophy of decoupling an otherwise large monolithic application into different independent modules (applications) interconnected with one another as well as external data sources using APIs. Message queuing comes into play in order to handle these inter-microservice and microservices-external-source communications, be it API calls or intensive data processing, blocking threads for which using synchronous model would render the entire application unresponsive.
Apache Kafka is one such platform. Officially, it’s known as a distributed stream processing platform with high resilience and fault tolerance. This is obtained by using numerous clusters of computing nodes in distributed system co-ordination amongst which is maintained by another apache platform called zookeeper.
Here is the basic flow:
- Start Zookeeper (in order to coordinate).
- Start the Kafka server (which is basically a broker, in this case, mediating messages between a publisher [producer] and subscribers [consumers]).
- Producer API (in order to create message and queue into Kafka).
- Consumer API (in order to consume messages from the Kafka queue). (For this purpose, I will be using a console producer and consumer for easy demonstration and understanding).
For Kafka, use this mirror (the Digital Ocean mirror, as of 27/3/2018, is not a valid resource).
Run the following script: (Run all the scripts inside
sh zookeeper-server-start.sh ../config/zookeeper.properties
Now the zookeeper monitor is ready and we can start our Kafka broker, which is the message intermediator.
sh kafka-server-start.sh ../config/server.properties
It runs on 9092 port by default with the Kafka server as a broker.
All the messages pushed into Kafka are logically categorized into “topics.” If a topic is not present, Kafka is configured to create a topic by default, but it’s wiser to create topics so we can logically distinguish our messages. This is how the consumer will query Kafka: “Give me a message from this topic.”
(Note: All basic essential tasks like creating, viewing, and deleting topics, console consumer-producer actions, etc are pre-bundled in the Kafka binary and we can use them by giving appropriate parameters).
sh kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic topic_name
Partitions and Replication Factor
The Zookeeper monitor runs on port 2181 port by default. Kafka topics are divided into various partitions. Partitions enable parallelization of topics by splitting the data in a particular topic across multiple brokers which could be present in different computing nodes. The flag “replication-factor” determines how many copies of the topic partition has to be made. This is how fault tolerance is achieved. One broker could go down and still consumer can consume messages from a live replica. Meanwhile, Zookeeper will monitor the down broker and take corrective action.
Now that we have Zookeeper and a broker running with a topic, we are ready to send messages to Kafka and consume them. In real-world software, this is achieved using a streaming API, but for demonstration purposes, we will use a console based producer and consumer.
sh kafka-console-producer.sh --broker-list localhost:9092 --topic topic_name
sh kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic topic_name
Now, whatever input is fed into the console producer is relayed to the console consumer via Kafka.
We can also view the existing messages:
sh kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic topic_name --from-beginning
By default kafka creates 3 partitions per topic which isn’t guarranted always. In case we want to change the number of partitions we can issue the following command:
sh kafka-topics.sh --zookeeper localhost:2181 --alter --topic topic_name --partitions 3
Likewise, we can also see the existing number of partitions per topic by issuing:
sh kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic topic_name --time -1
Kafka is a distributed messaging platform for stream processing. It’s more like a glorified enterprise messaging system set up in a distributed computing environment and has connectors to connect to external data sources.
Kafka tool is the GUI app to view and manage topics and partitions in a kafka broker.
The need for queuing
Microservice architecture inherently demands some sort of message queuing system. When we split a big monolithic application into smaller, loosely-coupled microservices, the number of REST API calls amongst those microservices increases, and so does the number of connections to external data sources. Keeping such a hugely interconnected system synchronous is not desirable, as it can render the entire application unresponsive and can defeat the whole purpose of splitting into microservices in the first place. So, having a message queuing system on a distributed platform like Kafka, which is highly fault-tolerant and has constant monitoring of broker nodes through services like Zookeeper, makes the whole data flow easier.
Message Queuing in ML Solutions Pipeline
Another use case for queuing like Kafka can be the various ML solution pipelines. In a very simplified way, this is how ML solutions are built:
A user interface on the client side (mobile, web) — →An API server and the database — →Machine learning (blackbox).
The machine learning blackbox can be very compute-heavy and it’s not practical to have those requests on a blocking synchronous mode. In this scenario, all the requests can be queued and a consumer API configured to take those requests one by one and feed them into the ML blackbox. This pipeline can easily handle compute-intensive tasks - like recognizing objects from thousands of images, which would take considerable time - without missing any requests.
Microservices deployed into containers talking with each other, mediated by fault-tolerant distributed clusters of broker nodes and monitored using a tool like Zookeeper, looks like the new way of enterprise software development.