Dear Friends,
As I transition myself from Database Developer to Data Scientist, I would take this opportunity to help my fellow friends who are on similar path as me in this journey.
What is Kafka?
Kafka open-source project founded in 2010 (released in Github) by Jay Kreps, Neha Narkhede and Jun Rao – Engineers from Confluent and Linkedin. It was built to be scalable messaging system that could meet the needs of both the monitoring and tracking systems.
‘Kafka’ name originated from Jay Kreps’ literature teacher from college – Franz Kafka
Why Kafka?
It is described as ‘Distributed commit log” or more recently as a “distributed streaming platform”.
Terminology
Kafka Terms | Meaning | Similar term in Database |
Message | Unit of data | Row or Record |
Schemas | Structure for Message content | JSON, XMLTYPE or XSD |
Topics | Message are categorized | Tables or folder in filesystems |
Partitions | Topics further broken down into partitions | Table Partitions/Table Sub-Partitions |
Stream | Data from single topic which is moving. | Similar to Oracle Streams in Change Data Capture used for Replication |
Producer/Writer/Source/Publisher | Publisher or creator of message in topic | Imagine as source in ETL job or source in UTL_FILE |
Consumer/Reader/Sink/Subscriber | Subscriber or reader of message in topic | Imagine as target in ETL job or target in UTL_FILE |
Broker | A single Kafka server that receives, stores and transfers the message from producer to consumer. | Instance (set of memory/processes) in Oracle, that acts as intermediary between datafiles/control files/logfiles and users/clients. One instance in Oracle RAC – Oracle real application cluster |
Cluster | Group of Brokers. One is leader broker and other brokers are followers | Similar to Oracle RAC – Oracle Real Application cluster with Multiple instances |
Zookeeper | Stores configuration information about the cluster and consumer client’s details | Similar to global dictionary views (gv$) like gv$session which contains connection metadata or interconnect in Oracle RAC to communicate/sync between different clusters |
What makes Kafka Popular?
Multiple Producers: Kafka seamlessly handle many multiples producers all writing to same topic or different topics
Multiple Consumers: Multiple consumers can read simultaneously without interfering with each other
Disk-based retention: Data can be durable and need not always be consumed in real-time
Scalable: Scaling up and scaling out in Kafka makes it easy to handle huge amount of data.
High Performance: All these features above give Kafka an excellent performance.
Open Source: It offers all the benefits of Open-Source software – cost benefit, transparency, flexibility, and security.
What are the popular Event Streaming Platforms?
Amazon Kinesis, Apache Spark, Apache Flink, Apache Kafka, Apache Storm
Conclusion:
This was just an introduction and brief insight into the amazing world of Kafka and the rich features it has to offer to the world where data is growing exponentially, and format of data is getting complicated.
Cheers!