Connect K of SMACK:pykafka, kafka-python or?

Shuhsi Lin
2017/06/09 at PyconTw 2017
Connect K of SMACK:
pykafka, kafka-python or ?

About Me
Data Software Engineer of EAD
in the manufacturer, Micron
Currently working with
- data and people
- Lurking in PyHug, Taipei.py and various Meetups
Shuhsi Lin
sucitw gmail.com

http://datastrophic.io/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka/
https://www.linkedin.com/pulse/smack-my-bdaas-why-2017-year-big-data-goes-tom-martin
http://www.natalinobusa.com/2015/11/why-is-smack-stack-all-rage-lately.html
https://dzone.com/articles/short-interview-with-smack-tech-stack-1
https://www.slideshare.net/akirillov/data-processing-platforms-architectures-with-spark-mesos-akka-cassandra-and-kafka
● Apache Spark: Processing Engine.
● Apache Mesos: The Container.
● Akka: The Model.
● Apache Cassandra: The Storage.
● Apache Kafka: The Broker.

Agenda
» Pipeline to streaming
» What is Apache Kafka
⋄ Overview
⋄ Architecture
⋄ Use cases
» Kafka API
⋄ Python clients
» Conclusion and More about Kafka

What we will not focus on
» Reliability and durability
⋄ Scaling, replication, guarantee
⋄ Zookeeper
» Compact log
» Administration, Configuration, Operations
» Kafka connect
» Kafka Stream
» Apache Kafka vs XXX
⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ,
ZeroMQ, Redis, and ....

3 Paradigms for Programming
1. Request/response
2. Batch
3. Stream processing
https://qconnewyork.com/ny2016/ny2016/presentation/large-scale-stream-processing-apache-kafka.html

What is streaming process
» Data comes from the rise of events
(orders, sales, clicks or trades)
» Databases are event streams
⋄ the process of creating a backup or standby copy
of a database
⋄ publishing the database changes

Data pipeline
https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini
What often happen in a
complex Data pipeline
● Complexity meant that the data
was always unreliable
● Reports were untrustworthy,
● Derived indexes and stores were
questionable
● Everyone spent a lot of time
battling data quality issues of
all kinds.
● Data discrepancy

The name, “Kafka”, came from?
https://www.quora.com/What-is-the-relation-between-Kafka-the-writer-and-Apache-Kafka-the-distributed-messaging-system
http://slideplayer.com/slide/4221536/
https://en.wikipedia.org/wiki/Franz_Kafka

What is Apache Kafka?
Apache Kafka is a distributed system designed for streams. It is built to be
fault-tolerant, high-throughput, horizontally scalable, and allows geographically
distributing data streams and processing.
https://kafka.apache.org

Fast
Scalable
Durable
Distributedhttps://pixabay.com/photo-2135057/

Stream data platform （Orignal mechanism)
https://www.confluent.io/blog/stream-data-platform-1/
Integration mechanism between systems

Kafka as a service
https://www.confluent.io/

What a streaming data platform can provide
» “Data integration” (ETL)
⋄ How to transport data between systems
⋄ Captures streams of events or data changes and
feeds these to other data systems
» “Stream processing” (messaging)
⋄ Continuous, real-time processing and
transformation of these streams and makes the
results available system-wide.
various systems in LinkedIn
Analytical data processing with very low latency

Kafka terminology
» Producer
» Consumer
⋄ Consumer group
⋄ offset
» Broker
» Topic
» Patition
» Message
» Replica

What Kafka Does
Publish & subscribe
● to streams of data like a messaging system
Process
● streams of data efficiently and in real time
Store
● streams of data safely in a distributed replicated cluster
https://kafka.apache.org/

Publish/Subscribe
P14 at
https://www.slideshare.net/lucasjellema/amis-sig-introducing-a
pache-kafka-scalable-reliable-event-bus-message-queue

P15 at https://www.slideshare.net/rahuldausa/real-time-analytics-with-apache-kafka-and-apache-spark
v0.10
Update offset
v08
Update offset
Smart consumer
2181
9092

A modern stream-centric data architecture built around Apache Kafka
500 billion events per day

The key abstraction in Kafka is a
structured commit log of updates
append records to this log
Each of these data consumers
has its own position in the log
and advances independently.
This allows a reliable, ordered stream of updates
to be distributed to each consumer.
The log can be sharded and spread
over a cluster of machines, and
each shard is replicated for
fault-tolerance.
consumers
producers
parallel, ordered consumption
(important to a change capture system
for database updates)
TBs of data

Topics and Partitions
» Topics are split into partitions
» Partitions are strongly ordered & immutable
» Partitions can exist on different servers
» Partition enable scalability
» Producers assign a message to a partition within the topic
⋄ Either round robin ( simply to balance load)
⋄ or according to the keys
https://kafka.apache.org/documentation/#gettingStarted

Offsets
» Message are assigned an offset in the partition
» Consumers track with ( offset, partition, topic)
A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups

Consumers and Partitions
» A consumer group consumes one topic
» A partition is always sent to the same consumer instance

Consumer
● Messages are available to consumers only when they have been
committed
● Kafka does not push
○ Unlike JMS
● Read does not destroy by consumers
○ Unlike JMS Topic
● (some) History available
○ Offline consumers can catch up
○ Consumers can re-consume from the past
● Delivery Guarantees
○ Ordering maintained
○ At-least-once (per consumer) by default; at-most-once and exactly-once can be
implemented
P11 at https://www.slideshare.net/lucasjellema/amis-sig-introducing-apache-kafka-scalable-reliable-event-bus-message-queue

ZooKeeper: the coordination interface
between the Kafka broker and consumers
https://hortonworks.com/hadoop-tutorial/realtime-event-processing-nifi-kafka-storm/#section_3
» Stores configuration data for distributed services
» Used primarily by brokers
» Used by consumers in 0.8 but not 0.9

Apache Kafka timeline
2011-Nov
2016-May2013-Nov 2015-Nov
Next
version
v0.10
Kafka Stream
rack awareness
v0.8
New Producer
Reassign-partitions
v0.9
Kafka Connect
Security
New Consumer
Apache
Software
Foundation
incubator
2010
Creation
In Linkedin
2014, Confluent
v0.10.2
Single Message Transforms
for Kafka Connect

TLS connection
SSL is supported only for the new Kafka Producer and Consumer (Kafka versions 0.9.0 and higher)
http://kafka.apache.org/documentation.html#security_ssl
http://docs.confluent.io/current/kafka/ssl.html
http://maximilianchrist.com/blog/connect-to-apache-kafka-from-python-using-ssl
https://github.com/edenhill/librdkafka/wiki/Using-SSL-with-librdkafka

Apache Kafka is consider as :
Stream data platform
» Commit log service
» Messaging system
» circular buffer

Cons of Apache Kafka
» Consumer Complexity (smart, but poor client)
» Lack of tooling/monitoring (3rd party)
» Still pre 1.0 release
» Operationally, it’s more manual than desired
» Requires ZooKeeper
Sep 26, 2015http://www.slideshare.net/jimplush/introduction-to-apache-kafka-53225326

Use Cases
» Website Activity Tracking
» Log Aggregation
» Stream Processing
» Event Sourcing
» Commit logs
» Metrics (Performance index streaming)
⋄ CPU/IO/Memory usage
⋄ Application Specific:
⋄ Time taken to load a web-page
⋄ Time taken to build a web-page
⋄ No. of requests
⋄ No. of hits on a particular page/url

Event-driven Applications
» how it first is adopted and how its role
evolves over time in their architecture.
https://aws.amazon.com/tw/kafka/

https://www.slideshare.net/ConfluentInc/iot-data-platforms-processing-iot-data-with-apache-kafka

Conceptual Reference Architecture
for Real-Time Processing in HDP 2.2
https://hortonworks.com/blog/storm-kafka-together-real-time-data-refinery/ February 12, 2015

Event delivery system design in Spotify
43
https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/

Case: Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spark Streaming
http://helenaedelson.com/?p=1186 (2016/03)

Four Core APIs
» Producer API
» Consumer API
» Connect API
» Streams API
» Legacy APIs
$ cat < in.txt | grep “python” | tr a-z A-Z > out.txt
https://www.slideshare.net/ConfluentInc/apache-kafkaa-distributed-streaming-platform

Kafka Clients
» JAVA (officially maintain)
» C/C++ (librdkafka)
» Go (AKA golang)
» Erlang
» .NET
» Clojure
» Ruby
» Node.js
» Proxy (HTTP REST, etc)
» Perl
» stdin/stdout
» PHP
» Rust
» Alternative Java
» Storm
» Scala DSL
» Clojure
https://cwiki.apache.org/confluence/display/KAFKA/Clients
» Python
⋄ Confluent-kafka-python
⋄ Kafka-python
⋄ pykafka

Kafka Clients survey
https://www.confluent.io/blog/first-annual-state-apache-kafka-client-use-survey (February 14, 2017)
How users choose a Kafka client
Kafka Client: Language Adoption
Results from 187 responses
Reliability:
● Stability should be
priority
● Good error handling
● Good testing
● Good metrics and logging
3rd

Create your own Kafka broker
https://github.com/Landoop/fast-data-dev

See your brokers and topics
● Kafka-topics-ui
○ Demo http://kafka-topics-ui.landoop.com/#/
● Kafka-connect-ui
○ Demo http://kafka-connect-ui.landoop.com/
● Kafka-manager (yahoo)
● Kafka Eagle
● kafka-offset-monitor
Kafka Tool (GUI)
https://www.datadoghq.com/

2 + 2 Core APIs
And python clients

Kafka API Documents
https://kafka.apache.org/0102/javadoc/index.html?

Apache Kafka client for Python
» Pykafka
» kafka-python
» Confluent-kafka-python
» Librdkafka
⋄ The Apache Kafka C/C++ library

Pykafka
https://github.com/Parsely/pykafka
http://pykafka.readthedocs.io/en/latest/
» Similar level of abstraction
to the JVM Kafka client
» Built on librdkafka
https://blog.parse.ly/post/3886/pykafka-now/ （2016,June)

kafka-python
https://github.com/dpkp/kafka-python/
http://kafka-python.readthedocs.io/
API
● Producer
● Consumer
● Message
● TopicPartition
● KafkaError
● KafkaException
● kafka-python is designed to function
much like the official java client,
with a sprinkling of pythonic
interfaces.

Confluent-kafka-python
Confluent's Python client for Apache Kafka and
the Confluent Platform.
Features:
● High performance
⋄ librdkafka
● Reliability
● Supported
● Future proof
https://github.com/confluentinc/confluent-kafka-python
http://docs.confluent.io/current/clients/confluent-kafka-python/index.html?

Producer API (JAVA)
https://kafka.apache.org/0102/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html
https://www.tutorialspoint.com/apache_kafka/apache_kafka_simple_producer_example.htm
● KafkaProducer – Sync and Async
○ close()
○ flush()
○ metrics()
○ partitionsFor( topic)
○ send(ProducerRecord<K,V> record)
Writing data to Kafka: A client that publishes records to the Kafka cluster.
Class KafkaProducer<K,V>
Class ProducerRecord<K,V>
● ProducerRecord( topic, V value)
● ProducerRecord( topic, Integer partition, K key, V value)
A key/value pair to be sent to Kafka.
Configuration Settings
(configuration is externalized in a property file)
● client.id
● producer.type
● acks
● retries
● bootstrap.servers
● linger.ms
● key.serializer
● value.serializer
● batch.size
● buffer.memory
messages

Producer API -Pykafka
from pykafka import KafkaClient
from settings import ….
client = KafkaClient(hosts=bootstrap_servers)
topic = client.topics [topic.encode('UTF-8')]
producer = topic.get_producer(use_rdkafka=use_rdkafka)
producer.produce(msg_payload)
producer.stop() # Will flush background queue
Class pykafka.producer.Producer()
Classpykafka.topic.Topic(cluster, topic_metadata)
http://pykafka.readthedocs.io/en/latest/api/producer.html
● produce(msg, partition_key=None)
● stop()
● get_producer(use_rdkafka=False,
**kwargs)

Performance assessment
https://blog.parse.ly/post/3886/pykafka-now/

Must be type bytes, or be
serializable to bytes via
configured value_serializer.
Producer API -Kafka-Python
from kafka import KafkaConsumer, KafkaProducer
from settings import BOOTSTRAP_SERVERS, TOPICS, MSG
p = KafkaProducer(bootstrap_servers=BOOTSTRAP_SERVERS)
p.send(TOPICS, MSG.encode('utf-8'))
p.flush()
Class kafka.KafkaProducer(**configs)
https://kafka-python.readthedocs.io/en/master/_modules/kafka/producer/kafka.html#KafkaProducer
● close(timeout=None)
● flush(timeout=None)
● partitions_for(topic)
● send(topic, value=None, key=None,
partition=None, timestamp_ms=None)
http://kafka-python.readthedocs.io/en/master/apidoc/KafkaProducer.html

Producer API -Confluent-python -Kafka
from confluent_kafka import Producer
from settings import BOOTSTRAP_SERVERS,
TOPICS, MSG
p = Producer({'bootstrap.servers':
BOOTSTRAP_SERVERS})
p.produce(TOPICS, MSG.encode('utf-8'))
p.flush()
http://docs.confluent.io/current/clients/confluent-kafka-python/#producer
Class confluent_kafka.Producer(*kwargs)
● len()
● flush([timeout])
● poll([timeout])
● produce(topic[, value][, key][, partition][,
on_delivery][, timestamp])

Consumer
● Consumer group
○ group.id
○ session.timout.ms
○ max.poll.records
○ heartbeat.interval.ms
● Offset Management
○ enable.auto.commit
○ Auto.commit.interval.ms
○ auto.offset.reset
https://kafka.apache.org/documentation.html#newconsumerconfigs

Consumer API (JAVA)
https://kafka.apache.org/0102/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
● assign(<TopicPartition> partitions)
● assignment()
● beginningOffsets(<TopicPartition> partitions)
● close(long timeout, TimeUnit timeUnit)
● commitAsync(Map<TopicPartition,OffsetAndMetadata> offsets,
OffsetCommitCallback callback)
● commitSync(Map<TopicPartition,OffsetAndMetadata> offsets)
● committed(TopicPartition partition)
● endOffsets(<TopicPartition> partitions)
● listTopics()
● metrics()
● offsetsForTimes(Map<TopicPartition,Long> timestampsToSearch)
● partitionsFor(topic)
● pause(<TopicPartition> partitions)
Reading data from Kafka: A client that consumes records from a Kafka cluster.
Class KafkaConsumer<K,V>
● poll(long timeout)
● position(TopicPartition partition)
● resume(<TopicPartition> partitions)
● seek(TopicPartition partition, long offset)
● seekToBeginning(<TopicPartition> partitions)
● seekToEnd(<TopicPartition> partitions)
● subscribe(topics, ConsumerRebalanceListener
listener)
● subscribe(Pattern pattern,
ConsumerRebalanceListener listener)
● subscription()
● unsubscribe()
● wakeup()

Create a Kafka Topic
» Let's create a topic named "test" with a single partition and
only one replica:
⋄ kafka-topics.sh --create --zookeeper zhost:2181
--replication-factor 1 --partitions 1 --topic test
» See that topic
⋄ bin/kafka-topics.sh --list --zookeeper zhost:2181
bin/kafka-topics.sh
» Create, delete, describe, or change a topic.

Python Kafka Client Benchmarking

DEMO
1. http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/
2. https://github.com/sucitw/benchmark-python-client-for-kafka

http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/
Python Kafka Client Benchmarking

Conclusion:
pykafka, kafka-python or ?

https://github.com/Parsely/pykafka/issues/559

More about Kafka
» Reliability and durability
⋄ Scaling, replication, guarantee, Zookeeper
» Compact log
» Administration, Configuration, Operations, Monitoring
» Kafka connect
» Kafka Stream
» Schema Registry
» Rest proxy
» Apache Kafka vs XXX
⋄ RabbitMQ, AWS Kinesis, GCP Pub/Sub, ActiveMQ, ZeroMQ, Redis,
and ....

The Another 2 APIs
» Connect API
○ JDBC, HDFS, S3, ….
» Streams API
○ MAP, filter, aggregate, join

More references
1. The Log: What every software engineer should know about real-time data's unifying abstraction,
Jay Kreps, 2013
2. Pykafka and Kafka-python? https://github.com/Parsely/pykafka/issues/559
3. Why I am not a fan of Apache Kafka (2015-2016 Sep)
4. Kafka vs RabbitMQ
a. What are the differences between Apache Kafka and RabbitMQ?
b. Understanding When to use RabbitMQ or Apache Kafka
5. Kafka summit (2016~)
6. Future features of Kafka (Kafka Improvement Proposals)
7. Kafka- The Definitive Guide

Connect K of SMACK:pykafka, kafka-python or?

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (9)

Similar to Connect K of SMACK:pykafka, kafka-python or?

Similar to Connect K of SMACK:pykafka, kafka-python or? (20)

Recently uploaded

Recently uploaded (20)

Connect K of SMACK:pykafka, kafka-python or?