Stream Processing made simple with Kafka

Kafka Streams
Stream processing Made Simple with Kafka
1
Guozhang Wang
Hadoop Summit, June 28, 2016

2
What is NOT Stream Processing?

3
Stream Processing isn’t (necessarily)
• Transient, approximate, lossy…
• .. that you must have batch processing as safety net

8
Stream Processing
• A different programming paradigm
• .. that brings computation to unbounded data
• .. with tradeoffs between latency / cost / correctness

9
Why Kafka in Stream Processing?

10
• Persistent Buffering
• Logical Ordering
• Highly Scalable “source-of-truth”
Kafka: Real-time Platforms

11
Stream Processing with Kafka

12
• Option I: Do It Yourself !

13
while (isRunning) {
// read some messages from Kafka
inputMessages = consumer.poll();
// do some processing…
// send output messages back to Kafka
producer.send(outputMessages);
}

14
• Ordering
• Partitioning &  
Scalability 
• Fault tolerance
DIY Stream Processing is Hard
• State Management
• Time, Window &  
Out-of-order Data 
• Re-processing

15
• Option II: full-fledged stream processing system
• Storm, Spark, Flink, Samza, ..

16
MapReduce Heritage?
• Config Management
• Resource Management 
• Configuration 
• etc..

17
MapReduce Heritage?
• Deployment 
• etc..

18
MapReduce Heritage?
• Deployment 
• etc..
Can I just use my own?!

19
• Option II: full-fledged stream processing system
• Option III: lightweight stream processing library

Kafka Streams
• In Apache Kafka since v0.10, May 2016
• Powerful yet easy-to-use stream processing library
• Event-at-a-time, Stateful
• Windowing with out-of-order handling
• Highly scalable, distributed, fault tolerant
• and more..
20

21
Anywhere, anytime
Ok. Ok. Ok. Ok.

22
Anywhere, anytime
<dependency> 
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>0.10.0.0</version>
</dependency>

23
Anywhere, anytime
War File
Rsync
Puppet/Chef
YARN
M
esos
Docker
Kubernetes
Very Uncool Very Cool

Kafka Streams DSL
25
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
KStream<String, String> words = builder.stream(”topic1”);
// count the words in this stream as an aggregated table
KTable<String, Long> counts = words.countByKey(”Counts”);
// write the result table to a new topic
counts.to(”topic2”);
// create a stream processing instance and start running it
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}

Kafka Streams DSL
26
streams.start();
}

Kafka Streams DSL
27
streams.start();
}

Kafka Streams DSL
28
streams.start();
}

Kafka Streams DSL
29
streams.start();
}

Kafka Streams DSL
30
streams.start();
}

31
Native Kafka Integration
Property cfg = new Properties();
cfg.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-streams-app”);
cfg.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker1:9092”);
cfg.put(ConsumerConfig.AUTO_OFFSET_RESET_CONIFG, “earliest”);
cfg.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_SSL”);
cfg.put(KafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, “registry:8081”);
StreamsConfig config = new StreamsConfig(cfg);
…

32
Property cfg = new Properties();
cfg.put(StreamsConfig.APPLICATION_ID_CONFIG, “my-streams-app”);
cfg.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker1:9092”);
cfg.put(ConsumerConfig.AUTO_OFFSET_RESET_CONIFG, “earliest”);
cfg.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_SSL”);
cfg.put(KafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, “registry:8081”);
StreamsConfig config = new StreamsConfig(cfg);
…
Native Kafka Integration

33
API, coding
“Full stack” evaluation
Operations, debugging, …

34
API, coding
“Full stack” evaluation
Operations, debugging, …
Simple is Beautiful

35
Key Idea:
Outsource hard problems to Kafka!

Kafka Concepts: the Log
4 5 5 7 8 9 10 11 12...
Producer Write
Consumer1 Reads
(offset 7)
Consumer2 Reads
(offset 10)
Messages
3

Topic 1
Topic 2
Partitions
Producers
Producers
Consumers
Consumers
Brokers
Kafka Concepts: the Log

38
Kafka Streams: Key Concepts

Stream and Records
39
Key Value Key Value Key Value Key Value
Stream
Record

Processor Topology
41
Stream
Processor

Processor Topology
42
KStream<..> stream1 = builder.stream(”topic1”);
KStream<..> joined = stream1.leftJoin(stream2, ...);
KTable<..> aggregated = joined.aggregateByKey(...);
aggregated.to(”topic3”);

Processor Topology
43

Processor Topology
44

Processor Topology
45

Processor Topology
46

Processor Topology
47
Source Processor
Sink Processor
KStream<..> stream1 = builder.stream(
KStream<..> stream2 = builder.stream(
aggregated.to(

Processor Topology
48Kafka Streams Kafka

Kafka Topic B
Data Parallelism
49
Kafka Topic A
MyApp.1 MyApp.2
Task2Task1

50
• Ordering
Scalability 
• Fault tolerance
Stream Processing Hard Parts
• Re-processing

States in Stream Processing
51
• filter
• map 
• join 
• aggregate
Stateless
Stateful

53
State

Kafka Topic B
Task2Task1
54
Kafka Topic A
State State

It’s all about Time
• Event-time (when an event is created)
• Processing-time (when an event is processed)
55

Event-time 1 2 3 4 5 6 7
Processing-time 1999 2002 2005 1997 1980 1983 2015
56
PHANTOMMENACE
ATTACKOFTHECLONES
REVENGEOFTHESITH
ANEWHOPE
THEEMPIRESTRIKESBACK
RETURNOFTHEJEDI
THEFORCEAWAKENS
Out-of-Order

Timestamp Extractor
57
public long extract(ConsumerRecord<Object, Object> record) {
return System.currentTimeMillis();
}
return record.timestamp();
}

Timestamp Extractor
58
}
}
processing-time

Timestamp Extractor
59
}
}
processing-time
event-time

67
• Ordering
Scalability 
• Fault tolerance
• Re-processing

Stream v.s.Table?
68
State

The Stream-Table Duality
• A stream is a changelog of a table
• A table is a materialized view at time of a stream
• Example: change data capture (CDC) of databases
73

KStream = interprets data as record stream
~ think: “append-only”
KTable = data as changelog stream
~ continuously updated materialized view
74

75
alice eggs bob lettuce alice milk
alice lnkd bob googl alice msft
KStream
KTable
User purchase history
User employment profile

76
KStream
KTable
time
“Alice bought eggs.”
“Alice is now at LinkedIn.”

77
KStream
KTable
time
“Alice bought eggs and milk.”
“Alice is now at LinkedIn
Microsoft.”

78
alice 2 bob 10 alice 3
timeKStream.aggregate()
KTable.aggregate()
(key: Alice, value: 2)
(key: Alice, value: 2)

79
alice 2 bob 10 alice 3
time
(key: Alice, value: 2 3)
(key: Alice, value: 2+3)
KStream.aggregate()
KTable.aggregate()

80
KStream KTable
reduce()
aggregate()
…
toStream()
map()
filter()
join()
…
map()
filter()
join()
…

81
KTable aggregated
KStream joined
KStream stream1KStream stream2
Updates Propagation in KTable
State

82
KTable aggregated
KStream joined
State

83
KTable aggregated
KStream joined
State

84
KTable aggregated
KStream joined
State

85
• Ordering
Scalability 
• Fault tolerance
• Re-processing

87
StateProcess
StateProcess
StateProcess
Kafka ChangelogFault Tolerance
Kafka
Kafka Streams
Kafka

88
StateProcess
StateProcess
Protoco
l
StateProcess
Fault Tolerance
Kafka
Kafka Streams
Kafka Changelog
Kafka

89
StateProcess
StateProcess
Protoco
l
StateProcess
Fault Tolerance
StateProcess
Kafka
Kafka Streams
Kafka Changelog
Kafka

94
• Ordering
Scalability 
• Fault tolerance
• Re-processing

95
• Ordering
Scalability 
• Fault tolerance
• Re-processing
Simple is Beautiful

96
But how to get data in / out Kafka?

Take-aways
• Stream Processing: a new programming paradigm
101

Take-aways
• Kafka Streams: stream processing made easy
102

Take-aways
• Kafka Streams: stream processing made easy
103
THANKS!
Guozhang Wang | guozhang@confluent.io | @guozhangwang
Visit Confluent at the Syncsort Booth (#1303), live demos @ 29th
Download Kafka Streams: www.confluent.io/product

Stream Processing made simple with Kafka

More Related Content

What's hot

Viewers also liked

Similar to Stream Processing made simple with Kafka

More from DataWorks Summit/Hadoop Summit

Recently uploaded

Stream Processing made simple with Kafka