Architecture of a Kafka camus infrastructure

© 2013 Impetus Technologies - Confidential1
Kafka/Camus Project
Phase I
Mountain View, CA
March 2013
(photos courtesy of LinkedIn)

Agenda
• Objective
• What tool to use?
• Kafka & Camus overview
• Infrastructure
• Architecture
• Performance benchmarks

Objective
• Customer has events (Data, UI) that happen
real-time, that need to be analyzed
• Immediate need for batch-oriented mechanism
• Events need to by ETL’ed and analyzed in
Hadoop
• Future need for more real-time stream
analysis
• Potential bursts of streaming data

What tool to use?
• JMS:
• just an API
• Not cross language
• Painful
• Doesn’t scale
• Active MQ
• Didn’t work for Linkedin:
• http://sites.computer.org/debull/A12june/pipeline
.pdf
• Apache Flume

Kafka overview
• Distributed Scalable Pub/Sub system for big
data
• Producer -> Broker -> Consumer of message
topics
• Can have multiple clients consuming at
different velocities
(synchronous/asynchronous)
• Notion of consumer group to parallelize
consumption of messages
• Persists messages so ability to rewind

Kafka overview
• More overview pictures:

Camus overview
• Pipeline out of Kafka to HDFS
• Automatic discovery of topics and partitions
• Finds latest offsets from Kafka nodes
• Uses Avro by default; option to use your own
Decoder
• Allocates topic pulls among a set # of Hadoop
job tasks
• Move data files to HDFS directories according
to timestamp
• Remembers last offset / topic

Infrastructure
• Kafka 0.7.2
• 3 nodes
• Benchmark tool to issue message size, #
of threads, # of messages, topic name,
data encoding
• CDH 4.2
• 1 NN, 1 SNN, 3 slaves for Hadoop
• Camus
• JSON or Avro decoder
• Zookeeper
• Hive

Infrastructure
• 8 Amazon EC2 large instances
• Dual core 2.0 Ghz
• 1 7200 rpm SATA drive
• 8 Gigs memory
• 200 bytes message
• 1 Producer – 1 consumer

Customer
architecture
Gam
ing
Shop
ping
Invite
friends
Consume
topics via
Camus
every hour
Kafka topic:
Data events
(i.e. User
profile
registrations)
Kafka topic:
UI events (i.e.
game
interaction)
Use Hive to
analyze the data

Performance
summary• Producer:
• Avg 20,000 messages / sec
• 3.81 MB per sec
• Consumer:
• 16,600 messages/ sec
• 3.17 MB per sec -> 190 Gig/hr
• Customer Goal: “want to scale to 5000 events
per second at peak.”

Performance
benchmarkdata size input Data type
Storage size on HDFS
(in bytes)
Hive Count
(in sec)
Hive max
(in sec) Camus run time Kafka
500000 records JSON text data 103779151 38.3 5946 seconds 34.2
JSON Serde 103779151 46.3 48.246 seconds 34.2
Avro data 60962022 25.2 29.354 seconds 15.9
1 Million records JSON text data -1M 416556931 27.582 50.8891 minute 40.56
JSON Serde -1M 416556931 39.428 32.305 40.56
Avro data 1M 122041553 35.806 26.3281 minute 22.36
7 Million records JSON text data - 7M 1456636071 57.895 111.5983 minutes 50 seconds 388
JSON Serde - 7M 1456636071 83.225 83.7763 minutes 50 seconds 388
Avro data - 7M 866962131 60.63 62.8964 minutes 50 181
JSON Serde - 10M 1919381181 103.4 1105 minutes 1 seconds 558
Avro data - 10M 1239446765 87.042 90.9587 minutes 23 seconds 230
JSON Serde - 15M 3157886975 141.345 153.365 851
20 Million records JSON text data - 20M 1159
JSON Serde - 20M 1159

Kafka Speed Performance
benchmark
Kafka 500000 records
1 Million
records
7 Million
records
10 Million
records
15 Million
records
20 Million
records
JSON text data 34.2 40.56 388 558 851 1159
JSON Serde 34.2 40.56 388 558 851 1159
Avro data 15.9 22.36 181 230 377 534
34.2 40.56
388
558
851
1159
34.2 40.56
388
558
851
1159
15.9 22.36
181
230
377
534
500000 records 1 Million records 7 Million records 10 Million records 15 Million records 20 Million records
Kafka comparison
JSON text data JSON Serde Avro data

Camus Speed
Performance benchmark
Camus 500000 records
1 Million
records
7 Million
records
10 Million
records
15 Million
records
20 Million
records
JSON text data 46 60 230 301 384
JSON Serde 46 60 230 301 384
Avro data 54 85 290 443 506 662
0
100
200
300
400
500
600
700
Camus comparison

Count Speed Performance
Count 500000 records
1 Million
records
7 Million
records
10 Million
records
15 Million
records
20
Million
records
JSON text data 38.3 27.58 57.89 78.337 107.325
JSON Serde 46.3 39.42 83.2 103.4 141.345
Avro data 25.2 35.8 60.6 87.042 96.9 133.606
0
20
40
60
80
100
120
140
160
Select Count(*) comparison

Max Speed Performance
0
50
100
150
200
250
Max(field) comparison
Max 500000 records
1 Million
records 7 Million records
10 Million
records
15 Million
records
20 Million
records
JSON text data 59 50.889 111.598 144.667 201.125
JSON Serde 48.2 32.305 83.776 110 153.365
Avro data 29.3 26.328 62.896 90.958 98.9 153.464

Q&A
Thank You

Architecture of a Kafka camus infrastructure

More Related Content

What's hot

Similar to Architecture of a Kafka camus infrastructure

Recently uploaded

Architecture of a Kafka camus infrastructure

Editor's Notes