Building a company-wide data pipeline on Apache Kafka - engineering for 150 billion messages per day
1. Building a company-wide data
pipeline upon Apache Kafka -
engineering for 150 billion
messages per day
Yuto Kawamura
LINE Corp
2. Speaker introduction
• Yuto Kawamura
• Senior software engineer of
LINE server development
• Work at Tokyo office
• Apache Kafka contributor
• Joined: Apr, 2015 (about 3
years)
3. About LINE
•Messaging service
•Over 200 million global monthly active users
1
in countries with top
market share like Japan, Taiwan and Thailand
•Many family services
•News
•Music
•LIVE (Video streaming)
1
As of June 2017. Sum of 4 countries: Japan, Taiwan, Thailand and Indonesia.
5. LINE Server Engineering is
about …
• Scalability
• Many users, many requests, many data
• Reliability
• LINE already is a communication infra
in countries
9. LEGY
• LINE Event Delivery Gateway
• API Gateway/Reverse Proxy
• Written in Erlang
• Features focused on needs of implementing a messaging
service
• e.g, Zero latency code hot swapping w/o closing client
connections
10. talk-server
• Java based web application server
• Implements most of messaging functionality + some other
features
• Java8 + Spring + Thrift RPC + Tomcat8
12. Message Delivery
LEGY
LEGY
talk-server
Storage
1. Find nearest LEGY
2. sendMessage(“Bob”, “Hello!”)
3. Proxy request
4. Write to storage
talk-server
X. fetchOps()
6. Proxy request
7. Read message
8. Return fetchOps() with message
5. Find LEGY Bob is connecting,
Notify message arrival
Alice
Bob
13. There’re a lot of internal communication
processing user’s request
talk-server
Threat
detection
system
Timeline
Server
Data Analysis
Background
Task
processing
Request
14. Communication between
internal systems
• Communication for querying, transactional
updates:
• Query authentication/permission
• Synchronous updates
• Communication for data synchronization, update
notification:
• Notify user’s relationship update
• Synchronize data update with another service
talk-server
Auth
Analytics
Another
Service
HTTP/REST/RPC
15. Apache Kafka
• A distributed streaming platform
• (narrow sense) A distributed persistent message queue
which supports Pub-Sub model
• Built-in load distribution
• Built-in fail-over on both server(broker) and client
16. How it works
Producer
Brokers
Consumer
Topic
Topic
Consumer
Consumer
Producer
AuthEvent event = AuthEvent.newBuilder()
.setUserId(123)
.setEventType(AuthEventType.REGISTER)
.build();
producer.send(new
ProducerRecord(“events", userId, event));
consumer = new KafkaConsumer("group.id" ->
"group-A");
consumer.subscribe("events");
consumer.poll(100)…
// => Record(key=123, value=...)
19. Scale metric: Events
produced into Kafka
Service Service
Service
Service
Service
Service
150 billion
msgs / day
(3 million msgs / sec)
20. our Kafka needs to be high-
performant
• Usages sensitive for delivery latency
• Broker’s latency impact throughput as well
• because Kafka topic is queue
21. … wasn’t a built-in property
• KAFKA-4614 Long GC pause harming broker performance
which is caused by mmap objects created for OffsetIndex
• 99th %ile latency of Produce request: 150 ~ 200ms => 10ms
(x15 ~ x20 faster)
• KAFKA-6051 ReplicaFetcherThread should close the
ReplicaFetcherBlockingSend earlier on shutdown
• Eliminated ~x1000 slower response during restarting broker
• (unpublished yet) Kafka broker performance degradation when
consumer requests to fetch old data
• x10 ~ x15 speedup for 99th %ile response
22. Performance Engineering
Kafka
• Application Level:
• Read and understand code
• Patch it to eliminate
bottleneck
• JVM Level:
• JVM profiling
• GC log analysis
• JVM parameters tuning
• OS Level:
• Linux perf
• Delay Accounting
• SystemTap
26. More interested?
• Kafka Summit SF 2017
• One Day, One Data Hub, 100
Billion Messages: Kafka at
LINE
• https://youtu.be/
X1zwbmLYPZg
• Google “kafka summit line”
27. Summary
• Large scale + high reliability = difficult and exciting
Engineering!
• LINE’s architecture will be keep evolving with OSSs
• … and there are more challenges
• Multi-IDC deployment
• more and more performance and reliability
improvements