Yuto Kawamura
Building a company-wide data pipeline on Apache Kafka - engineering for 150 billion messages per day
Summary:
LINE is a messaging service with 200 million monthly active users. Our service architecture evolves daily with various collaborating components. I'll introduce overview of LINE's messaging service architecture,
mainly focusing on our company-wide data pipeline infrastructure built upon Apache Kafka which accepts more than 150 billion messages every day, making it one of the largest in the world. In this talk I will introduce
how we managed such scale keeping it highly reliable to be capable of being an infrastructure to build services.
What's New in Teams Calling, Meetings and Devices April 2024
Building a company-wide data pipeline on Apache Kafka - engineering for 150 billion messages per day
1. Building a company-wide data
pipeline on Apache Kafka -
engineering for 150 billion
messages per day
Yuto Kawamura
LINE Corp
2. Speaker introduction
• Yuto Kawamura
• Senior software engineer of
LINE server development
• Apache Kafka contributor
• Joined: Apr, 2015 (about 3
years)
3. About LINE
•Messaging service
•More than 200 million active users1 in countries with top market
share like Japan, Taiwan and Thailand
•Many family services
•News
•Music
•LIVE (Video streaming)
1 As of June 2017. Sum of 4 countries: Japan, Taiwan, Thailand and Indonesia.
5. LINE Server Engineering is
about …
• Scalability
• Many users, many requests, many data
• Reliability
• LINE already is a communication infra
in countries
9. LEGY
• LINE Event Delivery Gateway
• API Gateway/Reverse Proxy
• Written in Erlang
• Deployed to many data centers all over the world
• Features focused on needs of implementing a messaging service
• Zero latency code hot swapping w/o closing client connections
• Durability thanks to Erlang process and message passing
• Single instance holds 100K ~ connection per instance =>
huge impact by single instance failure
10. talk-server
• Java based web application server
• Implements most of messaging functionality + some other
features
• Java8 + Spring + Thrift RPC + Tomcat8
11. Datastore with Redis and
HBase
• LINE’s hybrid datastore =
Redis(in-memory DB, home-
brew clustering) +
HBase(persistent distributed
key-value store)
• Cascading failure handling
• Async write in app
• Async write from background
task processor
• Data correction batch
Primary/
Backup
talk-server
Cache/
Primary
Dual write
13. There’re a lot of internal communication
processing user’s request
talk-server
Threat
detection
system
Timeline
Server
Data Analysis
Background
Task
processing
Request
14. Communication between
internal systems
• Communication for querying, transactional
updates:
• Query authentication/permission
• Synchronous updates
• Communication for data synchronization, update
notification:
• Notify user’s relationship update
• Synchronize data update with another service
talk-server
Auth
Analytics
Another
Service
HTTP/REST/RPC
15. Apache Kafka
• A distributed streaming platform
• (narrow sense) A distributed persistent message queue
which supports Pub-Sub model
• Built-in load distribution
• Built-in fail-over on both server(broker) and client
16. How it works
Producer
Brokers
Consumer
Topic
Topic
Consumer
Consumer
Producer
AuthEvent event = AuthEvent.newBuilder()
.setUserId(123)
.setEventType(AuthEventType.REGISTER)
.build();
producer.send(new
ProducerRecord(“events", userId, event));
consumer = new KafkaConsumer("group.id" ->
"group-A");
consumer.subscribe("events");
consumer.poll(100)…
// => Record(key=123, value=...)
19. Scale metric: Events
produced into Kafka
Service Service
Service
Service
Service
Service
150 billion
msgs / day
(3 million msgs / sec)
20. our Kafka needs to be high-
performant
• Usages sensitive for delivery latency
• Broker’s latency impact throughput as well
• because Kafka topic is queue
21. … wasn’t a built-in property
• KAFKA-4614 Long GC pause harming broker
performance which is caused by mmap objects created
for OffsetIndex
• // TODO fill-in
22. Performance Engineering
Kafka
• Application Level:
• Read and understand code
• Patch it to eliminate
bottleneck
• JVM Level:
• JVM profiling
• GC log analysis
• JVM parameters tuning
• OS Level:
• Linux perf
• Delay Accounting
• SystemTap
23. e.g, Investigating slow
sendfile(2)
• Observe sendfile
syscall’s duration
• => found that sendfile is
blocking Kafka’s event-
loop
• => patch Kafka to
eliminate blocking
sendfile
stap —e '
...
probe syscall.sendfile {
d[tid()] = gettimeofday_us()
}
probe syscall.sendfile.return {
if (d[tid()]) {
st <<< gettimeofday_us() - d[tid()]
delete d[tid()]
}
}
probe end {
print(@hist_log(st))
}
'
25. More interested?
• Kafka Summit SF 2017
• One Day, One Data Hub, 100
Billion Messages: Kafka at
LINE
• https://youtu.be/
X1zwbmLYPZg
• Google “kafka summit line”
26. Summary
• Large scale + high reliability = difficult and exciting
Engineering!
• LINE’s architecture will be keep evolving with OSSs
• … and there are more challenges
• Multi-IDC deployment
• more and more performance and reliability
improvements