Building a company-wide data
pipeline upon Apache Kafka -
engineering for 150 billion
messages per day
Yuto Kawamura

LINE Corp
Speaker introduction
• Yuto Kawamura

• Senior software engineer of
LINE server development

• Work at Tokyo office

• Apache Kafka contributor

• Joined: Apr, 2015 (about 3
years)
About LINE
•Messaging service 

•Over 200 million global monthly active users
1
in countries with top
market share like Japan, Taiwan and Thailand

•Many family services

•News 

•Music

•LIVE (Video streaming) 

1
As of June 2017. Sum of 4 countries: Japan, Taiwan, Thailand and Indonesia. 

Agenda
• Introducing LINE server

• Data pipeline w/ Apache Kafka
LINE Server Engineering is
about …
• Scalability

• Many users, many requests, many data

• Reliability

• LINE already is a communication infra
in countries

Scale metrics: message
delivery
LINE Server
25 billion /day
(API call: 80 billion
/ day)
Scale metric: Accumulated
data (for analysis)
40PB
Messaging System
Architecture Overview
LINE Apps
LEGY JP
LEGY DE
LEGY SG
Thrift RPC/HTTP
talk-server
Distributed Data Store
Distributed async
task processing
LEGY
• LINE Event Delivery Gateway

• API Gateway/Reverse Proxy

• Written in Erlang

• Features focused on needs of implementing a messaging
service

• e.g, Zero latency code hot swapping w/o closing client
connections
talk-server
• Java based web application server

• Implements most of messaging functionality + some other
features

• Java8 + Spring + Thrift RPC + Tomcat8
Datastore with Redis and
HBase
• LINE’s hybrid datastore =
Redis(in-memory DB, home-
brew clustering) +
HBase(persistent distributed
key-value store)

• Cascading failure handling

• Async write from background
task processor

• Data correction batch
Primary/
Backup
talk-server
Cache/
Primary
Dual write
Message Delivery
LEGY
LEGY
talk-server
Storage
1. Find nearest LEGY
2. sendMessage(“Bob”, “Hello!”)
3. Proxy request
4. Write to storage
talk-server
X. fetchOps()
6. Proxy request
7. Read message
8. Return fetchOps() with message
5. Find LEGY Bob is connecting,
Notify message arrival
Alice
Bob
There’re a lot of internal communication
processing user’s request
talk-server
Threat
detection
system
Timeline
Server
Data Analysis
Background
Task
processing
Request
Communication between
internal systems
• Communication for querying, transactional
updates:

• Query authentication/permission

• Synchronous updates
• Communication for data synchronization, update
notification:

• Notify user’s relationship update

• Synchronize data update with another service
talk-server
Auth
Analytics
Another
Service
HTTP/REST/RPC
Apache Kafka
• A distributed streaming platform

• (narrow sense) A distributed persistent message queue
which supports Pub-Sub model

• Built-in load distribution

• Built-in fail-over on both server(broker) and client
How it works
Producer
Brokers
Consumer
Topic
Topic
Consumer
Consumer
Producer
AuthEvent event = AuthEvent.newBuilder()
.setUserId(123)
.setEventType(AuthEventType.REGISTER)
.build();
producer.send(new
ProducerRecord(“events", userId, event));
consumer = new KafkaConsumer("group.id" ->
"group-A");
consumer.subscribe("events");
consumer.poll(100)…
// => Record(key=123, value=...)
Consumer GroupA
Pub-Sub
Brokers
Consumer
Topic
Topic
Consumer
Consumer GroupB
Consumer
Consumer
Records[A, B, C…]
Records[A, B, C…]
• Multiple consumer “groups” can
independently consume a single topic
Example: UserActivityEvent
Scale metric: Events
produced into Kafka
Service Service
Service
Service
Service
Service
150 billion
msgs / day
(3 million msgs / sec)
our Kafka needs to be high-
performant
• Usages sensitive for delivery latency

• Broker’s latency impact throughput as well

• because Kafka topic is queue
… wasn’t a built-in property
• KAFKA-4614 Long GC pause harming broker performance
which is caused by mmap objects created for OffsetIndex

• 99th %ile latency of Produce request: 150 ~ 200ms => 10ms
(x15 ~ x20 faster)

• KAFKA-6051 ReplicaFetcherThread should close the
ReplicaFetcherBlockingSend earlier on shutdown

• Eliminated ~x1000 slower response during restarting broker 

• (unpublished yet) Kafka broker performance degradation when
consumer requests to fetch old data

• x10 ~ x15 speedup for 99th %ile response
Performance Engineering
Kafka
• Application Level:

• Read and understand code

• Patch it to eliminate
bottleneck

• JVM Level:

• JVM profiling

• GC log analysis

• JVM parameters tuning
• OS Level:

• Linux perf

• Delay Accounting

• SystemTap
e.g, Investigating slow
sendfile(2)
• SystemTap: A kernel dynamic tracing tool

• Inject script to probe in-kernel behavior
stap —e '
...
probe syscall.sendfile {
d[tid()] = gettimeofday_us()
}
probe syscall.sendfile.return {
if (d[tid()]) {
st <<< gettimeofday_us() - d[tid()]
delete d[tid()]
}
}
probe end {
print(@hist_log(st))
}
'
e.g, Investigating slow
sendfile(2)
• Found that slow sendfile is blocking Kafka’s event-loop

• => patch Kafka to eliminate blocking sendfile
stap -e ‘…’
value |---------------------------------------- count
0 | 0
1 | 71
2 |@@@ 6171
16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29472
32 |@@@ 3418
2048 | 0
4096 | 1
and we contribute it back
More interested?
• Kafka Summit SF 2017

• One Day, One Data Hub, 100
Billion Messages: Kafka at
LINE

• https://youtu.be/
X1zwbmLYPZg

• Google “kafka summit line”
Summary
• Large scale + high reliability = difficult and exciting
Engineering!

• LINE’s architecture will be keep evolving with OSSs

• … and there are more challenges

• Multi-IDC deployment

• more and more performance and reliability
improvements
End of presentation.
Any questions?

Building a company-wide data pipeline on Apache Kafka - engineering for 150 billion messages per day

  • 1.
    Building a company-widedata pipeline upon Apache Kafka - engineering for 150 billion messages per day Yuto Kawamura LINE Corp
  • 2.
    Speaker introduction • YutoKawamura • Senior software engineer of LINE server development • Work at Tokyo office • Apache Kafka contributor • Joined: Apr, 2015 (about 3 years)
  • 3.
    About LINE •Messaging service •Over 200 million global monthly active users 1 in countries with top market share like Japan, Taiwan and Thailand
 •Many family services •News •Music •LIVE (Video streaming) 
 1 As of June 2017. Sum of 4 countries: Japan, Taiwan, Thailand and Indonesia. 

  • 4.
    Agenda • Introducing LINEserver • Data pipeline w/ Apache Kafka
  • 5.
    LINE Server Engineeringis about … • Scalability • Many users, many requests, many data • Reliability • LINE already is a communication infra in countries

  • 6.
    Scale metrics: message delivery LINEServer 25 billion /day (API call: 80 billion / day)
  • 7.
    Scale metric: Accumulated data(for analysis) 40PB
  • 8.
    Messaging System Architecture Overview LINEApps LEGY JP LEGY DE LEGY SG Thrift RPC/HTTP talk-server Distributed Data Store Distributed async task processing
  • 9.
    LEGY • LINE EventDelivery Gateway • API Gateway/Reverse Proxy • Written in Erlang • Features focused on needs of implementing a messaging service • e.g, Zero latency code hot swapping w/o closing client connections
  • 10.
    talk-server • Java basedweb application server • Implements most of messaging functionality + some other features • Java8 + Spring + Thrift RPC + Tomcat8
  • 11.
    Datastore with Redisand HBase • LINE’s hybrid datastore = Redis(in-memory DB, home- brew clustering) + HBase(persistent distributed key-value store) • Cascading failure handling • Async write from background task processor • Data correction batch Primary/ Backup talk-server Cache/ Primary Dual write
  • 12.
    Message Delivery LEGY LEGY talk-server Storage 1. Findnearest LEGY 2. sendMessage(“Bob”, “Hello!”) 3. Proxy request 4. Write to storage talk-server X. fetchOps() 6. Proxy request 7. Read message 8. Return fetchOps() with message 5. Find LEGY Bob is connecting, Notify message arrival Alice Bob
  • 13.
    There’re a lotof internal communication processing user’s request talk-server Threat detection system Timeline Server Data Analysis Background Task processing Request
  • 14.
    Communication between internal systems •Communication for querying, transactional updates: • Query authentication/permission • Synchronous updates • Communication for data synchronization, update notification: • Notify user’s relationship update • Synchronize data update with another service talk-server Auth Analytics Another Service HTTP/REST/RPC
  • 15.
    Apache Kafka • Adistributed streaming platform • (narrow sense) A distributed persistent message queue which supports Pub-Sub model • Built-in load distribution • Built-in fail-over on both server(broker) and client
  • 16.
    How it works Producer Brokers Consumer Topic Topic Consumer Consumer Producer AuthEventevent = AuthEvent.newBuilder() .setUserId(123) .setEventType(AuthEventType.REGISTER) .build(); producer.send(new ProducerRecord(“events", userId, event)); consumer = new KafkaConsumer("group.id" -> "group-A"); consumer.subscribe("events"); consumer.poll(100)… // => Record(key=123, value=...)
  • 17.
    Consumer GroupA Pub-Sub Brokers Consumer Topic Topic Consumer Consumer GroupB Consumer Consumer Records[A,B, C…] Records[A, B, C…] • Multiple consumer “groups” can independently consume a single topic
  • 18.
  • 19.
    Scale metric: Events producedinto Kafka Service Service Service Service Service Service 150 billion msgs / day (3 million msgs / sec)
  • 20.
    our Kafka needsto be high- performant • Usages sensitive for delivery latency • Broker’s latency impact throughput as well • because Kafka topic is queue
  • 21.
    … wasn’t abuilt-in property • KAFKA-4614 Long GC pause harming broker performance which is caused by mmap objects created for OffsetIndex • 99th %ile latency of Produce request: 150 ~ 200ms => 10ms (x15 ~ x20 faster) • KAFKA-6051 ReplicaFetcherThread should close the ReplicaFetcherBlockingSend earlier on shutdown • Eliminated ~x1000 slower response during restarting broker • (unpublished yet) Kafka broker performance degradation when consumer requests to fetch old data • x10 ~ x15 speedup for 99th %ile response
  • 22.
    Performance Engineering Kafka • ApplicationLevel: • Read and understand code • Patch it to eliminate bottleneck • JVM Level: • JVM profiling • GC log analysis • JVM parameters tuning • OS Level: • Linux perf • Delay Accounting • SystemTap
  • 23.
    e.g, Investigating slow sendfile(2) •SystemTap: A kernel dynamic tracing tool • Inject script to probe in-kernel behavior stap —e ' ... probe syscall.sendfile { d[tid()] = gettimeofday_us() } probe syscall.sendfile.return { if (d[tid()]) { st <<< gettimeofday_us() - d[tid()] delete d[tid()] } } probe end { print(@hist_log(st)) } '
  • 24.
    e.g, Investigating slow sendfile(2) •Found that slow sendfile is blocking Kafka’s event-loop • => patch Kafka to eliminate blocking sendfile stap -e ‘…’ value |---------------------------------------- count 0 | 0 1 | 71 2 |@@@ 6171 16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 29472 32 |@@@ 3418 2048 | 0 4096 | 1
  • 25.
  • 26.
    More interested? • KafkaSummit SF 2017 • One Day, One Data Hub, 100 Billion Messages: Kafka at LINE • https://youtu.be/ X1zwbmLYPZg • Google “kafka summit line”
  • 27.
    Summary • Large scale+ high reliability = difficult and exciting Engineering! • LINE’s architecture will be keep evolving with OSSs • … and there are more challenges • Multi-IDC deployment • more and more performance and reliability improvements
  • 28.