Data processing at LinkedIn
with Apache Kafka
Jeff Weiner
Chief Executive Officer
Joel Koshy
Sr. Staff Software Engineer
Kartik Paramasivam
Director, Software Engineering
Outline
Kafka growth at LinkedIn
Canonical use cases
Search, analytics and storage platforms
Data pipelines
Stream processing
Conclusion
Q&A
Kafka at LinkedIn over the years
Canonical use cases
Data movement
Who did what,
when?
Tracking
Monitoring and
alerting
Metrics/logs
Ad hoc messaging
Queuing
Offline  online
bridge
Data
deployment
Search, analytics and
storage platforms
Distributed near real-time OLAP
datastore with SQL query interface
Pinot
• 100B documents
• 1B documents ingested per day
• 100M queries per day
• 10’s of ms latency
Pinot
Pinot
SELECT weeksSinceEpochSunday,
distinctCount(viewerId)
FROM profileViewEvents
WHERE vieweeId=myMID
AND daysSinceEpoch >= 16624
AND daysSinceEpoch <= 16714
GROUP BY weeksSinceEpochSunday
TOP 20
Pinot
Pinot
(Galene)
Search-as-a-service
• People search
• Job search
• Internal code search
• … and more
Galene
• Base index generated
weekly (offline)
• Live updater pulls from
Kafka and Brooklin (DB
changes)
• Periodically combine
incremental snapshot and
live update buffer
Distributed replicated NoSQL store
Storage Node
API Server
MySQL
Router
Router
Router
Apache Helix
ZooKeeper
Storage Node
API Server
MySQL
Storage Node
API Server
MySQL
Storage Node
API Server
MySQL
Data
Control
Routing Table
r
r
r
HTTP
Client
HTTP
Distributed replicated NoSQL store
• Member profiles
• InMail
• Ad platforms
• Invites, endorsements, etc.
Espresso replication (before)
• MySQL (per-instance)
replication
• Partitions unnecessarily
share fate
• Poor resource utilization
• Cluster expansions are hard
Node 1
P1 P2 P3
Node 2
P1 P2 P3
Node 3
P1 P2 P3
Node 1
P4 P5 P6
Node 2
P4 P5 P6
Node 3
P4 P5 P6
Master
Slave
Offline
Espresso 1.0 Kafka-based replication
HELIX
P4:
Master: 1
Slave: 3
…
EXTERNALVIEW
Node 1
Node 2
Node 3
LIVEINSTANCES Node 1
P1 P2
P4
P3
P5 P6
P9 P10
Node 2
P5 P6
P8
P7
P1 P2
P11 P12
Node 3
P9 P10
P12
P11
P3 P4
P7 P8
Kafka
Espresso 1.0 Kafka-based replication
Espresso replication how-to
RF = 3
min.insync.replicas = 2
Disable unclean leader election
Rack awareness
acks = “all”
block.on.buffer.full = true
retries = Integer.MAX_VALUE
Durability
Espresso replication how-to
RF = 3
min.insync.replicas = 2
Disable unclean leader election
Rack awareness
acks = “all”
block.on.buffer.full = true
retries = Integer.MAX_VALUE
Durability
Bump up num.replica.fetchers
max.block.ms = 0
Reduce linger.ms
Use LZ4
Low latency
Espresso replication how-to
RF = 3
min.insync.replicas = 2
Disable unclean leader election
Rack awareness
acks = “all”
block.on.buffer.full = true
retries = Integer.MAX_VALUE
Durability
Bump up num.replica.fetchers
max.block.ms = 0
Reduce linger.ms
Use LZ4
Low latency
max.inflight.requests = 1
close(0) in callback on send
failure
Ordering
Espresso replication how-to
RF = 3
min.insync.replicas = 2
Disable unclean leader election
Rack awareness
acks = “all”
block.on.buffer.full = true
retries = Integer.MAX_VALUE
Durability
Bump up num.replica.fetchers
max.block.ms = 0
Reduce linger.ms
Use LZ4
Low latency
max.inflight.requests = 1
close(0) in callback on send
failure
Ordering
Large message support
JBOD (RF3 is costly with RAID-
10)
Nice-haves
Data pipelines
tee‘ing change-capture from replication stream
tee‘ing change-capture from replication stream
tee‘ing change-capture from replication stream
Streaming data pipeline
Brooklin
Continuous data movement
between various sources
and destinations
Brooklin architecture
Brooklin client options (at LinkedIn)
Stream processing
Stream processing technologies
Stream processing technologies
Yes it is
crowded!!
Distributed stream processing framework
Samza
• Top-level Apache project since 2014
• In use at LinkedIn, Uber,
Metamarkets, Netflix, Intuit,
TripAdvisor, MobileAware,
Optimizely, etc.
• Increase in production usage at
LinkedIn – from ~20 to ~350
applications in two years
Stateless processing – message in, message out
• Schema translation
• Data transformation
(e.g., ID
obfuscation)
Stateless processing – accessing adjunct data
Key issues:
• Accidental DOS of member
DB
• Dealing with spikes
• I/O makes performance slow
Stateless processing – locally accessible adjunct data
Stateless processing – locally accessible adjunct data
• Awesome performance at low cost (100x
faster)
• No issues with accidental DoS
• No need to over provision the remote
database
Pros Cons
• Does not work for cases where the adjunct
data is large and not co-partitionable in input
stream
• Auto-scaling the processor gets trickier
• Repartitioning the Input Kafka topic can mess
up local state
Stateless processing – async data access
Synchronous API (existing) Asynchronous API
// execute on multiple threads
public interface StreamTask {
void process(IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator) {
// process message
}
}
// call-back based
public interface AsyncStreamTask {
void processAsync(
IncomingMessageEnvelope envelope,
MessageCollector collector,
TaskCoordinator coordinator),
TaskCallback callback) {
// process message with asynchronous calls
// fire callback upon completion
}
}
Stateful processing
Aggregations,
windowed joins, etc.
Managing state
● Full state checkpointing
● Simply does not scale for non-trivial application state
● … but makes it easier to achieve “repeatable results” when recovering from
failure
● Incremental state checkpointing
● Scales to any type of application state
● Achieving repeatable results requires additional techniques (e.g. variants of
de-dup or transaction support)
Managing local state
• Durably store “host-to-task”
mapping
• Minimize reseeding during
failures, adding/removing capacity
Samza processing pipeline
• Natural back-pressure
• Per-stage checkpointing instead of global
checkpointing
• Cost considerations – new Kafka feature
(KIP-107: deleteDataBefore)
Stream
processing
Batch
processingvs
Stream
processing
Batch
processing
It is all just
data processing
vs
Scenario: title standardization
Re-evaluate titles for
all LinkedIn members
with new ML model
Dealing with changes in ML models
Re-evaluate titles for
all LinkedIn members
with new ML model
Batch processing in Samza
Samza HDFS support
(REPROCESSING, EXPERIMENTATION, LAMBDA ARCH, ETC.)
Samza HDFS benchmark
Profile count,
group-by country
500 files
250GB
Samza: a common API for data processing
● Application code does not change
● Stream Processing
● Batch data processing
● Configurable input sources and sinks (e.g. Kafka, Kinesis, Eventhub, HDFS
etc.)
Fluent API (0.13 release)
public class PageViewCounterExample implements StreamApplication {
@Override
public void init(StreamGraph graph, Config config) {
MessageStream<PageViewEvent> pageViewEvents = graph.createInputStream(“myinput”);
MessageStream<MyStreamOutput> outputStream = graph.createOutputStream(“myoutput”);
pageViewEvents.
partitionBy(m -> m.getMessage().memberId).
window(Windows.<PageViewEvent, String, Integer> keyedTumblingWindow(m ->
m.getMessage().memberId, Duration.ofSeconds(10), (m, c) -> c + 1).
map(MyStreamOutput::new).
sendTo(outputStream);
}
}
Fluent API (0.13 release)
public class PageViewCounterExample implements StreamApplication {
@Override
public void init(StreamGraph graph, Config config) {
MessageStream<PageViewEvent> pageViewEvents = graph.createInputStream(“myinput”);
MessageStream<MyStreamOutput> outputStream = graph.createOutputStream(“myoutput”);
pageViewEvents.
partitionBy(m -> m.getMessage().memberId).
window(Windows.<PageViewEvent, String, Integer> keyedTumblingWindow(m ->
m.getMessage().memberId, Duration.ofSeconds(10), (m, c) -> c + 1).
map(MyStreamOutput::new).
sendTo(outputStream);
}
public static void main(String[] args) throws Exception {
CommandLine cmdLine = new CommandLine();
Config config = cmdLine.loadConfig(cmdLine.parser().parse(args));
ApplicationRunner localRunner = ApplicationRunner.getLocalRunner(config);
localRunner.run(new PageViewCounterExample());
}
}
Deployment options
• Full control on application lifecycle
• Can be part of a bigger application
• ZK-based coordination
Standalone YARN-based
• Dashboard
• Management service
• Monitoring/alerts
• Long running service in YARN
Conclusion
+
Font check slide
THE FOLLOWING WORDS SHOULD BE IDENTICAL IN STYLE
Hello there.
Source Sans Pro Light If words do not look like the left side, please correct your font

Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka

  • 1.
    Data processing atLinkedIn with Apache Kafka Jeff Weiner Chief Executive Officer Joel Koshy Sr. Staff Software Engineer Kartik Paramasivam Director, Software Engineering
  • 2.
    Outline Kafka growth atLinkedIn Canonical use cases Search, analytics and storage platforms Data pipelines Stream processing Conclusion Q&A
  • 3.
    Kafka at LinkedInover the years
  • 4.
  • 5.
    Data movement Who didwhat, when? Tracking Monitoring and alerting Metrics/logs Ad hoc messaging Queuing Offline  online bridge Data deployment
  • 6.
  • 7.
    Distributed near real-timeOLAP datastore with SQL query interface Pinot • 100B documents • 1B documents ingested per day • 100M queries per day • 10’s of ms latency
  • 8.
  • 9.
    Pinot SELECT weeksSinceEpochSunday, distinctCount(viewerId) FROM profileViewEvents WHEREvieweeId=myMID AND daysSinceEpoch >= 16624 AND daysSinceEpoch <= 16714 GROUP BY weeksSinceEpochSunday TOP 20
  • 10.
  • 11.
  • 12.
    (Galene) Search-as-a-service • People search •Job search • Internal code search • … and more
  • 13.
    Galene • Base indexgenerated weekly (offline) • Live updater pulls from Kafka and Brooklin (DB changes) • Periodically combine incremental snapshot and live update buffer
  • 14.
    Distributed replicated NoSQLstore Storage Node API Server MySQL Router Router Router Apache Helix ZooKeeper Storage Node API Server MySQL Storage Node API Server MySQL Storage Node API Server MySQL Data Control Routing Table r r r HTTP Client HTTP
  • 15.
    Distributed replicated NoSQLstore • Member profiles • InMail • Ad platforms • Invites, endorsements, etc.
  • 16.
    Espresso replication (before) •MySQL (per-instance) replication • Partitions unnecessarily share fate • Poor resource utilization • Cluster expansions are hard Node 1 P1 P2 P3 Node 2 P1 P2 P3 Node 3 P1 P2 P3 Node 1 P4 P5 P6 Node 2 P4 P5 P6 Node 3 P4 P5 P6 Master Slave Offline
  • 17.
    Espresso 1.0 Kafka-basedreplication HELIX P4: Master: 1 Slave: 3 … EXTERNALVIEW Node 1 Node 2 Node 3 LIVEINSTANCES Node 1 P1 P2 P4 P3 P5 P6 P9 P10 Node 2 P5 P6 P8 P7 P1 P2 P11 P12 Node 3 P9 P10 P12 P11 P3 P4 P7 P8 Kafka
  • 18.
  • 19.
    Espresso replication how-to RF= 3 min.insync.replicas = 2 Disable unclean leader election Rack awareness acks = “all” block.on.buffer.full = true retries = Integer.MAX_VALUE Durability
  • 20.
    Espresso replication how-to RF= 3 min.insync.replicas = 2 Disable unclean leader election Rack awareness acks = “all” block.on.buffer.full = true retries = Integer.MAX_VALUE Durability Bump up num.replica.fetchers max.block.ms = 0 Reduce linger.ms Use LZ4 Low latency
  • 21.
    Espresso replication how-to RF= 3 min.insync.replicas = 2 Disable unclean leader election Rack awareness acks = “all” block.on.buffer.full = true retries = Integer.MAX_VALUE Durability Bump up num.replica.fetchers max.block.ms = 0 Reduce linger.ms Use LZ4 Low latency max.inflight.requests = 1 close(0) in callback on send failure Ordering
  • 22.
    Espresso replication how-to RF= 3 min.insync.replicas = 2 Disable unclean leader election Rack awareness acks = “all” block.on.buffer.full = true retries = Integer.MAX_VALUE Durability Bump up num.replica.fetchers max.block.ms = 0 Reduce linger.ms Use LZ4 Low latency max.inflight.requests = 1 close(0) in callback on send failure Ordering Large message support JBOD (RF3 is costly with RAID- 10) Nice-haves
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    Streaming data pipeline Brooklin Continuousdata movement between various sources and destinations
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
    Distributed stream processingframework Samza • Top-level Apache project since 2014 • In use at LinkedIn, Uber, Metamarkets, Netflix, Intuit, TripAdvisor, MobileAware, Optimizely, etc. • Increase in production usage at LinkedIn – from ~20 to ~350 applications in two years
  • 34.
    Stateless processing –message in, message out • Schema translation • Data transformation (e.g., ID obfuscation)
  • 35.
    Stateless processing –accessing adjunct data Key issues: • Accidental DOS of member DB • Dealing with spikes • I/O makes performance slow
  • 36.
    Stateless processing –locally accessible adjunct data
  • 37.
    Stateless processing –locally accessible adjunct data • Awesome performance at low cost (100x faster) • No issues with accidental DoS • No need to over provision the remote database Pros Cons • Does not work for cases where the adjunct data is large and not co-partitionable in input stream • Auto-scaling the processor gets trickier • Repartitioning the Input Kafka topic can mess up local state
  • 38.
    Stateless processing –async data access Synchronous API (existing) Asynchronous API // execute on multiple threads public interface StreamTask { void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) { // process message } } // call-back based public interface AsyncStreamTask { void processAsync( IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator), TaskCallback callback) { // process message with asynchronous calls // fire callback upon completion } }
  • 39.
  • 40.
    Managing state ● Fullstate checkpointing ● Simply does not scale for non-trivial application state ● … but makes it easier to achieve “repeatable results” when recovering from failure ● Incremental state checkpointing ● Scales to any type of application state ● Achieving repeatable results requires additional techniques (e.g. variants of de-dup or transaction support)
  • 41.
    Managing local state •Durably store “host-to-task” mapping • Minimize reseeding during failures, adding/removing capacity
  • 42.
    Samza processing pipeline •Natural back-pressure • Per-stage checkpointing instead of global checkpointing • Cost considerations – new Kafka feature (KIP-107: deleteDataBefore)
  • 43.
  • 44.
  • 45.
    Scenario: title standardization Re-evaluatetitles for all LinkedIn members with new ML model
  • 46.
    Dealing with changesin ML models Re-evaluate titles for all LinkedIn members with new ML model
  • 47.
  • 48.
    Samza HDFS support (REPROCESSING,EXPERIMENTATION, LAMBDA ARCH, ETC.)
  • 49.
    Samza HDFS benchmark Profilecount, group-by country 500 files 250GB
  • 50.
    Samza: a commonAPI for data processing ● Application code does not change ● Stream Processing ● Batch data processing ● Configurable input sources and sinks (e.g. Kafka, Kinesis, Eventhub, HDFS etc.)
  • 51.
    Fluent API (0.13release) public class PageViewCounterExample implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { MessageStream<PageViewEvent> pageViewEvents = graph.createInputStream(“myinput”); MessageStream<MyStreamOutput> outputStream = graph.createOutputStream(“myoutput”); pageViewEvents. partitionBy(m -> m.getMessage().memberId). window(Windows.<PageViewEvent, String, Integer> keyedTumblingWindow(m -> m.getMessage().memberId, Duration.ofSeconds(10), (m, c) -> c + 1). map(MyStreamOutput::new). sendTo(outputStream); } }
  • 52.
    Fluent API (0.13release) public class PageViewCounterExample implements StreamApplication { @Override public void init(StreamGraph graph, Config config) { MessageStream<PageViewEvent> pageViewEvents = graph.createInputStream(“myinput”); MessageStream<MyStreamOutput> outputStream = graph.createOutputStream(“myoutput”); pageViewEvents. partitionBy(m -> m.getMessage().memberId). window(Windows.<PageViewEvent, String, Integer> keyedTumblingWindow(m -> m.getMessage().memberId, Duration.ofSeconds(10), (m, c) -> c + 1). map(MyStreamOutput::new). sendTo(outputStream); } public static void main(String[] args) throws Exception { CommandLine cmdLine = new CommandLine(); Config config = cmdLine.loadConfig(cmdLine.parser().parse(args)); ApplicationRunner localRunner = ApplicationRunner.getLocalRunner(config); localRunner.run(new PageViewCounterExample()); } }
  • 53.
    Deployment options • Fullcontrol on application lifecycle • Can be part of a bigger application • ZK-based coordination Standalone YARN-based • Dashboard • Management service • Monitoring/alerts • Long running service in YARN
  • 54.
  • 56.
  • 57.
    Font check slide THEFOLLOWING WORDS SHOULD BE IDENTICAL IN STYLE Hello there. Source Sans Pro Light If words do not look like the left side, please correct your font