Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)

©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.
Open Source
Analytics Pipeline
at LinkedIn
Issac Buenrostro
Jean-François Im
BOSS Workshop, 2016
Outline
1 Overview of analytics at LinkedIn
2 Gobblin
3 Pinot
4 Demo
5 Operating an analytics pipeline in production
3
4
LinkedIn in Numbers
5
Members: 450m+
Number of datasets: 10k+
Data volume generated per day: 100TB+
Total accumulated data: 20PB+
Multiple datacenters
Thousands of nodes per Hadoop cluster
Analytics at LinkedIn
6
Kafka
Tracking
External
Database
Gobblin
HDFS
Pinot
Visualize
Apps
Reports
In This Workshop
7
Kafka
REST
File
System
Query
App
Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)
What is Gobblin?
Universal data ingestion framework
9
Gobblin Architecture
10
Sample Use Cases
1 Stream dumps (e.g. Kafka -> HDFS)
2 Snapshot dumps (e.g. Oracle, Salesforce -> HDFS)
3 Stream loading (e.g. HDFS -> Kafka)
4 Data cleaning (HDFS -> HDFS purging)
5 File download/copy (x-cluster replication, FTP/SFTP download)
11
Features
12
1. Pluggable sources, converters, quality checkers,
writers.
2. Run on single node, Gobblin managed cluster,
AWS, YARN (as MR or standalone YARN app).
3. Single Gobblin instance for multiple sources / sinks.
4. Quick start using templates for most common jobs.
5. Other Gobblin suite tools: metrics, retention,
configuration management, data compaction.
Gobblin at LinkedIn
1 In production since 2014
2
~20 different sources: Kafka, OLTP, HDFS, SFTP, Salesforce,
MySQL, etc.
3 Process >100 TB per day
4 Process 10,000+ different datasets with custom configurations
5
Configuration, retention, metrics, compaction handled by
Gobblin suite
13
P not
14
What is Pinot?
15
• Distributed near-realtime OLAP datastore
• Horizontally scalable for larger data volumes and
query rates
• Offers a SQL query interface
• Can index and combine data pushed from offline
data sources (eg. Hadoop) and realtime data
sources (eg. Kafka)
• Fault tolerant, no single point of failure
Pinot at LinkedIn
1 Over 50 different use cases (eg. “Who viewed my profile?”)
2
Several thousands of queries per second over billions of rows
across multiple data centers
3 Operates 24x7 with no downtime for maintenance
4 The de facto data store for site-facing analytics at Linkedin
16
Pinot at Linkedin: Who Viewed My Profile?
17
Pinot Design Limitations
18
1. Pinot is designed for analytical workloads (OLAP),
not transactional ones (OLTP)
2. Data in Pinot is immutable (eg. no UPDATE
statement), though it can be overwritten in bulk
3. Realtime data is append-only (can only load new
rows)
4. There is no support for JOINs or subselects
5. There are no UDFs for aggregation (work in
progress)
Demo
How to run the demos back home
20
• Since we cover a lot of material during these
demos, we’ll make the VM used for these demos
available after the tutorial. This way you can focus
on understanding what is demonstrated instead of
trying to follow exactly what is being typed by the
presenters.
• You can grab a copy of the VM after the tutorial at
https://jean-francois.im/vldb/vldb-2016-gobblin-
pinot-demo-vm.tar.gz or in person after the tutorial if
you want to avoid downloading over the hotel Wi-Fi
Gobblin Demo Outline
21
1. Setting up Gobblin
2. Kafka to file system ingest
3. Wikipedia to Kafka ingest from scratch
4. Metrics and events
5. Other running modes
Gobblin Setup
22
Download binary:
https://github.com/linkedin/gobblin/releases
Or download sources and build:
./gradlew assemble
Find tarball at build/gobblin-distribution/distributions
Untar, will generate a directory gobblin-dist
Gobblin Startup
23
cd gobblin-dist
export JAVA_HOME=<java-home>
mkdir $HOME/gobblin-jobs
mkdir $HOME/gobblin-workspace
bin/gobblin-standalone-v2.sh --conf $HOME/gobblin-
jobs/ --workdir $HOME/gobblin-workspace/ start
Gobblin Directory Layout
24
gobblin-dist/ Gobblin binaries and scripts
|--- bin/ Startup scripts
|--- conf/ Global configuration files
|--- lib/ Classpath jars
|--- logs/ Execution log files
gobblin-workspace/ Workspace for Gobblin
|--- locks/ Locks for each job
|--- state-store/ Stores watermarks and failed work units
|--- task-output/ Staging area for job output
gobblin-jobs/ Place job configuration files here
|--- job.pull A job configuration
Running a job
25
1. Place *.pull file in gobblin-jobs/
2. New and modified files automatically found and will
start executing.
3. Can provide cron-style schedule, or if absent, job
will run once. (Per Gobblin instance)
Kafka Puller Job
26
gobblin-jobs/Kafka-puller.pull
# Template to use
job.template=templates/gobblin-kafka.template
# Schedule in cron format
job.schedule=0 0/15 * * * ? # every 15 minutes
# Job configuration
job.name=KafkaPull
topics=test
# Can override brokers
# kafka.brokers="localhost:9092”
Pull records from Kafka topic (default at localhost),
write them to gobblin-jobs/job-output in plain text.
Kafka Puller Job – Json to Avro
27
gobblin-jobs/kafka-puller-jsontoavro.pull
job.template=templates/gobblin-kafka.template
job.schedule=0 0/1 * * * ?
job.name=KafkaPullAvro
topics=jsonDate
converter.classes=gobblin.converter.SchemaInjector,gobblin.converter.json.JsonStringToJ
sonIntermediateConverter,gobblin.converter.avro.JsonIntermediateToAvroConverter
gobblin.converter.schemaInjector.schema=<schema>
writer.builder.class=gobblin.writer.AvroDataWriterBuilder
writer.output.format=AVRO
# Uncomment for partitioning by date
# writer.partition.columns=timestamp
# writer.partitioner.class=gobblin.writer.partitioner.TimeBasedAvroWriterPartitioner
# writer.partition.pattern=yyyy/MM/dd/HH
Kafka Pusher Job
28
Push changes from Wikipedia to a Kafka topic.
https://gist.github.com/ibuenros/3cb4c9293edc7f
43ab41c0d0d59cb586
Gobblin Metrics and Events
29
Gobblin emits operational metrics and events.
metrics.enabled=true
metrics.reporting.file.enabled=true
metrics.log.dir=/home/gobblin/metrics
Write metrics to file
metrics.enabled=true
metrics.reporting.kafka.enabled=true
metrics.reporting.kafka.brokers=localhost:9092
metrics.reporting.kafka.topic.metrics=GobblinMetrics
metrics.reporting.kafka.topic.events=GobblinEvents
metrics.reporting.kafka.format=avro
metrics.reporting.kafka.schemaVersionWriterType=NOOP
Write metrics to Kafka
Gobblin Metric Flattening for Pinot
30
gobblin-jobs/gobblin-metrics-flattener.pull
job.template=templates/kafka-to-kafka.template
job.schedule=0 0/5 * * * ?
job.name=MetricsFlattener
inputTopics=GobblinMetrics
outputTopic=FlatMetrics
gobblin.source.kafka.extractorType=AVRO_FIXED_SCHEMA
gobblin.source.kafka.fixedSchema.GobblinMetrics=<schema>
converter.classes=gobblin.converter.GobblinMetricsFlattenerCo
nverter,gobblin.converter.avro.AvroToJsonStringConverter,gobb
lin.converter.string.StringToBytesConverter
Distributed Gobblin
31
Hadoop / YARN
Azkaban Mode
• AzkabanGobblinDaemon (multi-job)
• AzkabanJobLauncher (single job)
MR mode
• bin/gobblin-mapreduce.sh (single job)
YARN mode
• GobblinYarnAppLauncher (experimental)
AWS
Set up Gobblin cluster on AWS nodes.
In development:
Distributed job running for standalone Gobblin
Pinot Demo Outline
32
1. Set up Pinot and create a table
2. Load offline data into the table
3. Query Pinot
4. Configure realtime (streaming) data ingestion
Pinot Setup
33
git clone the latest version
mvn -DskipTests install
Pinot Startup
34
cd pinot-distribution/target/pinot-0.016-pkg
bin/start-controller.sh -dataDir /data/pinot/controller-data &
bin/start-broker.sh &
bin/start-server.sh -dataDir /data/pinot/server-data &
After Zookeeper and Kafka started.
This will:
• Start a controller listening on localhost:9000
• Start a broker listening on localhost:8099
• Start a server, although clients don’t connect to it
directly.
Pinot architecture
35
Creating a table
36
bin/pinot-admin.sh AddTable -filePath
flights/flights-definition.json -exec
• Tables in Pinot are created using a JSON-based
configuration format
• This configuration defines several parameters,
such as the retention period, time column and for
which columns to create inverted indices
37
{
"tableIndexConfig": {
"invertedIndexColumns":[], "loadMode":"MMAP”,
"lazyLoad":"false”
},
"tenants":{"server":"airline","broker":"airline_broker"},
"tableType":"OFFLINE","metadata":{},
"segmentsConfig":{
"retentionTimeValue":"700”,
"retentionTimeUnit":"DAYS“,
"segmentPushFrequency":"daily“,
"replication":1,
"timeColumnName":"DaysSinceEpoch”,
"timeType":"DAYS”,
"segmentPushType":"APPEND”,
"schemaName":"airlineStats”,
"segmentAssignmentStrategy":
"BalanceNumSegmentAssignmentStrategy”
},
"tableName":"airlineStats“
}
Loading data into Pinot
38
• Data in Pinot is stored in segments, which are pre-
indexed units of data
• To load our Avro-formatted data into Pinot, we’ll run
a segment conversion (which can either be run
locally or on Hadoop) to turn our data into segments
• We’ll then upload our segments into Pinot
Converting data into segments
39
• For this demo, we’ll do this locally:
• In a production environment, you’ll want to do this
on Hadoop:
• See https://github.com/linkedin/pinot/wiki/How-To-Use-Pinot
for Hadoop configuration
bin/pinot-admin.sh CreateSegment -dataDir flights -
outDir converted-segments -tableName flights -
segmentName flights
hadoop jar pinot-hadoop-0.016.jar SegmentCreation
job.properties
Uploading segments to Pinot
40
Uploading segments in Pinot is done through
a standard HTTP file upload; we also provide
a job to do it from Hadoop.
Locally:
On Hadoop:
bin/pinot-admin.sh UploadSegment -segmentDir
converted-segments
hadoop jar pinot-hadoop-0.016.jar SegmentTarPush
job.properties
Querying Pinot
41
• Pinot offers a REST API to send queries, which then
return a JSON-formatted query response
• There is also a Java client, which provides a JDBC-
like API to send queries
• For debugging purposes, it’s also possible to send
queries to the controller through a web interface,
which forwards the query to the appropriate broker
Querying Pinot
42
bin/pinot-admin.sh PostQuery -query "select
count(*) from flights"
{
"numDocsScanned":844482,
"aggregationResults”:
[{"function":"count_star","value":"844482"}],
"timeUsedMs":16,
"segmentStatistics":[],
"exceptions":[],
"totalDocs":844482
}
Adding realtime ingestion
43
• We could make our data fresher by running an
offline push job more often, but there’s a limit as to
how often we can do that
• In Pinot, there are two types of tables: offline and
realtime (eg. streaming from Kafka)
• Pinot supports merging offline and realtime tables at
runtime
Realtime table and offline table
44
SELECT SUM(foo) rewrite
45
Configuring realtime ingestion
46
• Pinot supports pluggable decoders to interpret
messages fetched from Kafka; there is one for
JSON and one for Avro
• Pinot also requires a schema, which defines which
columns to index, their type and purpose
(dimension, metric or time column)
• Realtime tables require having a time column, so
that query splitting can work properly
Configuring realtime ingestion
47
{
"schemaName" : "flights",
"timeFieldSpec" : {
"incomingGranularitySpec" : {
"timeType" : "DAYS”, "dataType" : "INT”, "name" : "DaysSinceEpoch"
}
},
"metricFieldSpecs" : [
{ "name" : "Delayed”, "dataType" : "INT”, "singleValueField" : true },
...
],
"dimensionFieldSpecs" : [
{ "name": "Year”, "dataType" : "INT”, "singleValueField" : true },
{ "name": "DivAirports”, "dataType" : "STRING”, "singleValueField" : false },
...
],
}
Operating in Production
48
Pipeline in production
1. Fault tolerance
2. Performance
3. Retention
4. Metrics
5. Offline and realtime
6. Indexing and sorting
49
Pipeline in production: Fault tolerance
Gobblin:
• Retry work units on failure
• Commit policies for isolating failures.
• Require external tool for daemon failures (cron, Azkaban)
Pinot:
• Supports replication data: fault tolerance and read scaling
• By design, no single point of failure; at Linkedin multiple
controllers, servers and brokers, any one can fail without
impacting availability.
50
Pipeline in production: Performance
Gobblin:
• Run in distributed mode.
• 1 or more tasks per container. Supports bin packing of tasks.
• Bottleneck at job driver (fix in progress).
Pinot:
• Offline clusters can be resized at runtime without service
interruption: just add more nodes and rebalance the cluster.
• Realtime clusters can also be resized, although new replicas
need to reconsume the contents of the Kafka topic (this
limitation should be gone in Q4 2016).
51
Pipeline in production: Retention
Gobblin:
• Data retention job available in Gobblin suite.
• Supports common policies (time, newest K) as well as custom
policies.
Pinot:
• Configurable retention feature: data expired and removed
automatically without user intervention.
• Configurable independently for realtime and offline tables: for
example, one might have 90 days of retention for offline data
and 7 days of retention for realtime data.
52
Pipeline in production: Metrics
Gobblin:
• Metrics and events emitted by all jobs to any sink: timings,
records processed per stage, etc.
• Can add custom instrumentation to pipeline.
Pinot:
• Emits metrics that can be used to monitor the system to make
sure everything is running correctly.
• Key metrics: per table query latency and rate, GC rate, and
number of available replicas.
• For debugging, it’s also possible to drill down into latency
metrics for the various phases of the query.
53
Pipeline in production: Offline and real time
Gobblin:
• Mostly offline job. Can run frequently with small batches.
• More real time processing in progress.
Pinot:
• For hybrid clusters (combined offline and real time), overlap between both
parts means fewer production issues:
• If Hadoop data push job fails, data is served from the real time part;
increasing the retention can be done for extended offline data push job
failures.
• If real time part has issues, offline data has precedence over real time
data, thus ensuring that data can be replaced; only the latest data
points will be unavailable.
54
Pipeline in production: Indexing and sorting
• Pinot supports per-table indexes; created at load time so there
is no performance hit at runtime for re-indexing.
• Pinot optimizes queries where data is sorted on at least one of
the filter predicates; for example “Who viewed my profile” data
is sorted on viewerId.
• Pinot supports sorting data ingested from realtime when
writing to disk.
55
Conclusions
56
1 Analytics pipeline collecting data from a variety of sources
2 Gobblin provides universal data ingestion and easy extensibility
3 Pinot provides offline and real time analytics querying
4 Easy, flexible setup of analytics pipeline
5 Production considerations around scale, fault tolerance, etc.
Who is using this?
57
Development Teams
58
P not
Find out more:
©2015 LinkedIn Corporation. All Rights
Reserved.
Find out more:
©2015 LinkedIn Corporation. All Rights
Reserved.
59
https://github.com/linkedin/gobblin
http://gobblin.readthedocs.io/
gobblin-users@googlegroups.com
P not
https://github.com/linkedin/pinot
pinot-users@googlegroups.com
https://engineering.linkedin.com/
1 of 59

Recommended

Presto at Twitter by
Presto at TwitterPresto at Twitter
Presto at TwitterBill Graham
4.3K views26 slides
Presto for the Enterprise @ Hadoop Meetup by
Presto for the Enterprise @ Hadoop MeetupPresto for the Enterprise @ Hadoop Meetup
Presto for the Enterprise @ Hadoop MeetupWojciech Biela
1.2K views33 slides
Overview of Cascading 3.0 on Apache Flink by
Overview of Cascading 3.0 on Apache Flink Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink Cascading
6.8K views16 slides
Presto Strata Hadoop SJ 2016 short talk by
Presto Strata Hadoop SJ 2016 short talkPresto Strata Hadoop SJ 2016 short talk
Presto Strata Hadoop SJ 2016 short talkkbajda
2K views17 slides
Presto - Hadoop Conference Japan 2014 by
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
107.3K views61 slides
Boston Hadoop Meetup: Presto for the Enterprise by
Boston Hadoop Meetup: Presto for the EnterpriseBoston Hadoop Meetup: Presto for the Enterprise
Boston Hadoop Meetup: Presto for the EnterpriseMatt Fuller
2.9K views18 slides

More Related Content

What's hot

Facebook Presto presentation by
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
18.4K views23 slides
Apache Kafka at LinkedIn by
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedInGuozhang Wang
8K views53 slides
Bitsy graph database by
Bitsy graph databaseBitsy graph database
Bitsy graph databaseLambdaZen LLC
11.7K views15 slides
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira by
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
1.8K views44 slides
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015) by
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Matt Fuller
1.7K views22 slides
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky... by
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
1.9K views38 slides

What's hot(20)

Facebook Presto presentation by Cyanny LIANG
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
Cyanny LIANG18.4K views
Bitsy graph database by LambdaZen LLC
Bitsy graph databaseBitsy graph database
Bitsy graph database
LambdaZen LLC11.7K views
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira by Databricks
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks1.8K views
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015) by Matt Fuller
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Matt Fuller1.7K views
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky... by Spark Summit
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit1.9K views
Planet-scale Data Ingestion Pipeline: Bigdam by SATOSHI TAGOMORI
Planet-scale Data Ingestion Pipeline: BigdamPlanet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: Bigdam
SATOSHI TAGOMORI6.3K views
MySQL in the Hosted Cloud - Percona Live 2015 by Colin Charles
MySQL in the Hosted Cloud - Percona Live 2015MySQL in the Hosted Cloud - Percona Live 2015
MySQL in the Hosted Cloud - Percona Live 2015
Colin Charles1K views
Fine-Grained Scheduling with Helix (ApacheCon NA 2014) by Kanak Biscuitwala
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Fine-Grained Scheduling with Helix (ApacheCon NA 2014)
Kanak Biscuitwala2.1K views
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza... by Flink Forward
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Flink Forward8.7K views
Presto by Chen Chun
PrestoPresto
Presto
Chen Chun1.7K views
Understanding Presto - Presto meetup @ Tokyo #1 by Sadayuki Furuhashi
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi14.1K views
Monitoring of GPU Usage with Tensorflow Models Using Prometheus by Databricks
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Databricks5.2K views
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ... by confluent
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
Kafka Summit SF 2017 - Streaming Processing in Python – 10 ways to avoid summ...
confluent1.1K views
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA by kbajda
Presto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CAPresto: Distributed SQL on Anything -  Strata Hadoop 2017 San Jose, CA
Presto: Distributed SQL on Anything - Strata Hadoop 2017 San Jose, CA
kbajda1.6K views
Solr cloud the 'search first' nosql database extended deep dive by lucenerevolution
Solr cloud the 'search first' nosql database   extended deep diveSolr cloud the 'search first' nosql database   extended deep dive
Solr cloud the 'search first' nosql database extended deep dive
lucenerevolution4.2K views
Tale of ISUCON and Its Bench Tools by SATOSHI TAGOMORI
Tale of ISUCON and Its Bench ToolsTale of ISUCON and Its Bench Tools
Tale of ISUCON and Its Bench Tools
SATOSHI TAGOMORI5.7K views

Similar to Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)

Eko10 Workshop Opensource Database Auditing by
Eko10  Workshop Opensource Database AuditingEko10  Workshop Opensource Database Auditing
Eko10 Workshop Opensource Database AuditingJuan Berner
728 views34 slides
Ingesting hdfs intosolrusingsparktrimmed by
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
1.6K views16 slides
Eko10 workshop - OPEN SOURCE DATABASE MONITORING by
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGEko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGPablo Garbossa
383 views34 slides
Testing Persistent Storage Performance in Kubernetes with Sherlock by
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with SherlockScyllaDB
278 views17 slides
Social Connections 13 - Troubleshooting Connections Pink by
Social Connections 13 - Troubleshooting Connections PinkSocial Connections 13 - Troubleshooting Connections Pink
Social Connections 13 - Troubleshooting Connections PinkNico Meisenzahl
1.2K views52 slides
Effective administration of IBM Integration Bus - Sanjay Nagchowdhury by
Effective administration of IBM Integration Bus - Sanjay NagchowdhuryEffective administration of IBM Integration Bus - Sanjay Nagchowdhury
Effective administration of IBM Integration Bus - Sanjay NagchowdhuryKaren Broughton-Mabbitt
4.2K views59 slides

Similar to Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)(20)

Eko10 Workshop Opensource Database Auditing by Juan Berner
Eko10  Workshop Opensource Database AuditingEko10  Workshop Opensource Database Auditing
Eko10 Workshop Opensource Database Auditing
Juan Berner728 views
Ingesting hdfs intosolrusingsparktrimmed by whoschek
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
whoschek1.6K views
Eko10 workshop - OPEN SOURCE DATABASE MONITORING by Pablo Garbossa
Eko10 workshop - OPEN SOURCE DATABASE MONITORINGEko10 workshop - OPEN SOURCE DATABASE MONITORING
Eko10 workshop - OPEN SOURCE DATABASE MONITORING
Pablo Garbossa383 views
Testing Persistent Storage Performance in Kubernetes with Sherlock by ScyllaDB
Testing Persistent Storage Performance in Kubernetes with SherlockTesting Persistent Storage Performance in Kubernetes with Sherlock
Testing Persistent Storage Performance in Kubernetes with Sherlock
ScyllaDB278 views
Social Connections 13 - Troubleshooting Connections Pink by Nico Meisenzahl
Social Connections 13 - Troubleshooting Connections PinkSocial Connections 13 - Troubleshooting Connections Pink
Social Connections 13 - Troubleshooting Connections Pink
Nico Meisenzahl1.2K views
Effective administration of IBM Integration Bus - Sanjay Nagchowdhury by Karen Broughton-Mabbitt
Effective administration of IBM Integration Bus - Sanjay NagchowdhuryEffective administration of IBM Integration Bus - Sanjay Nagchowdhury
Effective administration of IBM Integration Bus - Sanjay Nagchowdhury
Creating a Smooth Development Workflow for High-Quality Modular Open-Source P... by Pantheon
Creating a Smooth Development Workflow for High-Quality Modular Open-Source P...Creating a Smooth Development Workflow for High-Quality Modular Open-Source P...
Creating a Smooth Development Workflow for High-Quality Modular Open-Source P...
Pantheon1.1K views
Serverless Data Architecture at scale on Google Cloud Platform by MeetupDataScienceRoma
Serverless Data Architecture at scale on Google Cloud PlatformServerless Data Architecture at scale on Google Cloud Platform
Serverless Data Architecture at scale on Google Cloud Platform
Streaming Processing with a Distributed Commit Log by Joe Stein
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
Joe Stein3.1K views
Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0 by Continuent
Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0
Real-Time Data Loading from MySQL to Hadoop with New Tungsten Replicator 3.0
Continuent3.7K views
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX] by Animesh Singh
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Hybrid Cloud, Kubeflow and Tensorflow Extended [TFX]
Animesh Singh10.7K views
Build your own discovery index of scholary e-resources by Martin Czygan
Build your own discovery index of scholary e-resourcesBuild your own discovery index of scholary e-resources
Build your own discovery index of scholary e-resources
Martin Czygan10.7K views
Buildingsocialanalyticstoolwithmongodb by MongoDB APAC
BuildingsocialanalyticstoolwithmongodbBuildingsocialanalyticstoolwithmongodb
Buildingsocialanalyticstoolwithmongodb
MongoDB APAC297 views
XPages on IBM Bluemix: The Do's and Dont's - ICS.UG 2016 by ICS User Group
XPages on IBM Bluemix: The Do's and Dont's - ICS.UG 2016XPages on IBM Bluemix: The Do's and Dont's - ICS.UG 2016
XPages on IBM Bluemix: The Do's and Dont's - ICS.UG 2016
ICS User Group619 views
OS for AI: Elastic Microservices & the Next Gen of ML by Nordic APIs
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
Nordic APIs264 views
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese... by ronwarshawsky
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
MongoDB performance tuning and load testing, NOSQL Now! 2013 Conference prese...
ronwarshawsky11.9K views

Recently uploaded

MK__Cert.pdf by
MK__Cert.pdfMK__Cert.pdf
MK__Cert.pdfHassan Khan
10 views1 slide
Stone Masonry and Brick Masonry.pdf by
Stone Masonry and Brick Masonry.pdfStone Masonry and Brick Masonry.pdf
Stone Masonry and Brick Masonry.pdfMohammed Abdullah Laskar
21 views6 slides
Generative AI Models & Their Applications by
Generative AI Models & Their ApplicationsGenerative AI Models & Their Applications
Generative AI Models & Their ApplicationsSN
6 views1 slide
Effect of deep chemical mixing columns on properties of surrounding soft clay... by
Effect of deep chemical mixing columns on properties of surrounding soft clay...Effect of deep chemical mixing columns on properties of surrounding soft clay...
Effect of deep chemical mixing columns on properties of surrounding soft clay...AltinKaradagli
6 views10 slides
zincalume water storage tank design.pdf by
zincalume water storage tank design.pdfzincalume water storage tank design.pdf
zincalume water storage tank design.pdf3D LABS
5 views1 slide
DESIGN OF SPRINGS-UNIT4.pptx by
DESIGN OF SPRINGS-UNIT4.pptxDESIGN OF SPRINGS-UNIT4.pptx
DESIGN OF SPRINGS-UNIT4.pptxgopinathcreddy
19 views47 slides

Recently uploaded(20)

Generative AI Models & Their Applications by SN
Generative AI Models & Their ApplicationsGenerative AI Models & Their Applications
Generative AI Models & Their Applications
SN6 views
Effect of deep chemical mixing columns on properties of surrounding soft clay... by AltinKaradagli
Effect of deep chemical mixing columns on properties of surrounding soft clay...Effect of deep chemical mixing columns on properties of surrounding soft clay...
Effect of deep chemical mixing columns on properties of surrounding soft clay...
AltinKaradagli6 views
zincalume water storage tank design.pdf by 3D LABS
zincalume water storage tank design.pdfzincalume water storage tank design.pdf
zincalume water storage tank design.pdf
3D LABS5 views
_MAKRIADI-FOTEINI_diploma thesis.pptx by fotinimakriadi
_MAKRIADI-FOTEINI_diploma thesis.pptx_MAKRIADI-FOTEINI_diploma thesis.pptx
_MAKRIADI-FOTEINI_diploma thesis.pptx
fotinimakriadi7 views
Design of machine elements-UNIT 3.pptx by gopinathcreddy
Design of machine elements-UNIT 3.pptxDesign of machine elements-UNIT 3.pptx
Design of machine elements-UNIT 3.pptx
gopinathcreddy32 views
SUMIT SQL PROJECT SUPERSTORE 1.pptx by Sumit Jadhav
SUMIT SQL PROJECT SUPERSTORE 1.pptxSUMIT SQL PROJECT SUPERSTORE 1.pptx
SUMIT SQL PROJECT SUPERSTORE 1.pptx
Sumit Jadhav 12 views
A multi-microcontroller-based hardware for deploying Tiny machine learning mo... by IJECEIAES
A multi-microcontroller-based hardware for deploying Tiny machine learning mo...A multi-microcontroller-based hardware for deploying Tiny machine learning mo...
A multi-microcontroller-based hardware for deploying Tiny machine learning mo...
IJECEIAES13 views

Open Source LinkedIn Analytics Pipeline - BOSS 2016 (VLDB)

  • 1. ©2014 LinkedIn Corporation. All Rights Reserved.©2014 LinkedIn Corporation. All Rights Reserved.
  • 2. Open Source Analytics Pipeline at LinkedIn Issac Buenrostro Jean-François Im BOSS Workshop, 2016
  • 3. Outline 1 Overview of analytics at LinkedIn 2 Gobblin 3 Pinot 4 Demo 5 Operating an analytics pipeline in production 3
  • 4. 4
  • 5. LinkedIn in Numbers 5 Members: 450m+ Number of datasets: 10k+ Data volume generated per day: 100TB+ Total accumulated data: 20PB+ Multiple datacenters Thousands of nodes per Hadoop cluster
  • 9. What is Gobblin? Universal data ingestion framework 9
  • 11. Sample Use Cases 1 Stream dumps (e.g. Kafka -> HDFS) 2 Snapshot dumps (e.g. Oracle, Salesforce -> HDFS) 3 Stream loading (e.g. HDFS -> Kafka) 4 Data cleaning (HDFS -> HDFS purging) 5 File download/copy (x-cluster replication, FTP/SFTP download) 11
  • 12. Features 12 1. Pluggable sources, converters, quality checkers, writers. 2. Run on single node, Gobblin managed cluster, AWS, YARN (as MR or standalone YARN app). 3. Single Gobblin instance for multiple sources / sinks. 4. Quick start using templates for most common jobs. 5. Other Gobblin suite tools: metrics, retention, configuration management, data compaction.
  • 13. Gobblin at LinkedIn 1 In production since 2014 2 ~20 different sources: Kafka, OLTP, HDFS, SFTP, Salesforce, MySQL, etc. 3 Process >100 TB per day 4 Process 10,000+ different datasets with custom configurations 5 Configuration, retention, metrics, compaction handled by Gobblin suite 13
  • 15. What is Pinot? 15 • Distributed near-realtime OLAP datastore • Horizontally scalable for larger data volumes and query rates • Offers a SQL query interface • Can index and combine data pushed from offline data sources (eg. Hadoop) and realtime data sources (eg. Kafka) • Fault tolerant, no single point of failure
  • 16. Pinot at LinkedIn 1 Over 50 different use cases (eg. “Who viewed my profile?”) 2 Several thousands of queries per second over billions of rows across multiple data centers 3 Operates 24x7 with no downtime for maintenance 4 The de facto data store for site-facing analytics at Linkedin 16
  • 17. Pinot at Linkedin: Who Viewed My Profile? 17
  • 18. Pinot Design Limitations 18 1. Pinot is designed for analytical workloads (OLAP), not transactional ones (OLTP) 2. Data in Pinot is immutable (eg. no UPDATE statement), though it can be overwritten in bulk 3. Realtime data is append-only (can only load new rows) 4. There is no support for JOINs or subselects 5. There are no UDFs for aggregation (work in progress)
  • 19. Demo
  • 20. How to run the demos back home 20 • Since we cover a lot of material during these demos, we’ll make the VM used for these demos available after the tutorial. This way you can focus on understanding what is demonstrated instead of trying to follow exactly what is being typed by the presenters. • You can grab a copy of the VM after the tutorial at https://jean-francois.im/vldb/vldb-2016-gobblin- pinot-demo-vm.tar.gz or in person after the tutorial if you want to avoid downloading over the hotel Wi-Fi
  • 21. Gobblin Demo Outline 21 1. Setting up Gobblin 2. Kafka to file system ingest 3. Wikipedia to Kafka ingest from scratch 4. Metrics and events 5. Other running modes
  • 22. Gobblin Setup 22 Download binary: https://github.com/linkedin/gobblin/releases Or download sources and build: ./gradlew assemble Find tarball at build/gobblin-distribution/distributions Untar, will generate a directory gobblin-dist
  • 23. Gobblin Startup 23 cd gobblin-dist export JAVA_HOME=<java-home> mkdir $HOME/gobblin-jobs mkdir $HOME/gobblin-workspace bin/gobblin-standalone-v2.sh --conf $HOME/gobblin- jobs/ --workdir $HOME/gobblin-workspace/ start
  • 24. Gobblin Directory Layout 24 gobblin-dist/ Gobblin binaries and scripts |--- bin/ Startup scripts |--- conf/ Global configuration files |--- lib/ Classpath jars |--- logs/ Execution log files gobblin-workspace/ Workspace for Gobblin |--- locks/ Locks for each job |--- state-store/ Stores watermarks and failed work units |--- task-output/ Staging area for job output gobblin-jobs/ Place job configuration files here |--- job.pull A job configuration
  • 25. Running a job 25 1. Place *.pull file in gobblin-jobs/ 2. New and modified files automatically found and will start executing. 3. Can provide cron-style schedule, or if absent, job will run once. (Per Gobblin instance)
  • 26. Kafka Puller Job 26 gobblin-jobs/Kafka-puller.pull # Template to use job.template=templates/gobblin-kafka.template # Schedule in cron format job.schedule=0 0/15 * * * ? # every 15 minutes # Job configuration job.name=KafkaPull topics=test # Can override brokers # kafka.brokers="localhost:9092” Pull records from Kafka topic (default at localhost), write them to gobblin-jobs/job-output in plain text.
  • 27. Kafka Puller Job – Json to Avro 27 gobblin-jobs/kafka-puller-jsontoavro.pull job.template=templates/gobblin-kafka.template job.schedule=0 0/1 * * * ? job.name=KafkaPullAvro topics=jsonDate converter.classes=gobblin.converter.SchemaInjector,gobblin.converter.json.JsonStringToJ sonIntermediateConverter,gobblin.converter.avro.JsonIntermediateToAvroConverter gobblin.converter.schemaInjector.schema=<schema> writer.builder.class=gobblin.writer.AvroDataWriterBuilder writer.output.format=AVRO # Uncomment for partitioning by date # writer.partition.columns=timestamp # writer.partitioner.class=gobblin.writer.partitioner.TimeBasedAvroWriterPartitioner # writer.partition.pattern=yyyy/MM/dd/HH
  • 28. Kafka Pusher Job 28 Push changes from Wikipedia to a Kafka topic. https://gist.github.com/ibuenros/3cb4c9293edc7f 43ab41c0d0d59cb586
  • 29. Gobblin Metrics and Events 29 Gobblin emits operational metrics and events. metrics.enabled=true metrics.reporting.file.enabled=true metrics.log.dir=/home/gobblin/metrics Write metrics to file metrics.enabled=true metrics.reporting.kafka.enabled=true metrics.reporting.kafka.brokers=localhost:9092 metrics.reporting.kafka.topic.metrics=GobblinMetrics metrics.reporting.kafka.topic.events=GobblinEvents metrics.reporting.kafka.format=avro metrics.reporting.kafka.schemaVersionWriterType=NOOP Write metrics to Kafka
  • 30. Gobblin Metric Flattening for Pinot 30 gobblin-jobs/gobblin-metrics-flattener.pull job.template=templates/kafka-to-kafka.template job.schedule=0 0/5 * * * ? job.name=MetricsFlattener inputTopics=GobblinMetrics outputTopic=FlatMetrics gobblin.source.kafka.extractorType=AVRO_FIXED_SCHEMA gobblin.source.kafka.fixedSchema.GobblinMetrics=<schema> converter.classes=gobblin.converter.GobblinMetricsFlattenerCo nverter,gobblin.converter.avro.AvroToJsonStringConverter,gobb lin.converter.string.StringToBytesConverter
  • 31. Distributed Gobblin 31 Hadoop / YARN Azkaban Mode • AzkabanGobblinDaemon (multi-job) • AzkabanJobLauncher (single job) MR mode • bin/gobblin-mapreduce.sh (single job) YARN mode • GobblinYarnAppLauncher (experimental) AWS Set up Gobblin cluster on AWS nodes. In development: Distributed job running for standalone Gobblin
  • 32. Pinot Demo Outline 32 1. Set up Pinot and create a table 2. Load offline data into the table 3. Query Pinot 4. Configure realtime (streaming) data ingestion
  • 33. Pinot Setup 33 git clone the latest version mvn -DskipTests install
  • 34. Pinot Startup 34 cd pinot-distribution/target/pinot-0.016-pkg bin/start-controller.sh -dataDir /data/pinot/controller-data & bin/start-broker.sh & bin/start-server.sh -dataDir /data/pinot/server-data & After Zookeeper and Kafka started. This will: • Start a controller listening on localhost:9000 • Start a broker listening on localhost:8099 • Start a server, although clients don’t connect to it directly.
  • 36. Creating a table 36 bin/pinot-admin.sh AddTable -filePath flights/flights-definition.json -exec • Tables in Pinot are created using a JSON-based configuration format • This configuration defines several parameters, such as the retention period, time column and for which columns to create inverted indices
  • 38. Loading data into Pinot 38 • Data in Pinot is stored in segments, which are pre- indexed units of data • To load our Avro-formatted data into Pinot, we’ll run a segment conversion (which can either be run locally or on Hadoop) to turn our data into segments • We’ll then upload our segments into Pinot
  • 39. Converting data into segments 39 • For this demo, we’ll do this locally: • In a production environment, you’ll want to do this on Hadoop: • See https://github.com/linkedin/pinot/wiki/How-To-Use-Pinot for Hadoop configuration bin/pinot-admin.sh CreateSegment -dataDir flights - outDir converted-segments -tableName flights - segmentName flights hadoop jar pinot-hadoop-0.016.jar SegmentCreation job.properties
  • 40. Uploading segments to Pinot 40 Uploading segments in Pinot is done through a standard HTTP file upload; we also provide a job to do it from Hadoop. Locally: On Hadoop: bin/pinot-admin.sh UploadSegment -segmentDir converted-segments hadoop jar pinot-hadoop-0.016.jar SegmentTarPush job.properties
  • 41. Querying Pinot 41 • Pinot offers a REST API to send queries, which then return a JSON-formatted query response • There is also a Java client, which provides a JDBC- like API to send queries • For debugging purposes, it’s also possible to send queries to the controller through a web interface, which forwards the query to the appropriate broker
  • 42. Querying Pinot 42 bin/pinot-admin.sh PostQuery -query "select count(*) from flights" { "numDocsScanned":844482, "aggregationResults”: [{"function":"count_star","value":"844482"}], "timeUsedMs":16, "segmentStatistics":[], "exceptions":[], "totalDocs":844482 }
  • 43. Adding realtime ingestion 43 • We could make our data fresher by running an offline push job more often, but there’s a limit as to how often we can do that • In Pinot, there are two types of tables: offline and realtime (eg. streaming from Kafka) • Pinot supports merging offline and realtime tables at runtime
  • 44. Realtime table and offline table 44
  • 46. Configuring realtime ingestion 46 • Pinot supports pluggable decoders to interpret messages fetched from Kafka; there is one for JSON and one for Avro • Pinot also requires a schema, which defines which columns to index, their type and purpose (dimension, metric or time column) • Realtime tables require having a time column, so that query splitting can work properly
  • 47. Configuring realtime ingestion 47 { "schemaName" : "flights", "timeFieldSpec" : { "incomingGranularitySpec" : { "timeType" : "DAYS”, "dataType" : "INT”, "name" : "DaysSinceEpoch" } }, "metricFieldSpecs" : [ { "name" : "Delayed”, "dataType" : "INT”, "singleValueField" : true }, ... ], "dimensionFieldSpecs" : [ { "name": "Year”, "dataType" : "INT”, "singleValueField" : true }, { "name": "DivAirports”, "dataType" : "STRING”, "singleValueField" : false }, ... ], }
  • 49. Pipeline in production 1. Fault tolerance 2. Performance 3. Retention 4. Metrics 5. Offline and realtime 6. Indexing and sorting 49
  • 50. Pipeline in production: Fault tolerance Gobblin: • Retry work units on failure • Commit policies for isolating failures. • Require external tool for daemon failures (cron, Azkaban) Pinot: • Supports replication data: fault tolerance and read scaling • By design, no single point of failure; at Linkedin multiple controllers, servers and brokers, any one can fail without impacting availability. 50
  • 51. Pipeline in production: Performance Gobblin: • Run in distributed mode. • 1 or more tasks per container. Supports bin packing of tasks. • Bottleneck at job driver (fix in progress). Pinot: • Offline clusters can be resized at runtime without service interruption: just add more nodes and rebalance the cluster. • Realtime clusters can also be resized, although new replicas need to reconsume the contents of the Kafka topic (this limitation should be gone in Q4 2016). 51
  • 52. Pipeline in production: Retention Gobblin: • Data retention job available in Gobblin suite. • Supports common policies (time, newest K) as well as custom policies. Pinot: • Configurable retention feature: data expired and removed automatically without user intervention. • Configurable independently for realtime and offline tables: for example, one might have 90 days of retention for offline data and 7 days of retention for realtime data. 52
  • 53. Pipeline in production: Metrics Gobblin: • Metrics and events emitted by all jobs to any sink: timings, records processed per stage, etc. • Can add custom instrumentation to pipeline. Pinot: • Emits metrics that can be used to monitor the system to make sure everything is running correctly. • Key metrics: per table query latency and rate, GC rate, and number of available replicas. • For debugging, it’s also possible to drill down into latency metrics for the various phases of the query. 53
  • 54. Pipeline in production: Offline and real time Gobblin: • Mostly offline job. Can run frequently with small batches. • More real time processing in progress. Pinot: • For hybrid clusters (combined offline and real time), overlap between both parts means fewer production issues: • If Hadoop data push job fails, data is served from the real time part; increasing the retention can be done for extended offline data push job failures. • If real time part has issues, offline data has precedence over real time data, thus ensuring that data can be replaced; only the latest data points will be unavailable. 54
  • 55. Pipeline in production: Indexing and sorting • Pinot supports per-table indexes; created at load time so there is no performance hit at runtime for re-indexing. • Pinot optimizes queries where data is sorted on at least one of the filter predicates; for example “Who viewed my profile” data is sorted on viewerId. • Pinot supports sorting data ingested from realtime when writing to disk. 55
  • 56. Conclusions 56 1 Analytics pipeline collecting data from a variety of sources 2 Gobblin provides universal data ingestion and easy extensibility 3 Pinot provides offline and real time analytics querying 4 Easy, flexible setup of analytics pipeline 5 Production considerations around scale, fault tolerance, etc.
  • 57. Who is using this? 57
  • 59. Find out more: ©2015 LinkedIn Corporation. All Rights Reserved. Find out more: ©2015 LinkedIn Corporation. All Rights Reserved. 59 https://github.com/linkedin/gobblin http://gobblin.readthedocs.io/ gobblin-users@googlegroups.com P not https://github.com/linkedin/pinot pinot-users@googlegroups.com https://engineering.linkedin.com/

Editor's Notes

  1. https://jean-francois.im/vldb/vldb-2016-gobblin-pinot-demo-vm.tar.gz
  2. Detail what happens to an event at Linkedin, This presentation will mostly focus on Gobblin and Pinot for the purpose of analytics.
  3. ~25 minutes
  4. ~25 minutes