1. Pluggable sources, converters, quality checkers,
2. Run on single node, Gobblin managed cluster,
AWS, YARN (as MR or standalone YARN app).
3. Single Gobblin instance for multiple sources / sinks.
4. Quick start using templates for most common jobs.
5. Other Gobblin suite tools: metrics, retention,
configuration management, data compaction.
Gobblin at LinkedIn
1 In production since 2014
~20 different sources: Kafka, OLTP, HDFS, SFTP, Salesforce,
3 Process >100 TB per day
4 Process 10,000+ different datasets with custom configurations
Configuration, retention, metrics, compaction handled by
What is Pinot?
• Distributed near-realtime OLAP datastore
• Horizontally scalable for larger data volumes and
• Offers a SQL query interface
• Can index and combine data pushed from offline
data sources (eg. Hadoop) and realtime data
sources (eg. Kafka)
• Fault tolerant, no single point of failure
Pinot at LinkedIn
1 Over 50 different use cases (eg. “Who viewed my profile?”)
Several thousands of queries per second over billions of rows
across multiple data centers
3 Operates 24x7 with no downtime for maintenance
4 The de facto data store for site-facing analytics at Linkedin
Pinot Design Limitations
1. Pinot is designed for analytical workloads (OLAP),
not transactional ones (OLTP)
2. Data in Pinot is immutable (eg. no UPDATE
statement), though it can be overwritten in bulk
3. Realtime data is append-only (can only load new
4. There is no support for JOINs or subselects
5. There are no UDFs for aggregation (work in
How to run the demos back home
• Since we cover a lot of material during these
demos, we’ll make the VM used for these demos
available after the tutorial. This way you can focus
on understanding what is demonstrated instead of
trying to follow exactly what is being typed by the
• You can grab a copy of the VM after the tutorial at
pinot-demo-vm.tar.gz or in person after the tutorial if
you want to avoid downloading over the hotel Wi-Fi
Gobblin Demo Outline
1. Setting up Gobblin
2. Kafka to file system ingest
3. Wikipedia to Kafka ingest from scratch
4. Metrics and events
5. Other running modes
Or download sources and build:
Find tarball at build/gobblin-distribution/distributions
Untar, will generate a directory gobblin-dist
Gobblin Directory Layout
gobblin-dist/ Gobblin binaries and scripts
|--- bin/ Startup scripts
|--- conf/ Global configuration files
|--- lib/ Classpath jars
|--- logs/ Execution log files
gobblin-workspace/ Workspace for Gobblin
|--- locks/ Locks for each job
|--- state-store/ Stores watermarks and failed work units
|--- task-output/ Staging area for job output
gobblin-jobs/ Place job configuration files here
|--- job.pull A job configuration
Running a job
1. Place *.pull file in gobblin-jobs/
2. New and modified files automatically found and will
3. Can provide cron-style schedule, or if absent, job
will run once. (Per Gobblin instance)
Kafka Puller Job
# Template to use
# Schedule in cron format
job.schedule=0 0/15 * * * ? # every 15 minutes
# Job configuration
# Can override brokers
Pull records from Kafka topic (default at localhost),
write them to gobblin-jobs/job-output in plain text.
Kafka Puller Job – Json to Avro
job.schedule=0 0/1 * * * ?
# Uncomment for partitioning by date
Kafka Pusher Job
Push changes from Wikipedia to a Kafka topic.
Gobblin Metrics and Events
Gobblin emits operational metrics and events.
Write metrics to file
Write metrics to Kafka
Hadoop / YARN
• AzkabanGobblinDaemon (multi-job)
• AzkabanJobLauncher (single job)
• bin/gobblin-mapreduce.sh (single job)
• GobblinYarnAppLauncher (experimental)
Set up Gobblin cluster on AWS nodes.
Distributed job running for standalone Gobblin
Pinot Demo Outline
1. Set up Pinot and create a table
2. Load offline data into the table
3. Query Pinot
4. Configure realtime (streaming) data ingestion
git clone the latest version
mvn -DskipTests install
bin/start-controller.sh -dataDir /data/pinot/controller-data &
bin/start-server.sh -dataDir /data/pinot/server-data &
After Zookeeper and Kafka started.
• Start a controller listening on localhost:9000
• Start a broker listening on localhost:8099
• Start a server, although clients don’t connect to it
Creating a table
bin/pinot-admin.sh AddTable -filePath
• Tables in Pinot are created using a JSON-based
• This configuration defines several parameters,
such as the retention period, time column and for
which columns to create inverted indices
Loading data into Pinot
• Data in Pinot is stored in segments, which are pre-
indexed units of data
• To load our Avro-formatted data into Pinot, we’ll run
a segment conversion (which can either be run
locally or on Hadoop) to turn our data into segments
• We’ll then upload our segments into Pinot
Converting data into segments
• For this demo, we’ll do this locally:
• In a production environment, you’ll want to do this
• See https://github.com/linkedin/pinot/wiki/How-To-Use-Pinot
for Hadoop configuration
bin/pinot-admin.sh CreateSegment -dataDir flights -
outDir converted-segments -tableName flights -
hadoop jar pinot-hadoop-0.016.jar SegmentCreation
Uploading segments to Pinot
Uploading segments in Pinot is done through
a standard HTTP file upload; we also provide
a job to do it from Hadoop.
bin/pinot-admin.sh UploadSegment -segmentDir
hadoop jar pinot-hadoop-0.016.jar SegmentTarPush
• Pinot offers a REST API to send queries, which then
return a JSON-formatted query response
• There is also a Java client, which provides a JDBC-
like API to send queries
• For debugging purposes, it’s also possible to send
queries to the controller through a web interface,
which forwards the query to the appropriate broker
Adding realtime ingestion
• We could make our data fresher by running an
offline push job more often, but there’s a limit as to
how often we can do that
• In Pinot, there are two types of tables: offline and
realtime (eg. streaming from Kafka)
• Pinot supports merging offline and realtime tables at
Configuring realtime ingestion
• Pinot supports pluggable decoders to interpret
messages fetched from Kafka; there is one for
JSON and one for Avro
• Pinot also requires a schema, which defines which
columns to index, their type and purpose
(dimension, metric or time column)
• Realtime tables require having a time column, so
that query splitting can work properly
Pipeline in production
1. Fault tolerance
5. Offline and realtime
6. Indexing and sorting
Pipeline in production: Fault tolerance
• Retry work units on failure
• Commit policies for isolating failures.
• Require external tool for daemon failures (cron, Azkaban)
• Supports replication data: fault tolerance and read scaling
• By design, no single point of failure; at Linkedin multiple
controllers, servers and brokers, any one can fail without
Pipeline in production: Performance
• Run in distributed mode.
• 1 or more tasks per container. Supports bin packing of tasks.
• Bottleneck at job driver (fix in progress).
• Offline clusters can be resized at runtime without service
interruption: just add more nodes and rebalance the cluster.
• Realtime clusters can also be resized, although new replicas
need to reconsume the contents of the Kafka topic (this
limitation should be gone in Q4 2016).
Pipeline in production: Retention
• Data retention job available in Gobblin suite.
• Supports common policies (time, newest K) as well as custom
• Configurable retention feature: data expired and removed
automatically without user intervention.
• Configurable independently for realtime and offline tables: for
example, one might have 90 days of retention for offline data
and 7 days of retention for realtime data.
Pipeline in production: Metrics
• Metrics and events emitted by all jobs to any sink: timings,
records processed per stage, etc.
• Can add custom instrumentation to pipeline.
• Emits metrics that can be used to monitor the system to make
sure everything is running correctly.
• Key metrics: per table query latency and rate, GC rate, and
number of available replicas.
• For debugging, it’s also possible to drill down into latency
metrics for the various phases of the query.
Pipeline in production: Offline and real time
• Mostly offline job. Can run frequently with small batches.
• More real time processing in progress.
• For hybrid clusters (combined offline and real time), overlap between both
parts means fewer production issues:
• If Hadoop data push job fails, data is served from the real time part;
increasing the retention can be done for extended offline data push job
• If real time part has issues, offline data has precedence over real time
data, thus ensuring that data can be replaced; only the latest data
points will be unavailable.
Pipeline in production: Indexing and sorting
• Pinot supports per-table indexes; created at load time so there
is no performance hit at runtime for re-indexing.
• Pinot optimizes queries where data is sorted on at least one of
the filter predicates; for example “Who viewed my profile” data
is sorted on viewerId.
• Pinot supports sorting data ingested from realtime when
writing to disk.
1 Analytics pipeline collecting data from a variety of sources
2 Gobblin provides universal data ingestion and easy extensibility
3 Pinot provides offline and real time analytics querying
4 Easy, flexible setup of analytics pipeline
5 Production considerations around scale, fault tolerance, etc.