Gobblin: Unifying Data Ingestion for Hadoop

(18)
GOBBLIN: UNIFYING DATA
INGESTION FOR HADOOP
Lin Qiao, Yinan Li, Sahil Takiar, Ziyang Liu, Narasimha
Veeramreddy, Min Tu, Ying Dai, Issac Buenrostro, Kapil
Surlaker, Shirshanka Das, Chavdar Botev
DataAnalytics Infrastructure @ LinkedIn

(18)
Agenda
• Why Gobblin?
• Gobblin Overview
• Case Studies
• Gobblin in Details
• Gobblin in Production @ LinkedIn
• FutureWork
• Q&A
1

(18)
Data Ingestion Challenges @ LinkedIn
2
BIG engineering and operational COST!
Data Sources DataTypes Operational Pain

(18)
Pre-Gobblin Era
3
OLTP
Tracking
Snapshot
and delta
file dumps
Kafka
Databus
Changes
Pipeline #1
External
Partner Data
Pipeline #2
REST
JDBC
SOAP
...
Pipeline #3
Pipeline #4
Pipeline #5
Pipeline #n
Databases
(Oracle/Espresso)

(18)
The Gobblin Era
4
OLTP
Tracking
Snapshot
and delta
file dumps
Kafka
Databus
Changes
External
Partner Data
REST
JDBC
SOAP
...
Databases
(Oracle/Espresso)

(18)
Requirements
5
Multi-platform and
Scalability Support
Rich Source
Integration
Centralized State
Management
OperabilityExtensibility Self Service

(18)
Architecture Overview
6
Constructs for Building Ingestion Flows
WorkUnit /Task
Execution Runtime
Deployment Mode
state store
compaction
retention
mgmt.
monitoring
Standalone Hadoop MR Yarn
Source Extractor Converter
Qlty. Chker. Writer Publisher
Task Executor Task StateTracker
Job Launcher Job Scheduler

(18)
Case Study: Kafka Ingestion
7
KafkaAvroSource
KafkaAvroExtracto
r
WorkUnit 1
(Topic 1, Partition 1)
KafkaConverter
TimePartitioned
AvroWriter
Avro
/kafka/topic/hourly/yyyy/mm/dd/hh/*.avro
Compaction
/kafka/topic/daily/yyyy/mm/dd/*.avro
AuditCount
QualityChecker
KafkaAvroExtracto
r
WorkUnit 2
(Topic 1, Partition 2)
KafkaConverter
TimePartitioned
AvroWriter
Avro
AuditCount
QualityChecker
KafkaAvroExtracto
r
WorkUnit 3
(Topic 1, Partitions 1 & 2)
KafkaConverter
TimePartitioned
AvroWriter
Avro
AuditCount
QualityChecker
TimePartitioned
DataPublisher

(18)
Case Study: Database Ingestion
8
JdbcSource
JdbcExtractor
WorkUnit 1
[2015090512, 2015090514)
ToAvroConverter
SnapshotAvroWriter
Row
/database/table/incremental/snapshot-ts/*.avro
Compaction
/database/table/full/snapshot-ts/*.avro
SchemaCompatibiliy
& Count Qlty. Chker
SnapshotDataPublisher
JdbcExtractor
WorkUnit 1
[2015090512, 2015090514)
ToAvroConverter
SnapshotAvroWriter
Row
SchemaCompatibiliy
& Count Qlty. Chker
JdbcExtractor
WorkUnit 1
[2015090512, 2015090514)
ToAvroConverter
SnapshotAvroWriter
Row
SchemaCompatibiliy
& Count Qlty. Chker

(18)
Case Study – Filtering Sensitive Data
9
Has Sensitive
Data?
no
Source
Extractor
WorkUnit
Converter and
Quality Checker
Fork and Branching
Writer
DataPublisher
Writer
Sensitive Data
Filtering Converter
yes

(18)
Data Quality Checking
10
Record-level
Policies
Writer
Task-level
Policies
Publisher
Quarantine
FailTask
Quality Checkers
- Per record or per task.
- Policy driven
- Composable
~ Schema compatibility
~ Audit check
~ Sensitive fields
~ Required fields
~ Unique key

(18)
State and Metadata Mgmt.
11
State Store
- Stores runtime metadata, e.g., checkpoints
(a.k.a. watermarks)
~ Carried over between job runs
- Default impl: serializes job/task states into
files, one per run.
- Allows other implementations that conform
to the interface to be plugged in.
State Store
job run #2
job run #3job run #1
SEP
2
SEP
3
SEP
2 SEP
3
EXAMPLE

(18)
Metrics / Events and Alerting
12
Kafka
MetricContext
Topic 1
MetricContext
Topic 2
MetricContext
Partition 1
MetricContext
Partition 2
MetricContext
20
12 8
6 6
Metric
Reporter
Event
ReporterMetrics / Events
Collection and Reporting
- Metrics for ingestion progress
~ supports tagging
~ real-time aggregation
- Events for major milestones
~ “fire-and-forget”
- Various built-in metric / event
reporters

(18)
Running Modes
13
Standalone
Runs in a single
JVM; tasks run in a
thread pool.
Scale-out with
MapReduce
Each job run launches
a MR job, using
mappers as containers
to run tasks.
Scale-out with
General
Distributed
Resource Manager
Supports long-running
continuous ingestion,
with better resource
utilization and SLA
guarantees.
YARN
*in progress

(18)
Gobblin in Production @ LinkedIn
• In production since 2014
• Usages
– Internal sources  HDFS
• Kafka, MySQL, Dropbox, etc.
– External sources  HDFS
• Salesforce, GoogleAnalytics, S3, etc.
– HDFS  HDFS
• Closed member data purging
– Egress from HDFS (future work)
• Data volume
– Over a dozen data sources,
– thousands of datasets,
– tens ofTBs,
… daily.
14

(18)
FutureWork
• Gobblin onYarn (alpha-release)
• Real-time elastic ingestion
• Integration with
– Apache Sqoop: using Sqoop connectors
– Logstash: log ingestion
– Morphlines: using Morphline transformation
– Apache Spark
15

(18)
Conclusions
16
Pain of
maintaining
multiple
ingestion
pipelines
Gobblin to the
rescue!
Data quality
assurance and
centralized state
management
Gobblin in
production for a
wide range of
data sources
Continuous real-
time ingestion

(18)17
ACKNOWLEDGEMENT
Pradhan Cadabam
Shrikanth Shankar
Suvodeep Pyne
Ray Ortigas
Henry Cai
Kenneth Goodhope
Erik Krogen

(18)
Thanks.
18
Github https://github.com/linkedin/gobblin
Documentation https://github.com/linkedin/gobblin/wiki
User Group https://groups.google.com/forum/#!forum/gobblin-users

Gobblin: Unifying Data Ingestion for Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Gobblin: Unifying Data Ingestion for Hadoop

Similar to Gobblin: Unifying Data Ingestion for Hadoop (20)

Recently uploaded

Recently uploaded (20)

Gobblin: Unifying Data Ingestion for Hadoop