(18)
GOBBLIN: UNIFYING DATA
INGESTION FOR HADOOP
Lin Qiao, Yinan Li, Sahil Takiar, Ziyang Liu, Narasimha
Veeramreddy, Min Tu, Ying Dai, Issac Buenrostro, Kapil
Surlaker, Shirshanka Das, Chavdar Botev
DataAnalytics Infrastructure @ LinkedIn
(18)
Agenda
• Why Gobblin?
• Gobblin Overview
• Case Studies
• Gobblin in Details
• Gobblin in Production @ LinkedIn
• FutureWork
• Q&A
1
(18)
Data Ingestion Challenges @ LinkedIn
2
BIG engineering and operational COST!
Data Sources DataTypes Operational Pain
(18)
Pre-Gobblin Era
3
OLTP
Tracking
Snapshot
and delta
file dumps
Kafka
Databus
Changes
Pipeline #1
External
Partner Data
Pipeline #2
REST
JDBC
SOAP
...
Pipeline #3
Pipeline #4
Pipeline #5
Pipeline #n
Databases
(Oracle/Espresso)
(18)
The Gobblin Era
4
OLTP
Tracking
Snapshot
and delta
file dumps
Kafka
Databus
Changes
External
Partner Data
REST
JDBC
SOAP
...
Databases
(Oracle/Espresso)
(18)
Requirements
5
Multi-platform and
Scalability Support
Rich Source
Integration
Centralized State
Management
OperabilityExtensibility Self Service
(18)
Architecture Overview
6
Constructs for Building Ingestion Flows
WorkUnit /Task
Execution Runtime
Deployment Mode
state store
compaction
retention
mgmt.
monitoring
Standalone Hadoop MR Yarn
Source Extractor Converter
Qlty. Chker. Writer Publisher
Task Executor Task StateTracker
Job Launcher Job Scheduler
(18)
Case Study: Kafka Ingestion
7
KafkaAvroSource
KafkaAvroExtracto
r
WorkUnit 1
(Topic 1, Partition 1)
KafkaConverter
TimePartitioned
AvroWriter
Avro
/kafka/topic/hourly/yyyy/mm/dd/hh/*.avro
Compaction
/kafka/topic/daily/yyyy/mm/dd/*.avro
AuditCount
QualityChecker
KafkaAvroExtracto
r
WorkUnit 2
(Topic 1, Partition 2)
KafkaConverter
TimePartitioned
AvroWriter
Avro
AuditCount
QualityChecker
KafkaAvroExtracto
r
WorkUnit 3
(Topic 1, Partitions 1 & 2)
KafkaConverter
TimePartitioned
AvroWriter
Avro
AuditCount
QualityChecker
TimePartitioned
DataPublisher
(18)
Case Study: Database Ingestion
8
JdbcSource
JdbcExtractor
WorkUnit 1
[2015090512, 2015090514)
ToAvroConverter
SnapshotAvroWriter
Row
/database/table/incremental/snapshot-ts/*.avro
Compaction
/database/table/full/snapshot-ts/*.avro
SchemaCompatibiliy
& Count Qlty. Chker
SnapshotDataPublisher
JdbcExtractor
WorkUnit 1
[2015090512, 2015090514)
ToAvroConverter
SnapshotAvroWriter
Row
SchemaCompatibiliy
& Count Qlty. Chker
JdbcExtractor
WorkUnit 1
[2015090512, 2015090514)
ToAvroConverter
SnapshotAvroWriter
Row
SchemaCompatibiliy
& Count Qlty. Chker
(18)
Case Study – Filtering Sensitive Data
9
Has Sensitive
Data?
no
Source
Extractor
WorkUnit
Converter and
Quality Checker
Fork and Branching
Writer
DataPublisher
Writer
Sensitive Data
Filtering Converter
yes
(18)
Data Quality Checking
10
Record-level
Policies
Writer
Task-level
Policies
Publisher
Quarantine
FailTask
Quality Checkers
- Per record or per task.
- Policy driven
- Composable
~ Schema compatibility
~ Audit check
~ Sensitive fields
~ Required fields
~ Unique key
(18)
State and Metadata Mgmt.
11
State Store
- Stores runtime metadata, e.g., checkpoints
(a.k.a. watermarks)
~ Carried over between job runs
- Default impl: serializes job/task states into
files, one per run.
- Allows other implementations that conform
to the interface to be plugged in.
State Store
job run #2
job run #3job run #1
SEP
2
SEP
3
SEP
2 SEP
3
EXAMPLE
(18)
Metrics / Events and Alerting
12
Kafka
MetricContext
Topic 1
MetricContext
Topic 2
MetricContext
Partition 1
MetricContext
Partition 2
MetricContext
20
12 8
6 6
Metric
Reporter
Event
ReporterMetrics / Events
Collection and Reporting
- Metrics for ingestion progress
~ supports tagging
~ real-time aggregation
- Events for major milestones
~ “fire-and-forget”
- Various built-in metric / event
reporters
(18)
Running Modes
13
Standalone
Runs in a single
JVM; tasks run in a
thread pool.
Scale-out with
MapReduce
Each job run launches
a MR job, using
mappers as containers
to run tasks.
Scale-out with
General
Distributed
Resource Manager
Supports long-running
continuous ingestion,
with better resource
utilization and SLA
guarantees.
YARN
*in progress
(18)
Gobblin in Production @ LinkedIn
• In production since 2014
• Usages
– Internal sources  HDFS
• Kafka, MySQL, Dropbox, etc.
– External sources  HDFS
• Salesforce, GoogleAnalytics, S3, etc.
– HDFS  HDFS
• Closed member data purging
– Egress from HDFS (future work)
• Data volume
– Over a dozen data sources,
– thousands of datasets,
– tens ofTBs,
… daily.
14
(18)
FutureWork
• Gobblin onYarn (alpha-release)
• Real-time elastic ingestion
• Integration with
– Apache Sqoop: using Sqoop connectors
– Logstash: log ingestion
– Morphlines: using Morphline transformation
– Apache Spark
15
(18)
Conclusions
16
Pain of
maintaining
multiple
ingestion
pipelines
Gobblin to the
rescue!
Data quality
assurance and
centralized state
management
Gobblin in
production for a
wide range of
data sources
Continuous real-
time ingestion
(18)17
ACKNOWLEDGEMENT
Pradhan Cadabam
Shrikanth Shankar
Suvodeep Pyne
Ray Ortigas
Henry Cai
Kenneth Goodhope
Erik Krogen
(18)
Thanks.
18
Github https://github.com/linkedin/gobblin
Documentation https://github.com/linkedin/gobblin/wiki
User Group https://groups.google.com/forum/#!forum/gobblin-users

Gobblin: Unifying Data Ingestion for Hadoop

  • 1.
    (18) GOBBLIN: UNIFYING DATA INGESTIONFOR HADOOP Lin Qiao, Yinan Li, Sahil Takiar, Ziyang Liu, Narasimha Veeramreddy, Min Tu, Ying Dai, Issac Buenrostro, Kapil Surlaker, Shirshanka Das, Chavdar Botev DataAnalytics Infrastructure @ LinkedIn
  • 2.
    (18) Agenda • Why Gobblin? •Gobblin Overview • Case Studies • Gobblin in Details • Gobblin in Production @ LinkedIn • FutureWork • Q&A 1
  • 3.
    (18) Data Ingestion Challenges@ LinkedIn 2 BIG engineering and operational COST! Data Sources DataTypes Operational Pain
  • 4.
    (18) Pre-Gobblin Era 3 OLTP Tracking Snapshot and delta filedumps Kafka Databus Changes Pipeline #1 External Partner Data Pipeline #2 REST JDBC SOAP ... Pipeline #3 Pipeline #4 Pipeline #5 Pipeline #n Databases (Oracle/Espresso)
  • 5.
    (18) The Gobblin Era 4 OLTP Tracking Snapshot anddelta file dumps Kafka Databus Changes External Partner Data REST JDBC SOAP ... Databases (Oracle/Espresso)
  • 6.
    (18) Requirements 5 Multi-platform and Scalability Support RichSource Integration Centralized State Management OperabilityExtensibility Self Service
  • 7.
    (18) Architecture Overview 6 Constructs forBuilding Ingestion Flows WorkUnit /Task Execution Runtime Deployment Mode state store compaction retention mgmt. monitoring Standalone Hadoop MR Yarn Source Extractor Converter Qlty. Chker. Writer Publisher Task Executor Task StateTracker Job Launcher Job Scheduler
  • 8.
    (18) Case Study: KafkaIngestion 7 KafkaAvroSource KafkaAvroExtracto r WorkUnit 1 (Topic 1, Partition 1) KafkaConverter TimePartitioned AvroWriter Avro /kafka/topic/hourly/yyyy/mm/dd/hh/*.avro Compaction /kafka/topic/daily/yyyy/mm/dd/*.avro AuditCount QualityChecker KafkaAvroExtracto r WorkUnit 2 (Topic 1, Partition 2) KafkaConverter TimePartitioned AvroWriter Avro AuditCount QualityChecker KafkaAvroExtracto r WorkUnit 3 (Topic 1, Partitions 1 & 2) KafkaConverter TimePartitioned AvroWriter Avro AuditCount QualityChecker TimePartitioned DataPublisher
  • 9.
    (18) Case Study: DatabaseIngestion 8 JdbcSource JdbcExtractor WorkUnit 1 [2015090512, 2015090514) ToAvroConverter SnapshotAvroWriter Row /database/table/incremental/snapshot-ts/*.avro Compaction /database/table/full/snapshot-ts/*.avro SchemaCompatibiliy & Count Qlty. Chker SnapshotDataPublisher JdbcExtractor WorkUnit 1 [2015090512, 2015090514) ToAvroConverter SnapshotAvroWriter Row SchemaCompatibiliy & Count Qlty. Chker JdbcExtractor WorkUnit 1 [2015090512, 2015090514) ToAvroConverter SnapshotAvroWriter Row SchemaCompatibiliy & Count Qlty. Chker
  • 10.
    (18) Case Study –Filtering Sensitive Data 9 Has Sensitive Data? no Source Extractor WorkUnit Converter and Quality Checker Fork and Branching Writer DataPublisher Writer Sensitive Data Filtering Converter yes
  • 11.
    (18) Data Quality Checking 10 Record-level Policies Writer Task-level Policies Publisher Quarantine FailTask QualityCheckers - Per record or per task. - Policy driven - Composable ~ Schema compatibility ~ Audit check ~ Sensitive fields ~ Required fields ~ Unique key
  • 12.
    (18) State and MetadataMgmt. 11 State Store - Stores runtime metadata, e.g., checkpoints (a.k.a. watermarks) ~ Carried over between job runs - Default impl: serializes job/task states into files, one per run. - Allows other implementations that conform to the interface to be plugged in. State Store job run #2 job run #3job run #1 SEP 2 SEP 3 SEP 2 SEP 3 EXAMPLE
  • 13.
    (18) Metrics / Eventsand Alerting 12 Kafka MetricContext Topic 1 MetricContext Topic 2 MetricContext Partition 1 MetricContext Partition 2 MetricContext 20 12 8 6 6 Metric Reporter Event ReporterMetrics / Events Collection and Reporting - Metrics for ingestion progress ~ supports tagging ~ real-time aggregation - Events for major milestones ~ “fire-and-forget” - Various built-in metric / event reporters
  • 14.
    (18) Running Modes 13 Standalone Runs ina single JVM; tasks run in a thread pool. Scale-out with MapReduce Each job run launches a MR job, using mappers as containers to run tasks. Scale-out with General Distributed Resource Manager Supports long-running continuous ingestion, with better resource utilization and SLA guarantees. YARN *in progress
  • 15.
    (18) Gobblin in Production@ LinkedIn • In production since 2014 • Usages – Internal sources  HDFS • Kafka, MySQL, Dropbox, etc. – External sources  HDFS • Salesforce, GoogleAnalytics, S3, etc. – HDFS  HDFS • Closed member data purging – Egress from HDFS (future work) • Data volume – Over a dozen data sources, – thousands of datasets, – tens ofTBs, … daily. 14
  • 16.
    (18) FutureWork • Gobblin onYarn(alpha-release) • Real-time elastic ingestion • Integration with – Apache Sqoop: using Sqoop connectors – Logstash: log ingestion – Morphlines: using Morphline transformation – Apache Spark 15
  • 17.
    (18) Conclusions 16 Pain of maintaining multiple ingestion pipelines Gobblin tothe rescue! Data quality assurance and centralized state management Gobblin in production for a wide range of data sources Continuous real- time ingestion
  • 18.
    (18)17 ACKNOWLEDGEMENT Pradhan Cadabam Shrikanth Shankar SuvodeepPyne Ray Ortigas Henry Cai Kenneth Goodhope Erik Krogen
  • 19.