Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1wQh534.
Lin Qiao discusses the architecture of Gobblin, LinkedInโs framework for addressing the need of high quality and high velocity data ingestion. Filmed at qconsf.com.
Lin Qiao is leading LinkedInโs data lifecycle management for analytics, covering areas of data ingestion, data quality and workflow management.
A Secure and Reliable Document Management System is Essential.docx
ย
Gobblin: A Framework for Solving Big Data Ingestion Problem
1. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Gobblinโ Big Data with Ease
Lin Qiao
Data Analytics Infra @ LinkedIn
2. InfoQ.com: News & Community Site
โข 750,000 unique visitors/month
โข Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
โข Post content from our QCon conferences
โข News 15-20 / week
โข Articles 3-4 / week
โข Presentations (videos) 12-15 / week
โข Interviews 2-3 / week
โข Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/gobblin-linkedin
3. Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
4. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Overview
โข Challenges
โข What does Gobblin provide?
โข How does Gobblin work?
โข Retrospective and lookahead
5. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Overview
โข Challenges
โข What does Gobblin provide?
โข How does Gobblin work?
โข Retrospective and lookahead
6. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Perception
Analytics Platform
Ingest
Framework
Primary
Data
Sources
Transformations Business
Facing
Insights
Member
Facing
Insights and
Data Products
Load
Load
Validation
Validation
7. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Reality
5
Hadoop
Camus
Lumos
Teradata
External
Partner
Data
Ingest
Framework
DWH ETL
(fact tables)
Product,
Sciences,
Enterprise
Analytics
Site
(Member
Facing
Products)
Kafka
Activity
(tracking)
Data
R/W store
(Oracle/
Espresso)
Pro๏ฌle Data
Databus
Changes
Derived Data
Set
Core Data Set
(Tracking,
Database,
External)
Computed Results for Member Facing Products
Enterprise
Products
Change
dump on ๏ฌler
Ingest
utilities
Lassen
(facts and
dimensions)
Read store
(Voldemort)
8. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Challenges @ LinkedIn
โข Large variety of data sources
โข Multi-paradigm: streaming data, batch data
โข Different types of data: facts, dimensions, logs,
snapshots, increments, changelog
โข Operational complexity of multiple pipelines
โข Data quality
โข Data availability and predictability
โข Engineering cost
9. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Open source solutions
sqoopp
flumep morphlinep
RDBMS vendor-
specific connectorsp
aegisthus
logstashCamus
10. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Goals
โข Unified and Structured Data Ingestion Flow
โ RDBMS -> Hadoop
โ Event Streams -> Hadoop
โข Higher level abstractions
โ Facts, Dimensions
โ Snapshots, increments, changelog
โข ELT oriented
โ Minimize transformation in the ingest pipeline
11. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Central Ingestion Pipeline
Hadoop
Teradata
External
Partner
Data
Gobblin
DWH ETL
(fact tables)
Product,
Sciences,
Enterprise
Analytics
Site
(Member
Facing
Products)
Kafka
Tracking
R/W store
(Oracle/
Espresso)
OLTP Data
Databus
Changes
Derived Data
Set
Core Data Set
(Tracking,
Database,
External)
Enterprise
Products
Change
dump on ๏ฌler
REST
JDBC
SOAP
Custom
Compaction
12. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Overview
โข Challenges
โข What does Gobblin provide?
โข How does Gobblin work?
โข Retrospective and lookahead
13. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Usage @ LinkedIn
โข Business Analytics
โ Source data for, sales analysis, product sentiment
analysis, etc.
โข Engineering
โ Source data for issue tracking, monitoring, product
release, security compliance, A/B testing
โข Consumer product
โ Source data for acquisition integration
โ Performance analysis for email campaign, ads
campaign, etc.
14. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Key Features
๏ง Horizontally scalable and robust framework
๏ง Unified computation paradigm
๏ง Turn-key solution
๏ง Customize your own Ingestion
15. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Scalable and Robust Framework
13
Scalable
Centralized
State Management
State is carried over between jobs automatically, so metadata can be used
to track offsets, checkpoints, watermarks, etc.
Jobs are partitioned into tasks that run concurrently
Fault Tolerant Framework gracefully deals with machine and job failures
Query Assurance Baked in quality checking throughout the flow
16. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Unified computation paradigm
Common execution
flow
Common execution flow between batch ingestion and streaming ingestion
pipelines
Shared infra
components
Shared job state management, job metrics store, metadata management.
17. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Turn Key Solution
Built-in Exchange
Protocols
Existing adapters can easily be re-used for sources with common protocols
(e.g. JDBC, REST, SFTP, SOAP, etc.)
Built-in Source
Integration
Fully integrated with commonly used sources including MySQL, SQLServer,
Oracle, SalesForce, HDFS, filer, internal dropbox)
Built-in Data
Ingestion Semantics
Covers full dump and incremental ingestion for fact and dimension
datasets.
Policy driven flow
execution & tuning
Flow owners just need to specify pre-defined policy for handling job
failure, degree of parallelism, what data to publish, etc.
18. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Customize Your Own Ingestion Pipeline
Extendable
Operators
Configurable
Operator Flow
Operators for doing extraction, conversion, quality checking, data
persistence, etc., can be implemented or extended against common API.
Configuration allows for multiple plugin points to add in customized logic
and code
19. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Overview
โข Challenges
โข What does Gobblin provide?
โข How does Gobblin work?
โข Lookahead
21. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Computation Model
โข Gobblin standalone
โ single process, multi-threading
โ Testing, small data, sampling
โข Gobblin on Map/Reduce
โ Large datasets, horizontally scalable
โข Gobblin on Yarn
โ Better resource utilization
โ More scheduling flexibilities
22. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Scalable Ingestion Flow
20
Source
Work
Unit
Work
Unit
Work
Unit
Data
Publisher
Extractor Converter
Quality
Checker
Writer
Extractor Converter
Quality
Checker
Writer
Extractor Converter
Quality
Checker
Writer
Task
Task
Task
23. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Sources
โข Determines how to partition work
- Partitioning algorithm can leverage source sharding
- Group partitions intelligently for performance
โข Creates work-units to be scheduled
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
24. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Job Management
โข Job execution states
โ Watermark
โ Task state, job state, quality checker output, error code
โข Job synchronization
โข Job failure handling: policy driven
22
State Store
Job run 1 Job run 3Job run 2
25. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Operator Flow
Extract
Schema
Extract
Record
Convert
Record
Check
Record Data
Quality
Write
Record
Convert
Schema
Check Task
Data
Quality
Commit
Task Data
23
26. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Extractors Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
โข Specifies how to get the schema and pull data from
the source
โข Return ResultSet iterator
โข Track high watermark
โข Track extraction metrics
27. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Converters
โข Allow for schema and data transformation
โ Filtering
โ projection
โ type conversion
โ Structural change
โข Composable: can specify a list of converters to be applied in
the given order
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
28. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Quality
Checkers
โข Ensure quality of any data produced by Gobblin
โข Can be run on a per record, per task, or per job basis
โข Can specify a list of quality checkers to be applied
โ Schema compatibility
โ Audit check
โ Sensitive fields
โ Unique key
โข Policy driven
โ FAIL โ if the check fails then so does the job
โ OPTIONAL โ if the checks fails the job continues
โ ERR_FILE โ the offending row is written to an error file
26
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
29. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Writers
โข Writing data in Avro format onto HDFS
โ One writer per task
โข Flexibility
โ Configurable compression codec (Deflate, Snappy)
โ Configurable buffer size
โข Plan to support other data format (Parquet, ORC)
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
30. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Publishers
โข Determines job success based on Policy.
- COMMIT_ON_FULL_SUCCESS
- COMMIT_ON_PARTIAL_SUCCESS
โข Commits data to final directories based on job success.
Task 1
Task 2
Task 3
File 1
File 2
File 3
Tmp Dir
File 1
File 2
File 3
Final Dir
File 1
File 2
File 3
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
31. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Compaction
โข Dimensions:
โ Initial full dump followed by incremental extracts in
Gobblin
โ Maintain a consistent snapshot by doing regularly
scheduled compaction
โข Facts:
โ Merge small files
29
Ingestion HDFS Compaction
32. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Overview
โข Challenges
โข What does Gobblin provide?
โข How does Gobblin work?
โข Retrospective and lookahead
33. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Gobblin in Production
โข > 350 datasets
โข ~ 60 TB per day
โข Salesforce
โข Responsys
โข RightNow
โข Timeforce
โข Slideshare
โข Newsle
โข A/B testing
โข LinkedIn JIRA
โข Data retention
31
Production
Instances
Data Volume
34. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Lesson Learned
โข Data quality has a lot more work to do
โข Small data problem is not small
โข Performance optimization opportunities
โข Operational traits
35. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Roadmap
โข Gobblin on Yarn
โข Streaming Sources
โข Gobblin Workbench with ingestion DSL
โข Data Profiling for richer quality checking
โข Open source in Q4โ14
33