SlideShare a Scribd company logo
1 of 37
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Gobblinโ€™ Big Data with Ease
Lin Qiao
Data Analytics Infra @ LinkedIn
InfoQ.com: News & Community Site
โ€ข 750,000 unique visitors/month
โ€ข Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
โ€ข Post content from our QCon conferences
โ€ข News 15-20 / week
โ€ข Articles 3-4 / week
โ€ข Presentations (videos) 12-15 / week
โ€ข Interviews 2-3 / week
โ€ข Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/gobblin-linkedin
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Overview
โ€ข Challenges
โ€ข What does Gobblin provide?
โ€ข How does Gobblin work?
โ€ข Retrospective and lookahead
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Overview
โ€ข Challenges
โ€ข What does Gobblin provide?
โ€ข How does Gobblin work?
โ€ข Retrospective and lookahead
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Perception
Analytics Platform
Ingest
Framework
Primary
Data
Sources
Transformations Business
Facing
Insights
Member
Facing
Insights and
Data Products
Load
Load
Validation
Validation
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Reality
5
Hadoop
Camus
Lumos
Teradata
External
Partner
Data
Ingest
Framework
DWH ETL
(fact tables)
Product,
Sciences,
Enterprise
Analytics
Site
(Member
Facing
Products)
Kafka
Activity
(tracking)
Data
R/W store
(Oracle/
Espresso)
Pro๏ฌle Data
Databus
Changes
Derived Data
Set
Core Data Set
(Tracking,
Database,
External)
Computed Results for Member Facing Products
Enterprise
Products
Change
dump on ๏ฌler
Ingest
utilities
Lassen
(facts and
dimensions)
Read store
(Voldemort)
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Challenges @ LinkedIn
โ€ข Large variety of data sources
โ€ข Multi-paradigm: streaming data, batch data
โ€ข Different types of data: facts, dimensions, logs,
snapshots, increments, changelog
โ€ข Operational complexity of multiple pipelines
โ€ข Data quality
โ€ข Data availability and predictability
โ€ข Engineering cost
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Open source solutions
sqoopp
flumep morphlinep
RDBMS vendor-
specific connectorsp
aegisthus
logstashCamus
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Goals
โ€ข Unified and Structured Data Ingestion Flow
โ€“ RDBMS -> Hadoop
โ€“ Event Streams -> Hadoop
โ€ข Higher level abstractions
โ€“ Facts, Dimensions
โ€“ Snapshots, increments, changelog
โ€ข ELT oriented
โ€“ Minimize transformation in the ingest pipeline
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Central Ingestion Pipeline
Hadoop
Teradata
External
Partner
Data
Gobblin
DWH ETL
(fact tables)
Product,
Sciences,
Enterprise
Analytics
Site
(Member
Facing
Products)
Kafka
Tracking
R/W store
(Oracle/
Espresso)
OLTP Data
Databus
Changes
Derived Data
Set
Core Data Set
(Tracking,
Database,
External)
Enterprise
Products
Change
dump on ๏ฌler
REST
JDBC
SOAP
Custom
Compaction
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Overview
โ€ข Challenges
โ€ข What does Gobblin provide?
โ€ข How does Gobblin work?
โ€ข Retrospective and lookahead
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Usage @ LinkedIn
โ€ข Business Analytics
โ€“ Source data for, sales analysis, product sentiment
analysis, etc.
โ€ข Engineering
โ€“ Source data for issue tracking, monitoring, product
release, security compliance, A/B testing
โ€ข Consumer product
โ€“ Source data for acquisition integration
โ€“ Performance analysis for email campaign, ads
campaign, etc.
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Key Features
๏‚ง Horizontally scalable and robust framework
๏‚ง Unified computation paradigm
๏‚ง Turn-key solution
๏‚ง Customize your own Ingestion
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Scalable and Robust Framework
13
Scalable
Centralized
State Management
State is carried over between jobs automatically, so metadata can be used
to track offsets, checkpoints, watermarks, etc.
Jobs are partitioned into tasks that run concurrently
Fault Tolerant Framework gracefully deals with machine and job failures
Query Assurance Baked in quality checking throughout the flow
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Unified computation paradigm
Common execution
flow
Common execution flow between batch ingestion and streaming ingestion
pipelines
Shared infra
components
Shared job state management, job metrics store, metadata management.
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Turn Key Solution
Built-in Exchange
Protocols
Existing adapters can easily be re-used for sources with common protocols
(e.g. JDBC, REST, SFTP, SOAP, etc.)
Built-in Source
Integration
Fully integrated with commonly used sources including MySQL, SQLServer,
Oracle, SalesForce, HDFS, filer, internal dropbox)
Built-in Data
Ingestion Semantics
Covers full dump and incremental ingestion for fact and dimension
datasets.
Policy driven flow
execution & tuning
Flow owners just need to specify pre-defined policy for handling job
failure, degree of parallelism, what data to publish, etc.
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Customize Your Own Ingestion Pipeline
Extendable
Operators
Configurable
Operator Flow
Operators for doing extraction, conversion, quality checking, data
persistence, etc., can be implemented or extended against common API.
Configuration allows for multiple plugin points to add in customized logic
and code
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Overview
โ€ข Challenges
โ€ข What does Gobblin provide?
โ€ข How does Gobblin work?
โ€ข Lookahead
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Under the Hood
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Computation Model
โ€ข Gobblin standalone
โ€“ single process, multi-threading
โ€“ Testing, small data, sampling
โ€ข Gobblin on Map/Reduce
โ€“ Large datasets, horizontally scalable
โ€ข Gobblin on Yarn
โ€“ Better resource utilization
โ€“ More scheduling flexibilities
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Scalable Ingestion Flow
20
Source
Work
Unit
Work
Unit
Work
Unit
Data
Publisher
Extractor Converter
Quality
Checker
Writer
Extractor Converter
Quality
Checker
Writer
Extractor Converter
Quality
Checker
Writer
Task
Task
Task
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Sources
โ€ข Determines how to partition work
- Partitioning algorithm can leverage source sharding
- Group partitions intelligently for performance
โ€ข Creates work-units to be scheduled
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Job Management
โ€ข Job execution states
โ€“ Watermark
โ€“ Task state, job state, quality checker output, error code
โ€ข Job synchronization
โ€ข Job failure handling: policy driven
22
State Store
Job run 1 Job run 3Job run 2
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Operator Flow
Extract
Schema
Extract
Record
Convert
Record
Check
Record Data
Quality
Write
Record
Convert
Schema
Check Task
Data
Quality
Commit
Task Data
23
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Extractors Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
โ€ข Specifies how to get the schema and pull data from
the source
โ€ข Return ResultSet iterator
โ€ข Track high watermark
โ€ข Track extraction metrics
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Converters
โ€ข Allow for schema and data transformation
โ€“ Filtering
โ€“ projection
โ€“ type conversion
โ€“ Structural change
โ€ข Composable: can specify a list of converters to be applied in
the given order
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Quality
Checkers
โ€ข Ensure quality of any data produced by Gobblin
โ€ข Can be run on a per record, per task, or per job basis
โ€ข Can specify a list of quality checkers to be applied
โ€“ Schema compatibility
โ€“ Audit check
โ€“ Sensitive fields
โ€“ Unique key
โ€ข Policy driven
โ€“ FAIL โ€“ if the check fails then so does the job
โ€“ OPTIONAL โ€“ if the checks fails the job continues
โ€“ ERR_FILE โ€“ the offending row is written to an error file
26
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Writers
โ€ข Writing data in Avro format onto HDFS
โ€“ One writer per task
โ€ข Flexibility
โ€“ Configurable compression codec (Deflate, Snappy)
โ€“ Configurable buffer size
โ€ข Plan to support other data format (Parquet, ORC)
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Publishers
โ€ข Determines job success based on Policy.
- COMMIT_ON_FULL_SUCCESS
- COMMIT_ON_PARTIAL_SUCCESS
โ€ข Commits data to final directories based on job success.
Task 1
Task 2
Task 3
File 1
File 2
File 3
Tmp Dir
File 1
File 2
File 3
Final Dir
File 1
File 2
File 3
Source
Work
Unit PublisherExtractor Converter
Quality
Checker
Writer
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Compaction
โ€ข Dimensions:
โ€“ Initial full dump followed by incremental extracts in
Gobblin
โ€“ Maintain a consistent snapshot by doing regularly
scheduled compaction
โ€ข Facts:
โ€“ Merge small files
29
Ingestion HDFS Compaction
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Overview
โ€ข Challenges
โ€ข What does Gobblin provide?
โ€ข How does Gobblin work?
โ€ข Retrospective and lookahead
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Gobblin in Production
โ€ข > 350 datasets
โ€ข ~ 60 TB per day
โ€ข Salesforce
โ€ข Responsys
โ€ข RightNow
โ€ข Timeforce
โ€ข Slideshare
โ€ข Newsle
โ€ข A/B testing
โ€ข LinkedIn JIRA
โ€ข Data retention
31
Production
Instances
Data Volume
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Lesson Learned
โ€ข Data quality has a lot more work to do
โ€ข Small data problem is not small
โ€ข Performance optimization opportunities
โ€ข Operational traits
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Gobblin Roadmap
โ€ข Gobblin on Yarn
โ€ข Streaming Sources
โ€ข Gobblin Workbench with ingestion DSL
โ€ข Data Profiling for richer quality checking
โ€ข Open source in Q4โ€™14
33
ยฉ2014 LinkedIn Corporation. All Rights Reserved.
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/gobblin-
linkedin

More Related Content

Viewers also liked

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang
ย 
Shannon Eterginio_About Presentation
Shannon Eterginio_About PresentationShannon Eterginio_About Presentation
Shannon Eterginio_About PresentationShannon Eterginio
ย 
Gobbin config-meetup-june-2016
Gobbin config-meetup-june-2016Gobbin config-meetup-june-2016
Gobbin config-meetup-june-2016Vasanth Rajamani
ย 
Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Vasanth Rajamani
ย 
Giip kb-hadoop sizing
Giip kb-hadoop sizingGiip kb-hadoop sizing
Giip kb-hadoop sizingLowy Shin
ย 
Gobblin for Data Analytics
Gobblin for Data AnalyticsGobblin for Data Analytics
Gobblin for Data AnalyticsIntel IT Center
ย 
Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7mmathipra
ย 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsStephan Ewen
ย 
Scalable Hadoop in the cloud
Scalable Hadoop in the cloudScalable Hadoop in the cloud
Scalable Hadoop in the cloudTreasure Data, Inc.
ย 
[2C1] แ„‹แ…กแ„‘แ…กแ„Žแ…ต แ„‘แ…ตแ„€แ…ณแ„…แ…ณแ†ฏ แ„‹แ…ฑแ„’แ…กแ†ซ แ„แ…ฆแ„Œแ…ณ แ„‹แ…งแ†ซแ„‰แ…กแ†ซ แ„‹แ…ฆแ†ซแ„Œแ…ตแ†ซ แ„€แ…ขแ„‡แ…กแ†ฏแ„’แ…กแ„€แ…ต แ„Žแ…ฌแ„Œแ…ฉแ†ผ
[2C1] แ„‹แ…กแ„‘แ…กแ„Žแ…ต แ„‘แ…ตแ„€แ…ณแ„…แ…ณแ†ฏ แ„‹แ…ฑแ„’แ…กแ†ซ แ„แ…ฆแ„Œแ…ณ แ„‹แ…งแ†ซแ„‰แ…กแ†ซ แ„‹แ…ฆแ†ซแ„Œแ…ตแ†ซ แ„€แ…ขแ„‡แ…กแ†ฏแ„’แ…กแ„€แ…ต แ„Žแ…ฌแ„Œแ…ฉแ†ผ[2C1] แ„‹แ…กแ„‘แ…กแ„Žแ…ต แ„‘แ…ตแ„€แ…ณแ„…แ…ณแ†ฏ แ„‹แ…ฑแ„’แ…กแ†ซ แ„แ…ฆแ„Œแ…ณ แ„‹แ…งแ†ซแ„‰แ…กแ†ซ แ„‹แ…ฆแ†ซแ„Œแ…ตแ†ซ แ„€แ…ขแ„‡แ…กแ†ฏแ„’แ…กแ„€แ…ต แ„Žแ…ฌแ„Œแ…ฉแ†ผ
[2C1] แ„‹แ…กแ„‘แ…กแ„Žแ…ต แ„‘แ…ตแ„€แ…ณแ„…แ…ณแ†ฏ แ„‹แ…ฑแ„’แ…กแ†ซ แ„แ…ฆแ„Œแ…ณ แ„‹แ…งแ†ซแ„‰แ…กแ†ซ แ„‹แ…ฆแ†ซแ„Œแ…ตแ†ซ แ„€แ…ขแ„‡แ…กแ†ฏแ„’แ…กแ„€แ…ต แ„Žแ…ฌแ„Œแ…ฉแ†ผNAVER D2
ย 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureVinod Kumar Vavilapalli
ย 
Data API 2.0
Data API 2.0Data API 2.0
Data API 2.0Yuji Takayama
ย 
Search in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, LucidworksSearch in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, LucidworksLucidworks
ย 
The Data-Drive Paradigm
The Data-Drive ParadigmThe Data-Drive Paradigm
The Data-Drive ParadigmLucidworks
ย 
Hadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-TenancyHadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-TenancyTreasure Data, Inc.
ย 
The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...Audrey Neveu
ย 

Viewers also liked (18)

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
ย 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
ย 
Shannon Eterginio_About Presentation
Shannon Eterginio_About PresentationShannon Eterginio_About Presentation
Shannon Eterginio_About Presentation
ย 
Gobblin on-aws
Gobblin on-awsGobblin on-aws
Gobblin on-aws
ย 
Gobbin config-meetup-june-2016
Gobbin config-meetup-june-2016Gobbin config-meetup-june-2016
Gobbin config-meetup-june-2016
ย 
Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7Gobblin meetup-whats new in 0.7
Gobblin meetup-whats new in 0.7
ย 
Giip kb-hadoop sizing
Giip kb-hadoop sizingGiip kb-hadoop sizing
Giip kb-hadoop sizing
ย 
Gobblin for Data Analytics
Gobblin for Data AnalyticsGobblin for Data Analytics
Gobblin for Data Analytics
ย 
Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7Meet the experts dwo bde vds v7
Meet the experts dwo bde vds v7
ย 
Apache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and FriendsApache Flink Overview at SF Spark and Friends
Apache Flink Overview at SF Spark and Friends
ย 
Scalable Hadoop in the cloud
Scalable Hadoop in the cloudScalable Hadoop in the cloud
Scalable Hadoop in the cloud
ย 
[2C1] แ„‹แ…กแ„‘แ…กแ„Žแ…ต แ„‘แ…ตแ„€แ…ณแ„…แ…ณแ†ฏ แ„‹แ…ฑแ„’แ…กแ†ซ แ„แ…ฆแ„Œแ…ณ แ„‹แ…งแ†ซแ„‰แ…กแ†ซ แ„‹แ…ฆแ†ซแ„Œแ…ตแ†ซ แ„€แ…ขแ„‡แ…กแ†ฏแ„’แ…กแ„€แ…ต แ„Žแ…ฌแ„Œแ…ฉแ†ผ
[2C1] แ„‹แ…กแ„‘แ…กแ„Žแ…ต แ„‘แ…ตแ„€แ…ณแ„…แ…ณแ†ฏ แ„‹แ…ฑแ„’แ…กแ†ซ แ„แ…ฆแ„Œแ…ณ แ„‹แ…งแ†ซแ„‰แ…กแ†ซ แ„‹แ…ฆแ†ซแ„Œแ…ตแ†ซ แ„€แ…ขแ„‡แ…กแ†ฏแ„’แ…กแ„€แ…ต แ„Žแ…ฌแ„Œแ…ฉแ†ผ[2C1] แ„‹แ…กแ„‘แ…กแ„Žแ…ต แ„‘แ…ตแ„€แ…ณแ„…แ…ณแ†ฏ แ„‹แ…ฑแ„’แ…กแ†ซ แ„แ…ฆแ„Œแ…ณ แ„‹แ…งแ†ซแ„‰แ…กแ†ซ แ„‹แ…ฆแ†ซแ„Œแ…ตแ†ซ แ„€แ…ขแ„‡แ…กแ†ฏแ„’แ…กแ„€แ…ต แ„Žแ…ฌแ„Œแ…ฉแ†ผ
[2C1] แ„‹แ…กแ„‘แ…กแ„Žแ…ต แ„‘แ…ตแ„€แ…ณแ„…แ…ณแ†ฏ แ„‹แ…ฑแ„’แ…กแ†ซ แ„แ…ฆแ„Œแ…ณ แ„‹แ…งแ†ซแ„‰แ…กแ†ซ แ„‹แ…ฆแ†ซแ„Œแ…ตแ†ซ แ„€แ…ขแ„‡แ…กแ†ฏแ„’แ…กแ„€แ…ต แ„Žแ…ฌแ„Œแ…ฉแ†ผ
ย 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
ย 
Data API 2.0
Data API 2.0Data API 2.0
Data API 2.0
ย 
Search in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, LucidworksSearch in 2020: Presented by Will Hayes, Lucidworks
Search in 2020: Presented by Will Hayes, Lucidworks
ย 
The Data-Drive Paradigm
The Data-Drive ParadigmThe Data-Drive Paradigm
The Data-Drive Paradigm
ย 
Hadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-TenancyHadoop meets Cloud with Multi-Tenancy
Hadoop meets Cloud with Multi-Tenancy
ย 
The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...The end of polling : why and how to transform a REST API into a Data Streamin...
The end of polling : why and how to transform a REST API into a Data Streamin...
ย 

More from C4Media

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoC4Media
ย 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileC4Media
ย 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020C4Media
ย 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsC4Media
ย 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No KeeperC4Media
ย 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like OwnersC4Media
ย 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaC4Media
ย 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideC4Media
ย 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
ย 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
ย 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
ย 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
ย 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
ย 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
ย 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
ย 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
ย 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
ย 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
ย 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
ย 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
ย 

More from C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
ย 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
ย 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
ย 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
ย 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
ย 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
ย 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
ย 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
ย 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
ย 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
ย 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
ย 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
ย 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
ย 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
ย 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
ย 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
ย 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
ย 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
ย 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
ย 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
ย 

Recently uploaded

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
ย 
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธcall girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธDelhi Call girls
ย 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
ย 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
ย 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
ย 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
ย 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
ย 
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...Steffen Staab
ย 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
ย 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
ย 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
ย 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
ย 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธanilsa9823
ย 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
ย 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
ย 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...OnePlan Solutions
ย 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
ย 

Recently uploaded (20)

Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS LiveVip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
ย 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
ย 
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธcall girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
ย 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
ย 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
ย 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
ย 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
ย 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
ย 
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
ย 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
ย 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
ย 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
ย 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
ย 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online  โ˜‚๏ธ
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Kakori Lucknow best sexual service Online โ˜‚๏ธ
ย 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
ย 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
ย 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlanโ€™s ...
ย 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
ย 

Gobblin: A Framework for Solving Big Data Ingestion Problem

  • 1. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Gobblinโ€™ Big Data with Ease Lin Qiao Data Analytics Infra @ LinkedIn
  • 2. InfoQ.com: News & Community Site โ€ข 750,000 unique visitors/month โ€ข Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) โ€ข Post content from our QCon conferences โ€ข News 15-20 / week โ€ข Articles 3-4 / week โ€ข Presentations (videos) 12-15 / week โ€ข Interviews 2-3 / week โ€ข Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /gobblin-linkedin
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Overview โ€ข Challenges โ€ข What does Gobblin provide? โ€ข How does Gobblin work? โ€ข Retrospective and lookahead
  • 5. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Overview โ€ข Challenges โ€ข What does Gobblin provide? โ€ข How does Gobblin work? โ€ข Retrospective and lookahead
  • 6. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Perception Analytics Platform Ingest Framework Primary Data Sources Transformations Business Facing Insights Member Facing Insights and Data Products Load Load Validation Validation
  • 7. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Reality 5 Hadoop Camus Lumos Teradata External Partner Data Ingest Framework DWH ETL (fact tables) Product, Sciences, Enterprise Analytics Site (Member Facing Products) Kafka Activity (tracking) Data R/W store (Oracle/ Espresso) Pro๏ฌle Data Databus Changes Derived Data Set Core Data Set (Tracking, Database, External) Computed Results for Member Facing Products Enterprise Products Change dump on ๏ฌler Ingest utilities Lassen (facts and dimensions) Read store (Voldemort)
  • 8. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Challenges @ LinkedIn โ€ข Large variety of data sources โ€ข Multi-paradigm: streaming data, batch data โ€ข Different types of data: facts, dimensions, logs, snapshots, increments, changelog โ€ข Operational complexity of multiple pipelines โ€ข Data quality โ€ข Data availability and predictability โ€ข Engineering cost
  • 9. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Open source solutions sqoopp flumep morphlinep RDBMS vendor- specific connectorsp aegisthus logstashCamus
  • 10. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Goals โ€ข Unified and Structured Data Ingestion Flow โ€“ RDBMS -> Hadoop โ€“ Event Streams -> Hadoop โ€ข Higher level abstractions โ€“ Facts, Dimensions โ€“ Snapshots, increments, changelog โ€ข ELT oriented โ€“ Minimize transformation in the ingest pipeline
  • 11. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Central Ingestion Pipeline Hadoop Teradata External Partner Data Gobblin DWH ETL (fact tables) Product, Sciences, Enterprise Analytics Site (Member Facing Products) Kafka Tracking R/W store (Oracle/ Espresso) OLTP Data Databus Changes Derived Data Set Core Data Set (Tracking, Database, External) Enterprise Products Change dump on ๏ฌler REST JDBC SOAP Custom Compaction
  • 12. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Overview โ€ข Challenges โ€ข What does Gobblin provide? โ€ข How does Gobblin work? โ€ข Retrospective and lookahead
  • 13. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Gobblin Usage @ LinkedIn โ€ข Business Analytics โ€“ Source data for, sales analysis, product sentiment analysis, etc. โ€ข Engineering โ€“ Source data for issue tracking, monitoring, product release, security compliance, A/B testing โ€ข Consumer product โ€“ Source data for acquisition integration โ€“ Performance analysis for email campaign, ads campaign, etc.
  • 14. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Key Features ๏‚ง Horizontally scalable and robust framework ๏‚ง Unified computation paradigm ๏‚ง Turn-key solution ๏‚ง Customize your own Ingestion
  • 15. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Scalable and Robust Framework 13 Scalable Centralized State Management State is carried over between jobs automatically, so metadata can be used to track offsets, checkpoints, watermarks, etc. Jobs are partitioned into tasks that run concurrently Fault Tolerant Framework gracefully deals with machine and job failures Query Assurance Baked in quality checking throughout the flow
  • 16. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Unified computation paradigm Common execution flow Common execution flow between batch ingestion and streaming ingestion pipelines Shared infra components Shared job state management, job metrics store, metadata management.
  • 17. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Turn Key Solution Built-in Exchange Protocols Existing adapters can easily be re-used for sources with common protocols (e.g. JDBC, REST, SFTP, SOAP, etc.) Built-in Source Integration Fully integrated with commonly used sources including MySQL, SQLServer, Oracle, SalesForce, HDFS, filer, internal dropbox) Built-in Data Ingestion Semantics Covers full dump and incremental ingestion for fact and dimension datasets. Policy driven flow execution & tuning Flow owners just need to specify pre-defined policy for handling job failure, degree of parallelism, what data to publish, etc.
  • 18. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Customize Your Own Ingestion Pipeline Extendable Operators Configurable Operator Flow Operators for doing extraction, conversion, quality checking, data persistence, etc., can be implemented or extended against common API. Configuration allows for multiple plugin points to add in customized logic and code
  • 19. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Overview โ€ข Challenges โ€ข What does Gobblin provide? โ€ข How does Gobblin work? โ€ข Lookahead
  • 20. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Under the Hood
  • 21. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Computation Model โ€ข Gobblin standalone โ€“ single process, multi-threading โ€“ Testing, small data, sampling โ€ข Gobblin on Map/Reduce โ€“ Large datasets, horizontally scalable โ€ข Gobblin on Yarn โ€“ Better resource utilization โ€“ More scheduling flexibilities
  • 22. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Scalable Ingestion Flow 20 Source Work Unit Work Unit Work Unit Data Publisher Extractor Converter Quality Checker Writer Extractor Converter Quality Checker Writer Extractor Converter Quality Checker Writer Task Task Task
  • 23. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Sources โ€ข Determines how to partition work - Partitioning algorithm can leverage source sharding - Group partitions intelligently for performance โ€ข Creates work-units to be scheduled Source Work Unit PublisherExtractor Converter Quality Checker Writer
  • 24. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Job Management โ€ข Job execution states โ€“ Watermark โ€“ Task state, job state, quality checker output, error code โ€ข Job synchronization โ€ข Job failure handling: policy driven 22 State Store Job run 1 Job run 3Job run 2
  • 25. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Gobblin Operator Flow Extract Schema Extract Record Convert Record Check Record Data Quality Write Record Convert Schema Check Task Data Quality Commit Task Data 23
  • 26. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Extractors Source Work Unit PublisherExtractor Converter Quality Checker Writer โ€ข Specifies how to get the schema and pull data from the source โ€ข Return ResultSet iterator โ€ข Track high watermark โ€ข Track extraction metrics
  • 27. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Converters โ€ข Allow for schema and data transformation โ€“ Filtering โ€“ projection โ€“ type conversion โ€“ Structural change โ€ข Composable: can specify a list of converters to be applied in the given order Source Work Unit PublisherExtractor Converter Quality Checker Writer
  • 28. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Quality Checkers โ€ข Ensure quality of any data produced by Gobblin โ€ข Can be run on a per record, per task, or per job basis โ€ข Can specify a list of quality checkers to be applied โ€“ Schema compatibility โ€“ Audit check โ€“ Sensitive fields โ€“ Unique key โ€ข Policy driven โ€“ FAIL โ€“ if the check fails then so does the job โ€“ OPTIONAL โ€“ if the checks fails the job continues โ€“ ERR_FILE โ€“ the offending row is written to an error file 26 Source Work Unit PublisherExtractor Converter Quality Checker Writer
  • 29. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Writers โ€ข Writing data in Avro format onto HDFS โ€“ One writer per task โ€ข Flexibility โ€“ Configurable compression codec (Deflate, Snappy) โ€“ Configurable buffer size โ€ข Plan to support other data format (Parquet, ORC) Source Work Unit PublisherExtractor Converter Quality Checker Writer
  • 30. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Publishers โ€ข Determines job success based on Policy. - COMMIT_ON_FULL_SUCCESS - COMMIT_ON_PARTIAL_SUCCESS โ€ข Commits data to final directories based on job success. Task 1 Task 2 Task 3 File 1 File 2 File 3 Tmp Dir File 1 File 2 File 3 Final Dir File 1 File 2 File 3 Source Work Unit PublisherExtractor Converter Quality Checker Writer
  • 31. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Gobblin Compaction โ€ข Dimensions: โ€“ Initial full dump followed by incremental extracts in Gobblin โ€“ Maintain a consistent snapshot by doing regularly scheduled compaction โ€ข Facts: โ€“ Merge small files 29 Ingestion HDFS Compaction
  • 32. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Overview โ€ข Challenges โ€ข What does Gobblin provide? โ€ข How does Gobblin work? โ€ข Retrospective and lookahead
  • 33. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Gobblin in Production โ€ข > 350 datasets โ€ข ~ 60 TB per day โ€ข Salesforce โ€ข Responsys โ€ข RightNow โ€ข Timeforce โ€ข Slideshare โ€ข Newsle โ€ข A/B testing โ€ข LinkedIn JIRA โ€ข Data retention 31 Production Instances Data Volume
  • 34. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Lesson Learned โ€ข Data quality has a lot more work to do โ€ข Small data problem is not small โ€ข Performance optimization opportunities โ€ข Operational traits
  • 35. ยฉ2014 LinkedIn Corporation. All Rights Reserved. Gobblin Roadmap โ€ข Gobblin on Yarn โ€ข Streaming Sources โ€ข Gobblin Workbench with ingestion DSL โ€ข Data Profiling for richer quality checking โ€ข Open source in Q4โ€™14 33
  • 36. ยฉ2014 LinkedIn Corporation. All Rights Reserved.
  • 37. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/gobblin- linkedin