Gobblin: A Framework for Solving Big Data Ingestion Problem

9,744 views

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1wQh534.

Lin Qiao discusses the architecture of Gobblin, LinkedIn’s framework for addressing the need of high quality and high velocity data ingestion. Filmed at qconsf.com.

Lin Qiao is leading LinkedIn’s data lifecycle management for analytics, covering areas of data ingestion, data quality and workflow management.

Published in: Software, Technology
  • Be the first to comment

Gobblin: A Framework for Solving Big Data Ingestion Problem

  1. 1. ©2014 LinkedIn Corporation. All Rights Reserved. Gobblin’ Big Data with Ease Lin Qiao Data Analytics Infra @ LinkedIn
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /gobblin-linkedin
  3. 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  4. 4. ©2014 LinkedIn Corporation. All Rights Reserved. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead
  5. 5. ©2014 LinkedIn Corporation. All Rights Reserved. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead
  6. 6. ©2014 LinkedIn Corporation. All Rights Reserved. Perception Analytics Platform Ingest Framework Primary Data Sources Transformations Business Facing Insights Member Facing Insights and Data Products Load Load Validation Validation
  7. 7. ©2014 LinkedIn Corporation. All Rights Reserved. Reality 5 Hadoop Camus Lumos Teradata External Partner Data Ingest Framework DWH ETL (fact tables) Product, Sciences, Enterprise Analytics Site (Member Facing Products) Kafka Activity (tracking) Data R/W store (Oracle/ Espresso) Profile Data Databus Changes Derived Data Set Core Data Set (Tracking, Database, External) Computed Results for Member Facing Products Enterprise Products Change dump on filer Ingest utilities Lassen (facts and dimensions) Read store (Voldemort)
  8. 8. ©2014 LinkedIn Corporation. All Rights Reserved. Challenges @ LinkedIn • Large variety of data sources • Multi-paradigm: streaming data, batch data • Different types of data: facts, dimensions, logs, snapshots, increments, changelog • Operational complexity of multiple pipelines • Data quality • Data availability and predictability • Engineering cost
  9. 9. ©2014 LinkedIn Corporation. All Rights Reserved. Open source solutions sqoopp flumep morphlinep RDBMS vendor- specific connectorsp aegisthus logstashCamus
  10. 10. ©2014 LinkedIn Corporation. All Rights Reserved. Goals • Unified and Structured Data Ingestion Flow – RDBMS -> Hadoop – Event Streams -> Hadoop • Higher level abstractions – Facts, Dimensions – Snapshots, increments, changelog • ELT oriented – Minimize transformation in the ingest pipeline
  11. 11. ©2014 LinkedIn Corporation. All Rights Reserved. Central Ingestion Pipeline Hadoop Teradata External Partner Data Gobblin DWH ETL (fact tables) Product, Sciences, Enterprise Analytics Site (Member Facing Products) Kafka Tracking R/W store (Oracle/ Espresso) OLTP Data Databus Changes Derived Data Set Core Data Set (Tracking, Database, External) Enterprise Products Change dump on filer REST JDBC SOAP Custom Compaction
  12. 12. ©2014 LinkedIn Corporation. All Rights Reserved. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead
  13. 13. ©2014 LinkedIn Corporation. All Rights Reserved. Gobblin Usage @ LinkedIn • Business Analytics – Source data for, sales analysis, product sentiment analysis, etc. • Engineering – Source data for issue tracking, monitoring, product release, security compliance, A/B testing • Consumer product – Source data for acquisition integration – Performance analysis for email campaign, ads campaign, etc.
  14. 14. ©2014 LinkedIn Corporation. All Rights Reserved. Key Features  Horizontally scalable and robust framework  Unified computation paradigm  Turn-key solution  Customize your own Ingestion
  15. 15. ©2014 LinkedIn Corporation. All Rights Reserved. Scalable and Robust Framework 13 Scalable Centralized State Management State is carried over between jobs automatically, so metadata can be used to track offsets, checkpoints, watermarks, etc. Jobs are partitioned into tasks that run concurrently Fault Tolerant Framework gracefully deals with machine and job failures Query Assurance Baked in quality checking throughout the flow
  16. 16. ©2014 LinkedIn Corporation. All Rights Reserved. Unified computation paradigm Common execution flow Common execution flow between batch ingestion and streaming ingestion pipelines Shared infra components Shared job state management, job metrics store, metadata management.
  17. 17. ©2014 LinkedIn Corporation. All Rights Reserved. Turn Key Solution Built-in Exchange Protocols Existing adapters can easily be re-used for sources with common protocols (e.g. JDBC, REST, SFTP, SOAP, etc.) Built-in Source Integration Fully integrated with commonly used sources including MySQL, SQLServer, Oracle, SalesForce, HDFS, filer, internal dropbox) Built-in Data Ingestion Semantics Covers full dump and incremental ingestion for fact and dimension datasets. Policy driven flow execution & tuning Flow owners just need to specify pre-defined policy for handling job failure, degree of parallelism, what data to publish, etc.
  18. 18. ©2014 LinkedIn Corporation. All Rights Reserved. Customize Your Own Ingestion Pipeline Extendable Operators Configurable Operator Flow Operators for doing extraction, conversion, quality checking, data persistence, etc., can be implemented or extended against common API. Configuration allows for multiple plugin points to add in customized logic and code
  19. 19. ©2014 LinkedIn Corporation. All Rights Reserved. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Lookahead
  20. 20. ©2014 LinkedIn Corporation. All Rights Reserved. Under the Hood
  21. 21. ©2014 LinkedIn Corporation. All Rights Reserved. Computation Model • Gobblin standalone – single process, multi-threading – Testing, small data, sampling • Gobblin on Map/Reduce – Large datasets, horizontally scalable • Gobblin on Yarn – Better resource utilization – More scheduling flexibilities
  22. 22. ©2014 LinkedIn Corporation. All Rights Reserved. Scalable Ingestion Flow 20 Source Work Unit Work Unit Work Unit Data Publisher Extractor Converter Quality Checker Writer Extractor Converter Quality Checker Writer Extractor Converter Quality Checker Writer Task Task Task
  23. 23. ©2014 LinkedIn Corporation. All Rights Reserved. Sources • Determines how to partition work - Partitioning algorithm can leverage source sharding - Group partitions intelligently for performance • Creates work-units to be scheduled Source Work Unit PublisherExtractor Converter Quality Checker Writer
  24. 24. ©2014 LinkedIn Corporation. All Rights Reserved. Job Management • Job execution states – Watermark – Task state, job state, quality checker output, error code • Job synchronization • Job failure handling: policy driven 22 State Store Job run 1 Job run 3Job run 2
  25. 25. ©2014 LinkedIn Corporation. All Rights Reserved. Gobblin Operator Flow Extract Schema Extract Record Convert Record Check Record Data Quality Write Record Convert Schema Check Task Data Quality Commit Task Data 23
  26. 26. ©2014 LinkedIn Corporation. All Rights Reserved. Extractors Source Work Unit PublisherExtractor Converter Quality Checker Writer • Specifies how to get the schema and pull data from the source • Return ResultSet iterator • Track high watermark • Track extraction metrics
  27. 27. ©2014 LinkedIn Corporation. All Rights Reserved. Converters • Allow for schema and data transformation – Filtering – projection – type conversion – Structural change • Composable: can specify a list of converters to be applied in the given order Source Work Unit PublisherExtractor Converter Quality Checker Writer
  28. 28. ©2014 LinkedIn Corporation. All Rights Reserved. Quality Checkers • Ensure quality of any data produced by Gobblin • Can be run on a per record, per task, or per job basis • Can specify a list of quality checkers to be applied – Schema compatibility – Audit check – Sensitive fields – Unique key • Policy driven – FAIL – if the check fails then so does the job – OPTIONAL – if the checks fails the job continues – ERR_FILE – the offending row is written to an error file 26 Source Work Unit PublisherExtractor Converter Quality Checker Writer
  29. 29. ©2014 LinkedIn Corporation. All Rights Reserved. Writers • Writing data in Avro format onto HDFS – One writer per task • Flexibility – Configurable compression codec (Deflate, Snappy) – Configurable buffer size • Plan to support other data format (Parquet, ORC) Source Work Unit PublisherExtractor Converter Quality Checker Writer
  30. 30. ©2014 LinkedIn Corporation. All Rights Reserved. Publishers • Determines job success based on Policy. - COMMIT_ON_FULL_SUCCESS - COMMIT_ON_PARTIAL_SUCCESS • Commits data to final directories based on job success. Task 1 Task 2 Task 3 File 1 File 2 File 3 Tmp Dir File 1 File 2 File 3 Final Dir File 1 File 2 File 3 Source Work Unit PublisherExtractor Converter Quality Checker Writer
  31. 31. ©2014 LinkedIn Corporation. All Rights Reserved. Gobblin Compaction • Dimensions: – Initial full dump followed by incremental extracts in Gobblin – Maintain a consistent snapshot by doing regularly scheduled compaction • Facts: – Merge small files 29 Ingestion HDFS Compaction
  32. 32. ©2014 LinkedIn Corporation. All Rights Reserved. Overview • Challenges • What does Gobblin provide? • How does Gobblin work? • Retrospective and lookahead
  33. 33. ©2014 LinkedIn Corporation. All Rights Reserved. Gobblin in Production • > 350 datasets • ~ 60 TB per day • Salesforce • Responsys • RightNow • Timeforce • Slideshare • Newsle • A/B testing • LinkedIn JIRA • Data retention 31 Production Instances Data Volume
  34. 34. ©2014 LinkedIn Corporation. All Rights Reserved. Lesson Learned • Data quality has a lot more work to do • Small data problem is not small • Performance optimization opportunities • Operational traits
  35. 35. ©2014 LinkedIn Corporation. All Rights Reserved. Gobblin Roadmap • Gobblin on Yarn • Streaming Sources • Gobblin Workbench with ingestion DSL • Data Profiling for richer quality checking • Open source in Q4’14 33
  36. 36. ©2014 LinkedIn Corporation. All Rights Reserved.
  37. 37. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/gobblin- linkedin

×