Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Gobblin

24 views

Published on

This presentation gives an overview of the Apache Gobblin project. It explains Apache Gobblin in terms of it's architecture, data sources/sinks and it's work unit processing.

Links for further information and connecting

http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/

https://nz.linkedin.com/pub/mike-frampton/20/630/385

https://open-source-systems.blogspot.com/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Apache Gobblin

  1. 1. What Is Apache Gobblin ? ● A big data integration framework ● To simplify integration issues like – Data ingestion – Replication – Organization – Lifecycle management ● For streaming and batch ● An Apache incubator project
  2. 2. Gobblin Execution Modes ● Gobblin has a number of execution modes ● Standalone – Run on a single box / JVM / embedded mode ● Map Reduce – Run as a map reduce application ● Yarn / Mesos ( proposed ? ) – Run on a cluster via a scheduler, supports HA ● Cloud – Run on AWS / Azure, supports HA
  3. 3. Gobblin Sinks/Writers ● Gobblin supports the following sinks – Avro HDFS – Parquet HDFS – HDFS byte array – Console (StdOut) – Couchbase – HTTP – JDBC – Kafka
  4. 4. Gobblin Sources Gobblin supports the following sources ● Avro files ● File copy ● Query based ● Rest API ● Google Analytics ● Google drive ● Google webmaster ● Hadoop text input ● Hive Avro to ORC ● Hive compliance purging ● JSON ● Kafka ● MySQL ● Oracle ● Salesforce ● FTP / SFTP ● SQL Server ● Teradata ● Wikipedia
  5. 5. Gobblin Architecture
  6. 6. Gobblin Architecture ● A Gobblin job is built on a set of plugable constructs ● Which are extensible ● A job is a set of tasks created from a workunit ● The workunit serves as a container at runtime ● Tasks are executed by the Gobblin runtime – On the chosen deployment i.e. MapReduce ● Run time handles scheduling, error handling etc ● Utilities handle meta data, state, metrics etc
  7. 7. Gobblin Job
  8. 8. Gobblin Job ● Optional aquire lock (to stop next job instance) ● Create source instance ● From source work units create tasks ● Launch and run tasks ● Publish data if OK to do so ● Persist the job/task states into the state store ● Clean up temporary work data ● Release the job lock ( optional )
  9. 9. Gobblin Constructs
  10. 10. Gobblin Constructs ● Source partitions data into work units ● Source creates work unit data extractors ● Converter converts schema and data records ● Quality checker checks row and task level data ● Fork operator allows control to flow into multiple streams ● Writers sends data records to sink ● Publisher publishes job records
  11. 11. Gobblin Job Configuration ● Goblin jobs are configured via configuration files ● May be named .pull / .job plus .properties ● Source properties file defines – Connection / converter / quality / publisher ● Job file defines – Name / group / description / schedule – Extraction properties – Source properties
  12. 12. Gobblin Users
  13. 13. Available Books ● See “Big Data Made Easy” – Apress Jan 2015 ● See “Mastering Apache Spark” – Packt Oct 2015 ● See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” ● Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ ● Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  14. 14. Connect ● Feel free to connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at – open-source-systems.blogspot.com/ ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

×