What Is Apache Gobblin ?
● A big data integration framework
● To simplify integration issues like
– Data ingestion
– Replication
– Organization
– Lifecycle management
● For streaming and batch
● An Apache incubator project
Gobblin Execution Modes
● Gobblin has a number of execution modes
● Standalone
– Run on a single box / JVM / embedded mode
● Map Reduce
– Run as a map reduce application
● Yarn / Mesos ( proposed ? )
– Run on a cluster via a scheduler, supports HA
● Cloud
– Run on AWS / Azure, supports HA
Gobblin Sinks/Writers
● Gobblin supports the following sinks
– Avro HDFS
– Parquet HDFS
– HDFS byte array
– Console (StdOut)
– Couchbase
– HTTP
– JDBC
– Kafka
Gobblin Sources
Gobblin supports the following sources
● Avro files
● File copy
● Query based
● Rest API
● Google Analytics
● Google drive
● Google webmaster
● Hadoop text input
● Hive Avro to ORC
● Hive compliance purging
● JSON
● Kafka
● MySQL
● Oracle
● Salesforce
● FTP / SFTP
● SQL Server
● Teradata
● Wikipedia
Gobblin Architecture
Gobblin Architecture
● A Gobblin job is built on a set of plugable constructs
● Which are extensible
● A job is a set of tasks created from a workunit
● The workunit serves as a container at runtime
● Tasks are executed by the Gobblin runtime
– On the chosen deployment i.e. MapReduce
● Run time handles scheduling, error handling etc
● Utilities handle meta data, state, metrics etc
Gobblin Job
Gobblin Job
● Optional aquire lock (to stop next job instance)
● Create source instance
● From source work units create tasks
● Launch and run tasks
● Publish data if OK to do so
● Persist the job/task states into the state store
● Clean up temporary work data
● Release the job lock ( optional )
Gobblin Constructs
Gobblin Constructs
● Source partitions data into work units
● Source creates work unit data extractors
● Converter converts schema and data records
● Quality checker checks row and task level data
● Fork operator allows control to flow into multiple streams
● Writers sends data records to sink
● Publisher publishes job records
Gobblin Job Configuration
● Goblin jobs are configured via configuration files
● May be named .pull / .job plus .properties
● Source properties file defines
– Connection / converter / quality / publisher
● Job file defines
– Name / group / description / schedule
– Extraction properties
– Source properties
Gobblin Users
Available Books
● See “Big Data Made Easy”
– Apress Jan 2015
●
See “Mastering Apache Spark”
– Packt Oct 2015
●
See “Complete Guide to Open Source Big Data Stack
– “Apress Jan 2018”
● Find the author on Amazon
– www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
●
Connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020
Connect
● Feel free to connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020
● See my open source blog at
– open-source-systems.blogspot.com/
● I am always interested in
– New technology
– Opportunities
– Technology based issues
– Big data integration

Apache Gobblin

  • 1.
    What Is ApacheGobblin ? ● A big data integration framework ● To simplify integration issues like – Data ingestion – Replication – Organization – Lifecycle management ● For streaming and batch ● An Apache incubator project
  • 2.
    Gobblin Execution Modes ●Gobblin has a number of execution modes ● Standalone – Run on a single box / JVM / embedded mode ● Map Reduce – Run as a map reduce application ● Yarn / Mesos ( proposed ? ) – Run on a cluster via a scheduler, supports HA ● Cloud – Run on AWS / Azure, supports HA
  • 3.
    Gobblin Sinks/Writers ● Gobblinsupports the following sinks – Avro HDFS – Parquet HDFS – HDFS byte array – Console (StdOut) – Couchbase – HTTP – JDBC – Kafka
  • 4.
    Gobblin Sources Gobblin supportsthe following sources ● Avro files ● File copy ● Query based ● Rest API ● Google Analytics ● Google drive ● Google webmaster ● Hadoop text input ● Hive Avro to ORC ● Hive compliance purging ● JSON ● Kafka ● MySQL ● Oracle ● Salesforce ● FTP / SFTP ● SQL Server ● Teradata ● Wikipedia
  • 5.
  • 6.
    Gobblin Architecture ● AGobblin job is built on a set of plugable constructs ● Which are extensible ● A job is a set of tasks created from a workunit ● The workunit serves as a container at runtime ● Tasks are executed by the Gobblin runtime – On the chosen deployment i.e. MapReduce ● Run time handles scheduling, error handling etc ● Utilities handle meta data, state, metrics etc
  • 7.
  • 8.
    Gobblin Job ● Optionalaquire lock (to stop next job instance) ● Create source instance ● From source work units create tasks ● Launch and run tasks ● Publish data if OK to do so ● Persist the job/task states into the state store ● Clean up temporary work data ● Release the job lock ( optional )
  • 9.
  • 10.
    Gobblin Constructs ● Sourcepartitions data into work units ● Source creates work unit data extractors ● Converter converts schema and data records ● Quality checker checks row and task level data ● Fork operator allows control to flow into multiple streams ● Writers sends data records to sink ● Publisher publishes job records
  • 11.
    Gobblin Job Configuration ●Goblin jobs are configured via configuration files ● May be named .pull / .job plus .properties ● Source properties file defines – Connection / converter / quality / publisher ● Job file defines – Name / group / description / schedule – Extraction properties – Source properties
  • 12.
  • 13.
    Available Books ● See“Big Data Made Easy” – Apress Jan 2015 ● See “Mastering Apache Spark” – Packt Oct 2015 ● See “Complete Guide to Open Source Big Data Stack – “Apress Jan 2018” ● Find the author on Amazon – www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ ● Connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020
  • 14.
    Connect ● Feel freeto connect on LinkedIn – www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at – open-source-systems.blogspot.com/ ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration