Apache Gobblin

What Is Apache Gobblin ?
● A big data integration framework
● To simplify integration issues like
– Data ingestion
– Replication
– Organization
– Lifecycle management
● For streaming and batch
● An Apache incubator project

Gobblin Execution Modes
● Gobblin has a number of execution modes
● Standalone
– Run on a single box / JVM / embedded mode
● Map Reduce
– Run as a map reduce application
● Yarn / Mesos ( proposed ? )
– Run on a cluster via a scheduler, supports HA
● Cloud
– Run on AWS / Azure, supports HA

Gobblin Sinks/Writers
● Gobblin supports the following sinks
– Avro HDFS
– Parquet HDFS
– HDFS byte array
– Console (StdOut)
– Couchbase
– HTTP
– JDBC
– Kafka

Gobblin Sources
Gobblin supports the following sources
● Avro files
● File copy
● Query based
● Rest API
● Google Analytics
● Google drive
● Google webmaster
● Hadoop text input
● Hive Avro to ORC
● Hive compliance purging
● JSON
● Kafka
● MySQL
● Oracle
● Salesforce
● FTP / SFTP
● SQL Server
● Teradata
● Wikipedia

Gobblin Architecture
● A Gobblin job is built on a set of plugable constructs
● Which are extensible
● A job is a set of tasks created from a workunit
● The workunit serves as a container at runtime
● Tasks are executed by the Gobblin runtime
– On the chosen deployment i.e. MapReduce
● Run time handles scheduling, error handling etc
● Utilities handle meta data, state, metrics etc

Gobblin Job
● Optional aquire lock (to stop next job instance)
● Create source instance
● From source work units create tasks
● Launch and run tasks
● Publish data if OK to do so
● Persist the job/task states into the state store
● Clean up temporary work data
● Release the job lock ( optional )

Gobblin Constructs
● Source partitions data into work units
● Source creates work unit data extractors
● Converter converts schema and data records
● Quality checker checks row and task level data
● Fork operator allows control to flow into multiple streams
● Writers sends data records to sink
● Publisher publishes job records

Gobblin Job Configuration
● Goblin jobs are configured via configuration files
● May be named .pull / .job plus .properties
● Source properties file defines
– Connection / converter / quality / publisher
● Job file defines
– Name / group / description / schedule
– Extraction properties
– Source properties

Available Books
● See “Big Data Made Easy”
– Apress Jan 2015
●
See “Mastering Apache Spark”
– Packt Oct 2015
●
See “Complete Guide to Open Source Big Data Stack
– “Apress Jan 2018”
● Find the author on Amazon
– www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
●
Connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020

Connect
● Feel free to connect on LinkedIn
– www.linkedin.com/in/mike-frampton-38563020
● See my open source blog at
– open-source-systems.blogspot.com/
● I am always interested in
– New technology
– Opportunities
– Technology based issues
– Big data integration

Apache Gobblin

More Related Content

What's hot

Similar to Apache Gobblin

More from Mike Frampton

Recently uploaded

Apache Gobblin