Successfully reported this slideshow.
Your SlideShare is downloading. ×

Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 39 Ad

Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

Download to read offline

This talk describes the motivations behind Apache Gobblin (incubating), architecture, latest innovations in supporting both batch and streaming data pipelines as well as future roadmap.

This talk describes the motivations behind Apache Gobblin (incubating), architecture, latest innovations in supporting both batch and streaming data pipelines as well as future roadmap.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017 (20)

Recently uploaded (20)

Advertisement

Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

  1. 1. The Data Driven Network Kapil Surlaker Director of Engineering Bridging Batch and Streaming Data Integration with Gobblin Shirshanka Das Gobblin team 26th Apr, 2017 Big Data Meetup github.com/linkedin/gobblin @ApacheGobblin gitter.im/gobblin
  2. 2. Data Integration: key requirements Source, Sink Diversity Batch + Streaming Data Quality So, we built
  3. 3. SFTP JDBC REST Simplifying Data Integration @LinkedIn Hundreds of TB per day Thousands of datasets ~30 different source systems 80%+ of data ingest Open source @ github.com/linkedin/gobblin Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, CERN, NerdWallet and many more… Apache incubation under way SFTP Azure StorageAzure Storage
  4. 4. 4 Other Open Source Systems in this Space Sqoop, Flume, Falcon, Nifi, Kafka Connect Flink, Spark, Samza, Apex Similar in pieces, dissimilar in aggregate Most are tied to a specific execution model (batch / stream) Most are tied to a specific implementation, ecosystem (Kafka, Hadoop etc)
  5. 5. : Under the Hood 5
  6. 6. 6 Gobblin: The Logical Pipeline
  7. 7. 7 WorkUnit A logical unit of work, typically bounded but not necessary. Kafka Topic: LoginEvent, Partition: 10, Offsets: 10-200 HDFS Folder: /data/Login, File: part-0.avro Hive Dataset: Tracking.Login, date-partition=mm-dd-yy-hh
  8. 8. 8 Source: A provider of WorkUnits (typically a system like Kafka, HDFS etc.)
  9. 9. 9 Task: A unit of execution that operates on a WorkUnit Extracts records from the source, writes to the destination Ends when WorkUnit is exhausted of records (assigned to Thread in ThreadPool, Mapper in Map-Reduce etc.)
  10. 10. 10 Extractor: A provider of records given a WorkUnit Connects to Data Source Deserializer of records
  11. 11. 11 Converter: A 1:N mapper of input records to output records Multiple converters can be chained (e.g. Avro <-> JSON, Schema project, Encrypt)
  12. 12. 12 Quality Checker: Can check if the quality of the output is satisfactory Row-level (e.g. time value check) Task-level (e.g. audit check, schema compatibility)
  13. 13. 13 Writer: Writes to the destination Connection to the destination, Serializer of records Sync / Async e.g. FsWriter, KafkaWriter, CouchbaseWriter
  14. 14. 14 Publisher: Finalizes / Commits the data Used for destinations that support atomicity (e.g. move tmp staging directory to final output directory on HDFS)
  15. 15. 15 Gobblin: The Logical Pipeline
  16. 16. 16 State Store (HDFS, S3, MySQL, ZK, …) Load config previous watermarks save watermarks Gobblin: The Logical Pipeline Stateful ^
  17. 17. : Pipeline Specification 17
  18. 18. Gobblin: Pipeline Specification job.name=PullFromWikipedia job.group=Wikipedia job.description=A getting started example for Gobblin source.class=gobblin.example.wikipedia.WikipediaSource source.page.titles=LinkedIn,Wikipedia:Sandbox source.revisions.cnt=5 wikipedia.api.rooturl=https://en.wikipedia.org/w/api.php wikipedia.avro.schema={"namespace": “example.wikipedia.avro” ,…"null"]}]} gobblin.wikipediaSource.maxRevisionsPerPage=10 converter.classes=gobblin.example.wikipedia.WikipediaConverter Pipeline Name, Description Source + configuration
  19. 19. source.revisions.cnt=5 wikipedia.api.rooturl=https://en.wikipedia.org/w/api.php wikipedia.avro.schema={"namespace": “example.wikipedia.avro” ,…"null"]}]} gobblin.wikipediaSource.maxRevisionsPerPage=10 converter.classes=gobblin.example.wikipedia.WikipediaConverter extract.namespace=gobblin.example.wikipedia writer.destination.type=HDFS writer.output.format=AVRO writer.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner data.publisher.type=gobblin.publisher.BaseDataPublisher Gobblin: Pipeline Specification Converter Writer + configuration
  20. 20. converter.classes=gobblin.example.wikipedia.WikipediaConverter extract.namespace=gobblin.example.wikipedia writer.destination.type=HDFS writer.output.format=AVRO writer.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner data.publisher.type=gobblin.publisher.BaseDataPublisher Gobblin: Pipeline Specification Publisher
  21. 21. Gobblin: Pipeline Deployment Bare Metal / AWS / Azure / VM Standalone: Single Instance Small Medium Large AWS (EC2) Hadoop (YARN / MR) Standalone Cluster Pipeline Specification Static Cluster Elastic ClusterOne Box One Spec Multiple Environments
  22. 22. Execution Model: Batch versus Streaming Batch Determine work, Acquire slots, Run, Checkpoint, Repeat + Cost-efficient, deterministic, repeatable - Higher latency - Setup, Checkpoint costs dominate if “micro-batching”
  23. 23. Execution Model: Batch versus Streaming Streaming Determine work streams, Run continuously, Checkpoint periodically + Low latency - Higher-cost because it is harder to provision accurately - More sophistication needed to deal with change
  24. 24. Batch Execution Model Scorecard Batch Streaming Streaming Streaming Streaming Batch Batch JDBC <->HDFS Kafka ->HDFS HDFS ->Kafka Kafka <->Kinesis
  25. 25. Can we run in both models using the same system?
  26. 26. 26 Gobblin: The Logical Pipeline
  27. 27. 27 Batch Determine work Streaming Determine work - unbounded WorkUnit Pipeline Stages: Start
  28. 28. 28 Batch Acquire slots, Run Streaming Run continuously Checkpoint periodically Shutdown gracefully Pipeline Stages: Run Watermark Manager State Storage notify ack shutdown
  29. 29. 29 Batch Checkpoint, Commit Streaming Do nothing - NoOpPublisher Pipeline Stages: End
  30. 30. Enabling Streaming mode task.executionMode = streaming Standalone: Single Instance AWS Hadoop (YARN / MR) Standalone Cluster
  31. 31. A Streaming Pipeline Spec: Kafka 2 Kafka # A sample pull file that copies an input Kafka topic and # produces to an output Kafka topic with sampling job.name=Kafka2KafkaStreaming job.group=Kafka job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic job.lock.enabled=false source.class=gobblin.source….KafkaSimpleStreamingSource Pipeline Name, Description
  32. 32. job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic job.lock.enabled=false source.class=gobblin.source….KafkaSimpleStreamingSource gobblin.streaming.kafka.topic.key.deserializer=org.apache.kafka.com mon.serialization.StringDeserializer gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.co mmon.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092 # Sample 10% of the records Source, configuration A Streaming Pipeline Spec: Kafka 2 Kafka
  33. 33. mmon.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092 # Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10 writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.comm on.serialization.ByteArraySerializer A Streaming Pipeline Spec: Kafka 2 Kafka Converter, configuration
  34. 34. # Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10 writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.comm on.serialization.ByteArraySerializer data.publisher.type=gobblin.publisher.NoopPublisher task.executionMode=STREAMING A Streaming Pipeline Spec: Kafka 2 Kafka Writer, configuration Publisher
  35. 35. data.publisher.type=gobblin.publisher.NoopPublisher task.executionMode=STREAMING # Configure watermark storage for streaming #streaming.watermarkStateStore.type=zk #streaming.watermarkStateStore.config.state.store.zk.connectString= localhost:2181 # Configure watermark commit settings for streaming #streaming.watermark.commitIntervalMillis=2000 A Streaming Pipeline Spec: Kafka 2 Kafka Execution Mode, watermark storage configuration
  36. 36. Gobblin Streaming: Cluster view Cluster of processes Apache Helix: work-unit assignment, fault-tolerance, reassignment Cluster Master Helix Worker 1 Worker 2 Worker 3 Sink (Kafka, HDFS, …) Stream Source
  37. 37. Active Workstreams in Gobblin Gobblin as a Service Global orchestrator with REST API for submitting logical flow specifications Logical flow specifications compile down to physical pipeline specs Global Throttling Throttling capability to ensure Gobblin respects quotas globally (e.g. api calls, network b/w, Hadoop namenode etc.) Generic: can be used outside Gobblin Metadata driven Integration with Metadata Service (c.f. WhereHows) Policy driven replication, permissions, encryption etc.
  38. 38. Roadmap Final LinkedIn Gobblin 0.10.0 release Apache Incubator code donation and release More Streaming runtimes Integration with Apache Samza, LinkedIn Brooklin GDPR Compliance: Data purge for Hadoop and other systems Security improvements Credential storage, Secure specs
  39. 39. 39 Gobblin Team @ LinkedIn

×