Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

9,764 views

Published on

This talk describes the motivations behind Apache Gobblin (incubating), architecture, latest innovations in supporting both batch and streaming data pipelines as well as future roadmap.

Published in: Data & Analytics
  • sell your house fast in kansas city http://sellurhomekc.com/
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi there! Essay Help For Students | Discount 10% for your first order! - Check our website! https://vk.cc/80SakO
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.thesisscientist.com/top-30-sites-for-download-free-books-2018
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Effective powerful love spell to get your Ex Lover back now: Check on Dr Trust Website: https://utimatespellcaster.com I have just experience the wonders of Dr. Trust love spell,that have been spread on the internet and worldwide, How he marvelously helped people all over the world to restored back their marriage and get back lost lovers, and also help to win lottery. I contacted him after going through so many testimonies from different people how he help to bring back ex lover back, i told him about my husband that abandoned me about 8months ago, and left home with all i had..Dr Trust only told me to smile 3 times and have a rest of mind he will handle all in just 48hrs, After the second day Anthony called me, i was just so shocked, i pick the call and couldnt believe my ears, he was really begging me to forgive him and making promises on phone..He come back home and also got me a new car just for him to proof his love for me. i was so happy and called Dr Trust and thanked him, he only told me to share the good news all over the world ..Well if your need an effective and real spell caster for any problem in your life you can contact Dr Trust his website: https://utimatespellcaster.com (Ultimatespellcast@yahoo.com or Ultimatespellcast@gmail.com) Website: https://utimatespellcaster.com His whatApp or call number +2348156885231
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.ThesisScientist.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetup @ LinkedIn Apr 2017

  1. 1. The Data Driven Network Kapil Surlaker Director of Engineering Bridging Batch and Streaming Data Integration with Gobblin Shirshanka Das Gobblin team 26th Apr, 2017 Big Data Meetup github.com/linkedin/gobblin @ApacheGobblin gitter.im/gobblin
  2. 2. Data Integration: key requirements Source, Sink Diversity Batch + Streaming Data Quality So, we built
  3. 3. SFTP JDBC REST Simplifying Data Integration @LinkedIn Hundreds of TB per day Thousands of datasets ~30 different source systems 80%+ of data ingest Open source @ github.com/linkedin/gobblin Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, CERN, NerdWallet and many more… Apache incubation under way SFTP Azure StorageAzure Storage
  4. 4. 4 Other Open Source Systems in this Space Sqoop, Flume, Falcon, Nifi, Kafka Connect Flink, Spark, Samza, Apex Similar in pieces, dissimilar in aggregate Most are tied to a specific execution model (batch / stream) Most are tied to a specific implementation, ecosystem (Kafka, Hadoop etc)
  5. 5. : Under the Hood 5
  6. 6. 6 Gobblin: The Logical Pipeline
  7. 7. 7 WorkUnit A logical unit of work, typically bounded but not necessary. Kafka Topic: LoginEvent, Partition: 10, Offsets: 10-200 HDFS Folder: /data/Login, File: part-0.avro Hive Dataset: Tracking.Login, date-partition=mm-dd-yy-hh
  8. 8. 8 Source: A provider of WorkUnits (typically a system like Kafka, HDFS etc.)
  9. 9. 9 Task: A unit of execution that operates on a WorkUnit Extracts records from the source, writes to the destination Ends when WorkUnit is exhausted of records (assigned to Thread in ThreadPool, Mapper in Map-Reduce etc.)
  10. 10. 10 Extractor: A provider of records given a WorkUnit Connects to Data Source Deserializer of records
  11. 11. 11 Converter: A 1:N mapper of input records to output records Multiple converters can be chained (e.g. Avro <-> JSON, Schema project, Encrypt)
  12. 12. 12 Quality Checker: Can check if the quality of the output is satisfactory Row-level (e.g. time value check) Task-level (e.g. audit check, schema compatibility)
  13. 13. 13 Writer: Writes to the destination Connection to the destination, Serializer of records Sync / Async e.g. FsWriter, KafkaWriter, CouchbaseWriter
  14. 14. 14 Publisher: Finalizes / Commits the data Used for destinations that support atomicity (e.g. move tmp staging directory to final output directory on HDFS)
  15. 15. 15 Gobblin: The Logical Pipeline
  16. 16. 16 State Store (HDFS, S3, MySQL, ZK, …) Load config previous watermarks save watermarks Gobblin: The Logical Pipeline Stateful ^
  17. 17. : Pipeline Specification 17
  18. 18. Gobblin: Pipeline Specification job.name=PullFromWikipedia job.group=Wikipedia job.description=A getting started example for Gobblin source.class=gobblin.example.wikipedia.WikipediaSource source.page.titles=LinkedIn,Wikipedia:Sandbox source.revisions.cnt=5 wikipedia.api.rooturl=https://en.wikipedia.org/w/api.php wikipedia.avro.schema={"namespace": “example.wikipedia.avro” ,…"null"]}]} gobblin.wikipediaSource.maxRevisionsPerPage=10 converter.classes=gobblin.example.wikipedia.WikipediaConverter Pipeline Name, Description Source + configuration
  19. 19. source.revisions.cnt=5 wikipedia.api.rooturl=https://en.wikipedia.org/w/api.php wikipedia.avro.schema={"namespace": “example.wikipedia.avro” ,…"null"]}]} gobblin.wikipediaSource.maxRevisionsPerPage=10 converter.classes=gobblin.example.wikipedia.WikipediaConverter extract.namespace=gobblin.example.wikipedia writer.destination.type=HDFS writer.output.format=AVRO writer.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner data.publisher.type=gobblin.publisher.BaseDataPublisher Gobblin: Pipeline Specification Converter Writer + configuration
  20. 20. converter.classes=gobblin.example.wikipedia.WikipediaConverter extract.namespace=gobblin.example.wikipedia writer.destination.type=HDFS writer.output.format=AVRO writer.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner data.publisher.type=gobblin.publisher.BaseDataPublisher Gobblin: Pipeline Specification Publisher
  21. 21. Gobblin: Pipeline Deployment Bare Metal / AWS / Azure / VM Standalone: Single Instance Small Medium Large AWS (EC2) Hadoop (YARN / MR) Standalone Cluster Pipeline Specification Static Cluster Elastic ClusterOne Box One Spec Multiple Environments
  22. 22. Execution Model: Batch versus Streaming Batch Determine work, Acquire slots, Run, Checkpoint, Repeat + Cost-efficient, deterministic, repeatable - Higher latency - Setup, Checkpoint costs dominate if “micro-batching”
  23. 23. Execution Model: Batch versus Streaming Streaming Determine work streams, Run continuously, Checkpoint periodically + Low latency - Higher-cost because it is harder to provision accurately - More sophistication needed to deal with change
  24. 24. Batch Execution Model Scorecard Batch Streaming Streaming Streaming Streaming Batch Batch JDBC <->HDFS Kafka ->HDFS HDFS ->Kafka Kafka <->Kinesis
  25. 25. Can we run in both models using the same system?
  26. 26. 26 Gobblin: The Logical Pipeline
  27. 27. 27 Batch Determine work Streaming Determine work - unbounded WorkUnit Pipeline Stages: Start
  28. 28. 28 Batch Acquire slots, Run Streaming Run continuously Checkpoint periodically Shutdown gracefully Pipeline Stages: Run Watermark Manager State Storage notify ack shutdown
  29. 29. 29 Batch Checkpoint, Commit Streaming Do nothing - NoOpPublisher Pipeline Stages: End
  30. 30. Enabling Streaming mode task.executionMode = streaming Standalone: Single Instance AWS Hadoop (YARN / MR) Standalone Cluster
  31. 31. A Streaming Pipeline Spec: Kafka 2 Kafka # A sample pull file that copies an input Kafka topic and # produces to an output Kafka topic with sampling job.name=Kafka2KafkaStreaming job.group=Kafka job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic job.lock.enabled=false source.class=gobblin.source….KafkaSimpleStreamingSource Pipeline Name, Description
  32. 32. job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic job.lock.enabled=false source.class=gobblin.source….KafkaSimpleStreamingSource gobblin.streaming.kafka.topic.key.deserializer=org.apache.kafka.com mon.serialization.StringDeserializer gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.co mmon.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092 # Sample 10% of the records Source, configuration A Streaming Pipeline Spec: Kafka 2 Kafka
  33. 33. mmon.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092 # Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10 writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.comm on.serialization.ByteArraySerializer A Streaming Pipeline Spec: Kafka 2 Kafka Converter, configuration
  34. 34. # Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10 writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.comm on.serialization.ByteArraySerializer data.publisher.type=gobblin.publisher.NoopPublisher task.executionMode=STREAMING A Streaming Pipeline Spec: Kafka 2 Kafka Writer, configuration Publisher
  35. 35. data.publisher.type=gobblin.publisher.NoopPublisher task.executionMode=STREAMING # Configure watermark storage for streaming #streaming.watermarkStateStore.type=zk #streaming.watermarkStateStore.config.state.store.zk.connectString= localhost:2181 # Configure watermark commit settings for streaming #streaming.watermark.commitIntervalMillis=2000 A Streaming Pipeline Spec: Kafka 2 Kafka Execution Mode, watermark storage configuration
  36. 36. Gobblin Streaming: Cluster view Cluster of processes Apache Helix: work-unit assignment, fault-tolerance, reassignment Cluster Master Helix Worker 1 Worker 2 Worker 3 Sink (Kafka, HDFS, …) Stream Source
  37. 37. Active Workstreams in Gobblin Gobblin as a Service Global orchestrator with REST API for submitting logical flow specifications Logical flow specifications compile down to physical pipeline specs Global Throttling Throttling capability to ensure Gobblin respects quotas globally (e.g. api calls, network b/w, Hadoop namenode etc.) Generic: can be used outside Gobblin Metadata driven Integration with Metadata Service (c.f. WhereHows) Policy driven replication, permissions, encryption etc.
  38. 38. Roadmap Final LinkedIn Gobblin 0.10.0 release Apache Incubator code donation and release More Streaming runtimes Integration with Apache Samza, LinkedIn Brooklin GDPR Compliance: Data purge for Hadoop and other systems Security improvements Credential storage, Secure specs
  39. 39. 39 Gobblin Team @ LinkedIn

×