Successfully reported this slideshow.
Your SlideShare is downloading. ×

Stream processing for the masses with beam, python and flink

Upcoming SlideShare
Kafka summit apac session
Kafka summit apac session
Loading in …3

Check these out next

1 of 43 Ad

More Related Content

Slideshows for you (20)

Similar to Stream processing for the masses with beam, python and flink (20)


Recently uploaded (20)

Stream processing for the masses with beam, python and flink

  1. 1. Stream processing for the masses with Beam, Python and Flink Sept 12th, 2019 Enrico Canzonieri @EnricoC89
  2. 2. Yelp’s Mission Connecting people with great local businesses
  3. 3. Evolving data processing Latency ~ hours/days
  4. 4. Evolving data processing Latency ~ hours/days Latency ~ seconds/minutes
  5. 5. STREAMS POWER YELP Powered by streaming Notifications Real-time visit detection User Search Indexing pipeline User personalization Purchase flows ML feature ETL Ads Product development Transactions Realtime campaign shut-off Experimentation infrastructure and guardrail metrics
  6. 6. Scribe 2015 2017 Tooling innovation leads to more data 20192016
  7. 7. Scribe 2015 2017 Tooling innovation leads to more data 20192016
  8. 8. DATA PIPELINE Strong data schematization and documentation Standardized wire protocol (AVRO) Contract between data producers and consumers Centralized schema registry Decouple data ETL
  9. 9. PROCESSOR Paastorm Paastorm was Yelp’s answer to the lack of good open source Python stream processors Paastorm provides a thin wrapper around Kafka producer/consumer Good fit to perform map/flatmap transformations
  10. 10. PROCESSOR Paastorm Adoption Paastorm API is a class called Spolt Users extend the Spolt and implement process_message(self, message) Over 150 production Paastorm applications
  11. 11. class GreatReviewsSpolt(Spolt): def process_message(self, message): payload = message.payload_data if payload['rating'] >= 4.0: yield message if __name__ == ‘__main__’: Paastorm(GreatReviewsSpolt()).run() PROCESSOR Paastorm Code
  12. 12. Need shiny new tools to leverage the value of real-time data
  13. 13. 2017 Unlocked real-time data processing at scale Stateful processing Powerful streaming oriented (DataStream) API Apache Flink Event time processing
  14. 14. 2017 Stream SQL Joinery Aggregator Use Cases Connectors Sessionizer cassandra, elasticsearch, redshift, etc. Flink SQL wrapper to run arbitrary queries unwindowed streaming join of table change streams unwindowed aggregation of table change streams create sessions from event logs
  15. 15. Yelp’s data pipeline Stack
  16. 16. LIMITATIONS Tightly coupled to Kafka No high level primitives: groupBy, windowing, filter, etc. No stateful processing support High cost to implement and maintain new features
  17. 17. CHALLENGES Hundreds of Python libraries implementing business logic Flink SQL good mostly for simple SQL like transformations High barrier of entry to JVM language
  18. 18. Backend Finding the next new shiny tool
  19. 19. BEAM And more ... Pipeline Driver program Programming model Execution responsible for defining the pipeline in the Beam SDK represent the logical data processing tasks run on any supported distributed processing framework
  20. 20. BEAM Pipeline PTransform IO / Create IO / Write PCollection PTransform PCollections elements distributed bounded or unbounded data sets processing step that transforms PCollections have an associated timestamp
  21. 21. BEAM Python SDK High level API Side Input and tagged output State and Timers ParDo, Map, Flatmap, Filter, GroupByKey, CoGroupByKey Support for Window and Triggers Fixed, Sliding and Session windows and a variety triggers ParDo with two or more inputs and two or more outputs Can be combined to build complex stateful applications
  22. 22. EXECUTION Beam Portability API Portability API Define the protocols used by the runner (e.g. Flink) to translate and run the pipeline Python SDK Make use of a containerized SDK harness to run language specific UDFs Fn API Rely on gRPC for runner - SDK worker communication Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other LanguagesBeam Java Beam Python Execution Execution Cloud Dataflow Execution
  23. 23. EXECUTION The Flink Runner
  24. 24. INTEGRATION Data Pipeline Source and Sink Yelp specific implementation to discover Kafka clusters Team expertise around the Flink Consumer/Producer Customize Portable translation to “attach” existing Flink components to a Beam pipeline Flink DPSource Beam Flink DPSink Coder deserializer Coder serializer WindowedValue<byte[]> WindowedValue<byte[]>
  25. 25. INTEGRATION Invoking Flink code PBegin for the source and PDone for the sink class FlinkYelpDatapipelineSource(PTransform): def expand(self, pbegin): return pvalue.PCollection(pbegin.pipeline) def infer_output_type(self, unused_input_type): return Message def to_runner_api_parameter(self, context): api_parameters = ('yelp:flinkYelpDatapipelineSource', json.dumps({...})) return api_parameters @staticmethod @PTransform.register_urn('yelp:flinkYelpDatapipelineSource', None) def from_runner_api_parameter(spec_parameter, _unused_context): instance = FlinkYelpDatapipelineSource() params = json.loads(spec_parameter) .... return instance Can use json to pass parameters from Python Beam to Java Flink Beam urn identifies a PTransform during the translation
  26. 26. INTEGRATION Translate to a Flink operator Fork Add your urn and translation function to the translatorMap The result of the translation is a “chunk” of Flink pipeline The output/input Flink DataStream is of type WindowedValue<byte[]>
  27. 27. Message Envelope INTEGRATION Bytes from Kafka Timestamp Micros UUID Metadata / Headers Field 1 Field n Payload Message Type Message Timestamp Micros UUID Metadata / Headers Field 1 Field n Payload Message Type Kafka Position Kafka Position Cluster Topic Partition Offset
  28. 28. Beam Coder SERIALIZATION Data needs to be properly serialized between Flink and Beam SDK worker Extend the Beam Coder class to implement a custom coder for the Message class Register the coder when the source/sink is being used class DataPipelineCoder(Coder): def encode(self, value: Message): envelope = Envelope() return envelope.pack(value) def decode(self, value: bytes) -> Message: return create_from_kafka_message_value(value) registry.register_coder(Message, DataPipelineCoder)
  29. 29. Beam typehints SERIALIZATION Critical to make sure that the proper Coder is being used Every PTransform that returns a Message must use the typehint Annotation @typehints.with_output_types(Message)
  30. 30. DEVELOPMENT Beam application Yelp specific integration into yelp-beam wrapper Makefile to download and start Flink and Job Server locally Run SDK worker on host instead of Docker
  31. 31. DEVELOPMENT Acceptance testing
  32. 32. FLINK Practical differences Processing time using GlobalWindow No access to time characteristic Powerful but possibly complex trigger composition
  33. 33. FLINK Practical differences dataStream .keyBy() .window(TumblingProcessingTimeWindows.of(Time.seconds(10))) .process(new SomeCount()) Beam pcoll | beam.Map(f()-> (key, message)) | beam.WindowInto( window.GlobalWindows(), trigger=Repeatedly(AfterProcessingTime(10000)), accumulation_mode=AccumulationMode.DISCARDING, ) | beam.GroupByKey() | beam.ParDo(SomeCount())
  34. 34. DEPLOYMENT Running Beam Run on Kubernetes One Flink Cluster per Beam service One base Docker image extended with service specific deps Run SDK Worker in same Task Manager container
  35. 35. DEPLOYMENT Yelp’s Flink Operator
  36. 36. DEPLOYMENT Flink Supervisor Long running Python process Controls Beam/Flink job startup Checkpoint and Savepoint management Handles job failures and restarts Monitoring and alerting
  37. 37. DEPLOYMENT Job Launching
  38. 38. DEPLOYMENT Can do better? BEAM-7966 Write portable Beam application jar BEAM-7980 External environment with containerized worker pool (Beam 2.16) Pluggable custom portable translations
  39. 39. ADOPTION Paastorm on Beam class GreatReviewsSpolt(Spolt): def process_message(self, message): payload = message.payload_data if payload['rating'] >= 4.0: yield message if __name__ == ‘__main__’: Paastorm(GreatReviewsSpolt()).run()
  40. 40. ADOPTION Paastorm on Beam @typehints.with_output_types(Message) class Spolt(beam.DoFn): def process_message(self, message): raise NotImplementedError() def process(self, element): return self.process_message(element) class Paastorm: def __init__(self, paastorm_fn): self.paastorm_fn = paastorm_fn def run(): p = beam.Pipeline(options=options) messages = p | DataPipelineSource() | beam.Map(f()-> (kafka_partition, message)) | beam.ParDo(self.paastorm_fn) | DataPipelineSink()
  41. 41. TAKEAWAYS The future is now Run all of our stream processing on one engine: Flink Legacy Paastorm easily migrated Feature parity across languages New applications use native Beam Yelp’s Indexing pipeline as first use case
  42. 42. @YelpEngineering
  43. 43. Questions/Suggestions?