Stream processing for the masses with beam, python and flink

Stream processing for the masses with
Beam, Python and Flink
Sept 12th, 2019
Enrico Canzonieri
enrico@yelp.com @EnricoC89

Yelp’s Mission
Connecting
people with great
local businesses

Evolving data processing
Latency ~
hours/days

Evolving data processing
Latency ~
hours/days Latency ~
seconds/minutes

STREAMS POWER YELP
Powered by
streaming
Notifications
Real-time visit detection
User Search
Indexing pipeline
User personalization
Purchase flows
ML feature ETL
Ads
Product development
Transactions
Realtime campaign shut-off
Experimentation infrastructure and guardrail metrics

Scribe
2015 2017
Tooling innovation
leads to more data
20192016

DATA PIPELINE
Strong data schematization and documentation
Standardized wire protocol (AVRO)
Contract between data producers and consumers
Centralized schema registry
Decouple data
ETL

PROCESSOR
Paastorm
Paastorm was Yelp’s answer to the lack of good
open source Python stream processors
Paastorm provides a thin wrapper around Kafka
producer/consumer
Good ﬁt to perform map/ﬂatmap transformations

PROCESSOR
Paastorm
Adoption
Paastorm API is a class called Spolt
Users extend the Spolt and implement
process_message(self, message)
Over 150 production Paastorm applications

class GreatReviewsSpolt(Spolt):
def process_message(self, message):
payload = message.payload_data
if payload['rating'] >= 4.0:
yield message
if __name__ == ‘__main__’:
Paastorm(GreatReviewsSpolt()).run()
PROCESSOR
Paastorm
Code

Need shiny new tools to leverage the value
of real-time data

2017
Unlocked real-time data processing at scale
Stateful processing
Powerful streaming oriented (DataStream) API
Apache Flink
Event time processing

2017
Stream SQL
Joinery
Aggregator
Use Cases
Connectors
Sessionizer
cassandra, elasticsearch, redshift, etc.
Flink SQL wrapper to run arbitrary queries
unwindowed streaming join of table change streams
unwindowed aggregation of table change streams
create sessions from event logs

LIMITATIONS
Tightly coupled to Kafka
No high level primitives: groupBy, windowing,
ﬁlter, etc.
No stateful processing support
High cost to implement and maintain new features

CHALLENGES
Hundreds of Python libraries implementing
business logic
Flink SQL good mostly for simple SQL like
transformations
High barrier of entry to JVM language

Backend
Finding the next new shiny tool

BEAM
And more ...
Pipeline
Driver program
Programming
model
Execution
responsible for deﬁning the pipeline in the Beam SDK
represent the logical data processing tasks
run on any supported distributed processing framework

BEAM
Pipeline
PTransform
IO /
Create
IO /
Write
PCollection
PTransform
PCollections elements
distributed bounded or unbounded data sets
processing step that transforms PCollections
have an associated timestamp

BEAM
Python SDK
High level API
Side Input and tagged output
State and Timers
ParDo, Map, Flatmap, Filter, GroupByKey, CoGroupByKey
Support for Window and Triggers
Fixed, Sliding and Session windows and a variety triggers
ParDo with two or more inputs and two or more outputs
Can be combined to build complex stateful applications

EXECUTION
Beam
Portability API
Portability API
Define the protocols used by the runner (e.g. Flink) to translate
and run the pipeline
Python SDK
Make use of a containerized SDK harness to run
language specific UDFs
Fn API
Rely on gRPC for runner - SDK worker
communication
Beam Model: Fn Runners
Apache
Flink
Apache
Spark
Beam Model: Pipeline Construction
Other
LanguagesBeam Java
Beam
Python
Execution Execution
Cloud
Dataflow
Execution

INTEGRATION
Data Pipeline
Source and
Sink
Yelp speciﬁc implementation to discover Kafka
clusters
Team expertise around the Flink
Consumer/Producer
Customize Portable translation to “attach”
existing Flink components to a Beam pipeline
Flink
DPSource
Beam Flink DPSink
Coder
deserializer
Coder
serializer
WindowedValue<byte[]> WindowedValue<byte[]>

INTEGRATION
Invoking Flink
code
PBegin for the source and PDone for the sink
class FlinkYelpDatapipelineSource(PTransform):
def expand(self, pbegin): return pvalue.PCollection(pbegin.pipeline)
def infer_output_type(self, unused_input_type): return Message
def to_runner_api_parameter(self, context):
api_parameters = ('yelp:flinkYelpDatapipelineSource',
json.dumps({...}))
return api_parameters
@staticmethod
@PTransform.register_urn('yelp:flinkYelpDatapipelineSource', None)
def from_runner_api_parameter(spec_parameter, _unused_context):
instance = FlinkYelpDatapipelineSource()
params = json.loads(spec_parameter)
....
return instance
Can use json to pass parameters from Python
Beam to Java Flink
Beam urn identiﬁes a PTransform during the
translation

INTEGRATION
Translate to a
Flink operator
Fork FlinkStreamingPortablePipelineTranslator.java
Add your urn and translation function to the
translatorMap
The result of the translation is a “chunk” of Flink
pipeline
The output/input Flink DataStream is of type
WindowedValue<byte[]>

Message
Envelope
INTEGRATION
Bytes from Kafka
Timestamp Micros
UUID
Metadata / Headers
Field 1
Field n
Payload
Message Type
Message
Timestamp Micros
UUID
Metadata / Headers
Field 1
Field n
Payload
Message Type
Kafka Position
Kafka Position
Cluster
Topic
Partition
Oﬀset

Beam Coder
SERIALIZATION
Data needs to be properly serialized between
Flink and Beam SDK worker
Extend the Beam Coder class to implement a
custom coder for the Message class
Register the coder when the source/sink is
being used
class DataPipelineCoder(Coder):
def encode(self, value: Message):
envelope = Envelope()
return envelope.pack(value)
def decode(self, value: bytes) -> Message:
return create_from_kafka_message_value(value)
registry.register_coder(Message, DataPipelineCoder)

Beam
typehints
SERIALIZATION
Critical to make sure that the proper Coder is
being used
Every PTransform that returns a Message must
use the typehint
Annotation
@typehints.with_output_types(Message)

DEVELOPMENT
Beam
application
Yelp speciﬁc integration into yelp-beam wrapper
Makeﬁle to download and start Flink and Job
Server locally
Run SDK worker on host instead of Docker

DEVELOPMENT
Acceptance
testing

FLINK
Practical
diﬀerences
Processing time using GlobalWindow
No access to time characteristic
Powerful but possibly complex trigger composition

FLINK
Practical
diﬀerences
dataStream
.keyBy()
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.process(new SomeCount())
Beam
pcoll
| beam.Map(f()-> (key, message))
| beam.WindowInto(
window.GlobalWindows(),
trigger=Repeatedly(AfterProcessingTime(10000)),
accumulation_mode=AccumulationMode.DISCARDING,
)
| beam.GroupByKey()
| beam.ParDo(SomeCount())

DEPLOYMENT
Running Beam
Run on Kubernetes
One Flink Cluster per Beam service
One base Docker image extended with service speciﬁc
deps
Run SDK Worker in same Task Manager container

DEPLOYMENT
Yelp’s Flink
Operator

DEPLOYMENT
Flink
Supervisor
Long running Python process
Controls Beam/Flink job startup
Checkpoint and Savepoint management
Handles job failures and restarts
Monitoring and alerting

DEPLOYMENT
Can do
better?
BEAM-7966 Write portable Beam application jar
BEAM-7980 External environment with containerized worker
pool (Beam 2.16)
Pluggable custom portable translations

ADOPTION
Paastorm on
Beam
class GreatReviewsSpolt(Spolt):
payload = message.payload_data
if payload['rating'] >= 4.0:
yield message
if __name__ == ‘__main__’:
Paastorm(GreatReviewsSpolt()).run()

ADOPTION
Paastorm on
Beam
@typehints.with_output_types(Message)
class Spolt(beam.DoFn):
raise NotImplementedError()
def process(self, element):
return self.process_message(element)
class Paastorm:
def __init__(self, paastorm_fn):
self.paastorm_fn = paastorm_fn
def run():
p = beam.Pipeline(options=options)
messages = p | DataPipelineSource()
| beam.Map(f()-> (kafka_partition, message))
| beam.ParDo(self.paastorm_fn)
| DataPipelineSink()
p.run()

TAKEAWAYS
The future is
now
Run all of our stream processing on one engine:
Flink
Legacy Paastorm easily migrated
Feature parity across languages
New applications use native Beam
Yelp’s Indexing pipeline as ﬁrst use case

@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp

Questions/Suggestions?
enrico@yelp.com

Stream processing for the masses with beam, python and flink

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Stream processing for the masses with beam, python and flink

Similar to Stream processing for the masses with beam, python and flink (20)

Recently uploaded

Recently uploaded (20)

Stream processing for the masses with beam, python and flink