Building a Streaming Microservice Architecture: with Apache Spark Structured Streaming and Friends
Jun. 29, 2020•0 likes•1,542 views
Download to read offline
Report
Data & Analytics
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
Building a Streaming Microservice Architecture: with Apache Spark Structured Streaming and Friends
2. Building a Streaming Microservice
Architecture: With Spark
Structured Streaming and Friends
Scott Haines
Senior Principal Software Engineer
3. Introductions
▪ I work at Twilio
▪ Over 10 years working on Streaming
Architectures
▪ Helped Bring Streaming-First Spark Architecture
to Voice & Voice Insights
▪ Leads Spark Office Hours @ Twilio
▪ Loves Distributed Systems
About Me
Scott Haines: Senior Principal Software Engineer @newfront
4. Agenda
The Big Picture
What the Architecture looks like
Protocol Buffers
What they are. Why they rule!
GRPC / Protocol Streams
Versioned Data Lineage as a Service
How this fits into Spark
Structured Streaming with Protobuf support
9. Protocol Buffers
▪ Strict Types
▪ Enforce structure at compile time
▪ Similar to StructType in Apache Spark
▪ Interoperable with Spark via ExpressionEncoding extension
▪ Versioning API / Data Pipeline
▪ Compiled protobuf (*.proto) can be released like normal code
▪ Interoperable
▪ Pick your favorite programming language and compile and release.
▪ Supports Java, Scala, C++, Go, Obj-C, Node-JS, Python and more
Why use them?
10. Protocol Buffers
▪ Code Gen
▪ Automatically generate Builder classes
▪ Being lazy is okay!
▪ Optimized
▪ Messages are optimized and ship with their own
Serialization/Deserialization mechanics (SerDe)
Why use them?
12. gRPC
▪ High Performance
▪ Compact Binary Exchange Format
▪ Make API Calls to the Server like they were Client local
▪ Cross Language/Cross Platform
▪ Autogenerate API definitions for idiomatic client and server – just
implement the interfaces
▪ Bi-Directional Streaming
▪ Pluggable support for streaming with HTTP/2 transport
What is it?
GRPC Client
GRPC Server GRPC Server GRPC Server
HTTP /2
14. GRPC
▪ Define Messages
▪ What kind of Data are your sending?
▪ Example: Click Tracking / Impression Tracking
▪ What is necessary for the public interface?
▪ Example: AdImpression and Response
How it works?
15. GRPC
▪ Service Definition
▪ Compile your rpc definition to generate Service Interfaces
▪ Uses the Same protobuf definition (service.proto) as your
Client/Server request and response objects
▪ Can be used to create a binding Service Contract within your
organization or publicly
How it works?
16. GRPC
▪ Implement the Service
▪ Compilation of the Service auto-generates your
interfaces.
▪ Just implement the service contracts.
How it works?
17. GRPC
▪ Protocol Streams
▪ Messages (protobuf) are emitted to Kafka topic(s)
from the Server Layer
▪ Protocol Streams are now available from the Kafka
Topics bound to a given Service / Collection of
Messages
▪ Sets up Spark for the Hand-Off
How it works?
18. GRPC
System Architecture
GRPC Client
GRPC Server GRPC Server GRPC Server
Kafka Broker
Kafka Broker
6
HTTP /2
Topic: ads.click.stream
Client: service.adTrack(trackedAd)
Server: ClickTrackService.adTrack(trackedAd)
20. Structured Streaming with Protobuf
▪ Expression Encoding
▪ Natively Interop with Protobuf in Apache Spark.
▪ Protobuf to Case Class conversion from
scalapb.
▪ Product encoding comes for free via import
sparkSession.implicits._
From Protocol Buffer to StructType through ExpressionEncoders
21. Structured Streaming with Protobuf
▪ Native is Better
▪ Strict Native Kafka to DataFrame conversion with no need
for transformation to intermediary types
▪ Mutations and Joins can be done across DataFrame or
Datasets API.
▪ Create RealTime Data Pipelines, Machine Learning
Pipelines and More.
▪ Rest at Night knowing the pipelines are safe!
From Protocol Buffer to StructType through ExpressionEncoders
22. Structured Streaming with Protobuf
▪ Strict Data Writer
▪ Compiled / Versioned Protobuf can be used to strictly
enforce the format of your Writers even
▪ Use Protobuf to define the StructType that can be used in
your conversions to *Parquet. (* must abide by parquet
nesting rules )
▪ Declarative Input / Output means that Streaming
Applications don’t go down due to incompatible Data
Streams
▪ Can also be used with Delta so that the version of the
schema lines up with compiled Protobuf.
From Protocol Buffer to StructType through ExpressionEncoders
23. Structured Streaming with Protobuf
▪ Real World Use Case
▪ Close of Books Data Lineage Job
▪ Uses End to End Protobuf
▪ Enables teams to move quick with guarantees regarding
the Data being published and at what Frequency
▪ Can be emitted at different speeds to different locations
based on configuration
Example: Streaming Transformation Pipeline
24. Streaming Microservice Architecture
GRPC Client
GRPC Server GRPC Server GRPC Server
1
2
3
Kafka Broker
4
Kafka Broker
5
6
Spark Application
7 8
HDFS
S39
HTTP /2
26. What We Learned
▪ Language
Agnostic
Structured Data
▪ Compile Time
Guarantees
▪ Lightning Fast
Serialization/Dese
rialization
▪ Language
Agnostic Binary
Services
▪ Low-Latency
▪ Compile Time
Guarantees
▪ Smart Framework
GRPCProtobuf
▪ Highly Available
▪ Native Connector
for Spark
▪ Topic Based Binary
Protobuf Store
▪ Use to Pass
Records to one or
more Downstream
Services
Kafka
▪ Handle Data
Reliably
▪ Protobuf to
Dataset /
DataFrames is
awesome
▪ Parquet / Delta
plays nice as
Columnar Data
Exchange format
Structured Streaming