Building a Streaming Microservice Architecture: with Apache Spark Structured Streaming and Friends


Published on

As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads

Published in: Data & Analytics
  1. 1. Building a Streaming Microservice Architecture: With Spark Structured Streaming and Friends Scott Haines Senior Principal Software Engineer
  2. 2. Introductions ▪ I work at Twilio ▪ Over 10 years working on Streaming Architectures ▪ Helped Bring Streaming-First Spark Architecture to Voice & Voice Insights ▪ Leads Spark Office Hours @ Twilio ▪ Loves Distributed Systems About Me Scott Haines: Senior Principal Software Engineer @newfront
  3. 3. Agenda The Big Picture What the Architecture looks like Protocol Buffers What they are. Why they rule! GRPC / Protocol Streams Versioned Data Lineage as a Service How this fits into Spark Structured Streaming with Protobuf support
  4. 4. The Big Picture
  5. 5. Streaming Microservice Architecture GRPC Client GRPC Server GRPC Server GRPC Server 1 2 3 Kafka Broker 4 Kafka Broker 5 6 Spark Application 7 8 HDFS S39 HTTP /2
  6. 6. Streaming Microservice Architecture Kafka Topic Kafka Topic Spark Application Spark Application Spark Application Kafka Topic Data Table Data Table Spark Application GRPC Server
  7. 7. Protocol Buffers aka protobuf
  8. 8. Protocol Buffers ▪ Strict Types ▪ Enforce structure at compile time ▪ Similar to StructType in Apache Spark ▪ Interoperable with Spark via ExpressionEncoding extension ▪ Versioning API / Data Pipeline ▪ Compiled protobuf (*.proto) can be released like normal code ▪ Interoperable ▪ Pick your favorite programming language and compile and release. ▪ Supports Java, Scala, C++, Go, Obj-C, Node-JS, Python and more Why use them?
  9. 9. Protocol Buffers ▪ Code Gen ▪ Automatically generate Builder classes ▪ Being lazy is okay! ▪ Optimized ▪ Messages are optimized and ship with their own Serialization/Deserialization mechanics (SerDe) Why use them?
  10. 10. GRPC and Protocol Streams
  11. 11. gRPC ▪ High Performance ▪ Compact Binary Exchange Format ▪ Make API Calls to the Server like they were Client local ▪ Cross Language/Cross Platform ▪ Autogenerate API definitions for idiomatic client and server – just implement the interfaces ▪ Bi-Directional Streaming ▪ Pluggable support for streaming with HTTP/2 transport What is it? GRPC Client GRPC Server GRPC Server GRPC Server HTTP /2
  12. 12. GRPC Example: AdTracking
  13. 13. GRPC ▪ Define Messages ▪ What kind of Data are your sending? ▪ Example: Click Tracking / Impression Tracking ▪ What is necessary for the public interface? ▪ Example: AdImpression and Response How it works?
  14. 14. GRPC ▪ Service Definition ▪ Compile your rpc definition to generate Service Interfaces ▪ Uses the Same protobuf definition (service.proto) as your Client/Server request and response objects ▪ Can be used to create a binding Service Contract within your organization or publicly How it works?
  15. 15. GRPC ▪ Implement the Service ▪ Compilation of the Service auto-generates your interfaces. ▪ Just implement the service contracts. How it works?
  16. 16. GRPC ▪ Protocol Streams ▪ Messages (protobuf) are emitted to Kafka topic(s) from the Server Layer ▪ Protocol Streams are now available from the Kafka Topics bound to a given Service / Collection of Messages ▪ Sets up Spark for the Hand-Off How it works?
  17. 17. GRPC System Architecture GRPC Client GRPC Server GRPC Server GRPC Server Kafka Broker Kafka Broker 6 HTTP /2 Topic: Client: service.adTrack(trackedAd) Server: ClickTrackService.adTrack(trackedAd)
  18. 18. Structuring Protocol Streams: with Structured Streaming and protobuf
  19. 19. Structured Streaming with Protobuf ▪ Expression Encoding ▪ Natively Interop with Protobuf in Apache Spark. ▪ Protobuf to Case Class conversion from scalapb. ▪ Product encoding comes for free via import sparkSession.implicits._ From Protocol Buffer to StructType through ExpressionEncoders
  20. 20. Structured Streaming with Protobuf ▪ Native is Better ▪ Strict Native Kafka to DataFrame conversion with no need for transformation to intermediary types ▪ Mutations and Joins can be done across DataFrame or Datasets API. ▪ Create RealTime Data Pipelines, Machine Learning Pipelines and More. ▪ Rest at Night knowing the pipelines are safe! From Protocol Buffer to StructType through ExpressionEncoders
  21. 21. Structured Streaming with Protobuf ▪ Strict Data Writer ▪ Compiled / Versioned Protobuf can be used to strictly enforce the format of your Writers even ▪ Use Protobuf to define the StructType that can be used in your conversions to *Parquet. (* must abide by parquet nesting rules ) ▪ Declarative Input / Output means that Streaming Applications don’t go down due to incompatible Data Streams ▪ Can also be used with Delta so that the version of the schema lines up with compiled Protobuf. From Protocol Buffer to StructType through ExpressionEncoders
  22. 22. Structured Streaming with Protobuf ▪ Real World Use Case ▪ Close of Books Data Lineage Job ▪ Uses End to End Protobuf ▪ Enables teams to move quick with guarantees regarding the Data being published and at what Frequency ▪ Can be emitted at different speeds to different locations based on configuration Example: Streaming Transformation Pipeline
  24. 24. Recap
  25. 25. What We Learned ▪ Language Agnostic Structured Data ▪ Compile Time Guarantees ▪ Lightning Fast Serialization/Dese rialization ▪ Language Agnostic Binary Services ▪ Low-Latency ▪ Compile Time Guarantees ▪ Smart Framework GRPCProtobuf ▪ Highly Available ▪ Native Connector for Spark ▪ Topic Based Binary Protobuf Store ▪ Use to Pass Records to one or more Downstream Services Kafka ▪ Handle Data Reliably ▪ Protobuf to Dataset / DataFrames is awesome ▪ Parquet / Delta plays nice as Columnar Data Exchange format Structured Streaming
  26. 26. Thanks @newfrontcreative @newfront
