Building a Streaming Microservices Architecture - Data + AI Summit EU 2020

Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Building a Streaming
Microservices
Architecture
With Apache Spark Structured Streaming & Friends
Scott Haines
Senior Principal Software Engineer, Twilio
@newfront
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
A little about me.
• I work at Twilio building massive data systems
• I run a bi-weekly internal Spark Office Hours where I offer
training and guidance to teams at the company
• >12 years working on large distributed analytics systems
@newfront
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
A little about me.
• I work at Twilio building massive data systems
• I run a bi-weekly internal Spark Office Hours where I offer
training and guidance to teams at the company
• >12 years working on large distributed analytics systems
• Published work on Distributed Analytics Systems
@newfront
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Voice Insights
Accountable observability data and
interactive analytics and insights for
the voice business and customers.
VIRGINIA, USA
DUBLIN, IRELAND
SINGAPORE
SYDNEY, AUSTRALIA
TOKYO, JAPAN
SAO PAULO, BRAZIL
DATA CENT ERS
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Events per Second
>1MIL
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Let’s Build a Reliable Data Architecture
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Goal: Reliable E2E Streaming Data Pipeline
GRPC Client
GRPC Server GRPC Server GRPC Server
1
2
3
Kafka Broker
4
Kafka Broker
5
6
Spark Application
7 8
HDFS
S39
HTTP /2
@newfront
Strong and Reliable Data starts at Ingest
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
@newfront
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Of people love
JSON*.
>95%
{
"type": "CallEvent",
“call_sid”: "CA123",
"attributes": [
{
“account_sid”: “AC123”
“start_ms": 123,
“end_ms": "435"
}
]
}
• JSON has Structure.
• But JSON isn't strictly Structured Data.
Structured Data
@newfront @twilio
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Of people saddened by
bad data*
100%
{
"type": "CallEvent",
“call_sid": “CA234”,
“attributes":"oops"
}
• JSON has poor runtime guarantees due to
its flexible nature. Optimize for compile time
guarantees.
• Debugging corrupt data in a large
distributed system ruins hopes and dreams.
Structured Data
@newfront @twilio
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Protocol Buffers
message CallEvent {
uint64 created_ms = 1;
string call_sid = 2;
uint64 account_sid = 3;
EventType event_type = 4;
Region region = 5;
}
• Well Defined Events tell their own Story
• Type-Safety Rules
• Versioning your API / Pipeline / Data
now just means sticking to a version of
your schema
• Rely on Releases for versioning
• Interoperable with most major languages
(java/scala/c++/go/obj-c/node-js/
python/...)
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Protocol Buffers
message CallEvent {
uint64 created_ms = 1;
string call_sid = 2;
uint64 account_sid = 3;
EventType event_type = 4;
Region region = 5;
}
• Data Accountability
• Lightning Fast Serialization /
Deserialization
• Plays with nicely gRPC
• Interoperable with Spark SQL
• Like “JSON with Guard Rails”
Of people like when
things work between
releases!
100%
Take Aways
Data Engineers love gRPC
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
*technically not universally accepted
@newfront
GRPC Client
GRPC Server GRPC Server GRPC Server
HTTP /2
gRPC | saves time
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
GRPC Client
GRPC Server GRPC Server GRPC Server
HTTP /2
gRPC | saves time
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
val time = “$$$”
GRPC Client
GRPC Server GRPC Server GRPC Server
HTTP /2
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
GRPC.
GRPC // nutshell
• RPC = remote procedure call. “G” stands
for generic or Google
• Great for Internal Services
• High Performance
• Compact Binary Exchange Format
• Compile Idiomatic API Definitions
• Capable of Bi-Directional Streaming
• Pluggable HTTP/2 transport
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
GRPC.
• Building a CallEvent Service.
• 1: Define your messages (call.proto)
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
GRPC.
• Building a CallEvent Service.
• 1: Define your messages (call.proto)
• 2: Define your services (service.proto)
@newfront
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
GRPC.
• Building a CallEvent Service.
• 1: Define your messages (call.proto)
• 2: Define your services (service.proto)
• 3: Compile your messages and service
stubs.
sbt clean compile package publishLocal
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
GRPC.
• Building a CallEvent Service.
• 1: Define your messages (call.proto)
• 2: Define your services (service.proto)
• 3: Compile your messages and service
stubs.
• 4: Implement Traits and Run!
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
GRPC.
• Building a CallEvent Service.
• Client SDKs essentially write themselves.
• JSON <-> Protobuf is still possible for
maintaining customer facing APIs
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Protocol Gatewaymessage T<:Event { … }
protobuf @ version
Kafka Broker
gRPC
Common Pattern Emerges
Protocol Streams
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Still Building…
GRPC Client
GRPC Server GRPC Server GRPC Server
1
2
3
Kafka Broker
4
Kafka Broker
5
6
Spark Application
7 8
HDFS
S39
HTTP /2
@newfront
GRPC Client
GRPC Server GRPC Server GRPC Server
1
2
3
Kafka Broker
4
Kafka Broker
5
6
Spark Application
7 8
9
HTTP /2
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Kafka + Protobuf + Spark
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Kafka.
• Solve for common Data Pipeline Problems
• Partition Keys are important. Use them to
your advantage
• Spark AQE - adaptive query execution can
handle hot spots. Non-spark services not-
so-much.
Topic: CallEvents
key: call_sid
partitions*: n
* number of partitions: factor of producer records/s * avg(record.bytesize)
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Spark.
• Use ScalaPB’s ExpressionEncoders to
natively convert protobuf to Catalyst
Optimizable DataFrames
• Marry this with Sparks Kafka
DataSource Reader/Writer
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Spark.
• Behind the Scenes…
• Spark is doing magic
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Spark.
• End to End Tests ensure you can press
the release button anywhere in the
pipeline.
• Can be automated to ensure your Spark
Apps can continue working with any
changes to the upstream gRPC
• Can use for Canary testing updates
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Spark.
• Use ScalaPB to read local streams
• Ensure your Spark Apps can continue
working with any new protobuf
dependencies
• Test simple reads and complex
aggregations locally, deploy globally
@newfront
GRPC Client
GRPC Server GRPC Server GRPC Server
1
Kafka Broker
4
Kafka Broker
5
6
Spark Application
7 8
HDFS
S39
TP /2
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Spark & Beyond
@newfront
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Spark + Friends.
• Proto to Catalyst DataFrame
• Conversion to Parquet in Delta
Table
• Partitioned by Date
@newfront
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Spark + Friends.
• Now it is easy for Downstream
applications to pick up using
Parquet Protocol Streams
@newfront
© 2019 TWILIO INC. ALL RIGHTS RESERVED.
Rinse and Repeat
@newfront
Just keep adding new Flows.
1. Define your Structured Data
2. Define your service definitions
3. Emit Data / Enqueue to Kafka
4. Read and Drop in HDFS (delta) or
pass along to a new Topic
Kafka Topic Kafka Topic
Spark Application Spark Application Spark Application
Kafka Topic
Data Table Data Table
Spark Application
GRPC Server
THANK YOU
@newfront
1 of 34

Recommended

Building a Cross Cloud Data Protection Engine by
Building a Cross Cloud Data Protection EngineBuilding a Cross Cloud Data Protection Engine
Building a Cross Cloud Data Protection EngineDatabricks
181 views18 slides
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ... by
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...
Scale and Optimize Data Engineering Pipelines with Software Engineering Best ...Databricks
322 views31 slides
Data Driven Decisions at Scale by
Data Driven Decisions at ScaleData Driven Decisions at Scale
Data Driven Decisions at ScaleDatabricks
337 views20 slides
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (... by
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
4.5K views41 slides
Real-Time Robot Predictive Maintenance in Action by
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionDataWorks Summit
2.2K views34 slides
Automated Testing For Protecting Data Pipelines from Undocumented Assumptions by
Automated Testing For Protecting Data Pipelines from Undocumented AssumptionsAutomated Testing For Protecting Data Pipelines from Undocumented Assumptions
Automated Testing For Protecting Data Pipelines from Undocumented AssumptionsDatabricks
762 views32 slides

More Related Content

What's hot

Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee... by
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...HostedbyConfluent
378 views27 slides
Democratizing data science Using spark, hive and druid by
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druidDataWorks Summit
965 views26 slides
Delta Lake: Open Source Reliability w/ Apache Spark by
Delta Lake: Open Source Reliability w/ Apache SparkDelta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache SparkGeorge Chow
235 views39 slides
Intuit Analytics Cloud 101 by
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101DataWorks Summit/Hadoop Summit
1.1K views22 slides
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented... by
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Data Con LA
1.8K views29 slides
Building Sessionization Pipeline at Scale with Databricks Delta by
Building Sessionization Pipeline at Scale with Databricks DeltaBuilding Sessionization Pipeline at Scale with Databricks Delta
Building Sessionization Pipeline at Scale with Databricks DeltaDatabricks
634 views15 slides

What's hot(20)

Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee... by HostedbyConfluent
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
HostedbyConfluent378 views
Democratizing data science Using spark, hive and druid by DataWorks Summit
Democratizing data science Using spark, hive and druidDemocratizing data science Using spark, hive and druid
Democratizing data science Using spark, hive and druid
DataWorks Summit965 views
Delta Lake: Open Source Reliability w/ Apache Spark by George Chow
Delta Lake: Open Source Reliability w/ Apache SparkDelta Lake: Open Source Reliability w/ Apache Spark
Delta Lake: Open Source Reliability w/ Apache Spark
George Chow235 views
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented... by Data Con LA
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented...
Data Con LA1.8K views
Building Sessionization Pipeline at Scale with Databricks Delta by Databricks
Building Sessionization Pipeline at Scale with Databricks DeltaBuilding Sessionization Pipeline at Scale with Databricks Delta
Building Sessionization Pipeline at Scale with Databricks Delta
Databricks634 views
Data Privacy with Apache Spark: Defensive and Offensive Approaches by Databricks
Data Privacy with Apache Spark: Defensive and Offensive ApproachesData Privacy with Apache Spark: Defensive and Offensive Approaches
Data Privacy with Apache Spark: Defensive and Offensive Approaches
Databricks735 views
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re... by Databricks
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Unifying Streaming and Historical Telemetry Data For Real-time Performance Re...
Databricks2.3K views
Successful AI/ML Projects with End-to-End Cloud Data Engineering by Databricks
Successful AI/ML Projects with End-to-End Cloud Data EngineeringSuccessful AI/ML Projects with End-to-End Cloud Data Engineering
Successful AI/ML Projects with End-to-End Cloud Data Engineering
Databricks710 views
Reliable Data Intestion in BigData / IoT by Guido Schmutz
Reliable Data Intestion in BigData / IoTReliable Data Intestion in BigData / IoT
Reliable Data Intestion in BigData / IoT
Guido Schmutz1.2K views
Apache Spark in Scientific Applciations by Dr. Mirko Kämpf
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf380 views
How Workato creates robust data pipelines and automations for you? by Jeraldine Phneah
How Workato creates robust data pipelines and automations for you?How Workato creates robust data pipelines and automations for you?
How Workato creates robust data pipelines and automations for you?
Jeraldine Phneah188 views
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed... by Databricks
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
ML, Statistics, and Spark with Databricks for Maximizing Revenue in a Delayed...
Databricks195 views
Redash: Open Source SQL Analytics on Data Lakes by Databricks
Redash: Open Source SQL Analytics on Data LakesRedash: Open Source SQL Analytics on Data Lakes
Redash: Open Source SQL Analytics on Data Lakes
Databricks514 views
Reltio: Powering Enterprise Data-driven Applications with Cassandra by DataStax Academy
Reltio: Powering Enterprise Data-driven Applications with CassandraReltio: Powering Enterprise Data-driven Applications with Cassandra
Reltio: Powering Enterprise Data-driven Applications with Cassandra
DataStax Academy2.4K views
How to design and implement a data ops architecture with sdc and gcp by Joseph Arriola
How to design and implement a data ops architecture with sdc and gcpHow to design and implement a data ops architecture with sdc and gcp
How to design and implement a data ops architecture with sdc and gcp
Joseph Arriola394 views
Northwestern Mutual Journey – Transform BI Space to Cloud by Databricks
Northwestern Mutual Journey – Transform BI Space to CloudNorthwestern Mutual Journey – Transform BI Space to Cloud
Northwestern Mutual Journey – Transform BI Space to Cloud
Databricks407 views
Building Custom Big Data Integrations by Pat Patterson
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
Pat Patterson642 views

Similar to Building a Streaming Microservices Architecture - Data + AI Summit EU 2020

Apache Pulsar @Splunk by
Apache Pulsar @SplunkApache Pulsar @Splunk
Apache Pulsar @SplunkKarthik Ramasamy
835 views75 slides
Integrating Postgres with ActiveMQ and Camel by
Integrating Postgres with ActiveMQ and CamelIntegrating Postgres with ActiveMQ and Camel
Integrating Postgres with ActiveMQ and CamelJustin Reock
565 views57 slides
Oracle Modern AppDev Approach to Cloud & Container Native App by
Oracle Modern AppDev Approach to Cloud & Container Native AppOracle Modern AppDev Approach to Cloud & Container Native App
Oracle Modern AppDev Approach to Cloud & Container Native AppPaulo Alberto Simoes ∴
271 views50 slides
Why Splunk Chose Pulsar_Karthik Ramasamy by
Why Splunk Chose Pulsar_Karthik RamasamyWhy Splunk Chose Pulsar_Karthik Ramasamy
Why Splunk Chose Pulsar_Karthik RamasamyStreamNative
2.5K views47 slides
Pulsar summit-keynote-final by
Pulsar summit-keynote-finalPulsar summit-keynote-final
Pulsar summit-keynote-finalKarthik Ramasamy
1.3K views47 slides
CICDforModernApplications-Oslo.pdf by
CICDforModernApplications-Oslo.pdfCICDforModernApplications-Oslo.pdf
CICDforModernApplications-Oslo.pdfAmazon Web Services
379 views90 slides

Similar to Building a Streaming Microservices Architecture - Data + AI Summit EU 2020(20)

Integrating Postgres with ActiveMQ and Camel by Justin Reock
Integrating Postgres with ActiveMQ and CamelIntegrating Postgres with ActiveMQ and Camel
Integrating Postgres with ActiveMQ and Camel
Justin Reock565 views
Why Splunk Chose Pulsar_Karthik Ramasamy by StreamNative
Why Splunk Chose Pulsar_Karthik RamasamyWhy Splunk Chose Pulsar_Karthik Ramasamy
Why Splunk Chose Pulsar_Karthik Ramasamy
StreamNative2.5K views
CI/CD for Containers: A Way Forward for Your DevOps Pipeline by Amazon Web Services
CI/CD for Containers: A Way Forward for Your DevOps PipelineCI/CD for Containers: A Way Forward for Your DevOps Pipeline
CI/CD for Containers: A Way Forward for Your DevOps Pipeline
Kamailio practice Quobis-University of Vigo Laboratory of Commutation 2012-2... by Quobis
Kamailio practice Quobis-University of Vigo Laboratory of Commutation  2012-2...Kamailio practice Quobis-University of Vigo Laboratory of Commutation  2012-2...
Kamailio practice Quobis-University of Vigo Laboratory of Commutation 2012-2...
Quobis1.1K views
The DevOps Promise: Helping Management Realise the Quality, Velocity & Effici... by Splunk
The DevOps Promise: Helping Management Realise the Quality, Velocity & Effici...The DevOps Promise: Helping Management Realise the Quality, Velocity & Effici...
The DevOps Promise: Helping Management Realise the Quality, Velocity & Effici...
Splunk317 views
SplDevOps: Making Splunk Development a Breeze With a Deep Dive on DevOps' Con... by Harry McLaren
SplDevOps: Making Splunk Development a Breeze With a Deep Dive on DevOps' Con...SplDevOps: Making Splunk Development a Breeze With a Deep Dive on DevOps' Con...
SplDevOps: Making Splunk Development a Breeze With a Deep Dive on DevOps' Con...
Harry McLaren376 views
Cisco Connect Ottawa 2018 dev net by Cisco Canada
Cisco Connect Ottawa 2018 dev netCisco Connect Ottawa 2018 dev net
Cisco Connect Ottawa 2018 dev net
Cisco Canada198 views
Continuous Integration and Continuous Delivery Best Practices for Building Mo... by Amazon Web Services
Continuous Integration and Continuous Delivery Best Practices for Building Mo...Continuous Integration and Continuous Delivery Best Practices for Building Mo...
Continuous Integration and Continuous Delivery Best Practices for Building Mo...
Analytics im DevOps Lebenszyklus by Splunk
Analytics im DevOps LebenszyklusAnalytics im DevOps Lebenszyklus
Analytics im DevOps Lebenszyklus
Splunk451 views
Emulators as an Emerging Best Practice for API Providers by Cisco DevNet
Emulators as an Emerging Best Practice for API ProvidersEmulators as an Emerging Best Practice for API Providers
Emulators as an Emerging Best Practice for API Providers
Cisco DevNet207 views
AWS DevDay Cologne - CI/CD for modern applications by Cobus Bernard
AWS DevDay Cologne - CI/CD for modern applicationsAWS DevDay Cologne - CI/CD for modern applications
AWS DevDay Cologne - CI/CD for modern applications
Cobus Bernard303 views
Leveraging Splunk Enterprise Security with the MITRE’s ATT&CK Framework by Splunk
Leveraging Splunk Enterprise Security with the MITRE’s ATT&CK FrameworkLeveraging Splunk Enterprise Security with the MITRE’s ATT&CK Framework
Leveraging Splunk Enterprise Security with the MITRE’s ATT&CK Framework
Splunk2K views
TechWiseTV Workshop: Cisco Hybrid Cloud Platform for Google Cloud by Robb Boyd
TechWiseTV Workshop:  Cisco Hybrid Cloud Platform for Google CloudTechWiseTV Workshop:  Cisco Hybrid Cloud Platform for Google Cloud
TechWiseTV Workshop: Cisco Hybrid Cloud Platform for Google Cloud
Robb Boyd264 views

More from Databricks

DW Migration Webinar-March 2022.pptx by
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
4.3K views25 slides
Data Lakehouse Symposium | Day 1 | Part 1 by
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
1.5K views43 slides
Data Lakehouse Symposium | Day 1 | Part 2 by
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
743 views16 slides
Data Lakehouse Symposium | Day 4 by
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
1.8K views74 slides
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
6.3K views64 slides
Democratizing Data Quality Through a Centralized Platform by
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
1.4K views36 slides

More from Databricks(20)

DW Migration Webinar-March 2022.pptx by Databricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K views
Data Lakehouse Symposium | Day 1 | Part 1 by Databricks
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks1.5K views
Data Lakehouse Symposium | Day 1 | Part 2 by Databricks
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks743 views
Data Lakehouse Symposium | Day 4 by Databricks
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks1.8K views
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K views
Democratizing Data Quality Through a Centralized Platform by Databricks
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks1.4K views
Learn to Use Databricks for Data Science by Databricks
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks1.6K views
Why APM Is Not the Same As ML Monitoring by Databricks
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks743 views
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix by Databricks
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks689 views
Stage Level Scheduling Improving Big Data and AI Integration by Databricks
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks850 views
Simplify Data Conversion from Spark to TensorFlow and PyTorch by Databricks
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks1.8K views
Scaling your Data Pipelines with Apache Spark on Kubernetes by Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K views
Scaling and Unifying SciKit Learn and Apache Spark Pipelines by Databricks
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks667 views
Sawtooth Windows for Feature Aggregations by Databricks
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks606 views
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink by Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks677 views
Re-imagine Data Monitoring with whylogs and Spark by Databricks
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks551 views
Raven: End-to-end Optimization of ML Prediction Queries by Databricks
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks450 views
Processing Large Datasets for ADAS Applications using Apache Spark by Databricks
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks513 views
Massive Data Processing in Adobe Using Delta Lake by Databricks
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks719 views
Machine Learning CI/CD for Email Attack Detection by Databricks
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks389 views

Recently uploaded

Oral presentation.pdf by
Oral presentation.pdfOral presentation.pdf
Oral presentation.pdfreemalmazroui8
5 views10 slides
shivam tiwari.pptx by
shivam tiwari.pptxshivam tiwari.pptx
shivam tiwari.pptxAanyaMishra4
7 views14 slides
Lack of communication among family.pptx by
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptxahmed164023
15 views10 slides
PRIVACY AWRE PERSONAL DATA STORAGE by
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGEantony420421
7 views56 slides
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf by
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf10urkyr34
7 views259 slides
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...StatsCommunications
7 views26 slides

Recently uploaded(20)

Lack of communication among family.pptx by ahmed164023
Lack of communication among family.pptxLack of communication among family.pptx
Lack of communication among family.pptx
ahmed16402315 views
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204217 views
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf by 10urkyr34
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
6498-Butun_Beyinli_Cocuq-Daniel_J.Siegel-Tina_Payne_Bryson-2011-259s.pdf
10urkyr347 views
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an... by StatsCommunications
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OECD-Persol Holdings Workshop on Advancing Employee Well-being in Business an...
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf by Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus31 views
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821729 views
LIVE OAK MEMORIAL PARK.pptx by ms2332always
LIVE OAK MEMORIAL PARK.pptxLIVE OAK MEMORIAL PARK.pptx
LIVE OAK MEMORIAL PARK.pptx
ms2332always7 views
Customer Data Cleansing Project.pptx by Nat O
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptx
Nat O6 views
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P... by DataScienceConferenc1
[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...[DSC Europe 23][AI:CSI]  Dragan Pleskonjic - AI Impact on Cybersecurity and P...
[DSC Europe 23][AI:CSI] Dragan Pleskonjic - AI Impact on Cybersecurity and P...
CRM stick or twist.pptx by info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821711 views

Building a Streaming Microservices Architecture - Data + AI Summit EU 2020

  • 1. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Building a Streaming Microservices Architecture With Apache Spark Structured Streaming & Friends Scott Haines Senior Principal Software Engineer, Twilio @newfront
  • 2. © 2019 TWILIO INC. ALL RIGHTS RESERVED. A little about me. • I work at Twilio building massive data systems • I run a bi-weekly internal Spark Office Hours where I offer training and guidance to teams at the company • >12 years working on large distributed analytics systems @newfront
  • 3. © 2019 TWILIO INC. ALL RIGHTS RESERVED. A little about me. • I work at Twilio building massive data systems • I run a bi-weekly internal Spark Office Hours where I offer training and guidance to teams at the company • >12 years working on large distributed analytics systems • Published work on Distributed Analytics Systems @newfront
  • 4. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Voice Insights Accountable observability data and interactive analytics and insights for the voice business and customers.
  • 5. VIRGINIA, USA DUBLIN, IRELAND SINGAPORE SYDNEY, AUSTRALIA TOKYO, JAPAN SAO PAULO, BRAZIL DATA CENT ERS © 2019 TWILIO INC. ALL RIGHTS RESERVED. Events per Second >1MIL
  • 6. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Let’s Build a Reliable Data Architecture
  • 7. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Goal: Reliable E2E Streaming Data Pipeline GRPC Client GRPC Server GRPC Server GRPC Server 1 2 3 Kafka Broker 4 Kafka Broker 5 6 Spark Application 7 8 HDFS S39 HTTP /2 @newfront
  • 8. Strong and Reliable Data starts at Ingest © 2019 TWILIO INC. ALL RIGHTS RESERVED. @newfront
  • 9. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Of people love JSON*. >95% { "type": "CallEvent", “call_sid”: "CA123", "attributes": [ { “account_sid”: “AC123” “start_ms": 123, “end_ms": "435" } ] } • JSON has Structure. • But JSON isn't strictly Structured Data. Structured Data @newfront @twilio
  • 10. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Of people saddened by bad data* 100% { "type": "CallEvent", “call_sid": “CA234”, “attributes":"oops" } • JSON has poor runtime guarantees due to its flexible nature. Optimize for compile time guarantees. • Debugging corrupt data in a large distributed system ruins hopes and dreams. Structured Data @newfront @twilio
  • 11. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Protocol Buffers message CallEvent { uint64 created_ms = 1; string call_sid = 2; uint64 account_sid = 3; EventType event_type = 4; Region region = 5; } • Well Defined Events tell their own Story • Type-Safety Rules • Versioning your API / Pipeline / Data now just means sticking to a version of your schema • Rely on Releases for versioning • Interoperable with most major languages (java/scala/c++/go/obj-c/node-js/ python/...)
  • 12. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Protocol Buffers message CallEvent { uint64 created_ms = 1; string call_sid = 2; uint64 account_sid = 3; EventType event_type = 4; Region region = 5; } • Data Accountability • Lightning Fast Serialization / Deserialization • Plays with nicely gRPC • Interoperable with Spark SQL • Like “JSON with Guard Rails” Of people like when things work between releases! 100% Take Aways
  • 13. Data Engineers love gRPC © 2019 TWILIO INC. ALL RIGHTS RESERVED. *technically not universally accepted @newfront
  • 14. GRPC Client GRPC Server GRPC Server GRPC Server HTTP /2 gRPC | saves time © 2019 TWILIO INC. ALL RIGHTS RESERVED.
  • 15. GRPC Client GRPC Server GRPC Server GRPC Server HTTP /2 gRPC | saves time © 2019 TWILIO INC. ALL RIGHTS RESERVED. val time = “$$$”
  • 16. GRPC Client GRPC Server GRPC Server GRPC Server HTTP /2 © 2019 TWILIO INC. ALL RIGHTS RESERVED. GRPC. GRPC // nutshell • RPC = remote procedure call. “G” stands for generic or Google • Great for Internal Services • High Performance • Compact Binary Exchange Format • Compile Idiomatic API Definitions • Capable of Bi-Directional Streaming • Pluggable HTTP/2 transport
  • 17. © 2019 TWILIO INC. ALL RIGHTS RESERVED. GRPC. • Building a CallEvent Service. • 1: Define your messages (call.proto)
  • 18. © 2019 TWILIO INC. ALL RIGHTS RESERVED. GRPC. • Building a CallEvent Service. • 1: Define your messages (call.proto) • 2: Define your services (service.proto) @newfront
  • 19. © 2019 TWILIO INC. ALL RIGHTS RESERVED. GRPC. • Building a CallEvent Service. • 1: Define your messages (call.proto) • 2: Define your services (service.proto) • 3: Compile your messages and service stubs. sbt clean compile package publishLocal
  • 20. © 2019 TWILIO INC. ALL RIGHTS RESERVED. GRPC. • Building a CallEvent Service. • 1: Define your messages (call.proto) • 2: Define your services (service.proto) • 3: Compile your messages and service stubs. • 4: Implement Traits and Run!
  • 21. © 2019 TWILIO INC. ALL RIGHTS RESERVED. GRPC. • Building a CallEvent Service. • Client SDKs essentially write themselves. • JSON <-> Protobuf is still possible for maintaining customer facing APIs
  • 22. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Protocol Gatewaymessage T<:Event { … } protobuf @ version Kafka Broker gRPC Common Pattern Emerges Protocol Streams
  • 23. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Still Building… GRPC Client GRPC Server GRPC Server GRPC Server 1 2 3 Kafka Broker 4 Kafka Broker 5 6 Spark Application 7 8 HDFS S39 HTTP /2 @newfront
  • 24. GRPC Client GRPC Server GRPC Server GRPC Server 1 2 3 Kafka Broker 4 Kafka Broker 5 6 Spark Application 7 8 9 HTTP /2 © 2019 TWILIO INC. ALL RIGHTS RESERVED. Kafka + Protobuf + Spark
  • 25. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Kafka. • Solve for common Data Pipeline Problems • Partition Keys are important. Use them to your advantage • Spark AQE - adaptive query execution can handle hot spots. Non-spark services not- so-much. Topic: CallEvents key: call_sid partitions*: n * number of partitions: factor of producer records/s * avg(record.bytesize)
  • 26. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Spark. • Use ScalaPB’s ExpressionEncoders to natively convert protobuf to Catalyst Optimizable DataFrames • Marry this with Sparks Kafka DataSource Reader/Writer
  • 27. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Spark. • Behind the Scenes… • Spark is doing magic
  • 28. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Spark. • End to End Tests ensure you can press the release button anywhere in the pipeline. • Can be automated to ensure your Spark Apps can continue working with any changes to the upstream gRPC • Can use for Canary testing updates
  • 29. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Spark. • Use ScalaPB to read local streams • Ensure your Spark Apps can continue working with any new protobuf dependencies • Test simple reads and complex aggregations locally, deploy globally @newfront
  • 30. GRPC Client GRPC Server GRPC Server GRPC Server 1 Kafka Broker 4 Kafka Broker 5 6 Spark Application 7 8 HDFS S39 TP /2 © 2019 TWILIO INC. ALL RIGHTS RESERVED. Spark & Beyond @newfront
  • 31. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Spark + Friends. • Proto to Catalyst DataFrame • Conversion to Parquet in Delta Table • Partitioned by Date @newfront
  • 32. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Spark + Friends. • Now it is easy for Downstream applications to pick up using Parquet Protocol Streams @newfront
  • 33. © 2019 TWILIO INC. ALL RIGHTS RESERVED. Rinse and Repeat @newfront Just keep adding new Flows. 1. Define your Structured Data 2. Define your service definitions 3. Emit Data / Enqueue to Kafka 4. Read and Drop in HDFS (delta) or pass along to a new Topic Kafka Topic Kafka Topic Spark Application Spark Application Spark Application Kafka Topic Data Table Data Table Spark Application GRPC Server