SlideShare a Scribd company logo
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Landon Robinson & Jack Chapa
Spark Streaming
Headaches and Breakthroughs in Building
Continuous Applications
#UnifiedAnalytics #SparkAISummit
Who We Are
3#UnifiedAnalytics #SparkAISummit
Landon Robinson
Data Engineer
Jack Chapa
Data Engineer
Big Data Team @ SpotX
Because Spark Streaming...
• is very powerful
• can supercharge your infrastructure
• … and can be very complex!
Lots of headaches and breakthroughs!
But first… why are we here?
4#UnifiedAnalytics #SparkAISummit
● Streaming Basics
● Testing
● Monitoring & Alerts
● Batch Intervals & Resources
Takeaways
Leave with a few actionable items that we wish we knew
when we started with Spark Streaming.
Focus Areas
● Helpful Configurations
● Backpressure
● Data Enrichment &
Transformations
Our Company
6#UnifiedAnalytics #SparkAISummit
The Trusted Platform For
Premium Publishers and
Broadcasters
We Process a Lot of Data
Data:
- 220 MM+ Total Files/Blocks
- 8 PB+ HDFS Space
- 20 TB+ new data daily
- 100MM+ records/minute
- 300+ Data Nodes
Apps:
- Thousands of daily Spark apps
- Hundreds of daily user queries
- Multiple 24/7 Streaming apps
Our uses include:
- Rapid ingestion of data into warehouse for querying
- Machine learning on near-live data streams
- Ability to react to and impact live situations
- Accelerated processing / updating of metadata
- Real-time visualization of data streams and processing
Spark Streaming is Key for Us
Spark Streaming Basics
a brief overview
Spark Streaming Basics
Spark Streaming is an extension of Spark that
enables scalable, high-throughput, fault-tolerant
processing of live data streams.
• Stream == live data stream
– Topic == Kafka’s name for a stream
•
DStream == sequence of RDDs formed
from reading a data stream
• Batch == a self-contained job within your
Streaming app that processes a segment of
the stream.
Testing
Rapid development and
testing of Spark apps
Use Spark in Local Mode
You can start building Spark Streaming apps
in minutes, using Spark locally!
On your local machine
• No cluster needed!
• Great for rough testing
We Recommend:
IntelliJ Community Edition
• with SBT: For dependency management
Use Spark in Local Mode
In your build.sbt:
• src/test/scala => “provided”
• src/main/scala => “compiled”
The Scala Build Tool is your friend!
Simply:
• Import Spark libraries
• Invoke a Context and/or Session
• Set master to local[*] or local[n]
Example Unit Test using just a SparkContext
Invoke a local session:
• In your unit test classes
• Test logic on small datasets
Add to your deployment pipeline
for a nice pre-release gut check!
Unit Testing
Spark Streaming Apps can easily be unit tested
- Using .queueStream()
- Using a spark testing library
Libraries
- spark-testing-base
- sscheck
- spark-tests
Use Cases
- DStream actions
- Business Logic
- Integration
Example Library: spark-testing-base
- Easy to Use
- Helpful wrappers
- Integrates w/ scalatest
- Minimal code required
- Clock management
- Runs alongside other tests
GitHub: https://github.com/holdenk/spark-testing-base
Monitoring
Tracking and visualizing
performance of your
app
Monitoring is Awesome
It can reveal:
• How your app is performing
• Problems + Bugs!
And provide opportunities to:
• See and address issues
• Observe behavior visually
But monitoring can be tough to implement!
Monitoring (a less than ideal approach)
You could do it all in the app...
Example: Looping over RDDs to:
• Count records
• Track Kafka offsets
• Processing time / delays
But it’s less than ideal...
• Calculating performance significantly
impacts performance… not great.
• All these metrics are calculated by
Spark!
Monitoring and Visualization (using Listeners)
Use Spark Listeners to access
metrics in the background!
Let Spark do the hard work:
• Batch duration, delays
• Record throughput
• Stream position recovery
Come to our talk: Spark Listeners:
A Crash Course in Fast, Easy
Monitoring!
• Room 3016 | Today @ 5:30 PM
Kafka Offset Recovery
Saving your place
elsewhere
Inside the Spark
Listener class, after a
batch completes...
You can access an
object generated by
Spark containing your
offsets processed.
Take those offsets and
back them up to a
DB...
Writing Offsets to MySQL
Your offsets are now
stored in a DB after
each batch completes.
Whenever your app
restarts, it reads those
offsets from the DB...
And starts processing
where it last left off!
Reading Offsets from MySQL
Getting Offsets from the Database
Example: Reading Offsets from MySQL
Example: Reading Offsets from MySQL
- Record timing info for fast
troubleshooting
- Escalate alarms to the
appropriate team
- Quickly resolve while
app continues running
Timing Logging (around actions)
React
How do I react to this monitoring?
● Heartbeats
● Scheduled Monitor Jobs
○ Version Updates
○ Ensure Running
○ Act on failure/problem
● Monitoring Alarms
● Look at them!
Batch Intervals
Optimizing for speed
and resource efficiency
You want batches that process
faster than the interval, but
not so fast that resources are
idling and therefore wasted!
Setting Appropriate Batch Intervals
An appropriate batch interval is key to
an app that is quick and efficient.
Effectiveness of interval is
affected by:
• Resources alloc (cpu + ram)
• Quantity of work
• Quantity of data
Batch interval
Setting Appropriate Batch Intervals
Consider these questions:
How quickly do I need to process data?
• Can I slow it down to save resources?
What is my resource budget / allocation?
• Can I increase? Can I cut back?
• Bigger interval = more time to process
• … but also more data to process
• Smaller interval = the opposite
Tips for finding an optimal combination:
Start small!
a. Short batch interval (seconds)
b. Modest resources
Whichever you have in more flexible
supply (a or b), increasing accordingly.
Again: processing time < interval = good
Comfortably less, not significantly less.
Setting Appropriate Batch Intervals
Additional Resource Notes
- Scale down when possible
- Free up resources or save on cloud utilization spend
- Avoid preemption
- Use resource pools with prioritization
- With preemption disabled if you can
- Set appropriate # of partitions for Kafka topics
- Higher volume == higher partition count
- Higher partition count == greater parallelization
Helpful Configuration Settings
Configuring your app to
be performant and
efficient
Helpful Configuration Settings
Spark
• spark.memory.useLegacyMode = true
– spark.storage.memoryFraction=0.03
• spark.submit.deployMode = cluster
• spark.serializer = org.apache.spark.serializer.KryoSerializer
• spark.rdd.compress = true
– spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
• spark.shuffle.service.enabled = true
• spark.streaming.blockInterval = 300
Kafka
• enable.auto.commit = ‘false’
Backpressure
Use Case:
You have irregular spikes in message throughput from Kafka topics
• Backpressure dynamically alters rate data is received per batch from Kafka.
• Prevents overwhelming of app at startup and peak load.
Settings:
• spark.streaming.backpressure.enabled = true
• spark.streaming.kafka.maxRatePerPartition = 20000
– max rate (messages/second) at which each Kafka partition will be read
• PID Rate Estimator: can be used to tweak the rate based on batch performance
– spark.streaming.backpressure.pid.*
Source: https://www.linkedin.com/pulse/enable-back-pressure-make-your-spark-streaming-production-lan-jiang/
Transformations
Bringing streaming and
static data together
Transformations (Streaming + Static)
transform()
● Allows RDD-level access to data.
● Use case: joining with another RDD
updateStateByKey() / mapWithState()
● Apply function to each key - useful for keeping
track of state
● Use case: maintaining state between batches
(e.g. rolling join w/ two streams)
reduceByKey()
● Reduce a keyed RDD with appropriate
function.
● Use case: deduping, aggregations
Using the transform() method on DStream:
Apply an RDD-to-RDD function to every RDD of the DStream. Used for arbitrary RDD
operations on the DStream.
• Useful for applying arbitrary RDD operations on a DStream.
• Great for enriching streaming data with supplemental static data
Joining Streaming and Static Data
Source: https://hadoopsters.net/2017/11/26/how-to-join-static-data-with-streaming-data-dstream-in-spark/
transactions = … // streaming dataset (dstream)
transaction_details = … // static dataset (rdd)
val complete_transaction_data = transactions.transform(live_transaction =>
live_transaction.join(transaction_details))
Effective Static Joining
How do we handle static and persistent data?
Driver:
● Broadcast if small enough
● Read on driver every batch, then join
Worker:
● Connect on worker - lazy val connection
object
● Useful for persisting data
Streaming isn’t always easy… but here are some great takeaways!
• Testing: Use Spark Locally w/ Unit Tests
• Monitoring: Use Listeners & React
• Batch Intervals & Resources: Be thoughtful!
• Configuration: Lots of awesome ones!
• Transformations: Do more with your streaming data!
• Offset Recovery: Stop worrying and love the offset management!
Review
Landon Robinson
• lrobinson@spotx.tv
Jack Chapa
• jchapa@spotx.tv
hadoopsters.dev
https://gist.github.com/hadoopsters
Contact Us
Q & A
#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

What's hot

Disrupting Big Data with Apache Spark in the Cloud
Disrupting Big Data with Apache Spark in the CloudDisrupting Big Data with Apache Spark in the Cloud
Disrupting Big Data with Apache Spark in the Cloud
Jen Aman
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
Databricks
 
Self-Service Apache Spark Structured Streaming Applications and Analytics
Self-Service Apache Spark Structured Streaming Applications and AnalyticsSelf-Service Apache Spark Structured Streaming Applications and Analytics
Self-Service Apache Spark Structured Streaming Applications and Analytics
Databricks
 
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Spark Summit EU talk by Shaun Klopfenstein and Neelesh ShastrySpark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Spark Summit
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Spark Summit
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
Jozo Kovac
 
Power Your Delta Lake with Streaming Transactional Changes
 Power Your Delta Lake with Streaming Transactional Changes Power Your Delta Lake with Streaming Transactional Changes
Power Your Delta Lake with Streaming Transactional Changes
Databricks
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
Databricks
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Spark Summit
 
Life is but a Stream
Life is but a StreamLife is but a Stream
Life is but a Stream
Databricks
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
Databricks
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Databricks
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial Enterprises
Databricks
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Databricks
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Spark Summit
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
Nitin Kumar
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 

What's hot (20)

Disrupting Big Data with Apache Spark in the Cloud
Disrupting Big Data with Apache Spark in the CloudDisrupting Big Data with Apache Spark in the Cloud
Disrupting Big Data with Apache Spark in the Cloud
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
 
Self-Service Apache Spark Structured Streaming Applications and Analytics
Self-Service Apache Spark Structured Streaming Applications and AnalyticsSelf-Service Apache Spark Structured Streaming Applications and Analytics
Self-Service Apache Spark Structured Streaming Applications and Analytics
 
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Spark Summit EU talk by Shaun Klopfenstein and Neelesh ShastrySpark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
Spark Summit EU talk by Shaun Klopfenstein and Neelesh Shastry
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
 
Realtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIORealtime streaming architecture in INFINARIO
Realtime streaming architecture in INFINARIO
 
Power Your Delta Lake with Streaming Transactional Changes
 Power Your Delta Lake with Streaming Transactional Changes Power Your Delta Lake with Streaming Transactional Changes
Power Your Delta Lake with Streaming Transactional Changes
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Life is but a Stream
Life is but a StreamLife is but a Stream
Life is but a Stream
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh BhatnagarScalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
 
Streaming Analytics for Financial Enterprises
Streaming Analytics for Financial EnterprisesStreaming Analytics for Financial Enterprises
Streaming Analytics for Financial Enterprises
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 

Similar to Headaches and Breakthroughs in Building Continuous Applications

Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Landon Robinson
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Databricks
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Spark cep
Spark cepSpark cep
Spark cep
Byungjin Kim
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Amazon Web Services
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
Mark Smith
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
DB Tsai
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Sparktsliwowicz
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Spark Summit
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
Chen-en Lu
 

Similar to Headaches and Breakthroughs in Building Continuous Applications (20)

Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Spark cep
Spark cepSpark cep
Spark cep
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Taboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache SparkTaboola Road To Scale With Apache Spark
Taboola Road To Scale With Apache Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
Apache Spark the Hard Way: Challenges with Building an On-Prem Spark Analytic...
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 

Recently uploaded (20)

一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 

Headaches and Breakthroughs in Building Continuous Applications

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Landon Robinson & Jack Chapa Spark Streaming Headaches and Breakthroughs in Building Continuous Applications #UnifiedAnalytics #SparkAISummit
  • 3. Who We Are 3#UnifiedAnalytics #SparkAISummit Landon Robinson Data Engineer Jack Chapa Data Engineer Big Data Team @ SpotX
  • 4. Because Spark Streaming... • is very powerful • can supercharge your infrastructure • … and can be very complex! Lots of headaches and breakthroughs! But first… why are we here? 4#UnifiedAnalytics #SparkAISummit
  • 5. ● Streaming Basics ● Testing ● Monitoring & Alerts ● Batch Intervals & Resources Takeaways Leave with a few actionable items that we wish we knew when we started with Spark Streaming. Focus Areas ● Helpful Configurations ● Backpressure ● Data Enrichment & Transformations
  • 6. Our Company 6#UnifiedAnalytics #SparkAISummit The Trusted Platform For Premium Publishers and Broadcasters
  • 7. We Process a Lot of Data Data: - 220 MM+ Total Files/Blocks - 8 PB+ HDFS Space - 20 TB+ new data daily - 100MM+ records/minute - 300+ Data Nodes Apps: - Thousands of daily Spark apps - Hundreds of daily user queries - Multiple 24/7 Streaming apps
  • 8. Our uses include: - Rapid ingestion of data into warehouse for querying - Machine learning on near-live data streams - Ability to react to and impact live situations - Accelerated processing / updating of metadata - Real-time visualization of data streams and processing Spark Streaming is Key for Us
  • 9. Spark Streaming Basics a brief overview
  • 10. Spark Streaming Basics Spark Streaming is an extension of Spark that enables scalable, high-throughput, fault-tolerant processing of live data streams. • Stream == live data stream – Topic == Kafka’s name for a stream • DStream == sequence of RDDs formed from reading a data stream • Batch == a self-contained job within your Streaming app that processes a segment of the stream.
  • 12. Use Spark in Local Mode You can start building Spark Streaming apps in minutes, using Spark locally! On your local machine • No cluster needed! • Great for rough testing We Recommend: IntelliJ Community Edition • with SBT: For dependency management
  • 13. Use Spark in Local Mode In your build.sbt: • src/test/scala => “provided” • src/main/scala => “compiled” The Scala Build Tool is your friend! Simply: • Import Spark libraries • Invoke a Context and/or Session • Set master to local[*] or local[n]
  • 14. Example Unit Test using just a SparkContext Invoke a local session: • In your unit test classes • Test logic on small datasets Add to your deployment pipeline for a nice pre-release gut check!
  • 15. Unit Testing Spark Streaming Apps can easily be unit tested - Using .queueStream() - Using a spark testing library Libraries - spark-testing-base - sscheck - spark-tests Use Cases - DStream actions - Business Logic - Integration
  • 16. Example Library: spark-testing-base - Easy to Use - Helpful wrappers - Integrates w/ scalatest - Minimal code required - Clock management - Runs alongside other tests GitHub: https://github.com/holdenk/spark-testing-base
  • 18. Monitoring is Awesome It can reveal: • How your app is performing • Problems + Bugs! And provide opportunities to: • See and address issues • Observe behavior visually But monitoring can be tough to implement!
  • 19. Monitoring (a less than ideal approach) You could do it all in the app... Example: Looping over RDDs to: • Count records • Track Kafka offsets • Processing time / delays But it’s less than ideal... • Calculating performance significantly impacts performance… not great. • All these metrics are calculated by Spark!
  • 20. Monitoring and Visualization (using Listeners) Use Spark Listeners to access metrics in the background! Let Spark do the hard work: • Batch duration, delays • Record throughput • Stream position recovery Come to our talk: Spark Listeners: A Crash Course in Fast, Easy Monitoring! • Room 3016 | Today @ 5:30 PM
  • 21. Kafka Offset Recovery Saving your place elsewhere
  • 22. Inside the Spark Listener class, after a batch completes... You can access an object generated by Spark containing your offsets processed. Take those offsets and back them up to a DB... Writing Offsets to MySQL
  • 23. Your offsets are now stored in a DB after each batch completes. Whenever your app restarts, it reads those offsets from the DB... And starts processing where it last left off! Reading Offsets from MySQL
  • 24. Getting Offsets from the Database
  • 27. - Record timing info for fast troubleshooting - Escalate alarms to the appropriate team - Quickly resolve while app continues running Timing Logging (around actions)
  • 28. React How do I react to this monitoring? ● Heartbeats ● Scheduled Monitor Jobs ○ Version Updates ○ Ensure Running ○ Act on failure/problem ● Monitoring Alarms ● Look at them!
  • 29. Batch Intervals Optimizing for speed and resource efficiency
  • 30. You want batches that process faster than the interval, but not so fast that resources are idling and therefore wasted! Setting Appropriate Batch Intervals An appropriate batch interval is key to an app that is quick and efficient. Effectiveness of interval is affected by: • Resources alloc (cpu + ram) • Quantity of work • Quantity of data Batch interval
  • 31. Setting Appropriate Batch Intervals Consider these questions: How quickly do I need to process data? • Can I slow it down to save resources? What is my resource budget / allocation? • Can I increase? Can I cut back? • Bigger interval = more time to process • … but also more data to process • Smaller interval = the opposite
  • 32. Tips for finding an optimal combination: Start small! a. Short batch interval (seconds) b. Modest resources Whichever you have in more flexible supply (a or b), increasing accordingly. Again: processing time < interval = good Comfortably less, not significantly less. Setting Appropriate Batch Intervals
  • 33. Additional Resource Notes - Scale down when possible - Free up resources or save on cloud utilization spend - Avoid preemption - Use resource pools with prioritization - With preemption disabled if you can - Set appropriate # of partitions for Kafka topics - Higher volume == higher partition count - Higher partition count == greater parallelization
  • 34. Helpful Configuration Settings Configuring your app to be performant and efficient
  • 35. Helpful Configuration Settings Spark • spark.memory.useLegacyMode = true – spark.storage.memoryFraction=0.03 • spark.submit.deployMode = cluster • spark.serializer = org.apache.spark.serializer.KryoSerializer • spark.rdd.compress = true – spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec • spark.shuffle.service.enabled = true • spark.streaming.blockInterval = 300 Kafka • enable.auto.commit = ‘false’
  • 36. Backpressure Use Case: You have irregular spikes in message throughput from Kafka topics • Backpressure dynamically alters rate data is received per batch from Kafka. • Prevents overwhelming of app at startup and peak load. Settings: • spark.streaming.backpressure.enabled = true • spark.streaming.kafka.maxRatePerPartition = 20000 – max rate (messages/second) at which each Kafka partition will be read • PID Rate Estimator: can be used to tweak the rate based on batch performance – spark.streaming.backpressure.pid.* Source: https://www.linkedin.com/pulse/enable-back-pressure-make-your-spark-streaming-production-lan-jiang/
  • 38. Transformations (Streaming + Static) transform() ● Allows RDD-level access to data. ● Use case: joining with another RDD updateStateByKey() / mapWithState() ● Apply function to each key - useful for keeping track of state ● Use case: maintaining state between batches (e.g. rolling join w/ two streams) reduceByKey() ● Reduce a keyed RDD with appropriate function. ● Use case: deduping, aggregations
  • 39. Using the transform() method on DStream: Apply an RDD-to-RDD function to every RDD of the DStream. Used for arbitrary RDD operations on the DStream. • Useful for applying arbitrary RDD operations on a DStream. • Great for enriching streaming data with supplemental static data Joining Streaming and Static Data Source: https://hadoopsters.net/2017/11/26/how-to-join-static-data-with-streaming-data-dstream-in-spark/ transactions = … // streaming dataset (dstream) transaction_details = … // static dataset (rdd) val complete_transaction_data = transactions.transform(live_transaction => live_transaction.join(transaction_details))
  • 40. Effective Static Joining How do we handle static and persistent data? Driver: ● Broadcast if small enough ● Read on driver every batch, then join Worker: ● Connect on worker - lazy val connection object ● Useful for persisting data
  • 41. Streaming isn’t always easy… but here are some great takeaways! • Testing: Use Spark Locally w/ Unit Tests • Monitoring: Use Listeners & React • Batch Intervals & Resources: Be thoughtful! • Configuration: Lots of awesome ones! • Transformations: Do more with your streaming data! • Offset Recovery: Stop worrying and love the offset management! Review
  • 42. Landon Robinson • lrobinson@spotx.tv Jack Chapa • jchapa@spotx.tv hadoopsters.dev https://gist.github.com/hadoopsters Contact Us
  • 43. Q & A #UnifiedAnalytics #SparkAISummit
  • 44. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT