SlideShare a Scribd company logo
1 of 21
Spark and Online Analytics
Sudarshan Kadambi
Copyright 2016 Bloomberg L.P. All rights reserved.
Agenda
• Data and Analytics at Bloomberg
• The role of Spark
• The Bloomberg Spark Server
• Spark for Online use-cases
Data and Analytics is our Business
Analytics at Bloomberg
• Human-time, interactive analytics
• Scalability
– Handle increasingly sophisticated client analytic workflows
– Ad-hoc and cross-domain aggregations, filtering
• Heterogeneous data stores
– Analytics often requires data from multiple stores
• Low-latency updates, in addition to queries
Spark for Bloomberg Analytics
• Distributed compute scales well for
– large security universes and
– multi-universe cross-domain queries
• Abstract away heterogeneous data sources and present consistent interface
for efficient data access
– Spark as a tool for systems integration
• Connectors and primitives to deal with incoming streams
• Cache intermediate compute for fast queries
Spark as a Service?
6
• Stand-alone Spark Apps on isolated clusters pose
challenges:
– Redundancy in:
» Crafting and Managing of RDDs/DFs
» Coding of the same or similar types of transforms/actions
– Management of clusters, replication of data, etc.
– Analytics are confined to specific content sets making
Cross-Asset Analytics much harder
– Need to handle Real-time ingestion in each App
Spark
Cluster
Spark
App
Spark
Cluster
Spark
Server
Spark
App
Spark
App
Spark
Cluster
Spark
App
Bloomberg Spark Server
Spark
Context
Request
Processor
Request
Processor
Request
Processor
Request Handler
MDF Registry
7
Function Transform
Registry (FTR)
RSI …
use
Ingestion Manager
MDF1
MDF2
1 2
1 2
Spark Server: Content Caching
• Data access has long tail characteristics
• High value data sub-setted within Spark
• Specified as a filter predicate at time of registration
• Seamless unification of data in Spark and backing store
Spark HA: State of the World
– Execution lineage in Driver
• Recovery from lost RDDs
– RDD Replication
• Low latency, even with lost executors
• Support for “MEMORY_ONLY”, “MEMORY_ONLY_2”, “MEMORY_ONLY_SER”,
“MEMORY_ONLY_SER_2” modes for in-memory persistence. Easily extensible to
more replicas if needed.
– Speculative execution
• Minimizing performance hit from stragglers
– Off-heap data
• Minimizing GC stalls
Spark Architecture
RPC Environment RPC Environment
BlockManagerMasterEndpoint BlockManagerMaster
Driver Executor - 2
BlockManager
RPC Environment
BlockManagerMaster
Executor - 1
BlockManager
RDD - 0
RDDBlock(0,
Partition-1)
RDDBlock(0,
Partition-2)
RDD Block Replication
Executor-1 Executor-2Driver
Compute RDD
Computation
complete Get Peers for replication
List of Peers
Replicate block to Peer
Block stored
locallyResults of computation
RDD Block Replication: Challenges
– Lost RDD partitions costly to recover
• Data replenished at query time
– RDD replicated to random executors
• On YARN, multiple executors can be brought up on the same node in
different containers
• Hence multiple replicas possible on the same node/rack, susceptible to
node/rack failure
• Lost block replicas not recovered proactively
Topology Aware Replication (SPARK-15352)
– Ideas & Implementation by Shubham Chopra
– Making Peer selection for replication pluggable
• Driver gets topology information for executors
• Executors informed about this topology information
• Executors use prioritization logic to order peers for block replication
• Pluggable TopologyMapper and BlockReplicationPrioritizer
• Default implementation replicates current Spark behavior
Topology Aware Replication (SPARK-15352)
– Customizable prioritization strategies to suit different deployments
• Variety of replication objectives – ReplicateToDifferentHost,
ReplicateBlockWithinRack, ReplicateBlockOutsideRack
• Optimizer to find a minimum number of peers to meet the objectives
• Replicate to these peers with a higher priority
– Proactive replenishment of lost replicas
• BlockManagerMasterEndpoint triggered replenishment when an executor
failure is detected.
Spark HA: Challenges
– High Availability of Spark Driver
• High bootstrap cost to reconstructing cluster and cached state
• Naïve HA models (such as multiple active clusters) surface query
inconsistency
– High Availability and Low Tail Latency closely related
Spark HA – A Strawman
• Multiple Spark Servers in Leader-Standby
configuration
• Each Spark Server backed by a different
Spark Cluster
• Each Spark Server refreshed with up-to-
date data
• Queries to standbys redirected to leader
• Only leader responds to queries - Data
consistency
• RDD Partition loss in the leader still a
concern
• Performance still gated by slowest
executor in leader
• Resource usage amplified by the number
of Spark Servers
Spark Driver State
• Spark Driver is an arbitrary Java application
• Only a subset of the state is interesting or expensive to reconstruct
• For online-use cases, only RDDs/DFs created during ingestion are of
interest
• Expressing ingestion using DFs has better decoupling of data/state than
RDDs
Spark Driver State*
• BlockManagerMasterEndpoint holds Block<->Executor assignment
• Cache Manager holds Logical Plan and DataFrame references
– Used to short-circuit queries with pre-cached query plans, if possible
• Job Scheduler
– Keeps a track of various stages and tasks being scheduled
• Executor information
– Hostname and ports of live executors
*Illustrative, not exhaustive
Externalizing Driver State
Benefits:
– Quicker recoveries
– No need to restart executors
– State accessible from multiple Active-Active drivers
Solutions:
– Off-heap storage for RDDs
– Residual book-keeping driver state externalized to ZooKeeper
Quorum of Drivers
THANK YOU.
skadambi@bloomberg.net

More Related Content

What's hot

Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 

What's hot (20)

Operational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache SparkOperational Tips For Deploying Apache Spark
Operational Tips For Deploying Apache Spark
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
Performance Optimization Case Study: Shattering Hadoop's Sort Record with Spa...
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Spark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn WhittickSpark Summit EU talk by Emlyn Whittick
Spark Summit EU talk by Emlyn Whittick
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 

Viewers also liked

Viewers also liked (20)

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Parallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkRParallelizing Existing R Packages with SparkR
Parallelizing Existing R Packages with SparkR
 
Introducing apache prediction io (incubating) (bay area spark meetup at sales...
Introducing apache prediction io (incubating) (bay area spark meetup at sales...Introducing apache prediction io (incubating) (bay area spark meetup at sales...
Introducing apache prediction io (incubating) (bay area spark meetup at sales...
 
Exceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETLExceptions are the Norm: Dealing with Bad Actors in ETL
Exceptions are the Norm: Dealing with Bad Actors in ETL
 
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure ExecutionSpark Summit EU 2016: The Next AMPLab:  Real-time Intelligent Secure Execution
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure Execution
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Spark Summit Europe 2016 Keynote - Databricks CEO
Spark Summit Europe 2016 Keynote  - Databricks CEO Spark Summit Europe 2016 Keynote  - Databricks CEO
Spark Summit Europe 2016 Keynote - Databricks CEO
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFramesApache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Insights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured StreamingInsights Without Tradeoffs: Using Structured Streaming
Insights Without Tradeoffs: Using Structured Streaming
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-LearnApache® Spark™ MLlib: From Quick Start to Scikit-Learn
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 

Similar to Apache Spark and Online Analytics

Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Rose Toomey
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 

Similar to Apache Spark and Online Analytics (20)

Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in PinterestMigrating ETL Workflow to Apache Spark at Scale in Pinterest
Migrating ETL Workflow to Apache Spark at Scale in Pinterest
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)Introduction to Apache Geode (Cork, Ireland)
Introduction to Apache Geode (Cork, Ireland)
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
 
Spark architechure.pptx
Spark architechure.pptxSpark architechure.pptx
Spark architechure.pptx
 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
 
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsCassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
Cassandra Community Webinar: From Mongo to Cassandra, Architectural Lessons
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CITApache Geode Meetup, Cork, Ireland at CIT
Apache Geode Meetup, Cork, Ireland at CIT
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 

More from Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Jax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckJax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined Deck
Marc Lester
 

Recently uploaded (20)

BusinessGPT - Security and Governance for Generative AI
BusinessGPT  - Security and Governance for Generative AIBusinessGPT  - Security and Governance for Generative AI
BusinessGPT - Security and Governance for Generative AI
 
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
Abortion Clinic In Johannesburg ](+27832195400*)[ 🏥 Safe Abortion Pills in Jo...
 
Encryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key ConceptsEncryption Recap: A Refresher on Key Concepts
Encryption Recap: A Refresher on Key Concepts
 
Abortion Clinic In Springs ](+27832195400*)[ 🏥 Safe Abortion Pills in Springs...
Abortion Clinic In Springs ](+27832195400*)[ 🏥 Safe Abortion Pills in Springs...Abortion Clinic In Springs ](+27832195400*)[ 🏥 Safe Abortion Pills in Springs...
Abortion Clinic In Springs ](+27832195400*)[ 🏥 Safe Abortion Pills in Springs...
 
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCAOpenChain Webinar: AboutCode and Beyond - End-to-End SCA
OpenChain Webinar: AboutCode and Beyond - End-to-End SCA
 
Transformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with LinksTransformer Neural Network Use Cases with Links
Transformer Neural Network Use Cases with Links
 
Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024Modern binary build systems - PyCon 2024
Modern binary build systems - PyCon 2024
 
Community is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea GouletCommunity is Just as Important as Code by Andrea Goulet
Community is Just as Important as Code by Andrea Goulet
 
Rapidoform for Modern Form Building and Insights
Rapidoform for Modern Form Building and InsightsRapidoform for Modern Form Building and Insights
Rapidoform for Modern Form Building and Insights
 
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
Abortion Clinic In Pretoria ](+27832195400*)[ 🏥 Safe Abortion Pills in Pretor...
 
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
Abortion Pill Prices Jane Furse ](+27832195400*)[ 🏥 Women's Abortion Clinic i...
 
Test Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdfTest Automation Design Patterns_ A Comprehensive Guide.pdf
Test Automation Design Patterns_ A Comprehensive Guide.pdf
 
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
Workshop -  Architecting Innovative Graph Applications- GraphSummit MilanWorkshop -  Architecting Innovative Graph Applications- GraphSummit Milan
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
 
Software Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements EngineeringSoftware Engineering - Introduction + Process Models + Requirements Engineering
Software Engineering - Introduction + Process Models + Requirements Engineering
 
Jax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined DeckJax, FL Admin Community Group 05.14.2024 Combined Deck
Jax, FL Admin Community Group 05.14.2024 Combined Deck
 
A Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdfA Deep Dive into Secure Product Development Frameworks.pdf
A Deep Dive into Secure Product Development Frameworks.pdf
 
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdfThe Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
The Evolution of Web App Testing_ An Ultimate Guide to Future Trends.pdf
 
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
Wired_2.0_CREATE YOUR ULTIMATE LEARNING ENVIRONMENT_JCON_16052024
 
Your Ultimate Web Studio for Streaming Anywhere | Evmux
Your Ultimate Web Studio for Streaming Anywhere | EvmuxYour Ultimate Web Studio for Streaming Anywhere | Evmux
Your Ultimate Web Studio for Streaming Anywhere | Evmux
 
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
Anypoint Code Builder - Munich MuleSoft Meetup - 16th May 2024
 

Apache Spark and Online Analytics

  • 1. Spark and Online Analytics Sudarshan Kadambi Copyright 2016 Bloomberg L.P. All rights reserved.
  • 2. Agenda • Data and Analytics at Bloomberg • The role of Spark • The Bloomberg Spark Server • Spark for Online use-cases
  • 3. Data and Analytics is our Business
  • 4. Analytics at Bloomberg • Human-time, interactive analytics • Scalability – Handle increasingly sophisticated client analytic workflows – Ad-hoc and cross-domain aggregations, filtering • Heterogeneous data stores – Analytics often requires data from multiple stores • Low-latency updates, in addition to queries
  • 5. Spark for Bloomberg Analytics • Distributed compute scales well for – large security universes and – multi-universe cross-domain queries • Abstract away heterogeneous data sources and present consistent interface for efficient data access – Spark as a tool for systems integration • Connectors and primitives to deal with incoming streams • Cache intermediate compute for fast queries
  • 6. Spark as a Service? 6 • Stand-alone Spark Apps on isolated clusters pose challenges: – Redundancy in: » Crafting and Managing of RDDs/DFs » Coding of the same or similar types of transforms/actions – Management of clusters, replication of data, etc. – Analytics are confined to specific content sets making Cross-Asset Analytics much harder – Need to handle Real-time ingestion in each App Spark Cluster Spark App Spark Cluster Spark Server Spark App Spark App Spark Cluster Spark App
  • 7. Bloomberg Spark Server Spark Context Request Processor Request Processor Request Processor Request Handler MDF Registry 7 Function Transform Registry (FTR) RSI … use Ingestion Manager MDF1 MDF2 1 2 1 2
  • 8. Spark Server: Content Caching • Data access has long tail characteristics • High value data sub-setted within Spark • Specified as a filter predicate at time of registration • Seamless unification of data in Spark and backing store
  • 9. Spark HA: State of the World – Execution lineage in Driver • Recovery from lost RDDs – RDD Replication • Low latency, even with lost executors • Support for “MEMORY_ONLY”, “MEMORY_ONLY_2”, “MEMORY_ONLY_SER”, “MEMORY_ONLY_SER_2” modes for in-memory persistence. Easily extensible to more replicas if needed. – Speculative execution • Minimizing performance hit from stragglers – Off-heap data • Minimizing GC stalls
  • 10. Spark Architecture RPC Environment RPC Environment BlockManagerMasterEndpoint BlockManagerMaster Driver Executor - 2 BlockManager RPC Environment BlockManagerMaster Executor - 1 BlockManager RDD - 0 RDDBlock(0, Partition-1) RDDBlock(0, Partition-2)
  • 11. RDD Block Replication Executor-1 Executor-2Driver Compute RDD Computation complete Get Peers for replication List of Peers Replicate block to Peer Block stored locallyResults of computation
  • 12. RDD Block Replication: Challenges – Lost RDD partitions costly to recover • Data replenished at query time – RDD replicated to random executors • On YARN, multiple executors can be brought up on the same node in different containers • Hence multiple replicas possible on the same node/rack, susceptible to node/rack failure • Lost block replicas not recovered proactively
  • 13. Topology Aware Replication (SPARK-15352) – Ideas & Implementation by Shubham Chopra – Making Peer selection for replication pluggable • Driver gets topology information for executors • Executors informed about this topology information • Executors use prioritization logic to order peers for block replication • Pluggable TopologyMapper and BlockReplicationPrioritizer • Default implementation replicates current Spark behavior
  • 14. Topology Aware Replication (SPARK-15352) – Customizable prioritization strategies to suit different deployments • Variety of replication objectives – ReplicateToDifferentHost, ReplicateBlockWithinRack, ReplicateBlockOutsideRack • Optimizer to find a minimum number of peers to meet the objectives • Replicate to these peers with a higher priority – Proactive replenishment of lost replicas • BlockManagerMasterEndpoint triggered replenishment when an executor failure is detected.
  • 15. Spark HA: Challenges – High Availability of Spark Driver • High bootstrap cost to reconstructing cluster and cached state • Naïve HA models (such as multiple active clusters) surface query inconsistency – High Availability and Low Tail Latency closely related
  • 16. Spark HA – A Strawman • Multiple Spark Servers in Leader-Standby configuration • Each Spark Server backed by a different Spark Cluster • Each Spark Server refreshed with up-to- date data • Queries to standbys redirected to leader • Only leader responds to queries - Data consistency • RDD Partition loss in the leader still a concern • Performance still gated by slowest executor in leader • Resource usage amplified by the number of Spark Servers
  • 17. Spark Driver State • Spark Driver is an arbitrary Java application • Only a subset of the state is interesting or expensive to reconstruct • For online-use cases, only RDDs/DFs created during ingestion are of interest • Expressing ingestion using DFs has better decoupling of data/state than RDDs
  • 18. Spark Driver State* • BlockManagerMasterEndpoint holds Block<->Executor assignment • Cache Manager holds Logical Plan and DataFrame references – Used to short-circuit queries with pre-cached query plans, if possible • Job Scheduler – Keeps a track of various stages and tasks being scheduled • Executor information – Hostname and ports of live executors *Illustrative, not exhaustive
  • 19. Externalizing Driver State Benefits: – Quicker recoveries – No need to restart executors – State accessible from multiple Active-Active drivers Solutions: – Off-heap storage for RDDs – Residual book-keeping driver state externalized to ZooKeeper