Spark and Online Analytics: Spark Summit East talky by Shubham Chopra

© 2017 Bloomberg Finance L.P. All rights reserved.
February 9, 2017
Shubham Chopra
Software Engineer
Spark and Online Analytics
Spark Summit East 2017

Agenda
• Data and Analytics at Bloomberg
• The role of Spark
• The Bloomberg Spark Server
• Spark for online use cases

Data and Analytics are our Business

Analytics at Bloomberg
• Human-time, interactive analytics
• Scalability
• Handle increasingly sophisticated client analytic workflows
• Ad-hoc and cross-domain aggregations, filtering
• Heterogeneous data stores
• Analytics often requires data from multiple stores
• Low-latency updates, in addition to queries

Spark for Bloomberg Analytics
• Distributed compute scales well for:
• Large security universes
• Multi-universe cross-domain queries
• Abstract away heterogeneous data sources and present consistent interface for
efficient data access
• Spark as a tool for systems integration
• Connectors and primitives to deal with incoming streams
• Cache intermediate compute for fast queries

Spark as a Service?
• Stand-alone Spark Apps on isolated clusters pose challenges:
• Redundancy in:
• Crafting and managing RDDs/DFs
• Coding of the same or similar types of transforms/actions
• Management of clusters, replication of data, etc.
• Analytics are confined to specific content sets making cross-asset
analytics much harder
• Need to handle real-time ingestion in each App

Bloomberg Spark Server
• A single long-running Spark application
• Analytics deployed as Request Processors and served via a REST
API
• Can be deployed on YARN or MESOS or standalone
• Ingest time transforms to load data in Spark from a backing store
• Query time transforms to run analytics on the ingested data in
Spark

Bloomberg Spark Server

Spark Server: Content Caching
• Data access has long tail characteristics
• High value data sub-setted within Spark
• Specified as a filter predicate at time of registration
• Seamless unification of data in Spark and backing store
• Reliability?

Spark HA: State of the World
• Execution lineage in Driver
• Recovery from lost RDDs
• RDD Replication
• Low latency, even with lost executors
• Support for “MEMORY_ONLY”, “MEMORY_ONLY_2”, “MEMORY_ONLY_SER”,
“MEMORY_ONLY_SER_2” modes for in-memory persistence. Easily extensible to more
replicas if needed.
• Speculative execution
• Minimizing performance hit from stragglers
• Off-heap data
• Minimizing GC stalls

Spark Architecture

RDD Block Replication
Executor-1 Executor-2Driver
Compute RDD
Computation
complete Get Peers for replication
List of Peers
Replicate block to Peer
Block stored
locallyResults of computation

RDD Block Replication: Challenges
• Lost RDD partitions costly to recover
• Data replenished at query time
• RDD replicated to random executors
• On YARN, multiple executors can be brought up on the same node in different containers
• Hence multiple replicas possible on the same node/rack, susceptible to node/rack failure
• Lost block replicas not recovered proactively

TopologyAware Replication(SPARK-15352)
• Making Peer selection for replication pluggable
• Driver gets topology information for executors
• Executors informed about this topology information
• Executors use prioritization logic to order peers for block replication
• Pluggable TopologyMapper and BlockReplicationPrioritizer
• Default implementation replicates current Spark behavior

TopologyAware Replication(SPARK-15352)
• Customizable prioritization strategies to suit different
deployments
• Variety of replication objectives – ReplicateToDifferentHost,
ReplicateBlockWithinRack, ReplicateBlockOutsideRack
• Optimizer to find a minimum number of peers to meet the
objectives
• Replicate to these peers with a higher priority
• Proactive replenishment of lost replicas
• BlockManagerMasterEndpoint triggered replenishment when an
executor failure is detected.

Spark HA: Challenges
• High Availability of Spark Driver
• High bootstrap cost to reconstructing cluster and cached state
• Naïve HA models (such as multiple active clusters) surface query
inconsistency
• High Availability and Low Tail Latency closely related

Spark HA – A Strawman
• Multiple Spark Servers in Leader-Standby
configuration
• Each Spark Server backed by a different Spark
Cluster
• Each Spark Server refreshed with up-to-date
data
• Queries to standbys redirected to leader
• Only leader responds to queries - Data
consistency
• RDD Partition loss in the leader still a concern
• Performance still gated by slowest executor in
leader
• Resource usage amplified by the number of
Spark Servers

Spark Driver State
• Spark Driver is an arbitrary Java application
• Only a subset of the state is interesting or expensive to reconstruct
• For online-use cases, only RDDs/DFs created during ingestion are of interest
• Expressing ingestion using DFs has better decoupling of data/state than
RDDs

Spark Driver State*
• BlockManagerMasterEndpoint holds Block<->Executor
assignment
• Cache Manager holds Logical Plan and DataFrame references
• Used to short-circuit queries with pre-cached query plans, if possible
• Job Scheduler
• Keeps a track of various stages and tasks being scheduled
• Executor information
• Hostname and ports of live executors
*Illustrative, not exhaustive

Externalizing Driver State
Benefits:
• Quicker recoveries
• No need to restart executors
• State accessible from multiple Active-Active drivers
Solutions:
• Off-heap storage for RDDs
• Residual book-keeping driver state externalized to ZooKeeper

Quorum of Drivers

THANK YOU
schopra31@bloomberg.net

Spark and Online Analytics: Spark Summit East talky by Shubham Chopra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark and Online Analytics: Spark Summit East talky by Shubham Chopra

Similar to Spark and Online Analytics: Spark Summit East talky by Shubham Chopra (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Spark and Online Analytics: Spark Summit East talky by Shubham Chopra