Sudarshan Kadambi presented this talk at the Bay Area Spark Meetup @ Bloomberg. He covered Bloomberg Apache Spark Server and contributions to Apache Spark. The talk also talked about challenges of doing high-volume online analytics while still observing high-levels of SLAs
4. Analytics at Bloomberg
• Human-time, interactive analytics
• Scalability
– Handle increasingly sophisticated client analytic workflows
– Ad-hoc and cross-domain aggregations, filtering
• Heterogeneous data stores
– Analytics often requires data from multiple stores
• Low-latency updates, in addition to queries
5. Spark for Bloomberg Analytics
• Distributed compute scales well for
– large security universes and
– multi-universe cross-domain queries
• Abstract away heterogeneous data sources and present consistent interface
for efficient data access
– Spark as a tool for systems integration
• Connectors and primitives to deal with incoming streams
• Cache intermediate compute for fast queries
6. Spark as a Service?
6
• Stand-alone Spark Apps on isolated clusters pose
challenges:
– Redundancy in:
» Crafting and Managing of RDDs/DFs
» Coding of the same or similar types of transforms/actions
– Management of clusters, replication of data, etc.
– Analytics are confined to specific content sets making
Cross-Asset Analytics much harder
– Need to handle Real-time ingestion in each App
Spark
Cluster
Spark
App
Spark
Cluster
Spark
Server
Spark
App
Spark
App
Spark
Cluster
Spark
App
8. Spark Server: Content Caching
• Data access has long tail characteristics
• High value data sub-setted within Spark
• Specified as a filter predicate at time of registration
• Seamless unification of data in Spark and backing store
9. Spark HA: State of the World
– Execution lineage in Driver
• Recovery from lost RDDs
– RDD Replication
• Low latency, even with lost executors
• Support for “MEMORY_ONLY”, “MEMORY_ONLY_2”, “MEMORY_ONLY_SER”,
“MEMORY_ONLY_SER_2” modes for in-memory persistence. Easily extensible to
more replicas if needed.
– Speculative execution
• Minimizing performance hit from stragglers
– Off-heap data
• Minimizing GC stalls
11. RDD Block Replication
Executor-1 Executor-2Driver
Compute RDD
Computation
complete Get Peers for replication
List of Peers
Replicate block to Peer
Block stored
locallyResults of computation
12. RDD Block Replication: Challenges
– Lost RDD partitions costly to recover
• Data replenished at query time
– RDD replicated to random executors
• On YARN, multiple executors can be brought up on the same node in
different containers
• Hence multiple replicas possible on the same node/rack, susceptible to
node/rack failure
• Lost block replicas not recovered proactively
13. Topology Aware Replication (SPARK-15352)
– Ideas & Implementation by Shubham Chopra
– Making Peer selection for replication pluggable
• Driver gets topology information for executors
• Executors informed about this topology information
• Executors use prioritization logic to order peers for block replication
• Pluggable TopologyMapper and BlockReplicationPrioritizer
• Default implementation replicates current Spark behavior
14. Topology Aware Replication (SPARK-15352)
– Customizable prioritization strategies to suit different deployments
• Variety of replication objectives – ReplicateToDifferentHost,
ReplicateBlockWithinRack, ReplicateBlockOutsideRack
• Optimizer to find a minimum number of peers to meet the objectives
• Replicate to these peers with a higher priority
– Proactive replenishment of lost replicas
• BlockManagerMasterEndpoint triggered replenishment when an executor
failure is detected.
15. Spark HA: Challenges
– High Availability of Spark Driver
• High bootstrap cost to reconstructing cluster and cached state
• Naïve HA models (such as multiple active clusters) surface query
inconsistency
– High Availability and Low Tail Latency closely related
16. Spark HA – A Strawman
• Multiple Spark Servers in Leader-Standby
configuration
• Each Spark Server backed by a different
Spark Cluster
• Each Spark Server refreshed with up-to-
date data
• Queries to standbys redirected to leader
• Only leader responds to queries - Data
consistency
• RDD Partition loss in the leader still a
concern
• Performance still gated by slowest
executor in leader
• Resource usage amplified by the number
of Spark Servers
17. Spark Driver State
• Spark Driver is an arbitrary Java application
• Only a subset of the state is interesting or expensive to reconstruct
• For online-use cases, only RDDs/DFs created during ingestion are of
interest
• Expressing ingestion using DFs has better decoupling of data/state than
RDDs
18. Spark Driver State*
• BlockManagerMasterEndpoint holds Block<->Executor assignment
• Cache Manager holds Logical Plan and DataFrame references
– Used to short-circuit queries with pre-cached query plans, if possible
• Job Scheduler
– Keeps a track of various stages and tasks being scheduled
• Executor information
– Hostname and ports of live executors
*Illustrative, not exhaustive
19. Externalizing Driver State
Benefits:
– Quicker recoveries
– No need to restart executors
– State accessible from multiple Active-Active drivers
Solutions:
– Off-heap storage for RDDs
– Residual book-keeping driver state externalized to ZooKeeper