2. Why spark?
• Better support for
– Iterative algorithms
– Interactive data mining
• Fault tolerance, data locality, scalability
3. How
• In-memory data analytics
– Ram is the new disk…
• Memory cloud, H-Store, etc.
• Resilient Distributed Datasets (RDDs)
–
–
–
–
Initial RDD on disks (HDFS, etc)
Intermediate RDD on RAM
Fault recovery based on lineage
RDD operations is distributed
• Share variables
– Accumulators
– Broadcast variables
8. Some thoughts
• Intuitive potential improvements
– Better schedulers
• Across applications schedulers (FIFO, Fair scheduling)
• Job, task schedulers
– Native support for Cassandra, other DBMS
• CassandraRDD, MongoRDD, etc.
• StratioDeep: an integration between Spark & Cassandra
9. Spark: coarse-gained data processing
• Ill-suite for fine-gained operations
– Update few parts of RDD
• New RDD is created for whatever operations
– Memory usage is not economic
» Filtered RDD is exactly part of parent RDD
– Involved sending tasks to all workers holding RDD partitions.
• Actions
– Shadow copy
– Content-addressable memory
10. Spark in operations with other
frameworks
• E.g
– HDFS file is not immutable
– Log writers appends log to HDFS file
– Spark computes on that HDFS file
– What happens?
• Actions:
– Spark streaming
– In some case spark streaming won’t help,
probably.
11. RDD is back by mutable disk-based
data
• Normal file systems support update (offset,
size)
• RDD back by table on DBMS?
• Actions
– How to re-compute only RDD partitions affected
by the updates?
12. Spark memory efficiency
• Leverage compression
– Trade compute to memory
– When no free memory available for new RDD
partitions
– During shuffle processes
• Better network consumption
• Eager deletion of intermediate RDD (not to be
confused with GC)
– Analyze job code to find out intermediate RDD before
execution
13. Unbalance tasks in worker threads
• Context
– Hadoop RDD -> Filtered RDD
– Many small RDD partitions
– Few big RDD partitions
• Many transformations before real actions
– Chained unbalanced tasks -> some works are really busy
than the others
– Job finished time = slowest worker completion time
• Should implement balance RDD and inject implicitly to
reduce un-balance tasks
– Balance RDD = split big RDD partitions + aggregate small
RDD partitions
14. Topology-aware scheduling
• Well-known in distributed systems
(DynamoDB, MPI, etc.)
• Choosing reducers that close to the data
source
– Same rack, big memory, fast processor
• Assign big RDD partitions to big workers
15. Sharing RDD among different
applications
• Currently, application is running in an isolated
context => better isolation
• Motivation
– Scientists working on the same challenges, same
input data
• Probably come up with the same job tasks during
interactive processing
• Shared RDDs to reduce duplicated tasks
16. Single point of failure
• The Driver itself
– All scheduling, RDD info is hold centrally at the
driver
• Critical:
– Long running jobs that involve thousands
intermediate RDD
– How to recover from Driver failure? Rerun from
the beginning?
• Approach: check-pointing when persist RDDs
– Save context, scheduling information, RDD information etc.
17. Sum up
• Optimization is hard
– More work but favor only specific patterns
– Look Like playing with the parameters on research papers
• Put aside optimization for the future
– Build more RDD type: CassandraRDD, DBMS-RDD, etc
– More transformations, actions
• Balance RDD, etc.
– Memory shadowing + content-based addressing look good
to better memory efficiency
• Check-pointing for long term jobs seem to be critical
19. Update {table} efficiency
• Update -> chain of RDDs
– Memory usage efficiency?
• Solution:
– Instead of creating new RDD -> new version of
RDD (ref: versioning file systems)
– Or Remove previous RDD immediately
– Or making RDD mutable (break RDD design
principle?)
20. Indexing to improve lookup
• Currently no Index support
– Not easy? as new RDD is created for each update
– https://groups.google.com/forum/#!topic/sharkusers/Ugv0uIvNjzU
• SELECT * FROM a WHERE a.id in range()
– .filter RDD
– Any SELECT becomes a scan -> inefficient