Some thoughts on apache spark & shark

2,281 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,281
On SlideShare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
44
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Some thoughts on apache spark & shark

  1. 1. Some thoughts on Spark & Shark Viet-Trung Tran
  2. 2. Why spark? • Better support for – Iterative algorithms – Interactive data mining • Fault tolerance, data locality, scalability
  3. 3. How • In-memory data analytics – Ram is the new disk… • Memory cloud, H-Store, etc. • Resilient Distributed Datasets (RDDs) – – – – Initial RDD on disks (HDFS, etc) Intermediate RDD on RAM Fault recovery based on lineage RDD operations is distributed • Share variables – Accumulators – Broadcast variables
  4. 4. Single application mode • Driver – – – – RDD graph Scheduler Block tracker Shuffle tracker Cache 1 results tasks Driver Worker Block 1 • Worker – Task threads – Block manager • Driver <- {job} • Job <- {task} Cache 2 Worker Cache 3 Worker Block 3 Block 2
  5. 5. Cluster mode • Scheduling across applications (drivers) • Cluster manager: Mesos, YARN, Standalone
  6. 6. Spark in Berkeley data analytics stack
  7. 7. Some thoughts
  8. 8. Some thoughts • Intuitive potential improvements – Better schedulers • Across applications schedulers (FIFO, Fair scheduling) • Job, task schedulers – Native support for Cassandra, other DBMS • CassandraRDD, MongoRDD, etc. • StratioDeep: an integration between Spark & Cassandra
  9. 9. Spark: coarse-gained data processing • Ill-suite for fine-gained operations – Update few parts of RDD • New RDD is created for whatever operations – Memory usage is not economic » Filtered RDD is exactly part of parent RDD – Involved sending tasks to all workers holding RDD partitions. • Actions – Shadow copy – Content-addressable memory
  10. 10. Spark in operations with other frameworks • E.g – HDFS file is not immutable – Log writers appends log to HDFS file – Spark computes on that HDFS file – What happens? • Actions: – Spark streaming – In some case spark streaming won’t help, probably.
  11. 11. RDD is back by mutable disk-based data • Normal file systems support update (offset, size) • RDD back by table on DBMS? • Actions – How to re-compute only RDD partitions affected by the updates?
  12. 12. Spark memory efficiency • Leverage compression – Trade compute to memory – When no free memory available for new RDD partitions – During shuffle processes • Better network consumption • Eager deletion of intermediate RDD (not to be confused with GC) – Analyze job code to find out intermediate RDD before execution
  13. 13. Unbalance tasks in worker threads • Context – Hadoop RDD -> Filtered RDD – Many small RDD partitions – Few big RDD partitions • Many transformations before real actions – Chained unbalanced tasks -> some works are really busy than the others – Job finished time = slowest worker completion time • Should implement balance RDD and inject implicitly to reduce un-balance tasks – Balance RDD = split big RDD partitions + aggregate small RDD partitions
  14. 14. Topology-aware scheduling • Well-known in distributed systems (DynamoDB, MPI, etc.) • Choosing reducers that close to the data source – Same rack, big memory, fast processor • Assign big RDD partitions to big workers
  15. 15. Sharing RDD among different applications • Currently, application is running in an isolated context => better isolation • Motivation – Scientists working on the same challenges, same input data • Probably come up with the same job tasks during interactive processing • Shared RDDs to reduce duplicated tasks
  16. 16. Single point of failure • The Driver itself – All scheduling, RDD info is hold centrally at the driver • Critical: – Long running jobs that involve thousands intermediate RDD – How to recover from Driver failure? Rerun from the beginning? • Approach: check-pointing when persist RDDs – Save context, scheduling information, RDD information etc.
  17. 17. Sum up • Optimization is hard – More work but favor only specific patterns – Look Like playing with the parameters on research papers • Put aside optimization for the future – Build more RDD type: CassandraRDD, DBMS-RDD, etc – More transformations, actions • Balance RDD, etc. – Memory shadowing + content-based addressing look good to better memory efficiency • Check-pointing for long term jobs seem to be critical
  18. 18. Shark • Hive over Spark – HiveQL – UDF (analytics functions etc.) – Columnar memory store • Hive operation => {Spark transformation/action}
  19. 19. Update {table} efficiency • Update -> chain of RDDs – Memory usage efficiency? • Solution: – Instead of creating new RDD -> new version of RDD (ref: versioning file systems) – Or Remove previous RDD immediately – Or making RDD mutable (break RDD design principle?)
  20. 20. Indexing to improve lookup • Currently no Index support – Not easy? as new RDD is created for each update – https://groups.google.com/forum/#!topic/sharkusers/Ugv0uIvNjzU • SELECT * FROM a WHERE a.id in range() – .filter RDD – Any SELECT becomes a scan -> inefficient

×