Some thoughts on apache spark & shark

Some thoughts on Spark & Shark
Viet-Trung Tran

Why spark?
• Better support for
– Iterative algorithms
– Interactive data mining

• Fault tolerance, data locality, scalability

How
• In-memory data analytics
– Ram is the new disk…
• Memory cloud, H-Store, etc.

• Resilient Distributed Datasets (RDDs)
–
–
–
–

Initial RDD on disks (HDFS, etc)
Intermediate RDD on RAM
Fault recovery based on lineage
RDD operations is distributed

• Share variables
– Accumulators
– Broadcast variables

Single application mode
• Driver
–
–
–
–

RDD graph
Scheduler
Block tracker
Shuffle tracker

Cache 1

results
tasks

Driver

Worker
Block 1

• Worker
– Task threads
– Block manager

• Driver <- {job}
• Job <- {task}

Cache 2

Worker
Cache 3

Worker
Block 3

Block 2

Cluster mode
• Scheduling across applications (drivers)
• Cluster manager: Mesos, YARN, Standalone

Spark in Berkeley data analytics stack

Some thoughts
• Intuitive potential improvements
– Better schedulers
• Across applications schedulers (FIFO, Fair scheduling)
• Job, task schedulers

– Native support for Cassandra, other DBMS
• CassandraRDD, MongoRDD, etc.
• StratioDeep: an integration between Spark & Cassandra

Spark: coarse-gained data processing
• Ill-suite for fine-gained operations
– Update few parts of RDD
• New RDD is created for whatever operations
– Memory usage is not economic
» Filtered RDD is exactly part of parent RDD
– Involved sending tasks to all workers holding RDD partitions.

• Actions
– Shadow copy
– Content-addressable memory

Spark in operations with other
frameworks
• E.g
– HDFS file is not immutable
– Log writers appends log to HDFS file
– Spark computes on that HDFS file
– What happens?

• Actions:
– Spark streaming
– In some case spark streaming won’t help,
probably.

RDD is back by mutable disk-based
data
• Normal file systems support update (offset,
size)
• RDD back by table on DBMS?
• Actions
– How to re-compute only RDD partitions affected
by the updates?

Spark memory efficiency
• Leverage compression
– Trade compute to memory
– When no free memory available for new RDD
partitions
– During shuffle processes
• Better network consumption

• Eager deletion of intermediate RDD (not to be
confused with GC)
– Analyze job code to find out intermediate RDD before
execution

Unbalance tasks in worker threads
• Context
– Hadoop RDD -> Filtered RDD
– Many small RDD partitions
– Few big RDD partitions

• Many transformations before real actions
– Chained unbalanced tasks -> some works are really busy
than the others
– Job finished time = slowest worker completion time

• Should implement balance RDD and inject implicitly to
reduce un-balance tasks
– Balance RDD = split big RDD partitions + aggregate small
RDD partitions

Topology-aware scheduling
• Well-known in distributed systems
(DynamoDB, MPI, etc.)
• Choosing reducers that close to the data
source
– Same rack, big memory, fast processor

• Assign big RDD partitions to big workers

Sharing RDD among different
applications
• Currently, application is running in an isolated
context => better isolation
• Motivation
– Scientists working on the same challenges, same
input data
• Probably come up with the same job tasks during
interactive processing
• Shared RDDs to reduce duplicated tasks

Single point of failure
• The Driver itself
– All scheduling, RDD info is hold centrally at the
driver

• Critical:
– Long running jobs that involve thousands
intermediate RDD
– How to recover from Driver failure? Rerun from
the beginning?
• Approach: check-pointing when persist RDDs
– Save context, scheduling information, RDD information etc.

Sum up
• Optimization is hard
– More work but favor only specific patterns
– Look Like playing with the parameters on research papers

• Put aside optimization for the future
– Build more RDD type: CassandraRDD, DBMS-RDD, etc
– More transformations, actions
• Balance RDD, etc.

– Memory shadowing + content-based addressing look good
to better memory efficiency

• Check-pointing for long term jobs seem to be critical

Shark
• Hive over Spark
– HiveQL
– UDF (analytics functions etc.)
– Columnar memory store

• Hive operation => {Spark
transformation/action}

Update {table} efficiency
• Update -> chain of RDDs
– Memory usage efficiency?

• Solution:
– Instead of creating new RDD -> new version of
RDD (ref: versioning file systems)
– Or Remove previous RDD immediately
– Or making RDD mutable (break RDD design
principle?)

Indexing to improve lookup
• Currently no Index support
– Not easy? as new RDD is created for each update
– https://groups.google.com/forum/#!topic/sharkusers/Ugv0uIvNjzU

• SELECT * FROM a WHERE a.id in range()
– .filter RDD
– Any SELECT becomes a scan -> inefficient

Some thoughts on apache spark & shark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Some thoughts on apache spark & shark

Similar to Some thoughts on apache spark & shark (20)

More from Viet-Trung TRAN

More from Viet-Trung TRAN (20)

Recently uploaded

Recently uploaded (20)

Some thoughts on apache spark & shark