The idea of this presentation is to understand more about Apache Spark internals.
How it deals with resilience for each component, how Shard allocation works using RDD and how it abstract data partitioning and cluster distribution complexity.
10. Resilience
RDD
● An RDD is an immutable, deterministically re-computable, distributed dataset.
● Each RDD remembers the lineage of deterministic operations that were used on a
fault-tolerant input dataset to create it.
● If any partition of an RDD is lost due to a worker node failure, then that partition can be
re-computed from the original fault-tolerant dataset using the lineage of operations.
● Assuming that all of the RDD transformations are deterministic, the data in the final
transformed RDD will always be the same irrespective of failures in the Spark cluster.
11. cache
logLinesRDD
cleanedRDD
collect()
errosRDD
Error, ts, msg1,
ts, msg3, ts
Error, ts, msg4,
ts, msg1
Error, ts, msg1, ts Error, ts, ts, msg1
filter(fx)
errorMsg1RDD
count()
saveToCassandra()
Resilience
RDD
filter(fx)
coalesce(2)
If partition is damaged, it can
recompute from his parent, if
parents aren't in memory
anymore, it'll reprocess from disk
12. RDD
Shard allocation
RDD - Resilient Distributed Dataset
Error, ts,
msg1, warn, ts,
msg2, Error
info, ts, msg8,
info, ts, msg3,
info
Error, ts,
msg5, ts, info
Error, ts, info,
msg9, ts, info,
Error
File (hdfs,
s3, etc)
partitions
Default Algorithm: Hash partition
RDD = Data abstraction
It hides data partitioning and distribution complexity
15. Default settings:
● mapreduce.input.fileinputformat.split.minsize = 1 byte (minSize)
● dfs.block.size = 128 MB (cluster) / fs.local.block.size = 32 MB (local) (blockSize)
Calculating goal size:
e.g.:
● Total size of input files = T = 599 MB
● Desired number of partitions = P = 30 (parametrized)
● Partition Goal size = PGS = T / P = 599 / 30 = 19 MB
Result: Math.max(1, Math.min(19, 32)) == 19 MB
Shard allocation
Partition configuration - defining partition size
16. Fewer partitions
● more data in each partition
● less network and disk i/o
● fast access to data
● increase memory pressure
● don't make use of
parallelism
More partitions
● increase parallelism processing
● less data in each partition
● more network and disk i/o
Shard allocation
Trade offs