2. Spark Framework
• Efficient data processing via in-memory RDD.
• A rich data-flow API (Java, Scala and Python).
• An interactive shell (Scala and Python).
• Execution environment running in Local and Standalone modes, or on
top of Hadoop/Yarn, Apache Mesos, Amazon EC2.
• Several extensions on top of the core engine:
• Spark SQL, Spark Streaming, MLlib and GraphX.
2
4. Resilient Distributed Datasets (RDD)
• Immutable data collection partitioned across the nodes.
• Data-flow model with parallel transformations and actions.
• Transformations are lazy, the actual computation is done only on actions.
• Recompute partitions on failure from the computation graph (lineage).
• Can be persisted to memory and/or disk for future reuse.
4
7. Advanced RDD
• Data sets can be cached in memory for repeated access.
• Data that does not fit in RAM can be stored on disk.
• The user can decide partitioning for better join performance.
• Each RDD is represented as
• a set of partitions
• a set of dependencies on parent RDDs
• a function for computing it from its parents
• metadata about partitioning and data placement
7
8. RDD: Narrow vs Wide Dependencies
• Narrow: each parent partition has no more than one child partition.
• Can do pipelined execution (operator chaining).
• Easier recovery - need to recompute only the lost partitions and
they can be computed in parallel on different nodes.
• Wide: multiple child partitions.
• Needs shuffling.
• During computation (action) there is (was) materialization of parent
partitions before the shuffle.
8
9. Comparison to DSM and Map-Reduce
• Spark has an expressive API and support for Scala/Java/Python.
• Spark does efficient scheduling and recovery.
• Spark is best suitable for iterative batch data-flow operations on large
data sets.
• For ML and Graph applications it has shown x20 speedup due to
elimination I/O and deseriazation.
9
10. Spark Platform
• Spark SQL
• Provides Hive compatible SQL access and JDBC/ODBC.
• GhraphX
• Provides a flexible API for graph processing.
• Includes a variety of graph algorithms for computing PageRank,
connected components, triangle count, SVD, label propagation, etc.
10
11. Spark Platform
• Spark Streaming
• Provides a flexible streaming API based on micro-batch processing.
• Includes methods for stream source definitions, transformations and
window operations.
• MLlib
• Provides a set of ML algorithms for classification (logistic regres-
sion, SVM, naive bayes), linear regression and clustering (k-means),
matrix decomposition (SVD/PCA) and collaborative filtering (ALS).
11
12. Personal impressions
• The interactive shell is awesome!
• Good documentation and lots of examples, source code is in Scala is =/
• Tons of info messages are distracting, errors messages on teardown are
spooky.
• MLllib lacks methods for data cleaning/transformation, model validation
and exploration.
12
13. References
• Zaharia et al., 2012: Resilient distributed datasets: A fault-tolerant
abstraction for in-memory cluster computing. (Paper of the week!)
• http://spark.apache.org/
• Slideshare presentations: one, two, three, four, five.
13