Yet another intro to Apache Spark

A brief intro to
Apache Spark
– You eat, I talk…

Spark Framework
• Efficient data processing via in-memory RDD.
• A rich data-flow API (Java, Scala and Python).
• An interactive shell (Scala and Python).
• Execution environment running in Local and Standalone modes, or on
top of Hadoop/Yarn, Apache Mesos, Amazon EC2.
• Several extensions on top of the core engine:
• Spark SQL, Spark Streaming, MLlib and GraphX.
2

Get It Running
$ git clone https://github.com/apache/spark
$ export JAVA_HOME=...
$ spark/sbt/sbt assembly
$ spark/sbin/start-master.sh
$ spark/sbin/start-slave.sh --master spark://localhost:7077
01.
02.
03.
04.
05.
3

Resilient Distributed Datasets (RDD)
• Immutable data collection partitioned across the nodes.
• Data-flow model with parallel transformations and actions.
• Transformations are lazy, the actual computation is done only on actions.
• Recompute partitions on failure from the computation graph (lineage).
• Can be persisted to memory and/or disk for future reuse.
4

Transformations and Actions
• Transformations
• filter, map, flatMap, group/sort/reduceByKey, distinct, union,
intersection, cartesian, subtract, join, cogroup, sample
• Actions
• count, collect, reduce, take, takeSample, foreach, first, saveAsText
• Persistence
• MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER,
MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2/
5

Hello World! (pyspark)
>>> file = sc.textFile(".../spark/README.md")
>>> file.first()
u'# Apache Spark'
>>> file.filter(lambda line: "Spark" in line) .count()
19
>>> wordCounts = file.flatMap(lambda line: line.split())
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
01.
02.
03.
04.
05.
06.
07.
08.
6

Advanced RDD
• Data sets can be cached in memory for repeated access.
• Data that does not fit in RAM can be stored on disk.
• The user can decide partitioning for better join performance.
• Each RDD is represented as
• a set of partitions
• a set of dependencies on parent RDDs
• a function for computing it from its parents
• metadata about partitioning and data placement
7

RDD: Narrow vs Wide Dependencies
• Narrow: each parent partition has no more than one child partition.
• Can do pipelined execution (operator chaining).
• Easier recovery - need to recompute only the lost partitions and
they can be computed in parallel on different nodes.
• Wide: multiple child partitions.
• Needs shuffling.
• During computation (action) there is (was) materialization of parent
partitions before the shuffle.
8

Comparison to DSM and Map-Reduce
• Spark has an expressive API and support for Scala/Java/Python.
• Spark does efficient scheduling and recovery.
• Spark is best suitable for iterative batch data-flow operations on large
data sets.
• For ML and Graph applications it has shown x20 speedup due to
elimination I/O and deseriazation.
9

Spark Platform
• Spark SQL
• Provides Hive compatible SQL access and JDBC/ODBC.
• GhraphX
• Provides a flexible API for graph processing.
• Includes a variety of graph algorithms for computing PageRank,
connected components, triangle count, SVD, label propagation, etc.
10

Spark Platform
• Spark Streaming
• Provides a flexible streaming API based on micro-batch processing.
• Includes methods for stream source definitions, transformations and
window operations.
• MLlib
• Provides a set of ML algorithms for classification (logistic regres-
sion, SVM, naive bayes), linear regression and clustering (k-means),
matrix decomposition (SVD/PCA) and collaborative filtering (ALS).
11

Personal impressions
• The interactive shell is awesome!
• Good documentation and lots of examples, source code is in Scala is =/
• Tons of info messages are distracting, errors messages on teardown are
spooky.
• MLllib lacks methods for data cleaning/transformation, model validation
and exploration.
12

References
• Zaharia et al., 2012: Resilient distributed datasets: A fault-tolerant
abstraction for in-memory cluster computing. (Paper of the week!)
• http://spark.apache.org/
• Slideshare presentations: one, two, three, four, five.
13

Yet another intro to Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Yet another intro to Apache Spark

Similar to Yet another intro to Apache Spark (20)

More from Simon Lia-Jonassen

More from Simon Lia-Jonassen (9)

Recently uploaded

Recently uploaded (20)

Yet another intro to Apache Spark