Learning Apache Spark
by Examples
SAMUEL YEE
Why Apache Spark?
MapReduce vs Spark
PIG HIVE MAHOUT
(machine
learning)
MapReduce
Programming Models
MAPREDUCE – 50 LINES OF CODE SPARK – 1 LINE OF CODE
Resilient Distributed Dataset (RDD)
Immutable (read-only), fault-tolerant, distributed collection of records that can be operated on
in parallel across different nodes.
RDD is created by loading data from storage or as results of other RDDs.
RDD can be processed either in-disk or in-memory and persists finally on disk.
Verbs can be categorized into Transformations and Actions.
Transformations vs Actions
Transformations: define new RDDs based on current ones e.g. map, filter, join, union etc.
Actions: return a value (e.g. reduce, count, first etc)
Data Transformations and Actions
Input
Data on
Disk
RDD 1
T1
RDD 2
T2
T3
RDD X
RDD 3
Output
Result on
DiskA1
Demo
Learn from code samples and examples provided by Spark.
Develop Apache Spark Apps on Windows platform
◦ http://tinyurl.com/nf9bwey

Learning Apache Spark by examples