Learning Apache Spark by examples

1. Learning Apache Spark by Examples SAMUEL YEE

2. Why Apache Spark?

3. MapReduce vs Spark PIG HIVE MAHOUT (machine learning) MapReduce

4. Programming Models MAPREDUCE – 50 LINES OF CODE SPARK – 1 LINE OF CODE

5. Resilient Distributed Dataset (RDD) Immutable (read-only), fault-tolerant, distributed collection of records that can be operated on in parallel across different nodes. RDD is created by loading data from storage or as results of other RDDs. RDD can be processed either in-disk or in-memory and persists finally on disk. Verbs can be categorized into Transformations and Actions.

6. Transformations vs Actions Transformations: define new RDDs based on current ones e.g. map, filter, join, union etc. Actions: return a value (e.g. reduce, count, first etc)

7. Data Transformations and Actions Input Data on Disk RDD 1 T1 RDD 2 T2 T3 RDD X RDD 3 Output Result on DiskA1

8. Demo Learn from code samples and examples provided by Spark. Develop Apache Spark Apps on Windows platform ◦ http://tinyurl.com/nf9bwey

Learning Apache Spark by examples

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Learning Apache Spark by examples

Similar to Learning Apache Spark by examples (20)

Recently uploaded

Recently uploaded (20)

Learning Apache Spark by examples