Introduce to spark

A big data processing tool built with Scala and runs on JVM
Introduce to
Spark
ADB 2017
Yen Hao Huang
1

Big Data
● 4Vs
○ Volume/Variety/Velocity/Veracity
Due to the rise of Big Data, faster tools are required for
processing data.
2

Hadoop
● A platform to store and process large scale data
● Features
○ Scalable
○ Economical : many cheap servers
○ Flexible : schema-less
○ Reliable : replicas
4

Hadoop MapReduce
● Map
- Divide job to multiple tiny tasks and distribute to
servers
● Reduce
- Summary the results from those servers
5

Hadoop MapReduce
6
Figure Refence

● File I/O - write the middle process data to disk
Hadoop - Bottleneck
Iteration Iteration
Read Read WriteWrite
7

RDD
In-memory computation framework
9

RDD (Resilient Distributed Dataset)
● Write the middle process data to memory
● 10 - 100 times faster than hadoop
Iteration Iteration
Read Memory Read WriteMemory Write
RDD
10

Spark
● Features
○ Speed
○ Ease of use : Scala、Python、Java、R
○ Supports hadoop : HDFS、MapReduce
○ Accessibility : runs on many platforms
11

RDD Features
● Computations
○ Transformation - Lazy compute
○ Action - Execute the computations
○ Persistence - Keep RDD in ram/ disk
Transformation
RDD OutputAction
12

● Error Fixing
RDD Lineage
RDD1 RDD2
Transformation Action
[ 7, 10 ]
[ 2, 3 ] [ ?, ? ]f(x) = x2
+1
13
RDD2RDD1
2
[ 7, 10 ]
Fix !
1

Introduce to spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduce to spark

Similar to Introduce to spark (20)

Recently uploaded

Recently uploaded (20)

Introduce to spark