PySaprk

PySpark
Next generation cloud
computing engine using Python
Wisely Chen
Yahoo! Taiwan Data team

Who am I?
• Wisely Chen ( thegiive@gmail.com )

• Sr. Engineer inYahoo![Taiwan] data team

• Loves to promote open source tech

• Hadoop Summit 2013 San Jose

• Jenkins Conf 2013 Palo Alto

• Coscup 2006, 2012, 2013 , OSDC 2007,2014, Webconf
2013, Coscup 2012, PHPConf 2012 , RubyConf 2012

Taiwan Data Team
Data!
Highway
BI!
Report
Serving!
API
Data!
Mart
ETL /
Forecast
Machine!
Learning

Agenda
• What is Spark?
• What is PySpark?
• How to write PySpark applications?
• PySpark demo
• Q&A

HDFS
YARN
MapReduce
What is Spark?
Spark
Storage
Resource Management
Computing Engine

• The leading candidate for “successor to
MapReduce” today is Apache Spark
• No vendor — no new project — is likely to catch
up. Chasing Spark would be a waste of time,
and would delay availability of real-time analytic
and processing services for no good reason. !
• From Cloudera CTO http://0rz.tw/y3OfM
What is Spark?

Spark is 3X~25X faster than MapReduce
!
From Matei’s paper: http://0rz.tw/VVqgP
Logistic
regression
RunningTime(S)
0
20
40
60
80
MR Spark
3
76
KMeans
0
27.5
55
82.5
110
MR Spark
33
106
PageRank
0
45
90
135
180
MR Spark
23
171

Most machine learning
algorithms need iterative computing

a1.0
1.0
1.0
1.0
PageRank
1st Iter 2nd Iter 3rd Iter
b
d
c
Rank
Tmp
Result
Rank
Tmp
Result
a1.85
1.0
0.58
b
d
c
0.58
a1.31
1.72
0.39
b
d
c
0.58

HDFS is 100x slower than memory
Input
(HDFS)
Iter 1
Tmp
(HDFS)
Iter 2
Tmp
(HDFS)
Iter N
Input
(HDFS)
Iter 1
Tmp
(Mem)
Iter 2
Tmp
(Mem)
Iter N
MapReduce
Spark

First iteration(HDFS)!
take 200 sec
3rd iteration(mem)!
take 7.7 sec
Page Rank algorithm in 1 billion record url
2nd iteration(mem)!
take 7.4 sec

Spark API
• Multi Language API
• JVM: Scala, JAVA
• PySpark: Python

PySpark
• Process via Python
• CPython
• Python lib (NumPy, Scipy…)
• Storage and transfer data in Spark
• HDFS access/Networking/Fault-recovery
• scheduling/broadcast/checkpointing/

Spark Architecture
Master!
(JVM)
Worker!
!
!
!
!
!
Task
Client
Block1
Worker!
!
!
!
!
!
Task
Block2
Worker!
!
!
!
!
!
Task
Block3

PySpark Architecture
Master!
(JVM)
Worker!
(JVM)!
!
!
!
Python!
Code
Block1
Py
Proc
Worker!
(JVM)!
!
!
!
Block2
Py
Proc
Worker!
(JVM)!
!
!
!
Block3
Py
Proc
JVM

Master!
(JVM)
Worker!
(JVM)!
!
!
!
Python!
Code
Py4J Socket
Local FS
Block1
Worker!
(JVM)!
!
!
!
Block2
Worker!
(JVM)!
!
!
!
Block3

Master!
(JVM)
Worker!
(JVM)!
!
!
!
Python!
Code
Block1
Py
code
Worker!
(JVM)!
!
!
!
Block2
Worker!
(JVM)!
!
!
!
Block3
Python functions
and closures
are serialized using
PiCloud’s CloudPickle
module
Py
code
Py
code

Master!
(JVM)
Worker!
(JVM)!
!
!
!
Python!
Code
Block1
Py
Proc
Worker!
(JVM)!
!
!
!
Block2
Py
Proc
Worker!
(JVM)!
!
!
!
Block3
Py
Proc
On worker launch,
Python subprocesses
and communicate
with them using pipes,
sending the user's code
and the data to be processed.

How to write PySpark
application?

Python Word Count
• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" "))
• .map(lambda word: (word, 1))
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
Access data via
Spark API
Process via Python

Python Word Count
• counts = file.flatMap(lambda line: line.split(" "))
You can find the
latest Spark
documentation,
including the
guide
Original text List
['You', 'can', 'find', 'the',
'latest', 'Spark',
'documentation,',
'including', 'the', ‘guide’]

Python Word Count
• .map(lambda word: (word, 1))
List Tuple List
[ (‘You’,1) , (‘can’,1),
(‘ﬁnd’,1) , (‘the’,1) ….,
………..
(‘the’,1) , (‘guide’ ,1) ]
['You', 'can', 'ﬁnd', 'the',
'latest', 'Spark',
'documentation,',
'including', 'the', ‘guide’]

Python Word Count
• .reduceByKey(lambda a, b: a + b)
Tuple List Reduce Tuple List
[ (‘You’,1) ,
(‘can’,1),
(‘ﬁnd’,1) ,
(‘the’,1),
………..
(‘the’,1) ,
(‘guide’ ,1) ]
[ (‘You’,1) ,
(‘can’,1),
(‘ﬁnd’,1) ,
(‘the’,2),
………
………..
(‘guide’ ,1) ]

Can I use ML python
lib on PySpark?

PySpark + scikit-learn
• sgd = lm.SGDClassiﬁer(loss=‘log')
• for ii in range(ITERATIONS):
• sgd = sc.parallelize(…)
• .mapPartitions(lambda x:…)
• .reduce(lambda x, y: merge(x, y))
Use scikit-learn in
Single mode(master)
Cluster operation
Use scikit-learn
function in cluster mode ,
deal with partial data
!
Source Code is From : http://0rz.tw/o2CHT
!
!

PySpark support MLlib
• MLlib is spark version machine learning lib
• Example: KMeans.train(parsedData, 2,
maxIter=10, runs=30, "random")
• Check it out on http://0rz.tw/M35Rz

DEMO 1 :
Recommendation using ALS
(Data : MovieLens)

Join Us
• Our team’s work is highlight by world top conf
• Hadoop Summit San Jose 2013
• Hadoop Summit Amsterdam 2014
• MSTR World Las Vegas 2014
• SparkSummit San Francisco 2014
• Jenkins Conf Palo Alto 2013

PySaprk

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PySaprk

Similar to PySaprk (20)

Recently uploaded

Recently uploaded (20)

PySaprk