PySpark
Next generation cloud
computing engine using Python
Wisely Chen
Yahoo! Taiwan Data team
Who am I?
• Wisely Chen ( thegiive@gmail.com ) 	

• Sr. Engineer inYahoo![Taiwan] data team 	

• Loves to promote open source tech 	

• Hadoop Summit 2013 San Jose	

• Jenkins Conf 2013 Palo Alto	

• Coscup 2006, 2012, 2013 , OSDC 2007,2014, Webconf
2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
Taiwan Data Team
Data!
Highway
BI!
Report
Serving!
API
Data!
Mart
ETL /
Forecast
Machine!
Learning
Agenda
• What is Spark?
• What is PySpark?
• How to write PySpark applications?
• PySpark demo
• Q&A
HDFS
YARN
MapReduce
What is Spark?
Spark
Storage
Resource Management
Computing Engine
• The leading candidate for “successor to
MapReduce” today is Apache Spark
• No vendor — no new project — is likely to catch
up. Chasing Spark would be a waste of time,
and would delay availability of real-time analytic
and processing services for no good reason. !
• From Cloudera CTO http://0rz.tw/y3OfM
What is Spark?
Spark is 3X~25X faster than MapReduce
!
From Matei’s paper: http://0rz.tw/VVqgP
Logistic
regression
RunningTime(S)
0
20
40
60
80
MR Spark
3
76
KMeans
0
27.5
55
82.5
110
MR Spark
33
106
PageRank
0
45
90
135
180
MR Spark
23
171
Most machine learning
algorithms need iterative computing
a1.0
1.0
1.0
1.0
PageRank
1st Iter 2nd Iter 3rd Iter
b
d
c
Rank
Tmp
Result
Rank
Tmp
Result
a1.85
1.0
0.58
b
d
c
0.58
a1.31
1.72
0.39
b
d
c
0.58
HDFS is 100x slower than memory
Input
(HDFS)
Iter 1
Tmp
(HDFS)
Iter 2
Tmp
(HDFS)
Iter N
Input
(HDFS)
Iter 1
Tmp
(Mem)
Iter 2
Tmp
(Mem)
Iter N
MapReduce
Spark
First iteration(HDFS)!
take 200 sec
3rd iteration(mem)!
take 7.7 sec
Page Rank algorithm in 1 billion record url
2nd iteration(mem)!
take 7.4 sec
What is PySpark?
Spark API
• Multi Language API
• JVM: Scala, JAVA
• PySpark: Python
PySpark
• Process via Python
• CPython
• Python lib (NumPy, Scipy…)
• Storage and transfer data in Spark
• HDFS access/Networking/Fault-recovery
• scheduling/broadcast/checkpointing/
Spark Architecture
Master!
(JVM)
Worker!
!
!
!
!
!
Task
Client
Block1
Worker!
!
!
!
!
!
Task
Block2
Worker!
!
!
!
!
!
Task
Block3
PySpark Architecture
Master!
(JVM)
Worker!
(JVM)!
!
!
!
Python!
Code
Block1
Py
Proc
Worker!
(JVM)!
!
!
!
Block2
Py
Proc
Worker!
(JVM)!
!
!
!
Block3
Py
Proc
JVM
PySpark Architecture
Master!
(JVM)
Worker!
(JVM)!
!
!
!
Python!
Code
Py4J Socket
Local FS
Block1
Worker!
(JVM)!
!
!
!
Block2
Worker!
(JVM)!
!
!
!
Block3
PySpark Architecture
Master!
(JVM)
Worker!
(JVM)!
!
!
!
Python!
Code
Block1
Py
code
Worker!
(JVM)!
!
!
!
Block2
Worker!
(JVM)!
!
!
!
Block3
Python functions
and closures
are serialized using
PiCloud’s CloudPickle
module
Py
code
Py
code
PySpark Architecture
Master!
(JVM)
Worker!
(JVM)!
!
!
!
Python!
Code
Block1
Py
Proc
Worker!
(JVM)!
!
!
!
Block2
Py
Proc
Worker!
(JVM)!
!
!
!
Block3
Py
Proc
On worker launch,
Python subprocesses
and communicate
with them using pipes,
sending the user's code
and the data to be processed.
A lot of python
processes
How to write PySpark
application?
Python Word Count
• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" ")) 
• .map(lambda word: (word, 1)) 
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
Access data via
Spark API
Process via Python
Python Word Count
• counts = file.flatMap(lambda line: line.split(" ")) 
You can find the
latest Spark
documentation,
including the
guide
Original text List
['You', 'can', 'find', 'the',
'latest', 'Spark',
'documentation,',
'including', 'the', ‘guide’]
Python Word Count
• .map(lambda word: (word, 1))
List Tuple List
[ (‘You’,1) , (‘can’,1),
(‘find’,1) , (‘the’,1) ….,
………..
(‘the’,1) , (‘guide’ ,1) ]
['You', 'can', 'find', 'the',
'latest', 'Spark',
'documentation,',
'including', 'the', ‘guide’]
Python Word Count
• .reduceByKey(lambda a, b: a + b)
Tuple List Reduce Tuple List
[ (‘You’,1) ,
(‘can’,1),
(‘find’,1) ,
(‘the’,1),
………..
(‘the’,1) ,
(‘guide’ ,1) ]
[ (‘You’,1) ,
(‘can’,1),
(‘find’,1) ,
(‘the’,2),
………
………..
(‘guide’ ,1) ]
Can I use ML python
lib on PySpark?
PySpark + scikit-learn
• sgd = lm.SGDClassifier(loss=‘log')
• for ii in range(ITERATIONS):
• sgd = sc.parallelize(…) 
• .mapPartitions(lambda x:…) 
• .reduce(lambda x, y: merge(x, y))
Use scikit-learn in
Single mode(master)
Cluster operation
Use scikit-learn
function in cluster mode ,
deal with partial data
!
Source Code is From : http://0rz.tw/o2CHT
!
!
PySpark support MLlib
• MLlib is spark version machine learning lib
• Example: KMeans.train(parsedData, 2,
maxIter=10, runs=30, "random")
• Check it out on http://0rz.tw/M35Rz
DEMO 1 :
Recommendation using ALS
(Data : MovieLens)
DEMO 2:
Interactive Shell
Conclusion
Join Us
• Our team’s work is highlight by world top conf
• Hadoop Summit San Jose 2013
• Hadoop Summit Amsterdam 2014
• MSTR World Las Vegas 2014
• SparkSummit San Francisco 2014
• Jenkins Conf Palo Alto 2013
Thank you

PySaprk