PySaprk

3,636 views

Published on

Published in: Technology
0 Comments
18 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,636
On SlideShare
0
From Embeds
0
Number of Embeds
131
Actions
Shares
0
Downloads
175
Comments
0
Likes
18
Embeds 0
No embeds

No notes for slide

PySaprk

  1. 1. PySpark Next generation cloud computing engine using Python Wisely Chen Yahoo! Taiwan Data team
  2. 2. Who am I? • Wisely Chen ( thegiive@gmail.com ) • Sr. Engineer inYahoo![Taiwan] data team • Loves to promote open source tech • Hadoop Summit 2013 San Jose • Jenkins Conf 2013 Palo Alto • Coscup 2006, 2012, 2013 , OSDC 2007,2014, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
  3. 3. Taiwan Data Team Data! Highway BI! Report Serving! API Data! Mart ETL / Forecast Machine! Learning
  4. 4. Agenda • What is Spark? • What is PySpark? • How to write PySpark applications? • PySpark demo • Q&A
  5. 5. HDFS YARN MapReduce What is Spark? Spark Storage Resource Management Computing Engine
  6. 6. • The leading candidate for “successor to MapReduce” today is Apache Spark • No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. ! • From Cloudera CTO http://0rz.tw/y3OfM What is Spark?
  7. 7. Spark is 3X~25X faster than MapReduce ! From Matei’s paper: http://0rz.tw/VVqgP Logistic regression RunningTime(S) 0 20 40 60 80 MR Spark 3 76 KMeans 0 27.5 55 82.5 110 MR Spark 33 106 PageRank 0 45 90 135 180 MR Spark 23 171
  8. 8. Most machine learning algorithms need iterative computing
  9. 9. a1.0 1.0 1.0 1.0 PageRank 1st Iter 2nd Iter 3rd Iter b d c Rank Tmp Result Rank Tmp Result a1.85 1.0 0.58 b d c 0.58 a1.31 1.72 0.39 b d c 0.58
  10. 10. HDFS is 100x slower than memory Input (HDFS) Iter 1 Tmp (HDFS) Iter 2 Tmp (HDFS) Iter N Input (HDFS) Iter 1 Tmp (Mem) Iter 2 Tmp (Mem) Iter N MapReduce Spark
  11. 11. First iteration(HDFS)! take 200 sec 3rd iteration(mem)! take 7.7 sec Page Rank algorithm in 1 billion record url 2nd iteration(mem)! take 7.4 sec
  12. 12. What is PySpark?
  13. 13. Spark API • Multi Language API • JVM: Scala, JAVA • PySpark: Python
  14. 14. PySpark • Process via Python • CPython • Python lib (NumPy, Scipy…) • Storage and transfer data in Spark • HDFS access/Networking/Fault-recovery • scheduling/broadcast/checkpointing/
  15. 15. Spark Architecture Master! (JVM) Worker! ! ! ! ! ! Task Client Block1 Worker! ! ! ! ! ! Task Block2 Worker! ! ! ! ! ! Task Block3
  16. 16. PySpark Architecture Master! (JVM) Worker! (JVM)! ! ! ! Python! Code Block1 Py Proc Worker! (JVM)! ! ! ! Block2 Py Proc Worker! (JVM)! ! ! ! Block3 Py Proc JVM
  17. 17. PySpark Architecture Master! (JVM) Worker! (JVM)! ! ! ! Python! Code Py4J Socket Local FS Block1 Worker! (JVM)! ! ! ! Block2 Worker! (JVM)! ! ! ! Block3
  18. 18. PySpark Architecture Master! (JVM) Worker! (JVM)! ! ! ! Python! Code Block1 Py code Worker! (JVM)! ! ! ! Block2 Worker! (JVM)! ! ! ! Block3 Python functions and closures are serialized using PiCloud’s CloudPickle module Py code Py code
  19. 19. PySpark Architecture Master! (JVM) Worker! (JVM)! ! ! ! Python! Code Block1 Py Proc Worker! (JVM)! ! ! ! Block2 Py Proc Worker! (JVM)! ! ! ! Block3 Py Proc On worker launch, Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed.
  20. 20. A lot of python processes
  21. 21. How to write PySpark application?
  22. 22. Python Word Count • file = spark.textFile("hdfs://...") • counts = file.flatMap(lambda line: line.split(" ")) • .map(lambda word: (word, 1)) • .reduceByKey(lambda a, b: a + b) • counts.saveAsTextFile("hdfs://...") Access data via Spark API Process via Python
  23. 23. Python Word Count • counts = file.flatMap(lambda line: line.split(" ")) You can find the latest Spark documentation, including the guide Original text List ['You', 'can', 'find', 'the', 'latest', 'Spark', 'documentation,', 'including', 'the', ‘guide’]
  24. 24. Python Word Count • .map(lambda word: (word, 1)) List Tuple List [ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,1) …., ……….. (‘the’,1) , (‘guide’ ,1) ] ['You', 'can', 'find', 'the', 'latest', 'Spark', 'documentation,', 'including', 'the', ‘guide’]
  25. 25. Python Word Count • .reduceByKey(lambda a, b: a + b) Tuple List Reduce Tuple List [ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,1), ……….. (‘the’,1) , (‘guide’ ,1) ] [ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,2), ……… ……….. (‘guide’ ,1) ]
  26. 26. Can I use ML python lib on PySpark?
  27. 27. PySpark + scikit-learn • sgd = lm.SGDClassifier(loss=‘log') • for ii in range(ITERATIONS): • sgd = sc.parallelize(…) • .mapPartitions(lambda x:…) • .reduce(lambda x, y: merge(x, y)) Use scikit-learn in Single mode(master) Cluster operation Use scikit-learn function in cluster mode , deal with partial data ! Source Code is From : http://0rz.tw/o2CHT ! !
  28. 28. PySpark support MLlib • MLlib is spark version machine learning lib • Example: KMeans.train(parsedData, 2, maxIter=10, runs=30, "random") • Check it out on http://0rz.tw/M35Rz
  29. 29. DEMO 1 : Recommendation using ALS (Data : MovieLens)
  30. 30. DEMO 2: Interactive Shell
  31. 31. Conclusion
  32. 32. Join Us • Our team’s work is highlight by world top conf • Hadoop Summit San Jose 2013 • Hadoop Summit Amsterdam 2014 • MSTR World Las Vegas 2014 • SparkSummit San Francisco 2014 • Jenkins Conf Palo Alto 2013
  33. 33. Thank you

×