Successfully reported this slideshow.
Your SlideShare is downloading. ×

PySaprk

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 33 Ad

More Related Content

Slideshows for you (20)

Similar to PySaprk (20)

Advertisement

Recently uploaded (20)

PySaprk

  1. 1. PySpark Next generation cloud computing engine using Python Wisely Chen Yahoo! Taiwan Data team
  2. 2. Who am I? • Wisely Chen ( thegiive@gmail.com ) • Sr. Engineer inYahoo![Taiwan] data team • Loves to promote open source tech • Hadoop Summit 2013 San Jose • Jenkins Conf 2013 Palo Alto • Coscup 2006, 2012, 2013 , OSDC 2007,2014, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
  3. 3. Taiwan Data Team Data! Highway BI! Report Serving! API Data! Mart ETL / Forecast Machine! Learning
  4. 4. Agenda • What is Spark? • What is PySpark? • How to write PySpark applications? • PySpark demo • Q&A
  5. 5. HDFS YARN MapReduce What is Spark? Spark Storage Resource Management Computing Engine
  6. 6. • The leading candidate for “successor to MapReduce” today is Apache Spark • No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. ! • From Cloudera CTO http://0rz.tw/y3OfM What is Spark?
  7. 7. Spark is 3X~25X faster than MapReduce ! From Matei’s paper: http://0rz.tw/VVqgP Logistic regression RunningTime(S) 0 20 40 60 80 MR Spark 3 76 KMeans 0 27.5 55 82.5 110 MR Spark 33 106 PageRank 0 45 90 135 180 MR Spark 23 171
  8. 8. Most machine learning algorithms need iterative computing
  9. 9. a1.0 1.0 1.0 1.0 PageRank 1st Iter 2nd Iter 3rd Iter b d c Rank Tmp Result Rank Tmp Result a1.85 1.0 0.58 b d c 0.58 a1.31 1.72 0.39 b d c 0.58
  10. 10. HDFS is 100x slower than memory Input (HDFS) Iter 1 Tmp (HDFS) Iter 2 Tmp (HDFS) Iter N Input (HDFS) Iter 1 Tmp (Mem) Iter 2 Tmp (Mem) Iter N MapReduce Spark
  11. 11. First iteration(HDFS)! take 200 sec 3rd iteration(mem)! take 7.7 sec Page Rank algorithm in 1 billion record url 2nd iteration(mem)! take 7.4 sec
  12. 12. What is PySpark?
  13. 13. Spark API • Multi Language API • JVM: Scala, JAVA • PySpark: Python
  14. 14. PySpark • Process via Python • CPython • Python lib (NumPy, Scipy…) • Storage and transfer data in Spark • HDFS access/Networking/Fault-recovery • scheduling/broadcast/checkpointing/
  15. 15. Spark Architecture Master! (JVM) Worker! ! ! ! ! ! Task Client Block1 Worker! ! ! ! ! ! Task Block2 Worker! ! ! ! ! ! Task Block3
  16. 16. PySpark Architecture Master! (JVM) Worker! (JVM)! ! ! ! Python! Code Block1 Py Proc Worker! (JVM)! ! ! ! Block2 Py Proc Worker! (JVM)! ! ! ! Block3 Py Proc JVM
  17. 17. PySpark Architecture Master! (JVM) Worker! (JVM)! ! ! ! Python! Code Py4J Socket Local FS Block1 Worker! (JVM)! ! ! ! Block2 Worker! (JVM)! ! ! ! Block3
  18. 18. PySpark Architecture Master! (JVM) Worker! (JVM)! ! ! ! Python! Code Block1 Py code Worker! (JVM)! ! ! ! Block2 Worker! (JVM)! ! ! ! Block3 Python functions and closures are serialized using PiCloud’s CloudPickle module Py code Py code
  19. 19. PySpark Architecture Master! (JVM) Worker! (JVM)! ! ! ! Python! Code Block1 Py Proc Worker! (JVM)! ! ! ! Block2 Py Proc Worker! (JVM)! ! ! ! Block3 Py Proc On worker launch, Python subprocesses and communicate with them using pipes, sending the user's code and the data to be processed.
  20. 20. A lot of python processes
  21. 21. How to write PySpark application?
  22. 22. Python Word Count • file = spark.textFile("hdfs://...") • counts = file.flatMap(lambda line: line.split(" ")) • .map(lambda word: (word, 1)) • .reduceByKey(lambda a, b: a + b) • counts.saveAsTextFile("hdfs://...") Access data via Spark API Process via Python
  23. 23. Python Word Count • counts = file.flatMap(lambda line: line.split(" ")) You can find the latest Spark documentation, including the guide Original text List ['You', 'can', 'find', 'the', 'latest', 'Spark', 'documentation,', 'including', 'the', ‘guide’]
  24. 24. Python Word Count • .map(lambda word: (word, 1)) List Tuple List [ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,1) …., ……….. (‘the’,1) , (‘guide’ ,1) ] ['You', 'can', 'find', 'the', 'latest', 'Spark', 'documentation,', 'including', 'the', ‘guide’]
  25. 25. Python Word Count • .reduceByKey(lambda a, b: a + b) Tuple List Reduce Tuple List [ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,1), ……….. (‘the’,1) , (‘guide’ ,1) ] [ (‘You’,1) , (‘can’,1), (‘find’,1) , (‘the’,2), ……… ……….. (‘guide’ ,1) ]
  26. 26. Can I use ML python lib on PySpark?
  27. 27. PySpark + scikit-learn • sgd = lm.SGDClassifier(loss=‘log') • for ii in range(ITERATIONS): • sgd = sc.parallelize(…) • .mapPartitions(lambda x:…) • .reduce(lambda x, y: merge(x, y)) Use scikit-learn in Single mode(master) Cluster operation Use scikit-learn function in cluster mode , deal with partial data ! Source Code is From : http://0rz.tw/o2CHT ! !
  28. 28. PySpark support MLlib • MLlib is spark version machine learning lib • Example: KMeans.train(parsedData, 2, maxIter=10, runs=30, "random") • Check it out on http://0rz.tw/M35Rz
  29. 29. DEMO 1 : Recommendation using ALS (Data : MovieLens)
  30. 30. DEMO 2: Interactive Shell
  31. 31. Conclusion
  32. 32. Join Us • Our team’s work is highlight by world top conf • Hadoop Summit San Jose 2013 • Hadoop Summit Amsterdam 2014 • MSTR World Las Vegas 2014 • SparkSummit San Francisco 2014 • Jenkins Conf Palo Alto 2013
  33. 33. Thank you

×