Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.

378 views

Published on

16.4.16 Java/Scala Lab
Upcoming events: goo.gl/I2gJ4H

Краткое введение в Spark и Machine Learning с использованием Spark на примере подбора рекламной кампании для целевой аудитории.

Published in: Software
  • Be the first to comment

Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.

  1. 1. MACHINE LEARNING WITH APACHE SPARK
  2. 2. HADOOP MR Spark tries to keep things in memory, whereas MapReduce keeps shuffling things in and out of disk.
  3. 3. Spark is much faster and more convenient that Hadoop • Caches data in memory • Pipelines calculations through RDDs with optional caching • Organizes calculations with DAG • Provides user-friendly Scala, Python and Java APIs • Gives a number of useful Spark libs: GraphX, MLib etc
  4. 4. SPARK ARCHITECTURE
  5. 5. SPARK GENERAL FLOW
  6. 6. SPARK RDD (FAULT TOLERANT!)
  7. 7. LOGISTICS REGRESSION PERFORMANCE
  8. 8. WORDCOUNT WITH HADOOP
  9. 9. WORDCOUNT WITH SPARK It’s easier to develop for Spark.
  10. 10. Spark also adds libraries for doing things like machine learning, streaming, graph programming and SQL
  11. 11. SOME ACTIONS AND TRANSFORMATIONS map(func) flatMap(func) froupByKey() reduceByKey(func) mapValues(func) sample(…) union(other) distinct() sortByKey() .. reduce(func) collect() count() first() take(n) saveAsTextFile(path) countByKey() foreach(func) …
  12. 12. SPARK STREAMING
  13. 13. MACHINE LEARNING Types of Machine Learning
  14. 14. ALS Algorithm
  15. 15. ALS MODEL AND ALGORITHM Model Ratings as product of User (A) and Movie Feature (B) matrices of size UxK and MxK Alternating Least Squares (ALS) • Start with random A nd B vectors • Optimize user vectors (A) based on campaigns • Optimize campaign vectors (B) based on users • Repeat until converged
  16. 16. ALS ALGORITHM
  17. 17. CREATE INPUT RDDs
  18. 18. SPLIT INTO TRAINING, VALIDATION AND TEST DATASETS FIND OUT OPTIMAL RANK AND NUMBER OF ITERATIONS
  19. 19. RMSE (ROOT MEAN SQUARE ERROR) CALCULATION METHOD EVALUATE THE BEST MODEL ON THE TEST SET
  20. 20. CREATE A NAIVE BASELINE AND COMPARE IT WITH THE BEST MODEL OUTPUT
  21. 21. RECOMMEND SOME NEW PRODUCTS FOR USER WITH ID #150 AND SOME OUTPUT...
  22. 22. USER ALREADY REACTED ON SOME CAMPAIGNS
  23. 23. USE THIS INFORMATION FOR PREDICTION AND SOME OUTPUT...
  24. 24. Q&A

×