Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to ML with Apache Spark MLlib

3,547 views

Published on

Machine learning is overhyped nowadays. There is a strong belief that this area is exclusively for data scientists with a deep mathematical background that leverage Python (scikit-learn, Theano, Tensorflow, etc.) or R ecosystem and use specific tools like Matlab, Octave or similar. Of course, there is a big grain of truth in this statement, but we, Java engineers, also can take the best of machine learning universe from an applied perspective by using our native language and familiar frameworks like Apache Spark. During this introductory presentation, you will get acquainted with the simplest machine learning tasks and algorithms, like regression, classification, clustering, widen your outlook and use Apache Spark MLlib to distinguish pop music from heavy metal and simply have fun.

Source code: https://github.com/tmatyashovsky/spark-ml-samples

Design by Yarko Filevych: http://filevych.com/

Published in: Engineering
  • Be the first to comment

Introduction to ML with Apache Spark MLlib

  1. 1. with Apache Spark MLlib #javaone
  2. 2. https://ua.linkedin.com/in/tarasmatyashovsky 2
  3. 3. I am not a data science engineer 3
  4. 4. 4
  5. 5. lyrics genre 5
  6. 6. “I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you” https://github.com/tmatyashovsky/spark-ml-samples 6
  7. 7. “I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you” https://github.com/tmatyashovsky/spark-ml-samples 7
  8. 8. 8
  9. 9.  Look for particular words like “fear”, “fight”, “kill”, “devil”, ”death”, etc.?  Count length of a verse?  Count unique words in a verse? 9
  10. 10. 10
  11. 11. 15-20 11
  12. 12. is the study of computer algorithms that improve automatically through experience 12
  13. 13. Supervise d learning Unsupervise d learning Reinforcemen t learning 13
  14. 14. 14
  15. 15.  Date & time  Conference name  Speaker  Talk name  Track  Duration  Type  Overall impression  Overall rating  Number of slides  Time spent on live coding  Number of jokes  Etc. 15
  16. 16. Learning algorithms Hypotheses: Сost function: Features: Target variable: Training example: Training set: 16
  17. 17. http://www.slideshare.net/liweiyang5/spark-mllib-training-material 17
  18. 18. Number of jokes during a talk Speaker’s rating 18
  19. 19. 19
  20. 20. 20
  21. 21. 21
  22. 22. 22
  23. 23. 23
  24. 24. 24
  25. 25. Positive Negative Impression Number of jokes during a talk 25
  26. 26. 26
  27. 27. 27
  28. 28. 28
  29. 29. 29
  30. 30. 30
  31. 31. 31
  32. 32. Numberofjokesduringa talk Time (min.) spent on live coding Number of clusters: K = 5K = 2 32
  33. 33. 33  Initialize cluster centroids:  assign each example to the closest cluster centroid  Recalculate centroids as an average (mean) of examples assigned to a cluster
  34. 34. 34
  35. 35. 35
  36. 36. 36
  37. 37.  Collect data set of lyrics:  Abba, Ace of base, Backstreet Boys, Britney Spears, Christina Aguilera, Madonna, etc.  Black Sabbath, In Flames, Iron Maiden, Metallica, Moonspell, Nightwish, Sentenced, etc.  Create training set, i.e. label (0|1) + features  Train logistic regression (or other classification algorithm) https://github.com/tmatyashovsky/spark-ml-samples 37
  38. 38. https://github.com/tmatyashovsky/spark-ml-samples 38
  39. 39. 39
  40. 40. GloV e Bag of Words Word2VecTF- IDF http://spark.apache.org/docs/latest/ml-features.html#feature-extractors 40
  41. 41.  Produces unique fixed-size dense vectors  Captures semantic and morphologic similarity https://code.google.com/archive/p/word2vec/ 41
  42. 42. Similar scores (cos ~ 1) Opposite scores (cos ~ -1) Unrelated scores (cos ~ 0) http://bionlp-www.utu.fi/wv_demo/ http://blog.christianperone.com/wp-content/uploads/2013/09/cosinesimilarityfq1.png 42
  43. 43. 43 Verse Cosine Distance baby one more time 0.482028 crazy for you 0.437875 show me the meaning of being lonely 0.258147 highway to hell -0.1120049 kill them all -0.231876 https://github.com/tmatyashovsky/spark-ml-samples
  44. 44. https://github.com/tmatyashovsky/spark-ml-samples 44
  45. 45. Under-fitting (high bias) Over-fitting (high variance) Appropriate fitting http://mlwiki.org/index.php/Overfitting 47
  46. 46. Training set (66,6%) Test set (33%) K = 3 48
  47. 47. Training set (66,6%) Test set (33%) K = 3 49
  48. 48. Training set (33,3%) Test set (33%) Training set (33,3%) K = 3 50
  49. 49. 51
  50. 50. Java 52
  51. 51. Weka Encog AerosolveFlinkM L https://github.com/josephmisiti/awesome-machine-learning 53
  52. 52. Easy of use Cloud computing Spee d Generali ty Data processing 54
  53. 53. https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html 55
  54. 54. Is a library of ML algorithms and utilities designed to run in parallel on Spark cluster 56
  55. 55.  Introduces a few new data types, e.g. vector (dense and sparse), labeled point, rating, etc.  Allows to invoke various algorithms on distributed datasets (RDD/Dataset) http://spark.apache.org/docs/latest/mllib-guide.html 57
  56. 56. http://spark.apache.org/docs/latest/mllib-guide.html Build on top of RDDs Build on top of Datasets spark.mll ib spark.ml 58
  57. 57.  Utilities: linear algebra, statistics, etc.  Features extraction, features transforming, etc.  Regression  Classification  Clustering  Collaborative filtering, e.g. alternating least squares  Dimensionality reduction  And many more http://spark.apache.org/docs/latest/mllib-guide.html 59
  58. 58. ”All” spark.mllib features plus: • Pipelines • Persistence • Model selection and tuning: • Train validation split • K-folds cross validation http://spark.apache.org/docs/latest/ml-guide.html 60
  59. 59. Raw data Transformer Estimator [parameters] Transformer [parameters] Estimator [parameters] Dataset Dataset Dataset Dataset http://spark.apache.org/docs/latest/ml-pipeline.html Cross Validator [pipeline, evaluator, parameters] Dataset 61
  60. 60. Using Spark MLlib Pipeline
  61. 61. Lyrics https://github.com/tmatyashovsky/spark-ml-samples 63
  62. 62. I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you https://github.com/tmatyashovsky/spark-ml-samples 64
  63. 63. Lyrics Cleanser Dataset https://github.com/tmatyashovsky/spark-ml-samples 65
  64. 64. I'm a rolling thunder, a pouring rain I'm comin' on like a hurricane My lightning's flashing across the sky You're only young but you're gonna die I won't take no prisoners, won't spare no lives Nobody's putting up a fight I got my bell, I'm gonna take you to hell I'm gonna get you, Satan get you https://github.com/tmatyashovsky/spark-ml-samples 66
  65. 65. Lyrics Cleanser Dataset Numerator Dataset https://github.com/tmatyashovsky/spark-ml-samples 67
  66. 66. Im a rolling thunder a pouring rain Im comin on like a hurricane My lightnings flashing across the sky Youre only young but youre gonna die I wont take no prisoners wont spare no lives Nobodys putting up a fight I got my bell Im gonna take you to hell Im gonna get you Satan get you https://github.com/tmatyashovsky/spark-ml-samples 68 1 2 3 4 5 6 7 8
  67. 67. Lyrics Cleanser Dataset Numerator Tokenizer Stop Words Remover Dataset Dataset Dataset https://github.com/tmatyashovsky/spark-ml-samples 69
  68. 68. im a rolling thunder a pouring rain im comin on like a hurricane My lightnings flashing across the sky youre only young but youre gonna die I wont take no prisoners wont spare no lives nobodys putting up a fight I got my bell im gonna take you to hell im gonna get you satan get you https://github.com/tmatyashovsky/spark-ml-samples 70 1 2 3 4 5 6 7 8
  69. 69. Lyrics Cleanser Dataset Dataset Numerator Tokenizer Stop Words Remover Dataset Dataset ExploderStemmer Dataset Uniter Dataset Dataset https://github.com/tmatyashovsky/spark-ml-samples 71
  70. 70. im rolling thunder pouring rain im comin like hurricane lightnings flashing across sky youre young youre gonna die wont take prisoners wont spare lives nobodiys putting fight got bell im gonna take hell im gonna get satan get https://github.com/tmatyashovsky/spark-ml-samples 72 1 2 3 4 5 6 7 8
  71. 71. Lyrics Cleanser Dataset Dataset Numerator Tokenizer Stop Words Remover Dataset Dataset ExploderStemmer Dataset Uniter Dataset Verser [Sentences in verse] Dataset Dataset https://github.com/tmatyashovsky/spark-ml-samples 73
  72. 72. 4 im roll thunder pour rain im comin like hurrican lightn flash across sky your young your gonna die wont take prison wont spare live nobodi put fight got bell im gonna take hell im gonna get satan get https://github.com/tmatyashovsky/spark-ml-samples 74 1 2 3 4 5 6 7 8 verse1 verse2
  73. 73. 8 im roll thunder pour rain im comin like hurrican Light n flash across sky your young your gonna die wont take prison wont spare live nobodi put fight got bell im gonna take hell im gonna get satan get https://github.com/tmatyashovsky/spark-ml-samples 75 1 2 3 4 5 6 7 8 verse1
  74. 74. Lyrics Cleanser Word2Vec [Vector size] Dataset Dataset Numerator Tokenizer Stop Words Remover Dataset Dataset ExploderStemmer Dataset Uniter Dataset Verser [Sentences in verse] Dataset Dataset Dataset https://github.com/tmatyashovsky/spark-ml-samples 76
  75. 75. 4 [0.036463763926011056, -0.013076733228398295, ... 0.03816963326281462] https://github.com/tmatyashovsky/spark-ml-samples 77 feature1 feature2 [-0.013962931134021625, 0.049275818325650804, ... -0.058982484615766086]
  76. 76. 8 [0.036463763926011056, -0.013076733228398295, 0.044362547532774695, 0.03816963326281462, ... -0.013962931134021625, 0.049275818325650804, -0.058982484615766086] https://github.com/tmatyashovsky/spark-ml-samples 78 feature1
  77. 77. Lyrics Cleanser Word2Vec [Vector size] Dataset Dataset Numerator Tokenizer Stop Words Remover Dataset Dataset ExploderStemmer Dataset Uniter Dataset Verser [Sentences in verse] Dataset Logistic Regression [Max iterations, Reg parameter] Dataset Dataset Dataset https://github.com/tmatyashovsky/spark-ml-samples 79
  78. 78. Probability: [0.9212126972383768, 0.07878730276162313] Prediction: 0.0 https://github.com/tmatyashovsky/spark-ml-samples 80
  79. 79. Lyrics Cleanser Word2Vec [Vector size] Dataset Dataset Numerator Tokenizer Stop Words Remover Dataset Dataset ExploderStemmer Dataset Uniter Dataset Verser [Sentences in verse] Dataset Logistic Regression [Max iterations, Reg parameter] Dataset Dataset Cross Validator Model Dataset https://github.com/tmatyashovsky/spark-ml-samples 81
  80. 80. [0.8454839775240359, 0.9061236588248319, 0.9527128936788524, 0.9522790271664413, ... 0.9526248129757111, 0.9522790271664411] https://github.com/tmatyashovsky/spark-ml-samples 82
  81. 81. Lyrics Cleanser Word2Vec [Vector size] Dataset Dataset Numerator Tokenizer Stop Words Remover Dataset Dataset ExploderStemmer Dataset Uniter Dataset Verser [Sentences in verse] Dataset Logistic Regression [Max iterations, Reg parameter] Dataset Dataset Cross Validator Model Dataset https://github.com/tmatyashovsky/spark-ml-samples 83
  82. 82. 84
  83. 83. 85
  84. 84. 86  ML is not as complex as it seems from an applied perspective  Existing libraries and frameworks reduce a lot of tedious work  For instance, Spark MLlib can help to build nice ML pipelines
  85. 85. Design by 87
  86. 86.  https://www.quora.com/What-is-the-difference-between-supervised-and-unsupervised-learning-algorithms  Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia  https://databricks.com/blog/2015/01/07/ml-pipelines-a-new-high-level-api-for-mllib.html  https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html  https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research  https://www.kaggle.com/c/dogs-vs-cats/  http://yann.lecun.com/exdb/mnist/  http://www.bcl.hamilton.ie/~barak/teach/F98/ECE547/hw1/index.html  http://www.slideshare.net/jeykottalam/pipelines-ampcamp  https://github.com/master/spark-stemming  https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-apache-spark.html  http://www.degeneratestate.org/posts/2016/Apr/20/heavy-metal-and-natural-language-processing-part-1/  https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/functions.html  https://www.quora.com/What-is-the-difference-between-supervised-and-unsupervised-learning-algorithms  http://www.slideshare.net/liweiyang5/spark-mllib-training-material  https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.htm  http://www.slideshare.net/databricks/combining-machine-learning-frameworks-with-apache-spark l  https://databricks.com/blog/2015/10/20/audience-modeling-with-apache-spark-ml-pipelines.html  https://github.com/deeplearning4j/deeplearning4j  http://deeplearning4j.org/spark  http://mlwiki.org/index.php/Overfitting  http://bionlp-www.utu.fi/wv_demo/  https://quomodocumque.wordpress.com/2016/01/15/messing-around-with-word2vec/ 88

×