Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Which library should you choose for data-science? That's the question!

452 views

Published on

This talk presents you the data-science ecosystem in two languages : Python and Scala. It demonstrates the use of their libraries on real dataset to solve binary classification problem with decision tree algorithm.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Which library should you choose for data-science? That's the question!

  1. 1. Which library should you choose for data-science? That’s the question! (?) Anastasia Lieva Data Scientist @lievAnastazia
  2. 2. Agenda 1. What is Data-Science? How magic is it? 2. Python & Scala Data-Science ecosystem 3. Demonstration of some libraries on real dataset 4. Your choice in the pocket?
  3. 3. the sexiest job of the 21st century Data-Science
  4. 4. most laborious job of the 21st century? Data-Science
  5. 5. most laborious?
  6. 6. Credit: Emma Walker
  7. 7. Time series analysis Clustering Classification Regression ... ... Descriptive statistics Frame the problem!
  8. 8. Components that we need to solve the problem Learning/optimization of algorithm Mathematical analysis Tuning/optimization of algorithm Preprocessing Evaluation ... Visualisation
  9. 9. Visualisation Preprocessing Machine Learning Evaluation Features engineering Features selection Features extraction Hyper-param eters tuning Algorithm optimization Algorithm Evaluation strategies Visualisation Evaluation metrics
  10. 10. On which aspects should we focus on? Solution that works / Solution out of box Solution that is well documented Solution that is easy & fast to test Solution that is easy & fast to develop Solution that is easy & fast to industrialize Solution that is easy to maintain Solution that is easy & fast to scale
  11. 11. Frame your search
  12. 12. R Python SQL Scala Scala.io 2016 Anastasia Lieva : “Big-Data-Science in Scala”
  13. 13. Python Scala Which language to pick up?Frame your search:
  14. 14. Python Scala Which library to pick up?Frame your search:
  15. 15. Python Scala Which library to pick up?Frame your search: Spark Saddle Smile Breeze Spark Statsmodels Scikit-learn Numpy Pandas Simpy Matplotlib Bokeh Searborn Vispy ggplot
  16. 16. Time series analysis Clustering Classification Regression ... ... Descriptive statistics Frame the problem! Python Scala
  17. 17. Components that we need to resolve the problem Learning/optimisation algorithme Mathematical analysis Tuning/optimisation of algorithme Preprocessing Evaluation ... Visualisation
  18. 18. Frame your search Which library to pick up? Scala Spark SparkTS Smile Breeze Saddle learning algorithms mathematical analysis algorithms tuning preprocessing evaluation visualisation
  19. 19. Frame your search Which library to pick up? PySpark Scikit-learn statsmodels Scipy SymPy Pandas Numpy learning algorithms mathematical analysis algorithms tuning/optimiz ation preprocessing evaluation visualisation Python
  20. 20. Frame your search Which library to pick up? Pandas matplotlib searborn bokeh vispy ggplot Visualisation Python
  21. 21. Frame your search Which library to pick up? BayesPy PyMC libpgm BNFinder pebl Bayesian inference Python
  22. 22. Frame your search Which library to pick up? TensorFlow Keras Theano Caffe Lasagne deep learning Python
  23. 23. Development environment matters Python Scala
  24. 24. Development environment matters Python AnacondaST3 plugin for Sublime Text 3
  25. 25. Development environment matters Scala
  26. 26. Problem: Optimize click rate of delivering ads We want to estimate the probability the ads will be clicked ● request configuration ● proposed creative ● user history ● third-party information depending on:
  27. 27. Time series analysis Clustering Classification Regression ... ... Descriptive statistics Frame the problem!
  28. 28. Visualisation Preprocessing Machine Learning Evaluation Features engineering Features selection Features extraction Hyper-param eters tuning Algorithm optimization Algorithm Evaluation strategies Visualisation Evaluation metrics
  29. 29. Visualisation Preprocessing Machine Learning Evaluation Features engineering Features selection Features extraction Hyper-param eters tuning Algorithm optimization Algorithm Evaluation strategies Visualisation Evaluation metrics
  30. 30. Decision Tree os Category City Games Android Music iOs Paris Nantes Yes No Yes No
  31. 31. { "id":"951cb9f5-2bab-46ce-b759-8245cffxxxxx", "time":"2016-06-09T0:25:28Z", "bidfloor":2.88, "appOrSite":"app", "adType":"banner", "categories":"games,news,football", "publisherId":"11e281c1123139xxxxx", "carrier":"208-10", "os":"iOS", "connectionType":3, "coords":[48.929256439208984, 2.4255824089050293], "adSize":[320, 50], "exchange":"xxxxx", [...], "clicked":true } Raw data 500 Mb
  32. 32. Visualisation Preprocessing Machine Learning Evaluation Features engineering Features selection Features extraction Hyper-param eters tuning Algorithm optimization Algorithm Evaluation strategies Visualisation Evaluation metrics
  33. 33. Os BidPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z
  34. 34. Os BidPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z
  35. 35. Os BidPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z
  36. 36. Os BidPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z
  37. 37. Os BidPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z Click False True False
  38. 38. Os BidPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z Click False True False Os BidPrice Time 3.0 6.0 1.0 5.0 3.0 5.0 1.0 2.0 3.0
  39. 39. Spark Preprocessing Features engineering Features selection Features extraction Scala
  40. 40. Databricks Notebook
  41. 41. Databricks Notebook Display and download options
  42. 42. Databricks Notebook
  43. 43. Databricks Notebook
  44. 44. 1. Spark SQL optimized methods 2. MLlib out-of-box features engineering / features selection 3. Dataset performance & type safety Execution time for preprocessing 43 seconds Spark Scala
  45. 45. Saddle SCALA Preprocessing Features engineering Features selection Features extraction Scala
  46. 46. Saddle
  47. 47. Saddle
  48. 48. 1. Out-of-box easy to use structures: frame, matrix, series, vectors 2. Not active development Execution time for preprocessing 3,5 minutes Saddle Scala
  49. 49. 1. TypeSafe & very performant 2. You have to implement yourself all preprocessing stages and methods Execution time for preprocessing 3,1 seconds Native Scala library Scala
  50. 50. Numpy, Pandas, Scikit-learn Preprocessing Features engineering Features selection Features extraction Python
  51. 51. Python 1. Numpy arrays instead of Python lists for operations on sequences 2. Pandas DataFrame slicing methods to access values 3. Pandas DataFrame methods for data-structure transformations and access 4. Scikit-learn for features engineering Execution time for preprocessing 20 minutes Numpy, Pandas, Scikit-learn
  52. 52. Python 1. Numpy arrays instead of Python lists for operations on sequences 2. Pandas DataFrame slicing methods to access values 3. Pandas DataFrame methods for data-structure transformation and access 4. Scikit-learn for features engineering Numpy - homogeneous multidimensional array with its indexing, slicing and reshaping tricks - linear algebra Execution time for preprocessing 20 minutes Numpy, Pandas, Scikit-learn
  53. 53. Python 1. Numpy arrays instead of Python lists for operations on sequences 2. Pandas DataFrame slicing methods to access values 3. Pandas DataFrame methods for data-structure transformation and access 4. Scikit-learn for features engineering Pandas - DataFrame with its 425 methods : slicing, multi-indexing, merging, grouping, missing values imputations … - Plotting - Time Series analysis Execution time for preprocessing 20 minutes Numpy, Pandas, Scikit-learn
  54. 54. Python 1. Numpy arrays instead of Python lists for operations on sequences 2. Pandas DataFrame slicing methods to access values 3. Pandas DataFrame methods for data-structure transformation and access 4. Scikit-learn for features engineering Scikit-learn - Preprocessing (features engineering, missing value imputation, features selection) - Decomposing signals in components (PCA, LDA, Factor analysis, matrix factorisation) Execution time for preprocessing 20 minutes Numpy, Pandas, Scikit-learn
  55. 55. Compare execution time for preprocessing on laptop Intel Core i5 11Gb RAM, 4 cores
  56. 56. Visualisation Preprocessing Features engineering Features selection Features extraction os Category City Gam es Android M usic iOs Paris Nantes Yes No Yes No Decision Tree
  57. 57. Visualisation Preprocessing Machine Learning Evaluation Features engineering Features selection Features extraction Hyper-param eters tuning Algorithm optimization Algorithm Evaluation strategies Visualisation Evaluation metrics
  58. 58. Scikit-learn Machine Learning Hyper-param eters tuning Algorithm optimization Algorithm Python
  59. 59. Scikit-learn
  60. 60. Scikit-learn String Indexer Tokenizer Bucketizer PCA Assembler
  61. 61. Visualisation Preprocessing Machine Learning Evaluation Features engineering Features selection Features extraction Hyper-param eters tuning Algorithm optimization Algorithm Evaluation strategies Visualisation Evaluation metrics
  62. 62. Scikit-learn
  63. 63. Smile Machine Learning Hyper-param eters tuning Algorithm optimization Algorithm Scala
  64. 64. Model importance 0.17041644829479835,0.0,0.24611540915530505,1.1389295846602683,0.07655364222 388063,0.0,0.0,0.009896625232551026,4.57453119760533,0.36047880690737855,1.2 020833333333334,0.007662298205433167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 Smile
  65. 65. Spark Machine Learning Hyper-param eters tuning Algorithm optimization Algorithm Scala
  66. 66. Spark
  67. 67. Spark
  68. 68. Visualisation Preprocessing Machine Learning Evaluation Features engineering Features selection Features extraction Hyper-param eters tuning Algorithm optimization Algorithm Evaluation strategies Visualisation Evaluation metrics
  69. 69. Spark
  70. 70. Pipeline interface String Indexer Tokenizer Bucketizer PCA Assembler
  71. 71. Spark
  72. 72. Visualisation Visualisation Preprocessing Machine Learning Evaluation Features engineering Features selection Features extraction Hyper-param eters tuning Algorithm optimization Algorithm Evaluation strategies Evaluation metrics
  73. 73. Spark Smile Scikit-learn Regression Binary Classification Multiclass Classification Regression Classification Clustering Regression Classification evaluators built-in methods for generation of classification report & confusion matrix
  74. 74. Classification metrics
  75. 75. Compare execution time for learning on laptop Intel Core i5 11Gb RAM, 4 cores
  76. 76. On which aspects should we focus on? Scala Python Solution that works / Solution out of box Solution well explained/supported Solution easy & fast to test Solution easy & fast to develop Solution easy & fast industrialize Solution easy to maintain Solution easy & fast to scale
  77. 77. Thank you for your attention! and go make data-science to save the world @lievAnastazia
  78. 78. Credit: Nicolas Duforet

×