Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data-Science
in Scala
Anastasia Lieva
Data Scientist
@lievAnastazia
Agenda
1. Big Data as motivation for Scala
2. Overview of data-science libraries in scala
2. Demonstration of some librari...
1. R
2. Python
3. SQL
2014
KDnuggets Polls: most popular tools in data-science
2015
2016
Context: Real Time Bidding
Raw requests: 200 000 requests per second
8 terabytes per day
R
Python
SQL
Scala
R
Python
SQL
Scala
Spark ML/DATAFRAME/SQL
SMILE
Saddle
Breeze
Components that we need to resolve the problem
Learning/optimisation algorithme
Mathematical analysis
Tuning/optimisation ...
Frame your search Which library to pick up?
Scala
Spark SparkTS Smile Breeze Saddle
learning
algorithms
mathematical
analy...
Frame your search
Which library to pick up?
DeepLearning.scala
(ThoughtWorks)
Neuron DeepLearning4j
deep learning
Scala
Problem:
Optimize click rate of delivering ads
We want to estimate the probability the ads will be clicked
● request confi...
Time series analysis
Clustering
Classification
Regression
...
...
Descriptive statistics
Frame the problem!
Visualisation
Preprocessing
Machine
Learning
Evaluation
Features
engineering
Features
selection
Features
extraction
Hyper-...
Algorithm:
Random Forest
Averaging the decisions
from all the trees
os
Categorie City
Games
Android
Music
iOs
Paris
Nantes...
Raw data
{
"id":"951cb9f5-2bab-46ce-b759-8245cffxxxxx",
"time":"2016-06-09T0:25:28Z",
"bidfloor":2.88,
"appOrSite":"app",
...
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Clic...
Os MaxPrice Time
Android 7.3 2016-06-09T0:25:28Z
iOS 4.55 2016-05-09T14:23:12Z
WindowsPhone 2.89 2016-06-09T11:35:11Z
Clic...
Preprocessing: Spark ml
● Extraction: Extracting features from “raw” data
● Transformation: Scaling, converting, or modify...
Preprocessing: Saddle
array-backed, specialized data structures:
Pandas-like operations:
dealing with missing values
index...
Learning: Spark ml
Dataframe-based API
● Classification
● Regression
● Linear Methods
● Decision Trees
● Tree ensembles
Learning: Spark ml
Dataframe-based API
Pipeline interface
● Classification
● Regression
● Linear Methods
● Decision Trees
...
Compare performance : Spark
Learning: Smile
● Classification
● Regression
● Linear Methods
● Decision Trees
● Tree ensembles
Array-backed API
Learning: Smile
● Classification
● Regression
● Linear Methods
● Decision Trees
● Tree ensembles
★ Visualisation
★ Missing...
Saddle Preprocessing
Features
engineering
Features
selection
Features
extraction
Scala
Saddle Create the dataframe
Balance the data
Saddle
Index categorical data
Preprocessing: Saddle
Split randomly to test and train sets
and convert to input type needed in Smile RF implementation
1. Out-of-box easy to use structures:
frame, matrix, series, vectors
2. Not active development
3. Not typesafe dataframes
...
Spark Preprocessing
Features
engineering
Features
selection
Features
extraction
Scala
Databricks Notebook
Databricks Notebook
Display and download options
Databricks Notebook
Databricks Notebook
Preprocessing: Spark ml
balance the data
Preprocessing: Spark ml
Index categorical data
timestamp os osIdx
1465037789 iOS 1
1464983457 Windows Phone 2
1465019529 A...
Preprocessing: Spark ml
Conversion and sampling
1. Spark SQL optimized methods
2. MLlib out-of-box features engineering / features selection
3. Dataset performance & type...
1. TypeSafe & very performant
2. You have to implement yourself
all preprocessing stages and methods
Execution time for 0....
Visualisation
Preprocessing
Features
engineering
Features
selection
Features
extraction
Random Forest
os
Categorie City
Ga...
Smile
Machine
Learning
Hyper-param
eters tuning
Algorithm
optimization
Algorithm
Scala
Learning: Smile
Construct Classifier and set
hyperparameters
Learning:
Train model
and predict on test dataframe
Smile
0.17041644829479835,0.0,0.24611540915530505,1.1389295846602683,0...
Spark
Machine
Learning
Hyper-param
eters tuning
Algorithm
optimization
Algorithm
Scala
Learning:
Construct Classifier and set
hyperparameters
Spark ml
Spark
Spark
Pipeline interface
String
Indexer
Tokenizer Bucketizer PCA Assembler
Visualisation
Preprocessing
Machine
Learning
Evaluation
Features
engineering
Features
selection
Features
extraction
Hyper-...
Spark
Hyper-parameters tuning
Visualisation
Visualisation
Preprocessing
Machine
Learning
Evaluation
Features
engineering
Features
selection
Features
ext...
Spark Smile
Regression
Binary
Classification
Multiclass
Classification
Regression
Classification
evaluators
Compare Spark and Smile Random Forest
The higher the better The lower the better
Classification metrics
Compare Spark and Smile Random Forest
Running time on 13 GB
minutes
Compare preprocessing:
Spark vs Saddle
My List[tools] for THIS project:
Preprocessing
Spark
Machine Learning
(Random Forest)
Smile
Your Option[tools] for YOUR project:
Spark
Spark TS
SMILE
Breeze
Saddle
Thank you for your
attention!
and go make data-science to save the world
@lievAnastazia
Big Data Science in Scala V2
Big Data Science in Scala V2
Upcoming SlideShare
Loading in …5
×

Big Data Science in Scala V2

1,583 views

Published on

Updated version of "Big Data Science in Scala" featuring Spark pipelines and hyper-parameter optimization techniques. This talk presents you how three scala libraries - Smile, Saddle and Spark ML - satisfy requirements of new Big Data Science projects. Let's see it on example of click-through rate prediction.

Published in: Technology

Big Data Science in Scala V2

  1. 1. Big Data-Science in Scala Anastasia Lieva Data Scientist @lievAnastazia
  2. 2. Agenda 1. Big Data as motivation for Scala 2. Overview of data-science libraries in scala 2. Demonstration of some libraries on real dataset 3. Your choice in the pocket?
  3. 3. 1. R 2. Python 3. SQL 2014 KDnuggets Polls: most popular tools in data-science 2015 2016
  4. 4. Context: Real Time Bidding Raw requests: 200 000 requests per second 8 terabytes per day
  5. 5. R Python SQL Scala
  6. 6. R Python SQL Scala Spark ML/DATAFRAME/SQL SMILE Saddle Breeze
  7. 7. Components that we need to resolve the problem Learning/optimisation algorithme Mathematical analysis Tuning/optimisation of algorithme Preprocessing Evaluation ... Visualisation
  8. 8. Frame your search Which library to pick up? Scala Spark SparkTS Smile Breeze Saddle learning algorithms mathematical analysis algorithms tuning preprocessing evaluation visualisation
  9. 9. Frame your search Which library to pick up? DeepLearning.scala (ThoughtWorks) Neuron DeepLearning4j deep learning Scala
  10. 10. Problem: Optimize click rate of delivering ads We want to estimate the probability the ads will be clicked ● request configuration ● proposed creative ● user history ● third-party information depending on:
  11. 11. Time series analysis Clustering Classification Regression ... ... Descriptive statistics Frame the problem!
  12. 12. Visualisation Preprocessing Machine Learning Evaluation Features engineering Features selection Features extraction Hyper-param eters tuning Algorithm optimization Algorithm Evaluation strategies Visualisation Evaluation metrics
  13. 13. Algorithm: Random Forest Averaging the decisions from all the trees os Categorie City Games Android Music iOs Paris Nantes Oui Non OuiNon adType adSize weekDay 320x50 480x320 Video SaturdayMonday Oui Non OuiNon Banner
  14. 14. Raw data { "id":"951cb9f5-2bab-46ce-b759-8245cffxxxxx", "time":"2016-06-09T0:25:28Z", "bidfloor":2.88, "appOrSite":"app", "adType":"banner", "categories":"games,news,football", "publisherId":"11e281c1123139xxxxx", "carrier":"208-10", "os":"iOS", "connectionType":3, "coords":[48.929256439208984, 2.4255824089050293], "adSize":[320, 50], "exchange":"xxxxx", [...], "clicked":true }
  15. 15. Os MaxPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z
  16. 16. Os MaxPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z
  17. 17. Os MaxPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z
  18. 18. Os MaxPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z
  19. 19. Os MaxPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z Click False True False
  20. 20. Os MaxPrice Time Android 7.3 2016-06-09T0:25:28Z iOS 4.55 2016-05-09T14:23:12Z WindowsPhone 2.89 2016-06-09T11:35:11Z Click False True False Os MaxPrice Time 3.0 6.0 1.0 5.0 3.0 5.0 1.0 2.0 3.0
  21. 21. Preprocessing: Spark ml ● Extraction: Extracting features from “raw” data ● Transformation: Scaling, converting, or modifying features ● Selection: Selecting a subset from a larger set of features
  22. 22. Preprocessing: Saddle array-backed, specialized data structures: Pandas-like operations: dealing with missing values index transformation tools extracting,slicing,mapping row/column wise groupBy/join/concat sorting/pivoting
  23. 23. Learning: Spark ml Dataframe-based API ● Classification ● Regression ● Linear Methods ● Decision Trees ● Tree ensembles
  24. 24. Learning: Spark ml Dataframe-based API Pipeline interface ● Classification ● Regression ● Linear Methods ● Decision Trees ● Tree ensembles TF-IDF String Indexer Assembler Random Forest Evaluation
  25. 25. Compare performance : Spark
  26. 26. Learning: Smile ● Classification ● Regression ● Linear Methods ● Decision Trees ● Tree ensembles Array-backed API
  27. 27. Learning: Smile ● Classification ● Regression ● Linear Methods ● Decision Trees ● Tree ensembles ★ Visualisation ★ Missing Values Imputation ★ Association Rule Mining ★ Manifold learning ★ Multi-dimensional scaling ★ Feature selection and dimensionality reduction
  28. 28. Saddle Preprocessing Features engineering Features selection Features extraction Scala
  29. 29. Saddle Create the dataframe Balance the data
  30. 30. Saddle Index categorical data
  31. 31. Preprocessing: Saddle Split randomly to test and train sets and convert to input type needed in Smile RF implementation
  32. 32. 1. Out-of-box easy to use structures: frame, matrix, series, vectors 2. Not active development 3. Not typesafe dataframes Saddle Scala
  33. 33. Spark Preprocessing Features engineering Features selection Features extraction Scala
  34. 34. Databricks Notebook
  35. 35. Databricks Notebook Display and download options
  36. 36. Databricks Notebook
  37. 37. Databricks Notebook
  38. 38. Preprocessing: Spark ml balance the data
  39. 39. Preprocessing: Spark ml Index categorical data timestamp os osIdx 1465037789 iOS 1 1464983457 Windows Phone 2 1465019529 Android 0 1464974567 iOS 1 1465018552 Android 0
  40. 40. Preprocessing: Spark ml Conversion and sampling
  41. 41. 1. Spark SQL optimized methods 2. MLlib out-of-box features engineering / features selection 3. Dataset performance & type safety Spark Scala
  42. 42. 1. TypeSafe & very performant 2. You have to implement yourself all preprocessing stages and methods Execution time for 0.3 GB preprocessing 1.2 seconds Execution time for 13 GB preprocessing 22 seconds Native Scala library Scala
  43. 43. Visualisation Preprocessing Features engineering Features selection Features extraction Random Forest os Categorie City Games Android Music iOs Paris Nantes Oui Non OuiNon adType adSize weekDay 320x50 480x320 Video SaturdayMonday Oui Non OuiNon Banner
  44. 44. Smile Machine Learning Hyper-param eters tuning Algorithm optimization Algorithm Scala
  45. 45. Learning: Smile Construct Classifier and set hyperparameters
  46. 46. Learning: Train model and predict on test dataframe Smile 0.17041644829479835,0.0,0.24611540915530505,1.1389295846602683,0.07655364222 388063,0.0,0.0,0.009896625232551026,4.57453119760533,0.36047880690737855,1.2 020833333333334,0.007662298205433167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
  47. 47. Spark Machine Learning Hyper-param eters tuning Algorithm optimization Algorithm Scala
  48. 48. Learning: Construct Classifier and set hyperparameters Spark ml
  49. 49. Spark
  50. 50. Spark
  51. 51. Pipeline interface String Indexer Tokenizer Bucketizer PCA Assembler
  52. 52. Visualisation Preprocessing Machine Learning Evaluation Features engineering Features selection Features extraction Hyper-param eters tuning Algorithm optimization Algorithm Evaluation strategies Visualisation Evaluation metrics
  53. 53. Spark Hyper-parameters tuning
  54. 54. Visualisation Visualisation Preprocessing Machine Learning Evaluation Features engineering Features selection Features extraction Hyper-param eters tuning Algorithm optimization Algorithm Evaluation strategies Evaluation metrics
  55. 55. Spark Smile Regression Binary Classification Multiclass Classification Regression Classification evaluators
  56. 56. Compare Spark and Smile Random Forest The higher the better The lower the better Classification metrics
  57. 57. Compare Spark and Smile Random Forest Running time on 13 GB minutes
  58. 58. Compare preprocessing: Spark vs Saddle
  59. 59. My List[tools] for THIS project: Preprocessing Spark Machine Learning (Random Forest) Smile
  60. 60. Your Option[tools] for YOUR project: Spark Spark TS SMILE Breeze Saddle
  61. 61. Thank you for your attention! and go make data-science to save the world @lievAnastazia

×