SlideShare a Scribd company logo
1 of 20
Download to read offline
TRAINING MODELS
ON TERABYTES OF DATA
RUXANDRA BURTICA, DATA SCIENTIST
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
1 Problem
2 Using embeddings for dimensionality reduction
3 Comparing feature spaces
4 Analyzing used resources
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
PROBLEM(S)
Terabytes of data Training time
Hyperparameter tuning
Thousand dense features
Millions sparse features
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
1. Transform sparse features to embeddings
2. Build the following data-sets:
§ Dense
§ Dense & sparse
§ Dense & embeddings
§ Dense & embeddings & sparse
3. Sub-sample from these data-sets
4. Train to find best models
5. Analyze results, decide whether or not to
drop the sparse features and use
embeddings
APPROACH
FINDING THE BEST MODELS
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
SETUP
1. DataSet used for experimentation
§ 1/16th
of the dataset
§ 1/8th
of the dataset
2. Hyper-parameter tuning method
§ GridSearchCV; RandomizedSearchCV
§ Hyperband
§ Bayesian Optimization
3. Hyper-parameter tuning optimization metric
§ Partial AUC
Credits to: https://playground.tensorflow.org
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
§ Area Under the Curve
§ Partial AUC
§ Compute the AUC up
to a threshold (e.g.. 0.2)
§ Why use Partial AUC?
PARTIAL AUC
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
BAYESIAN OPTIMIZATION
Credits to: https://github.com/fmfn/BayesianOptimization
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
BAYESIAN OPTIMIZATION - RESULTS
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
BAYESIAN OPTIMIZATION - RESULTS
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
BAYESIAN OPTIMIZATION – BEST MODELS
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
BAYESIAN OPTIMIZATION - RESOURCES
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
HYPERBAND
Credits to: https://arxiv.org/pdf/1603.06560.pdf
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
BAYESIAN OPTIMIZATION vs. HYPERBAND
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
REFERENCES
§ ROC curves for continuous data
§ https://books.google.be/books?id=UZHwdiwOs4QC
§ Bayesian optimization - code
§ https://scikit-optimize.github.io
§ https://github.com/fmfn/BayesianOptimization
§ Hyperband
§ https://people.eecs.berkeley.edu/~kjamieson/hyper
band.html
§ https://arxiv.org/abs/1603.06560
THANK YOU!
ruxandra.burtica@crowdstrike.com
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
TREES AND ENSEMBLES
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
SPARSE FEATURES
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
FEATURES (EXAMPLES)

More Related Content

Similar to Training models on terabytes of data

Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems
MongoDB
 
Using Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterUsing Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your Cluster
MongoDB
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
thkoch
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
Databricks
 

Similar to Training models on terabytes of data (20)

Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRGData Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
Data Programming: Creating Large Datasets, Quickly -- Presented at JPL MLRG
 
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage fro...
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems
 
Using Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterUsing Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your Cluster
 
Database Management | Why Data Warehouse Projects Fail
Database Management | Why Data Warehouse Projects FailDatabase Management | Why Data Warehouse Projects Fail
Database Management | Why Data Warehouse Projects Fail
 
Starfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analyticsStarfish-A self tuning system for bigdata analytics
Starfish-A self tuning system for bigdata analytics
 
Data Intensive Research with DISPEL
Data Intensive Research with DISPELData Intensive Research with DISPEL
Data Intensive Research with DISPEL
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
 
Dynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 PresentationDynamo Systems - QCon SF 2012 Presentation
Dynamo Systems - QCon SF 2012 Presentation
 
Data Base Design.pptx
Data Base Design.pptxData Base Design.pptx
Data Base Design.pptx
 
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...
A STUDY OF DECISION TREE ENSEMBLES AND FEATURE SELECTION FOR STEEL PLATES FAU...
 
HadoopDB a major step towards a dead end
HadoopDB a major step towards a dead endHadoopDB a major step towards a dead end
HadoopDB a major step towards a dead end
 
التنقيب في البيانات - Data Mining
التنقيب في البيانات -  Data Miningالتنقيب في البيانات -  Data Mining
التنقيب في البيانات - Data Mining
 
Big data processing using - Hadoop Technology
Big data processing using - Hadoop TechnologyBig data processing using - Hadoop Technology
Big data processing using - Hadoop Technology
 
Srikanta Mishra
Srikanta MishraSrikanta Mishra
Srikanta Mishra
 
IRJET - Student Future Prediction System under Filtering Mechanism
IRJET - Student Future Prediction System under Filtering MechanismIRJET - Student Future Prediction System under Filtering Mechanism
IRJET - Student Future Prediction System under Filtering Mechanism
 
A general framework for predicting the optimal computing configuration for cl...
A general framework for predicting the optimal computing configuration for cl...A general framework for predicting the optimal computing configuration for cl...
A general framework for predicting the optimal computing configuration for cl...
 
IBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
IBM Cloud Côte d'Azur Meetup - 20190328 - OptimisationIBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
IBM Cloud Côte d'Azur Meetup - 20190328 - Optimisation
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 

Recently uploaded

obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
siskavia95
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
bams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptxbams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptx
JocylDuran
 
Jual Obat Aborsi Lhokseumawe ( Asli No.1 ) 088980685493 Obat Penggugur Kandun...
Jual Obat Aborsi Lhokseumawe ( Asli No.1 ) 088980685493 Obat Penggugur Kandun...Jual Obat Aborsi Lhokseumawe ( Asli No.1 ) 088980685493 Obat Penggugur Kandun...
Jual Obat Aborsi Lhokseumawe ( Asli No.1 ) 088980685493 Obat Penggugur Kandun...
Obat Aborsi 088980685493 Jual Obat Aborsi
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
23050636
 

Recently uploaded (20)

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Rolex Watch - Design Decision Analysis.
Rolex Watch -  Design Decision Analysis.Rolex Watch -  Design Decision Analysis.
Rolex Watch - Design Decision Analysis.
 
bams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptxbams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptx
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
 
Jual Obat Aborsi Lhokseumawe ( Asli No.1 ) 088980685493 Obat Penggugur Kandun...
Jual Obat Aborsi Lhokseumawe ( Asli No.1 ) 088980685493 Obat Penggugur Kandun...Jual Obat Aborsi Lhokseumawe ( Asli No.1 ) 088980685493 Obat Penggugur Kandun...
Jual Obat Aborsi Lhokseumawe ( Asli No.1 ) 088980685493 Obat Penggugur Kandun...
 
Fuel Efficiency Forecast: Predictive Analytics for a Greener Automotive Future
Fuel Efficiency Forecast: Predictive Analytics for a Greener Automotive FutureFuel Efficiency Forecast: Predictive Analytics for a Greener Automotive Future
Fuel Efficiency Forecast: Predictive Analytics for a Greener Automotive Future
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 

Training models on terabytes of data

  • 1. TRAINING MODELS ON TERABYTES OF DATA RUXANDRA BURTICA, DATA SCIENTIST
  • 2. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. 1 Problem 2 Using embeddings for dimensionality reduction 3 Comparing feature spaces 4 Analyzing used resources
  • 3. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. PROBLEM(S) Terabytes of data Training time Hyperparameter tuning Thousand dense features Millions sparse features
  • 4. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. 1. Transform sparse features to embeddings 2. Build the following data-sets: § Dense § Dense & sparse § Dense & embeddings § Dense & embeddings & sparse 3. Sub-sample from these data-sets 4. Train to find best models 5. Analyze results, decide whether or not to drop the sparse features and use embeddings APPROACH
  • 5. FINDING THE BEST MODELS 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 6. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. SETUP 1. DataSet used for experimentation § 1/16th of the dataset § 1/8th of the dataset 2. Hyper-parameter tuning method § GridSearchCV; RandomizedSearchCV § Hyperband § Bayesian Optimization 3. Hyper-parameter tuning optimization metric § Partial AUC Credits to: https://playground.tensorflow.org
  • 7. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. § Area Under the Curve § Partial AUC § Compute the AUC up to a threshold (e.g.. 0.2) § Why use Partial AUC? PARTIAL AUC
  • 8. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. BAYESIAN OPTIMIZATION Credits to: https://github.com/fmfn/BayesianOptimization
  • 9. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. BAYESIAN OPTIMIZATION - RESULTS
  • 10. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. BAYESIAN OPTIMIZATION - RESULTS
  • 11. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. BAYESIAN OPTIMIZATION – BEST MODELS
  • 12. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. BAYESIAN OPTIMIZATION - RESOURCES
  • 13. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. HYPERBAND Credits to: https://arxiv.org/pdf/1603.06560.pdf
  • 14. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. BAYESIAN OPTIMIZATION vs. HYPERBAND
  • 15. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. REFERENCES § ROC curves for continuous data § https://books.google.be/books?id=UZHwdiwOs4QC § Bayesian optimization - code § https://scikit-optimize.github.io § https://github.com/fmfn/BayesianOptimization § Hyperband § https://people.eecs.berkeley.edu/~kjamieson/hyper band.html § https://arxiv.org/abs/1603.06560
  • 17. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 18. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. TREES AND ENSEMBLES
  • 19. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. SPARSE FEATURES
  • 20. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. FEATURES (EXAMPLES)