Ruxandra Burtica, a data scientist, compared different methods for training models on terabytes of data with thousands of dense features and millions of sparse features. She used dimensionality reduction techniques like embeddings to transform sparse features. Models were trained on different combinations of features and subsets of data to determine the best approach. Bayesian optimization and hyperband techniques were evaluated for hyperparameter tuning. The results and resource usage of different methods were analyzed to decide whether sparse features could be removed and embeddings used instead.
2. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
1 Problem
2 Using embeddings for dimensionality reduction
3 Comparing feature spaces
4 Analyzing used resources
3. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
PROBLEM(S)
Terabytes of data Training time
Hyperparameter tuning
Thousand dense features
Millions sparse features
4. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
1. Transform sparse features to embeddings
2. Build the following data-sets:
§ Dense
§ Dense & sparse
§ Dense & embeddings
§ Dense & embeddings & sparse
3. Sub-sample from these data-sets
4. Train to find best models
5. Analyze results, decide whether or not to
drop the sparse features and use
embeddings
APPROACH
5. FINDING THE BEST MODELS
2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
6. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
SETUP
1. DataSet used for experimentation
§ 1/16th
of the dataset
§ 1/8th
of the dataset
2. Hyper-parameter tuning method
§ GridSearchCV; RandomizedSearchCV
§ Hyperband
§ Bayesian Optimization
3. Hyper-parameter tuning optimization metric
§ Partial AUC
Credits to: https://playground.tensorflow.org
7. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
§ Area Under the Curve
§ Partial AUC
§ Compute the AUC up
to a threshold (e.g.. 0.2)
§ Why use Partial AUC?
PARTIAL AUC
8. 2019 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
BAYESIAN OPTIMIZATION
Credits to: https://github.com/fmfn/BayesianOptimization