SlideShare a Scribd company logo
1 of 46
Download to read offline
Simpler 
Machine Learning 
with SKLL 1.0 
Dan Blanchard 
Educational Testing Service 
dblanchard@ets.org 
PyData NYC 2014
Survived Perished
Survived Perished 
first class, 
female, 
1 sibling, 
35 years old 
third class, 
female, 
2 siblings, 
18 years old 
second class, 
male, 
0 siblings, 
50 years old
Survived Perished 
first class, 
female, 
1 sibling, 
35 years old 
third class, 
female, 
2 siblings, 
18 years old 
second class, 
male, 
0 siblings, 
50 years old 
Can we predict survival from data?
SciKit-Learn Laboratory
SKLL 
It's where the learning happens
Learning to Predict Survival 
1. Split up given training set: train (80%) and dev (20%) 
$ ./make_titanic_example_data.py 
Loading train.csv... done 
Writing titanic/train/socioeconomic.csv...done 
Writing titanic/train/family.csv...done 
Writing titanic/train/vitals.csv...done 
Writing titanic/train/misc.csv...done 
Writing titanic/train+dev/socioeconomic.csv...done 
Writing titanic/train+dev/family.csv...done 
Writing titanic/train+dev/vitals.csv...done 
Writing titanic/train+dev/misc.csv...done 
Writing titanic/dev/socioeconomic.csv...done 
Writing titanic/dev/family.csv...done 
Writing titanic/dev/vitals.csv...done 
Writing titanic/dev/misc.csv...done 
Loading test.csv... done 
Writing titanic/test/socioeconomic.csv...done 
Writing titanic/test/family.csv...done 
Writing titanic/test/vitals.csv...done 
Writing titanic/test/misc.csv...done
Learning to Predict Survival 
2. Pick classifiers to try: 
1. Decision Tree 
2. Naive Bayes 
3. Random forest 
4. Support Vector Machine (SVM)
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
directory with feature files for 
training learner
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
directory with feature files for 
evaluating performance
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory # of siblings, = train 
spouses, parents, children 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
departure port
Learning to Predict Survival 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
fare & passenger class 
3. Create configuration file for SKLL
Learning to Predict Survival 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
sex & age 
3. Create configuration file for SKLL
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
directory to store evaluation results
Learning to Predict Survival 
3. Create configuration file for SKLL 
[General] 
experiment_name = Titanic_Evaluate_Untuned 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Output] 
results = output 
models = output 
directory to store trained models
Learning to Predict Survival 
4. Run the configuration file with run_experiment 
$ run_experiment evaluate.cfg 
Loading train/family.csv... done 
Loading train/misc.csv... done 
Loading train/socioeconomic.csv... done 
Loading train/vitals.csv... done 
Loading dev/family.csv... done 
Loading dev/misc.csv... done 
Loading dev/socioeconomic.csv... done 
Loading dev/vitals.csv... done 
Loading train/family.csv... done 
Loading train/misc.csv... done 
Loading train/socioeconomic.csv... done 
Loading train/vitals.csv... done 
Loading dev/family.csv... done 
...
Learning to Predict Survival 
5. Examine results 
Experiment Name: Titanic_Evaluate_Untuned 
SKLL Version: 1.0.0 
Training Set: train (712) 
Test Set: dev (179) 
Feature Set: ["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"] 
Learner: RandomForestClassifier 
Scikit-learn Version: 0.15.2 
Total Time: 0:00:02.065403 
+-------+------+------+-----------+--------+-----------+ 
| | 0.0 | 1.0 | Precision | Recall | F-measure | 
+-------+------+------+-----------+--------+-----------+ 
| 0.000 | [96] | 19 | 0.865 | 0.835 | 0.850 | 
+-------+------+------+-----------+--------+-----------+ 
| 1.000 | 15 | [49] | 0.721 | 0.766 | 0.742 | 
+-------+------+------+-----------+--------+-----------+ 
(row = reference; column = predicted) 
Accuracy = 0.8100558659217877
Aggregate Evaluation Results 
Dev. Accuracy Learner 
0.8101 RandomForestClassifier 
0.7989 DecisionTreeClassifier 
0.7709 SVC 
0.7095 MultinomialNB
[General] 
experiment_name = Titanic_Evaluate 
task = evaluate 
[Input] 
train_directory = train 
test_directory = dev 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Tuning] 
grid_search = true 
objective = accuracy 
[Output] 
results = output 
Tuning learner 
Can we do better than default hyperparameters?
Tuned Evaluation Results 
Untuned Accuracy Tuned Accuracy Learner 
0.8101 0.8380 RandomForestClassifier 
0.7989 0.7989 DecisionTreeClassifier 
0.7709 0.8156 SVC 
0.7095 0.7095 MultinomialNB
Using All Available Data 
Use training and dev to generate predictions on test 
[General] 
experiment_name = Titanic_Predict 
task = predict 
[Input] 
train_directory = train+dev 
test_directory = test 
featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] 
learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", 
"MultinomialNB"] 
id_col = PassengerId 
label_col = Survived 
[Tuning] 
grid_search = true 
objective = accuracy 
[Output] 
results = output
Test Set Accuracy 
Train only Train + Dev 
Learner 
Untuned Tuned Untuned Tuned 
0.727 0.756 0.746 0.780 RandomForestClassifier 
0.703 0.742 0.670 0.742 DecisionTreeClassifier 
0.608 0.679 0.612 0.679 SVC 
0.627 0.627 0.622 0.622 MultinomialNB
Advanced SKLL Features 
• Read & write .arff, .csv, 
.jsonlines, .libsvm, .megam, 
.ndj, and .tsv data 
• Parameter grids for all 
supported scikit-learn learners 
• Custom learners 
• Parallelize experiments on 
DRMAA clusters via GridMap 
• Ablation experiments 
• Collapse/rename classes from 
config file 
• Feature scaling 
• Rescale predictions to be closer 
to observed data 
• Command-line tools for joining, 
filtering, and converting feature 
files 
• Python API
Currently Supported Learners 
Classifiers Regressors 
Linear Support Vector Machine Elastic Net 
Logistic Regression Lasso 
Multinomial Naive Bayes Linear 
AdaBoost 
Decision Tree 
Gradient Boosting 
K-Nearest Neighbors 
Random Forest 
Stochastic Gradient Descent 
Support Vector Machine
Contributors 
• Nitin Madnani 
• Mike Heilman 
• Nils Murrugarra Llerena 
• Aoife Cahill 
• Diane Napolitano 
• Keelan Evanini 
• Ben Leong
References 
• Dataset: kaggle.com/c/titanic-gettingStarted 
• SKLL GitHub: github.com/EducationalTestingService/skll 
• SKLL Docs: skll.readthedocs.org 
• Titanic configs and data splitting script in examples dir on GitHub 
@dsblanch 
dan-blanchard
Bonus Slides
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# confusion Load test matrix 
examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
overall accuracy on 
# Load test examples test set 
and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
precision, recall, f-score for 
# Load test examples and each evaluate 
class 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
tuned model 
# Load test examples and evaluate 
parameters 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
objective function 
# Load test examples and evaluate 
score on test set 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples) 
# Perform 10-fold cross-validation with a radial SVM 
learner = Learner('SVC') 
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples) 
per-fold evaluation 
# Perform results 
10-fold cross-validation with a radial SVM 
learner = Learner('SVC') 
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples) 
per-fold training set 
# Perform 10-fold cross-obj. validation scores 
with a radial SVM 
learner = Learner('SVC') 
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL API 
from skll import Learner, Reader 
# Load training examples 
train_examples = Reader.for_path('myexamples.megam').read() 
# Train a linear SVM 
learner = Learner('LinearSVC') 
learner.train(train_examples) 
# Load test examples and evaluate 
test_examples = Reader.for_path('test.tsv').read() 
conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) 
# Generate predictions from trained model 
predictions = learner.predict(test_examples) 
# Perform 10-fold cross-validation with a radial SVM 
learner = Learner('SVC') 
fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
SKLL API 
import numpy as np 
from os.path import join 
from skll import FeatureSet, NDJWriter, Writer 
# Create some training examples 
labels = [] 
ids = [] 
features = [] 
for i in range(num_train_examples): 
labels.append("dog" if i % 2 == 0 else "cat") 
ids.append("{}{}".format(y, i)) 
features.append({"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4)}) 
feat_set = FeatureSet('training', ids, labels=labels, features=features) 
# Write them to a file 
train_path = join(_my_dir, 'train', 'test_summary.jsonlines') 
Writer.for_path(train_path, feat_set).write() 
# Or 
NDJWriter.(train_path, feat_set).write()

More Related Content

Viewers also liked

streamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with stormstreamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with stormDaniel Blanchard
 
9 Facts Maine Small Business Owners Should Know About Portland Radio
9 Facts Maine Small Business Owners Should Know About Portland  Radio9 Facts Maine Small Business Owners Should Know About Portland  Radio
9 Facts Maine Small Business Owners Should Know About Portland RadioLarry Julius
 
Chemistry and Application of Leuco Dyes
Chemistry and Application of Leuco DyesChemistry and Application of Leuco Dyes
Chemistry and Application of Leuco DyesAsma Khan
 
Unad’s festival film 2016
Unad’s festival film 2016Unad’s festival film 2016
Unad’s festival film 2016Luis Rodriguez
 
Apparato locomotore
Apparato locomotoreApparato locomotore
Apparato locomotorepgiac
 
Le giunture
Le giuntureLe giunture
Le giunturepgiac
 
Torace e bacin0
Torace e bacin0Torace e bacin0
Torace e bacin0pgiac
 

Viewers also liked (7)

streamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with stormstreamparse and pystorm: simple reliable parallel processing with storm
streamparse and pystorm: simple reliable parallel processing with storm
 
9 Facts Maine Small Business Owners Should Know About Portland Radio
9 Facts Maine Small Business Owners Should Know About Portland  Radio9 Facts Maine Small Business Owners Should Know About Portland  Radio
9 Facts Maine Small Business Owners Should Know About Portland Radio
 
Chemistry and Application of Leuco Dyes
Chemistry and Application of Leuco DyesChemistry and Application of Leuco Dyes
Chemistry and Application of Leuco Dyes
 
Unad’s festival film 2016
Unad’s festival film 2016Unad’s festival film 2016
Unad’s festival film 2016
 
Apparato locomotore
Apparato locomotoreApparato locomotore
Apparato locomotore
 
Le giunture
Le giuntureLe giunture
Le giunture
 
Torace e bacin0
Torace e bacin0Torace e bacin0
Torace e bacin0
 

Recently uploaded

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 

Recently uploaded (20)

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 

Simpler Machine Learning with SKLL 1.0

  • 1. Simpler Machine Learning with SKLL 1.0 Dan Blanchard Educational Testing Service dblanchard@ets.org PyData NYC 2014
  • 2.
  • 3.
  • 4.
  • 6. Survived Perished first class, female, 1 sibling, 35 years old third class, female, 2 siblings, 18 years old second class, male, 0 siblings, 50 years old
  • 7. Survived Perished first class, female, 1 sibling, 35 years old third class, female, 2 siblings, 18 years old second class, male, 0 siblings, 50 years old Can we predict survival from data?
  • 9. SKLL It's where the learning happens
  • 10. Learning to Predict Survival 1. Split up given training set: train (80%) and dev (20%) $ ./make_titanic_example_data.py Loading train.csv... done Writing titanic/train/socioeconomic.csv...done Writing titanic/train/family.csv...done Writing titanic/train/vitals.csv...done Writing titanic/train/misc.csv...done Writing titanic/train+dev/socioeconomic.csv...done Writing titanic/train+dev/family.csv...done Writing titanic/train+dev/vitals.csv...done Writing titanic/train+dev/misc.csv...done Writing titanic/dev/socioeconomic.csv...done Writing titanic/dev/family.csv...done Writing titanic/dev/vitals.csv...done Writing titanic/dev/misc.csv...done Loading test.csv... done Writing titanic/test/socioeconomic.csv...done Writing titanic/test/family.csv...done Writing titanic/test/vitals.csv...done Writing titanic/test/misc.csv...done
  • 11. Learning to Predict Survival 2. Pick classifiers to try: 1. Decision Tree 2. Naive Bayes 3. Random forest 4. Support Vector Machine (SVM)
  • 12. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  • 13. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory with feature files for training learner
  • 14. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory with feature files for evaluating performance
  • 15. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  • 16. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory # of siblings, = train spouses, parents, children test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  • 17. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output departure port
  • 18. Learning to Predict Survival [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output fare & passenger class 3. Create configuration file for SKLL
  • 19. Learning to Predict Survival [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output sex & age 3. Create configuration file for SKLL
  • 20. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  • 21. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory to store evaluation results
  • 22. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory to store trained models
  • 23. Learning to Predict Survival 4. Run the configuration file with run_experiment $ run_experiment evaluate.cfg Loading train/family.csv... done Loading train/misc.csv... done Loading train/socioeconomic.csv... done Loading train/vitals.csv... done Loading dev/family.csv... done Loading dev/misc.csv... done Loading dev/socioeconomic.csv... done Loading dev/vitals.csv... done Loading train/family.csv... done Loading train/misc.csv... done Loading train/socioeconomic.csv... done Loading train/vitals.csv... done Loading dev/family.csv... done ...
  • 24. Learning to Predict Survival 5. Examine results Experiment Name: Titanic_Evaluate_Untuned SKLL Version: 1.0.0 Training Set: train (712) Test Set: dev (179) Feature Set: ["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Scikit-learn Version: 0.15.2 Total Time: 0:00:02.065403 +-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [96] | 19 | 0.865 | 0.835 | 0.850 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 15 | [49] | 0.721 | 0.766 | 0.742 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8100558659217877
  • 25. Aggregate Evaluation Results Dev. Accuracy Learner 0.8101 RandomForestClassifier 0.7989 DecisionTreeClassifier 0.7709 SVC 0.7095 MultinomialNB
  • 26. [General] experiment_name = Titanic_Evaluate task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Tuning] grid_search = true objective = accuracy [Output] results = output Tuning learner Can we do better than default hyperparameters?
  • 27. Tuned Evaluation Results Untuned Accuracy Tuned Accuracy Learner 0.8101 0.8380 RandomForestClassifier 0.7989 0.7989 DecisionTreeClassifier 0.7709 0.8156 SVC 0.7095 0.7095 MultinomialNB
  • 28. Using All Available Data Use training and dev to generate predictions on test [General] experiment_name = Titanic_Predict task = predict [Input] train_directory = train+dev test_directory = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Tuning] grid_search = true objective = accuracy [Output] results = output
  • 29. Test Set Accuracy Train only Train + Dev Learner Untuned Tuned Untuned Tuned 0.727 0.756 0.746 0.780 RandomForestClassifier 0.703 0.742 0.670 0.742 DecisionTreeClassifier 0.608 0.679 0.612 0.679 SVC 0.627 0.627 0.622 0.622 MultinomialNB
  • 30. Advanced SKLL Features • Read & write .arff, .csv, .jsonlines, .libsvm, .megam, .ndj, and .tsv data • Parameter grids for all supported scikit-learn learners • Custom learners • Parallelize experiments on DRMAA clusters via GridMap • Ablation experiments • Collapse/rename classes from config file • Feature scaling • Rescale predictions to be closer to observed data • Command-line tools for joining, filtering, and converting feature files • Python API
  • 31. Currently Supported Learners Classifiers Regressors Linear Support Vector Machine Elastic Net Logistic Regression Lasso Multinomial Naive Bayes Linear AdaBoost Decision Tree Gradient Boosting K-Nearest Neighbors Random Forest Stochastic Gradient Descent Support Vector Machine
  • 32. Contributors • Nitin Madnani • Mike Heilman • Nils Murrugarra Llerena • Aoife Cahill • Diane Napolitano • Keelan Evanini • Ben Leong
  • 33. References • Dataset: kaggle.com/c/titanic-gettingStarted • SKLL GitHub: github.com/EducationalTestingService/skll • SKLL Docs: skll.readthedocs.org • Titanic configs and data splitting script in examples dir on GitHub @dsblanch dan-blanchard
  • 35. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 36. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # confusion Load test matrix examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 37. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) overall accuracy on # Load test examples test set and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 38. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) precision, recall, f-score for # Load test examples and each evaluate class test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 39. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) tuned model # Load test examples and evaluate parameters test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 40. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) objective function # Load test examples and evaluate score on test set test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  • 41. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples)
  • 42. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) # Perform 10-fold cross-validation with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  • 43. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) per-fold evaluation # Perform results 10-fold cross-validation with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  • 44. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) per-fold training set # Perform 10-fold cross-obj. validation scores with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  • 45. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) # Perform 10-fold cross-validation with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  • 46. SKLL API import numpy as np from os.path import join from skll import FeatureSet, NDJWriter, Writer # Create some training examples labels = [] ids = [] features = [] for i in range(num_train_examples): labels.append("dog" if i % 2 == 0 else "cat") ids.append("{}{}".format(y, i)) features.append({"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4)}) feat_set = FeatureSet('training', ids, labels=labels, features=features) # Write them to a file train_path = join(_my_dir, 'train', 'test_summary.jsonlines') Writer.for_path(train_path, feat_set).write() # Or NDJWriter.(train_path, feat_set).write()