Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Simpler Machine Learning with SKLL 1.0

2,953 views

Published on

As the popularity of machine learning techniques spreads to new areas of industry and science, the number of potential machine learning users is growing rapidly. While the fantastic scikit-learn library is widely used in the Python community for tackling such tasks, there are two significant hurdles in place for people working on new machine learning problems:

• Scikit-learn requires writing a fair amount of boilerplate code to run even simple experiments.
• Obtaining good performance typically requires tuning various model parameters, which can be particularly challenging for beginners.

SciKit-Learn Laboratory (SKLL) is an open source Python package, originally developed by the NLP & Speech group at the Educational Testing Service (ETS), that addresses these issues by providing the ability to run scikit-learn experiments with tuned models without writing any code beyond what generates the features. This talk will provide an overview of performing common machine learning tasks with SKLL, and highlight some of the new features that are present as of the 1.0 release.

Published in: Data & Analytics
  • Be the first to comment

Simpler Machine Learning with SKLL 1.0

  1. 1. Simpler Machine Learning with SKLL 1.0 Dan Blanchard Educational Testing Service dblanchard@ets.org PyData NYC 2014
  2. 2. Survived Perished
  3. 3. Survived Perished first class, female, 1 sibling, 35 years old third class, female, 2 siblings, 18 years old second class, male, 0 siblings, 50 years old
  4. 4. Survived Perished first class, female, 1 sibling, 35 years old third class, female, 2 siblings, 18 years old second class, male, 0 siblings, 50 years old Can we predict survival from data?
  5. 5. SciKit-Learn Laboratory
  6. 6. SKLL It's where the learning happens
  7. 7. Learning to Predict Survival 1. Split up given training set: train (80%) and dev (20%) $ ./make_titanic_example_data.py Loading train.csv... done Writing titanic/train/socioeconomic.csv...done Writing titanic/train/family.csv...done Writing titanic/train/vitals.csv...done Writing titanic/train/misc.csv...done Writing titanic/train+dev/socioeconomic.csv...done Writing titanic/train+dev/family.csv...done Writing titanic/train+dev/vitals.csv...done Writing titanic/train+dev/misc.csv...done Writing titanic/dev/socioeconomic.csv...done Writing titanic/dev/family.csv...done Writing titanic/dev/vitals.csv...done Writing titanic/dev/misc.csv...done Loading test.csv... done Writing titanic/test/socioeconomic.csv...done Writing titanic/test/family.csv...done Writing titanic/test/vitals.csv...done Writing titanic/test/misc.csv...done
  8. 8. Learning to Predict Survival 2. Pick classifiers to try: 1. Decision Tree 2. Naive Bayes 3. Random forest 4. Support Vector Machine (SVM)
  9. 9. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  10. 10. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory with feature files for training learner
  11. 11. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory with feature files for evaluating performance
  12. 12. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  13. 13. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory # of siblings, = train spouses, parents, children test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  14. 14. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output departure port
  15. 15. Learning to Predict Survival [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output fare & passenger class 3. Create configuration file for SKLL
  16. 16. Learning to Predict Survival [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output sex & age 3. Create configuration file for SKLL
  17. 17. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output
  18. 18. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory to store evaluation results
  19. 19. Learning to Predict Survival 3. Create configuration file for SKLL [General] experiment_name = Titanic_Evaluate_Untuned task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Output] results = output models = output directory to store trained models
  20. 20. Learning to Predict Survival 4. Run the configuration file with run_experiment $ run_experiment evaluate.cfg Loading train/family.csv... done Loading train/misc.csv... done Loading train/socioeconomic.csv... done Loading train/vitals.csv... done Loading dev/family.csv... done Loading dev/misc.csv... done Loading dev/socioeconomic.csv... done Loading dev/vitals.csv... done Loading train/family.csv... done Loading train/misc.csv... done Loading train/socioeconomic.csv... done Loading train/vitals.csv... done Loading dev/family.csv... done ...
  21. 21. Learning to Predict Survival 5. Examine results Experiment Name: Titanic_Evaluate_Untuned SKLL Version: 1.0.0 Training Set: train (712) Test Set: dev (179) Feature Set: ["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"] Learner: RandomForestClassifier Scikit-learn Version: 0.15.2 Total Time: 0:00:02.065403 +-------+------+------+-----------+--------+-----------+ | | 0.0 | 1.0 | Precision | Recall | F-measure | +-------+------+------+-----------+--------+-----------+ | 0.000 | [96] | 19 | 0.865 | 0.835 | 0.850 | +-------+------+------+-----------+--------+-----------+ | 1.000 | 15 | [49] | 0.721 | 0.766 | 0.742 | +-------+------+------+-----------+--------+-----------+ (row = reference; column = predicted) Accuracy = 0.8100558659217877
  22. 22. Aggregate Evaluation Results Dev. Accuracy Learner 0.8101 RandomForestClassifier 0.7989 DecisionTreeClassifier 0.7709 SVC 0.7095 MultinomialNB
  23. 23. [General] experiment_name = Titanic_Evaluate task = evaluate [Input] train_directory = train test_directory = dev featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Tuning] grid_search = true objective = accuracy [Output] results = output Tuning learner Can we do better than default hyperparameters?
  24. 24. Tuned Evaluation Results Untuned Accuracy Tuned Accuracy Learner 0.8101 0.8380 RandomForestClassifier 0.7989 0.7989 DecisionTreeClassifier 0.7709 0.8156 SVC 0.7095 0.7095 MultinomialNB
  25. 25. Using All Available Data Use training and dev to generate predictions on test [General] experiment_name = Titanic_Predict task = predict [Input] train_directory = train+dev test_directory = test featuresets = [["family.csv", "misc.csv", "socioeconomic.csv", "vitals.csv"]] learners = ["RandomForestClassifier", "DecisionTreeClassifier", "SVC", "MultinomialNB"] id_col = PassengerId label_col = Survived [Tuning] grid_search = true objective = accuracy [Output] results = output
  26. 26. Test Set Accuracy Train only Train + Dev Learner Untuned Tuned Untuned Tuned 0.727 0.756 0.746 0.780 RandomForestClassifier 0.703 0.742 0.670 0.742 DecisionTreeClassifier 0.608 0.679 0.612 0.679 SVC 0.627 0.627 0.622 0.622 MultinomialNB
  27. 27. Advanced SKLL Features • Read & write .arff, .csv, .jsonlines, .libsvm, .megam, .ndj, and .tsv data • Parameter grids for all supported scikit-learn learners • Custom learners • Parallelize experiments on DRMAA clusters via GridMap • Ablation experiments • Collapse/rename classes from config file • Feature scaling • Rescale predictions to be closer to observed data • Command-line tools for joining, filtering, and converting feature files • Python API
  28. 28. Currently Supported Learners Classifiers Regressors Linear Support Vector Machine Elastic Net Logistic Regression Lasso Multinomial Naive Bayes Linear AdaBoost Decision Tree Gradient Boosting K-Nearest Neighbors Random Forest Stochastic Gradient Descent Support Vector Machine
  29. 29. Contributors • Nitin Madnani • Mike Heilman • Nils Murrugarra Llerena • Aoife Cahill • Diane Napolitano • Keelan Evanini • Ben Leong
  30. 30. References • Dataset: kaggle.com/c/titanic-gettingStarted • SKLL GitHub: github.com/EducationalTestingService/skll • SKLL Docs: skll.readthedocs.org • Titanic configs and data splitting script in examples dir on GitHub @dsblanch dan-blanchard
  31. 31. Bonus Slides
  32. 32. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  33. 33. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # confusion Load test matrix examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  34. 34. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) overall accuracy on # Load test examples test set and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  35. 35. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) precision, recall, f-score for # Load test examples and each evaluate class test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  36. 36. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) tuned model # Load test examples and evaluate parameters test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  37. 37. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) objective function # Load test examples and evaluate score on test set test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples)
  38. 38. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples)
  39. 39. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) # Perform 10-fold cross-validation with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  40. 40. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) per-fold evaluation # Perform results 10-fold cross-validation with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  41. 41. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) per-fold training set # Perform 10-fold cross-obj. validation scores with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  42. 42. SKLL API from skll import Learner, Reader # Load training examples train_examples = Reader.for_path('myexamples.megam').read() # Train a linear SVM learner = Learner('LinearSVC') learner.train(train_examples) # Load test examples and evaluate test_examples = Reader.for_path('test.tsv').read() conf_matrix, accuracy, prf_dict, model_params, obj_score = learner.evaluate(test_examples) # Generate predictions from trained model predictions = learner.predict(test_examples) # Perform 10-fold cross-validation with a radial SVM learner = Learner('SVC') fold_result_list, grid_search_scores = learner.cross_validate(train_examples)
  43. 43. SKLL API import numpy as np from os.path import join from skll import FeatureSet, NDJWriter, Writer # Create some training examples labels = [] ids = [] features = [] for i in range(num_train_examples): labels.append("dog" if i % 2 == 0 else "cat") ids.append("{}{}".format(y, i)) features.append({"f1": np.random.randint(1, 4), "f2": np.random.randint(1, 4)}) feat_set = FeatureSet('training', ids, labels=labels, features=features) # Write them to a file train_path = join(_my_dir, 'train', 'test_summary.jsonlines') Writer.for_path(train_path, feat_set).write() # Or NDJWriter.(train_path, feat_set).write()

×