How I did my Master Thesis
Amendra Shrestha
Uppsala University
January 10, 2017
Introduction Workflow Data Preparation Modeling Process Experiment
Introduction
Find a topic
- 1 -
Introduction Workflow Data Preparation Modeling Process Experiment
Introduction
Reading
• papers
• online machine learning courses (coursera.org)
• mastery for self study (http://machinelearningmastery.com)
• writing
- 2 -
Introduction Workflow Data Preparation Modeling Process Experiment
Introduction
Data
• crawl data from social media, discussion forums, blogs
• download from archives
• KONECT (http://konect.uni-koblenz.de)
• UCI (http://archive.ics.uci.edu/ml/datasets.html)
• Kaggle (http://blog.kaggle.com)
• Spr˚akbanken (https://spraakbanken.gu.se/)
- 3 -
Introduction Workflow Data Preparation Modeling Process Experiment
Project workflow
• Data Preparation
• data cleaning
• data preparation
• feature vector creation
• Modeling Process
• feature selection
• transformation
• missing data
• model generation
• model selection
- 4 -
Introduction Workflow Data Preparation Modeling Process Experiment
Data Preparation
Cleaning data
• removing duplicates
• impossible values
• negative ages
• misspelt words
• inconsistent time formats
• unwanted elements
• text: quotes, retweets, strange symbols, URLs, punctuations,
function words
• outliers
- 5 -
Introduction Workflow Data Preparation Modeling Process Experiment
Data Preparation
Preparation of data
• alterations of data
• stemming and lemmization of text
• uniformization of units
- 6 -
Introduction Workflow Data Preparation Modeling Process Experiment
Data Preparation
Feature vector
• n-dimensional vector of features
• types
• data dependent
• data independent
• text
• bag of words
• term frequency
• tf-idf
• n-grams
- 7 -
Introduction Workflow Data Preparation Modeling Process Experiment
Modeling Process
Feature selection
use a minimal number of maximally informative features
• noise
• overfitting
• computational load
best features?
• background/expert knowledge
• pairwise statistical analysis
• model validation
- 8 -
Introduction Workflow Data Preparation Modeling Process Experiment
Modeling Process
Transformation
• scaling
• PCA
- 9 -
Introduction Workflow Data Preparation Modeling Process Experiment
Modeling Process
Dealing with incomplete data
• use a model that can deal with missing items
• throw away
• simple statistic (not recommended)
• mean, median, mode
• Knn imputation
- 10 -
Introduction Workflow Data Preparation Modeling Process Experiment
Modeling Process
Model generation
• supervised Learning
• artificial neural network
• decision tree learning
• support vector machines
• random forests
• unsupervised Learning
• clustering
- 11 -
Introduction Workflow Data Preparation Modeling Process Experiment
Modeling Process
Model selection
• how well the model performs on new data
• splitting training, validation and test set
• cross validation
• divide data into n subsets
• generate n models, each using all but one subset
• test each model on the hold-out subset
• combine results
- 12 -
Introduction Workflow Data Preparation Modeling Process Experiment
Experiment
- 13 -
Introduction Workflow Data Preparation Modeling Process Experiment
Experiment
- 14 -
Thank You

How to solve a problem with machine learning

  • 1.
    How I didmy Master Thesis Amendra Shrestha Uppsala University January 10, 2017
  • 2.
    Introduction Workflow DataPreparation Modeling Process Experiment Introduction Find a topic - 1 -
  • 3.
    Introduction Workflow DataPreparation Modeling Process Experiment Introduction Reading • papers • online machine learning courses (coursera.org) • mastery for self study (http://machinelearningmastery.com) • writing - 2 -
  • 4.
    Introduction Workflow DataPreparation Modeling Process Experiment Introduction Data • crawl data from social media, discussion forums, blogs • download from archives • KONECT (http://konect.uni-koblenz.de) • UCI (http://archive.ics.uci.edu/ml/datasets.html) • Kaggle (http://blog.kaggle.com) • Spr˚akbanken (https://spraakbanken.gu.se/) - 3 -
  • 5.
    Introduction Workflow DataPreparation Modeling Process Experiment Project workflow • Data Preparation • data cleaning • data preparation • feature vector creation • Modeling Process • feature selection • transformation • missing data • model generation • model selection - 4 -
  • 6.
    Introduction Workflow DataPreparation Modeling Process Experiment Data Preparation Cleaning data • removing duplicates • impossible values • negative ages • misspelt words • inconsistent time formats • unwanted elements • text: quotes, retweets, strange symbols, URLs, punctuations, function words • outliers - 5 -
  • 7.
    Introduction Workflow DataPreparation Modeling Process Experiment Data Preparation Preparation of data • alterations of data • stemming and lemmization of text • uniformization of units - 6 -
  • 8.
    Introduction Workflow DataPreparation Modeling Process Experiment Data Preparation Feature vector • n-dimensional vector of features • types • data dependent • data independent • text • bag of words • term frequency • tf-idf • n-grams - 7 -
  • 9.
    Introduction Workflow DataPreparation Modeling Process Experiment Modeling Process Feature selection use a minimal number of maximally informative features • noise • overfitting • computational load best features? • background/expert knowledge • pairwise statistical analysis • model validation - 8 -
  • 10.
    Introduction Workflow DataPreparation Modeling Process Experiment Modeling Process Transformation • scaling • PCA - 9 -
  • 11.
    Introduction Workflow DataPreparation Modeling Process Experiment Modeling Process Dealing with incomplete data • use a model that can deal with missing items • throw away • simple statistic (not recommended) • mean, median, mode • Knn imputation - 10 -
  • 12.
    Introduction Workflow DataPreparation Modeling Process Experiment Modeling Process Model generation • supervised Learning • artificial neural network • decision tree learning • support vector machines • random forests • unsupervised Learning • clustering - 11 -
  • 13.
    Introduction Workflow DataPreparation Modeling Process Experiment Modeling Process Model selection • how well the model performs on new data • splitting training, validation and test set • cross validation • divide data into n subsets • generate n models, each using all but one subset • test each model on the hold-out subset • combine results - 12 -
  • 14.
    Introduction Workflow DataPreparation Modeling Process Experiment Experiment - 13 -
  • 15.
    Introduction Workflow DataPreparation Modeling Process Experiment Experiment - 14 - Thank You