Learning from data

635 views

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
635
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
11
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Learning from data

  1. 1. Learning from Data Busy Professional’s guide to machine learning @govindk http://govindkanshi.wordpress.com
  2. 2. Agenda • What we know • What we do not know • Process • What to measure • Challenge with Model • Challenge with Data • Resources • Software • Books
  3. 3. What we know • Reports made from data • KPIs made of data • Dashboards made of data • They all measure known metrics, questions
  4. 4. What we do not know • Will this person turn delinquent in x years based on his profile (age/income/background…) • Which kind of process, machine will fail • Which people/things are similar to each other – find me a pattern • Prevent people from readmission into Hospital • Why - because we do not know the question and database/applications do not have oob functionality.
  5. 5. We are already using applied ML results • Mails get despammed • Kinect recognizes our gestures • Facebook recognizes our photos • Siri/Cortana – recognize our voice commands • Watson used some • Search uses many • Recommendation is there in face
  6. 6. So then • Learn from data • How • Create a model of the data • Test the model for error and use it
  7. 7. Unsupervised • Clustering • Customer segmentation • Topic identification • Number of algorithms • Hierarchical (distance as measure – generally Euclidian ) • Agglomerative ( start with n groups and start merging them) • Single Link (2 at time) vs divisive (start single – break it down)
  8. 8. Simple way • Group folks on • Height • What you eat • Where you are from (state) • Next time a new person comes in – let us predict
  9. 9. Demos • USArrests Data • Wine Data
  10. 10. Challenges and next steps • How many groups/clusters • How many miss-groupings (Evaluation) • Associate Topics & after Clustering what • Once clusters are formed – some one can name them • Now run supervised methods on data to learn more
  11. 11. Supervised learning • Given a label L for a attributes (a1,a2,a3..) • Learn the model which can predict the label based on attributes
  12. 12. Simple way to understand Classification • Let us say we are labelled north indian, south indian • How • Attributes (language, food, movie language, music …) • Basically learning the link between • An observed data X and • A variable y usually called target or labels.
  13. 13. Supervised • Data • One dataset for training which has label • One dataset for testing • Example • Classification (spam, order data, disease data, Kinect gesture) • Classification • binary vs. multiclass • Regression (sales) • Ranking • Search • Predictive maintenance • Recommendation • Netflix - Netflix competition = SVD
  14. 14. Demos • Trees • DecisionTree – Python (show train and test, validation) • Decision tree – R • BigML (nw dependent) • Challenge – • one input every time
  15. 15. Few more terms to overcome data issues • Bagging – (used with tree models) (bias reduction) • Train an ensemble of models from Bootstrap samples • Get a vote amongst models • Class predicted by majority of the model wins • Get an average if outputs are scores or probabilities • * Bootstrap – denotes different random sample of dataset • Boosting (variance reduction) • Like Bagging but penalizes & learns from misclassification • Challenge of assigning “weights” misclassified instances to penalize • Start with higher weight say 1 and keep reducing till error comes down
  16. 16. Demo • RandomForest • n training data out of N, at each decision node of the tree, it randomly selects m input features from the total M input features (m ~ M^0.5) and learns a decision tree from it. Finally each tree in the forest vote for the result. • Evaluation • Loss function to margins (penalize mis-classification, reward +ve)
  17. 17. Regression • Explain relationship betwee two variables (dependent vs independent) • Simple linear - y = W0 + W1x1 + W2x2 + … • Estimate the weights to predict y • Multivariate
  18. 18. Demos • Excel • SimpleLinear -R • RandomForest – Wine • Evaluate by applying loss function to residuals
  19. 19. What to meaure • Data • Cross Validation • n-fold cross-validation • Leave-one-out validation • Hold out • Eod – how much data is enough, is there bias in data (only certain kind of labels) • Model Results • Contingency table(true negatives & false positive are bad ) • ROC & AUC (coverage curve) (true positive vs false positives) • Precision/Recall (from search world) • F-measure • Lift (not interested in accuracy on entire dataset, want for 5%,10% of dataset)
  20. 20. Is Model working right Predicted +ve Predicted -ve Actual +ve 40 15 55 Actual -ve 5 40 45 45 55 100 Precision 40/45 Recall 40/55 F measure (Harmonic mean) 2/((1/prec) + (1/rec)) Accuracy TPR(40) + TPN(55)/ (40+15+5+40) How much accuracy is enough Lift – How much better than random guessing Lift and accuracy do not have correlation
  21. 21. Challenge with Model • Overfitting • Avoid Bias and have less variance • Use Regularization • L1 (Ridge) • L2 (Lasso) • If time permits show the alpha effect • Look for “overfitting model” , “bias and variance”
  22. 22. Challenge with Data • Categorical, ordinal, quantitative • Measures – mean, median, variance, std deviation, range, shape (skewness) • Always observe to get “feel”/smell of data • Discretize/Thresholding (convert quantitative feature) • Missing feature(s) – • What do you do – median, avg • Data encoding • Create new from existing vs encode in different way
  23. 23. Feature engineering • Feature selection • Intuition, testing co-relation • Subset (Start small and increase) based on some error function • Feature extraction • New k dimensions – as combination of older d dimensions • Linear • PCA (find the variance by projecting – explains impact of outliers) • LDA (supervised method for dimension redn for classification) • FA(Factor Analysis), Multidimensional Scaling(distance between points) • IsoMap (geodesic distance) and Locally Linear Embedding (LLE)
  24. 24. What we could not cover • Mechanisms • Reinforcement Learning (punishment/rewards to learn better) • Algorithm types • Perceptron (back propogation, som, ..) • SVM • LDA and friends for unstructured world • Regression(ols,logistic,stepwise,mars) • Regularization (ridge/lasso) • Trees (GBM,c4.5, ID3…) • Bayesian • Kernel (radial) • Deep learning(DBN, Boltzman..) • Clustering (Expectation Max) • Recommendation • Probability (distributions) & Linear Algebra • Constraint Solving and Optimization (Solver, OpenSolver..)
  25. 25. Tools • R • Scikit • Theano • Weka • Kmine • Recommender (.net….) • DataTau • BigML • WiseIO • Skytree • SAS/SPSS • YHatr
  26. 26. Books • Bishop • Alpyadin • John Foreman • PyMC – Search query (Bayesian-Methods-for-Hackers) • Scikit – • jakevdp – “scikit jake 2014 tutorial” • Olvier – “scikit olvier grasel tutorial” • Recommender (http://mymedialite.net/) – Zeno Ganter
  27. 27. What you will be doing • Data • Touch/feel (visualize),breathe it in • Cleaning, scaling/normalization • Selecting • Algorithm (chose the task) • Classification • Regression • Ranking (recommendation, search results) • Amongst • Evaluate Algorithm against each other & refine/calibrate • AUC, ROC, RMSE etc…
  28. 28. If time & net permits Yhatr demo • Because you need to deploy,test & use the model • Yhatr provides good host (theirs and host your own)
  29. 29. Thanks for your time • Please fill the evaluation form • See you next time
  30. 30. Reference

×