Your SlideShare is downloading. ×
0
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Learning from data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Learning from data

252

Published on

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
252
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Learning from Data Busy Professional’s guide to machine learning @govindk http://govindkanshi.wordpress.com
  • 2. Agenda • What we know • What we do not know • Process • What to measure • Challenge with Model • Challenge with Data • Resources • Software • Books
  • 3. What we know • Reports made from data • KPIs made of data • Dashboards made of data • They all measure known metrics, questions
  • 4. What we do not know • Will this person turn delinquent in x years based on his profile (age/income/background…) • Which kind of process, machine will fail • Which people/things are similar to each other – find me a pattern • Prevent people from readmission into Hospital • Why - because we do not know the question and database/applications do not have oob functionality.
  • 5. We are already using applied ML results • Mails get despammed • Kinect recognizes our gestures • Facebook recognizes our photos • Siri/Cortana – recognize our voice commands • Watson used some • Search uses many • Recommendation is there in face
  • 6. So then • Learn from data • How • Create a model of the data • Test the model for error and use it
  • 7. Unsupervised • Clustering • Customer segmentation • Topic identification • Number of algorithms • Hierarchical (distance as measure – generally Euclidian ) • Agglomerative ( start with n groups and start merging them) • Single Link (2 at time) vs divisive (start single – break it down)
  • 8. Simple way • Group folks on • Height • What you eat • Where you are from (state) • Next time a new person comes in – let us predict
  • 9. Demos • USArrests Data • Wine Data
  • 10. Challenges and next steps • How many groups/clusters • How many miss-groupings (Evaluation) • Associate Topics & after Clustering what • Once clusters are formed – some one can name them • Now run supervised methods on data to learn more
  • 11. Supervised learning • Given a label L for a attributes (a1,a2,a3..) • Learn the model which can predict the label based on attributes
  • 12. Simple way to understand Classification • Let us say we are labelled north indian, south indian • How • Attributes (language, food, movie language, music …) • Basically learning the link between • An observed data X and • A variable y usually called target or labels.
  • 13. Supervised • Data • One dataset for training which has label • One dataset for testing • Example • Classification (spam, order data, disease data, Kinect gesture) • Classification • binary vs. multiclass • Regression (sales) • Ranking • Search • Predictive maintenance • Recommendation • Netflix - Netflix competition = SVD
  • 14. Demos • Trees • DecisionTree – Python (show train and test, validation) • Decision tree – R • BigML (nw dependent) • Challenge – • one input every time
  • 15. Few more terms to overcome data issues • Bagging – (used with tree models) (bias reduction) • Train an ensemble of models from Bootstrap samples • Get a vote amongst models • Class predicted by majority of the model wins • Get an average if outputs are scores or probabilities • * Bootstrap – denotes different random sample of dataset • Boosting (variance reduction) • Like Bagging but penalizes & learns from misclassification • Challenge of assigning “weights” misclassified instances to penalize • Start with higher weight say 1 and keep reducing till error comes down
  • 16. Demo • RandomForest • n training data out of N, at each decision node of the tree, it randomly selects m input features from the total M input features (m ~ M^0.5) and learns a decision tree from it. Finally each tree in the forest vote for the result. • Evaluation • Loss function to margins (penalize mis-classification, reward +ve)
  • 17. Regression • Explain relationship betwee two variables (dependent vs independent) • Simple linear - y = W0 + W1x1 + W2x2 + … • Estimate the weights to predict y • Multivariate
  • 18. Demos • Excel • SimpleLinear -R • RandomForest – Wine • Evaluate by applying loss function to residuals
  • 19. What to meaure • Data • Cross Validation • n-fold cross-validation • Leave-one-out validation • Hold out • Eod – how much data is enough, is there bias in data (only certain kind of labels) • Model Results • Contingency table(true negatives & false positive are bad ) • ROC & AUC (coverage curve) (true positive vs false positives) • Precision/Recall (from search world) • F-measure • Lift (not interested in accuracy on entire dataset, want for 5%,10% of dataset)
  • 20. Is Model working right Predicted +ve Predicted -ve Actual +ve 40 15 55 Actual -ve 5 40 45 45 55 100 Precision 40/45 Recall 40/55 F measure (Harmonic mean) 2/((1/prec) + (1/rec)) Accuracy TPR(40) + TPN(55)/ (40+15+5+40) How much accuracy is enough Lift – How much better than random guessing Lift and accuracy do not have correlation
  • 21. Challenge with Model • Overfitting • Avoid Bias and have less variance • Use Regularization • L1 (Ridge) • L2 (Lasso) • If time permits show the alpha effect • Look for “overfitting model” , “bias and variance”
  • 22. Challenge with Data • Categorical, ordinal, quantitative • Measures – mean, median, variance, std deviation, range, shape (skewness) • Always observe to get “feel”/smell of data • Discretize/Thresholding (convert quantitative feature) • Missing feature(s) – • What do you do – median, avg • Data encoding • Create new from existing vs encode in different way
  • 23. Feature engineering • Feature selection • Intuition, testing co-relation • Subset (Start small and increase) based on some error function • Feature extraction • New k dimensions – as combination of older d dimensions • Linear • PCA (find the variance by projecting – explains impact of outliers) • LDA (supervised method for dimension redn for classification) • FA(Factor Analysis), Multidimensional Scaling(distance between points) • IsoMap (geodesic distance) and Locally Linear Embedding (LLE)
  • 24. What we could not cover • Mechanisms • Reinforcement Learning (punishment/rewards to learn better) • Algorithm types • Perceptron (back propogation, som, ..) • SVM • LDA and friends for unstructured world • Regression(ols,logistic,stepwise,mars) • Regularization (ridge/lasso) • Trees (GBM,c4.5, ID3…) • Bayesian • Kernel (radial) • Deep learning(DBN, Boltzman..) • Clustering (Expectation Max) • Recommendation • Probability (distributions) & Linear Algebra • Constraint Solving and Optimization (Solver, OpenSolver..)
  • 25. Tools • R • Scikit • Theano • Weka • Kmine • Recommender (.net….) • DataTau • BigML • WiseIO • Skytree • SAS/SPSS • YHatr
  • 26. Books • Bishop • Alpyadin • John Foreman • PyMC – Search query (Bayesian-Methods-for-Hackers) • Scikit – • jakevdp – “scikit jake 2014 tutorial” • Olvier – “scikit olvier grasel tutorial” • Recommender (http://mymedialite.net/) – Zeno Ganter
  • 27. What you will be doing • Data • Touch/feel (visualize),breathe it in • Cleaning, scaling/normalization • Selecting • Algorithm (chose the task) • Classification • Regression • Ranking (recommendation, search results) • Amongst • Evaluate Algorithm against each other & refine/calibrate • AUC, ROC, RMSE etc…
  • 28. If time & net permits Yhatr demo • Because you need to deploy,test & use the model • Yhatr provides good host (theirs and host your own)
  • 29. Thanks for your time • Please fill the evaluation form • See you next time
  • 30. Reference

×