Your SlideShare is downloading.
×

×
Saving this for later?
Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.

Text the download link to your phone

Standard text messaging rates apply

Like this presentation? Why not share!

- [RakutenTechConf2013] [C-4_2] Build... by Rakuten, Inc 661 views
- Information Visualization for Large... by The Hive 892 views
- 10 Evil(ish) Things and how they re... by terry chay 2794 views
- F & B Industry by Oaktree Ventures 2483 views
- Epicure- UNIQUE HIGH QUALITY F and ... by czbinden 566 views
- Ch 06 Analyzed Markets by joeffreybarrios 85 views

252

Published on

No Downloads

Total Views

252

On Slideshare

0

From Embeds

0

Number of Embeds

0

Shares

0

Downloads

9

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Learning from Data Busy Professional’s guide to machine learning @govindk http://govindkanshi.wordpress.com
- 2. Agenda • What we know • What we do not know • Process • What to measure • Challenge with Model • Challenge with Data • Resources • Software • Books
- 3. What we know • Reports made from data • KPIs made of data • Dashboards made of data • They all measure known metrics, questions
- 4. What we do not know • Will this person turn delinquent in x years based on his profile (age/income/background…) • Which kind of process, machine will fail • Which people/things are similar to each other – find me a pattern • Prevent people from readmission into Hospital • Why - because we do not know the question and database/applications do not have oob functionality.
- 5. We are already using applied ML results • Mails get despammed • Kinect recognizes our gestures • Facebook recognizes our photos • Siri/Cortana – recognize our voice commands • Watson used some • Search uses many • Recommendation is there in face
- 6. So then • Learn from data • How • Create a model of the data • Test the model for error and use it
- 7. Unsupervised • Clustering • Customer segmentation • Topic identification • Number of algorithms • Hierarchical (distance as measure – generally Euclidian ) • Agglomerative ( start with n groups and start merging them) • Single Link (2 at time) vs divisive (start single – break it down)
- 8. Simple way • Group folks on • Height • What you eat • Where you are from (state) • Next time a new person comes in – let us predict
- 9. Demos • USArrests Data • Wine Data
- 10. Challenges and next steps • How many groups/clusters • How many miss-groupings (Evaluation) • Associate Topics & after Clustering what • Once clusters are formed – some one can name them • Now run supervised methods on data to learn more
- 11. Supervised learning • Given a label L for a attributes (a1,a2,a3..) • Learn the model which can predict the label based on attributes
- 12. Simple way to understand Classification • Let us say we are labelled north indian, south indian • How • Attributes (language, food, movie language, music …) • Basically learning the link between • An observed data X and • A variable y usually called target or labels.
- 13. Supervised • Data • One dataset for training which has label • One dataset for testing • Example • Classification (spam, order data, disease data, Kinect gesture) • Classification • binary vs. multiclass • Regression (sales) • Ranking • Search • Predictive maintenance • Recommendation • Netflix - Netflix competition = SVD
- 14. Demos • Trees • DecisionTree – Python (show train and test, validation) • Decision tree – R • BigML (nw dependent) • Challenge – • one input every time
- 15. Few more terms to overcome data issues • Bagging – (used with tree models) (bias reduction) • Train an ensemble of models from Bootstrap samples • Get a vote amongst models • Class predicted by majority of the model wins • Get an average if outputs are scores or probabilities • * Bootstrap – denotes different random sample of dataset • Boosting (variance reduction) • Like Bagging but penalizes & learns from misclassification • Challenge of assigning “weights” misclassified instances to penalize • Start with higher weight say 1 and keep reducing till error comes down
- 16. Demo • RandomForest • n training data out of N, at each decision node of the tree, it randomly selects m input features from the total M input features (m ~ M^0.5) and learns a decision tree from it. Finally each tree in the forest vote for the result. • Evaluation • Loss function to margins (penalize mis-classification, reward +ve)
- 17. Regression • Explain relationship betwee two variables (dependent vs independent) • Simple linear - y = W0 + W1x1 + W2x2 + … • Estimate the weights to predict y • Multivariate
- 18. Demos • Excel • SimpleLinear -R • RandomForest – Wine • Evaluate by applying loss function to residuals
- 19. What to meaure • Data • Cross Validation • n-fold cross-validation • Leave-one-out validation • Hold out • Eod – how much data is enough, is there bias in data (only certain kind of labels) • Model Results • Contingency table(true negatives & false positive are bad ) • ROC & AUC (coverage curve) (true positive vs false positives) • Precision/Recall (from search world) • F-measure • Lift (not interested in accuracy on entire dataset, want for 5%,10% of dataset)
- 20. Is Model working right Predicted +ve Predicted -ve Actual +ve 40 15 55 Actual -ve 5 40 45 45 55 100 Precision 40/45 Recall 40/55 F measure (Harmonic mean) 2/((1/prec) + (1/rec)) Accuracy TPR(40) + TPN(55)/ (40+15+5+40) How much accuracy is enough Lift – How much better than random guessing Lift and accuracy do not have correlation
- 21. Challenge with Model • Overfitting • Avoid Bias and have less variance • Use Regularization • L1 (Ridge) • L2 (Lasso) • If time permits show the alpha effect • Look for “overfitting model” , “bias and variance”
- 22. Challenge with Data • Categorical, ordinal, quantitative • Measures – mean, median, variance, std deviation, range, shape (skewness) • Always observe to get “feel”/smell of data • Discretize/Thresholding (convert quantitative feature) • Missing feature(s) – • What do you do – median, avg • Data encoding • Create new from existing vs encode in different way
- 23. Feature engineering • Feature selection • Intuition, testing co-relation • Subset (Start small and increase) based on some error function • Feature extraction • New k dimensions – as combination of older d dimensions • Linear • PCA (find the variance by projecting – explains impact of outliers) • LDA (supervised method for dimension redn for classification) • FA(Factor Analysis), Multidimensional Scaling(distance between points) • IsoMap (geodesic distance) and Locally Linear Embedding (LLE)
- 24. What we could not cover • Mechanisms • Reinforcement Learning (punishment/rewards to learn better) • Algorithm types • Perceptron (back propogation, som, ..) • SVM • LDA and friends for unstructured world • Regression(ols,logistic,stepwise,mars) • Regularization (ridge/lasso) • Trees (GBM,c4.5, ID3…) • Bayesian • Kernel (radial) • Deep learning(DBN, Boltzman..) • Clustering (Expectation Max) • Recommendation • Probability (distributions) & Linear Algebra • Constraint Solving and Optimization (Solver, OpenSolver..)
- 25. Tools • R • Scikit • Theano • Weka • Kmine • Recommender (.net….) • DataTau • BigML • WiseIO • Skytree • SAS/SPSS • YHatr
- 26. Books • Bishop • Alpyadin • John Foreman • PyMC – Search query (Bayesian-Methods-for-Hackers) • Scikit – • jakevdp – “scikit jake 2014 tutorial” • Olvier – “scikit olvier grasel tutorial” • Recommender (http://mymedialite.net/) – Zeno Ganter
- 27. What you will be doing • Data • Touch/feel (visualize),breathe it in • Cleaning, scaling/normalization • Selecting • Algorithm (chose the task) • Classification • Regression • Ranking (recommendation, search results) • Amongst • Evaluate Algorithm against each other & refine/calibrate • AUC, ROC, RMSE etc…
- 28. If time & net permits Yhatr demo • Because you need to deploy,test & use the model • Yhatr provides good host (theirs and host your own)
- 29. Thanks for your time • Please fill the evaluation form • See you next time
- 30. Reference

Be the first to comment