Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Azure provisioning at your control by Govind Kanshi 648 views
- Choosing right data store & processing by Govind Kanshi 487 views
- Mtc learnings from isv & enterprise... by Govind Kanshi 224 views
- AzureML – zero to hero by Govind Kanshi 1193 views
- Event Hubs : million events per sec... by Paolo Patierno 2517 views
- The Six Highest Performing B2B Blog... by Barry Feldman 90791 views

No Downloads

Total views

635

On SlideShare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

11

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Learning from Data Busy Professional’s guide to machine learning @govindk http://govindkanshi.wordpress.com
- 2. Agenda • What we know • What we do not know • Process • What to measure • Challenge with Model • Challenge with Data • Resources • Software • Books
- 3. What we know • Reports made from data • KPIs made of data • Dashboards made of data • They all measure known metrics, questions
- 4. What we do not know • Will this person turn delinquent in x years based on his profile (age/income/background…) • Which kind of process, machine will fail • Which people/things are similar to each other – find me a pattern • Prevent people from readmission into Hospital • Why - because we do not know the question and database/applications do not have oob functionality.
- 5. We are already using applied ML results • Mails get despammed • Kinect recognizes our gestures • Facebook recognizes our photos • Siri/Cortana – recognize our voice commands • Watson used some • Search uses many • Recommendation is there in face
- 6. So then • Learn from data • How • Create a model of the data • Test the model for error and use it
- 7. Unsupervised • Clustering • Customer segmentation • Topic identification • Number of algorithms • Hierarchical (distance as measure – generally Euclidian ) • Agglomerative ( start with n groups and start merging them) • Single Link (2 at time) vs divisive (start single – break it down)
- 8. Simple way • Group folks on • Height • What you eat • Where you are from (state) • Next time a new person comes in – let us predict
- 9. Demos • USArrests Data • Wine Data
- 10. Challenges and next steps • How many groups/clusters • How many miss-groupings (Evaluation) • Associate Topics & after Clustering what • Once clusters are formed – some one can name them • Now run supervised methods on data to learn more
- 11. Supervised learning • Given a label L for a attributes (a1,a2,a3..) • Learn the model which can predict the label based on attributes
- 12. Simple way to understand Classification • Let us say we are labelled north indian, south indian • How • Attributes (language, food, movie language, music …) • Basically learning the link between • An observed data X and • A variable y usually called target or labels.
- 13. Supervised • Data • One dataset for training which has label • One dataset for testing • Example • Classification (spam, order data, disease data, Kinect gesture) • Classification • binary vs. multiclass • Regression (sales) • Ranking • Search • Predictive maintenance • Recommendation • Netflix - Netflix competition = SVD
- 14. Demos • Trees • DecisionTree – Python (show train and test, validation) • Decision tree – R • BigML (nw dependent) • Challenge – • one input every time
- 15. Few more terms to overcome data issues • Bagging – (used with tree models) (bias reduction) • Train an ensemble of models from Bootstrap samples • Get a vote amongst models • Class predicted by majority of the model wins • Get an average if outputs are scores or probabilities • * Bootstrap – denotes different random sample of dataset • Boosting (variance reduction) • Like Bagging but penalizes & learns from misclassification • Challenge of assigning “weights” misclassified instances to penalize • Start with higher weight say 1 and keep reducing till error comes down
- 16. Demo • RandomForest • n training data out of N, at each decision node of the tree, it randomly selects m input features from the total M input features (m ~ M^0.5) and learns a decision tree from it. Finally each tree in the forest vote for the result. • Evaluation • Loss function to margins (penalize mis-classification, reward +ve)
- 17. Regression • Explain relationship betwee two variables (dependent vs independent) • Simple linear - y = W0 + W1x1 + W2x2 + … • Estimate the weights to predict y • Multivariate
- 18. Demos • Excel • SimpleLinear -R • RandomForest – Wine • Evaluate by applying loss function to residuals
- 19. What to meaure • Data • Cross Validation • n-fold cross-validation • Leave-one-out validation • Hold out • Eod – how much data is enough, is there bias in data (only certain kind of labels) • Model Results • Contingency table(true negatives & false positive are bad ) • ROC & AUC (coverage curve) (true positive vs false positives) • Precision/Recall (from search world) • F-measure • Lift (not interested in accuracy on entire dataset, want for 5%,10% of dataset)
- 20. Is Model working right Predicted +ve Predicted -ve Actual +ve 40 15 55 Actual -ve 5 40 45 45 55 100 Precision 40/45 Recall 40/55 F measure (Harmonic mean) 2/((1/prec) + (1/rec)) Accuracy TPR(40) + TPN(55)/ (40+15+5+40) How much accuracy is enough Lift – How much better than random guessing Lift and accuracy do not have correlation
- 21. Challenge with Model • Overfitting • Avoid Bias and have less variance • Use Regularization • L1 (Ridge) • L2 (Lasso) • If time permits show the alpha effect • Look for “overfitting model” , “bias and variance”
- 22. Challenge with Data • Categorical, ordinal, quantitative • Measures – mean, median, variance, std deviation, range, shape (skewness) • Always observe to get “feel”/smell of data • Discretize/Thresholding (convert quantitative feature) • Missing feature(s) – • What do you do – median, avg • Data encoding • Create new from existing vs encode in different way
- 23. Feature engineering • Feature selection • Intuition, testing co-relation • Subset (Start small and increase) based on some error function • Feature extraction • New k dimensions – as combination of older d dimensions • Linear • PCA (find the variance by projecting – explains impact of outliers) • LDA (supervised method for dimension redn for classification) • FA(Factor Analysis), Multidimensional Scaling(distance between points) • IsoMap (geodesic distance) and Locally Linear Embedding (LLE)
- 24. What we could not cover • Mechanisms • Reinforcement Learning (punishment/rewards to learn better) • Algorithm types • Perceptron (back propogation, som, ..) • SVM • LDA and friends for unstructured world • Regression(ols,logistic,stepwise,mars) • Regularization (ridge/lasso) • Trees (GBM,c4.5, ID3…) • Bayesian • Kernel (radial) • Deep learning(DBN, Boltzman..) • Clustering (Expectation Max) • Recommendation • Probability (distributions) & Linear Algebra • Constraint Solving and Optimization (Solver, OpenSolver..)
- 25. Tools • R • Scikit • Theano • Weka • Kmine • Recommender (.net….) • DataTau • BigML • WiseIO • Skytree • SAS/SPSS • YHatr
- 26. Books • Bishop • Alpyadin • John Foreman • PyMC – Search query (Bayesian-Methods-for-Hackers) • Scikit – • jakevdp – “scikit jake 2014 tutorial” • Olvier – “scikit olvier grasel tutorial” • Recommender (http://mymedialite.net/) – Zeno Ganter
- 27. What you will be doing • Data • Touch/feel (visualize),breathe it in • Cleaning, scaling/normalization • Selecting • Algorithm (chose the task) • Classification • Regression • Ranking (recommendation, search results) • Amongst • Evaluate Algorithm against each other & refine/calibrate • AUC, ROC, RMSE etc…
- 28. If time & net permits Yhatr demo • Because you need to deploy,test & use the model • Yhatr provides good host (theirs and host your own)
- 29. Thanks for your time • Please fill the evaluation form • See you next time
- 30. Reference

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment