Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- Rise of the machine (learning algor... by Frank Van Lankvelt 437 views
- In pursuit of augmented intelligence by DataScienceAssoci... 113 views
- Machine learning for the curious bu... by Ellen König 337 views
- Augmented Intelligence 2.0 by Daniel Kornev 1193 views
- Machine Learning Meetup SOF: Intro ... by Imagga Technology 4110 views
- The How and Why of Fast Data Analyt... by Legacy Typesafe (... 3523 views

1,303 views

Published on

At Zipfian Academy, we're working to make it easier to learn these skills. In this 1-day Bootcamp in partnership with DataWeek, we'll teach the fundamentals of analysis and machine learning. You'll work hands-on with instructors utilizing Python in a small-group environment.

The class runs from 10:00am - 4:30pm, with a break from 1:00 - 1:30pm.

In this 1-Day Bootcamp, you will:

* Gain an understanding of basic principles of data mining and analysis

* Get hands-on experience using machine learning tools in Python

* Work alongside instructors and other students with scikit-learn on classification, regression, and clustering of data

* Learn next steps to develop your skills as a data scientist plus resources to help you

* Connect with other aspiring data scientists and start participating in the growing data science community

Published in:
Technology

No Downloads

Total views

1,303

On SlideShare

0

From Embeds

0

Number of Embeds

90

Shares

0

Downloads

102

Comments

0

Likes

7

No embeds

No notes for slide

- 1. Zipfian Academy Introduction to Machine Learning Jonathan Dinu Co-Founder, Zipfian Academy jonathan@zipfianacademy.com @clearspandex Ryan Orban Co-Founder, Zipfian Academy ryan@zipfianacademy.com @ryanorban @ZipfianAcademy
- 2. Outline • What is it and why do I care? • Supervised Learning • Regression and Classification • Unsupervised Learning • Clustering and Dimensionality Reduction • In Practice • Q&A Zipfian Academy -- June 11th, 2013
- 3. What? Field of study that gives computers the ability to learn without being explicitly programmed. “ -Arthur Samuel circa 1959
- 4. What? A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. “ -Tom M. Mitchell
- 5. What? Learning ≠ Thinking
- 6. What? NOT Machine learning is: • Hard coded logic by programmer: ifs and elses... • Predefined results: completely deterministic • Burden is placed on programmer at design time • Must anticipate all inputs to program, and react
- 7. What? Machine learning is: • Automated knowledge acquisition through input • Iterative improvement as more data is seen • Adaptive Algorithms
- 8. Discrete Labeled Data kNN Unlabeled Data Continuous Regression K-means Clustering/ Dimensionality Reduction SVM Classification Apriori Association Analysis PCA FP-Growth Machines GLM Linear Trees Logistic Trees SVD
- 9. Discrete Labeled Data kNN Unlabeled Data Continuous Regression K-means Clustering/ Dimensionality Reduction SVM Classification Apriori Association Analysis PCA FP-Growth Today! GLM Linear Trees Logistic Trees SVD
- 10. Applications Regression: • Stock Market Analysis: trend analysis • Utilities: smart grid load forecasting • Web: page traffic prediction • Bioinformatics: protein binding site prediction
- 11. Applications Linear Regression
- 12. Applications Classification: • Spam Filtering and document classification • Finance: Fraud detection and loan default prediction • Sentiment Analysis: People like to do this with Tweets • National Security: ??? PRISM!
- 13. Applications K-nearest neighbors Source: http://scipy-lectures.github.io/advanced/scikit-learn/
- 14. Applications Clustering: • Product Marketing: Cohort Analysis • Oncology: Malignant cell identification • Computer Vision: entity recognition • Census: demographics analysis
- 15. Applications K-means clustering Source: http://www.flickr.com/photos/kylemcdonald/3866231864/
- 16. Applications Machine Learning Data Mining Discovery Unknown Properties Unsupervised Ex: K-means Clustering Prediction Known Properties Supervised Ex: Naïve Bayes Classification
- 17. Pitfalls Considerations -- Performance Number of Features Train vs. Predict Online vs. Batch Multinomial
- 18. In Practice Source: http://recsys.acm.org/more-data-or-better-models/
- 19. Outline • What is it and why do I care? • Supervised Learning • Regression and Classification • Unsupervised Learning • Clustering and Dimensionality Reduction • In Practice • Q&A
- 20. Supervised Inferring a function from labeled training data, e.g. learning from experience
- 21. Outline • What is it and why do I care? • Supervised Learning • Regression and Classification • Unsupervised Learning • Clustering and Dimensionality Reduction • In Practice • Q&A
- 22. Classification Prior => Prediction Regression Classification continuous discrete structured discrete continuous discrete structured continuous
- 23. Classification Logistic Regression (contrary to its name... actually used to classify) threshold = 0.5 Source: http://cvxopt.org/examples/book/logreg.html
- 24. Classification Source: http://www.stepbystep.com/difference-between-linear-and-logistic-regression- 103607/ Logistic Regression (contrary to its name... actually used to classify)
- 25. Process 1. Obtain -- convert to numeric 2. Train -- pre-labeled data 3. Test -- cross validation 4. Use -- unknown label data
- 26. Process Input: historical labeled data + non-parameterized function (linear, logistic, etc.) = Output: parameterized function
- 27. Products A model is just a function
- 28. Products Inputs...
- 29. Products Outputs...
- 30. Others Today: Logistic Regression
- 31. Α = β0 +β1x1 +β2x2 + ...+βnxn threshold = 0.5 Source: http://www.sjsu.edu/faculty/gerstman/EpiInfo/cont-mult1.jpg Logistic Regression
- 32. Logistic Regression Α = β0 +β1x1 +β2x2 + ...+βnxn threshold = 0.5 Source: http://www.sjsu.edu/faculty/gerstman/EpiInfo/cont-mult1.jpg ML Gold
- 33. Multi dimensional Logistic Regression Α = β0 +β1x1 +β2x2 + ...+βnxn threshold = 0.5 Source: http://www.sjsu.edu/faculty/gerstman/EpiInfo/cont-mult1.jpg Mine it!
- 34. Raw Data Scrubbing Cleaned Data Prepared Vectorization Data New Data Test Set Training Set Train Model Sampling Evaluate Cross Validation Cleaned Data Prepared Scrubbing Vectorization Data Predict Labels/ Classes In Practice
- 35. Raw Data Scrubbing Cleaned Data Prepared Vectorization Data New Data Test Set Training Set Train Model Sampling Evaluate Cross Validation Cleaned Data Prepared Scrubbing Vectorization Data Predict Labels/ Classes In Practice Now:
- 36. Others Work Time!
- 37. Others Additional Methods
- 38. Classification Nearest Neighbors Source: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
- 39. Classification Support Vector Machine (SVM) non-linear input space Kernel transform Linear classifier Source: http://digitheadslabnotebook.blogspot.com/2011/11/support-vector-machines. html
- 40. Pitfalls Overcoming the labeling problem Unsupervised/Semi-supervised Learning Label your data by hand Have someone else label your data by hand (Mechanical Turk) Online/Active learning
- 41. Logistic Regression Gotchas 1. Normalize Values 2. Outliers
- 42. Others (Cross) Validation
- 43. Raw Data Scrubbing Cleaned Data Prepared Vectorization Data New Data Test Set Training Set Train Model Sampling Evaluate Cross Validation Cleaned Data Prepared Scrubbing Vectorization Data Predict Labels/ Classes In Practice Now:
- 44. Evaluate: Split labeled articles into a training set and a test set Naïve Bayes Whatever you do, DO NOT cross the streams
- 45. Focus Do Don’t Evaluate the efficacy of your algorithm Blindly trust theory or intuition
- 46. Outline • What is it and why do I care? • Supervised Learning • Regression and Classification • Unsupervised Learning • Clustering and Dimensionality Reduction • In Practice • Q&A
- 47. Unsupervised Finding hidden structure in unlabeled data
- 48. Unsupervised How is this different? • Unlabeled input data • No error for evaluation • Exploratory • Past rather than Future
- 49. Outline • What is it and why do I care? • Supervised Learning • Regression and Classification • Unsupervised Learning • Clustering and Dimensionality Reduction • Q&A
- 50. Clustering K-means Source: http://shabal.in/visuals.html
- 51. K-means Assumptions/Weaknesses 1. Needs numeric features 2. Greedy/converge on local minima 3. Iterative -- slow on large data sets
- 52. Dimensionality Reduction Source: http://www.nlpca.org/pca_principal_component_analysis.html
- 53. Practice Data Science in Practice
- 54. Counting Data Science....
- 55. Counting It’s just like...
- 56. Counting Counting!
- 57. Source: http://www.troll.me/images/x-all-the-things/count-all-the-things.jpg Counting
- 58. Source: https://data.sfgov.org/analytics Counting Counts Data Science
- 59. Source: https://data.sfgov.org/analytics Counting +σ μ 3 Day Moving Average
- 60. Counting Source: https://data.sfgov.org/analytics
- 61. Counting What Happened on Aug. 5th?
- 62. Counting Both Passive and Active measurement
- 63. Counting Passively measure time user spends on page (or slide)
- 64. Source: http://tctechcrunch2011.files.wordpress.com/2010/10/awesome.jpg Counting Actively solicit users feedback e.g.
- 65. Counting But it doesn’t stop at simple Statistics
- 66. Approach Naïve Bayes Source: http://www.emeraldinsight.com/journals.htm?articleid=1550453&show=html
- 67. A Short Divergence: How do we turn an article full of words into something an algorithm can understand? Naïve Bayes
- 68. Source: http://users.livejournal.com/_winnie/361188.html Vectorization
- 69. A Short Divergence: Put it in a Bag! Vectorization
- 70. A Short Divergence: Bag of Words Counts! original document dictionary of word counts brown fox feature vector: width of “vocabulary” of English language The brown fox { “the” : 1, “brown”: 1, “fox” : 1} [0,0,1,0,1,0,...] Tokenization Vectorization
- 71. Naïve Bayes Train: i Π P(Y | F1,…,Fn ) ∝ P(Y ) P(Fi | Y ) [0,0,1,0,1,0,...] brown fox Newspaper Section (what we want to predict)
- 72. Naïve Bayes Train: P(Fi | Y ) Conditional Probability Count occurrence of each word in training data count(x) total # words (with label Y)
- 73. Naïve Bayes Train: P(Y ) P(Fi | Sports) P(Fi | Nature) Sports : 0.32 Social Media : 0.12 Nature : 0.41 News : 0.10 ... the : 0.156 to : 0.153 ball : 0.0045 bat : 0.0033 she : 0.0083 Boston : 0.0002 ... the : 0.189 us : 0.0167 trees : 0.0945 bat : 0.0068 she : 0.017 Brazil : 0.0042 ...
- 74. Pitfalls Overcoming the labeling problem Unsupervised/Semi-supervised Learning Label your data by hand Have someone else label your data by hand (Mechanical Turk) Online/Active learning
- 75. Source: http://www.glogster.com/mrsallenballard/pickles-i-love-em-/ g-6mevh13be74mgnc9i8qifa0 Persistence
- 76. Persistence SerDes • Disk • Database • Memory •
- 77. Scale Issues?
- 78. Scale Issues? • Batch Oriented Algorithms • Data Preparation • Output?
- 79. Scale
- 80. Scale Train vs. Predict
- 81. Source: http://clincancerres.aacrjournals.org/content/5/11/3403/F4.expansion Showdown kNN vs. Trees
- 82. Exposé
- 83. Exposé APIs and Interfaces • Internal • ReSTful • Public • (Web) Application
- 84. Get Involved • Present a guest lecture or share a data story • Donate datasets and propose projects • Sponsor a scholarship • Donate cluster time or resources • Attend our Hiring Day (Nov. 20th)
- 85. We’re Hiring! • Full Time Instructors • Part Time Instructors • Curriculum Developers • TAs
- 86. DataWeek • Monday: Machine Learning Workshop • Tuesday: Hiring Mixer • Wednesday: Why I Teach (Data Science) • Thursday: Why Knowing Your Data is Invaluable to Startups Panel • Nighttime: DATAVIZ ART + TECH
- 87. Outline Q&A
- 88. Thank You! Jonathan Dinu Co-Founder, Zipfian Academy jonathan@zipfianacademy.com @clearspandex Ryan Orban Co-Founder, Zipfian Academy ryan@zipfianacademy.com @ryanorban

No public clipboards found for this slide

Be the first to comment