Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DataWeek: Intro to Data Science and Machine Learning

1,303 views

Published on

Demand for data scientists is growing quickly, but the path to a data science career can be long and winding. Today's great data scientists are miners and analysts who generate superhuman predictions using computer power.

At Zipfian Academy, we're working to make it easier to learn these skills. In this 1-day Bootcamp in partnership with DataWeek, we'll teach the fundamentals of analysis and machine learning. You'll work hands-on with instructors utilizing Python in a small-group environment.

The class runs from 10:00am - 4:30pm, with a break from 1:00 - 1:30pm.

In this 1-Day Bootcamp, you will:

* Gain an understanding of basic principles of data mining and analysis

* Get hands-on experience using machine learning tools in Python

* Work alongside instructors and other students with scikit-learn on classification, regression, and clustering of data

* Learn next steps to develop your skills as a data scientist plus resources to help you

* Connect with other aspiring data scientists and start participating in the growing data science community

Published in: Technology
  • Be the first to comment

DataWeek: Intro to Data Science and Machine Learning

  1. 1. Zipfian Academy Introduction to Machine Learning Jonathan Dinu Co-Founder, Zipfian Academy jonathan@zipfianacademy.com @clearspandex Ryan Orban Co-Founder, Zipfian Academy ryan@zipfianacademy.com @ryanorban @ZipfianAcademy
  2. 2. Outline • What is it and why do I care? • Supervised Learning • Regression and Classification • Unsupervised Learning • Clustering and Dimensionality Reduction • In Practice • Q&A Zipfian Academy -- June 11th, 2013
  3. 3. What? Field of study that gives computers the ability to learn without being explicitly programmed. “ -Arthur Samuel circa 1959
  4. 4. What? A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. “ -Tom M. Mitchell
  5. 5. What? Learning ≠ Thinking
  6. 6. What? NOT Machine learning is: • Hard coded logic by programmer: ifs and elses... • Predefined results: completely deterministic • Burden is placed on programmer at design time • Must anticipate all inputs to program, and react
  7. 7. What? Machine learning is: • Automated knowledge acquisition through input • Iterative improvement as more data is seen • Adaptive Algorithms
  8. 8. Discrete Labeled Data kNN Unlabeled Data Continuous Regression K-means Clustering/ Dimensionality Reduction SVM Classification Apriori Association Analysis PCA FP-Growth Machines GLM Linear Trees Logistic Trees SVD
  9. 9. Discrete Labeled Data kNN Unlabeled Data Continuous Regression K-means Clustering/ Dimensionality Reduction SVM Classification Apriori Association Analysis PCA FP-Growth Today! GLM Linear Trees Logistic Trees SVD
  10. 10. Applications Regression: • Stock Market Analysis: trend analysis • Utilities: smart grid load forecasting • Web: page traffic prediction • Bioinformatics: protein binding site prediction
  11. 11. Applications Linear Regression
  12. 12. Applications Classification: • Spam Filtering and document classification • Finance: Fraud detection and loan default prediction • Sentiment Analysis: People like to do this with Tweets • National Security: ??? PRISM!
  13. 13. Applications K-nearest neighbors Source: http://scipy-lectures.github.io/advanced/scikit-learn/
  14. 14. Applications Clustering: • Product Marketing: Cohort Analysis • Oncology: Malignant cell identification • Computer Vision: entity recognition • Census: demographics analysis
  15. 15. Applications K-means clustering Source: http://www.flickr.com/photos/kylemcdonald/3866231864/
  16. 16. Applications Machine Learning Data Mining Discovery Unknown Properties Unsupervised Ex: K-means Clustering Prediction Known Properties Supervised Ex: Naïve Bayes Classification
  17. 17. Pitfalls Considerations -- Performance Number of Features Train vs. Predict Online vs. Batch Multinomial
  18. 18. In Practice Source: http://recsys.acm.org/more-data-or-better-models/
  19. 19. Outline • What is it and why do I care? • Supervised Learning • Regression and Classification • Unsupervised Learning • Clustering and Dimensionality Reduction • In Practice • Q&A
  20. 20. Supervised Inferring a function from labeled training data, e.g. learning from experience
  21. 21. Outline • What is it and why do I care? • Supervised Learning • Regression and Classification • Unsupervised Learning • Clustering and Dimensionality Reduction • In Practice • Q&A
  22. 22. Classification Prior => Prediction Regression Classification continuous discrete structured discrete continuous discrete structured continuous
  23. 23. Classification Logistic Regression (contrary to its name... actually used to classify) threshold = 0.5 Source: http://cvxopt.org/examples/book/logreg.html
  24. 24. Classification Source: http://www.stepbystep.com/difference-between-linear-and-logistic-regression- 103607/ Logistic Regression (contrary to its name... actually used to classify)
  25. 25. Process 1. Obtain -- convert to numeric 2. Train -- pre-labeled data 3. Test -- cross validation 4. Use -- unknown label data
  26. 26. Process Input: historical labeled data + non-parameterized function (linear, logistic, etc.) = Output: parameterized function
  27. 27. Products A model is just a function
  28. 28. Products Inputs...
  29. 29. Products Outputs...
  30. 30. Others Today: Logistic Regression
  31. 31. Α = β0 +β1x1 +β2x2 + ...+βnxn threshold = 0.5 Source: http://www.sjsu.edu/faculty/gerstman/EpiInfo/cont-mult1.jpg Logistic Regression
  32. 32. Logistic Regression Α = β0 +β1x1 +β2x2 + ...+βnxn threshold = 0.5 Source: http://www.sjsu.edu/faculty/gerstman/EpiInfo/cont-mult1.jpg ML Gold
  33. 33. Multi dimensional Logistic Regression Α = β0 +β1x1 +β2x2 + ...+βnxn threshold = 0.5 Source: http://www.sjsu.edu/faculty/gerstman/EpiInfo/cont-mult1.jpg Mine it!
  34. 34. Raw Data Scrubbing Cleaned Data Prepared Vectorization Data New Data Test Set Training Set Train Model Sampling Evaluate Cross Validation Cleaned Data Prepared Scrubbing Vectorization Data Predict Labels/ Classes In Practice
  35. 35. Raw Data Scrubbing Cleaned Data Prepared Vectorization Data New Data Test Set Training Set Train Model Sampling Evaluate Cross Validation Cleaned Data Prepared Scrubbing Vectorization Data Predict Labels/ Classes In Practice Now:
  36. 36. Others Work Time!
  37. 37. Others Additional Methods
  38. 38. Classification Nearest Neighbors Source: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm
  39. 39. Classification Support Vector Machine (SVM) non-linear input space Kernel transform Linear classifier Source: http://digitheadslabnotebook.blogspot.com/2011/11/support-vector-machines. html
  40. 40. Pitfalls Overcoming the labeling problem Unsupervised/Semi-supervised Learning Label your data by hand Have someone else label your data by hand (Mechanical Turk) Online/Active learning
  41. 41. Logistic Regression Gotchas 1. Normalize Values 2. Outliers
  42. 42. Others (Cross) Validation
  43. 43. Raw Data Scrubbing Cleaned Data Prepared Vectorization Data New Data Test Set Training Set Train Model Sampling Evaluate Cross Validation Cleaned Data Prepared Scrubbing Vectorization Data Predict Labels/ Classes In Practice Now:
  44. 44. Evaluate: Split labeled articles into a training set and a test set Naïve Bayes Whatever you do, DO NOT cross the streams
  45. 45. Focus Do Don’t Evaluate the efficacy of your algorithm Blindly trust theory or intuition
  46. 46. Outline • What is it and why do I care? • Supervised Learning • Regression and Classification • Unsupervised Learning • Clustering and Dimensionality Reduction • In Practice • Q&A
  47. 47. Unsupervised Finding hidden structure in unlabeled data
  48. 48. Unsupervised How is this different? • Unlabeled input data • No error for evaluation • Exploratory • Past rather than Future
  49. 49. Outline • What is it and why do I care? • Supervised Learning • Regression and Classification • Unsupervised Learning • Clustering and Dimensionality Reduction • Q&A
  50. 50. Clustering K-means Source: http://shabal.in/visuals.html
  51. 51. K-means Assumptions/Weaknesses 1. Needs numeric features 2. Greedy/converge on local minima 3. Iterative -- slow on large data sets
  52. 52. Dimensionality Reduction Source: http://www.nlpca.org/pca_principal_component_analysis.html
  53. 53. Practice Data Science in Practice
  54. 54. Counting Data Science....
  55. 55. Counting It’s just like...
  56. 56. Counting Counting!
  57. 57. Source: http://www.troll.me/images/x-all-the-things/count-all-the-things.jpg Counting
  58. 58. Source: https://data.sfgov.org/analytics Counting Counts Data Science
  59. 59. Source: https://data.sfgov.org/analytics Counting +σ μ 3 Day Moving Average
  60. 60. Counting Source: https://data.sfgov.org/analytics
  61. 61. Counting What Happened on Aug. 5th?
  62. 62. Counting Both Passive and Active measurement
  63. 63. Counting Passively measure time user spends on page (or slide)
  64. 64. Source: http://tctechcrunch2011.files.wordpress.com/2010/10/awesome.jpg Counting Actively solicit users feedback e.g.
  65. 65. Counting But it doesn’t stop at simple Statistics
  66. 66. Approach Naïve Bayes Source: http://www.emeraldinsight.com/journals.htm?articleid=1550453&show=html
  67. 67. A Short Divergence: How do we turn an article full of words into something an algorithm can understand? Naïve Bayes
  68. 68. Source: http://users.livejournal.com/_winnie/361188.html Vectorization
  69. 69. A Short Divergence: Put it in a Bag! Vectorization
  70. 70. A Short Divergence: Bag of Words Counts! original document dictionary of word counts brown fox feature vector: width of “vocabulary” of English language The brown fox { “the” : 1, “brown”: 1, “fox” : 1} [0,0,1,0,1,0,...] Tokenization Vectorization
  71. 71. Naïve Bayes Train: i Π P(Y | F1,…,Fn ) ∝ P(Y ) P(Fi | Y ) [0,0,1,0,1,0,...] brown fox Newspaper Section (what we want to predict)
  72. 72. Naïve Bayes Train: P(Fi | Y ) Conditional Probability Count occurrence of each word in training data count(x) total # words (with label Y)
  73. 73. Naïve Bayes Train: P(Y ) P(Fi | Sports) P(Fi | Nature) Sports : 0.32 Social Media : 0.12 Nature : 0.41 News : 0.10 ... the : 0.156 to : 0.153 ball : 0.0045 bat : 0.0033 she : 0.0083 Boston : 0.0002 ... the : 0.189 us : 0.0167 trees : 0.0945 bat : 0.0068 she : 0.017 Brazil : 0.0042 ...
  74. 74. Pitfalls Overcoming the labeling problem Unsupervised/Semi-supervised Learning Label your data by hand Have someone else label your data by hand (Mechanical Turk) Online/Active learning
  75. 75. Source: http://www.glogster.com/mrsallenballard/pickles-i-love-em-/ g-6mevh13be74mgnc9i8qifa0 Persistence
  76. 76. Persistence SerDes • Disk • Database • Memory •
  77. 77. Scale Issues?
  78. 78. Scale Issues? • Batch Oriented Algorithms • Data Preparation • Output?
  79. 79. Scale
  80. 80. Scale Train vs. Predict
  81. 81. Source: http://clincancerres.aacrjournals.org/content/5/11/3403/F4.expansion Showdown kNN vs. Trees
  82. 82. Exposé
  83. 83. Exposé APIs and Interfaces • Internal • ReSTful • Public • (Web) Application
  84. 84. Get Involved • Present a guest lecture or share a data story • Donate datasets and propose projects • Sponsor a scholarship • Donate cluster time or resources • Attend our Hiring Day (Nov. 20th)
  85. 85. We’re Hiring! • Full Time Instructors • Part Time Instructors • Curriculum Developers • TAs
  86. 86. DataWeek • Monday: Machine Learning Workshop • Tuesday: Hiring Mixer • Wednesday: Why I Teach (Data Science) • Thursday: Why Knowing Your Data is Invaluable to Startups Panel • Nighttime: DATAVIZ ART + TECH
  87. 87. Outline Q&A
  88. 88. Thank You! Jonathan Dinu Co-Founder, Zipfian Academy jonathan@zipfianacademy.com @clearspandex Ryan Orban Co-Founder, Zipfian Academy ryan@zipfianacademy.com @ryanorban

×