Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

13

Share

Download to read offline

Data Science and Machine Learning Using Python and Scikit-learn

Download to read offline

Workshop at DataEngConf 2016, on April 7-8 2016, at Galvanize, 44 Tehama Street, San Francisco, CA.
Demo and labs for workshop are at https://github.com/asimjalis/data-science-workshop

Related Books

Free with a 30 day trial from Scribd

See all

Data Science and Machine Learning Using Python and Scikit-learn

  1. 1. DATA SCIENCE AND MACHINE LEARNING USING PYTHON AND SCIKIT-LEARN ASIM JALIS GALVANIZE
  2. 2. INTRO
  3. 3. ASIM JALIS Galvanize/Zipfian, Data Engineering Cloudera, Microso!, Salesforce MS in Computer Science from University of Virginia
  4. 4. GALVANIZE PROGRAMS Program Duration Data Science Immersive 12 weeks Data Engineering Immersive 12 weeks Full Stack Immersive 6 months Galvanize U 1 year
  5. 5. YOU GET TO . . . Immersive Group Learning Master High-Demand Skills and Technologies Intense Focus on Hiring and Outcomes Level UP your Career
  6. 6. WANT MORE INFO OR A TOUR? http://galvanize.com asim.jalis@galvanize.com
  7. 7. WORKSHOP OVERVIEW
  8. 8. WHAT IS THIS WORKSHOP ABOUT? Using Data Science and Machine Learning Building Classifiers Using Python and scikit- Learn By the end of the workshop you will be able to build Machine Learning Classification Systems
  9. 9. HOW MANY PEOPLE HERE HAVE USED MACHINE LEARNING ALGORITHMS?
  10. 10. HOW MANY PEOPLE HERE HAVE USED PYTHON?
  11. 11. HOW MANY PEOPLE HERE HAVE USED IPYTHON?
  12. 12. HOW MANY PEOPLE HERE HAVE USED SCIKIT-LEARN?
  13. 13. OUTLINE What is Data Science and Machine Learning? What is scikit-learn? Why its super helpful? What kinds of problems can we solve with this stuff?
  14. 14. DATA SCIENCE
  15. 15. WHY MACHINE LEARNING EXCITING Self-driving cars Voice recognition AlphaGo
  16. 16. DATA SCIENCE AND MACHINE LEARNING Data Science = Machine Learning + Statistics + Domain Expertise
  17. 17. STATISTICS AND MACHINE LEARNING Statistics asks whether milk causes heart disease Machine Learning predicts your death Focused on results and actionable predictions Used in production so!ware systems
  18. 18. HISTORY OF MACHINE LEARNING Input Features Algorithm Output Machine Human Human Machine Machine Human Machine Machine Machine Machine Machine Machine
  19. 19. WHAT IS MACHINE LEARNING? Inputs: Vectors or points of high dimensions Outputs: Either binary vectors or continuous vectors Machine Learning finds the relationship between them Using statistical techniques
  20. 20. SUPERVISED VS UNSUPERVISED Learning Type Meaning Supervised Data needs to be labeled Unsupervised Data does not need to be labeled
  21. 21. TECHNIQUES Classification Regression Clustering Recommendations Anomaly detection
  22. 22. CLASSIFICATION EXAMPLE: EMAIL SPAM DETECTION
  23. 23. CLASSIFICATION EXAMPLE: EMAIL SPAM DETECTION Start with large collection of emails, labeled spam/not- spam Convert email text into vectors of 0s and 1s: 0 if a word occurs, 1 if it does not These are called inputs or features Split data set into training set (70%) and test set (30%) Use algorithm like Random Forests to build model Evaluate model by running it on test set and capturing success rate
  24. 24. CLASSIFICATION ALGORITHMS Neural Networks Random Forests Support Vector Machines (SVM) Decision Trees Logistic Regression Naive Bayes
  25. 25. CHOOSING ALGORITHM Evaluate different models on data Look at the relative success rates Use rules of thumb: some algorithms work better on some kinds of data
  26. 26. CLASSIFICATION EXAMPLES Is this tumor benign or cancerous? Is this lead profitable or not? Who will win the presidential elections?
  27. 27. CLASSIFICATION: POP QUIZ Is classification supervised or unsupervised learning? Supervised because you have to label the data.
  28. 28. CLUSTERING EXAMPLE: LOCATE CELL PHONE TOWERS Start with GPS coordinates of all cell phone users Represent data as vectors Locate towers in biggest clusters
  29. 29. CLUSTERING EXAMPLE: T-SHIRTS What size should a t- shirt be? Everyone’s real t-shirt size is different Lay out all sizes and cluster Target large clusters with XS, S, M, L, XL
  30. 30. CLUSTERING: POP QUIZ Is clustering supervised or unsupervised? Unsupervised because no labeling is required
  31. 31. RECOMMENDATIONS EXAMPLE: AMAZON Model looks at user ratings of books Viewing a book triggers implicit rating Recommend user new books
  32. 32. RECOMMENDATION: POP QUIZ Are recommendation systems supervised or unsupervised? Unsupervised
  33. 33. REGRESSION Like classification Output is continuous instead of one from k choices
  34. 34. REGRESSION EXAMPLES How many units of product will sell next month What will student score on SAT What is the market price of this house How long before this engine needs repair
  35. 35. REGRESSION EXAMPLE: AIRCRAFT PART FAILURE Cessna collects data from airplane sensors Predict when part needs to be replaced Ship part to customer’s service airport
  36. 36. REGRESSION: POP QUIZ Is regression supervised or unsupervised? Supervised
  37. 37. ANOMALY DETECTION EXAMPLE: CREDIT CARD FRAUD Train model on good transactions Anomalous activity indicates fraud Can pass transaction down to human for investigation
  38. 38. ANOMALY DETECTION EXAMPLE: NETWORK INTRUSION Train model on network login activity Anomalous activity indicates threat Can initiate alerts and lockdown procedures
  39. 39. ANOMALY DETECTION: POP QUIZ Is anomaly detection supervised or unsupervised? Unsupervised because we only train on normal data
  40. 40. MACHINE LEARNING WORKFLOW
  41. 41. IPYTHON
  42. 42. IPYTHON CHEATSHEET Command Meaning ipython Start IPython /help np Help on module /help np.array Help on function /help "hello" Help on object %cpaste Paste blob of text %timeit Time function call %load FILE Load file as source quit Exit
  43. 43. WORKFLOW
  44. 44. DS AND ML WORKFLOW
  45. 45. PANDAS SCIKIT-LEARN NUMPY
  46. 46. SCIKIT-LEARN David Cournapeau In 2007 Google Summer of Code project
  47. 47. PANDAS Wes McKinney In 2008 At AQR Capital Management
  48. 48. WHY PANDAS “I’m a data janitor.” — Josh Wills Big part of data science is data cleaning Pandas is a power tool for data cleaning
  49. 49. PANDAS AND NUMPY Pandas and NumPy both hold data Pandas has column names as well Makes it easier to manipulate data
  50. 50. SIDE BY SIDE import numpy as np # Create numpy array sales_a = np.array([ [5.0,2000.0], [10.0,500.0], [20.0,200.0]]) # Extract all rows, first column sales_a[:,0] Out: array([ 5., 10., 20.]) import pandas as pd # Create pandas DataFrame sales_df = pd.DataFrame( sales_a, columns=['Price','Sales']) # Extract first column as DataFrame sales_df[['Price']] Out: Price 0 5.0 1 10.0 2 20.0
  51. 51. PANDAS NUMPY SCIKIT-LEARN WORKFLOW Start with CSV Convert to Pandas DataFrame Slice and dice in Pandas Convert to NumPy array to feed to scikit-learn
  52. 52. ARRAYS VS PANDAS VS NUMPY NumPy is faster than Pandas Both are faster than normal Python arrays
  53. 53. PYTHON ARRAY SLICES
  54. 54. INDEXING # Array a = [0,1,2,3,4] # First element a[0] # Second element a[1] # Last element a[-1]
  55. 55. SLICING # Start at index 1, stop before 5, step 2 a[1:5:2] # Start at index 1, stop before 3, step 1 a[1:3] # Start at index 1, stop at end, step 1 a[1:] # Start at index 0, stop before 5, step 2 a[:5:2]
  56. 56. SLICING: POP QUIZ What does a[::] give you? Defaults for everything: start at 0, to end, step 1.
  57. 57. PANDAS
  58. 58. DATA FRAMES FROM CSV # Use CSV file header for column names df = pd.read_csv('file.csv',header=0) # CSV file has no header df = pd.read_csv('file.csv',header=None)
  59. 59. DATA FRAME import pandas as pd df = pd.DataFrame( columns= ['City','State','Sales'], data=[ ['SFO','CA',300], ['SEA','WA',200], ['PDX','OR',150], ]) City State Sales 0 SFO CA 300 1 SEA WA 200 2 PDX OR 150
  60. 60. SELECTING COLUMNS AS DATAFRAME df[['State','Sales']] State Sales 0 CA 300 1 WA 200 2 OR 150
  61. 61. SELECTING COLUMNS AS SERIES df['Sales'] 0 300 1 200 2 150 Name: Sales, dtype: int64 df.Sales 0 300 1 200 2 150 Name: Sales, dtype: int64
  62. 62. SELECTING ROWS WITH SLICE df[1:3] City State Sales 1 SEA WA 200 2 PDX OR 150
  63. 63. SELECTING ROWS WITH CONDITION df[df['Sales'] >= 200] City State Sales 0 SFO CA 300 1 SEA WA 200 df[df.Sales >= 200] City State Sales 0 SFO CA 300 1 SEA WA 200
  64. 64. SELECTING ROWS + COLUMNS WITH SLICES # First 2 rows and all columns df.iloc[0:2,::] City State Sales 0 SFO CA 300 1 SEA WA 200
  65. 65. SELECTING ROWS + COLUMNS WITH SLICES # First 2 rows and all but first column df.iloc[0:2,1::] State Sales 0 CA 300 1 WA 200
  66. 66. ADD NEW COLUMN # New tax column df['Tax'] = df.Sales * 0.085 df City State Sales Tax 0 SFO CA 300 25.50 1 SEA WA 200 17.00 2 PDX OR 150 12.75
  67. 67. ADD NEW BOOLEAN COLUMN # New boolean column df['HighSales'] = (df.Sales >= 200) df City State Sales Tax HighSales 0 SFO CA 300 25.50 True 1 SEA WA 200 17.00 True 2 PDX OR 150 12.75 False
  68. 68. ADD NEW INTEGER COLUMN # New integer 0/1 column df['HighSales'] = (df.Sales >= 200).astype('int') df City State Sales Tax HighSales 0 SFO CA 300 25.50 1 1 SEA WA 200 17.00 1 2 PDX OR 150 12.75 0
  69. 69. APPLYING FUNCTION # Arbitrary function df['State2'] = df.State.apply(lambda x: x.lower()) df City State Sales Tax HighSales State2 0 SFO CA 300 25.50 1 ca 1 SEA WA 200 17.00 1 wa 2 PDX OR 150 12.75 0 or
  70. 70. APPLYING FUNCTION ACROSS AXIS # Calculate mean across all rows, columns 2,3,4 df.iloc[:,2:5].apply(lambda x: x.mean(),axis=0) Sales 216.666667 Tax 18.416667 HighSales 0.666667 dtype: float64
  71. 71. VECTORIZED OPERATIONS Which one is faster? # Vectorized operation %timeit df * 2 160 µs per loop # Loop %timeit for i in xrange(df.size): df.iloc[i,0] * 2 1.72 s per loop
  72. 72. VECTORIZED OPERATIONS Always use vectorized operations Avoid Python loops
  73. 73. VISUALIZATION
  74. 74. WHY VISUALIZE Why do we want to plot and visualize data? Develop intuition about data Relationships and correlations might stand out
  75. 75. SCATTER MATRIX from pandas.tools.plotting import scatter_matrix df = pd.DataFrame(randn(1000,3), columns=['A','B','C']) scatter_matrix(df,alpha=0.2,figsize=(6,6),diagonal='kde')
  76. 76. SCATTER MATRIX
  77. 77. PLOT FEATURES Plot survival by port: C = Cherbourg, France Q = Queenstown, Ireland S = Southampton, UK # Load data df = pd.read_csv('data/titanic.csv',header=0) # Plot by port df.groupby('Embarked')[['Embarked','Survived']].mean() df.groupby('Embarked')[['Embarked','Survived']].mean().plot(kind='bar')
  78. 78. SURVIVAL BY PORT
  79. 79. PREPROCESSING
  80. 80. PREPROCESSING Handling Missing Values Encoding Categories with Dummy Variables Centering and Scaling
  81. 81. MISSING VALUES from numpy import NaN df = pd.DataFrame( columns= ['State','Sales'], data=[ ['CA',300], ['WA',NaN], ['OR',150], ]) State Sales 0 CA 300 1 WA NaN 2 OR 150
  82. 82. MISSING VALUES: BACK FILL df State Sales 0 CA 300 1 WA NaN 2 OR 150 df.fillna(method='bfill',axis=0) State Sales 0 CA 300 1 WA 150 2 OR 150
  83. 83. FORWARD FILL: USE PREVIOUS VALID ROW’S VALUE df State Sales 0 CA 300 1 WA NaN 2 OR 150 df.fillna(method='ffill',axis=0) State Sales 0 CA 300 1 WA 300 2 OR 150
  84. 84. DROP NA df State Sales 0 CA 300 1 WA NaN 2 OR 150 df.dropna() State Sales 0 CA 300 2 OR 150
  85. 85. CATEGORICAL DATA How can we handle column that has data like CA, WA, OR, etc? Replace categorical features with 0 and 1 For example, replace state column containing CA, WA, OR With binary column for each state
  86. 86. DUMMY VARIABLES df = pd.DataFrame( columns= ['State','Sales'], data=[ ['CA',300], ['WA',200], ['OR',150], ]) State Sales 0 CA 300 1 WA 200 2 OR 150
  87. 87. DUMMY VARIABLES df State Sales 0 CA 300 1 WA 200 2 OR 150 pd.get_dummies(df) Sales State_CA State_OR State_WA 0 300 1.0 0.0 0.0 1 200 0.0 0.0 1.0 2 150 0.0 1.0 0.0
  88. 88. CENTERING AND SCALING DATA Why center and scale data? Features with large values can dominate Centering centers it at zero Scaling divides by standard deviation
  89. 89. CENTERING AND SCALING DATA from sklearn import preprocessing import numpy as np X = np.array([[1.0,-1.0, 2.0], [2.0, 0.0, 0.0], [3.0, 1.0, 2.0], [0.0, 1.0,-1.0]]) scaler = preprocessing.StandardScaler().fit(X) X_scaled = scaler.transform(X) X_scaled X_scaled.mean(axis=0) X_scaled.std(axis=0)
  90. 90. INVERSE SCALING scaler.inverse_transform(X_scaled)
  91. 91. SCALING: POP QUIZ Why is inverse scaling useful? Use it to unscale the predictions Back to the units from the problem domain
  92. 92. RANDOM FORESTS
  93. 93. RANDOM FORESTS HISTORY Classification and Regression algorithm Invented by Leo Breiman and Adele Cutler at Berkelely in 2001
  94. 94. LEO BREIMAN
  95. 95. ADELE CUTLER
  96. 96. BASIC IDEA Collection of decision trees Each decision tree looks only at some features For final decision the trees vote Example of ensemble method
  97. 97. DECISION TREES Decision trees play 20 questions on your data Finds feature questions that can split up final output Chooses splits that produce most pure branches
  98. 98. RANDOM FORESTS Collection of decision trees Each tree sees random sample of data Each split point based on random subset of features out of total features To classify new data trees vote
  99. 99. RANDOM FORESTS: POP QUIZ Which one takes more time: training random forests or classifying new data point on trained random forests? Training takes more time This is when tree is constructed Running/evaluation is fast You just walk down the tree
  100. 100. MODEL ACCURACY
  101. 101. PROBLEM OF OVERFITTING Model can get attached to sample data Learns specific patterns instead of general pattern
  102. 102. OVERFITTING Which one of these is overfitting, underfitting, just right? 1. Underfitting 2. Just right 3. Overfitting
  103. 103. DETECTING OVERFITTING How do you know you are overfitting? Model does great on training set and terrible on test set
  104. 104. RANDOM FORESTS AND OVERFITTING Random Forests are not prone to overfitting. Why? Random Forests are an ensemble method Each tree only sees and captures part of the data Tends to pick up general rather than specific patterns
  105. 105. CROSS VALIDATION
  106. 106. PROBLEM How can we find out how good our models are? Is it enough for models to do well on training set? How can we know how the model will do on new unseen data?
  107. 107. CROSS VALIDATION Technique to test model Split data into train and test subsets Train on train data set Measure model on test data set
  108. 108. CROSS VALIDATION: POP QUIZ Why can’t we test our models on the training set? The model already knows the training set It will have an unfair advantage It has to be tested on data it has not seen before
  109. 109. K-FOLD CROSS VALIDATION Split data into k sets Repeat k times: train on k-1, test on kth Model’s score is average of the k scores
  110. 110. CROSS VALIDATION CODE from sklearn.cross_validation import cross_val_score # 10-fold (default 3-fold) scores10 = cross_val_score(model, X, y, cv=10) # See score stats pd.Series(scores10).describe()
  111. 111. HYPERPARAMETER TUNING
  112. 112. PROBLEM: OIL EXPLORATION Drilling holes is expensive We want to find the biggest oilfield without wasting money on duds Where should we plant our next oilfield derrick?
  113. 113. PROBLEM: MACHINE LEARNING Testing hyperparameters is expensive We have an N-dimensional grid of parameters How can we quickly zero in on the best combination of hyperparameters?
  114. 114. HYPERPARAMETER EXAMPLE: RANDOM FORESTS "n_estimators": = [10, 50, 100, 300] "max_depth": [3, 5, 10], "max_features": [1, 3, 10], "criterion": ["gini", "entropy"]}
  115. 115. ALGORITHMS Grid Random Bayesian Optimization
  116. 116. GRID Systematically search entire grid Remember best found so far
  117. 117. RANDOM Randomly search the grid 60 random samples gets you within top 5% of grid search with 95% probability Bergstra and Bengio’s result and Alice Zheng’s explanation (see References)
  118. 118. BAYESIAN OPTIMIZATION Balance between explore and exploit Exploit: test spots within explored perimeter Explore: test new spots in random locations Balance the trade-off
  119. 119. SIGOPT YC-backed SF startup Founded by Scott Clark Raised $2M Sells cloud-based proprietary variant of Bayesian Optimization
  120. 120. BAYESIAN OPTIMIZATION PRIMER Bayesian Optimization Primer by Ian Dewancker, Michael McCourt, Scott Clark See References
  121. 121. OPEN SOURCE VARIANTS Open source alternatives: Spearmint Hyperopt SMAC MOE
  122. 122. GRID SEARCH CODE # Run grid analysis from sklearn import grid_search parameters = { 'n_estimators':[20,50,100,300], 'max_depth':[5,10,20] } model = grid_search.GridSearchCV( RandomForestClassifier(), parameters) model.fit(X,y) # Lets find out which parameters won print model.best_params_ print model.best_score_ print model.grid_scores_
  123. 123. CONCLUSION
  124. 124. SUMMARY
  125. 125. REFERENCES Python Reference scikit-learn Reference http://python.org http://scikit-learn.org
  126. 126. REFERENCES Bayesian Optimization by Dewancker et al Random Search by Bengio et al Evaluating machine learning models Alice Zheng http://sigopt.com http://jmlr.org http://www.oreilly.com
  127. 127. COURSES Machine Learning by Andrew Ng (Online Course) Intro to Statistical Learning by Hastie et al (PDF) (Video) (Online Course) https://www.coursera.org http://usc.edu http://www.dataschool.io https://stanford.edu
  128. 128. WORKSHOP DEMO AND LAB Titanic Demo and Congress Lab for this workshop https://github.com/asimjalis/data-science-workshop
  129. 129. QUESTIONS
  • SaheliBasu2

    Nov. 27, 2019
  • BijayaZenchenko

    Jan. 30, 2019
  • raistlinkong

    Jan. 1, 2018
  • mkdvs

    Aug. 20, 2017
  • wolverinetyagi

    May. 28, 2017
  • ChristopherPoptic

    Apr. 14, 2017
  • kitanota

    Nov. 9, 2016
  • mikepham12

    Sep. 30, 2016
  • symptia

    Jul. 4, 2016
  • nadern2012

    Jul. 1, 2016
  • nithya_bugs

    Jun. 30, 2016
  • choeungjin

    Jun. 20, 2016
  • XiyangZhang

    May. 12, 2016

Workshop at DataEngConf 2016, on April 7-8 2016, at Galvanize, 44 Tehama Street, San Francisco, CA. Demo and labs for workshop are at https://github.com/asimjalis/data-science-workshop

Views

Total views

3,961

On Slideshare

0

From embeds

0

Number of embeds

90

Actions

Downloads

308

Shares

0

Comments

0

Likes

13

×