Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Machine Learning with R

1,342 views

Published on

These slides are for the tutorial on how to use R language for data analysis and Machine Learning tasks.
The workshop was given at OSCON (Austin, TX), 2017

Published in: Data & Analytics
  • DOWNLOAD THIS BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THIS can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THIS is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THIS Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THIS the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THIS Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Doctors Predicted I Would NEVER Stop Snoring But Contrarily to their Prediction, I Did It 100% Naturally! learn more... ➤➤ http://t.cn/Aigi9dEf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Machine Learning with R

  1. 1. Machine Learning with R Barbara Fusinska @BasiaFusinska
  2. 2. About me Programmer Machine Learning Data Solutions Architect @BasiaFusinska https://github.com/BasiaFusinska/MachineLearningWithR
  3. 3. Agenda • What’s Machine Learning? • Exploratory Data Analysis • Classification • Clustering • Regression
  4. 4. Setup • Install R: https://www.r-project.org/ • Install RStudio: https://www.rstudio.com/ • GitHub repository: https://github.com/BasiaFusinska/Ma chineLearningWithR • Packages
  5. 5. Machine Learning?
  6. 6. Movies Genres Title # Kisses # Kicks Genre Taken 3 47 Action Love story 24 2 Romance P.S. I love you 17 3 Romance Rush hours 5 51 Action Bad boys 7 42 Action Question: What is the genre of Gone with the wind ?
  7. 7. Data-based classification Id Feature 1 Feature 2 Class 1. 3 47 A 2. 24 2 B 3. 17 3 B 4. 5 51 A 5. 7 42 A Question: What is the class of the entry with the following features: F1: 31, F2: 4 ?
  8. 8. Data Visualization 0 10 20 30 40 50 60 0 10 20 30 40 50 Rule 1: If on the left side of the line then Class = A Rule 2: If on the right side of the line then Class = B A B
  9. 9. Chick sexing
  10. 10. Supervised learning • Classification, regression • Label, target value • Training & Validation phases
  11. 11. Unsupervised learning • Clustering, feature selection • Finding structure of data • Statistical values describing the data
  12. 12. Supervised Machine Learning workflow Clean data Data split Machine Learning algorithm Trained model Score Preprocess data Training data Test data
  13. 13. Publishing the model Machine Learning Model Model Training Published Machine Learning Model Prediction Training data Publish model Test stream Scores
  14. 14. Exploratory Data Analysis Demo
  15. 15. Classification problem Model training Data & Labels
  16. 16. Classification data Source #Links #Characters ... Fake TopNews 10 2750 … T Twitter 2 120 … F TopNews 235 502 … F Channel X 1530 3024 … T Twitter 24 70 … F StoryLeaks 722 1408 … T Facebook 98 230 … T … … … … ... Features Labels
  17. 17. Task: Iris EDA • Descriptive statistics (dimensions, rows, columns, data types, correlation) • Data visualization (distributions, outliers) • Features distributions & classes separation • 2D visualisation http://archive.ics.uci.edu/ml/datasets/Iris
  18. 18. K-Nearest Neighbours Algorithm • Object is classified by a majority vote • k – algorithm parameter • Distance metrics: Euclidean (continuous variables), Hamming (text) ?
  19. 19. Naïve Bayes classifier 𝑝 𝐶 𝑘 𝒙) = 𝑝 𝐶 𝑘 𝑝 𝒙 𝐶 𝑘) 𝑝(𝒙) 𝒙 = (𝑥1, … , 𝑥 𝑘) 𝑝 𝐶 𝑘 𝑥1, … , 𝑥 𝑘) likelihood evidence prior posterior
  20. 20. Naïve Bayes example Sex Height Weight Foot size Male 6 190 11 Male 6.2 170 10 Female 5 130 6 … … … … Sex Height Weight Foot size ? 5.9 140 8 𝑝 𝑚𝑎𝑙𝑒 𝒙 = 𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝(8|𝑚𝑎𝑙𝑒) 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑝 𝑚𝑎𝑙𝑒 𝑝 5.9 𝑚𝑎𝑙𝑒 𝑝 140 𝑚𝑎𝑙𝑒 𝑝 8 𝑚𝑎𝑙𝑒 + 𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒) 𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝒙 = 𝑝 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 5.9 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝 140 𝑓𝑒𝑚𝑎𝑙𝑒 𝑝(8|𝑓𝑒𝑚𝑎𝑙𝑒) 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
  21. 21. Logistic regression 𝑧 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘 𝑦 = 1 𝑓𝑜𝑟 𝑧 > 0 0 𝑓𝑜𝑟 𝑧 < 0 𝑦 = 1 𝑓𝑜𝑟 𝜙(𝑧) > 0.5 0 𝑓𝑜𝑟 𝜙(𝑧) < 0.5 Logistic function Coefficients Best fit of β
  22. 22. Data processing Demo
  23. 23. Data classification Demo
  24. 24. Evaluation methods for classification Confusion Matrix Reference Positive Negative Prediction Positive TP FP Negative FN TN Receiver Operating Characteristic curve Area under the curve (AUC) 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = #𝑐𝑜𝑟𝑟𝑒𝑐𝑡 #𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁 𝑇𝑁 + 𝐹𝑁 How good at avoiding false alarms How good it is at detecting positives
  25. 25. Task: Iris Classification • Data preprocessing • Split data for training and tests sets • Classification using: kNN and Naïve Bayes • Performance evaluation • Results Visualisation
  26. 26. Task: Binary Classification • Only two slasses in the dataset (versicolor & virginica) • Classification using logistic regression • Performance evaluation • Results Visualisation
  27. 27. Resampling: Bootstrapping
  28. 28. k-fold cross validation
  29. 29. Data resampling Demo
  30. 30. Data tuning Demo
  31. 31. Task: Resampling & Tuning • Repeated k-fold cross validation • Use Naïve Bayes as classification algorithm • Tune the parameters using specific values • Performance evaluation
  32. 32. Clustering problem
  33. 33. K-means Algorithm
  34. 34. Hierarchical clustering • Decision of where the cluster should be split • Metric: distance between pairs of observation • Linkage criterion: dissimilarity of sets
  35. 35. Clustering Demo
  36. 36. Evaluating methods for clustering • Sum of squares • Class based measures • Underlying true
  37. 37. Task: Iris Clustering • Clustering using k-means and hierarchies • Compare clusters with the original classes assignments • Visualise the findings
  38. 38. Regression problem • Dependent value • Predicting the real value • Fitting the coefficients • Analytical solutions • Gradient descent 𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽 𝑘 𝑥 𝑘
  39. 39. Ordinary linear regression Residual sum of squares (RSS) 𝑆 𝑤 = 𝑖=1 𝑛 (𝑦𝑖 − 𝑥𝑖 𝑇 𝑤)2 = 𝑦 − 𝑋𝑤 𝑇 𝑦 − 𝑋𝑤 𝑤 = 𝑎𝑟𝑔 min 𝑤 𝑆(𝑤)
  40. 40. Task: Prestige EDA • Descriptive statistics (dimensions, rows, columns, data types, correlation) • Data visualization (distributions, outliers) • Handle missing data • Features significance
  41. 41. Evaluation methods for regression • Errors 𝑅𝑀𝑆𝐸 = 𝑖=1 𝑛 (𝑓𝑖 − 𝑦𝑖)2 𝑛 𝑅2 = 1 − (𝑓𝑖 − 𝑦𝑖)2 ( 𝑦 − 𝑦𝑖)2 • Statistics (t, ANOVA)
  42. 42. Residuals vs Fitted • Check if residuals have non- linear patterns • Check if the model captures the non-linear relationship • Should show equally spread residuals around the horizontal line
  43. 43. Normal Q-Q • Shows if the residuals are normally distributed • Values should be lined on the straight dashed line • Check if residuals do not deviate severely
  44. 44. Scale-Location • Show if residuals are spread equally along the ranges of predictors • Test the assumption of equal variance (homoscedasticity) • Should show horizontal line with equally (randomly) spread points
  45. 45. Residuals vs Leverage • Helps to find influential cases • When outside of the Cook’s distance the cases are influential • With no influential cases Cook’s distance lines should be barely visible
  46. 46. Regression problem Demo
  47. 47. Task: Prestige Regression • Numeric and categorical features • Other than linear relations • Combining the features
  48. 48. Categorical data for regression • Categories: A, B, C are coded as dummy variables • In general if the variable has k categories it will be decoded into k-1 dummy variables Category V1 V2 A 0 0 B 1 0 C 0 0 𝑓 𝒙 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑗 𝑥𝑗 + 𝛽𝑗+1 𝑣1 + ⋯ + 𝛽𝑗+𝑘−1 𝑣 𝑘
  49. 49. Categorical data for regression 𝑓 𝑥 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑣1 + ⋯ + 𝛽 𝑘 𝑣 𝑘−1 + 𝛽 𝑘+1 𝑣1 𝑥 + ⋯ + 𝛽2𝑘−1 𝑣 𝑘−1 𝑥 𝑦 ~ 𝑥 + 𝑐𝑎𝑡 + 𝑥: 𝑐𝑎𝑡
  50. 50. Keep in touch BarbaraFusinska.com @BasiaFusinska https://github.com/BasiaFusinska/MachineLearningWithR

×