Polong Lin(林伯龍)/how to approach data science problems from start to end

1,420 views

Published on

Polong Lin is a Data Scientist at IBM. He is a regular speaker on data science and develops content for free data education on bigdatauniversity.com using open data tools on datascientistworkbench.com. Polong earned his M.Sc. at the Univ. of Tsukuba.

Published in: Data & Analytics
1 Comment
6 Likes
Statistics
Notes
  • For Business Analytics Tools Online Training register at http://www.todaycourses.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
1,420
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
228
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide

Polong Lin(林伯龍)/how to approach data science problems from start to end

  1. 1. How to Approach Data Science Problems from Start to End Polong Lin Data Scientist IBM Analytics, Emerging Technologies @polonglin @bigdatau 台灣資料科學年會
  2. 2. • Free online courses • Data Science & Data Engineering • A communityinitiative led by IBM • Certificates and Badges • > 450,000 users What is Big Data University (BDU)?
  3. 3. 3
  4. 4. 4
  5. 5. 5 “5-5-5 Rule” Course Lesson 1 Lesson 2 Lesson 3 Lesson 4 Final Exam Certificate/Badge Lesson 5 5 videos 5 videos 5 videos 5 videos 5 videos
  6. 6. Lab Exercises 6 Learn hands-on. Exercises in the cloud. DataScientistWorkbench.com
  7. 7. 1. Business Understanding Data Science Methodology 7. Modelling 6. Data Preparation 3. Data Requirements 9. Deployment10. Feedback Prediction Interpretation Justification Testing 4. Data Collection 8. Evaluation 5. Data Understanding 2. Analytic Approach
  8. 8. “Polong will fly from San Francisco to New York for a meeting at 3:00pm on Friday, July 22.” Can Polong anticipate whether his flight will be delayed? Flight delays 8 San Francisco New York
  9. 9. • Every project begins with business understanding. • What is the project objective? • What are we trying to do – what is our goal? 1. Formulate a clear question 2. Define problem and solution requirements 9 1. Business Understanding Flight delays: Create some solution that can help users predict if a flight on a given day will be delayed or not delayed 1. Business understanding
  10. 10. Using departing & arrival airport, date, carrier, etc., we could predict flight [DELAY] or [NO-DELAY] using logistic regression. • Identify suitable statistical/machine learning technique(s) 10 2. Analytic Approach • Linear regression • Logistic regression • Clustering • Decision Trees • Principal component analysis • Text analysis • SVM/SVR • Neural networks • Dimension Reduction 2. Analytic approach
  11. 11. 11 3. Data Requirements 4. Data Collection 5. Data Understanding What data is required? What format? Collect the data What does the data look like? What are initial insights? Can we visualize the data? Are missing anything? • Flight data • Open data available • All domestic US flights per year • CSV format • Which airports are busiest? • Which flights are most delayed? • Which airports are best/worst?
  12. 12. Flight Data 12 We will only look at data from 2007 (seven million flights) http://stat-computing.org/dataexpo/2009/the-data.html Departure Delay (min)
  13. 13. 13 Which airports are busiest?
  14. 14. 14 Which flights are most likely be delayed?
  15. 15. Data Preparation typically includes: • Data cleaning • Merging data • Transforming data • Feature engineering • Text analysis 15 6. Data preparation 6. Data Preparation Flights are classified as “delayed” if >15 min late. • Delayed? [True or False] Does time of day for departure predict delays? • Hour
  16. 16. 16 Which day of the week and time of departure is worst?
  17. 17. 1. Business Understanding Data Science Methodology 7. Modelling 6. Data Preparation 3. Data Requirements 9. Deployment10. Feedback Prediction Interpretation Justification Testing 4. Data Collection 8. Evaluation 5. Data Understanding 2. Analytic Approach
  18. 18. Modeling is a: • Highly iterative process • Multiple models may be used and tested 18 Modelling Modeling Using inputs: • Year • Month • Day of Month • Hour of departure • Distance • Destination airport Predict: Delay (True/False) Logistic Regression
  19. 19. How well does our model accurately predict delays? • Does the model performance meet our business goals? • Do we need to refine our model? 19 Evaluation Model evaluation
  20. 20. • Once finalized, the model is deployed into a production environment. • May be in a limited / test environment until model is proven • Involves additional groups, skills, and technologies • Solution owner • Marketing • Application developers and designers • IT administration • Feedback to assess model performance • Gathering and analysis of feedback for assessment of the model’s performance and impact • Iterative process for model refinement and redeployment • Accelerate through automated processes 20 Deployment Feedback Prediction Interpretation Justification Testing Deployment and feedback
  21. 21. 21 Creating a prototype
  22. 22. 1. Business Understanding Data Science Methodology 7. Modelling 6. Data Preparation 3. Data Requirements 9. Deployment10. Feedback Prediction Interpretation Justification Testing 4. Data Collection 8. Evaluation 5. Data Understanding 2. Analytic Approach
  23. 23. Case-study & Demo: Food Can we use ingredients to predict what cuisine a recipe belongs to? 23
  24. 24. What cuisine is this? 2 PM 4 minute BLT Beast 24
  25. 25. What cuisine is this? Ingredients: Rice Seaweed Wasabi Soy sauce 25 http://allrecipes.com/recipe/189477/california-roll-sushi/
  26. 26. 26 How are we able to tell what kind of cuisine some food dish is, even if we’ve never seen it before? Schellack at English Wikipedia https://www.flickr.com/photos/10559879@N00/4004745542
  27. 27. A. Based on the ingredients alone, can we predict what cuisine a food dish belongs to? B. Which cuisines are similar to each other based on their ingredients? 27 Business 1. Research Understanding Japanese American British Indian Chinese French Italian Vietnamese Canadian Food and ingredients
  28. 28. 28 Rice? ALL CUISINES NON-ASIAN FOOD ASIAN FOOD NO YES Wasabi? NO YES NOT JAPANESE JAPANESE A. Based on the ingredients alone, can we predict what cuisine a food dish belongs to? 2. Analytic Approach Decision trees
  29. 29. B. Which cuisines are similar to each other based on their ingredients? Analytic Approach K-means Clustering Group similar cuisines together into k number of clusters.
  30. 30. www.allrecipes.com www.epicurious.com www.menupan.com 30 Web Scrape Data Collection Data scraped by Yong-Yeol Ahn http://yongyeol.com/
  31. 31. 31 Data Understanding

×