Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data science for everyone

469 views

Published on

Taivo Pungas at ML Estonia Meetup (29.03.2016)

Published in: Data & Analytics
  • Be the first to comment

Data science for everyone

  1. 1. Data science for everyone Taivo Pungas 29.03.2016
  2. 2. For everyone? ● Manual/one-off analyses ○ Not production-level code ● Personal interest ● No deep learning Survey: ● PhD [student]? ● >1 stats/ML class? ● Can code?
  3. 3. Finding interesting data#1
  4. 4. select * from users Job Experience
  5. 5. 1. Existing datasets int’l organisations Kaggle opendata.riik.ee → R library (WIP)
  6. 6. 1. Existing datasets 2. Scrape kv.ee/2724720 kv.ee/2724721
  7. 7. 1. Existing datasets 2. Scrape check ToS don’t DoS KV, Postimees, Auto24, Osta, … 1M Estonian real estate ads
  8. 8. 1. Existing datasets 2. Scrape 3. Quantified Self
  9. 9. QS: how did I feel today? very goodvery bad count OK
  10. 10. Process & tools for analysis#2
  11. 11. Get data Clean Explore / visualise Publish My process Sleep on it feedback
  12. 12. Python general purpose data processing Hadleyverse Rvs pandas matplotlib scikit-learn
  13. 13. R libraries: Hadleyverse Step Libraries Get data rvest, xml2, readxl Clean dplyr, tidyr, stringr Explore / visualise ggplot2 Publish ... and many others from Hadley Wickham.
  14. 14. Alternatives Excel / Google Sheets ● External data sources ● Google Apps Script ○ Google Translate API ○ Sending e-mail ○ … ● Not easy to reproduce analyses Tons of other software R & Python worth the learning curve
  15. 15. R is easy: reading data # Read data apartments <- read.csv2("data/apartment_rent_tartu.csv", sep=";", header=TRUE)
  16. 16. R is easy: dplyr library(dplyr) # Find average price by part of city apartments %>% group_by(Linnaosa) %>% summarise(KeskmineHind=mean(HindKohandatud)) %>% arrange(desc(KeskmineHind))
  17. 17. R is easy: lin. regression # Build linear model fit <- lm(HindKohandatud ~ Tube, data=apartments) summary(fit)
  18. 18. Presenting your results#3
  19. 19. interactive > static
  20. 20. D3.js: powerful web visualisations
  21. 21. Easy to useHard to use Limited Powerful ggplot2 D3.js D3 derivates Excel GSheets AI
  22. 22. How to reach an audience ● Social media ● Start a blog ○ stat24.ee ○ pungas.ee ● Offer free content ○ Newspapers (tip lines) ○ Guest posts on blogs ● Push to Estonian data science community ○ TODO: FB group? Community blog?
  23. 23. Putting it together ##
  24. 24. Examples Apartment prices: R + D3.js 18k hits Salaries of public servants: R + D3.js 38k hits Study data: R + D3.js 3k hits Election promise calculator: D3.js 42k hits Bondora: R Alcohol deaths: Illustrator
  25. 25. News & inspiration Mailing lists: Information is Beautiful, Data Science Weekly, Data Elixir Blogs: FiveThirtyEight, R-bloggers, Stat24, Mike Bostock
  26. 26. Long-term motivation Flickr: ucirvine / CC BY-NC-ND
  27. 27. pungas.eetaivo@ tpungas

×