Fairy tale from the land of data

476 views
439 views

Published on

A fairy tale about falling into a trap of wrong interpretation of results. Shows the importance of building models and understanding them.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
476
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Fairy tale from the land of data

  1. 1. Fairy tales in the land of data Or - do I know what I’m doing? By @przemur from http://about.me/przemek.maciolek
  2. 2. A story
  3. 3. http://yamao.deviantart.com/art/Cleric-comm-343786321 https://www.flickr.com/photos/jsjgeology/8359854092/
  4. 4. Suspense
  5. 5. <
 ? “The hammers from the new provider are no good, sayr.”
  6. 6. What would you do?
  7. 7. New hammers since this month
  8. 8. install.packages('ggplot2') require('ggplot2') setwd("/Users/pmm/Desktop/hammer") all <- read.csv(file="all.csv") ! qplot(all$month_sequence, all$dwarfs) + geom_smooth() qplot(all$month_sequence, all$production) + geom_smooth() ! all$prod_per_dwarf <- all$production / all$dwarfs qplot(all$month_sequence, all$prod_per_dwarf) + geom_smooth()
  9. 9. Number of dwarfs working in the mine The hammers from the new provider started being distributed to the new miners.
  10. 10. Total production of gold
  11. 11. Per-dwarf average production
  12. 12. Who sees any problem?
  13. 13. Lets look at the production of each dwarf, relative to the time one applied… Dwarfs which are using the OLD hammer design Dwarfs which are using the NEW hammer design
  14. 14. new <- read.csv(file="new_relative.csv") old <- read.csv(file="old_relative.csv") ! qplot(new$relative_month, new$production) ggplot(new, aes(x=relative_month, y=production)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.2)
 # This will look much better!
 old$type='old' new$type='new' old_and_new = rbind(old,new) ggplot(old_and_new, aes(x=relative_month, y=production, color=type)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.2)
  15. 15. Scatterplot showing relative production done using old and new hammers
  16. 16. What now?
  17. 17. ggplot(old_and_new, aes(x=relative_month, y=production, color=type)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.1) + geom_smooth(method=lm) The new hammers wear much faster!
  18. 18. How much did the dwarfs lost?
  19. 19. old_m = lm(production ~ relative_month, old) new$possible_production <- predict(old_m, new) sum(new$possible_production) - sum(new$production) (sum(new$possible_production) - sum(new$production))/sum(new$production) 0.5% Now, taking into account the price of hammer, one can select the optimal strategy… but that’s another story…
  20. 20. Lessons learned …? • Don’t trust the data blindly, ask questions • Try to understand underlying rules of the system • Don’t be shy with trying various models • If using R, go for ggplot2

×