Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Fairy tales in the land of data
Or - do I know what I’m doing?
By @przemur from
http://about.me/przemek.maciolek
A story
http://yamao.deviantart.com/art/Cleric-comm-343786321 https://www.flickr.com/photos/jsjgeology/8359854092/
Suspense
<

?
“The hammers
from the new
provider are no
good, sayr.”
What would you do?
New hammers
since this month
install.packages('ggplot2')
require('ggplot2')
setwd("/Users/pmm/Desktop/hammer")
all <- read.csv(file="all.csv")
!
qplot(a...
Number of dwarfs working in the mine
The hammers from the new
provider started being
distributed to the new miners.
Total production of gold
Per-dwarf average production
Who sees any problem?
Lets look at the production of each
dwarf, relative to the time one applied…
Dwarfs which are using the
OLD hammer design
...
new <- read.csv(file="new_relative.csv")
old <- read.csv(file="old_relative.csv")
!
qplot(new$relative_month, new$production...
Scatterplot showing relative production
done using old and new hammers
What now?
ggplot(old_and_new, aes(x=relative_month, y=production, color=type)) + geom_point(shape=19,
position=position_jitter(width...
How much did the dwarfs lost?
old_m = lm(production ~ relative_month, old)
new$possible_production <- predict(old_m, new)
sum(new$possible_production) -...
Lessons learned …?
• Don’t trust the data blindly, ask questions
• Try to understand underlying rules of the system
• Don’...
Fairy tale from the land of data
Fairy tale from the land of data
Fairy tale from the land of data
Upcoming SlideShare
Loading in …5
×

Fairy tale from the land of data

585 views

Published on

A fairy tale about falling into a trap of wrong interpretation of results. Shows the importance of building models and understanding them.

  • Be the first to comment

  • Be the first to like this

Fairy tale from the land of data

  1. 1. Fairy tales in the land of data Or - do I know what I’m doing? By @przemur from http://about.me/przemek.maciolek
  2. 2. A story
  3. 3. http://yamao.deviantart.com/art/Cleric-comm-343786321 https://www.flickr.com/photos/jsjgeology/8359854092/
  4. 4. Suspense
  5. 5. <
 ? “The hammers from the new provider are no good, sayr.”
  6. 6. What would you do?
  7. 7. New hammers since this month
  8. 8. install.packages('ggplot2') require('ggplot2') setwd("/Users/pmm/Desktop/hammer") all <- read.csv(file="all.csv") ! qplot(all$month_sequence, all$dwarfs) + geom_smooth() qplot(all$month_sequence, all$production) + geom_smooth() ! all$prod_per_dwarf <- all$production / all$dwarfs qplot(all$month_sequence, all$prod_per_dwarf) + geom_smooth()
  9. 9. Number of dwarfs working in the mine The hammers from the new provider started being distributed to the new miners.
  10. 10. Total production of gold
  11. 11. Per-dwarf average production
  12. 12. Who sees any problem?
  13. 13. Lets look at the production of each dwarf, relative to the time one applied… Dwarfs which are using the OLD hammer design Dwarfs which are using the NEW hammer design
  14. 14. new <- read.csv(file="new_relative.csv") old <- read.csv(file="old_relative.csv") ! qplot(new$relative_month, new$production) ggplot(new, aes(x=relative_month, y=production)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.2)
 # This will look much better!
 old$type='old' new$type='new' old_and_new = rbind(old,new) ggplot(old_and_new, aes(x=relative_month, y=production, color=type)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.2)
  15. 15. Scatterplot showing relative production done using old and new hammers
  16. 16. What now?
  17. 17. ggplot(old_and_new, aes(x=relative_month, y=production, color=type)) + geom_point(shape=19, position=position_jitter(width=.5,height=0), alpha=.1) + geom_smooth(method=lm) The new hammers wear much faster!
  18. 18. How much did the dwarfs lost?
  19. 19. old_m = lm(production ~ relative_month, old) new$possible_production <- predict(old_m, new) sum(new$possible_production) - sum(new$production) (sum(new$possible_production) - sum(new$production))/sum(new$production) 0.5% Now, taking into account the price of hammer, one can select the optimal strategy… but that’s another story…
  20. 20. Lessons learned …? • Don’t trust the data blindly, ask questions • Try to understand underlying rules of the system • Don’t be shy with trying various models • If using R, go for ggplot2

×