Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

14 Ddply

2,168 views

Published on

Published in: Technology, Education
  • Be the first to comment

14 Ddply

  1. 1. Stat405 Still more ddply Hadley Wickham Wednesday, 14 October 2009
  2. 2. 1. Homework 2. Projects 3. Continuing ddply 4. Next week Wednesday, 14 October 2009
  3. 3. Homework Homework 4: out of 20. Code quality not graded (sorry - ran out of time) Homework 5: out of 5 (but equal weight with others). Make sure to check some great examples online. Please check grades online Wednesday, 14 October 2009
  4. 4. Projects Generally great. Make sure to read through the two I’ve posted online. Lots of interesting findings! One common mistake - don’t forget about the denominator! Wednesday, 14 October 2009
  5. 5. b <- read.csv("batting.csv") b$onplate <- with(b, ab + bb + ibb + hbp + sh + sf) b$onbase <- with(b, h + bb + ibb + hbp) b$obp <- with(b, onbase / onplate) library(ggplot2) # What is this going to look like? qplot(obp, data = b, binwidth = 0.01) qplot(onplate, obp, data = b) qplot(onplate, obp, data = b, alpha = I(1 / 100) Wednesday, 14 October 2009
  6. 6. How would you remove these outliers? 4000 3000 count 2000 1000 0 0.0 0.2 0.4 0.6 0.8 1.0 obp Wednesday, 14 October 2009
  7. 7. How would you remove these outliers? 4000 3000 count 2000 1000 0 0.0 0.2 0.4 0.6 0.8 1.0 qplot(obp, data = b, binwidth = 0.01) obp Wednesday, 14 October 2009
  8. 8. Which player would most like to have on your team? Wednesday, 14 October 2009
  9. 9. Wednesday, 14 October 2009
  10. 10. Wednesday, 14 October 2009
  11. 11. 15000 10000 count 5000 0 0 200 400 600 onplate Wednesday, 14 October 2009
  12. 12. 15000 10000 count 5000 0 0 200 400 600 qplot(onplate, data = b, binwidth = 5) onplate Wednesday, 14 October 2009
  13. 13. What cutoff should we 2000 choose? 1500 count 1000 500 0 50 100 150 200 onplate Wednesday, 14 October 2009
  14. 14. What cutoff should we 2000 choose? 1500 count 1000 500 0 50 100 150 200 last_plot() + xlim(10, onplate 200) Wednesday, 14 October 2009
  15. 15. # How many players make that many apperances # for each team in a given year? b2000 <- subset(b, year == 2000) ddply(b2000, "team", summarise, n100 = sum(onplate > 100), n200 = sum(onplate > 200), n = length(onplate) ) # Problems? Wednesday, 14 October 2009
  16. 16. qplot(onplate, reorder(team, onplate), data = b2000) qplot(year, onplate, data = subset(b, year > 1960), geom = "boxplot", group = year) Wednesday, 14 October 2009
  17. 17. 4000 3000 count 2000 1000 0 0.0 0.2 0.4 0.6 0.8 1.0 obp Wednesday, 14 October 2009
  18. 18. 4000 3000 count 2000 1000 0 0.0 0.2 0.4 0.6 0.8 1.0 qplot(obp, data = b, binwidth = 0.01) obp Wednesday, 14 October 2009
  19. 19. 1500 1000 count 500 0 0.0 0.2 0.4 0.6 0.8 1.0 obp Wednesday, 14 October 2009
  20. 20. 1500 1000 count 500 0 0.0 0.2 0.4 0.6 0.8 1.0 obp qplot(obp, data = subset(b, onplate > 100), binwidth = 0.01) + xlim(0, 1) Wednesday, 14 October 2009
  21. 21. Project tips Proof read - far too many projects with obvious mistakes. Include a section on the data, giving a quick English run-down of what you did to the data. Appendix should only contain technical details. Presentation matters - you should be proud of your work, so take a little time to put it in a nice wrapper. Wednesday, 14 October 2009
  22. 22. Baby name sex exploration Wednesday, 14 October 2009
  23. 23. library(plyr) library(ggplot2) bnames <- read.csv("baby-names.csv", stringsAsFactors = FALSE) times <- ddply(bnames, c("name"), summarise, boys = sum(prop[sex == "boy"]), boys_n = sum(sex == "boy"), girls = sum(prop[sex == "girl"]), girls_n = sum(sex == "girl"), .progress = "text" ) both_sexes <- subset(times, boys_n > 10 & girls > 10 & boys + girls > 0.4) selected_names <- both_sexes$name selected <- subset(bnames, name %in% selected_names) Wednesday, 14 October 2009
  24. 24. Yearly summaries Next problem is to classify which names are dual-sex, and which are errors. To do that, we’ll need to calculate yearly summaries for each of those names, and use our knowledge of names to come up with a good classification criterion. Wednesday, 14 October 2009
  25. 25. Your turn For each name, in each year, figure out the total number of boys and girls. Think of ways to summarise the difference between the number of boys and girls, and start visualising the data. Wednesday, 14 October 2009
  26. 26. bysex <- ddply(selected, c("name", "year"), summarise, boys = sum(prop[sex == "boy"]), girls = sum(prop[sex == "girl"]), .progress = "text" ) # It's useful to have a symmetric means of comparing # the relative abundance of boys and girls - the log # ratio is good for this. bysex$lratio <- log10(bysex$boys / bysex$girls) bysex$lratio[!is.finite(bysex$lratio)] <- NA Wednesday, 14 October 2009
  27. 27. 2 1 0 lratio −1 −2 1880 1900 1920 1940 1960 1980 2000 year Wednesday, 14 October 2009
  28. 28. Ronald ● ●●● ●●● ●● Mark ● ●●●● ●● ●● ●● Larry ● ●● ● ● ● ●●● ● ●● ● Richard ●●●●●●●● ● ●●● ● ●● ●●● ●● ● ●● ● ● ●● ● ● ● ● ●● ●● ●●● ● William ●● ●●● ● ●●●●●● ● ●●●●● ● ●●●● ● ● ●● ●● ●●●● ●●●● ● ●● ●●● ●●● ● ●● ●● ●● ● ●● ● ●● ●● Edward ● ●●●●●●●●●● ● ● ●●●●●● ● ● ●●●●● ● ●● ● ● ●● ●● Thomas ● ●●●●●●●●● ● ● ●●● ●●●●●●● ● ●● ●●● ●● ●●● ● ● ●●● ● Donald ●●● ● ●●● ●● ●●● ● ● ●● ●● ● David ● ●●●●●● ●●●●●● ● ●●●●●● ● ●● ●● ●●●●● ●●● ●● ●●● ● ●●●● ●● ● John ●●●●● ● ●●●●●● ●●● ● ●●●●●● ● ●● ●● ● ●●● ●● ●●● ● ●●●● ● ● ● ●●●● ● ● ●● ● ● ● Robert ●● ●●●●●●●●● ● ●● ●●●●●●●●● ●● ● ●●●● ●● ● ●● ●●● ● ●●● ●●●●●● ●● ● Harry ● ●●●● ●● ●● ● ●● ● ● ● James ●●●●●●●●●●●● ●●●●●●● ●●● ● ●● ●●●●● ●●● ●● ●●●● ●● ● ●● ●● ● ●● ●● ●● Joseph ●●●●●●●●●●● ●●●●●●●●● ●● ●●● ●●●●●●● ● ●●●●●●●● ● ● ●● ● ● ● ●●● ●● ● Frank ● ●●●●●●●●●● ● ●●● ●●● ● ● ●●●● ●● ●● ● ● ● ● Charles ●●●●●●●●●● ●● ● ● ●● ●●●●●●● ● ●●●●●● ●● ● ●●●●●● ●●●● ●● ● ●● ● ● Albert ●● ● ●● ●● ● ● ●● ●● ● Paul ● ●●● ●● ● ● ● ●●●● ●●●● ● ●● ●● ● Michael ●● ●●● ●●●● ● ● ●●●●● ●●●● ●● ●● ● ● ● ●● ● ● ● ●● ● ●● ● Brian ● ●● ●● ● ●●●● Kenneth ●● ● ●● ●● ● ● ● ● ● ●● Harold ●●●●● ●● ● ●●●● ●● Walter ●●● ●●●●●●● ● ● ●●●● ●● ●● ●●● ● ●● ● ● ● Arthur ●●●● ● ● ● ● ● ●●● ● ●● ● ●● ● ● ●● ● Matthew ● ●● ● ●● ● ●●● ● George ●●●●●●●● ●● ● ●●●●●● ●● ● ●●●● ● ●● ●●●●● ● ●● ●●●● ●●● ● Kevin ● ● ●● ● ● ●● ● ● Christopher ●●● ●●● ●●● ● ● ●●● ● ●●● Jack ●● ●● ●● ● ● ●●● ●● ● Henry ●●●●●●●● ●●● ●●● ●●●●● ●● ●●● ●● ● ● ● ●●● ●● ● ●● ●●● ●● Fred ●● ● ●● ● ● ●●● ● ●● ●● ●● Jason ●●● ● ●● ● ●● ●● Joshua ● ●● ● ●● ● ● ● Eric ● ●● ●●● ● ●● ●● reorder(name, lratio, na.rm = T) Daniel ●● ● ●● ● ●● ● ●● Anthony ●●● ● ●● ●●● ● ●● ●● ●● ● ●● Louis ● ● ●●●● ● ●●● ● ● ●●●● ● ● Joe ●●●●●●● ●●● ● ● ●●● ●●●●●●● ● ● ●● ● ●●●● ● ● ●● ● ● ●● ●● ● ● ● ●● Ryan ●● ●●● ●●●● ● ●● ● ●● ●●●● ● ● ●● ● ●●● ●● ● Jerry ●● ● ● ●● ●● ●● ●● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●● ●● ●● ● ●● Willie ●●●●●● ●● ●● ●●● ● ●● ●● ●●●●●● ● ● ● ●● ● ●● ● ● ●●●●●● ●●●●●● ● ●● ● ● ● Shirley ●● ● ●●● ●●● ●●● ●● ● ●● ●● ● ●● ● ●● ●● ● ●● ● ● ● ● ● ●● ●●● ●● ●●● ●● ●●● ● ● ●● ●●● ●● ● ● ● ● ● ●● Ashley ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●●● ● ●● ● ● ●● ● Carol ●●●● ● ● ●● ●● ●● ● ● ● ●● ●● ● ● ● ●●●● ● ●●●● ● ●● ●●● ●● ● ● ● ● ●● ●● ●● ● ●● ●● ● ● ● Frances ● ●● ●● ●●●●●● ● ● ●● ●● ● ●●● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● Julia ● ●● ● ● ●●● ● ● ● Doris ● ●●●●● ● ● ● ● ●● ●●● ● ● ● ● ● ● ●●● ● ● Irene ● ●●●● ● ● ● ● ● ● Louise ●●●● ● ●●●● ●●●●●● ● ●● ●● ● ●● ● ● ● Rose ● ●● ● ●● ●● ● ● ●● ●● ●●● ● ●● ● ●● ●● Florence ● ●●●● ● ●● ● ●● ●●●● ● ●●● ● ●● ●● ● Ethel ● ●●● ●●●● ● ● ● ●● ●● ●●● ● ●● ●● ● ● ●● ● Edith ● ●●● ● ●● ● ●● ● ●● ● ● ● Kimberly ● ●● ●● ● ●● ●●● ● ● ● ● ● ●● ● Annie ● ●●● ●● ● ●●● ● ● ●● ● ● ●●● ● ●●● ● ● ●● ●● Edna ● ●●●●●●● ●●●● ● ●●●● ● ● ● Minnie ● ●●● ●●●●● ●● ●● ●● ●● ● ● ● ● Grace ● ● ●●● ● ● ● ●● ● ● ●●● ●● ● ● Clara ● ●●●● ●● ● ● ● ●●● ●● ● ●● Bertha ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ●● Lillian ● ●●●●● ● ● ● ● ●● ● ● ●● ●● ● ● Martha ●●● ●●● ● ●●● ●● ● Marie ●●●●●●●● ●●●● ● ● ● ●● ● Emma ● ● ● ● ● ●●● ● ● ● ●● ●● ● ●● ● Mildred ●● ●● ●●●● ●●● ● ● ● ● ●● ● ● ● Alice ●●●● ●●●● ● ● ●● ●●● ● ●● ●● ●●● ● Anna ●●●●●●●●● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● Sarah ● ●● ●● ● ●● ● ● ● ● ● ●● ●● ● Elizabeth ●● ●●●●●●● ● ● ●● ●●●●●● ● ● ●●●●●●● ●●● ●● ●●●●● ●● ● ●● ●● Ruth ●● ●● ● ●●● ●● ● ● ●● ● ●●● ● ● ● ●●● ● ●● ● ●●● ● ● ● Margaret ●●● ● ●●●●●● ●● ●● ● ● ●●●●●●●● ● ● ●●● ●● ● ●●●● ● ●● ● ●●● ●● ● Helen ●●●●● ●●●● ● ● ● ● ●● ●●●●● ● ● ● ● ● ●● ●● ●● ● ● ●● ● ●● ●●● Virginia ●● ●●● ●●●● ●● ●● ● Dorothy ● ● ● ●●● ●●● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ●● ● ●● ● ● ●● ● Mary ● ●●●●●●●●●●●● ● ●●●●●●●●●● ●●● ● ● ● ●●●● ●● ●● ● ● ● ●●● ●●● ● ●● ● ● ● ● Betty ● ●●●● ●● ●●● ●● ● ● ●●● ●● ●● ● Michelle ●●● ● ●●● ● ●● ● ● Sharon ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● Jessica ● ●●●● ● ●● ● Melissa ●●● ● ● ●● ●●●● ●● ●● Nancy ●● ●● ● ● ● ●● ● ● ● ● ● ● Jennifer ● ●●● ● ● ● ●●● ●●● ●● ● Amanda ● ●●● ● ●● ● ●● Patricia ●●● ●● ●●● ● ●● ● ●●● ●● ●●●● ● ●● ●● ●● ●● Donna ● ●●● ● ●● ● ● ● ● ● Sandra ●● ●●● ●● ●● ● ● ● ● Barbara ● ●●●● ●●●● ● ● ● ● ●● ● ●● ● Lisa ●●● ●●● ●● ●● ● ●● Karen ●● ● ●● ● ● ●● ● ● ● ● ● Linda ● ● ● ●●● ● ● ●● ● ●● ● ●●●●●●● ● Susan ●● ●●●● ● ● ●● ●● ● ● ●● ● ● ● ● −2 −1 0 1 2 lratio Wednesday, 14 October 2009

×