This document discusses a case study analyzing gender trends in baby names:
1. It focuses on analyzing a smaller subset of the most popular names over time.
2. The document computes summary statistics like the proportion of babies given each name that were boys or girls and the number of years each name was in the top 1000 for boys and girls.
3. Names are classified as either having dual or separate gender usage over time based on the ratio of boy to girl name assignments each year.
Web Development With Ruby - From Simple To ComplexBrian Hogan
Beyond the massive hype of Ruby on Rails, there's an amazing world of frameworks, DSLs, and libraries that make the Ruby language a compelling choice when working on the web. In this talk, you'll get a chance to see how to use Ruby to quickly build a static web site, create complex stylesheets with ease, build a simple web service, crete a simple Websocket server, and test your existing applications. Finally, you'll see a few of the ways Rails really can make developing complex applications easier, from advanced database querying to rendering views in multiple formats.
My presentation at http://www.agiletourlondon.co.uk/
Code examples at https://github.com/uberto/tdd-awry
A voyage into today Java enterprise worse practices.
Have you ever seen 10 mocks used to tests a couple of lines of code? Beans with tons of getters/setters? The same code repeated all over again with little differences? The three pasta antipattern: spaghetti, ravioli and lasagna.
From my personal experience, some examples of terrible code, written trying to follow industry best practices and TDD. Understanding the design and the goals, will help to find the way to improve it.
Querying your database in natural language was a presentation done during PyData Silicon Valley 2014, based on the quepy software project. More information at:
http://pydata.org/sv2014/abstracts/#197
https://github.com/machinalis/quepy
Querying your database in natural language by Daniel Moisset PyData SV 2014PyData
Most end users can't write a database query, and yet, they often have the need to access information that keyword-based searches can't retrieve precisely. Lately, there's been an explosion of proprietary Natural Language Interfaces to knowledge databases, like Siri, Google Now and Wolfram Alpha. On the open side, huge knowledge bases like DBpedia and Freebase exists, but access to them is typically limited to using formal database query languages. We implemented Quepy as an approach to provide a solution for this problem. Quepy is an open source framework to transform Natural Language questions into semantic database queries that can be used with popular knowledge databases like, for example, DBPedia and Freebase. So instead of requiring end users to learn to write some query language, a Quepy Application can fills the gap, allowing end users to make their queries in "plain English". In this talk we would discuss the techniques used in Quepy, what additional work can be done, and its limitations.
Research methods & comms
Some info about careers, about writing, introduction to R practicals (regular expressions, functions, loops), experimental design.
@ Queen Mary U London.
Web Development With Ruby - From Simple To ComplexBrian Hogan
Beyond the massive hype of Ruby on Rails, there's an amazing world of frameworks, DSLs, and libraries that make the Ruby language a compelling choice when working on the web. In this talk, you'll get a chance to see how to use Ruby to quickly build a static web site, create complex stylesheets with ease, build a simple web service, crete a simple Websocket server, and test your existing applications. Finally, you'll see a few of the ways Rails really can make developing complex applications easier, from advanced database querying to rendering views in multiple formats.
My presentation at http://www.agiletourlondon.co.uk/
Code examples at https://github.com/uberto/tdd-awry
A voyage into today Java enterprise worse practices.
Have you ever seen 10 mocks used to tests a couple of lines of code? Beans with tons of getters/setters? The same code repeated all over again with little differences? The three pasta antipattern: spaghetti, ravioli and lasagna.
From my personal experience, some examples of terrible code, written trying to follow industry best practices and TDD. Understanding the design and the goals, will help to find the way to improve it.
Querying your database in natural language was a presentation done during PyData Silicon Valley 2014, based on the quepy software project. More information at:
http://pydata.org/sv2014/abstracts/#197
https://github.com/machinalis/quepy
Querying your database in natural language by Daniel Moisset PyData SV 2014PyData
Most end users can't write a database query, and yet, they often have the need to access information that keyword-based searches can't retrieve precisely. Lately, there's been an explosion of proprietary Natural Language Interfaces to knowledge databases, like Siri, Google Now and Wolfram Alpha. On the open side, huge knowledge bases like DBpedia and Freebase exists, but access to them is typically limited to using formal database query languages. We implemented Quepy as an approach to provide a solution for this problem. Quepy is an open source framework to transform Natural Language questions into semantic database queries that can be used with popular knowledge databases like, for example, DBPedia and Freebase. So instead of requiring end users to learn to write some query language, a Quepy Application can fills the gap, allowing end users to make their queries in "plain English". In this talk we would discuss the techniques used in Quepy, what additional work can be done, and its limitations.
Research methods & comms
Some info about careers, about writing, introduction to R practicals (regular expressions, functions, loops), experimental design.
@ Queen Mary U London.
1. Stat405 ddply case study
Hadley Wickham
Tuesday, 5 October 2010
2. 1. Homework
2. Project
3. Case study: gender trends
1. Focus on smaller subset
2. Develop summary statistic
3. Classify names
Tuesday, 5 October 2010
3. Homework
Explain your code!
Comments should explain why not what
Check your indenting - if it’s not indented
correctly, it’s very hard to read
Tuesday, 5 October 2010
4. # Really bad:
# Set x equal to ten.
x <- 10
# Bad:
# Figure out if all windows are bars
allbars <- all(windows %in% c("B", "BB", "BBB"))
# Better:
# all() / any() combination used to prevent errors in the
# case of three DDs.
# Better:
# Check to see if DD will create a triple
# if (length(unique(windows)) == 2)
Tuesday, 5 October 2010
5. # Best (but still not perfect:
## DD wild 4 cases and subcases
#### 1c) 3 DD's
#### 2c) 2 DD's
#### 2c) 2 DD's
#### the prize is quadrupled
#### 3c) 1 DD
#### prize doubled
## 3c.1) 1 DD and 2 of a kind
## 3c.2) 1 DD for any bars
## 3c.3) 1 DD for Cherries
#### 4c) NO DD's
## 4c.1) Just any bar
## 4c.2) Just cherries
Tuesday, 5 October 2010
7. Tips from last year
Proof read - far too many projects with
obvious mistakes.
Include a section on the data, giving a quick
English run-down of what you did to the
data. Only appendix should technical details.
Presentation matters - you should be proud
of your work, so take a little time to put it in a
nice wrapper.
Tuesday, 5 October 2010
8. Easy ways to lose
points
Overplotting
Code style violations
Forgetting about the denominator of a
ratio
Tuesday, 5 October 2010
9. Team Assessment
Your individual grades will be weighted by
effort.
Each team member should turn in a
(confidential) team evaluation sheet.
Don’t forget to assess yourself.
Tuesday, 5 October 2010
11. Questions
For names that are used for both boys
and girls, how has usage changed?
Can we use names that clearly have the
incorrect sex to estimate error rates over
time?
Tuesday, 5 October 2010
12. Getting started
options(stringsAsFactors = FALSE)
library(plyr)
library(ggplot2)
bnames <- read.csv("baby-names2.csv.bz2")
Tuesday, 5 October 2010
13. First task
Too many names (~7000): need to identify
smaller subset (~100) likely to be
interesting.
Outside of class, would look at more, but
starting with a subset for easier
exploration is a good idea.
Tuesday, 5 October 2010
14. First task
Too many names (~7000): need to identify
smaller subset (~100) likely to be
interesting.
Outside of class, would look at more, but
starting with a subset for easier
exploration is a good idea.
For this task, what attributes of a name are
likely to be useful?
Tuesday, 5 October 2010
15. Your turn
For each name, calculate the total proportion
of boys, the total proportion of girls, the
number of years the name was in the top
1000 as a girls name, the number of years
the name was in the top 1000 as a boys
name
Hint: Start with a single name and figure out
how to solve the problem. Hint: Use
summarise
Tuesday, 5 October 2010
16. times <- ddply(bnames, "name", summarise,
boys = sum(prop[sex == "boy"]),
boys_n = sum(sex == "boy"),
girls = sum(prop[sex == "girl"]),
girls_n = sum(sex == "girl"),
.progress = "text"
)
Useful for slow operations
# But this is rather painful
Tuesday, 5 October 2010
17. # For this task, data much easier to work with
# if put sex in columns instead of rows. We'll learn
# more about reshaping in a couple of weeks
# install.packages("reshape2")
library(reshape2)
bnames2 <- dcast(bnames, year + name ~ sex,
value_var = "prop")
# No information unless we have both boys and
# girls for that name in that year
both <- subset(bnames2, !is.na(boy) & !is.na(girl))
dim(both)
head(both)
Tuesday, 5 October 2010
18. Your turn
Summarise each name with the number
of years its made the list for both boys
and girls, the average proportion of
babies given that name.
Which names would you include for
further investigation?
Tuesday, 5 October 2010
19. both_sum <- ddply(both, "name", summarise,
years = length(name),
avg_usage = mean(boy + girl) / 2
)
# No point at looking at names that only appear once
both_sum <- subset(both_sum, years > 1)
qplot(years, avg_usage, data = both_sum)
Tuesday, 5 October 2010
20. # Now save our selections
selected_names <- subset(both_sum,
years > 20 & avg_usage > 0.005)$name
selected <- subset(both, name %in% selected_names)
nrow(selected) / nrow(both)
Tuesday, 5 October 2010
21. Your turn
Explore how the gender assignment of
these names has changed over time.
What is a good summary to use to
compare boy popularity to girl popularity?
Tuesday, 5 October 2010
22. qplot(year, boy - girl, data = selected,
geom = "line", group = name)
qplot(year, abs(boy - girl), data = selected,
geom = "line", group = name,
colour = sign(boy - girl))
qplot(year, boy / girl, data = selected,
geom = "line", group = name)
qplot(year, log10(boy / girl), data = selected,
geom = "line", group = name)
selected$lratio <- with(selected, log10(boy / girl))
qplot(lratio, name, data = selected)
qplot(lratio, reorder(name, lratio), data = selected)
qplot(abs(lratio), reorder(name, lratio),
data = selected)
Tuesday, 5 October 2010
23. Your turn
Compute the mean and range of lratio for
each name.
Plot and come up with cutoffs that you
think separate the two groups.
Tuesday, 5 October 2010