1. Statistics 141 - Homework 6 complete solutions correct answers key
Find the solution at
http://www.coursemerit.com/solution-details/15143/Statistics-141---Homework-6-complete-solutions-correct-answers-key
NO LATE SUBMISSIONS
Write a report showing the code, results and plots for the questions below.
Put a printed version in Charles Arnold's mailbox in the Statistics department o_ce,4th oor of
the Mathematical Sciences Building, and
send an electronic version to dtemplelang@ucdavis.edu with the subject STA141 Assignment 6.
Place the following text at the top of your report and sign it on the physical version you submit:
I certify that I have acknowledged any code that I used from any other person in the class, from
Piazza or any Web site or book or other source. Any other work is my own.
1 UNIX Shell Tools
In this part of the assignment, you will use UNIX shell tools to process data outside of R and also
to get data into R.
In the Data directory on the class Web site, there is a collection of CSV _les for the monthly airline
delay data from July 2012 to June 2013, inclusive. This is a compressed tar _le Airline2012_13.tar.gz.
Within this archive, each _le name is of the form year_Month.csv,e.g. 2013_January.csv.
Download this _le and extract the _les into a single directory. Use a shell command not a point-
and-click GUI (graphical user interface).
We want to count the number of ights for the 5 airports OAK,SFO, SMF, LAX and JFK. The
tasks are simple to state.
i) Compute the number of outbound ights for each of the _ve airports OAK,SMF, LAX, SFO
and JFK, and sort these counts from largest to smallest.
Perform the same computations in R. Compare the total time for each approach.
ii) Compute the total number of ights in and out of the _ve airports, i.e., the sume of both the
inbound and outbound ights. You can do this however you want using a mix of the shell and R
code. One way is to _rst obtain the lines in the _les which involve any of these _ve airports. Then
obtain a count for each pair of airports, i.e., ORIGIN, DESTINATION pairs. At most, how many
will there be? Then read these counts by ORIGIN, DESTINATION pairs into R and compute the
total number of ights for each of the 5 airports.
Use only the UNIXshell tools to do i). For ii), use the shell tools to greatly reduce the data and
then _nish o_ the computations in R.
Work on a small subset of the data _rst to get the code working correctly. You can check the results
by doing the equivalent computations in R. Then run it on the larger data set. Make certain to try
this regardless of how powerful and capable your computer is. If your computer is not capable of
running on the full data set, run it for di_erent size input and show a plot of the time taken as a
function of number of lines processed.
Shell commands that may be useful include: sed, egrep,wc, sort, uniq, cut, man, ls, gunzip, tr,
head, tail, echo, cat, xargs. You probably don't need them all.
2 Basebal, Databases and SQL
In this part of the assignment, you will gain experience with databases and SQL, and of course R,
data manipulation and visualization.
2.1 Data
We will use data about many, many aspects of baseball. This data has been compiled by Sean
Leahman and he has kindly made them available for use by many. Je_ Knecht has made the
data, up to 2011, available as an SQLite database. It is available via cloning a git repository
(https://github.com/jknecht/lahmann-2013.sqlite) You can also retrieve from the class Web
site at http://eeyore.ucdavis.edu/stat141/Data/lahman2013.sqlite. As we saw in class,
there are 24 tables in this database. Each table has columns and rows. Documentation for each of
the tables is available at http://seanlahman.com/files/database/readme2013.txt.
2.2 Software
2. You will need to install the RSQLite package,typically using install.packages().
2.3 Questions
You can answer these questions with a combination of SQL commands and R manipulation of the
results, if necessary.
Give the answer and show the SQL and R code used to answer each question.
1. What years does the data cover? are there data for each of these years?
2. How many (unique) people are included in the database? How many are players, managers, etc?
3. What team won the World Series in 2000?
4. What team lost the World Series each year?
5. Do you see a relationship between the number of games won in a season and winning the World
Series?
6. In 2003, what were the three highest salaries? (We refer here to unique salaries, i.e., more than
one player might be paid one of these salaries.)
7. For 1999, compute the total payroll of each of the di_erent teams. Next compute the team
payrolls for all years in the database for which we have salary information. Display these in a plot.
8. Study the change in salary over time. Have salaries kept up with ination, fallen behind, or
grown faster?
9. Compare payrolls for the teams that are in the same leagues, and then in the same divisions.
Are there any interesting characteristics? Have certain teams always had top payrolls over the
years? Is there a connection between payroll and performance?
10. Has the distribution of home runs for players increased over the years?
When answering the questions, try to summarize the results in convenient and informative form
(e.g. tables and/or plots) that illustrate the key features.
2.4 Bonus Questions
Students who are looking for bonus points (e.g., to makeup for other assignments) can compose
additional questions and answer these. Make certain to explicitly state each question, indicate why
it is interesting, and answer it using the data,providing conclusions, evidence and the code used
to answer the question.