Quantitative
Methods
for
Lawyers Exploring Data in R
Loading Datasets
R Boot Camp - Part 1
Class #14
@ computational
computationallegalstudies.com
professor daniel martin katz danielmartinkatz.com
lexpredict.com slideshare.net/DanielKatz
A Place to Get Familiar
with the Language
Can You Earn
All 7 Badges ?
http://tryr.codeschool.com/
The Cheat Sheet
http://cran.r-project.org/doc/contrib/Short-refcard.pdf
Download It, Print It and Keep it
with you when you are working:
This can be extremely helpful
Let Me Start By
Flagging Some
Additional Resources
that are Available to
Learn R
http://www.ats.ucla.edu/stat/r/
https://www.coursera.org/course/compdata
SignUp Here:
Videos Are Here:
http://www.youtube.com/watch?
v=EiKxy5IecUw&list=PL7Tw2kQ2ed
vpNEGrU0cGKwmdDRKc5A6C4
http://www.r-bloggers.com/That is Me :)
Wearing Google Glasses
http://www.programmingr.com/
http://www.statmethods.net/
http://www.stat.yale.edu/~jay/JSM2012/PDFs/intro.pdf
http://cran.r-project.org/web/packages/IPSUR/vignettes/IPSUR.pdf
A 412 Page Book on Probability and Statistics Using R
As You Learn More Take a Look at the Style Guide
Produced By
http://google-styleguide.googlecode.com/svn/trunk/Rguide.xml
Setting Your
Working Directory
in _
Initially we need to
make sure we
understand what
directory / folder R
is using
We use
getwd()
in order to
determine the
current working
directory
We use
getwd()
in order to
determine the
current working
directory
This is a Mac
file extension
but I am in my
Users/katzd
folder
Within this
folder is my
Desktop so lets
go there
Within this
folder is my
Desktop so lets
go there
I have now Set
my Working
Directory to the
Desktop
Actually I want
to point this to
my R folder
which is located
on my Desktop
If you retype
the command
add a slash and
hit tab ...
this menu will
pop up and you
can find the “R”
folder
The use of tab
is very helpful
as it can be
used to figure
out how to
complete lots of
arguments in R
So now I have
my directory
setup properly
Loading a Dataset(s)
into r
To Get Started You
Need to Be Able to
Load a Dataset(s)
into r
In general, your
dataset(s) is going to
be either located
either on
your computer
or
online
This is Calvin Johnson of the Detroit Lions
It is located on this website:
http://s3.amazonaws.com/
KatzCloud/Calvin_Test_Data.csv
I have made available to
you a simple dataset
featuring game by game
statistics for each game in
Calvin's professional career.
How Do I Load
Datasets from the
Internet ?
There are various file
formats in which
your data may be
located
Subject to limitations
such as terms of
service, etc.
It is quite possible to
turn anything online
into your dataset
http://computationallegalstudies.com/2009/07/01/how-
python-can-turn-the-internet-into-your-dataset-part-1/
We will focus upon
loading the most
common dataset
formats
.dta .csv .xls
This is Calvin Johnson of the Detroit Lions
It is located on this website:
http://s3.amazonaws.com/
KatzCloud/Calvin_Test_Data.csv
Note the file extension of .csv
As you
learned
while
getting
your 7
badges
We Will Need to Assign The
Dataset a Name Once We
Load it into R
Here are all of the default
settings including
header=TRUE
Here Read in
a .CSV file
using the full
URL
<-
This is used to
assign an
object
Here I have
given the set
the name
calvin_game_data
If you have
downloaded
the .CSV file
locally to your
machine than
make sure that
the path
extension is set
to the location
where the data
set currently
resides
Type this
and
then hit tab when
your cursor is
between the
quotes
(this will bring up
all files within the
current working
directory ... in this
case all files on
my desktop
I then select the
Calvin_Test_Data.csv
Getting Rid of the NA’s
If you want to view the data in a spreadsheet form:
View(calvin_game_data)
Notice that we have an issue with extra rows full of NA's. We
need to generate a clean version of the data without those rows
of missing values.
There are several ways to fix this but here is one way:
calvin_games_data <- calvin_game_data[complete.cases(calvin_game_data), ]
(Note: I will explain this syntax on the next slide)
Getting Rid of the NA’s
Lets take this apart. The complete.cases command creates a logical vector
specifying which observations/rows have no missing values across the entire
sequence. To test observe this try running the following:
complete.cases(calvin_game_data)
You will see that each row gets a true/false value. Those True/False are in
response to the presence of the NA values.
In the full command we are creating a new dataset called "calvin_games_data"
The syntax on the right in plain language is to take calvin_game_data and then
complete the cases using a row, column logic.
The syntax of complete cases is as follows: complete.cases(x, y)
Notice that in the following we use x=calvin_game_data and y is left blank after
the comma. The default here with the blank is to take the whole row.
calvin_games_data <- calvin_game_data[complete.cases(calvin_game_data), ]
The Head( )
command will
give you the
first few rows
but notice that
the row
numbering is
still off
The NULL here
will reset the
row numbers
Learning Some
of the Syntax
Some Basic Commands
What is the fewest yards Calvin has
had in a Game?
What is the most Touchdowns Calvin
has had in a Game?
Some Basic Commands
What is the fewest yards Calvin has
had in a Game?
What is the most Touchdowns Calvin
has had in a Game?
Min Selects the Smallest Value
Syntax is Dataset$Variable
Dollar Sign Selects the Column
Max Selects the Largest Value
Syntax is Dataset$Variable
Dollar Sign Selects the Column
Some Basic Commands
How Many Touchdowns has Calvin
had in his career?
What are the respective quantiles of
Calvin’s Yards Per Game?
Some Basic Commands
How Many Touchdowns has Calvin
had in his career?
What are the respective quantiles of
Calvin’s Yards Per Game?
Some Basic Commands
Across his Career what are Calvin’s
average yards per game?
What is the Standard Deviation of
those Yards?
Some Basic Commands
Across his Career what are Calvin’s
average yards per game?
What is the Standard Deviation of
those Yards?
Some Basic Commands
What About the Skewness and
Kurtosis of those Yards?
Some Basic Commands
If you want a high level perspective
on your variables try the summary
command:
Plotting Data
Lets Plot Calvin’s Yards Per Game
Plotting Data
Lets Plot Calvin’s Yards Per Game
Notice the Default Bin Widths,
Labels & Style of the Histogram
Getting Help
Getting Help
Getting Help
Getting Help
Plotting Data
Lets Plot Calvin’s Yards Per Game
Box and Whisker Plot
Earlier in the Course We Saw This Data ...
New York City
31.5 33.6 42.4 52.5 62.7 71.6 76.8 75.5 68.2 57.5 47.6 36.6
Houston
50.4 53.9 60.6 68.3 74.5 80.4 82.6 82.3 78.2 69.6 61 53.5
San Francisco
48.7 52.2 53.3 55.6 58.1 61.5 62.7 63.7 64.5 61 54.8 49.4
http://s3.amazonaws.com/KatzCloud/AvgTemp.csv
Load the Data from Here:
Load the Data
from My Cloud
Take a Peak at
the Results
Okay so this is not exactly a great looking plot
Notice here how I passed the vector of names
In the RStudio
Plots Window
Use the Copy to
Clipboard
Option
Then Scale the
Plot So that the
Y Axis is Larger
The Final Product
More To Come in Part 2
of BootCamp
Daniel Martin Katz
@ computational
computationallegalstudies.com
lexpredict.com
danielmartinkatz.com
illinois tech - chicago kent college of law@

Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Professor Daniel Martin Katz