Introduction

INTRODUCTION
Review of Statistics. Stata and Excel introduction

HOMEWORK FOR FRIDAY
• Using files grades, peanuts, and unrate…
• Find summary statistics for each variable
• Create histogram chart for grades
• Create line graph for unrate
• Save everything in a do file.

DESCRIPTIVE STATISTICS
• Mean – arithmetic mean, arithmetic average.
• Sum of the data values divided by the number of observations
• Mode
• Median
• Minimum, maximum
• Variance
• Standard deviation

MEAN
• Mean - arithmetic mean, arithmetic average. Sum of the data values divided by the
number of observations
• Example: Calculate the mean for the hypothetical data for shipments of peanuts from a
U.S. exporter to five Canadian cities
• Montreal – 640,000 pounds
• Ottawa – 15,000 pounds
• Toronto – 285,000 pounds
• Vancouver – 228,000 pounds
• Winnipeg – 45,000 pounds
• Notes: Σ means sum, unit of observation here is a Canadian city

MEAN CONT’D
• In excel:
• Click on fx and find the function name or type in
• =average(range of data)
• In Stata:
• Import data by clicking on “file” (upper left corner) -> “import” ->pick the format of the file->
find it by clicking ”browse”-> tick the box “import first row as variable names”
• Mean(peanuts)

STATA
• Stata is a powerful tool for researchers and applied economists.
• Infinitely extensible, gives users the same development tools used by the company’s professional
programmers
• Google is your best friend
• Stata has a few windows:
• bottom middle is the command window – this is where you type in the commands;
• top middle – the commands that you submitted appear and so does the output;
• left – all of the commands you have run;
• right – all of the variables you have in your dataset
• To view your dataset you can click on “data editor” or “data browser”

STATA
• Right now there is no data in Stata. We first have to upload the data to it. The way you
upload data into Stata (or any other type of statistical software) depends on the type of
data file you have
• Text data, such as comma-delimited files (.csv)
• Excel files (.xlsx)
• Stata files (.dta)
• Please find the dataset “grades” on blackboard. What type of file is it?
• Stata: file-> import->type of file. Please tick “import first row as variable names”
• If you want to upload a different dataset to work with it, type in “clear” in the command window

STATA LOGS AND DO-FILES
• log – records your work in Stata, start before you do anything else!
• .do file – lets you record a series of commands
• Try to make your own log and .do file
• Click on “log” -> “begin” ->give it a name ->save in the location convenient for you (this starts a
log, when you exit Stata the log will automatically save).
• Click on “do-file editor” start typing up commands. You would save it like any other document
(”save” -> give it a name, save in a convenient location).
• To run the commands in the do-file simply click ”run” at the top of the do-file

EXAMPLE
• Calculate mean for the student grades in excel and in Stata
• You will find the data set “grades” on blackboard
• Make sure your work in Stata is recorded in a log
• What is the unit of observation in the dataset (i.e. whose grades are these)?
• How many observations are there?
• What is the average grade in that class?

SMALLEST AND LARGEST OBSERVATION
• You might be wondering if anyone got 100 in the class, or what the highest grade in the class
was and possibly the lowest.
• We can do so by looking at the data, by sorting data, and by using minimum and maximum
functions in Excel and Stata
• To sort data:
• In Excel: highlight the data you want to sort, “data” -> “sort”
• In Stata: sort ’variablename’
• gsort +’variablename’ or –’variablename’
• Once you have sorted the data you can see what the first and last observations are
• Functions in Excel: =min(data), =max(data)
• Functions in Stata: summarize ‘variablename’
• Minimum and maximum let you know if you have outliers in your data or there are certain
problems with your data

APPLICATION 1. USE EXCEL
• Use UNRATE – unemployment rate dataset to find out the…
• Average unemployment rate between 1948 and 2020
• What was the maximum and minimum unemployment rate during that period?
• Any thoughts on your findings?
• TIP… Stata has an API with Fred. There are two ways of accessing the FRED database…
• Freduse command (might need to be installed)…. freduse UNRATE, clear
• File >> Import >> Federal Reserve Economic Database

APPLICATION 2. USE GRADES2 TO ANSWER THE
FOLLOWING
• In Stata:
• What is the minimum grade in that class?
• What is the maximum grade in that class?
• What is the average grade in that class?
• How do the minimums, maximums, and averages compare across the two classes?

STANDARD DEVIATION
• I want to calculate how dispersed the students’ grades are compared to the average
grade in the class
• Standard deviation (square root of variance) – spread of the observations around the
mean value
• Why is it useful? We can find out how much the data fluctuates around the mean in a
dataset and compare datasets, it also lets us know if there are any outliers in a dataset
so we can get rid of them.
• Examples: income in different cities, unemployment in different regions, return on
different companies’ stock,

STANDARD DEVIATION CONT’D
• In Excel the function for standard deviation is: =stdev(data)
• In Stata standard deviation is the part of summarize command output

STANDARD DEVIATION APPLICATIONS
• Find the standard deviation for both of the classes and compare them. What conclusion
can you draw?
• What was the standard deviation of the unemployment rate before and after outliers
were corrected? What conclusion can you draw?

VARIANCE
• Closely tied to standard deviation
• Variance = squared standard deviation
• Measure of how far away the observations are in a dataset from the mean
• To find variance in excel: =var(datarange)
• To find variance in Stata: have to square standard deviation by hand or use display r(Var)
after summarize command
• Stata retains a number of calculations (behind the scenes).
• return list
• There are other tools for calculating summary statistics…
• Help tabstat
• tabstat UNRATE, s(var)

USING STATA AND EXCEL AS A CALCULATOR
• To find variance you can always square standard deviation
• di r(Var)
• di r(sd)^2
• To use excel as a calculator you have to type in “=“ into a cell and then what you are
trying to calculate
• In Stata you have to type in the word ”display” and then what you are trying to calculate
• For example, if standard deviation is 1.6 then to calculate variance in
• Excel: =1.6^2 (or =1.6*1.6)
• Stata: display 1.6^2 (or display 1.6*1.6)

CREATING A NEW VARIABLE
• You can create new variables in Excel and Stata. This skill will be useful later on in the
class
• For now lets imagine the professor gives everyone in the first class a 1% curve and
calculate their grades
• In excel in a new cell type in: =”cell with data”+1, hover over bottom right corner of the
new cell and double click, the column should populate with calculated values. What is
the class average now once everyone received extra credit?
• Let’s import the grades into Stata and do the same. To create a new variable:
• generate var=classgrade+1

BAR CHARTS
• You would like to find out how many people in the class received an A, B, C, and D.
• The best way to look at that is to create a distribution chart (histogram) that will show
how many received each grade
• In Excel highlight the data->insert->histogram->right-click on the x-axis label to change
number of bins and their range
• In Stata click on graphics->histogram. There are many options, let’s go through some of
them
• Variable – classgrade
• Width of bins – 10 (this is how “wide” each grade category is)
• Lower limit of first bin – 60 (assuming no one failed the class)
• Y-axis – frequency

BAR CHARTS CONT’D
• We can create bar charts to compare the same variable over time (i.e. unemployment) or
across different units (i.e. income across different cities)
• Let’s create an overtime bar chart using unemployment rate data in excel
• Highlight unemployment rate column by clicking on column name twice
• Click “insert” (top right)->pick bar chart (2D column)
• Left click on x-axis labels->select data->edit->select range (years column) by
highlighting it
• To add labels to the axes, click on the chart->”+” symbol at the right corner-> tick axis
titles->type the titles into the boxes

LINE CHARTS
• Showing the progression of a variable overtime is easier with a line chart
• Load unemployment rate to Stata
• This is time series data. We have to treat it a bit differently
generate daten = tm(1948m1) +_n-1
format daten %tm
tsset daten, monthly
sort month
• Click “graphics” on the top left -> twoway graph->create->line plot type-> Y-variable is
unemployment rate, X-variable is year->submit
• To save your graph - > file->save as-> pick the type that will make it easy for you to open
the graph
• https://fred.stlouisfed.org/series/UNRATE/ compare your graphs to FRED data

SIDE NOTE
• How does Stata work with time series data…
• It uses a numerical system stating in 1/1/1960 (this value will always be 0)
• _n refers to a specific period
• _N refers to total number of observations
• Why do I need to subtract 1 when finding the correct month…
• To ensure the data align with 1/1/1960 is 0

GDP OVERTIME IN US, MEXICO, AND CANADA
• Please google “GDP per capita by country world bank” -> pick the one in current US$
(why do we have to use GDP per capita in current dollars? ) ->Download the csv file
• Use ctrl-F to find GDP for US, Mexico and Canada. Copy and paste into a new document
each country’s GDP
• Delete third and fourth columns
• Create a line chart. What conclusion can we draw about the relative economic growth of
these countries?

CORRELATION
• Is it possible to improve your score during the semester or is the grade on the first exam
closely related to the grade at the end of the semester?
• Use grades3.xlsx data set to be able to answer this question
• Import the dataset into stata. We are going to plot the observed points on a graph
where the axes are: exam grade and class grade
• To do so type in: scatter(exam1 classgrade)
• We can tell that there is a positive relationship between the two variables
• The graph that you created is called a scatterplot. By looking at scatterplots we can kind
of tell if there is a relationship between different variables in the data. We can also make
an educated guess whether the relationship between the two variables is positive or
negative by looking at a scatterplot
• Can you think of two variables that might be positively or negatively related?

CALIFORNIA SCHOOL’S DATASET
• The data set includes data on California’s school districts in 1998-1999 school year
• It includes average test scores for 5th grades in each school district
• The description of the data set is in the word document titled “California Test Scores”
• Let’s look at the relationship between total enrollment and testscores
• Stata: scatter testscr enrl_tot
• Take a look at the data description and think of what could be related to the test scores?
Is it a positive or a negative relationship?

CORRELATION COEFFICIENT
• We don’t have to guess whether there is a relationship between two variables and
whether the relationship is positive or negative
• We will use something called “correlation coefficient” (usually denoted r) to answer that
• If r is between 0 and 1 the relationship is positive
• If r is between -1 and 0 the relationship is negative
• The closer the absolute value of r to 1is, the stronger the relationship
• The closer the absolute value of r to 0 is, the weaker the relationship
• In stata to find the correlation coefficient type in: correlate variable1 variable2
• In excel to find the correlation coefficient type in: =correl(variable 1 variable2)

DO IT YOURSELF TIME
• Try to create a scatterplot for the grades3 dataset in excel
• Hint: a scatterplot is just a type of chart, your steps would be similar to creating a bar
chart in excel
• Try to find the correlation coefficient for the grades3 dataset in excel (on slido)
• Hint: the correlation coefficient is a type of function. This should be similar to finding an
average or a standard deviation in excel.

LINE OF BEST FIT
• Line of best fit is the line that best represents all of the data points on a scatterplot
• Like any straight line it has an intercept and a slope
• The equation of a straight line is: y=mx+b
• Where b – intercept with the y-axis, m – the slope of the line
• If the line of best fit for a scatterplot is y=-3x+2, this means that 2 – intercept with the Y-
axis and 3 – slope of the line.
• When x = 0, y = 2
• Since the slope is negative the relationship between the two variables is negative.

EXAMPLE: LINE OF BEST FIT FOR CLASS GRADES
• Once you have created a scatterplot in excel you can add the line of best fit to it
• Click on the “+” in the upper-right corner, tick “trendline”
• You can see that the line of best fit is upward-sloping => the relationship between the
two variables is positive
• To find out the equation of the line left-click on it ->format->display equation on chart
• What are the intercept and the slope of the line? What conclusion can we draw from
knowing those numbers?
• Do they make sense?

CONCLUSION
• We have reviewed descriptive statistics. What are some of the descriptive stats we have
discussed?
• How can we find them in excel?
• How can we find them in stata?
• What types of charts have you learned to create? How can you do this in stata/ excel?
• If the correlation coefficient is -1 what does it mean? 0? 0.2?

Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Introduction

Similar to Introduction (20)

More from Ryan Herzog

More from Ryan Herzog (20)

Recently uploaded

Recently uploaded (20)

Introduction