Data Analysis
Instructions of Excel 2016
By Yancy Chow
Data Analysis: House Example
House Data : 50 houses
Two variables: Price (Y) Area (X)
Excel: How to Add-in
Setting Up Excel for Statistical Analysis:
Click on Excel FileOptions
3
Excel: How to Add-in
Then find “Add ins” on the left side and click on it.
After that, click on “Go”.
Then click on “OK”.
Excel: How to Add-in
Then find “Analysis ToolPak” and click on it.
---- Click on “OK”.
Note: if you use Apple computer, the “add-in” option is under “Tool”!
Excel: How to Add-in
Now you’ve successfully added in the “Data Analysis”.
Click on “Data” on the top, now you can see “Data Analysis” icon!
Excel: How to Calculate
Mean, Median and Mode?
Open your data in Excel or type your data in Excel by column. For example, we want to calculate the mean, median and mode for the variable “Price” in this data. Select “Data” firstly, then click on “Data Analysis”
Excel: How to Calculate
Mean, Median and Mode?
After you clicking on “Data Analysis”, scroll the mouse until you find “Descriptive Statistics” in the Analysis Tools Panel and then select it. Then click on “OK”.
Excel: How to Calculate
Mean, Median and Mode?
Firstly, you need to input the “Input Range”.
You can either input by typing in the box or clicking using the mouse to select the data numbers in the column which you are interested in. In this example, we select the all 50 numbers in the first column. Do not select the label row, like “price” row.
9
Excel: How to Calculate
Mean, Median and Mode?
After selecting the “Input Range”, you need to select “Output Range” and choose anywhere you want to the output to be.
Excel: How to Calculate
Mean, Median and Mode?
Then select “Summary statistics”. Click on “OK” and you will have the data analysis results.
Excel: How to Calculate
Mean, Median and Mode?
Here is the results from the data analysis, including the information such like mean, median, mode , standard deviation, sample variance, rang, minimum and maximum.
Note that EXCEL can only find one mode. You need to check whether there is mort than one by your own.
Excel: How to Calculate the
first Quartiles (Q1)?
Q1: Choose an empty space, enter:
“=quartile(data range, 1)”
Then press “Enter” and you will get the first quartile (Q1) result.
A2:A51 is the range of the data
Excel: How to Calculate the
third Quartiles (Q3)?
Q1: Choose an empty space, enter:
“=quartile(data range, 3)”
Then press “Enter” and you will get the third quartile (Q3) result.
A2:A51 is the range of the data
Excel: How to Draw Histograms?
Firstly, check the output from the “Descriptive Statistics” in “Data Analysis”. We notice in this house data, mean is $956396.66, minimum is $729870 and maximum is $1190000. A reasonable will be $50000. So create a new Colum of the “Bins” which is from .
Data AnalysisInstructions of Excel 2016By Yancy Chow.docx
1. Data Analysis
Instructions of Excel 2016
By Yancy Chow
Data Analysis: House Example
House Data : 50 houses
Two variables: Price (Y) Area (X)
Excel: How to Add-in
Setting Up Excel for Statistical Analysis:
2. 3
Excel: How to Add-in
Then find “Add ins” on the left side and click on it.
After that, click on “Go”.
Then click on “OK”.
Excel: How to Add-in
Then find “Analysis ToolPak” and click on it.
---- n “OK”.
Note: if you use Apple computer, the “add-in” option is under
“Tool”!
3. Excel: How to Add-in
Now you’ve successfully added in the “Data Analysis”.
Click on “Data” on the top, now you can see “Data Analysis”
icon!
Excel: How to Calculate
Mean, Median and Mode?
Open your data in Excel or type your data in Excel by column.
For example, we want to calculate the mean, median and mode
for the variable “Price” in this data. Select “Data” firstly, then
click on “Data Analysis”
Excel: How to Calculate
Mean, Median and Mode?
After you clicking on “Data Analysis”, scroll the mouse until
you find “Descriptive Statistics” in the Analysis Tools Panel
and then select it. Then click on “OK”.
4. Excel: How to Calculate
Mean, Median and Mode?
Firstly, you need to input the “Input Range”.
You can either input by typing in the box or clicking using the
mouse to select the data numbers in the column which you are
interested in. In this example, we select the all 50 numbers in
the first column. Do not select the label row, like “price” row.
9
Excel: How to Calculate
Mean, Median and Mode?
After selecting the “Input Range”, you need to select “Output
Range” and choose anywhere you want to the output to be.
5. Excel: How to Calculate
Mean, Median and Mode?
Then select “Summary statistics”. Click on “OK” and you will
have the data analysis results.
Excel: How to Calculate
Mean, Median and Mode?
Here is the results from the data analysis, including the
information such like mean, median, mode , standard deviation,
sample variance, rang, minimum and maximum.
Note that EXCEL can only find one mode. You need to check
whether there is mort than one by your own.
Excel: How to Calculate the
first Quartiles (Q1)?
Q1: Choose an empty space, enter:
“=quartile(data range, 1)”
Then press “Enter” and you will get the first quartile (Q1)
6. result.
A2:A51 is the range of the data
Excel: How to Calculate the
third Quartiles (Q3)?
Q1: Choose an empty space, enter:
“=quartile(data range, 3)”
Then press “Enter” and you will get the third quartile (Q3)
result.
A2:A51 is the range of the data
Excel: How to Draw Histograms?
Firstly, check the output from the “Descriptive Statistics” in
“Data Analysis”. We notice in this house data, mean is
$956396.66, minimum is $729870 and maximum is $1190000. A
reasonable will be $50000. So create a new Colum of the “Bins”
which is from $700000 to $1200000, by the interval $50000
7. 15
Once you create reasonable Bins, select “Data”-
Analysis”. Find “Histograms” and click on “OK”.
Excel: How to Draw Histograms?
Select the “Input Range”, the 50 house data.
Select the “Bin Range”, the column you created.
Decide any empty space as your “Output Range”
Click on “Cumulative Percentage” and “Chart Output”-
Excel: How to Draw Histograms?
8. Excel: How to Draw Histograms?
Here is the output from the “Data Analysis”: Frequency Table
and Histogram!
Now you can edit the words color, size, filled color if you want.
Excel: How to Draw Histograms?
You can also edit the color of the “Page Layout”.
Excel: How to Draw Histograms?
9. You can also design some effects of the bars if you want.
Double click on one bar and Select the “Format” Tool. You can
design the shape filled, the outline of the shape and effects if
you want to draw a beautiful graph.
Excel: How to Draw Histograms?
Excel: How to Draw Histograms?
Usually, the histograms have no gap. How to have no gap?
Choose the graph, right click the mouse and choose “Format
Plot Area”.
Usually, the histograms have no gap. How to have no gap?
Then choose from “Plot Area Options”--
Excel: How to Draw Histograms?
10. Excel: How to Draw Histograms?
Usually, the histograms have no gap. How to have no gap?
Then make the “Gap Width” as “0%”
Now, there is no gap for the histogram!
Excel: How to Draw Bar Charts (Qualitative data)?
Type the categories of the qualitative data and the
corresponding frequency (How many in each class)
Excel: How to Draw Bar Charts (Qualitative data)?
Select the data (class and frequency)-- - -D
Column” or “3-D Column”
11. Double click each bar and choose “Format” to design the color
and effects.
Excel: How to Draw Scatter Plot?
Before doing the data analysis, you may want to see the scatter
plots to see whether there exists a relationship between Y and
X. Note that you should put X in the first column and Y in the
second column. Click “Insert” and then choose “Scatter chart”
to make the plot.
28
12. Excel: How to Draw Scatter Plot?
Excel: How to Do Simple Linear Regression?
Select “Data” from the Tool Bar- find
out “Regression” in the dialog and click on “OK”.
30
Excel: How to Do Simple Linear Regression?
Select the “Input Y Range” from the variable “Price” column
Select the “Input X Range” from the variable “Area” column
Select the “Output Range” to somewhere empty.
Click on “Residual Plots”, “Line Fit Plots” and “Normal
Probability Plots”.
13. Then Click on “OK”.
31
Correlation or strength of linear relationship between x and y.
0.913 is strong.
R Square: the amount of variation explained by the regression.
Is the model a reliable predictor of y?
83.3% of variation is explained.
Y-intercept
Slope
If the p-value is smaller than 0.05, then the parameter is
significant/important for predicting.
If the Significant F is smaller than 0.05, then the Model is
significant/important.
Standard
Error of
The
Regression
Use in
prediction
14. The Equation will be : Y=307953+195X
Data Analysis Output
: EXCEL 2016
32
Chapter 11
Simple Linear Regression
(Individual Project)
By Yancy Chow
What is Simple Linear Regression?
House 1
15. House 2
Which one you think is more expensive? Why?
What is Simple Linear Regression?
Can be used to find relationships between two variables.
Examples:
Gene mapping for cancer research
Examples:
Stock market investment analysis
What is Simple Linear Regression?
16. Examples:
Sales forecasting
What is Simple Linear Regression?
Examples:
Product quality control
What is Simple Linear Regression?
Examples:
Income demographics
17. What is Simple Linear Regression?
Simple linear regression
one-to-one
Dependent Variable
(y)
Ex. House price
Independent Variable (x)
Ex. Square feet
What is Simple Linear Regression?
House Price=??*Square feet+ Random Error
Procedure: Model
18. Procedure: Model
We always assume that the mean value of the random error
equals 0.
Taking average……
That is, the model will be:
Figure of the Model
How to fit the Model?
----The Least Squares Approach
19. How to interpret?
For every unit increase in x, the mean of y is estimated to
increase by unit.
Bellevue College
Houses near Bellevue College
20. Home value (Y) as a function of
square footage (X)
Example
Correlation or strength of linear relationship between x and y.
0.913 is strong.
R Square: the amount of variation explained by the regression.
Is the model a reliable predictor of y?
83.3% of variation is explained.
Y-intercept
Slope
If the p-value is smaller than 0.05, then the parameter is
significant/important for predicting.
If the Significant F is smaller than 0.05, then the Model is
significant/important.
Standard
Error of
The
Regression
Use in
prediction
The Equation will be : Y=307953+195X
21. Data Analysis Output
: EXCEL 2016
16
Recently sold house near Bellevue College:
Area: 2360 sqt
Lot Size : 8,162 sqt
Built year: 1976
What is the house value from our model?
Plots: How to Check Assumptions?
22. Mean of Zero:
Variance Constancy:
Normality:
Evenly Around 0
NO Trend
Two parallel lines to 0.
Linear Line
Independence
No Pattern
Assumption Check
Residual Plot:
Mean of 0
Constant variance
Independence
Normal Probability Plot:
Normality
28. Table of Contents
1. Selecting the Data
...............................................................................................
. 3-6
• Rationale
...............................................................................................
... 4-5
• Reliability of Source Data
............................................................................5
• Limitations of the Data
................................................................................5
• Cleaning Up the Data
...................................................................................5
• General Assumptions before Data Analysis
................................................6
2. Describing the
Data........................................................................................
...... 6-7
3. Empirical Rule
...............................................................................................
..........8
4. Identify Outliers
...............................................................................................
........8
5. Five Number Summary and Z-
Scores......................................................................8
6. The Linear Regression Analysis
........................................................................ 9 -11
7. The Regression Scatterplot
....................................................................................12
8. The Linear Regression Line Fit Plots Analysis
.....................................................12
29. 9. The Significance of the Regression Model
............................................................13
10. The Regression Equation
.......................................................................................13
11. The Reliability of the Regression Model
...............................................................14
12. The Assumptions of the Regression Model
...........................................................14
13. Conclusion
...............................................................................................
..............15
14. Team Information
...............................................................................................
...15
Works Cited
...............................................................................................
..................16
Page 3
1. Selecting the Data
30. Data Set
Dependent Variable (Y)------------- 1. Top Grossing Movies
Worldwide (Gross Profit)
Independent Variables (X1) ---------------------- 1. Budget
Amount to Create the Movie
Independent Variables (X2) ----------------------------------2.
Length of Movie (Minutes)
Independent Variables (X3) ------------------------- 3. Movie
Rating Scores (Out of 100)
In this individual report, I will be focusing on the X1 variable,
the correlation between the
budget amounts that were utilized to create the presented
movies with the Y variable, the top
grossing movies worldwide.
Movie Gross Profit (Y) Budget (X)
Avatar $2,787,965,087 $237,000,000
Titanic $2,186,772,302 $200,000,000
Star Wars: The Force awakens $2,068,223,624 $245,000,000
Jurassic World $1,670,400,637 $150,000,000
The Avengers $1,518,812,988 $220,000,000
Furious 7 $1,516,045,911 $190,000,000
Avengers: Age of Ultron $1,405,403,694 $250,000,000
Harry Potter TDH2 $1,341,511,219 $125,000,000
Frozen $1,287,000,000 $150,000,000
Iron Man 3 $1,214,811,252 $200,000,000
Minions $1,159,398,397 $74,000,000
Captain America: Civil War $1,153,304,495 $250,000,000
31. Transformers: Dark of the Moon $1,123,794,079 $195,000,000
Lord of the Rings: ROTK $1,119,929,521 $94,000,000
Skyfall $1,108,561,013 $200,000,000
Transformers: Age of Extinction $1,104,054,072 $210,000,000
The Dark Knight Rises $1,084,939,099 $250,000,000
Toy Story 3 $1,066,969,703 $200,000,000
POTC: Dead Man's Chest $1,066,179,725 $225,000,000
POTC: On Stranger Tides $1,045,713,802 $250,000,000
Jurassic Park (Original) $1,029,939,903 $63,000,000
Page 4
Finding Dory $1,027,865,760 $200,000,000
Star Wars: Phantom Menace $1,027,044,677 $115,000,000
Alice in Wonderland $1,025,467,110 $200,000,000
Zootopia $1,023,784,195 $150,000,000
The Hobbit: Unexpected Journey $1,021,103,568 $180,000,000
The Dark Knight $1,004,558,444 $185,000,000
Rouge One $982,998,446 $200,000,000
Harry Potter TPS $974,755,371 $125,000,000
Despicable Me 2 $970,761,885 $76,000,000
The Lion King $968,483,777 $45,000,000
The Jungle Book (2016) $966,550,600 $175,000,000
POTC: At World's End $963,420,425 $300,000,000
Harry Potter: TDH1 $960,283,305 $161,287,500
The Hobbit: DOS $958,366,855 $225,000,000
The Hobbit: BOFA $956,019,788 $250,000,000
Finding Nemo $940,335,536 $94,000,000
Harry Potter: OOTP $939,885,929 $150,000,000
Harry Potter: HBP $934,416,487 $250,000,000
Lord of the Rings: Two Towers $926,047,111 $94,000,000
Shrek 2 $919,838,758 $150,000,000
Harry Potter: GoF $896,911,078 $150,000,000
32. Spider-Man 3 $890,871,626 $258,000,000
Ice Age: Dawn of the Dinosaurs $886,686,817 $90,000,000
Spectre $880,674,609 $245,000,000
Harry Potter: COS $878,979,634 $100,000,000
Ice Age: Continental Drift $877,244,782 $95,000,000
Secret Life of Pets $875,457,937 $75,000,000
Batman v Superman $873,260,194 $250,000,000
Lord of the Rings: Fellowship $871,835,347 $93,000,000
Rationale
With our group’s backgrounds in diverse cultural activities, we
brainstormed ideas given the
significance that activities play in our lives, and we were able
to meet in agreement about the
idea of cinema and its impact on the consumers as far as how
much revenue movies generate.
As part of our cultural backgrounds in films, we all shared
common observations and facts
about movies and we were interested what individual factors
impact a movie’s total gross
amount. As for our variables, we decided as a group on 3 main
independent variables and
modified them to meet the standards for the dependent variable
through the variable’s
Page 5
completeness and integrity given the large computations
involved for this data set. (1. Budget
Amount to Create the Movie, 2. Length of Movie (Minutes), 3.
Movie Rating Scores (Out of
33. 100.)) Throughout the initial development stage of our project,
we did some research as far as
knowing if our information about movies with their total gross
amounts and the independent
variables were available to us on the web. Moreover, the detail
and credibility was there
online and it provided us the essential information for both the
independent variables and the
dependent variable.
Reliability of Source Data
As a group, we believe that our data source,
http://www.boxofficemojo.com/alltime/world/ is
a reliable and credible source of published and updated data as a
result of its owner,
IMDb.com, who is also owned by Amazon.com. Given the
significance and reputation of
IMDb, they provide with its affiliates such as Box Office Mojo,
the utmost effort in accurate
reports, reliable sources to obtain practical and useful
information, and credibility in order to
offer daily publishing and updates on movies worldwide with
their gross values and other
variables including the estimated budgets that were utilized to
create that specific movie.
Moreover, according to IMDb (2017), the owner of Box Office
Mojo, stated that only more
recently within the last 15 years of films, “studios and
distributors have started disclosing
detailed figures only recently” (p. 1), and are reported as
estimates.
Limitations of the Data
As far as limitations in our data, the data combines all films and
34. provides no breakup. As
such, we are missing some levels of significant details like sub-
categorization into genres, a
specific year range, and calculating for inflation. Having access
to this viable data could have
provided a higher level of insight into which movie genres are
watched the most and how
much that provides, what year range could offer the best results,
and what movies would have
the highest gross value if we accounted for inflation.
Cleaning Up the Data
Some of the steps to clean up the data included only selecting
the top 50 grossing movies
rather than 100 because the data is already in the millions and
billions and too many values is
confusing, looks sloppy in graphs with higher values, and is
difficult to calculate with some
formulas. Even though we were provided the gross values, some
values were in a different
currency such as Euros, so we converted a couple movies from
Euros to dollars. This gave us
consistency among the other gross values in terms of currency
and provided us more accurate
data.
http://www.boxofficemojo.com/alltime/world/
Page 6
General Assumptions before Data Analysis
35. From looking at the data, my general assumption is that with a
higher budget set to create the
movie, the higher the gross value will be for that specific
movie. For example, it is possible
that with more recent movies that involve computer-generated
imagery will involve a higher
budget to output a higher gross value for the movie as a result
of an improvement in
technology and entertainment for the consumers to witness.
2. Describing the Data
Here are the vital statistics for a description of the independent
variable – the budget amount
to create a movie. Most of the budget amounts were within
$200,000,000 - $250,000,000, but
a majority was < $250,000,000. Moreover, the shape of the
distribution suggests this is quite
a normal distribution, and the skewness doesn’t have too big of
margin with -0.219, but it
does categorize itself as a left-skewed distribution due to the
median being greater than the
mean.
0.00%
20.00%
40.00%
60.00%
80.00%
37. Frequency distribution of the Budget amounts to
create a movie
Mean Median Mode
Page 7
Here are the vital statistics for a
description of the dependent variable – the
movie’s gross profit or top grossing
movies worldwide. Most of the top grossing
movies were within $1,000,000,000 -
$1,200,000,000, but a majority was < $1,600,000,000. Although
the shape of the distribution
suggests this is not a normal distribution, the skewness has a
greater margin at around 2.87
due to a couple of outliers on the right, but it does categorize
itself as a right-skewed
distribution due to the mean being greater than the median.
0.00%
20.00%
40.00%
60.00%
80.00%
39. Gross Profit (Y)
Frequency distribution of the top movie's Gross
Profits worldwideMedian
Mean
Page 8
3. Empirical Rule
X The range Percent of data
falling in the range
Satisfy empirical
rule? (Yes or No)
),( σµσµ +- (108,286,384,
238,085,116)
54% No.
),( σµσµ 22- + (43,387,018,
302,984,482)
100% Yes.
),( σµσµ 33- + (-21,512,348,
367,883,848)
40. 100% Yes.
Y The range Percent of data
falling in the range
Satisfy empirical
rule? (Yes or No)
),( σµσµ +- (763,092,924,
1,496,252,698)
88% Yes.
),( σµσµ 22- + (396,513,037,
1,862,832,585)
94% No.
),( σµσµ 33- + (299,331,150,
2,229,412,472)
98% No.
4. Identify Outliers
On the x-variable, budget amounts to create a movie, there were
no outliers found within the
data distribution because all of the z-scores were below 2. On
the y-variable, the top grossing
movies worldwide, one “extreme outlier” was found and two
“normal outliers” were
found. For the y-variable, both Titanic and Star Wars: The
Force Awakens are considered
41. normal outliers with z-scores of 2.88 and 2.56 because the z-
scores are greater than or equal
to 2 and less than 3. Also for the y-variable, Avatar is
considered an extreme outlier with a z-
score of 4.52 because its z-score is greater than or equal to 3.
Page 9
5. Five Number Summary and Z-Scores
6. The Linear Regression Analysis
x z-scores y z-scores
Mean 173,185,750 0 1,129,672,811 0
Median 187,500,000 .2206 1,022,443,882
-.2925
Mode 200,000,000 .4132 N/A N/A
Standard Deviation 64,899,366 N/A 366,579,887 N/A
Minimum 45,000,000 -1.975 871,835,347 -.7034
25th Percentile 117,500,000 -.8580 939,998,331 -.5174
75th Percentile 225,000,000 .7984 1,122,827,940 -.0187
Max 300,000,000 1.954 2,787,965,087 4.524
Page 10
46. 7. The Linear Regression Scatterplot
8. The Linear Regression Line Fit Plots Analysis
As we can see, the “Line Fit Plot” doesn’t display a linear
relationship between the budget
and the top grossing movies worldwide. The x variable, the
budget, has a positive correlation
to the y variable, but this data can’t fit within a straight line due
to the data being spread out
which indicates that there is a slight relationship between the
two, but this isn’t a very
strong relationship.
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
2,500,000,000
3,000,000,000
0 100,000,000 200,000,000 300,000,000 400,000,000
To
p
G
ro
48. 9. The Significance of the Regression Model
Based on the Simple Linear Regression output, the model isn’t
significant because it is
stated from Significance F, which is 0.078175279, and this
value is greater than 0.05 which
classifies this model as insignificant.
10. The Regression Equation
Based on the Simple Linear Regression output regarding the
coefficients, the Intercept and
Budget (x), the mathematical equation of this model is
Y=883,710,603 + 1.420221979x. For
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
2,500,000,000
50. Budget (X) Line Fit Plot
Gross Profit (Y)
Predicted Gross Profit (Y)
Page 14
any unit in the budget increase, the gross amount for the
upcoming film will increase by
1.420. For example, if we were to suggest that the budget for an
upcoming film was
$250,000,000, the value of y would be $1,238,766,098. As a
result, the gross amount for that
upcoming film would make around $1,238,766,098 if they had a
budget of $250,000,000 to
spend on the film.
11. The Reliability of the Regression Model
Based on the Simple Linear Regression output regarding the
model being a reliable predictor
of y or not is based on R square. In this model, R square is
0.063229234 or 6.322% which is
very low for the amount of variation explained by the
regression, and unfortunately, this
model is not a very reliable predictor of y as only 6.322% of
variation is explained.
12. The Assumptions of the Regression Model
51. Assumption Check Yes or No? Why or Why not?
Mean of 0 No. Based on the data from the residual
plot, most of the data isn’t evenly
around 0 with exceptions from
100,000,000-250,000,000, but a lot of the
data is spread out from the mean of 0.
Constant Variability No. Based on the data observed in the
residual plot, the variance is not
constant. There is a clear
triangular/cone-like pattern, which
suggests that the data isn’t between two
parable lines to 0 and isn’t constant.
Independent
Yes
Normality No. As can be observed from the normal
probability plot, it is not a linear line as
it has a tail going upwards; therefore,
normality isn’t satisfied as it’s not a
complete linear line.
Page 15
13. Conclusion
52. Based on the general data analysis of the plots and the output, it
is evident that the
relationship between budget and top grossing movies worldwide
isn’t explained and
represented at a high percentage given that R square is only
6.322%. With only 6.322% of
the variation explained by the regression, this is considered
very low throughout the data
analysis. More importantly, our significance F, which represents
the significance of our
model from the output showed a value of 0.078175279 which is
greater than 0.05, therefore,
rendering our model insignificant. Understanding that this
correlation between our budget
variable and the top grossing movies worldwide isn’t
represented with a lower significance F
suggests that this independent variable doesn’t specifically
affect the dependent variable on a
significant scale. Overall, even with our data output
representing an insignificant model, I
learned that not every variable that correlates with a particular
subject such as a film’s budget
and a movie’s gross profit will have a high percentage of
variance explained or a strong
linear relationship. More importantly, contrary to what might
seem like conventional
wisdom, I also learned that most of the values given for both
gross profits for movies and the
53. budgets are estimates which suggest that given the public
response of how well the movie
does or not can represent a different number presented to
credible sources about their figures.
Moreover, for any last comments and further improvement of
my model, I would like to
present more of a dataset like 100 pieces of data rather than 50
to give me a possibly better
relationship between budget and gross profit. For any
improvement of my model, I would
want to analyze my dependent variable more specifically by
acknowledging the inflation rate
as movies in the past weren’t represented in the top grossing
movies because of inflation. As
a result of inflation, our currency has changed drastically over
the last eighty years and could
give us more accurate data for our model if it were accounted
for in our data. Overall, I
would also like to address that this individual evaluation of the
dataset and plots gave me a
better perspective of data analysis where I can see the figures
on a smaller scale in
histograms, scatterplots, and data charts rather than just values
on a website.
Page 16
14. Team Information
Team Member X Variable
Shane Cornfield Budget Amount to Create the Movie
54. Dana Saxton Budget Amount to Create the Movie
Drew Thoman Length of Movie (Minutes)
Zekun Huang Movie Rating Scores (Out of 100)
Works Cited
All Time Worldwide Box Office Grosses. (2017, February 9).
Retrieved February 09, 2017, from
<http://www.boxofficemojo.com/alltime/world/>.
All Time Worldwide Box Office Profiles. (2017, February 9).
Retrieved from February 09, 2017,
from
<http://www.boxofficemojo.com/movies/?id=moviename.htm>.
Why are your budget/gross figures for some movies different
than those listed by another
source? Why do you have budget/gross data on some movies
and not others? (2017).
Retrieved February 09, 2017, from
<http://www.imdb.com/help/show_leaf?boxofficedifferent>.
Credits
55. Image of Cinema on front cover: Hayden Dingman from
Pcworld.com:
https://i.ytimg.com/vi/5ar91JNLdR4/maxresdefault.jpg
https://i.ytimg.com/vi/5ar91JNLdR4/maxresdefault.jpg
BA240 Individual Project Report
SUBMITTED HARD COPY AT THE BEGINNING OF CLASS
1. THIS PROJECT IS PRESENTED IN WORD FORMAT SO
YOU CAN USE THE TABLES INCLUDED HERE.
2. The instructor reserves the right to adjust individual scores.
3. Individual or team projects that are just Excel printouts will
receive 0 points.
4. Excel instructions are contained under “Project” link on
Canvas.
INDIVIDUAL Project
Each team member is to choose one of the independent variables
in the data sets to analyze along with the dependent data set. All
team members will have the same dependent variable (y) but a
different independent variable (x) (minimum 3 variables in a
group). Review the Excel videos and linear regression before
you do your own. Each variable should contain at least 50 data.
Number all your answers in your submission.Although you are
56. sharing data, you must complete the analysis and interpretation
individually.
1. Introduction:
· Show the data and explain why you selected the data and how
the data was collected.
· Cite websites and evaluate the credibility of your sources.
· List any limitations of the data.
· Describe any steps you took to clean up the data ( if you have
missing data)
· Make assumptions before doing the data analysis
· This section may be reused in your team project
· Write the project more like a report.
2. Describing the data:
· Plot histograms of your x variable and the y variable using
reasonable intervals for each set. (There will be two
histograms.)
· Label the graph correctly
· Comment on the shape of the distribution (skewness).
3. Analyze whether the x and y distributions satisfy the
empirical rule (Yes or No, explain why). Show details such like
the range of within 1 standard deviation, within 2 standard
deviation and within 3 standard deviation and the corresponding
true percentage falling in these ranges.
4. Identify and list all outliers in each distribution (Both X and
Y) using appropriate methodology and explain why they are
outliers. If you have more than 10 outliers in either distribution
(X or Y) in your dataset, you can just list out the top 10
outliers.
5. Calculate the mean, median, and mode and show where they
are on the histogram graph (you can either edit on the graph in
Word ,Excel or PowerPoint, or you can show them by pen on
57. the graph). Finish the following table for the five number
summary (Minimum, Q1, median, Q3, maximum) and the z-
scores of each.
x
z-scores
y
z-scores
Mean
Median
Mode
Standard Deviation
NA
NA
Min
25 percentile
58. 75 percentile
Max
6. The Regression: Show the output and all the plots from Excel
from Simple Linear Regression analysis. You can copy and
paste from Excel output and plots.
7. The Regression: Create a scatter plot of your independent
variable against the dependent variable using Excel. Make sure
your dependent variable is y and your independent is x on the
graph. Write a paragraph about your finding in the scatter plot.
8. The Regression: Display the “Line Fit Plots” from the Simple
Linear Regression output. Is there a linear relationship between
these two variables from the plot? Explain why?
9. The Regression: Is this regression model is
important/significant? Why or why not?
10. The Regression: Are all parameters important/significant?
Why or why not?
11. The Regression: Show the mathematical equation of this
model. Please give two examples after you have the equation.
Select any two meaningful numbers of X and predict the value
59. of Y and interpret the equation using words.
12. The Regression: Is this model a reliable predictor of y?
Explain how much of variation is explained. Do you think there
is a strong correlation and explain why or why not.
13. The Regression: Assumption check
Write a paragraph for the 4 assumption check and explain why
it satisfies or violate the assumptions.
14. Summary: Write at least one paragraph including: summary
of your findings in the plots, numerical measurements and data
analysis, what you have learned from the project, and any
comments you have or any further improvement of your model.
15. List all your team members’ names and their corresponding
variables.
16. Appendix if needed