Data AnalysisInstructions of Excel 2016By Yancy Chow.docx

Data Analysis
Instructions of Excel 2016
By Yancy Chow
Data Analysis: House Example
House Data : 50 houses
Two variables: Price (Y) Area (X)
Excel: How to Add-in
Setting Up Excel for Statistical Analysis:

3
Then find “Add ins” on the left side and click on it.
After that, click on “Go”.
Then click on “OK”.
Then find “Analysis ToolPak” and click on it.
---- n “OK”.
Note: if you use Apple computer, the “add-in” option is under
“Tool”!

Now you’ve successfully added in the “Data Analysis”.
Click on “Data” on the top, now you can see “Data Analysis”
icon!
Excel: How to Calculate
Mean, Median and Mode?
Open your data in Excel or type your data in Excel by column.
For example, we want to calculate the mean, median and mode
for the variable “Price” in this data. Select “Data” firstly, then
click on “Data Analysis”
After you clicking on “Data Analysis”, scroll the mouse until
you find “Descriptive Statistics” in the Analysis Tools Panel
and then select it. Then click on “OK”.

Firstly, you need to input the “Input Range”.
You can either input by typing in the box or clicking using the
mouse to select the data numbers in the column which you are
interested in. In this example, we select the all 50 numbers in
the first column. Do not select the label row, like “price” row.
9
After selecting the “Input Range”, you need to select “Output
Range” and choose anywhere you want to the output to be.

Then select “Summary statistics”. Click on “OK” and you will
have the data analysis results.
Here is the results from the data analysis, including the
information such like mean, median, mode , standard deviation,
sample variance, rang, minimum and maximum.
Note that EXCEL can only find one mode. You need to check
whether there is mort than one by your own.
Excel: How to Calculate the
first Quartiles (Q1)?
Q1: Choose an empty space, enter:
“=quartile(data range, 1)”
Then press “Enter” and you will get the first quartile (Q1)

result.
A2:A51 is the range of the data
Excel: How to Calculate the
third Quartiles (Q3)?
Q1: Choose an empty space, enter:
“=quartile(data range, 3)”
Then press “Enter” and you will get the third quartile (Q3)
result.
A2:A51 is the range of the data
Excel: How to Draw Histograms?
Firstly, check the output from the “Descriptive Statistics” in
“Data Analysis”. We notice in this house data, mean is
$956396.66, minimum is $729870 and maximum is $1190000. A
reasonable will be $50000. So create a new Colum of the “Bins”
which is from $700000 to $1200000, by the interval $50000

15
Once you create reasonable Bins, select “Data”-
Analysis”. Find “Histograms” and click on “OK”.
Select the “Input Range”, the 50 house data.
Select the “Bin Range”, the column you created.
Decide any empty space as your “Output Range”
Click on “Cumulative Percentage” and “Chart Output”-

Here is the output from the “Data Analysis”: Frequency Table
and Histogram!
Now you can edit the words color, size, filled color if you want.
You can also edit the color of the “Page Layout”.

You can also design some effects of the bars if you want.
Double click on one bar and Select the “Format” Tool. You can
design the shape filled, the outline of the shape and effects if
you want to draw a beautiful graph.
Usually, the histograms have no gap. How to have no gap?
Choose the graph, right click the mouse and choose “Format
Plot Area”.
Then choose from “Plot Area Options”--

Then make the “Gap Width” as “0%”
Now, there is no gap for the histogram!
Excel: How to Draw Bar Charts (Qualitative data)?
Type the categories of the qualitative data and the
corresponding frequency (How many in each class)
Excel: How to Draw Bar Charts (Qualitative data)?
Select the data (class and frequency)-- - -D
Column” or “3-D Column”

Double click each bar and choose “Format” to design the color
and effects.
Excel: How to Draw Scatter Plot?
Before doing the data analysis, you may want to see the scatter
plots to see whether there exists a relationship between Y and
X. Note that you should put X in the first column and Y in the
second column. Click “Insert” and then choose “Scatter chart”
to make the plot.
28

Excel: How to Draw Scatter Plot?
Excel: How to Do Simple Linear Regression?
Select “Data” from the Tool Bar- find
out “Regression” in the dialog and click on “OK”.
30
Excel: How to Do Simple Linear Regression?
Select the “Input Y Range” from the variable “Price” column
Select the “Input X Range” from the variable “Area” column
Select the “Output Range” to somewhere empty.
Click on “Residual Plots”, “Line Fit Plots” and “Normal
Probability Plots”.

Then Click on “OK”.
31
Correlation or strength of linear relationship between x and y.
0.913 is strong.
R Square: the amount of variation explained by the regression.
Is the model a reliable predictor of y?
83.3% of variation is explained.
Y-intercept
Slope
If the p-value is smaller than 0.05, then the parameter is
significant/important for predicting.
If the Significant F is smaller than 0.05, then the Model is
significant/important.
Standard
Error of
The
Regression
Use in
prediction

The Equation will be : Y=307953+195X
Data Analysis Output
: EXCEL 2016
32
Chapter 11
Simple Linear Regression
(Individual Project)
By Yancy Chow
What is Simple Linear Regression?
House 1

House 2
Which one you think is more expensive? Why?
Can be used to find relationships between two variables.
Examples:
Gene mapping for cancer research
Examples:
Stock market investment analysis

Examples:
Sales forecasting
Examples:
Product quality control
Examples:
Income demographics

Simple linear regression
one-to-one
Dependent Variable
(y)
Ex. House price
Independent Variable (x)
Ex. Square feet
House Price=??*Square feet+ Random Error
Procedure: Model

Procedure: Model
We always assume that the mean value of the random error
equals 0.
Taking average……
That is, the model will be:
Figure of the Model
How to fit the Model?
----The Least Squares Approach

How to interpret?
For every unit increase in x, the mean of y is estimated to
increase by unit.
Bellevue College
Houses near Bellevue College

Home value (Y) as a function of
square footage (X)
Example
Correlation or strength of linear relationship between x and y.
0.913 is strong.
R Square: the amount of variation explained by the regression.
Is the model a reliable predictor of y?
83.3% of variation is explained.
Y-intercept
Slope
If the p-value is smaller than 0.05, then the parameter is
significant/important for predicting.
If the Significant F is smaller than 0.05, then the Model is
significant/important.
Standard
Error of
The
Regression
Use in
prediction
The Equation will be : Y=307953+195X

Data Analysis Output
: EXCEL 2016
16
Recently sold house near Bellevue College:
Area: 2360 sqt
Lot Size : 8,162 sqt
Built year: 1976
What is the house value from our model?
Plots: How to Check Assumptions?

Mean of Zero:
Variance Constancy:
Normality:
Evenly Around 0
NO Trend
Two parallel lines to 0.
Linear Line
Independence
No Pattern
Assumption Check
Residual Plot:
Mean of 0
Constant variance
Independence
Normal Probability Plot:
Normality

Plots
X Variable 1 Line Fit Plot
Y 2010 1570 1600 2100 2800 3930 4660 4318 3008 3190 2660
2690 2640 2560 1750 2250 2450 2480 1590 1990 1360 1040
2240 1420 2200 2760 2468 3300 1890 2250 2400 2480 3140
2320 1900 3590 2600 1420 2530 2605 745557 671884
677768 891146 890899 1262820 1303545
1109270 991289 985735 752739 747912 750865
741286 693232 726184 743571 747504 675726
719319 598238 536724 707796 577911 656267
767334 726996 939357 719439 726184 739008
747504 849439 744544 719319 983183 725446
566757 828472 725686 Predicted Y 2010 1570
1600 2100 2800 3930 4660 4318 3008 3190 2660 2690 2640
2560 1750 2250 2450 2480 1590 1990 1360 1040 2240 1420
2200 2760 2468 3300 1890 2250 2400 2480 3140 2320 1900
3590 2600 1420 2530 2605 698975.94578266761
613378.80888165673 619214.97730672569
716484.45105787436 852661.71430948237
1072490.7249870785 1214504.1566637554
1147971.8366179697 893125.81538996031
928531.90383537835 825426.26165916084
831262.43008422968 821535.48270911491
805972.36690893117 648395.81943207025
745665.29318321892 784573.08268367837
790409.25110874732 617269.58783170278
695085.16683262167 572525.62990617438
510273.16670543922 743719.90370819601
584197.96675631218 735938.34580810415

844880.1564093905 788074.78373871977
949931.18806063104 675631.27208239189
745665.29318321892 774846.13530856348
790409.25110 874732 918804.95646026358
759283.01950837974 677576.6615574148
1006347.4828362972 813753.92480902304
584197.96675631218 800136.19848386222
814726.61954653449
X Variable 1
Y
X Variable 1 Residual Plot
2010 1570 1600 2100 2800 3930 4660 4318 3008 3190 2660 2690
2640 2560 1750 2250 2450 2480 1590 1990 1360 1040 2240
1420 2200 2760 2468 3300 1890 2250 2400 2480 3140 2320
1900 3590 2600 1420 2530 2605 46581.054217332392
58505.191118343268 58553.02269327431
174661.54894212564 38237.28569051763
190329.27501292154 89040.843336244579 -
38701.836617969675 98163.184610039694
57203.096164621646 -72687.261659160838 -
83350.43008422968 -70670.482709114905 -
64686.366908931173 44836.180567929754 -
19481.293183218921 -41002.082683678367 -
42905.251108747325 58456.412168297218
24233.833167378325 25712.370093825622
26450.833294560784 -35923.903708196012 -
6286.9667563121766 -79671. 345808104146 -
77546.156409390504 -61078.783738719765 -
10574.188060631044 43807.727917608107 -
19481.293183218921 -35838.135308563476 -
42905.251108747325 -69365.95646026358 -
14739.019508379744 41742.338442585198 -
23164.482836297248 -88307.924809023039 -
17440.966756312177 28335.801516137784 -
89040.619546534494
X Variable 1

Residuals
Normal Probability Plot
1.25 3.75 6.25 8.75 11.25 13.75 16.25 18.75
21.25 23.75 26.25 28.75 31.25 33.75
36.25 38.75 41.25 43.75 46.25 48.75
51.25 53.75 56.25 58.75 61.25 63.75
66.25 68.75 71.25 73.75 76.25 78.75
81.25 83.75 86.25 88.75 91.25 93.75
96.25 98.75 536724 566757 577911 598238
656267 671884 675726 677768 693232 707796
719319 719319 719439 725446 725686 726184
726184 726996 739008 741286 743571 744544
745557 747504 747504 747912 750865 752739
767334 828472 849439 890899 891146 939357
983183 985735 991289 1109270 1262820
1303545
Sample Percentile
Y
x
y
1
0
b
b
+
=
x
y
E
or
1
0
)
(
b
b

+
=
xx
xy
SS
SS
slope
=
1
ˆ
:
b
n
x
n
y
ercept
i
i
i
å
å
-
=
b
b
ˆ
ˆ
:
int
0
1
ˆ
b
Temp Residual Plot
-1

-0.5
0
0.5
1
1.5
020406080
Temp
Residuals
0
5
10
15
020406080100
Sample Percentile
Fuel
Page 1
The Cinematic Study
Factors Affecting Top Grossing Movies Worldwide
Page 2

Table of Contents
1. Selecting the Data
...............................................................................................
. 3-6
• Rationale
...............................................................................................
... 4-5
• Reliability of Source Data
............................................................................5
• Limitations of the Data
................................................................................5
• Cleaning Up the Data
...................................................................................5
• General Assumptions before Data Analysis
................................................6
2. Describing the
Data........................................................................................
...... 6-7
3. Empirical Rule
...............................................................................................
..........8
4. Identify Outliers
...............................................................................................
........8
5. Five Number Summary and Z-
Scores......................................................................8
6. The Linear Regression Analysis
........................................................................ 9 -11
7. The Regression Scatterplot
....................................................................................12
8. The Linear Regression Line Fit Plots Analysis
.....................................................12

9. The Significance of the Regression Model
............................................................13
10. The Regression Equation
.......................................................................................13
11. The Reliability of the Regression Model
...............................................................14
12. The Assumptions of the Regression Model
...........................................................14
13. Conclusion
...............................................................................................
..............15
14. Team Information
...............................................................................................
...15
Works Cited
...............................................................................................
..................16
Page 3
1. Selecting the Data

Data Set
Dependent Variable (Y)------------- 1. Top Grossing Movies
Worldwide (Gross Profit)
Independent Variables (X1) ---------------------- 1. Budget
Amount to Create the Movie
Independent Variables (X2) ----------------------------------2.
Length of Movie (Minutes)
Independent Variables (X3) ------------------------- 3. Movie
Rating Scores (Out of 100)
In this individual report, I will be focusing on the X1 variable,
the correlation between the
budget amounts that were utilized to create the presented
movies with the Y variable, the top
grossing movies worldwide.
Movie Gross Profit (Y) Budget (X)
Avatar $2,787,965,087 $237,000,000
Titanic $2,186,772,302 $200,000,000
Star Wars: The Force awakens $2,068,223,624 $245,000,000
Jurassic World $1,670,400,637 $150,000,000
The Avengers $1,518,812,988 $220,000,000
Furious 7 $1,516,045,911 $190,000,000
Avengers: Age of Ultron $1,405,403,694 $250,000,000
Harry Potter TDH2 $1,341,511,219 $125,000,000
Frozen $1,287,000,000 $150,000,000
Iron Man 3 $1,214,811,252 $200,000,000
Minions $1,159,398,397 $74,000,000
Captain America: Civil War $1,153,304,495 $250,000,000

Transformers: Dark of the Moon $1,123,794,079 $195,000,000
Lord of the Rings: ROTK $1,119,929,521 $94,000,000
Skyfall $1,108,561,013 $200,000,000
Transformers: Age of Extinction $1,104,054,072 $210,000,000
The Dark Knight Rises $1,084,939,099 $250,000,000
Toy Story 3 $1,066,969,703 $200,000,000
POTC: Dead Man's Chest $1,066,179,725 $225,000,000
POTC: On Stranger Tides $1,045,713,802 $250,000,000
Jurassic Park (Original) $1,029,939,903 $63,000,000
Page 4
Finding Dory $1,027,865,760 $200,000,000
Star Wars: Phantom Menace $1,027,044,677 $115,000,000
Alice in Wonderland $1,025,467,110 $200,000,000
Zootopia $1,023,784,195 $150,000,000
The Hobbit: Unexpected Journey $1,021,103,568 $180,000,000
The Dark Knight $1,004,558,444 $185,000,000
Rouge One $982,998,446 $200,000,000
Harry Potter TPS $974,755,371 $125,000,000
Despicable Me 2 $970,761,885 $76,000,000
The Lion King $968,483,777 $45,000,000
The Jungle Book (2016) $966,550,600 $175,000,000
POTC: At World's End $963,420,425 $300,000,000
Harry Potter: TDH1 $960,283,305 $161,287,500
The Hobbit: DOS $958,366,855 $225,000,000
The Hobbit: BOFA $956,019,788 $250,000,000
Finding Nemo $940,335,536 $94,000,000
Harry Potter: OOTP $939,885,929 $150,000,000
Harry Potter: HBP $934,416,487 $250,000,000
Lord of the Rings: Two Towers $926,047,111 $94,000,000
Shrek 2 $919,838,758 $150,000,000
Harry Potter: GoF $896,911,078 $150,000,000

Spider-Man 3 $890,871,626 $258,000,000
Ice Age: Dawn of the Dinosaurs $886,686,817 $90,000,000
Spectre $880,674,609 $245,000,000
Harry Potter: COS $878,979,634 $100,000,000
Ice Age: Continental Drift $877,244,782 $95,000,000
Secret Life of Pets $875,457,937 $75,000,000
Batman v Superman $873,260,194 $250,000,000
Lord of the Rings: Fellowship $871,835,347 $93,000,000
Rationale
With our group’s backgrounds in diverse cultural activities, we
brainstormed ideas given the
significance that activities play in our lives, and we were able
to meet in agreement about the
idea of cinema and its impact on the consumers as far as how
much revenue movies generate.
As part of our cultural backgrounds in films, we all shared
common observations and facts
about movies and we were interested what individual factors
impact a movie’s total gross
amount. As for our variables, we decided as a group on 3 main
independent variables and
modified them to meet the standards for the dependent variable
through the variable’s
Page 5
completeness and integrity given the large computations
involved for this data set. (1. Budget
Amount to Create the Movie, 2. Length of Movie (Minutes), 3.
Movie Rating Scores (Out of

100.)) Throughout the initial development stage of our project,
we did some research as far as
knowing if our information about movies with their total gross
amounts and the independent
variables were available to us on the web. Moreover, the detail
and credibility was there
online and it provided us the essential information for both the
independent variables and the
dependent variable.
Reliability of Source Data
As a group, we believe that our data source,
http://www.boxofficemojo.com/alltime/world/ is
a reliable and credible source of published and updated data as a
result of its owner,
IMDb.com, who is also owned by Amazon.com. Given the
significance and reputation of
IMDb, they provide with its affiliates such as Box Office Mojo,
the utmost effort in accurate
reports, reliable sources to obtain practical and useful
information, and credibility in order to
offer daily publishing and updates on movies worldwide with
their gross values and other
variables including the estimated budgets that were utilized to
create that specific movie.
Moreover, according to IMDb (2017), the owner of Box Office
Mojo, stated that only more
recently within the last 15 years of films, “studios and
distributors have started disclosing
detailed figures only recently” (p. 1), and are reported as
estimates.
Limitations of the Data
As far as limitations in our data, the data combines all films and

provides no breakup. As
such, we are missing some levels of significant details like sub-
categorization into genres, a
specific year range, and calculating for inflation. Having access
to this viable data could have
provided a higher level of insight into which movie genres are
watched the most and how
much that provides, what year range could offer the best results,
and what movies would have
the highest gross value if we accounted for inflation.
Cleaning Up the Data
Some of the steps to clean up the data included only selecting
the top 50 grossing movies
rather than 100 because the data is already in the millions and
billions and too many values is
confusing, looks sloppy in graphs with higher values, and is
difficult to calculate with some
formulas. Even though we were provided the gross values, some
values were in a different
currency such as Euros, so we converted a couple movies from
Euros to dollars. This gave us
consistency among the other gross values in terms of currency
and provided us more accurate
data.
http://www.boxofficemojo.com/alltime/world/
Page 6
General Assumptions before Data Analysis

From looking at the data, my general assumption is that with a
higher budget set to create the
movie, the higher the gross value will be for that specific
movie. For example, it is possible
that with more recent movies that involve computer-generated
imagery will involve a higher
budget to output a higher gross value for the movie as a result
of an improvement in
technology and entertainment for the consumers to witness.
2. Describing the Data
Here are the vital statistics for a description of the independent
variable – the budget amount
to create a movie. Most of the budget amounts were within
$200,000,000 - $250,000,000, but
a majority was < $250,000,000. Moreover, the shape of the
distribution suggests this is quite
a normal distribution, and the skewness doesn’t have too big of
margin with -0.219, but it
does categorize itself as a left-skewed distribution due to the
median being greater than the
mean.
0.00%
20.00%
40.00%
60.00%
80.00%

100.00%
120.00%
0
2
4
6
8
10
12
14
16
50000000 100000000 150000000 200000000 250000000
300000000 350000000 More
Fr
eq
ue
nc
y
of
B
ud
ge
t
Budget (X)

Frequency distribution of the Budget amounts to
create a movie
Mean Median Mode
Page 7
Here are the vital statistics for a
description of the dependent variable – the
movie’s gross profit or top grossing
movies worldwide. Most of the top grossing
movies were within $1,000,000,000 -
$1,200,000,000, but a majority was < $1,600,000,000. Although
the shape of the distribution
suggests this is not a normal distribution, the skewness has a
greater margin at around 2.87
due to a couple of outliers on the right, but it does categorize
itself as a right-skewed
distribution due to the mean being greater than the median.
0.00%
20.00%
40.00%
60.00%
80.00%

100.00%
120.00%
0
5
10
15
20
25
Fr
eq
ue
nc
y
of
G
ro
ss
P
ro
fi
t

Gross Profit (Y)
Frequency distribution of the top movie's Gross
Profits worldwideMedian
Mean
Page 8
3. Empirical Rule
X The range Percent of data
falling in the range
Satisfy empirical
rule? (Yes or No)
），（ σµσµ +- (108,286,384,
238,085,116)
54% No.
），（ σµσµ 22- + (43,387,018,
302,984,482)
100% Yes.
），（ σµσµ 33- + (-21,512,348,
367,883,848)

100% Yes.
Y The range Percent of data
falling in the range
Satisfy empirical
rule? (Yes or No)
），（ σµσµ +- (763,092,924,
1,496,252,698)
88% Yes.
），（ σµσµ 22- + (396,513,037,
1,862,832,585)
94% No.
），（ σµσµ 33- + (299,331,150,
2,229,412,472)
98% No.
4. Identify Outliers
On the x-variable, budget amounts to create a movie, there were
no outliers found within the
data distribution because all of the z-scores were below 2. On
the y-variable, the top grossing
movies worldwide, one “extreme outlier” was found and two
“normal outliers” were
found. For the y-variable, both Titanic and Star Wars: The
Force Awakens are considered

normal outliers with z-scores of 2.88 and 2.56 because the z-
scores are greater than or equal
to 2 and less than 3. Also for the y-variable, Avatar is
considered an extreme outlier with a z-
score of 4.52 because its z-score is greater than or equal to 3.
Page 9
5. Five Number Summary and Z-Scores
6. The Linear Regression Analysis
x z-scores y z-scores
Mean 173,185,750 0 1,129,672,811 0
Median 187,500,000 .2206 1,022,443,882
-.2925
Mode 200,000,000 .4132 N/A N/A
Standard Deviation 64,899,366 N/A 366,579,887 N/A
Minimum 45,000,000 -1.975 871,835,347 -.7034
25th Percentile 117,500,000 -.8580 939,998,331 -.5174
75th Percentile 225,000,000 .7984 1,122,827,940 -.0187
Max 300,000,000 1.954 2,787,965,087 4.524
Page 10

-500,000,000
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
0 50,000,000 100,000,000 150,000,000 200,000,000
250,000,000 300,000,000 350,000,000
Re
si
du
al
s
Budget (X)
Budget (X) Residual Plot
Page 11

0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
2,500,000,000
3,000,000,000
0 20 40 60 80 100 120
To
p
G
ro
ss
in
g
M
ov
ie
s
W
or

ld
w
id
e
(Y
)
Sample Percentile
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
2,500,000,000
3,000,000,000
0 100,000,000 200,000,000 300,000,000 400,000,000
To
p
G
ro

ss
in
g
M
ov
ie
s
W
or
ld
w
id
e
(Y
)
Budget (X)
Budget (X) Line Fit Plot
Gross Profit (Y)
Predicted Gross Profit (Y)
Page 12

7. The Linear Regression Scatterplot
8. The Linear Regression Line Fit Plots Analysis
As we can see, the “Line Fit Plot” doesn’t display a linear
relationship between the budget
and the top grossing movies worldwide. The x variable, the
budget, has a positive correlation
to the y variable, but this data can’t fit within a straight line due
to the data being spread out
which indicates that there is a slight relationship between the
two, but this isn’t a very
strong relationship.
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
2,500,000,000
3,000,000,000
0 100,000,000 200,000,000 300,000,000 400,000,000
To
p
G
ro

ss
in
g
M
ov
ie
s
W
or
ld
w
id
e
(Y
)
Budget (X)
Budget vs. Top Grossing Movies Worldwide
Gross Profit (Y)
Page 13

9. The Significance of the Regression Model
Based on the Simple Linear Regression output, the model isn’t
significant because it is
stated from Significance F, which is 0.078175279, and this
value is greater than 0.05 which
classifies this model as insignificant.
10. The Regression Equation
Based on the Simple Linear Regression output regarding the
coefficients, the Intercept and
Budget (x), the mathematical equation of this model is
Y=883,710,603 + 1.420221979x. For
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
2,500,000,000

3,000,000,000
0 200,000,000 400,000,000
To
p
G
ro
ss
in
g
M
ov
ie
s
W
or
ld
w
id
e
(Y
)
Budget (X)

Budget (X) Line Fit Plot
Gross Profit (Y)
Predicted Gross Profit (Y)
Page 14
any unit in the budget increase, the gross amount for the
upcoming film will increase by
1.420. For example, if we were to suggest that the budget for an
upcoming film was
$250,000,000, the value of y would be $1,238,766,098. As a
result, the gross amount for that
upcoming film would make around $1,238,766,098 if they had a
budget of $250,000,000 to
spend on the film.
11. The Reliability of the Regression Model
Based on the Simple Linear Regression output regarding the
model being a reliable predictor
of y or not is based on R square. In this model, R square is
0.063229234 or 6.322% which is
very low for the amount of variation explained by the
regression, and unfortunately, this
model is not a very reliable predictor of y as only 6.322% of
variation is explained.
12. The Assumptions of the Regression Model

Assumption Check Yes or No? Why or Why not?
Mean of 0 No. Based on the data from the residual
plot, most of the data isn’t evenly
around 0 with exceptions from
100,000,000-250,000,000, but a lot of the
data is spread out from the mean of 0.
Constant Variability No. Based on the data observed in the
residual plot, the variance is not
constant. There is a clear
triangular/cone-like pattern, which
suggests that the data isn’t between two
parable lines to 0 and isn’t constant.
Independent
Yes
Normality No. As can be observed from the normal
probability plot, it is not a linear line as
it has a tail going upwards; therefore,
normality isn’t satisfied as it’s not a
complete linear line.
Page 15
13. Conclusion

Based on the general data analysis of the plots and the output, it
is evident that the
relationship between budget and top grossing movies worldwide
isn’t explained and
represented at a high percentage given that R square is only
6.322%. With only 6.322% of
the variation explained by the regression, this is considered
very low throughout the data
analysis. More importantly, our significance F, which represents
the significance of our
model from the output showed a value of 0.078175279 which is
greater than 0.05, therefore,
rendering our model insignificant. Understanding that this
correlation between our budget
variable and the top grossing movies worldwide isn’t
represented with a lower significance F
suggests that this independent variable doesn’t specifically
affect the dependent variable on a
significant scale. Overall, even with our data output
representing an insignificant model, I
learned that not every variable that correlates with a particular
subject such as a film’s budget
and a movie’s gross profit will have a high percentage of
variance explained or a strong
linear relationship. More importantly, contrary to what might
seem like conventional
wisdom, I also learned that most of the values given for both
gross profits for movies and the

budgets are estimates which suggest that given the public
response of how well the movie
does or not can represent a different number presented to
credible sources about their figures.
Moreover, for any last comments and further improvement of
my model, I would like to
present more of a dataset like 100 pieces of data rather than 50
to give me a possibly better
relationship between budget and gross profit. For any
improvement of my model, I would
want to analyze my dependent variable more specifically by
acknowledging the inflation rate
as movies in the past weren’t represented in the top grossing
movies because of inflation. As
a result of inflation, our currency has changed drastically over
the last eighty years and could
give us more accurate data for our model if it were accounted
for in our data. Overall, I
would also like to address that this individual evaluation of the
dataset and plots gave me a
better perspective of data analysis where I can see the figures
on a smaller scale in
histograms, scatterplots, and data charts rather than just values
on a website.
Page 16
14. Team Information
Team Member X Variable
Shane Cornfield Budget Amount to Create the Movie

Dana Saxton Budget Amount to Create the Movie
Drew Thoman Length of Movie (Minutes)
Zekun Huang Movie Rating Scores (Out of 100)
Works Cited
All Time Worldwide Box Office Grosses. (2017, February 9).
Retrieved February 09, 2017, from
<http://www.boxofficemojo.com/alltime/world/>.
All Time Worldwide Box Office Profiles. (2017, February 9).
Retrieved from February 09, 2017,
from
<http://www.boxofficemojo.com/movies/?id=moviename.htm>.
Why are your budget/gross figures for some movies different
than those listed by another
source? Why do you have budget/gross data on some movies
and not others? (2017).
Retrieved February 09, 2017, from
<http://www.imdb.com/help/show_leaf?boxofficedifferent>.
Credits

Image of Cinema on front cover: Hayden Dingman from
Pcworld.com:
https://i.ytimg.com/vi/5ar91JNLdR4/maxresdefault.jpg
https://i.ytimg.com/vi/5ar91JNLdR4/maxresdefault.jpg
BA240 Individual Project Report
SUBMITTED HARD COPY AT THE BEGINNING OF CLASS
1. THIS PROJECT IS PRESENTED IN WORD FORMAT SO
YOU CAN USE THE TABLES INCLUDED HERE.
2. The instructor reserves the right to adjust individual scores.
3. Individual or team projects that are just Excel printouts will
receive 0 points.
4. Excel instructions are contained under “Project” link on
Canvas.
INDIVIDUAL Project
Each team member is to choose one of the independent variables
in the data sets to analyze along with the dependent data set. All
team members will have the same dependent variable (y) but a
different independent variable (x) (minimum 3 variables in a
group). Review the Excel videos and linear regression before
you do your own. Each variable should contain at least 50 data.
Number all your answers in your submission.Although you are

sharing data, you must complete the analysis and interpretation
individually.
1. Introduction:
· Show the data and explain why you selected the data and how
the data was collected.
· Cite websites and evaluate the credibility of your sources.
· List any limitations of the data.
· Describe any steps you took to clean up the data ( if you have
missing data)
· Make assumptions before doing the data analysis
· This section may be reused in your team project
· Write the project more like a report.
2. Describing the data:
· Plot histograms of your x variable and the y variable using
reasonable intervals for each set. (There will be two
histograms.)
· Label the graph correctly
· Comment on the shape of the distribution (skewness).
3. Analyze whether the x and y distributions satisfy the
empirical rule (Yes or No, explain why). Show details such like
the range of within 1 standard deviation, within 2 standard
deviation and within 3 standard deviation and the corresponding
true percentage falling in these ranges.
4. Identify and list all outliers in each distribution (Both X and
Y) using appropriate methodology and explain why they are
outliers. If you have more than 10 outliers in either distribution
(X or Y) in your dataset, you can just list out the top 10
outliers.
5. Calculate the mean, median, and mode and show where they
are on the histogram graph (you can either edit on the graph in
Word ,Excel or PowerPoint, or you can show them by pen on

the graph). Finish the following table for the five number
summary (Minimum, Q1, median, Q3, maximum) and the z-
scores of each.
x
z-scores
y
z-scores
Mean
Median
Mode
Standard Deviation
NA
NA
Min
25 percentile

75 percentile
Max
6. The Regression: Show the output and all the plots from Excel
from Simple Linear Regression analysis. You can copy and
paste from Excel output and plots.
7. The Regression: Create a scatter plot of your independent
variable against the dependent variable using Excel. Make sure
your dependent variable is y and your independent is x on the
graph. Write a paragraph about your finding in the scatter plot.
8. The Regression: Display the “Line Fit Plots” from the Simple
Linear Regression output. Is there a linear relationship between
these two variables from the plot? Explain why?
9. The Regression: Is this regression model is
important/significant? Why or why not?
10. The Regression: Are all parameters important/significant?
Why or why not?
11. The Regression: Show the mathematical equation of this
model. Please give two examples after you have the equation.
Select any two meaningful numbers of X and predict the value

of Y and interpret the equation using words.
12. The Regression: Is this model a reliable predictor of y?
Explain how much of variation is explained. Do you think there
is a strong correlation and explain why or why not.
13. The Regression: Assumption check
Write a paragraph for the 4 assumption check and explain why
it satisfies or violate the assumptions.
14. Summary: Write at least one paragraph including: summary
of your findings in the plots, numerical measurements and data
analysis, what you have learned from the project, and any
comments you have or any further improvement of your model.
15. List all your team members’ names and their corresponding
variables.
16. Appendix if needed

Data AnalysisInstructions of Excel 2016By Yancy Chow.docx

Recommended

Recommended

More Related Content

Similar to Data AnalysisInstructions of Excel 2016By Yancy Chow.docx

Similar to Data AnalysisInstructions of Excel 2016By Yancy Chow.docx (20)

More from whittemorelucilla

More from whittemorelucilla (20)

Recently uploaded

Recently uploaded (20)

Data AnalysisInstructions of Excel 2016By Yancy Chow.docx