SlideShare a Scribd company logo
1 of 59
Data Analysis
Instructions of Excel 2016
By Yancy Chow
Data Analysis: House Example
House Data : 50 houses
Two variables: Price (Y) Area (X)
Excel: How to Add-in
Setting Up Excel for Statistical Analysis:
3
Excel: How to Add-in
Then find “Add ins” on the left side and click on it.
After that, click on “Go”.
Then click on “OK”.
Excel: How to Add-in
Then find “Analysis ToolPak” and click on it.
---- n “OK”.
Note: if you use Apple computer, the “add-in” option is under
“Tool”!
Excel: How to Add-in
Now you’ve successfully added in the “Data Analysis”.
Click on “Data” on the top, now you can see “Data Analysis”
icon!
Excel: How to Calculate
Mean, Median and Mode?
Open your data in Excel or type your data in Excel by column.
For example, we want to calculate the mean, median and mode
for the variable “Price” in this data. Select “Data” firstly, then
click on “Data Analysis”
Excel: How to Calculate
Mean, Median and Mode?
After you clicking on “Data Analysis”, scroll the mouse until
you find “Descriptive Statistics” in the Analysis Tools Panel
and then select it. Then click on “OK”.
Excel: How to Calculate
Mean, Median and Mode?
Firstly, you need to input the “Input Range”.
You can either input by typing in the box or clicking using the
mouse to select the data numbers in the column which you are
interested in. In this example, we select the all 50 numbers in
the first column. Do not select the label row, like “price” row.
9
Excel: How to Calculate
Mean, Median and Mode?
After selecting the “Input Range”, you need to select “Output
Range” and choose anywhere you want to the output to be.
Excel: How to Calculate
Mean, Median and Mode?
Then select “Summary statistics”. Click on “OK” and you will
have the data analysis results.
Excel: How to Calculate
Mean, Median and Mode?
Here is the results from the data analysis, including the
information such like mean, median, mode , standard deviation,
sample variance, rang, minimum and maximum.
Note that EXCEL can only find one mode. You need to check
whether there is mort than one by your own.
Excel: How to Calculate the
first Quartiles (Q1)?
Q1: Choose an empty space, enter:
“=quartile(data range, 1)”
Then press “Enter” and you will get the first quartile (Q1)
result.
A2:A51 is the range of the data
Excel: How to Calculate the
third Quartiles (Q3)?
Q1: Choose an empty space, enter:
“=quartile(data range, 3)”
Then press “Enter” and you will get the third quartile (Q3)
result.
A2:A51 is the range of the data
Excel: How to Draw Histograms?
Firstly, check the output from the “Descriptive Statistics” in
“Data Analysis”. We notice in this house data, mean is
$956396.66, minimum is $729870 and maximum is $1190000. A
reasonable will be $50000. So create a new Colum of the “Bins”
which is from $700000 to $1200000, by the interval $50000
15
Once you create reasonable Bins, select “Data”-
Analysis”. Find “Histograms” and click on “OK”.
Excel: How to Draw Histograms?
Select the “Input Range”, the 50 house data.
Select the “Bin Range”, the column you created.
Decide any empty space as your “Output Range”
Click on “Cumulative Percentage” and “Chart Output”-
Excel: How to Draw Histograms?
Excel: How to Draw Histograms?
Here is the output from the “Data Analysis”: Frequency Table
and Histogram!
Now you can edit the words color, size, filled color if you want.
Excel: How to Draw Histograms?
You can also edit the color of the “Page Layout”.
Excel: How to Draw Histograms?
You can also design some effects of the bars if you want.
Double click on one bar and Select the “Format” Tool. You can
design the shape filled, the outline of the shape and effects if
you want to draw a beautiful graph.
Excel: How to Draw Histograms?
Excel: How to Draw Histograms?
Usually, the histograms have no gap. How to have no gap?
Choose the graph, right click the mouse and choose “Format
Plot Area”.
Usually, the histograms have no gap. How to have no gap?
Then choose from “Plot Area Options”--
Excel: How to Draw Histograms?
Excel: How to Draw Histograms?
Usually, the histograms have no gap. How to have no gap?
Then make the “Gap Width” as “0%”
Now, there is no gap for the histogram!
Excel: How to Draw Bar Charts (Qualitative data)?
Type the categories of the qualitative data and the
corresponding frequency (How many in each class)
Excel: How to Draw Bar Charts (Qualitative data)?
Select the data (class and frequency)-- - -D
Column” or “3-D Column”
Double click each bar and choose “Format” to design the color
and effects.
Excel: How to Draw Scatter Plot?
Before doing the data analysis, you may want to see the scatter
plots to see whether there exists a relationship between Y and
X. Note that you should put X in the first column and Y in the
second column. Click “Insert” and then choose “Scatter chart”
to make the plot.
28
Excel: How to Draw Scatter Plot?
Excel: How to Do Simple Linear Regression?
Select “Data” from the Tool Bar- find
out “Regression” in the dialog and click on “OK”.
30
Excel: How to Do Simple Linear Regression?
Select the “Input Y Range” from the variable “Price” column
Select the “Input X Range” from the variable “Area” column
Select the “Output Range” to somewhere empty.
Click on “Residual Plots”, “Line Fit Plots” and “Normal
Probability Plots”.
Then Click on “OK”.
31
Correlation or strength of linear relationship between x and y.
0.913 is strong.
R Square: the amount of variation explained by the regression.
Is the model a reliable predictor of y?
83.3% of variation is explained.
Y-intercept
Slope
If the p-value is smaller than 0.05, then the parameter is
significant/important for predicting.
If the Significant F is smaller than 0.05, then the Model is
significant/important.
Standard
Error of
The
Regression
Use in
prediction
The Equation will be : Y=307953+195X
Data Analysis Output
: EXCEL 2016
32
Chapter 11
Simple Linear Regression
(Individual Project)
By Yancy Chow
What is Simple Linear Regression?
House 1
House 2
Which one you think is more expensive? Why?
What is Simple Linear Regression?
Can be used to find relationships between two variables.
Examples:
Gene mapping for cancer research
Examples:
Stock market investment analysis
What is Simple Linear Regression?
Examples:
Sales forecasting
What is Simple Linear Regression?
Examples:
Product quality control
What is Simple Linear Regression?
Examples:
Income demographics
What is Simple Linear Regression?
Simple linear regression
one-to-one
Dependent Variable
(y)
Ex. House price
Independent Variable (x)
Ex. Square feet
What is Simple Linear Regression?
House Price=??*Square feet+ Random Error
Procedure: Model
Procedure: Model
We always assume that the mean value of the random error
equals 0.
Taking average……
That is, the model will be:
Figure of the Model
How to fit the Model?
----The Least Squares Approach
How to interpret?
For every unit increase in x, the mean of y is estimated to
increase by unit.
Bellevue College
Houses near Bellevue College
Home value (Y) as a function of
square footage (X)
Example
Correlation or strength of linear relationship between x and y.
0.913 is strong.
R Square: the amount of variation explained by the regression.
Is the model a reliable predictor of y?
83.3% of variation is explained.
Y-intercept
Slope
If the p-value is smaller than 0.05, then the parameter is
significant/important for predicting.
If the Significant F is smaller than 0.05, then the Model is
significant/important.
Standard
Error of
The
Regression
Use in
prediction
The Equation will be : Y=307953+195X
Data Analysis Output
: EXCEL 2016
16
Recently sold house near Bellevue College:
Area: 2360 sqt
Lot Size : 8,162 sqt
Built year: 1976
What is the house value from our model?
Plots: How to Check Assumptions?
Mean of Zero:
Variance Constancy:
Normality:
Evenly Around 0
NO Trend
Two parallel lines to 0.
Linear Line
Independence
No Pattern
Assumption Check
Residual Plot:
Mean of 0
Constant variance
Independence
Normal Probability Plot:
Normality
Plots
X Variable 1 Line Fit Plot
Y 2010 1570 1600 2100 2800 3930 4660 4318 3008 3190 2660
2690 2640 2560 1750 2250 2450 2480 1590 1990 1360 1040
2240 1420 2200 2760 2468 3300 1890 2250 2400 2480 3140
2320 1900 3590 2600 1420 2530 2605 745557 671884
677768 891146 890899 1262820 1303545
1109270 991289 985735 752739 747912 750865
741286 693232 726184 743571 747504 675726
719319 598238 536724 707796 577911 656267
767334 726996 939357 719439 726184 739008
747504 849439 744544 719319 983183 725446
566757 828472 725686 Predicted Y 2010 1570
1600 2100 2800 3930 4660 4318 3008 3190 2660 2690 2640
2560 1750 2250 2450 2480 1590 1990 1360 1040 2240 1420
2200 2760 2468 3300 1890 2250 2400 2480 3140 2320 1900
3590 2600 1420 2530 2605 698975.94578266761
613378.80888165673 619214.97730672569
716484.45105787436 852661.71430948237
1072490.7249870785 1214504.1566637554
1147971.8366179697 893125.81538996031
928531.90383537835 825426.26165916084
831262.43008422968 821535.48270911491
805972.36690893117 648395.81943207025
745665.29318321892 784573.08268367837
790409.25110874732 617269.58783170278
695085.16683262167 572525.62990617438
510273.16670543922 743719.90370819601
584197.96675631218 735938.34580810415
844880.1564093905 788074.78373871977
949931.18806063104 675631.27208239189
745665.29318321892 774846.13530856348
790409.25110 874732 918804.95646026358
759283.01950837974 677576.6615574148
1006347.4828362972 813753.92480902304
584197.96675631218 800136.19848386222
814726.61954653449
X Variable 1
Y
X Variable 1 Residual Plot
2010 1570 1600 2100 2800 3930 4660 4318 3008 3190 2660 2690
2640 2560 1750 2250 2450 2480 1590 1990 1360 1040 2240
1420 2200 2760 2468 3300 1890 2250 2400 2480 3140 2320
1900 3590 2600 1420 2530 2605 46581.054217332392
58505.191118343268 58553.02269327431
174661.54894212564 38237.28569051763
190329.27501292154 89040.843336244579 -
38701.836617969675 98163.184610039694
57203.096164621646 -72687.261659160838 -
83350.43008422968 -70670.482709114905 -
64686.366908931173 44836.180567929754 -
19481.293183218921 -41002.082683678367 -
42905.251108747325 58456.412168297218
24233.833167378325 25712.370093825622
26450.833294560784 -35923.903708196012 -
6286.9667563121766 -79671. 345808104146 -
77546.156409390504 -61078.783738719765 -
10574.188060631044 43807.727917608107 -
19481.293183218921 -35838.135308563476 -
42905.251108747325 -69365.95646026358 -
14739.019508379744 41742.338442585198 -
23164.482836297248 -88307.924809023039 -
17440.966756312177 28335.801516137784 -
89040.619546534494
X Variable 1
Residuals
Normal Probability Plot
1.25 3.75 6.25 8.75 11.25 13.75 16.25 18.75
21.25 23.75 26.25 28.75 31.25 33.75
36.25 38.75 41.25 43.75 46.25 48.75
51.25 53.75 56.25 58.75 61.25 63.75
66.25 68.75 71.25 73.75 76.25 78.75
81.25 83.75 86.25 88.75 91.25 93.75
96.25 98.75 536724 566757 577911 598238
656267 671884 675726 677768 693232 707796
719319 719319 719439 725446 725686 726184
726184 726996 739008 741286 743571 744544
745557 747504 747504 747912 750865 752739
767334 828472 849439 890899 891146 939357
983183 985735 991289 1109270 1262820
1303545
Sample Percentile
Y
x
y
1
0
b
b
+
=
x
y
E
or
1
0
)
(
b
b
+
=
xx
xy
SS
SS
slope
=
1
ˆ
:
b
n
x
n
y
ercept
i
i
i
å
å
-
=
b
b
ˆ
ˆ
:
int
0
1
ˆ
b
Temp Residual Plot
-1
-0.5
0
0.5
1
1.5
020406080
Temp
Residuals
Normal Probability Plot
0
5
10
15
020406080100
Sample Percentile
Fuel
Page 1
The Cinematic Study
Factors Affecting Top Grossing Movies Worldwide
Page 2
Table of Contents
1. Selecting the Data
...............................................................................................
. 3-6
• Rationale
...............................................................................................
... 4-5
• Reliability of Source Data
............................................................................5
• Limitations of the Data
................................................................................5
• Cleaning Up the Data
...................................................................................5
• General Assumptions before Data Analysis
................................................6
2. Describing the
Data........................................................................................
...... 6-7
3. Empirical Rule
...............................................................................................
..........8
4. Identify Outliers
...............................................................................................
........8
5. Five Number Summary and Z-
Scores......................................................................8
6. The Linear Regression Analysis
........................................................................ 9 -11
7. The Regression Scatterplot
....................................................................................12
8. The Linear Regression Line Fit Plots Analysis
.....................................................12
9. The Significance of the Regression Model
............................................................13
10. The Regression Equation
.......................................................................................13
11. The Reliability of the Regression Model
...............................................................14
12. The Assumptions of the Regression Model
...........................................................14
13. Conclusion
...............................................................................................
..............15
14. Team Information
...............................................................................................
...15
Works Cited
...............................................................................................
..................16
Page 3
1. Selecting the Data
Data Set
Dependent Variable (Y)------------- 1. Top Grossing Movies
Worldwide (Gross Profit)
Independent Variables (X1) ---------------------- 1. Budget
Amount to Create the Movie
Independent Variables (X2) ----------------------------------2.
Length of Movie (Minutes)
Independent Variables (X3) ------------------------- 3. Movie
Rating Scores (Out of 100)
In this individual report, I will be focusing on the X1 variable,
the correlation between the
budget amounts that were utilized to create the presented
movies with the Y variable, the top
grossing movies worldwide.
Movie Gross Profit (Y) Budget (X)
Avatar $2,787,965,087 $237,000,000
Titanic $2,186,772,302 $200,000,000
Star Wars: The Force awakens $2,068,223,624 $245,000,000
Jurassic World $1,670,400,637 $150,000,000
The Avengers $1,518,812,988 $220,000,000
Furious 7 $1,516,045,911 $190,000,000
Avengers: Age of Ultron $1,405,403,694 $250,000,000
Harry Potter TDH2 $1,341,511,219 $125,000,000
Frozen $1,287,000,000 $150,000,000
Iron Man 3 $1,214,811,252 $200,000,000
Minions $1,159,398,397 $74,000,000
Captain America: Civil War $1,153,304,495 $250,000,000
Transformers: Dark of the Moon $1,123,794,079 $195,000,000
Lord of the Rings: ROTK $1,119,929,521 $94,000,000
Skyfall $1,108,561,013 $200,000,000
Transformers: Age of Extinction $1,104,054,072 $210,000,000
The Dark Knight Rises $1,084,939,099 $250,000,000
Toy Story 3 $1,066,969,703 $200,000,000
POTC: Dead Man's Chest $1,066,179,725 $225,000,000
POTC: On Stranger Tides $1,045,713,802 $250,000,000
Jurassic Park (Original) $1,029,939,903 $63,000,000
Page 4
Finding Dory $1,027,865,760 $200,000,000
Star Wars: Phantom Menace $1,027,044,677 $115,000,000
Alice in Wonderland $1,025,467,110 $200,000,000
Zootopia $1,023,784,195 $150,000,000
The Hobbit: Unexpected Journey $1,021,103,568 $180,000,000
The Dark Knight $1,004,558,444 $185,000,000
Rouge One $982,998,446 $200,000,000
Harry Potter TPS $974,755,371 $125,000,000
Despicable Me 2 $970,761,885 $76,000,000
The Lion King $968,483,777 $45,000,000
The Jungle Book (2016) $966,550,600 $175,000,000
POTC: At World's End $963,420,425 $300,000,000
Harry Potter: TDH1 $960,283,305 $161,287,500
The Hobbit: DOS $958,366,855 $225,000,000
The Hobbit: BOFA $956,019,788 $250,000,000
Finding Nemo $940,335,536 $94,000,000
Harry Potter: OOTP $939,885,929 $150,000,000
Harry Potter: HBP $934,416,487 $250,000,000
Lord of the Rings: Two Towers $926,047,111 $94,000,000
Shrek 2 $919,838,758 $150,000,000
Harry Potter: GoF $896,911,078 $150,000,000
Spider-Man 3 $890,871,626 $258,000,000
Ice Age: Dawn of the Dinosaurs $886,686,817 $90,000,000
Spectre $880,674,609 $245,000,000
Harry Potter: COS $878,979,634 $100,000,000
Ice Age: Continental Drift $877,244,782 $95,000,000
Secret Life of Pets $875,457,937 $75,000,000
Batman v Superman $873,260,194 $250,000,000
Lord of the Rings: Fellowship $871,835,347 $93,000,000
Rationale
With our group’s backgrounds in diverse cultural activities, we
brainstormed ideas given the
significance that activities play in our lives, and we were able
to meet in agreement about the
idea of cinema and its impact on the consumers as far as how
much revenue movies generate.
As part of our cultural backgrounds in films, we all shared
common observations and facts
about movies and we were interested what individual factors
impact a movie’s total gross
amount. As for our variables, we decided as a group on 3 main
independent variables and
modified them to meet the standards for the dependent variable
through the variable’s
Page 5
completeness and integrity given the large computations
involved for this data set. (1. Budget
Amount to Create the Movie, 2. Length of Movie (Minutes), 3.
Movie Rating Scores (Out of
100.)) Throughout the initial development stage of our project,
we did some research as far as
knowing if our information about movies with their total gross
amounts and the independent
variables were available to us on the web. Moreover, the detail
and credibility was there
online and it provided us the essential information for both the
independent variables and the
dependent variable.
Reliability of Source Data
As a group, we believe that our data source,
http://www.boxofficemojo.com/alltime/world/ is
a reliable and credible source of published and updated data as a
result of its owner,
IMDb.com, who is also owned by Amazon.com. Given the
significance and reputation of
IMDb, they provide with its affiliates such as Box Office Mojo,
the utmost effort in accurate
reports, reliable sources to obtain practical and useful
information, and credibility in order to
offer daily publishing and updates on movies worldwide with
their gross values and other
variables including the estimated budgets that were utilized to
create that specific movie.
Moreover, according to IMDb (2017), the owner of Box Office
Mojo, stated that only more
recently within the last 15 years of films, “studios and
distributors have started disclosing
detailed figures only recently” (p. 1), and are reported as
estimates.
Limitations of the Data
As far as limitations in our data, the data combines all films and
provides no breakup. As
such, we are missing some levels of significant details like sub-
categorization into genres, a
specific year range, and calculating for inflation. Having access
to this viable data could have
provided a higher level of insight into which movie genres are
watched the most and how
much that provides, what year range could offer the best results,
and what movies would have
the highest gross value if we accounted for inflation.
Cleaning Up the Data
Some of the steps to clean up the data included only selecting
the top 50 grossing movies
rather than 100 because the data is already in the millions and
billions and too many values is
confusing, looks sloppy in graphs with higher values, and is
difficult to calculate with some
formulas. Even though we were provided the gross values, some
values were in a different
currency such as Euros, so we converted a couple movies from
Euros to dollars. This gave us
consistency among the other gross values in terms of currency
and provided us more accurate
data.
http://www.boxofficemojo.com/alltime/world/
Page 6
General Assumptions before Data Analysis
From looking at the data, my general assumption is that with a
higher budget set to create the
movie, the higher the gross value will be for that specific
movie. For example, it is possible
that with more recent movies that involve computer-generated
imagery will involve a higher
budget to output a higher gross value for the movie as a result
of an improvement in
technology and entertainment for the consumers to witness.
2. Describing the Data
Here are the vital statistics for a description of the independent
variable – the budget amount
to create a movie. Most of the budget amounts were within
$200,000,000 - $250,000,000, but
a majority was < $250,000,000. Moreover, the shape of the
distribution suggests this is quite
a normal distribution, and the skewness doesn’t have too big of
margin with -0.219, but it
does categorize itself as a left-skewed distribution due to the
median being greater than the
mean.
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
0
2
4
6
8
10
12
14
16
50000000 100000000 150000000 200000000 250000000
300000000 350000000 More
Fr
eq
ue
nc
y
of
B
ud
ge
t
Budget (X)
Frequency distribution of the Budget amounts to
create a movie
Mean Median Mode
Page 7
Here are the vital statistics for a
description of the dependent variable – the
movie’s gross profit or top grossing
movies worldwide. Most of the top grossing
movies were within $1,000,000,000 -
$1,200,000,000, but a majority was < $1,600,000,000. Although
the shape of the distribution
suggests this is not a normal distribution, the skewness has a
greater margin at around 2.87
due to a couple of outliers on the right, but it does categorize
itself as a right-skewed
distribution due to the mean being greater than the median.
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
0
5
10
15
20
25
Fr
eq
ue
nc
y
of
G
ro
ss
P
ro
fi
t
Gross Profit (Y)
Frequency distribution of the top movie's Gross
Profits worldwideMedian
Mean
Page 8
3. Empirical Rule
X The range Percent of data
falling in the range
Satisfy empirical
rule? (Yes or No)
),( σµσµ +- (108,286,384,
238,085,116)
54% No.
),( σµσµ 22- + (43,387,018,
302,984,482)
100% Yes.
),( σµσµ 33- + (-21,512,348,
367,883,848)
100% Yes.
Y The range Percent of data
falling in the range
Satisfy empirical
rule? (Yes or No)
),( σµσµ +- (763,092,924,
1,496,252,698)
88% Yes.
),( σµσµ 22- + (396,513,037,
1,862,832,585)
94% No.
),( σµσµ 33- + (299,331,150,
2,229,412,472)
98% No.
4. Identify Outliers
On the x-variable, budget amounts to create a movie, there were
no outliers found within the
data distribution because all of the z-scores were below 2. On
the y-variable, the top grossing
movies worldwide, one “extreme outlier” was found and two
“normal outliers” were
found. For the y-variable, both Titanic and Star Wars: The
Force Awakens are considered
normal outliers with z-scores of 2.88 and 2.56 because the z-
scores are greater than or equal
to 2 and less than 3. Also for the y-variable, Avatar is
considered an extreme outlier with a z-
score of 4.52 because its z-score is greater than or equal to 3.
Page 9
5. Five Number Summary and Z-Scores
6. The Linear Regression Analysis
x z-scores y z-scores
Mean 173,185,750 0 1,129,672,811 0
Median 187,500,000 .2206 1,022,443,882
-.2925
Mode 200,000,000 .4132 N/A N/A
Standard Deviation 64,899,366 N/A 366,579,887 N/A
Minimum 45,000,000 -1.975 871,835,347 -.7034
25th Percentile 117,500,000 -.8580 939,998,331 -.5174
75th Percentile 225,000,000 .7984 1,122,827,940 -.0187
Max 300,000,000 1.954 2,787,965,087 4.524
Page 10
-500,000,000
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
0 50,000,000 100,000,000 150,000,000 200,000,000
250,000,000 300,000,000 350,000,000
Re
si
du
al
s
Budget (X)
Budget (X) Residual Plot
Page 11
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
2,500,000,000
3,000,000,000
0 20 40 60 80 100 120
To
p
G
ro
ss
in
g
M
ov
ie
s
W
or
ld
w
id
e
(Y
)
Sample Percentile
Normal Probability Plot
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
2,500,000,000
3,000,000,000
0 100,000,000 200,000,000 300,000,000 400,000,000
To
p
G
ro
ss
in
g
M
ov
ie
s
W
or
ld
w
id
e
(Y
)
Budget (X)
Budget (X) Line Fit Plot
Gross Profit (Y)
Predicted Gross Profit (Y)
Page 12
7. The Linear Regression Scatterplot
8. The Linear Regression Line Fit Plots Analysis
As we can see, the “Line Fit Plot” doesn’t display a linear
relationship between the budget
and the top grossing movies worldwide. The x variable, the
budget, has a positive correlation
to the y variable, but this data can’t fit within a straight line due
to the data being spread out
which indicates that there is a slight relationship between the
two, but this isn’t a very
strong relationship.
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
2,500,000,000
3,000,000,000
0 100,000,000 200,000,000 300,000,000 400,000,000
To
p
G
ro
ss
in
g
M
ov
ie
s
W
or
ld
w
id
e
(Y
)
Budget (X)
Budget vs. Top Grossing Movies Worldwide
Gross Profit (Y)
Page 13
9. The Significance of the Regression Model
Based on the Simple Linear Regression output, the model isn’t
significant because it is
stated from Significance F, which is 0.078175279, and this
value is greater than 0.05 which
classifies this model as insignificant.
10. The Regression Equation
Based on the Simple Linear Regression output regarding the
coefficients, the Intercept and
Budget (x), the mathematical equation of this model is
Y=883,710,603 + 1.420221979x. For
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
2,500,000,000
3,000,000,000
0 200,000,000 400,000,000
To
p
G
ro
ss
in
g
M
ov
ie
s
W
or
ld
w
id
e
(Y
)
Budget (X)
Budget (X) Line Fit Plot
Gross Profit (Y)
Predicted Gross Profit (Y)
Page 14
any unit in the budget increase, the gross amount for the
upcoming film will increase by
1.420. For example, if we were to suggest that the budget for an
upcoming film was
$250,000,000, the value of y would be $1,238,766,098. As a
result, the gross amount for that
upcoming film would make around $1,238,766,098 if they had a
budget of $250,000,000 to
spend on the film.
11. The Reliability of the Regression Model
Based on the Simple Linear Regression output regarding the
model being a reliable predictor
of y or not is based on R square. In this model, R square is
0.063229234 or 6.322% which is
very low for the amount of variation explained by the
regression, and unfortunately, this
model is not a very reliable predictor of y as only 6.322% of
variation is explained.
12. The Assumptions of the Regression Model
Assumption Check Yes or No? Why or Why not?
Mean of 0 No. Based on the data from the residual
plot, most of the data isn’t evenly
around 0 with exceptions from
100,000,000-250,000,000, but a lot of the
data is spread out from the mean of 0.
Constant Variability No. Based on the data observed in the
residual plot, the variance is not
constant. There is a clear
triangular/cone-like pattern, which
suggests that the data isn’t between two
parable lines to 0 and isn’t constant.
Independent
Yes
Normality No. As can be observed from the normal
probability plot, it is not a linear line as
it has a tail going upwards; therefore,
normality isn’t satisfied as it’s not a
complete linear line.
Page 15
13. Conclusion
Based on the general data analysis of the plots and the output, it
is evident that the
relationship between budget and top grossing movies worldwide
isn’t explained and
represented at a high percentage given that R square is only
6.322%. With only 6.322% of
the variation explained by the regression, this is considered
very low throughout the data
analysis. More importantly, our significance F, which represents
the significance of our
model from the output showed a value of 0.078175279 which is
greater than 0.05, therefore,
rendering our model insignificant. Understanding that this
correlation between our budget
variable and the top grossing movies worldwide isn’t
represented with a lower significance F
suggests that this independent variable doesn’t specifically
affect the dependent variable on a
significant scale. Overall, even with our data output
representing an insignificant model, I
learned that not every variable that correlates with a particular
subject such as a film’s budget
and a movie’s gross profit will have a high percentage of
variance explained or a strong
linear relationship. More importantly, contrary to what might
seem like conventional
wisdom, I also learned that most of the values given for both
gross profits for movies and the
budgets are estimates which suggest that given the public
response of how well the movie
does or not can represent a different number presented to
credible sources about their figures.
Moreover, for any last comments and further improvement of
my model, I would like to
present more of a dataset like 100 pieces of data rather than 50
to give me a possibly better
relationship between budget and gross profit. For any
improvement of my model, I would
want to analyze my dependent variable more specifically by
acknowledging the inflation rate
as movies in the past weren’t represented in the top grossing
movies because of inflation. As
a result of inflation, our currency has changed drastically over
the last eighty years and could
give us more accurate data for our model if it were accounted
for in our data. Overall, I
would also like to address that this individual evaluation of the
dataset and plots gave me a
better perspective of data analysis where I can see the figures
on a smaller scale in
histograms, scatterplots, and data charts rather than just values
on a website.
Page 16
14. Team Information
Team Member X Variable
Shane Cornfield Budget Amount to Create the Movie
Dana Saxton Budget Amount to Create the Movie
Drew Thoman Length of Movie (Minutes)
Zekun Huang Movie Rating Scores (Out of 100)
Works Cited
All Time Worldwide Box Office Grosses. (2017, February 9).
Retrieved February 09, 2017, from
<http://www.boxofficemojo.com/alltime/world/>.
All Time Worldwide Box Office Profiles. (2017, February 9).
Retrieved from February 09, 2017,
from
<http://www.boxofficemojo.com/movies/?id=moviename.htm>.
Why are your budget/gross figures for some movies different
than those listed by another
source? Why do you have budget/gross data on some movies
and not others? (2017).
Retrieved February 09, 2017, from
<http://www.imdb.com/help/show_leaf?boxofficedifferent>.
Credits
Image of Cinema on front cover: Hayden Dingman from
Pcworld.com:
https://i.ytimg.com/vi/5ar91JNLdR4/maxresdefault.jpg
https://i.ytimg.com/vi/5ar91JNLdR4/maxresdefault.jpg
BA240 Individual Project Report
SUBMITTED HARD COPY AT THE BEGINNING OF CLASS
1. THIS PROJECT IS PRESENTED IN WORD FORMAT SO
YOU CAN USE THE TABLES INCLUDED HERE.
2. The instructor reserves the right to adjust individual scores.
3. Individual or team projects that are just Excel printouts will
receive 0 points.
4. Excel instructions are contained under “Project” link on
Canvas.
INDIVIDUAL Project
Each team member is to choose one of the independent variables
in the data sets to analyze along with the dependent data set. All
team members will have the same dependent variable (y) but a
different independent variable (x) (minimum 3 variables in a
group). Review the Excel videos and linear regression before
you do your own. Each variable should contain at least 50 data.
Number all your answers in your submission.Although you are
sharing data, you must complete the analysis and interpretation
individually.
1. Introduction:
· Show the data and explain why you selected the data and how
the data was collected.
· Cite websites and evaluate the credibility of your sources.
· List any limitations of the data.
· Describe any steps you took to clean up the data ( if you have
missing data)
· Make assumptions before doing the data analysis
· This section may be reused in your team project
· Write the project more like a report.
2. Describing the data:
· Plot histograms of your x variable and the y variable using
reasonable intervals for each set. (There will be two
histograms.)
· Label the graph correctly
· Comment on the shape of the distribution (skewness).
3. Analyze whether the x and y distributions satisfy the
empirical rule (Yes or No, explain why). Show details such like
the range of within 1 standard deviation, within 2 standard
deviation and within 3 standard deviation and the corresponding
true percentage falling in these ranges.
4. Identify and list all outliers in each distribution (Both X and
Y) using appropriate methodology and explain why they are
outliers. If you have more than 10 outliers in either distribution
(X or Y) in your dataset, you can just list out the top 10
outliers.
5. Calculate the mean, median, and mode and show where they
are on the histogram graph (you can either edit on the graph in
Word ,Excel or PowerPoint, or you can show them by pen on
the graph). Finish the following table for the five number
summary (Minimum, Q1, median, Q3, maximum) and the z-
scores of each.
x
z-scores
y
z-scores
Mean
Median
Mode
Standard Deviation
NA
NA
Min
25 percentile
75 percentile
Max
6. The Regression: Show the output and all the plots from Excel
from Simple Linear Regression analysis. You can copy and
paste from Excel output and plots.
7. The Regression: Create a scatter plot of your independent
variable against the dependent variable using Excel. Make sure
your dependent variable is y and your independent is x on the
graph. Write a paragraph about your finding in the scatter plot.
8. The Regression: Display the “Line Fit Plots” from the Simple
Linear Regression output. Is there a linear relationship between
these two variables from the plot? Explain why?
9. The Regression: Is this regression model is
important/significant? Why or why not?
10. The Regression: Are all parameters important/significant?
Why or why not?
11. The Regression: Show the mathematical equation of this
model. Please give two examples after you have the equation.
Select any two meaningful numbers of X and predict the value
of Y and interpret the equation using words.
12. The Regression: Is this model a reliable predictor of y?
Explain how much of variation is explained. Do you think there
is a strong correlation and explain why or why not.
13. The Regression: Assumption check
Write a paragraph for the 4 assumption check and explain why
it satisfies or violate the assumptions.
14. Summary: Write at least one paragraph including: summary
of your findings in the plots, numerical measurements and data
analysis, what you have learned from the project, and any
comments you have or any further improvement of your model.
15. List all your team members’ names and their corresponding
variables.
16. Appendix if needed

More Related Content

Similar to Data AnalysisInstructions of Excel 2016By Yancy Chow.docx

Numerical and statistical methods new
Numerical and statistical methods newNumerical and statistical methods new
Numerical and statistical methods newAabha Tiwari
 
Using Excel to Do Data Journalism - Steve Doig - Seattle NewsTrain - 11.11.17
Using Excel to Do Data Journalism - Steve Doig - Seattle NewsTrain - 11.11.17Using Excel to Do Data Journalism - Steve Doig - Seattle NewsTrain - 11.11.17
Using Excel to Do Data Journalism - Steve Doig - Seattle NewsTrain - 11.11.17News Leaders Association's NewsTrain
 
Lab 4 excel basics
Lab 4 excel basicsLab 4 excel basics
Lab 4 excel basicsAnuja Lad
 
Correlations and Scatterplots MS Excel Lesson 2 Grade 8.ppt
Correlations and Scatterplots MS Excel Lesson 2 Grade 8.pptCorrelations and Scatterplots MS Excel Lesson 2 Grade 8.ppt
Correlations and Scatterplots MS Excel Lesson 2 Grade 8.pptJoshuaCasas7
 
How to make your own population pyramid in six simple steps
How to make your own population pyramid in six simple stepsHow to make your own population pyramid in six simple steps
How to make your own population pyramid in six simple stepsNed Baring
 
Elementary Data Analysis with MS Excel_Day-5
Elementary Data Analysis with MS Excel_Day-5Elementary Data Analysis with MS Excel_Day-5
Elementary Data Analysis with MS Excel_Day-5Redwan Ferdous
 
Homework Assignment 9 Edited on 10272014 Due by Wednes.docx
Homework Assignment 9 Edited on 10272014 Due by Wednes.docxHomework Assignment 9 Edited on 10272014 Due by Wednes.docx
Homework Assignment 9 Edited on 10272014 Due by Wednes.docxadampcarr67227
 
Using microsoft excel for weibull analysis
Using microsoft excel for weibull analysisUsing microsoft excel for weibull analysis
Using microsoft excel for weibull analysisMelvin Carter
 
2.6b scatter plots and lines of best fit
2.6b scatter plots and lines of best fit2.6b scatter plots and lines of best fit
2.6b scatter plots and lines of best fithartcher
 
9.6 Systems of Inequalities and Linear Programming
9.6 Systems of Inequalities and Linear Programming9.6 Systems of Inequalities and Linear Programming
9.6 Systems of Inequalities and Linear Programmingsmiller5
 

Similar to Data AnalysisInstructions of Excel 2016By Yancy Chow.docx (20)

Lab 4 excel basics
Lab 4 excel basicsLab 4 excel basics
Lab 4 excel basics
 
Numerical and statistical methods new
Numerical and statistical methods newNumerical and statistical methods new
Numerical and statistical methods new
 
Using Excel to Do Data Journalism - Steve Doig - Seattle NewsTrain - 11.11.17
Using Excel to Do Data Journalism - Steve Doig - Seattle NewsTrain - 11.11.17Using Excel to Do Data Journalism - Steve Doig - Seattle NewsTrain - 11.11.17
Using Excel to Do Data Journalism - Steve Doig - Seattle NewsTrain - 11.11.17
 
Lab 4 excel basics
Lab 4 excel basicsLab 4 excel basics
Lab 4 excel basics
 
Lab 4 excel basics
Lab 4 excel basicsLab 4 excel basics
Lab 4 excel basics
 
Correlations and Scatterplots MS Excel Lesson 2 Grade 8.ppt
Correlations and Scatterplots MS Excel Lesson 2 Grade 8.pptCorrelations and Scatterplots MS Excel Lesson 2 Grade 8.ppt
Correlations and Scatterplots MS Excel Lesson 2 Grade 8.ppt
 
How to make your own population pyramid in six simple steps
How to make your own population pyramid in six simple stepsHow to make your own population pyramid in six simple steps
How to make your own population pyramid in six simple steps
 
Excel Training
Excel TrainingExcel Training
Excel Training
 
Elementary Data Analysis with MS Excel_Day-5
Elementary Data Analysis with MS Excel_Day-5Elementary Data Analysis with MS Excel_Day-5
Elementary Data Analysis with MS Excel_Day-5
 
Excle
ExcleExcle
Excle
 
Homework Assignment 9 Edited on 10272014 Due by Wednes.docx
Homework Assignment 9 Edited on 10272014 Due by Wednes.docxHomework Assignment 9 Edited on 10272014 Due by Wednes.docx
Homework Assignment 9 Edited on 10272014 Due by Wednes.docx
 
Using microsoft excel for weibull analysis
Using microsoft excel for weibull analysisUsing microsoft excel for weibull analysis
Using microsoft excel for weibull analysis
 
Linear_Regression
Linear_RegressionLinear_Regression
Linear_Regression
 
2.6b scatter plots and lines of best fit
2.6b scatter plots and lines of best fit2.6b scatter plots and lines of best fit
2.6b scatter plots and lines of best fit
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
Excel Basics.ppt
Excel Basics.pptExcel Basics.ppt
Excel Basics.ppt
 
Excel booklet
Excel bookletExcel booklet
Excel booklet
 
9.6 Systems of Inequalities and Linear Programming
9.6 Systems of Inequalities and Linear Programming9.6 Systems of Inequalities and Linear Programming
9.6 Systems of Inequalities and Linear Programming
 
Ms excel ppt
Ms excel pptMs excel ppt
Ms excel ppt
 
Advanced Models
Advanced ModelsAdvanced Models
Advanced Models
 

More from whittemorelucilla

Database reports provide us with the ability to further analyze ou.docx
Database reports provide us with the ability to further analyze ou.docxDatabase reports provide us with the ability to further analyze ou.docx
Database reports provide us with the ability to further analyze ou.docxwhittemorelucilla
 
DataInformationKnowledge1.  Discuss the relationship between.docx
DataInformationKnowledge1.  Discuss the relationship between.docxDataInformationKnowledge1.  Discuss the relationship between.docx
DataInformationKnowledge1.  Discuss the relationship between.docxwhittemorelucilla
 
DataHole 12 Score6757555455555455575775655565656555655656556566643.docx
DataHole 12 Score6757555455555455575775655565656555655656556566643.docxDataHole 12 Score6757555455555455575775655565656555655656556566643.docx
DataHole 12 Score6757555455555455575775655565656555655656556566643.docxwhittemorelucilla
 
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docxDataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docxwhittemorelucilla
 
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docxDataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docxwhittemorelucilla
 
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docxDataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docxwhittemorelucilla
 
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docxDataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docxwhittemorelucilla
 
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docxDataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docxwhittemorelucilla
 
Database Project CharterBusiness CaseKhalia HartUnive.docx
Database Project CharterBusiness CaseKhalia HartUnive.docxDatabase Project CharterBusiness CaseKhalia HartUnive.docx
Database Project CharterBusiness CaseKhalia HartUnive.docxwhittemorelucilla
 
Databases selected Multiple databases...Full Text (1223 .docx
Databases selected Multiple databases...Full Text (1223  .docxDatabases selected Multiple databases...Full Text (1223  .docx
Databases selected Multiple databases...Full Text (1223 .docxwhittemorelucilla
 
Database SystemsDesign, Implementation, and ManagementCo.docx
Database SystemsDesign, Implementation, and ManagementCo.docxDatabase SystemsDesign, Implementation, and ManagementCo.docx
Database SystemsDesign, Implementation, and ManagementCo.docxwhittemorelucilla
 
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docxDATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docxwhittemorelucilla
 
Database Security Assessment Transcript You are a contracting office.docx
Database Security Assessment Transcript You are a contracting office.docxDatabase Security Assessment Transcript You are a contracting office.docx
Database Security Assessment Transcript You are a contracting office.docxwhittemorelucilla
 
Database Design Mid Term ExamSpring 2020Name ________________.docx
Database Design Mid Term ExamSpring 2020Name ________________.docxDatabase Design Mid Term ExamSpring 2020Name ________________.docx
Database Design Mid Term ExamSpring 2020Name ________________.docxwhittemorelucilla
 
Database Justification MemoCreate a 1-page memo for the .docx
Database Justification MemoCreate a 1-page memo for the .docxDatabase Justification MemoCreate a 1-page memo for the .docx
Database Justification MemoCreate a 1-page memo for the .docxwhittemorelucilla
 
Database Concept Maphttpwikieducator.orgCCNCCCN.docx
Database Concept Maphttpwikieducator.orgCCNCCCN.docxDatabase Concept Maphttpwikieducator.orgCCNCCCN.docx
Database Concept Maphttpwikieducator.orgCCNCCCN.docxwhittemorelucilla
 
Database Dump Script(Details of project in file)Mac1) O.docx
Database Dump Script(Details of project in file)Mac1) O.docxDatabase Dump Script(Details of project in file)Mac1) O.docx
Database Dump Script(Details of project in file)Mac1) O.docxwhittemorelucilla
 
Database Design 1. What is a data model A. method of sto.docx
Database Design 1.  What is a data model A. method of sto.docxDatabase Design 1.  What is a data model A. method of sto.docx
Database Design 1. What is a data model A. method of sto.docxwhittemorelucilla
 
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docxDataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docxwhittemorelucilla
 

More from whittemorelucilla (20)

Database reports provide us with the ability to further analyze ou.docx
Database reports provide us with the ability to further analyze ou.docxDatabase reports provide us with the ability to further analyze ou.docx
Database reports provide us with the ability to further analyze ou.docx
 
DataInformationKnowledge1.  Discuss the relationship between.docx
DataInformationKnowledge1.  Discuss the relationship between.docxDataInformationKnowledge1.  Discuss the relationship between.docx
DataInformationKnowledge1.  Discuss the relationship between.docx
 
DataHole 12 Score6757555455555455575775655565656555655656556566643.docx
DataHole 12 Score6757555455555455575775655565656555655656556566643.docxDataHole 12 Score6757555455555455575775655565656555655656556566643.docx
DataHole 12 Score6757555455555455575775655565656555655656556566643.docx
 
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docxDataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
DataDestination PalletsTotal CasesCases redCases whiteCases organi.docx
 
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docxDataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
DataIllinois Tool WorksConsolidated Statement of Income($ in milli.docx
 
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docxDataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
DataIDSalaryCompa-ratioMidpoint AgePerformance RatingServiceGender.docx
 
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docxDataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
DataCity1997 Median Price1997 Change1998 Forecast1993-98 Annualize.docx
 
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docxDataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
DataClientRoom QualityFood QualityService Quality1GPG2GGG3GGG4GPG5.docx
 
Database Project CharterBusiness CaseKhalia HartUnive.docx
Database Project CharterBusiness CaseKhalia HartUnive.docxDatabase Project CharterBusiness CaseKhalia HartUnive.docx
Database Project CharterBusiness CaseKhalia HartUnive.docx
 
Databases selected Multiple databases...Full Text (1223 .docx
Databases selected Multiple databases...Full Text (1223  .docxDatabases selected Multiple databases...Full Text (1223  .docx
Databases selected Multiple databases...Full Text (1223 .docx
 
Database SystemsDesign, Implementation, and ManagementCo.docx
Database SystemsDesign, Implementation, and ManagementCo.docxDatabase SystemsDesign, Implementation, and ManagementCo.docx
Database SystemsDesign, Implementation, and ManagementCo.docx
 
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docxDATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
DATABASE SYSTEMS DEVELOPMENT & IMPLEMENTATION PLAN1DATABASE SYS.docx
 
Database Security Assessment Transcript You are a contracting office.docx
Database Security Assessment Transcript You are a contracting office.docxDatabase Security Assessment Transcript You are a contracting office.docx
Database Security Assessment Transcript You are a contracting office.docx
 
Data.docx
Data.docxData.docx
Data.docx
 
Database Design Mid Term ExamSpring 2020Name ________________.docx
Database Design Mid Term ExamSpring 2020Name ________________.docxDatabase Design Mid Term ExamSpring 2020Name ________________.docx
Database Design Mid Term ExamSpring 2020Name ________________.docx
 
Database Justification MemoCreate a 1-page memo for the .docx
Database Justification MemoCreate a 1-page memo for the .docxDatabase Justification MemoCreate a 1-page memo for the .docx
Database Justification MemoCreate a 1-page memo for the .docx
 
Database Concept Maphttpwikieducator.orgCCNCCCN.docx
Database Concept Maphttpwikieducator.orgCCNCCCN.docxDatabase Concept Maphttpwikieducator.orgCCNCCCN.docx
Database Concept Maphttpwikieducator.orgCCNCCCN.docx
 
Database Dump Script(Details of project in file)Mac1) O.docx
Database Dump Script(Details of project in file)Mac1) O.docxDatabase Dump Script(Details of project in file)Mac1) O.docx
Database Dump Script(Details of project in file)Mac1) O.docx
 
Database Design 1. What is a data model A. method of sto.docx
Database Design 1.  What is a data model A. method of sto.docxDatabase Design 1.  What is a data model A. method of sto.docx
Database Design 1. What is a data model A. method of sto.docx
 
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docxDataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
DataAGEGENDERETHNICMAJORSEMHOUSEGPAHRSNEWSPAPTVHRSSLEEPWEIGHTHEIGH.docx
 

Recently uploaded

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 

Recently uploaded (20)

Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 

Data AnalysisInstructions of Excel 2016By Yancy Chow.docx

  • 1. Data Analysis Instructions of Excel 2016 By Yancy Chow Data Analysis: House Example House Data : 50 houses Two variables: Price (Y) Area (X) Excel: How to Add-in Setting Up Excel for Statistical Analysis:
  • 2. 3 Excel: How to Add-in Then find “Add ins” on the left side and click on it. After that, click on “Go”. Then click on “OK”. Excel: How to Add-in Then find “Analysis ToolPak” and click on it. ---- n “OK”. Note: if you use Apple computer, the “add-in” option is under “Tool”!
  • 3. Excel: How to Add-in Now you’ve successfully added in the “Data Analysis”. Click on “Data” on the top, now you can see “Data Analysis” icon! Excel: How to Calculate Mean, Median and Mode? Open your data in Excel or type your data in Excel by column. For example, we want to calculate the mean, median and mode for the variable “Price” in this data. Select “Data” firstly, then click on “Data Analysis” Excel: How to Calculate Mean, Median and Mode? After you clicking on “Data Analysis”, scroll the mouse until you find “Descriptive Statistics” in the Analysis Tools Panel and then select it. Then click on “OK”.
  • 4. Excel: How to Calculate Mean, Median and Mode? Firstly, you need to input the “Input Range”. You can either input by typing in the box or clicking using the mouse to select the data numbers in the column which you are interested in. In this example, we select the all 50 numbers in the first column. Do not select the label row, like “price” row. 9 Excel: How to Calculate Mean, Median and Mode? After selecting the “Input Range”, you need to select “Output Range” and choose anywhere you want to the output to be.
  • 5. Excel: How to Calculate Mean, Median and Mode? Then select “Summary statistics”. Click on “OK” and you will have the data analysis results. Excel: How to Calculate Mean, Median and Mode? Here is the results from the data analysis, including the information such like mean, median, mode , standard deviation, sample variance, rang, minimum and maximum. Note that EXCEL can only find one mode. You need to check whether there is mort than one by your own. Excel: How to Calculate the first Quartiles (Q1)? Q1: Choose an empty space, enter: “=quartile(data range, 1)” Then press “Enter” and you will get the first quartile (Q1)
  • 6. result. A2:A51 is the range of the data Excel: How to Calculate the third Quartiles (Q3)? Q1: Choose an empty space, enter: “=quartile(data range, 3)” Then press “Enter” and you will get the third quartile (Q3) result. A2:A51 is the range of the data Excel: How to Draw Histograms? Firstly, check the output from the “Descriptive Statistics” in “Data Analysis”. We notice in this house data, mean is $956396.66, minimum is $729870 and maximum is $1190000. A reasonable will be $50000. So create a new Colum of the “Bins” which is from $700000 to $1200000, by the interval $50000
  • 7. 15 Once you create reasonable Bins, select “Data”- Analysis”. Find “Histograms” and click on “OK”. Excel: How to Draw Histograms? Select the “Input Range”, the 50 house data. Select the “Bin Range”, the column you created. Decide any empty space as your “Output Range” Click on “Cumulative Percentage” and “Chart Output”- Excel: How to Draw Histograms?
  • 8. Excel: How to Draw Histograms? Here is the output from the “Data Analysis”: Frequency Table and Histogram! Now you can edit the words color, size, filled color if you want. Excel: How to Draw Histograms? You can also edit the color of the “Page Layout”. Excel: How to Draw Histograms?
  • 9. You can also design some effects of the bars if you want. Double click on one bar and Select the “Format” Tool. You can design the shape filled, the outline of the shape and effects if you want to draw a beautiful graph. Excel: How to Draw Histograms? Excel: How to Draw Histograms? Usually, the histograms have no gap. How to have no gap? Choose the graph, right click the mouse and choose “Format Plot Area”. Usually, the histograms have no gap. How to have no gap? Then choose from “Plot Area Options”-- Excel: How to Draw Histograms?
  • 10. Excel: How to Draw Histograms? Usually, the histograms have no gap. How to have no gap? Then make the “Gap Width” as “0%” Now, there is no gap for the histogram! Excel: How to Draw Bar Charts (Qualitative data)? Type the categories of the qualitative data and the corresponding frequency (How many in each class) Excel: How to Draw Bar Charts (Qualitative data)? Select the data (class and frequency)-- - -D Column” or “3-D Column”
  • 11. Double click each bar and choose “Format” to design the color and effects. Excel: How to Draw Scatter Plot? Before doing the data analysis, you may want to see the scatter plots to see whether there exists a relationship between Y and X. Note that you should put X in the first column and Y in the second column. Click “Insert” and then choose “Scatter chart” to make the plot. 28
  • 12. Excel: How to Draw Scatter Plot? Excel: How to Do Simple Linear Regression? Select “Data” from the Tool Bar- find out “Regression” in the dialog and click on “OK”. 30 Excel: How to Do Simple Linear Regression? Select the “Input Y Range” from the variable “Price” column Select the “Input X Range” from the variable “Area” column Select the “Output Range” to somewhere empty. Click on “Residual Plots”, “Line Fit Plots” and “Normal Probability Plots”.
  • 13. Then Click on “OK”. 31 Correlation or strength of linear relationship between x and y. 0.913 is strong. R Square: the amount of variation explained by the regression. Is the model a reliable predictor of y? 83.3% of variation is explained. Y-intercept Slope If the p-value is smaller than 0.05, then the parameter is significant/important for predicting. If the Significant F is smaller than 0.05, then the Model is significant/important. Standard Error of The Regression Use in prediction
  • 14. The Equation will be : Y=307953+195X Data Analysis Output : EXCEL 2016 32 Chapter 11 Simple Linear Regression (Individual Project) By Yancy Chow What is Simple Linear Regression? House 1
  • 15. House 2 Which one you think is more expensive? Why? What is Simple Linear Regression? Can be used to find relationships between two variables. Examples: Gene mapping for cancer research Examples: Stock market investment analysis What is Simple Linear Regression?
  • 16. Examples: Sales forecasting What is Simple Linear Regression? Examples: Product quality control What is Simple Linear Regression? Examples: Income demographics
  • 17. What is Simple Linear Regression? Simple linear regression one-to-one Dependent Variable (y) Ex. House price Independent Variable (x) Ex. Square feet What is Simple Linear Regression? House Price=??*Square feet+ Random Error Procedure: Model
  • 18. Procedure: Model We always assume that the mean value of the random error equals 0. Taking average…… That is, the model will be: Figure of the Model How to fit the Model? ----The Least Squares Approach
  • 19. How to interpret? For every unit increase in x, the mean of y is estimated to increase by unit. Bellevue College Houses near Bellevue College
  • 20. Home value (Y) as a function of square footage (X) Example Correlation or strength of linear relationship between x and y. 0.913 is strong. R Square: the amount of variation explained by the regression. Is the model a reliable predictor of y? 83.3% of variation is explained. Y-intercept Slope If the p-value is smaller than 0.05, then the parameter is significant/important for predicting. If the Significant F is smaller than 0.05, then the Model is significant/important. Standard Error of The Regression Use in prediction The Equation will be : Y=307953+195X
  • 21. Data Analysis Output : EXCEL 2016 16 Recently sold house near Bellevue College: Area: 2360 sqt Lot Size : 8,162 sqt Built year: 1976 What is the house value from our model? Plots: How to Check Assumptions?
  • 22. Mean of Zero: Variance Constancy: Normality: Evenly Around 0 NO Trend Two parallel lines to 0. Linear Line Independence No Pattern Assumption Check Residual Plot: Mean of 0 Constant variance Independence Normal Probability Plot: Normality
  • 23. Plots X Variable 1 Line Fit Plot Y 2010 1570 1600 2100 2800 3930 4660 4318 3008 3190 2660 2690 2640 2560 1750 2250 2450 2480 1590 1990 1360 1040 2240 1420 2200 2760 2468 3300 1890 2250 2400 2480 3140 2320 1900 3590 2600 1420 2530 2605 745557 671884 677768 891146 890899 1262820 1303545 1109270 991289 985735 752739 747912 750865 741286 693232 726184 743571 747504 675726 719319 598238 536724 707796 577911 656267 767334 726996 939357 719439 726184 739008 747504 849439 744544 719319 983183 725446 566757 828472 725686 Predicted Y 2010 1570 1600 2100 2800 3930 4660 4318 3008 3190 2660 2690 2640 2560 1750 2250 2450 2480 1590 1990 1360 1040 2240 1420 2200 2760 2468 3300 1890 2250 2400 2480 3140 2320 1900 3590 2600 1420 2530 2605 698975.94578266761 613378.80888165673 619214.97730672569 716484.45105787436 852661.71430948237 1072490.7249870785 1214504.1566637554 1147971.8366179697 893125.81538996031 928531.90383537835 825426.26165916084 831262.43008422968 821535.48270911491 805972.36690893117 648395.81943207025 745665.29318321892 784573.08268367837 790409.25110874732 617269.58783170278 695085.16683262167 572525.62990617438 510273.16670543922 743719.90370819601 584197.96675631218 735938.34580810415
  • 24. 844880.1564093905 788074.78373871977 949931.18806063104 675631.27208239189 745665.29318321892 774846.13530856348 790409.25110 874732 918804.95646026358 759283.01950837974 677576.6615574148 1006347.4828362972 813753.92480902304 584197.96675631218 800136.19848386222 814726.61954653449 X Variable 1 Y X Variable 1 Residual Plot 2010 1570 1600 2100 2800 3930 4660 4318 3008 3190 2660 2690 2640 2560 1750 2250 2450 2480 1590 1990 1360 1040 2240 1420 2200 2760 2468 3300 1890 2250 2400 2480 3140 2320 1900 3590 2600 1420 2530 2605 46581.054217332392 58505.191118343268 58553.02269327431 174661.54894212564 38237.28569051763 190329.27501292154 89040.843336244579 - 38701.836617969675 98163.184610039694 57203.096164621646 -72687.261659160838 - 83350.43008422968 -70670.482709114905 - 64686.366908931173 44836.180567929754 - 19481.293183218921 -41002.082683678367 - 42905.251108747325 58456.412168297218 24233.833167378325 25712.370093825622 26450.833294560784 -35923.903708196012 - 6286.9667563121766 -79671. 345808104146 - 77546.156409390504 -61078.783738719765 - 10574.188060631044 43807.727917608107 - 19481.293183218921 -35838.135308563476 - 42905.251108747325 -69365.95646026358 - 14739.019508379744 41742.338442585198 - 23164.482836297248 -88307.924809023039 - 17440.966756312177 28335.801516137784 - 89040.619546534494 X Variable 1
  • 25. Residuals Normal Probability Plot 1.25 3.75 6.25 8.75 11.25 13.75 16.25 18.75 21.25 23.75 26.25 28.75 31.25 33.75 36.25 38.75 41.25 43.75 46.25 48.75 51.25 53.75 56.25 58.75 61.25 63.75 66.25 68.75 71.25 73.75 76.25 78.75 81.25 83.75 86.25 88.75 91.25 93.75 96.25 98.75 536724 566757 577911 598238 656267 671884 675726 677768 693232 707796 719319 719319 719439 725446 725686 726184 726184 726996 739008 741286 743571 744544 745557 747504 747504 747912 750865 752739 767334 828472 849439 890899 891146 939357 983183 985735 991289 1109270 1262820 1303545 Sample Percentile Y x y 1 0 b b + = x y E or 1 0 ) ( b b
  • 27. -0.5 0 0.5 1 1.5 020406080 Temp Residuals Normal Probability Plot 0 5 10 15 020406080100 Sample Percentile Fuel Page 1 The Cinematic Study Factors Affecting Top Grossing Movies Worldwide Page 2
  • 28. Table of Contents 1. Selecting the Data ............................................................................................... . 3-6 • Rationale ............................................................................................... ... 4-5 • Reliability of Source Data ............................................................................5 • Limitations of the Data ................................................................................5 • Cleaning Up the Data ...................................................................................5 • General Assumptions before Data Analysis ................................................6 2. Describing the Data........................................................................................ ...... 6-7 3. Empirical Rule ............................................................................................... ..........8 4. Identify Outliers ............................................................................................... ........8 5. Five Number Summary and Z- Scores......................................................................8 6. The Linear Regression Analysis ........................................................................ 9 -11 7. The Regression Scatterplot ....................................................................................12 8. The Linear Regression Line Fit Plots Analysis .....................................................12
  • 29. 9. The Significance of the Regression Model ............................................................13 10. The Regression Equation .......................................................................................13 11. The Reliability of the Regression Model ...............................................................14 12. The Assumptions of the Regression Model ...........................................................14 13. Conclusion ............................................................................................... ..............15 14. Team Information ............................................................................................... ...15 Works Cited ............................................................................................... ..................16 Page 3 1. Selecting the Data
  • 30. Data Set Dependent Variable (Y)------------- 1. Top Grossing Movies Worldwide (Gross Profit) Independent Variables (X1) ---------------------- 1. Budget Amount to Create the Movie Independent Variables (X2) ----------------------------------2. Length of Movie (Minutes) Independent Variables (X3) ------------------------- 3. Movie Rating Scores (Out of 100) In this individual report, I will be focusing on the X1 variable, the correlation between the budget amounts that were utilized to create the presented movies with the Y variable, the top grossing movies worldwide. Movie Gross Profit (Y) Budget (X) Avatar $2,787,965,087 $237,000,000 Titanic $2,186,772,302 $200,000,000 Star Wars: The Force awakens $2,068,223,624 $245,000,000 Jurassic World $1,670,400,637 $150,000,000 The Avengers $1,518,812,988 $220,000,000 Furious 7 $1,516,045,911 $190,000,000 Avengers: Age of Ultron $1,405,403,694 $250,000,000 Harry Potter TDH2 $1,341,511,219 $125,000,000 Frozen $1,287,000,000 $150,000,000 Iron Man 3 $1,214,811,252 $200,000,000 Minions $1,159,398,397 $74,000,000 Captain America: Civil War $1,153,304,495 $250,000,000
  • 31. Transformers: Dark of the Moon $1,123,794,079 $195,000,000 Lord of the Rings: ROTK $1,119,929,521 $94,000,000 Skyfall $1,108,561,013 $200,000,000 Transformers: Age of Extinction $1,104,054,072 $210,000,000 The Dark Knight Rises $1,084,939,099 $250,000,000 Toy Story 3 $1,066,969,703 $200,000,000 POTC: Dead Man's Chest $1,066,179,725 $225,000,000 POTC: On Stranger Tides $1,045,713,802 $250,000,000 Jurassic Park (Original) $1,029,939,903 $63,000,000 Page 4 Finding Dory $1,027,865,760 $200,000,000 Star Wars: Phantom Menace $1,027,044,677 $115,000,000 Alice in Wonderland $1,025,467,110 $200,000,000 Zootopia $1,023,784,195 $150,000,000 The Hobbit: Unexpected Journey $1,021,103,568 $180,000,000 The Dark Knight $1,004,558,444 $185,000,000 Rouge One $982,998,446 $200,000,000 Harry Potter TPS $974,755,371 $125,000,000 Despicable Me 2 $970,761,885 $76,000,000 The Lion King $968,483,777 $45,000,000 The Jungle Book (2016) $966,550,600 $175,000,000 POTC: At World's End $963,420,425 $300,000,000 Harry Potter: TDH1 $960,283,305 $161,287,500 The Hobbit: DOS $958,366,855 $225,000,000 The Hobbit: BOFA $956,019,788 $250,000,000 Finding Nemo $940,335,536 $94,000,000 Harry Potter: OOTP $939,885,929 $150,000,000 Harry Potter: HBP $934,416,487 $250,000,000 Lord of the Rings: Two Towers $926,047,111 $94,000,000 Shrek 2 $919,838,758 $150,000,000 Harry Potter: GoF $896,911,078 $150,000,000
  • 32. Spider-Man 3 $890,871,626 $258,000,000 Ice Age: Dawn of the Dinosaurs $886,686,817 $90,000,000 Spectre $880,674,609 $245,000,000 Harry Potter: COS $878,979,634 $100,000,000 Ice Age: Continental Drift $877,244,782 $95,000,000 Secret Life of Pets $875,457,937 $75,000,000 Batman v Superman $873,260,194 $250,000,000 Lord of the Rings: Fellowship $871,835,347 $93,000,000 Rationale With our group’s backgrounds in diverse cultural activities, we brainstormed ideas given the significance that activities play in our lives, and we were able to meet in agreement about the idea of cinema and its impact on the consumers as far as how much revenue movies generate. As part of our cultural backgrounds in films, we all shared common observations and facts about movies and we were interested what individual factors impact a movie’s total gross amount. As for our variables, we decided as a group on 3 main independent variables and modified them to meet the standards for the dependent variable through the variable’s Page 5 completeness and integrity given the large computations involved for this data set. (1. Budget Amount to Create the Movie, 2. Length of Movie (Minutes), 3. Movie Rating Scores (Out of
  • 33. 100.)) Throughout the initial development stage of our project, we did some research as far as knowing if our information about movies with their total gross amounts and the independent variables were available to us on the web. Moreover, the detail and credibility was there online and it provided us the essential information for both the independent variables and the dependent variable. Reliability of Source Data As a group, we believe that our data source, http://www.boxofficemojo.com/alltime/world/ is a reliable and credible source of published and updated data as a result of its owner, IMDb.com, who is also owned by Amazon.com. Given the significance and reputation of IMDb, they provide with its affiliates such as Box Office Mojo, the utmost effort in accurate reports, reliable sources to obtain practical and useful information, and credibility in order to offer daily publishing and updates on movies worldwide with their gross values and other variables including the estimated budgets that were utilized to create that specific movie. Moreover, according to IMDb (2017), the owner of Box Office Mojo, stated that only more recently within the last 15 years of films, “studios and distributors have started disclosing detailed figures only recently” (p. 1), and are reported as estimates. Limitations of the Data As far as limitations in our data, the data combines all films and
  • 34. provides no breakup. As such, we are missing some levels of significant details like sub- categorization into genres, a specific year range, and calculating for inflation. Having access to this viable data could have provided a higher level of insight into which movie genres are watched the most and how much that provides, what year range could offer the best results, and what movies would have the highest gross value if we accounted for inflation. Cleaning Up the Data Some of the steps to clean up the data included only selecting the top 50 grossing movies rather than 100 because the data is already in the millions and billions and too many values is confusing, looks sloppy in graphs with higher values, and is difficult to calculate with some formulas. Even though we were provided the gross values, some values were in a different currency such as Euros, so we converted a couple movies from Euros to dollars. This gave us consistency among the other gross values in terms of currency and provided us more accurate data. http://www.boxofficemojo.com/alltime/world/ Page 6 General Assumptions before Data Analysis
  • 35. From looking at the data, my general assumption is that with a higher budget set to create the movie, the higher the gross value will be for that specific movie. For example, it is possible that with more recent movies that involve computer-generated imagery will involve a higher budget to output a higher gross value for the movie as a result of an improvement in technology and entertainment for the consumers to witness. 2. Describing the Data Here are the vital statistics for a description of the independent variable – the budget amount to create a movie. Most of the budget amounts were within $200,000,000 - $250,000,000, but a majority was < $250,000,000. Moreover, the shape of the distribution suggests this is quite a normal distribution, and the skewness doesn’t have too big of margin with -0.219, but it does categorize itself as a left-skewed distribution due to the median being greater than the mean. 0.00% 20.00% 40.00% 60.00% 80.00%
  • 36. 100.00% 120.00% 0 2 4 6 8 10 12 14 16 50000000 100000000 150000000 200000000 250000000 300000000 350000000 More Fr eq ue nc y of B ud ge t Budget (X)
  • 37. Frequency distribution of the Budget amounts to create a movie Mean Median Mode Page 7 Here are the vital statistics for a description of the dependent variable – the movie’s gross profit or top grossing movies worldwide. Most of the top grossing movies were within $1,000,000,000 - $1,200,000,000, but a majority was < $1,600,000,000. Although the shape of the distribution suggests this is not a normal distribution, the skewness has a greater margin at around 2.87 due to a couple of outliers on the right, but it does categorize itself as a right-skewed distribution due to the mean being greater than the median. 0.00% 20.00% 40.00% 60.00% 80.00%
  • 39. Gross Profit (Y) Frequency distribution of the top movie's Gross Profits worldwideMedian Mean Page 8 3. Empirical Rule X The range Percent of data falling in the range Satisfy empirical rule? (Yes or No) ),( σµσµ +- (108,286,384, 238,085,116) 54% No. ),( σµσµ 22- + (43,387,018, 302,984,482) 100% Yes. ),( σµσµ 33- + (-21,512,348, 367,883,848)
  • 40. 100% Yes. Y The range Percent of data falling in the range Satisfy empirical rule? (Yes or No) ),( σµσµ +- (763,092,924, 1,496,252,698) 88% Yes. ),( σµσµ 22- + (396,513,037, 1,862,832,585) 94% No. ),( σµσµ 33- + (299,331,150, 2,229,412,472) 98% No. 4. Identify Outliers On the x-variable, budget amounts to create a movie, there were no outliers found within the data distribution because all of the z-scores were below 2. On the y-variable, the top grossing movies worldwide, one “extreme outlier” was found and two “normal outliers” were found. For the y-variable, both Titanic and Star Wars: The Force Awakens are considered
  • 41. normal outliers with z-scores of 2.88 and 2.56 because the z- scores are greater than or equal to 2 and less than 3. Also for the y-variable, Avatar is considered an extreme outlier with a z- score of 4.52 because its z-score is greater than or equal to 3. Page 9 5. Five Number Summary and Z-Scores 6. The Linear Regression Analysis x z-scores y z-scores Mean 173,185,750 0 1,129,672,811 0 Median 187,500,000 .2206 1,022,443,882 -.2925 Mode 200,000,000 .4132 N/A N/A Standard Deviation 64,899,366 N/A 366,579,887 N/A Minimum 45,000,000 -1.975 871,835,347 -.7034 25th Percentile 117,500,000 -.8580 939,998,331 -.5174 75th Percentile 225,000,000 .7984 1,122,827,940 -.0187 Max 300,000,000 1.954 2,787,965,087 4.524 Page 10
  • 42. -500,000,000 0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000 0 50,000,000 100,000,000 150,000,000 200,000,000 250,000,000 300,000,000 350,000,000 Re si du al s Budget (X) Budget (X) Residual Plot Page 11
  • 44. ld w id e (Y ) Sample Percentile Normal Probability Plot 0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000 2,500,000,000 3,000,000,000 0 100,000,000 200,000,000 300,000,000 400,000,000 To p G ro
  • 45. ss in g M ov ie s W or ld w id e (Y ) Budget (X) Budget (X) Line Fit Plot Gross Profit (Y) Predicted Gross Profit (Y) Page 12
  • 46. 7. The Linear Regression Scatterplot 8. The Linear Regression Line Fit Plots Analysis As we can see, the “Line Fit Plot” doesn’t display a linear relationship between the budget and the top grossing movies worldwide. The x variable, the budget, has a positive correlation to the y variable, but this data can’t fit within a straight line due to the data being spread out which indicates that there is a slight relationship between the two, but this isn’t a very strong relationship. 0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000 2,500,000,000 3,000,000,000 0 100,000,000 200,000,000 300,000,000 400,000,000 To p G ro
  • 47. ss in g M ov ie s W or ld w id e (Y ) Budget (X) Budget vs. Top Grossing Movies Worldwide Gross Profit (Y) Page 13
  • 48. 9. The Significance of the Regression Model Based on the Simple Linear Regression output, the model isn’t significant because it is stated from Significance F, which is 0.078175279, and this value is greater than 0.05 which classifies this model as insignificant. 10. The Regression Equation Based on the Simple Linear Regression output regarding the coefficients, the Intercept and Budget (x), the mathematical equation of this model is Y=883,710,603 + 1.420221979x. For 0 500,000,000 1,000,000,000 1,500,000,000 2,000,000,000 2,500,000,000
  • 50. Budget (X) Line Fit Plot Gross Profit (Y) Predicted Gross Profit (Y) Page 14 any unit in the budget increase, the gross amount for the upcoming film will increase by 1.420. For example, if we were to suggest that the budget for an upcoming film was $250,000,000, the value of y would be $1,238,766,098. As a result, the gross amount for that upcoming film would make around $1,238,766,098 if they had a budget of $250,000,000 to spend on the film. 11. The Reliability of the Regression Model Based on the Simple Linear Regression output regarding the model being a reliable predictor of y or not is based on R square. In this model, R square is 0.063229234 or 6.322% which is very low for the amount of variation explained by the regression, and unfortunately, this model is not a very reliable predictor of y as only 6.322% of variation is explained. 12. The Assumptions of the Regression Model
  • 51. Assumption Check Yes or No? Why or Why not? Mean of 0 No. Based on the data from the residual plot, most of the data isn’t evenly around 0 with exceptions from 100,000,000-250,000,000, but a lot of the data is spread out from the mean of 0. Constant Variability No. Based on the data observed in the residual plot, the variance is not constant. There is a clear triangular/cone-like pattern, which suggests that the data isn’t between two parable lines to 0 and isn’t constant. Independent Yes Normality No. As can be observed from the normal probability plot, it is not a linear line as it has a tail going upwards; therefore, normality isn’t satisfied as it’s not a complete linear line. Page 15 13. Conclusion
  • 52. Based on the general data analysis of the plots and the output, it is evident that the relationship between budget and top grossing movies worldwide isn’t explained and represented at a high percentage given that R square is only 6.322%. With only 6.322% of the variation explained by the regression, this is considered very low throughout the data analysis. More importantly, our significance F, which represents the significance of our model from the output showed a value of 0.078175279 which is greater than 0.05, therefore, rendering our model insignificant. Understanding that this correlation between our budget variable and the top grossing movies worldwide isn’t represented with a lower significance F suggests that this independent variable doesn’t specifically affect the dependent variable on a significant scale. Overall, even with our data output representing an insignificant model, I learned that not every variable that correlates with a particular subject such as a film’s budget and a movie’s gross profit will have a high percentage of variance explained or a strong linear relationship. More importantly, contrary to what might seem like conventional wisdom, I also learned that most of the values given for both gross profits for movies and the
  • 53. budgets are estimates which suggest that given the public response of how well the movie does or not can represent a different number presented to credible sources about their figures. Moreover, for any last comments and further improvement of my model, I would like to present more of a dataset like 100 pieces of data rather than 50 to give me a possibly better relationship between budget and gross profit. For any improvement of my model, I would want to analyze my dependent variable more specifically by acknowledging the inflation rate as movies in the past weren’t represented in the top grossing movies because of inflation. As a result of inflation, our currency has changed drastically over the last eighty years and could give us more accurate data for our model if it were accounted for in our data. Overall, I would also like to address that this individual evaluation of the dataset and plots gave me a better perspective of data analysis where I can see the figures on a smaller scale in histograms, scatterplots, and data charts rather than just values on a website. Page 16 14. Team Information Team Member X Variable Shane Cornfield Budget Amount to Create the Movie
  • 54. Dana Saxton Budget Amount to Create the Movie Drew Thoman Length of Movie (Minutes) Zekun Huang Movie Rating Scores (Out of 100) Works Cited All Time Worldwide Box Office Grosses. (2017, February 9). Retrieved February 09, 2017, from <http://www.boxofficemojo.com/alltime/world/>. All Time Worldwide Box Office Profiles. (2017, February 9). Retrieved from February 09, 2017, from <http://www.boxofficemojo.com/movies/?id=moviename.htm>. Why are your budget/gross figures for some movies different than those listed by another source? Why do you have budget/gross data on some movies and not others? (2017). Retrieved February 09, 2017, from <http://www.imdb.com/help/show_leaf?boxofficedifferent>. Credits
  • 55. Image of Cinema on front cover: Hayden Dingman from Pcworld.com: https://i.ytimg.com/vi/5ar91JNLdR4/maxresdefault.jpg https://i.ytimg.com/vi/5ar91JNLdR4/maxresdefault.jpg BA240 Individual Project Report SUBMITTED HARD COPY AT THE BEGINNING OF CLASS 1. THIS PROJECT IS PRESENTED IN WORD FORMAT SO YOU CAN USE THE TABLES INCLUDED HERE. 2. The instructor reserves the right to adjust individual scores. 3. Individual or team projects that are just Excel printouts will receive 0 points. 4. Excel instructions are contained under “Project” link on Canvas. INDIVIDUAL Project Each team member is to choose one of the independent variables in the data sets to analyze along with the dependent data set. All team members will have the same dependent variable (y) but a different independent variable (x) (minimum 3 variables in a group). Review the Excel videos and linear regression before you do your own. Each variable should contain at least 50 data. Number all your answers in your submission.Although you are
  • 56. sharing data, you must complete the analysis and interpretation individually. 1. Introduction: · Show the data and explain why you selected the data and how the data was collected. · Cite websites and evaluate the credibility of your sources. · List any limitations of the data. · Describe any steps you took to clean up the data ( if you have missing data) · Make assumptions before doing the data analysis · This section may be reused in your team project · Write the project more like a report. 2. Describing the data: · Plot histograms of your x variable and the y variable using reasonable intervals for each set. (There will be two histograms.) · Label the graph correctly · Comment on the shape of the distribution (skewness). 3. Analyze whether the x and y distributions satisfy the empirical rule (Yes or No, explain why). Show details such like the range of within 1 standard deviation, within 2 standard deviation and within 3 standard deviation and the corresponding true percentage falling in these ranges. 4. Identify and list all outliers in each distribution (Both X and Y) using appropriate methodology and explain why they are outliers. If you have more than 10 outliers in either distribution (X or Y) in your dataset, you can just list out the top 10 outliers. 5. Calculate the mean, median, and mode and show where they are on the histogram graph (you can either edit on the graph in Word ,Excel or PowerPoint, or you can show them by pen on
  • 57. the graph). Finish the following table for the five number summary (Minimum, Q1, median, Q3, maximum) and the z- scores of each. x z-scores y z-scores Mean Median Mode Standard Deviation NA NA Min 25 percentile
  • 58. 75 percentile Max 6. The Regression: Show the output and all the plots from Excel from Simple Linear Regression analysis. You can copy and paste from Excel output and plots. 7. The Regression: Create a scatter plot of your independent variable against the dependent variable using Excel. Make sure your dependent variable is y and your independent is x on the graph. Write a paragraph about your finding in the scatter plot. 8. The Regression: Display the “Line Fit Plots” from the Simple Linear Regression output. Is there a linear relationship between these two variables from the plot? Explain why? 9. The Regression: Is this regression model is important/significant? Why or why not? 10. The Regression: Are all parameters important/significant? Why or why not? 11. The Regression: Show the mathematical equation of this model. Please give two examples after you have the equation. Select any two meaningful numbers of X and predict the value
  • 59. of Y and interpret the equation using words. 12. The Regression: Is this model a reliable predictor of y? Explain how much of variation is explained. Do you think there is a strong correlation and explain why or why not. 13. The Regression: Assumption check Write a paragraph for the 4 assumption check and explain why it satisfies or violate the assumptions. 14. Summary: Write at least one paragraph including: summary of your findings in the plots, numerical measurements and data analysis, what you have learned from the project, and any comments you have or any further improvement of your model. 15. List all your team members’ names and their corresponding variables. 16. Appendix if needed