Adjusting PageRank parameters and comparing results : REPORTSubhajit Sahu
Ā
This is my report on Adjusting PageRank parameters and comparing results (version 1).
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Web graphs unaltered are reducible, and thus the rate of convergence of the power-iteration method is the rate at which Ī±k ā 0, where Ī± is the damping factor, and k is the iteration count. An estimate of the number of iterations needed to converge to a tolerance Ļ is logĪ± Ļ. For Ļ = 10-6 and Ī± = 0.85, it can take roughly 85 iterations to converge. For Ī± = 0.95, and Ī± = 0.75, with the same tolerance Ļ = 10-6, it takes roughly 269 and 48 iterations respectively. For Ļ = 10-9, and Ļ = 10-3, with the same damping factor Ī± = 0.85, it takes roughly 128 and 43 iterations respectively. Thus, adjusting the damping factor or the tolerance parameters of the PageRank algorithm can have a significant effect on the convergence rate, both in terms of time and iterations. However, especially with the damping factor Ī±, adjustment of the parameter value is a delicate balancing act. For smaller values of Ī±, the convergence is fast, but the link structure of the graph used to determine ranks is less true. Slightly different values for Ī± can produce very different rank vectors. Moreover, as Ī± ā 1, convergence slows down drastically, and sensitivity issues begin to surface [langville04].
The ARIMA analytical method predicts future values of a time series using a linear combination of past values and a series of errors. It is suitable for instances when data is stationary/non stationary and is univariate, with any type of data pattern. It produces accurate, dependable forecasts for short-term planning, and provides forecasted values of target variables for user-specified periods to illustrate results for planning, production, sales and other factors.
Any business and economic applications of forecasting involve time series data. Re-gression models can be fit to monthly, quarterly, or yearly data using the techniques de-scribed in previous chapters. However, because data collected over time tend to exhibit trends, seasonal patterns, and so forth, observations in different time periods are reĀ¬lated or autocorrelated. That is, for time series data, the sample of observations cannot be regarded as a random sample. Problems of interpretation can arise when standard regression methods are applied to observations that are related to one another over time. Fitting regression models to time series data must be done with considerable care.
Adjusting PageRank parameters and comparing results : REPORTSubhajit Sahu
Ā
This is my report on Adjusting PageRank parameters and comparing results (version 1).
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Web graphs unaltered are reducible, and thus the rate of convergence of the power-iteration method is the rate at which Ī±k ā 0, where Ī± is the damping factor, and k is the iteration count. An estimate of the number of iterations needed to converge to a tolerance Ļ is logĪ± Ļ. For Ļ = 10-6 and Ī± = 0.85, it can take roughly 85 iterations to converge. For Ī± = 0.95, and Ī± = 0.75, with the same tolerance Ļ = 10-6, it takes roughly 269 and 48 iterations respectively. For Ļ = 10-9, and Ļ = 10-3, with the same damping factor Ī± = 0.85, it takes roughly 128 and 43 iterations respectively. Thus, adjusting the damping factor or the tolerance parameters of the PageRank algorithm can have a significant effect on the convergence rate, both in terms of time and iterations. However, especially with the damping factor Ī±, adjustment of the parameter value is a delicate balancing act. For smaller values of Ī±, the convergence is fast, but the link structure of the graph used to determine ranks is less true. Slightly different values for Ī± can produce very different rank vectors. Moreover, as Ī± ā 1, convergence slows down drastically, and sensitivity issues begin to surface [langville04].
The ARIMA analytical method predicts future values of a time series using a linear combination of past values and a series of errors. It is suitable for instances when data is stationary/non stationary and is univariate, with any type of data pattern. It produces accurate, dependable forecasts for short-term planning, and provides forecasted values of target variables for user-specified periods to illustrate results for planning, production, sales and other factors.
Any business and economic applications of forecasting involve time series data. Re-gression models can be fit to monthly, quarterly, or yearly data using the techniques de-scribed in previous chapters. However, because data collected over time tend to exhibit trends, seasonal patterns, and so forth, observations in different time periods are reĀ¬lated or autocorrelated. That is, for time series data, the sample of observations cannot be regarded as a random sample. Problems of interpretation can arise when standard regression methods are applied to observations that are related to one another over time. Fitting regression models to time series data must be done with considerable care.
Adjusting PageRank parameters and comparing results : REPORTSubhajit Sahu
Ā
This is my report on Adjusting PageRank parameters and comparing results (v2).
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Abstract ā The effect of adjusting damping factor Ī± and tolerance Ļ on iterations needed for PageRank computation is studied here. Relative performance of PageRank computation with L1, L2, and Lā norms used as convergence check, are also compared with six possible mean ratios. It is observed that increasing the damping factor Ī± linearly increases the iterations needed almost exponentially. On the other hand, decreasing the tolerance Ļ exponentially decreases the iterations needed almost exponentially. On average, PageRank with Lā norm as convergence check is the fastest, quickly followed by L2 norm, and then L1 norm. For large graphs, above certain tolerance Ļ values, convergence can occur in a single iteration. On the contrary, below certain tolerance Ļ values, sensitivity issues can begin to appear, causing computation to halt at maximum iteration limit without convergence. The six mean ratios for relative performance comparison are based on arithmetic, geometric, and harmonic mean, as well as the order of ratio calculation. Among them GM-RATIO, geometric mean followed by ratio calculation, is found to be most stable, followed by AM-RATIO.
Index terms ā PageRank algorithm, Parameter adjustment, Convergence function, Sensitivity issues, Relative performance comparison.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology
Xā¾ -R Chart maximum utilization of information available from data & provide detailed information in process average & variation for control of individual dimensions.
With reference to monthly coffee price data (Jan. 2006-Apr. 2014) for the world indicator prices and retail prices in the US, this paper analyzes how the variations observed at the coffee world market price level pass on the coffee retail price. Unit root tests of the series under study reveal that all the series are non-stationary at level and stationary after first difference. The result of Johansen test indicates the existence of one cointegration relation between the variables and there is long-term dynamics between coffee retail and world price. Granger causality test indicates that there is transmission of price signals from the world market to the local retail market.
Adjusting PageRank parameters and comparing results : REPORTSubhajit Sahu
Ā
This is my report on Adjusting PageRank parameters and comparing results (v2).
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Abstract ā The effect of adjusting damping factor Ī± and tolerance Ļ on iterations needed for PageRank computation is studied here. Relative performance of PageRank computation with L1, L2, and Lā norms used as convergence check, are also compared with six possible mean ratios. It is observed that increasing the damping factor Ī± linearly increases the iterations needed almost exponentially. On the other hand, decreasing the tolerance Ļ exponentially decreases the iterations needed almost exponentially. On average, PageRank with Lā norm as convergence check is the fastest, quickly followed by L2 norm, and then L1 norm. For large graphs, above certain tolerance Ļ values, convergence can occur in a single iteration. On the contrary, below certain tolerance Ļ values, sensitivity issues can begin to appear, causing computation to halt at maximum iteration limit without convergence. The six mean ratios for relative performance comparison are based on arithmetic, geometric, and harmonic mean, as well as the order of ratio calculation. Among them GM-RATIO, geometric mean followed by ratio calculation, is found to be most stable, followed by AM-RATIO.
Index terms ā PageRank algorithm, Parameter adjustment, Convergence function, Sensitivity issues, Relative performance comparison.
Decision Making Using The Analytic Hierarchy ProcessVaibhav Gaikwad
Ā
Analytic Hierarchy Process (AHP) is an
effective tool for dealing with complex decision making,
and may aid the decision maker to set priorities and
make the best decision. By reducing complex decisions
to a series of pairwise comparisons, and then
synthesizing the results, the AHP helps to capture both
subjective and objective aspects of a decision. In
addition, the AHP incorporates a useful technique for
checking the consistency of the decision makerās
evaluations, thus reducing the bias in the decision
making process. In this paper we give special emphasis
to departure from consistency and its measurement and
to the use of absolute and relative measurement,
providing examples and justification for rank
preservation and reversal in relative measurement.
The objective of the project is to use the dataset 'Factor-Hair-Revised.csv' to build an optimum regression model to predict satisfaction.
Perform exploratory data analysis on the dataset. Showcase some charts, graphs. Check for outliers and missing values.
Check evidence of multicollinearity.
Perform simple linear regression for the dependent variable with every independent variable.
Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name the Factors.
Perform Multiple linear regression with customer satisfaction as dependent variables and the four factors as independent variables.
A parsimonious SVM model selection criterion for classiļ¬cation of real-world ...o_almasi
Ā
This paper proposes and optimizes a two-term cost function consisting of a sparseness term and a generalized v-fold cross-validation term by a new adaptive particle swarm optimization (APSO). APSO updates its parameters adaptively based on a dynamic feedback from the success rate of the each particleās personal best. Since the proposed cost function is based on the choosing fewer numbers of support vectors, the complexity of SVM models decreased while the accuracy remains in an acceptable range. Therefore, the testing time decreases and makes SVM more applicable for practical applications in real data sets. A comparative study on data sets of UCI database is performed between the proposed cost function and conventional cost function to demonstrate the effectiveness of the proposed cost function.
Adjusting PageRank parameters and comparing results : REPORTSubhajit Sahu
Ā
This is my report on Adjusting PageRank parameters and comparing results (v2).
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Abstract ā The effect of adjusting damping factor Ī± and tolerance Ļ on iterations needed for PageRank computation is studied here. Relative performance of PageRank computation with L1, L2, and Lā norms used as convergence check, are also compared with six possible mean ratios. It is observed that increasing the damping factor Ī± linearly increases the iterations needed almost exponentially. On the other hand, decreasing the tolerance Ļ exponentially decreases the iterations needed almost exponentially. On average, PageRank with Lā norm as convergence check is the fastest, quickly followed by L2 norm, and then L1 norm. For large graphs, above certain tolerance Ļ values, convergence can occur in a single iteration. On the contrary, below certain tolerance Ļ values, sensitivity issues can begin to appear, causing computation to halt at maximum iteration limit without convergence. The six mean ratios for relative performance comparison are based on arithmetic, geometric, and harmonic mean, as well as the order of ratio calculation. Among them GM-RATIO, geometric mean followed by ratio calculation, is found to be most stable, followed by AM-RATIO.
Index terms ā PageRank algorithm, Parameter adjustment, Convergence function, Sensitivity issues, Relative performance comparison.
International Journal of Computational Engineering Research(IJCER) is an intentional online Journal in English monthly publishing journal. This Journal publish original research work that contributes significantly to further the scientific knowledge in engineering and Technology
Xā¾ -R Chart maximum utilization of information available from data & provide detailed information in process average & variation for control of individual dimensions.
With reference to monthly coffee price data (Jan. 2006-Apr. 2014) for the world indicator prices and retail prices in the US, this paper analyzes how the variations observed at the coffee world market price level pass on the coffee retail price. Unit root tests of the series under study reveal that all the series are non-stationary at level and stationary after first difference. The result of Johansen test indicates the existence of one cointegration relation between the variables and there is long-term dynamics between coffee retail and world price. Granger causality test indicates that there is transmission of price signals from the world market to the local retail market.
Adjusting PageRank parameters and comparing results : REPORTSubhajit Sahu
Ā
This is my report on Adjusting PageRank parameters and comparing results (v2).
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Abstract ā The effect of adjusting damping factor Ī± and tolerance Ļ on iterations needed for PageRank computation is studied here. Relative performance of PageRank computation with L1, L2, and Lā norms used as convergence check, are also compared with six possible mean ratios. It is observed that increasing the damping factor Ī± linearly increases the iterations needed almost exponentially. On the other hand, decreasing the tolerance Ļ exponentially decreases the iterations needed almost exponentially. On average, PageRank with Lā norm as convergence check is the fastest, quickly followed by L2 norm, and then L1 norm. For large graphs, above certain tolerance Ļ values, convergence can occur in a single iteration. On the contrary, below certain tolerance Ļ values, sensitivity issues can begin to appear, causing computation to halt at maximum iteration limit without convergence. The six mean ratios for relative performance comparison are based on arithmetic, geometric, and harmonic mean, as well as the order of ratio calculation. Among them GM-RATIO, geometric mean followed by ratio calculation, is found to be most stable, followed by AM-RATIO.
Index terms ā PageRank algorithm, Parameter adjustment, Convergence function, Sensitivity issues, Relative performance comparison.
Decision Making Using The Analytic Hierarchy ProcessVaibhav Gaikwad
Ā
Analytic Hierarchy Process (AHP) is an
effective tool for dealing with complex decision making,
and may aid the decision maker to set priorities and
make the best decision. By reducing complex decisions
to a series of pairwise comparisons, and then
synthesizing the results, the AHP helps to capture both
subjective and objective aspects of a decision. In
addition, the AHP incorporates a useful technique for
checking the consistency of the decision makerās
evaluations, thus reducing the bias in the decision
making process. In this paper we give special emphasis
to departure from consistency and its measurement and
to the use of absolute and relative measurement,
providing examples and justification for rank
preservation and reversal in relative measurement.
The objective of the project is to use the dataset 'Factor-Hair-Revised.csv' to build an optimum regression model to predict satisfaction.
Perform exploratory data analysis on the dataset. Showcase some charts, graphs. Check for outliers and missing values.
Check evidence of multicollinearity.
Perform simple linear regression for the dependent variable with every independent variable.
Perform PCA/Factor analysis by extracting 4 factors. Interpret the output and name the Factors.
Perform Multiple linear regression with customer satisfaction as dependent variables and the four factors as independent variables.
A parsimonious SVM model selection criterion for classiļ¬cation of real-world ...o_almasi
Ā
This paper proposes and optimizes a two-term cost function consisting of a sparseness term and a generalized v-fold cross-validation term by a new adaptive particle swarm optimization (APSO). APSO updates its parameters adaptively based on a dynamic feedback from the success rate of the each particleās personal best. Since the proposed cost function is based on the choosing fewer numbers of support vectors, the complexity of SVM models decreased while the accuracy remains in an acceptable range. Therefore, the testing time decreases and makes SVM more applicable for practical applications in real data sets. A comparative study on data sets of UCI database is performed between the proposed cost function and conventional cost function to demonstrate the effectiveness of the proposed cost function.
This project requires us to understand what mode of transport employees prefers to commute to their office. The available data includes employee information about their mode of transport as well as their personal and professional details like age, salary, work exp. We need to predict whether or not an employee will use Car as a mode of transport. Also, which variables are a significant predictor behind this decision.
Following is expected in this assessment.
EDA
Perform an EDA on the data
Illustrate the insights based on EDA
Check for Multicollinearity - Plot the graph based on Multicollinearity & treat it.
Data Preparation
Prepare the data for analysis (SMOTE)
Modeling
Create multiple models and explore how each model perform using appropriate model performance metrics
KNN
Naive Bayes
Logistic Regression
Apply both bagging and boosting modeling procedures to create 2 models and compare its accuracy with the best model of the above step.
Customer Churn is a burning problem for Telecom companies. In this project, we simulate one such case of customer churn where we work on a data of postpaid customers with a contract. The data has information about the customer usage behavior, contract details and the payment details. The data also indicates which were the customers who canceled their service. Based on this past data, we need to build a model which can predict whether a customer will cancel their service in the future or not.
chapter 5
CostāVolumeāProfit Analysis
Learning Objectives
ā¢ Extend your knowledge of fixed and variable costs, and be able to perform cost
behavior analysis.
ā¢ Understand the contribution margin, contribution margin ratio, and how knowledge of
these concepts can be used to calculate breakeven and other performance measures.
ā¢ Know the critical assumptions of costāvolumeāprofit analysis.
ā¢ Understand variable versus absorption costing.
ā¢ Be able to calculate residual income.
istockphoto
waL80281_05_c05_113-140.indd 1 9/25/12 1:03 PM
114
CHAPTER 5Section 5.1 Mixed Costs
Chapter Outline
5.1 Mixed Costs
5.2 CostāVolumeāProfit Analysis
The Algebra of Break-Even and Targeted Income Analysis
Influence of Taxes
Changing Costs
Changing Revenues
Multiple Products
5.3 CVP Assumptions
Direct Costing
Comprehensive Income Statements Under Variable and Absorption Costing
Fluctuating Inventory
5.4 Evaluating Residual Income
You have previously learned about fixed and variable costs. Fixed costs are the same over the relevant range of expected production. Variable costs fluctuate in direct pro-
portion to volume. You have seen how cost behavior influences measures of income, flex-
ible budgeting, standard costing models, and so forth. Management must understand
cost behavior to operate a successful business organization effectively. In this chapter,
your knowledge of cost behavior will be extended to encompass techniques useful in
studying a businessās break-even point and similar concepts. These techniques are com-
monly referred to as costāvolumeāprofit analysis or just CVP. You will also apply your
knowledge of cost behavior to understand alternative costing methods that are useful in
managing business decisions.
5.1 Mixed Costs
Before diving into CVP and alternative costing models, one must give consideration to the prospect of a mixed cost. Mixed costs entail a fixed component and a variable
component. They are actually quite common. If you have ever committed to a cell phone
contract, it is very possible that you have some hands-on experience with mixed costs.
Your monthly cellular bill may include both fixed and variable amounts. Perhaps there is
a fixed charge for basic monthly service and variable charges related to Internet access,
texting, and so forth. Mixed costs change in response to fluctuations in volume, but not
in a way that is immediately apparent. Before a manager can study the effects of volume
fluctuation on a business, it is first necessary to develop a model that separates mixed
costs into their fixed and variable components.
Assume that Charlieās Restaurant receives a monthly electric bill. Charlieās electricity use
fluctuates significantly each month. The cause of the fluctuation relates mostly to seasonal
differences in utility consumption, based on heating and air-conditioning needs. Charlieās
provides data about its monthly electric bill in Table 5.1.
waL80281_05_c05_113-140.indd ...
1. To: Mike Ryan, Deputy Secretary for Highways
From: Berkeley Teate, Analyst
RE: Patchwork Assessment 2016
Introduction: The Pennsylvania State Department of Highways has requested an examination of
the effect of using specialized crews on the cost of conducting one particular maintenance activity.
Specific interest in this analysis will focus on the maintenance activity titled āmanual patching,ā
otherwise known as filling in potholes in state highways. Maintenance managers in the Department
believe the increasing reliance on specialized crews for activities will improve efficiency [of
labor], and result in lower costs of manual patching. Lastly, the Department has implemented a
quality improvement initiative in the past two years, and is interested to learn whether increased
quality [of manual patching] leads to reduction in costs. I intend to analyze these variables, to
determine the extent to which specialized crews and the quality improvement initiative have
impacted costs for the Department of Highways in Pennsylvania.
ļ· ResearchHypothesis: An overall increase in āCrew Specalizationā should result in lower
costs [of manual patching work, in dollars] for the 67 county maintenance units examined.
o The reliance on specialized crews and the quality improvement initiative will
decrease the influence of specific variables [costs of manual patching work in
dollars, quality assurance score for manual patching, and production units per 100
activity hours] on the cost of manual patching.
o The same reliance(s) will increase the influence of specific variables [production
units of manual patching completed, lane miles of state highway, number of
freeze/thaw cycles, average days with snow on the ground, and material cost index]
on the cost of manual patching.
Research Method: An analysis of 67 county maintenance units in the Pennsylvania Department
of Highways was completed, looking at descriptive data provided. The independent variables
examined in this analysis are as follows: 1) Crew specialization (interval scale); 2) Production
units of manual patching completed (interval scale); 3) Quality assurance score for manual
patching (ordinal scale); 4) Production units per 100 activity hours (interval scale); 5) Lane Miles
of state highway (nominal scale); 6) Number of freeze/thaw cycles (interval scale); 7) Average
snow days on the ground (interval scale); and, 8) Materials Cost Index (interval scale). Operating
efficiency will also considered an independent variable, along with crew specialization and labor
productivity variables. The dependent variable in this analysis is the cost of manual patching work,
in dollars (interval scale), of each of the 67 county maintenance units. The control variables
include: 1) production units [when analyzing the association between crew specialization and
cost]; 2) material cost index [when analyzing the effect of crew specialization on overall
expenditures]; and, 3) the number of freeze/thaw cycles and the average days of snow on the
ground [considering climatic conditions in the State of Pennsylvania].
ļ· The quality assurance variable provides scores ranging from 1 ā 10, with higher scores
reflecting more positive patching jobs.
ļ· The crew specialization variable is an indicator of the extent to which manual patching
work is performed by specialized crews. Values on this scale represent the percent of all
maintenance crews in the district not required to perform the āfirstā 50% of manual
patching in that district ā thus, higher percentages indicate more crew specialization.
2. Now letās look at specific research procedures. I will use a Cross Sectional Design to view simple
descriptive data, Scatterplots to complete a partial correlation, analyzing the relationship between
individual variables [while controlling for one or more other variable], a Correlation Matrix as a
prelude to my multivariate analysis of contextual variables, where Iāll re-purpose SPSS data to
visualize regression model(s). This will provide an interpretation of influence on each variable in
terms of slope, statistical significance, and the strength of the overall model. NoSampling Strategy
is applicable to this analysis, as there was no strict selection or particular treatment involved.
ļ· Although not requested, Iām providing the research design of this cost function analysis ā
a Quasi-Experimental, Interrupted Time Series Design. The Department implemented a
quality improvement initiative in the past two years, creating a āpreā and āpostā atmosphere
for the variables Iām examining. Please see below for visual aid:
Interrupted Time Series Design:
O O O X O O O
3. Results:
1. Initial Impression of Raw Data [based on Descriptive Statistics]:
Table A:
Descriptive Statistics
N Minimum Maximum Mean Std. Deviation
COST OF MANUAL
PATCHING
66 $37,936 $929,733 $272,421.67 $182,290.557
PRODUCTION UNITS OF
MANUAL PATCHING
66 544 24,256 5,685.50 4,564.538
NUMBER OF FREEZE
THAW CYCLES
67 4 83 24.27 27.030
QUALITY OF MANUAL
PATCHING
67 1 10 5.91 2.360
MATERIAL COST INDEX
MANUAL PATCHING
67 75 150 103.09 14.665
NUMBER OF DAYS SNOW
ON GRND
67 10 91 44.88 20.751
COUNTY LANE MILES 67 161 2701 1103.34 516.889
UNIT COST OF MANUAL
PATCHING
66 $18.16 $186.96 $56.7203 $27.42233
PERCENT CREW
SPECIALIZATION (FIRST
50%)
66 65 95 79.42 6.884
LABOR PRODUCTIVITY
INDEX
66 4.1 18.6 15.022 2.5740
Valid N (listwise) 65
I ran descriptive statistics, displayed above in a cross-sectional table, to retrieve the minimum,
maximum, mean, and standard deviation of all variables provided [independent, dependent,
control]. These results provide a simple understanding of what is presented, and hopefully begin
to show correlation between variables.
ļ· The minimum cost of manual patching is $37,936, and the maximum is $929,733 ā a near
$900,000 gap. However, when looking at the mean cost for manual patching, it sits at
$272,421, with a standard deviation of $182,290. This provides a high probability for
outliers, which we can look for in the Scatterplot Analysis [in Step #2].
ļ· The minimum quality assurance rating of manual patching is 1, and the maximum is 10.
Once again however, the mean shows just below a ā6ā average. This variable does have a
high Standard Deviation given the small range ā nearly a 2.4 rating. This could mean that
quality varies greatly depending on the crew, and possibly the district of the job.
4. ļ· The minimum crew specialization percentage is 65%, and the maximum is 95% - nearly a
30% gap in manual patch requirement(s). However the mean is nearly 80%, showing a high
percentage of crews who are not required to perform the first 50% of work.
2. Scatterplot Analysis[Two Separate Comparisons]:
Graph A:
Next, I generated a Scatterplot Analysis to evaluate whether there was a correlation between the
cost of manual patching [dependent variable] and crew specialization [independent variable],
controlling for Material Cost Index and Production Units [The total cost of any manual patching
activity is heavily dependent on the volume of patching completed]. This is a linear relationship,
as defined by the slope equation:
Y = aX + b
The scatterplot above shows a positive, linear association:
5. ļ· The strength of the scatterplot is related to how tight the points are ā in this analysis, we
see a āmoderateā correlation.
ļ· The association is determined by the R^2 of the Fit Line. The āFit Lineā is also known as
the regression line. The slope of the regression line = 0.104.
ļ· The direction of the scatterplot provides a negative gradient. This is most likely due to
multiple outliers in the upper-left corner of the graph, with high costs and low crew
percentages.
This scatterplot supports the hypothesis as it shows that as crew specialization goes up, the cost of
manual patching decreases. This is reconfirmed by the āmoderateā strength in correlation of the
clustered points, and a positive slope.
Graph B:
I also generated a Scatterplot Analysis to evaluate whether there was a correlation between the
cost of manual patching [dependent variable] and the quality of manual patching [independent
variable]. This is a linear relationship, as defined by the slope equation: Y = aX + b
The scatterplot above shows a positive, linear association:
6. ļ· The strength of the scatterplot is related to how tight the points are ā in this analysis, we
see a āmoderate to strongā correlation. There are nearly ten cluster points falling exactly on
the slope line.
ļ· The association is determined by the R^2 of the Fit Line. The āFit Lineā is also known as
the regression line. The slope of the regression line = 0.151.
ļ· The direction of the scatterplot provides a negative gradient. This is most likely due to
multiple outliers above the slope line entirely, possibly showing higher costs regardless of
quality of manual patching work.
This scatterplot also supports the hypothesis as it shows that as quality goes up, the cost of manual
patching decreases. This is reconfirmed by the āmoderate to strongā strength in correlation of the
clustered points, and a positive slope.
3. Correlation Matrix [Refer to Appendix for SPSS Matrix]:
A Correlation matrix [Pearson Product-Moment] was generated to look at correlation amongst the
dependent variable and individual independent variables [not all are featured by scatterplots
above], statistical outliers and levels of significance of those relationships. Please refer to the full
report Correlation Matrix labeled āTable Eā in the Appendix. Please note: there were now row-
wise deletions, as there were few missing values.
ļ· The following independent variable correlations with Cost of Manual Patching were
flagged as significant at the 0.01 level (2-tailed): 1) crew specialization; 2) quality of
manual patching; 3) county lane miles; 4) production units of manual patching; and, 5)
number of freeze/thaw cycles.
ļ· The following independent variable correlations with Crew Specialization were flagged as
significant at the 0.01 level (2-tailed): 1) cost of manual patching; 2) quality of manual
patching; and, 3) number of freeze/thaw cycles.
ļ· The following independent variable correlations with Quality of Manual Patching were
flagged as significant at the 0.01 level (2-tailed): 1) cost of manual patching; 2) crew
specialization; and 3) production units of manual patching.
o There was a significant relationship with county lane miles at the 0.05 level (2-
tailed).
This shows there is a repetitive significance between cost, crew specialization, and quality of
manual patching ā variables which were requested by the Pennsylvania Department of Highways.
These relationships provide additional results that support the hypothesis. By being significant at
the 0.01 level (two-tailed), it means 99 percent of the time there will be a relationship between cost
of patching, specialized crew, and quality of the patching work.
4. Initial RegressionModel(s):
Table C:
In Tables C and D below, I provided a final test for significance between the independent variables,
and the dependent variable [cost of the manual patching, by dollars] ā a test of significance of the
regression slope. I used a multiple regression analysis, adding variables increasingly to predict a
7. change in the dependent variable, and re-purposed the data requested for Steps 4 ā 7. Please refer
to the Output Data provided separately for a full visual Step-by-Step processes.
Cost of Manual Patching:1
Step 4 Step 5 Step 6 Step 7
Crewspec -3275.130 -.124***
Unitcost 1,782.374 .269***
Laborprod -864.467 -.012
Quality -8361.313 -.107* -6482.359 -.083*
Produnit 25.811 .646*** 24.820 .622*** 32.197 .807***
Snowdays 2373.561 .272*** -24.264 -.003 -250.361 -.029 97.207 .011
Freethaw -746.948 -.111 -387.730 -.058 -378.822 -.056 -188.152 -.028
Matcost 1498.102 .121 1,102.377 .089* 1,272.666 .103* 850.520 .069*
Lanemile2 302.206 .798*** 130.412 .344*** 123.187 .325*** 77.436 .204***
N: 67 67 67 67
R^2 .649 .861 .870 .942
In Table C, I would like to review the āR Squareā statistic ā the coefficient of determination. I will
review the slope and strength of statistically significant variables in Table D. The R Square statistic
measures the proportion of total variation in the slope (Y).
ļ· This graph is strong as it shows the variation in the dependent variable when affected by
individual independent variables. By adding variables in a step-by-step fashion, it is easier
to determine which variables are significant, even without the slope or two-tailed test. In
this analysis, all R Square statistics are strong [above 0.40].
ļ· Step 4:
o The R Square is 0.649, meaning when looking at those specific variables [snow
days, freeze/thaw cycles, material cost, and county lane miles] just under 65% of
the costs of manual patching are accounted for.
ļ· Step 5:
o The R Square is 0.861, meaning when the specific independent variable of
āproduction units of manual patchingā is added into the regression model just over
86% of the costs of manual patching become accounted for.
o This is important, as it shows that one variable ā Production Units ā can change the
dependent variable costs in manual patching by over 22%.
ļ· Step 6:
1 Cost of Manual Patching:Dependent Variable
2 Independent Variables Full Namaes:County Lane Miles;Material CostIndex Manual Patching;Number of Freeze
Thaw Cycles; Number of Days [with] Snow on Ground; Production Units of Manual Patching;Quality of Manual
Patching;Labor Productivity Index; Unit Cost of Manual Patching;Percent Crew Specialization [First50%]
8. o The R Square is 0.870, when the specific independent variable of āquality of manual
patchingā is added into the regression model just over 86% of the costs of manual
patching become accounted for.
o This is in line with Graph B, which supports the hypothesis that as quality goes up,
costs go down. Seeing as there was only just over a 1% shift in the dependent
variable, this regression supports previous results.
ļ· Step 7:
o The R Square is 0.942, meaning when looking at those specific variables [labor
productivity, unit cost and crew specialization] just over 94% of the costs of manual
patching become accounted for.
o This is important for similar reasons to Step 6 ā it is in line with Graph A. This R
Square statistic supports the hypothesis that crew specialization lowers costs in
manual patching. There wasnāt a significant shift in the dependent variable, taking
into account two other independent variables were added simultaneously.
5. Final Model [without insignificant variables]:
Table D:
In Table D, I have removed those not significant at the 0.05 level [or higher i.e. 0.01 and 0.001]. I
will provide the slope and statistical significance of the following variables, which directly support
the hypothesis: 1) Crew Specialization and 2) Quality of Manual Patching. I will also look at the
slope and significance of Production Units of Manual Patching, based on the 20% variation
influence it had on the dependent variable. Lastly, I will look at County Lane Miles variable, which
was statistically significant at the 0.001 level in Steps 4 ā 7. Furthermore, the slope is based on
that seen in Step 7, once all independent variables are accounted for.
Cost of Manual Patching:3
Step 4 Step 5 Step 6 Step 7
Crewspec -3275.130 -.124***
Unitcost 1782.374 .269***
Quality -8361.313 -.107* -6482.359 -.083*
Produnit 25.811 .646*** 24.820 .622*** 32.197 .807***
Snowdays 2373.561 .272*** -24.264 -.003 -250.361 -.029 97.207 .011
Matcost 1498.102 .121 1,102.377 .089* 1,272.666 .103* 850.520 .069*
Lanemile4 302.206 .798*** 130.412 .344*** 123.187 .325*** 77.436 .204***
N: 67 67 67 67
3 Cost of Manual Patching:Dependent Variable
4 Independent Variables Full Namaes:County Lane Miles;Material CostIndex Manual Patching;Number of Freeze
Thaw Cycles; Number of Days [with] Snow on Ground; Production Units of Manual Patching;Quality of Manual
Patching;Labor Productivity Index; Unit Cost of Manual Patching;Percent Crew Specialization [First50%]
9. ļ· Crew Specialization:
o Statistical Significance at the 0.001 level ā a 99.9% relation with the dependent
variable.
o Using the formula provided in Step 2 of the results, we can determine the slope is
(-3222.830).
ļ· Quality of Manual Patching
o Statistical Significance at the 0.05 level ā a 95% relation with the dependent
variable.
o Using the formula provided in Step 2 of the results, we can determine the slope is
(-25.811).
ļ· Production Units of Manual Patching
o Statistically Significance at the 0.001 level [Steps 5 ā 7] ā a 99.9% relation with the
dependent variable.
o Using the formula provided in Step 2 of the results, we can determine the slope is
30.534.
ļ· County Lane Miles
o Statistical Significance at the 0.001 level [Steps 4 ā 7] ā a 99.9% relation with the
dependent variable.
o Using the formula provided in Step 2 of the results, we can determine the slope is
90.913.
All four of these independent variables were statistically significant at the 0.05 level at least once
in the multiple regression model. These percentages show a strong correlation between the
independent variables above and cost of manual patching. When looking at the slope, letās look at
an example. When looking at production units, for $1 increase in cost of manual patching, there is
a 30% increase in production units. This makes sense given the R Square statistic seen in Table C,
as well as the significance level for the quality variable. Something important to see is the negative
slope of the variables crew specialization and quality of manual patching ā both which had
negative gradients in Graphs A and B.
Given that the Crew Specialization and Quality of Manual Patching were statistically
significant at the 0.05 levelor greater, we can conclude that as crew specialization and quality
of patching increases, costs of patching by dollars decrease. Therefore, we can reject the null
hypothesis.
Conclusions:
Despite a wide range in descriptive statistics [minimum and maximum], the variable of Crew
Specialization deviated little ā just under 7% of a mean nearly at 80%. That shows a high number
of specialized crews in the 67 counties. The mean averages of Cost of Manual Patching, our other
variable highly analyzed for hypothesis support, showed a $272,421.67 average in cost of manual
patching ā with just under a $200,000 standard deviation. This provided us with the expectation
for a high probability of outliers, which was confirmed in Graphs A and B of the results. When
10. holding for specific independent variables, the Graphs both had a negative gradient, which
supported the hypothesis that as quality and specialization goes up, costs continue to go down. The
graphs were moderate to strong correlations, with linear regression supported by the numbers
found in the Correlation Matrix. Cost of Manual Patching, Crew Specialization, and Quality of
Manual Patching were flagged as statistically significant at the 0.01 level or more ā stating that 99
percent of the time there will be a relationship between cost of patching, specialized crews, and
quality of patching work. Both Tables C and D supported previous test results, and provided insight
for slope and R Square statistics. The most significant finding was Production Units affecting the
dependent variable by over 20%. Once again, R Square provided further support for our
hypothesis, showing between 1% and 7% changes [taking into account three variables] for Quality
and Crew variables. Based on these results, I rejected the null hypothesis.
Limitations: Considering the Research Design [Interrupted Time Series], the limitations include
both History and Instrumentation. Looking at the original brief, there is little qualitative
knowledge as to the history of the limited or widely available transportation in the city. A city
not unlike that of Atlanta has more potholes due to lack of centralized transportation ā therefore
there are more cars on the road ā more cars create more potholes. Furthermore, cities may have
different ways of patching a pothole. It is unclear if specialized crews all have the same
instruments, if they were slowly provided updated equipment, or if it varies per the 67 districts.
Lastly, the weather variables including freeze/thaw cycles and snow on the ground could provide
another analysis entirely of their own. Considering that both independent variables were
requested to be compared in Step 4, and not individually, it is unclear how strongly they affect
the dependent variable. There may also be other natural elements which data did not provide for.
11. Appendix:
Correlations
COST OF
MANUAL
PATCHIN
G
PRODUCTIO
N UNITS OF
MANUAL
PATCHING
NUMBE
R OF
FREEZ
E THAW
CYCLE
S
QUALITY
OF
MANUAL
PATCHIN
G
MATERIA
L COST
INDEX
MANUAL
PATCHIN
G
NUMBE
R OF
DAYS
SNOW
ON
GRND
COUNT
Y LANE
MILES
UNIT
COST OF
MANUAL
PATCHIN
G
PERCENT
CREW
SPECIALIZATI
ON (FIRST
50%)
LABOR
PRODUCTIVI
TY INDEX
COST OF
MANUAL
PATCHING
Pearson
Correlatio
n
1 .872**
-.372**
-.389**
.217 -.045 .745**
-.032 -.323**
.112
Sig. (2-
tailed)
.000 .002 .001 .080 .722 .000 .800 .009 .371
N 66 66 66 66 66 66 66 66 65 66
PRODUCTION
UNITS OF
MANUAL
PATCHING
Pearson
Correlatio
n
.872**
1 -.272*
-.356**
.140 .147 .574**
-.406**
-.109 .442**
Sig. (2-
tailed)
.000 .027 .003 .261 .240 .000 .001 .388 .000
N 66 66 66 66 66 66 66 66 65 66
NUMBER OF
FREEZE THAW
CYCLES
Pearson
Correlatio
n
-.372**
-.272*
1 .092 -.153 .173 -.360**
.016 .330**
-.038
Sig. (2-
tailed)
.002 .027 .461 .216 .160 .003 .899 .007 .759
12. N 66 66 67 67 67 67 67 66 66 66
QUALITY OF
MANUAL
PATCHING
Pearson
Correlatio
n
-.389**
-.356**
.092 1 .044 -.189 -.272*
.223 .315**
-.155
Sig. (2-
tailed)
.001 .003 .461 .726 .125 .026 .072 .010 .215
N 66 66 67 67 67 67 67 66 66 66
MATERIAL
COST INDEX
MANUAL
PATCHING
Pearson
Correlatio
n
.217 .140 -.153 .044 1 .042 .119 .105 .067 -.031
Sig. (2-
tailed)
.080 .261 .216 .726 .736 .337 .403 .596 .807
N 66 66 67 67 67 67 67 66 66 66
NUMBER OF
DAYS SNOW
ON GRND
Pearson
Correlatio
n
-.045 .147 .173 -.189 .042 1 -.364**
-.372**
.057 .329**
Sig. (2-
tailed)
.722 .240 .160 .125 .736 .002 .002 .651 .007
N 66 66 67 67 67 67 67 66 66 66
COUNTY LANE
MILES
Pearson
Correlatio
n
.745**
.574**
-.360**
-.272*
.119 -.364**
1 .081 -.232 .016
Sig. (2-
tailed)
.000 .000 .003 .026 .337 .002 .520 .061 .900
N 66 66 67 67 67 67 67 66 66 66
UNIT COST OF
MANUAL
PATCHING
Pearson
Correlatio
n
-.032 -.406**
.016 .223 .105 -.372**
.081 1 -.139 -.834**
13. Sig. (2-
tailed)
.800 .001 .899 .072 .403 .002 .520 .270 .000
N 66 66 66 66 66 66 66 66 65 66
PERCENT
CREW
SPECIALIZATI
ON (FIRST
50%)
Pearson
Correlatio
n
-.323**
-.109 .330**
.315**
.067 .057 -.232 -.139 1 .226
Sig. (2-
tailed)
.009 .388 .007 .010 .596 .651 .061 .270 .070
N 65 65 66 66 66 66 66 65 66 65
LABOR
PRODUCTIVIT
Y INDEX
Pearson
Correlatio
n
.112 .442**
-.038 -.155 -.031 .329**
.016 -.834**
.226 1
Sig. (2-
tailed)
.371 .000 .759 .215 .807 .007 .900 .000 .070
N 66 66 66 66 66 66 66 66 65 66
**. Correlation is significantatthe 0.01 level (2-tailed).
*. Correlation is significantatthe 0.05 level (2-tailed).