A statistical analysis of damage to property using a predictive regression model. Also an investigation to ascertain possible differences in reported divisional burglary rates using ANOVA.
Using the scholar data and researcher point of view on composite materials. We illustrate the application of composite material in aerospace industry. Composites are highly efficient to make the parts and structure of aircrafts. We found the characteristics of the composite material make it very suitable material for aerospace industry. Composites like carbon fiber, carbon epoxy, and glass epoxy are very light and high strength which is mostly used in aircraft industries. In addition, our study takes the first step to highlight the uses of composite material to manufacture the different parts of aircraft's.
Water jet machine report mechanical reportMohammad Asif
A water jet cutter, also known as a water jet or water jet, is an industrial tool capable of cutting a wide variety of materials using a very high-pressure jet of water, or a mixture of water and an abrasive substance. The term abrasive jet refers specifically to the use of a mixture of water and abrasive to cut hard materials such as metal or granite, while the terms pure water jet and water-only cutting refer to water jet cutting without the use of added abrasives, often used for softer materials such as wood or rubber
Using the scholar data and researcher point of view on composite materials. We illustrate the application of composite material in aerospace industry. Composites are highly efficient to make the parts and structure of aircrafts. We found the characteristics of the composite material make it very suitable material for aerospace industry. Composites like carbon fiber, carbon epoxy, and glass epoxy are very light and high strength which is mostly used in aircraft industries. In addition, our study takes the first step to highlight the uses of composite material to manufacture the different parts of aircraft's.
Water jet machine report mechanical reportMohammad Asif
A water jet cutter, also known as a water jet or water jet, is an industrial tool capable of cutting a wide variety of materials using a very high-pressure jet of water, or a mixture of water and an abrasive substance. The term abrasive jet refers specifically to the use of a mixture of water and abrasive to cut hard materials such as metal or granite, while the terms pure water jet and water-only cutting refer to water jet cutting without the use of added abrasives, often used for softer materials such as wood or rubber
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
4. Performed statistical analysis on a chosen data table and understood relationship amongst different data fields using IBM SPSS software.
Methodologies: Multi linear regression, Logistic linear regression
IBM SPSS
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 9: Inferences from Two Samples
9.3 Two Means, Two Dependent Samples, Matched Pairs
Project Week 7
1.
Both graphs shows a possibility of negative linear relationship between the cost and Annual % ROI in both majors.
2.
Regression analysis for business major
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.9701
R Square
0.9410
Adjusted R Square
0.9377
Standard Error
0.0027
Observations
20.0000
ANOVA
df
SS
MS
F
Significance F
Regression
1.0000
0.0022
0.0022
287.2207
0.0000
Residual
18.0000
0.0001
0.0000
Total
19.0000
0.0023
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Lower 95.0%
Upper 95.0%
Intercept
0.11803988
0.00242949
48.58621379
0.00000000
0.11293570
0.12314405
0.11293570
0.12314405
Cost
-0.00000021
0.00000001
-16.94758619
0.00000000
-0.00000024
-0.00000019
-0.00000024
-0.00000019
The regression equation is
And the Adjusted value is 0.9377.
This means that 93.77 % of annual % ROI is explained by Cost.
Regression analysis for engineering major
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.97543117
R Square
0.951465967
Adjusted R Square
0.948769632
Standard Error
0.003304954
Observations
20
ANOVA
df
SS
MS
F
Significance F
Regression
1
0.003854341
0.003854341
352.8737765
2.83396E-13
Residual
18
0.000196609
1.09227E-05
Total
19
0.00405095
Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Lower 95.0%
Upper 95.0%
Intercept
0.126782012
0.002020843
62.73719176
1.56075E-22
0.122536379
0.131027646
0.122536379
0.131027646
Cost
-2.1455E-07
1.14214E-08
-18.78493483
2.83396E-13
-2.38545E-07
-1.90554E-07
-2.38545E-07
-1.90554E-07
The regression equation is
And the Adjusted value is 0.948769632.
This means that 94.88 % of annual % ROI is explained by Cost.
3.
1. Estimated ‘Annual % ROI’ when the ‘Cost’ (X) is $160,000.
For engineering major
Therefore the predicted value is
For business major
Therefore the predicted value is
2. To test the hypothesis that
H0: β1 = 0
Ha: β1 ≠ 0
For business major, we have the t-statistic as -16.94758619 with a p-value being 0.00. Since this value is less than 0.05, we reject the null hypothesis and conclude that β1 is significant (different from zero).
For engineering major, we have the t-statistic as -18.78493483with a p-value being 0.00. Since this value is less than 0.05, we reject the null hypothesis and conclude that β1 is significant (different from zero).
3. From the output above, all the regression estimates from both majors are significant since their corresponding p value are less than 0.05. In both cases, the coefficient of determination is high (more than 90%) indicating that most of the variation in annual % ROI is explained by cost.
The plots indicate a possibility of negative linear relationship, which is confirmed by the regression coefficient estimates. These estimates are significant as confirmed by the test of hypotheses done above. This shows that a linear regression is fit to model the ...
Math 104 Fall 14Lab Assignment #4Math 104 Fall 14Lab Assignmen.docxandreecapon
Math 104 Fall '14Lab Assignment #4
Math 104 Fall '14Lab Assignment #4
[1] Lipitor is a drug used to control cholesterol. In clinical trials of Lipitor, 94 subjects were treated with Lipitor and 270 subjects were given a placebo. Among those who were treated with Lipitor, 7 developed infections. Among those given a placebo, 27 developed infections. Use a 0.05 significance level to test the claim that the rate of infections was the same for those treated with Lipitor and those given a placebo.
Z Test for Differences in Two Proportions
Data
Hypothesized Difference
0
Level of Significance
0.05
Group 1
Number of Items of Interest
7
Sample Size
94
Group 2
Number of Items of Interest
27
Sample Size
270
Intermediate Calculations
Group 1 Proportion
0.074468085
Group 2 Proportion
0.1
Difference in Two Proportions
-0.02553191
Average Proportion
0.0934
Z Test Statistic
-0.7326
Two-Tail Test
Lower Critical Value
-1.9600
Upper Critical Value
1.9600
p-Value
0.4638
Do not reject the null hypothesis
[2] Simple random samples of high-interest (5.36%) mortgages and low-interest (3.77%) mortgage were obtained. For the 40 high-interest mortgages, the borrowers had a mean FICO credit score of 594.8 and standard deviation of 12.2. For the 40 low-interest mortgages, the borrowers had a mean FICO credit score of 785.2 and standard deviation of 16.3.
a) Use a 0.01 significance level to test the claim that the mean FICO score of borrowers with high-interest mortgage is lower than the mean FICO score of borrowers with low-interest mortgage.
b) Does the FICO credit rating score appear to affect mortgage payments? If so, how?
Math 104 Fall '14Lab Assignment #4
The test of interest here is,
As here the sample size for both the samples are large enough so we can assume the normality using CLT and thus a Z-test would be appropriate. And the hypotheses also show that it’s a left tail test.
The output is given below,
Z Test for Differences in Two Means
Data
Hypothesized Difference
0
Level of Significance
0.01
Population 1 Sample
Sample Size
40
Sample Mean
594.8
Population Standard Deviation
12.2
Population 2 Sample
Sample Size
40
Sample Mean
785.2
Population Standard Deviation
16.3
Intermediate Calculations
Difference in Sample Means
-190.4
Standard Error of the Difference in Means
3.2192
Z Test Statistic
-59.1451
Lower-Tail Test
Lower Critical Value
-2.3263
p-Value
0.0000
Reject the null hypothesis
Here the lower p-value suggests that we should reject the null hypothesis and conclude that the mean FICO score of borrowers with high-interest mortgage is lower than the mean FICO score of borrowers with low-interest mortgage.
Note that the above analysis showed that the interest rate is higher for the group with lower FICO scores. Thus the FICO credit rating scores appear to affect mortgage payments. If the FICO score is high then the interest rate is expected to be low so the mortgage payments would be low. And for low FICO score the opposite is expe ...
An unsupervised learning
k-means clustering technique is used to identify and focus
on a set of crimes (prostitution, narcotics, burglary, battery
and interference with a public officer) recorded in the city
of Chicago. The crime data is supplemented with orthogonal
temperature and unemployment data. ANOVA and Kruskal-
Wallis statistical tests assess the temporal significance in
crimes clusters. The findings indicate various crime hot spots
which are temporal and location specific, and therefore may
act as input to the scheduling and allocation of policing resources.
The Prepared Executive: A Linguistic ExplorationTom Donoghue
An exploration of linguistic features in executive answers which investigates how these features might be proxies for detecting executive preparedness. The motivation is to discover hitherto unknown facets of the executive through their language.
A domain expert participates in an experiment to annotate a sample of executive answers providing a valuable ground truth. A set of models are adapted and combined to test their accuracy at predicting whether an executive's answer is unprepared.
Exploration of Call Transcripts with MapReduce and Zipf’s LawTom Donoghue
This study implements a proof of concept
pipeline to capture web based call transcripts and produces
a word frequency dataset ready for textual analysis
This paper describes the concept of a data lake and how it compares to a data warehouse. A review recent research and discussion of the definition of both repositories, what types of data are catered for? Does ingesting data make it available for forging information and beyond
into knowledge? What types of people, process and tools need to be involved to realise the
benefits of using a data lake?
Project report on the design and build of a data warehouse from unstructured and structured data sources (Quandl, yelp and UK Office for National Statistics) using SQL Server 2016, MongoDB and IBM Watson. Design and implementation of business intelligence visualisations using Tableau to answer cross domain business questions
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
2. CA2 Statistics
Tom Donoghue v1.0 Page 1
Table of Contents
Background......................................................................................................................2
Regression........................................................................................................................2
Output from SPSS.....................................................................................................................3
Results.....................................................................................................................................9
An Example using the model.....................................................................................................9
ANOVA.............................................................................................................................9
Output from SPSS...................................................................................................................11
Results...................................................................................................................................13
3. CA2 Statistics
Tom Donoghue v1.0 Page 2
Background
Using the CSO databases and taking a dataset that provides reported crime figures for Dublin Garda
Divisions:
CJQ03 Recorded Crime Offences by Garda Division, Type of Offence and Quarter (2003Q1-2016Q2)
-Modified on 28/09/16 at 11:02
We were asked to conduct two pieces of statistical analysis on the dataset. The following section
describe the statistical analysis conducted.
Regression
We are looking to see if we can build a model which could be used to predict damage to property
crime rates using the various other type of crimes reported across the given 6 Garda Divisions.
Our dependent continuous outcome variable is the number of damage to property crimes. Our
independent predictor variables are also continuous and comprise Burglary, Sexual offences and
Weapons and Explosives offences.
Initial check running scatterplots to examine the relationships between the outcome variable and the
predictors provided the following output:
4. CA2 Statistics
Tom Donoghue v1.0 Page 3
We can see that there is a plausible linear relationship between the predictor variables and the
outcome variable. The output below provides a further preliminary check for multicollinearity and of
the relationships of the predictors and the outcome variables.
This shows that we do not have predictors that are too highly correlated (i.e. r > 0.8) with each other
and hence no multicollinearity in the data.
Only taking the predictors variables into account then the highest correlation is between Burglary and
Sexual offences (r = .275, p < .05). The predictor with the highest correlation with our outcome variable
is Burglary (r = .546, p < .001).
Output from SPSS
5. CA2 Statistics
Tom Donoghue v1.0 Page 4
The model summary shows a single model as we entered all the variables simultaneously in 1 block
using Forced Entry. The rationale being that we have no research available to us that would indicate
a particular order in which we should input the predictor variables.
The R value shows the correlation between the predictors and the outcome at .812. R2
shows how
much the variability in the outcome is accounted for by the predictors with a value of .66, which means
the predictors account for 66% of the variance in damage to property crime rates.
To obtain an idea of how well our model generalises, we look at difference between R2
and Adjusted
R2
as the difference is .660 -.642 = .018 of 18%. This indicates that if the model were derived from
the population rather than the sample it would account for approximately 18% less variance in the
outcome.
The Durbin-Watson statistic indicates whether the assumption of independent error is tenable; the
result above = 1.341 and is greater than 1 and less than 3 which is the conservative rule and at this
value it has been met.
The ANOVA indicates that our model has a significant fit to the data overall F (3, 56) = 36.299 p < 0.001
we have a significant fit to our data. This tells us that by using the model we are significantly better at
predicting values of the outcome than by using the mean.
Beta values: Burglary b = 0.456 indicates that as burglary increases by 1 unit, Damage to property will
increase by 0.456 units.
Sexual offences b = 1.67 indicates that as sexual offences increases by 1 unit, Damage to property will
increase by 1.67 units.
6. CA2 Statistics
Tom Donoghue v1.0 Page 5
Weapons and Explosives b = 4.07 indicates that as the Weapons and Explosives offence increases by
1 unit, Damage to property will increase by 4.07 units.
The standardised beta values allow direct comparison of the predictors in the model and indicate
Burglary = 0.55, Weapons and Explosions = 0.53 and Sexual offences = 0.193
Examining the t-test section we can see that both Burglary t(56) = 6.68, p < 0.001, and Weapons and
Explosions t(56) = 6.53, p < 0.001, are making a significant contribution to the model. Sexual offences
t(56) = 2.29, p < 0.05, also makes a significant contribution but less than that of the other two
predictors.
Checking for multicollinearity, the VIF values are all less than 10 which indicates that there is probably
no cause for concern. The average VIF = 1.12 which is close to 1, again indicating no probable cause
for concern.
Assessing the table below for additional multicollinearity check, using the Eigenvalues none of the
predictors have a high variance proportions on the same small Eigenvalue (i.e. Dimension 4 ).
The Casewise diagnostics show that we have 3 cases that are treated as outliers (we set the residuals
from 3 to 2 standard deviations when selecting casewise diagnostics). In an ordinary sample we would
expect to see 95% of the standardised residuals lie between ± 2. In our sample of 60 we see 3 cases
or 5% that have a standardised residuals outside these limits. As 99% of the cases should lie within ±
2.5, we would expect to see 1% outside these limits. We have a single case, 51 that has a standardised
residual of 3.126, which we may wish to investigate further. Other than that, the diagnostics provide
no other cause for concern.
7. CA2 Statistics
Tom Donoghue v1.0 Page 6
Charts
The histogram and P-P plot below indicate the normality of residuals. There is some deviation in the
P-P plot and the histogram has a few gaps, which could be improved by increasing the sample size.
In the scatterplot below we check for Homoscedasticity and Linearity. The zpred v zresid and partial
plots as follows:
8. CA2 Statistics
Tom Donoghue v1.0 Page 7
This random scatter pattern indicates that the assumptions linearity and homoscedasticity
have been met.
This scatter pattern indicates that the assumptions linearity (positive relationship to
Damage to property) and homoscedasticity (dots are well spaced out with no outliers) have
been met.
9. CA2 Statistics
Tom Donoghue v1.0 Page 8
This scatter pattern shows a positive relationship to Damage to property (although slightly
less linear than the other predictors) and homoscedasticity (dots are well spaced out with
no outliers) have been met.
This scatter pattern indicates that the assumptions linearity (positive relationship to
Damage to property). So the assumptions of linearity and homoscedasticity (dots are well
spaced out with no outliers) have been met.
10. CA2 Statistics
Tom Donoghue v1.0 Page 9
Results
Linear model of predictors of Damage to Property crime rates, with a 95% confidence interval.
Step 1 B SE B Β P
(Constant) -10.285
(-99.83, 79.26)
44.70
Burglary 0.456
(0.32, 0.59)
.07 .55 p < 0.001
Sexual Offences 1.67
(0.21, 3.13)
.73 .19 p < 0.05
Weapons and
Explosives
Offences
4.07
(2.82, 5.31)
.62 .53 p < 0.001
Note : R2
= .66
An Example using the model
Damage to property = -10.285+(0.456 burglary) +(1.67 Sexual offence) + (4.07 Weapons Explosives offences)
As an example using the equation
Burglary = 383
Sexual Offence = 20
Weapons and Explosives = 60
Damage to property = -10.285+(0.456 * 383) +(1.67 * 20) + (4.07 *60)
= -10.285 + 174.65 + 33.4 + 244.2
= 442
ANOVA
We are investigating to see if there a difference in reported burglary rates between the 6 Dublin Garda
Divisions.
There is one categorical independent variable with 6 levels of the factor (representing the 6 Garda
divisions).
There is one continuous dependant variable which is burglary and related offences, Sample sizes are
n=10 for each Garda division. We are assuming that the observations are independent.
Preliminary tests were conducted to check for a normal distribution using SPSS histogram and Q-Q as
see below:
Fig 1. Histogram
11. CA2 Statistics
Tom Donoghue v1.0 Page 10
Fig. 2 Q-Q Plot
The histogram is symmetrical and more or less bell shaped indicating normality. The Q-Q plot also
indicates normality with the dots lying along the diagonal line.
Our Null Hypothesis is that there is no difference between all the means of the 6 Garda Divisions for
burglary rates. Our Alternative Hypothesis is that at least one of the means is a different. Due to our
relatively small sample sizes we decided to set an alpha value of 0.01 to reduce the risk of Type I error.
12. CA2 Statistics
Tom Donoghue v1.0 Page 11
Output from SPSS
The Descriptives give us a sanity check and confirm our k groups and n sample sizes. Examining the
Levene statistic for homogeneity of variance it was noted that it was not significant at p > 0.05.
The omnibus results indicate that the groups are significantly different F (5, 54) = 11.445, p < .001, but
we need to examine the Post Hoc Tests to discover which of the comparisons is different.
The Post Hoc tests were Tukey HSD which compares each group to all remaining groups, hence
indicating whether there is there is a significant difference between the means. A Bonferroni post hoc
test was included in the test options but may be too strict as we have already lowered the alpha level
to 0.01 and we could be risking making Type II errors. For the purposes of this analysis we will use
Tukey.
14. CA2 Statistics
Tom Donoghue v1.0 Page 13
Results
A one way between groups ANOVA was carried out to investigate reported burglary crimes between
groups of Dublin Garda Divisions. The Garda Division comprised of 6 groups (61 ,D.M.R. South Central
Garda Division, 62 ,D.M.R. North Central, 63 ,D.M.R. Northern, 64 ,D.M.R. Southern, 65 ,D.M.R.
Eastern, 66 ,D.M.R. Western). There was a statistically significant difference at the p < 0.01 level in
the burglary crimes reported for the 6 groups: F (5, 54) = 11.445, p < .001. Post Hoc comparisons using
the Tukey HSD test indicated that the mean score for Group 61 (M = 391, SD=73.5) did not differ
significantly from any of the remaining groups. However, Group 62 (M = 236, SD=46.9) was
significantly different from Groups 63 (M = 542.2, SD=175.2), 64 (M = 561, SD=164.6), 65 (M = 564,
SD=133.2, and 66 (M = 564, SD=133.2). Group 63, 64, 65 and 66 were significantly different to Group
62. As a result we reject the Null Hypothesis.