This document discusses correlation and regression analysis. It introduces correlation as a measure of the strength of association between two variables. The correlation coefficient r ranges from -1 to 1, where values closer to 1 or -1 indicate a stronger linear relationship. Regression analysis is used to model the relationship between a response variable and one or more predictor variables. A simple linear regression fits a straight line to the data to summarize the relationship between two continuous variables. Diagnostic plots are used to check assumptions of the regression model such as linearity and homoscedasticity of residuals.
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...Maninda Edirisooriya
Simplest Machine Learning algorithm or one of the most fundamental Statistical Learning technique is Linear Regression. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...Maninda Edirisooriya
Simplest Machine Learning algorithm or one of the most fundamental Statistical Learning technique is Linear Regression. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables of interest. Regression analysis is a reliable method of identifying which variables have impact on a topic of interest. The process of performing a regression allows you to confidently determine which factors matter most, which factors can be ignored, and how these factors influence each other.In this presentation a brief introduction about SLR and MLR and their codes in R are described
Correlation & Regression Analysis using SPSSParag Shah
Concept of Correlation, Simple Linear Regression & Multiple Linear Regression and its analysis using SPSS. How it check the validity of assumptions in Regression
Simple Linear Regression is a statistical technique that attempts to explore the relationship between one independent variable (X) and one dependent variable (Y). The Simple Linear Regression technique is not suitable for datasets where more than one variable/predictor exists.
Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables of interest. Regression analysis is a reliable method of identifying which variables have impact on a topic of interest. The process of performing a regression allows you to confidently determine which factors matter most, which factors can be ignored, and how these factors influence each other.In this presentation a brief introduction about SLR and MLR and their codes in R are described
Correlation & Regression Analysis using SPSSParag Shah
Concept of Correlation, Simple Linear Regression & Multiple Linear Regression and its analysis using SPSS. How it check the validity of assumptions in Regression
Simple Linear Regression is a statistical technique that attempts to explore the relationship between one independent variable (X) and one dependent variable (Y). The Simple Linear Regression technique is not suitable for datasets where more than one variable/predictor exists.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
2. INTRODUCTION
• We deal with data that consists of random pairs (
or sets) of observations. Elements in each are
observations from the same subject
Ways to deal with such data
• Ignore any relation between the variables and
analyze them separately
• Use correlation to describe the intensity of
association between the two variables
• Use regression analysis to assess the degree and
nature of association between the variables
3. CORRELATION
• If we have two continuous variables X and Y
we can summarize them with five parameters
namely
• The two means(µx, µy)
• The two variances(σx,σy) and
• The covariance (σxy)
4. Covariance
• The sample covariance is calculated as the
sum of cross products ( of deviations ) divided
by the degrees of freedom
5. Properties of covariance
• If large values of X pair with large values of Y
and small values of x with small values of y the
covariance will be positive.
• If large x go with small Y and vice versa then
the covariance will be negative
• If X and Y are independent then covariance
will be zero
6. Correlation coefficient
• Can replace the covariance without any loss of
information. Its denoted by ρ while the
statistic is denoted by r
8. Properties of correlation coefficient
• The value of r is always between -1 and 1.
• Positive values indicate a positive association
between the variables
• Negative values indicate a negative
association between the variables
• If r=1 or r =-1 then all of the cases fall on a
straight line
9. Coefficient of determination
• This is the square of the correlation
coefficient.
• Recall that the total sum of squares is a
measure of variability of a variable
10. Cont
• The sum of squares may be given as
• Similarly
Is a measure of the total joint variability of X and
Y
11. Cont
• Measure of variability of Y over and above
that of the joint variability of both X and Y
(SS[Y|X]) is called the sum of squares due to
regression of Y on X denotes as SSr
• It can be shown that r2 is the ratio of SSr to
SSto
12. Cont`
• Coefficient of determination is therefore a
measure of variability in Y that is explained by
the variable X
13. Properties of r2
• Coefficeient of determination lies between 0
and 1
• When the variables are highly correlated r2 is
near 1 and near 0 when they are not
correlated.
14. Example
• Consider the following data
X Y
9 0
9 9
8 1
5 1
7 9
-Find the variance of x and y and the covariance of x and
y
Find the correlation coefficient and the coefficient of
determination
16. Sampling distribution of r
• The sampling distribution is only symmetric
when the parameter ρ=0.
• It becomes skewed as ρ moves away from 0
• Hence we cannot use CLT in computing
confidence interval for ρ and in hypothesis
testing
• Two variables are correlated if r>0.5 and the
sample is large enough
17. Testing hypothesis about ρ
Test H0:ρ=0
• Recall that if ρ=0 then the two variables are
not correlated
• The test assesses whether there is correlation
between variables .The test statistic
18. Hypothesis testing cont`
Test H0 :ρ=ρ0 whereρ0 is not equal to zero.
We transform to z` and the test statistic is
Where , and
95% C.I will be given by z`±1.96×σz
20. REGRESSION
• Model of relationships between some
covariates and outcome.
• Often used for exploratory settings
• Sometimes be used for confirmatory studies
• A regression line is an equation that describes
the relationship between a response variable
y(outcome) and an explanatory variable x(
covariates.
21. Regression continued
• Statistical relationships may be linear,
exponential , polynomial logarithmic etc
• Simplest form is the linear
• Linear means linear in the coefficients i.e. y is
a linear functions of the coefficients
• Non linear relations can be modified into
forms that are approximately linear through
the transformation
22. Simple linear regression
• Linear relationship may be summarized using an
equation
y = 0 + 1x
where 0 is the intercept and 1 the slope of the
line.
• For observation i ( i = 1,2,…,10 ) whose value of
the explanatory variable is xi one would expect
the corresponding response yi to be such that
E(yi) = 0 + 1xi
23. • The statistical model fitted for a simple linear
regression is of the form
yi= 0 + 1xi +εi i = 1, ..., n
24. EXAMPLE
Study of the effect of temperature on the rate of development
of the potato leafhooper, Empoasca fabae. The response (y)
was the mean length of the development period (in days)
from egg to adult.
Temperature (F) Mean length (days)
59.8 30.2
67.6 27.3
70.0 26.8
70.4 23.3
74.0 19.1
75.3 19.0
78.0 16.5
80.4 15.9
81.4 14.8
83.2 14.2
25. Mean length of development period of
potato leafhopper versus temperature
0
5
10
15
20
25
30
35
0 10 20 30 40 50 60 70 80 90
27. Output from Genstat
• Regression Analysis
• Response variate: length
• Fitted terms: Constant, temp
• Summary of analysis
• d.f. s.s. m.s. v.r. F pr.
• Regression 1 282.28 282.282 120.85 <.001
• Residual 8 18.69 2.336
• Total 9 300.97 33.441
• Estimates of parameters
• estimate s.e. t(8) t pr.
• Constant 78.09 5.24 14.90 <.001
• temp -0.7753 0.0705 -10.99 <.001
28. ASSUMPTIONS
• Error terms have constant variance(
Homoscedascity)
• The error terms are independent
• The error terms are normally distributed
• The regression function is linear
• Outliers
• Important independent variables in the model
– Must be checked
29. How to check the assumptions
• Diagnostic plots
• The plots tell you whether the regression is
even appropriate.
• Include univariate plots, bivariate plots,
Residual analysis plots
30. Univariate plots of X and Y
• To look for outliers
• Examine the shape of the distribution
• Include box plots, stem plots , histograms and
dot plots for x and y
31. Bivariate plots
• Plots of X vs Y
• Is the relationship between the two variables
linear?
• Are there two dimensional outliers?
• Does the assumption of constant variance
look reasonable?
32. Plots of residuals versus X
• Useful for detecting non linearity
• Any observable pattern in the residual versus
X plot indicate a problem with model
assumption.
33. Plot the residuals versus Y'
• For one predictor variable its has same
information as previous.
• For multiple linear regression the plot lets us
examine patterns of the residuals with
increasing response.