This document provides summaries of key concepts related to statistical hypothesis testing and confidence intervals involving t-distributions. It discusses when to use a t-distribution versus a standard normal distribution, specifically when the population standard deviation is unknown and must be estimated from a sample. It provides examples of hypothesis tests and confidence intervals for a single population mean when the sample size is small as well as examples involving paired data. Key formulas are presented for t-tests, confidence intervals, and the t-distribution.
This 10 hours class is intended to give students the basis to empirically solve statistical problems. Talk 1 serves as an introduction to the statistical software R, and presents how to calculate basic measures such as mean, variance, correlation and gini index. Talk 2 shows how the central limit theorem and the law of the large numbers work empirically. Talk 3 presents the point estimate, the confidence interval and the hypothesis test for the most important parameters. Talk 4 introduces to the linear regression model and Talk 5 to the bootstrap world. Talk 5 also presents an easy example of a markov chains.
All the talks are supported by script codes, in R language.
I am Hannah Lucy. Currently associated with statisticshomeworkhelper.com as statistics homework helper. After completing my master's from Kean University, USA, I was in search of an opportunity that expands my area of knowledge hence I decided to help students with their homework. I have written several statistics homework till date to help students overcome numerous difficulties they face.
I am Hannah Lucy. Currently associated with excelhomeworkhelp.com as excel homework helper. After completing my master's from Kean University, USA, I was in search of an opportunity that expands my area of knowledge hence I decided to help students with their homework. I have written several excel homework till date to help students overcome numerous difficulties they face.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
This 10 hours class is intended to give students the basis to empirically solve statistical problems. Talk 1 serves as an introduction to the statistical software R, and presents how to calculate basic measures such as mean, variance, correlation and gini index. Talk 2 shows how the central limit theorem and the law of the large numbers work empirically. Talk 3 presents the point estimate, the confidence interval and the hypothesis test for the most important parameters. Talk 4 introduces to the linear regression model and Talk 5 to the bootstrap world. Talk 5 also presents an easy example of a markov chains.
All the talks are supported by script codes, in R language.
I am Hannah Lucy. Currently associated with statisticshomeworkhelper.com as statistics homework helper. After completing my master's from Kean University, USA, I was in search of an opportunity that expands my area of knowledge hence I decided to help students with their homework. I have written several statistics homework till date to help students overcome numerous difficulties they face.
I am Hannah Lucy. Currently associated with excelhomeworkhelp.com as excel homework helper. After completing my master's from Kean University, USA, I was in search of an opportunity that expands my area of knowledge hence I decided to help students with their homework. I have written several excel homework till date to help students overcome numerous difficulties they face.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
2. Recall: Single population mean
(large n)
Hypothesis test:
Confidence Interval
n
s
Z
mean
null
mean
observed
)
(
*
Z
mean
observed
interval
confidence /2
n
s
3. Single population mean (small
n, normally distributed trait)
Hypothesis test:
Confidence Interval
n
s
Tn
mean
null
mean
observed
1
)
(
*
T
mean
observed
interval
confidence /2
,
1
n
s
n
4. What is a T-distribution?
A t-distribution is like a Z distribution,
except has slightly fatter tails to reflect
the uncertainty added by estimating .
The bigger the sample size (i.e., the
bigger the sample size used to estimate
), then the closer t becomes to Z.
If n>100, t approaches Z.
10. Student’s t Distribution
t
0
t (df = 5)
t (df = 13)
t-distributions are bell-
shaped and symmetric, but
have ‘fatter’ tails than the
normal
Standard
Normal
(t with df = )
Note: t Z as n increases
from “Statistics for Managers” Using Microsoft® Excel 4th Edition, Prentice-Hall 2004
11. Student’s t Table
Upper Tail Area
df .25 .10 .05
1 1.000 3.078 6.314
2 0.817 1.886 2.920
3 0.765 1.638 2.353
t
0 2.920
The body of the table
contains t values, not
probabilities
Let: n = 3
df = n - 1 = 2
= .10
/2 =.05
/2 = .05
from “Statistics for Managers” Using Microsoft® Excel 4th Edition, Prentice-Hall 2004
12. t distribution values
With comparison to the Z value
Confidence t t t Z
Level (10 d.f.) (20 d.f.) (30 d.f.) ____
.80 1.372 1.325 1.310 1.28
.90 1.812 1.725 1.697 1.64
.95 2.228 2.086 2.042 1.96
.99 3.169 2.845 2.750 2.58
Note: t Z as n increases
from “Statistics for Managers” Using Microsoft® Excel 4th Edition, Prentice-Hall 2004
13. The T probability density
function
What does t look like mathematically? (You may at least recognize
some resemblance to the normal distribution function…)
Where:
v is the degrees of freedom
(gamma) is the Gamma function
is the constant Pi (3.14...)
14. The t-distribution in SAS
Yikes! The t-distribution looks like a mess! Don’t want to
integrate!
Luckily, there are charts and SAS! MUST SPECIFY
DEGREES OF FREEDOM!
The t-function in SAS is:
probt(t-statistic, df)
15. The normality assumption…
Ttests (and all linear models, in fact) have a
“normality assumption”:
If the outcome variable is not normally distributed
and the sample size is small, a ttest is
inappropriate
it takes longer for the CLT to kick in and the
sample means do not immediately follow a t-
distribution…
This is the source of the “normality
assumption” of the ttest…
16. Computer simulation of the distribution
of the sample mean (non-normal,
small n):
1. Pick any probability distribution and specify a mean and standard deviation.
2. Tell the computer to randomly generate 1000 observations from that
probability distributions
E.g., the computer is more likely to spit out values with high probabilities
3. Calculate 1000 T-statistics:
4. Plot the T-statistics in histograms.
5. Repeat for different sample sizes (n’s).
n
S
X
T
x
n
22. Conclusions
If the underlying data are not normally
distributed AND n is small**, the means do
not follow a t-distribution (so using a ttest
will result in erroneous inferences).
Data transformation or non-parametric
tests should be used instead.
**How small is too small? No hard and fast
rule—depends on the true shape of the
underlying distribution. Here N>30 (closer
to 100) is needed.
23. Practice Problem:
A manufacturer of light bulbs claims that its light
bulbs have a mean life of 1520 hours with an
unknown standard deviation. A random sample of 40
such bulbs is selected for testing. If the sample
produces a mean value of 1505 hours and a sample
standard deviation of 86, is there sufficient evidence
to claim that the mean life is significantly less than
the manufacturer claimed?
Assume that light bulb lifetimes are roughly normally
distributed.
24. Answer
1. What is your null hypothesis?
Null hypothesis: mean life = 1520 hours
Alternative hypothesis: mean life < 1520 hours
2. What is your null distribution?
Since we have to estimate the standard deviation, we need to make inferences from a T-curve with
39 degrees of freedom.
3. Empirical evidence: 1 random sample of 40 has a mean of 1498.3 hours
5. Probably not sufficient evidence to reject the null. We cannot sue the light bulb manufacturer for
false advertising! Notice that using t-distribution to calculate the p-value didn’t change much!
With n>30, might as well use Z table.
)
5
.
13
40
86
s
,
1520
(
~ X
39
40
t
X
137
.
)
11
.
1
(
11
.
1
5
.
13
1520
1505
39
39
t
P
value
p
t
25. Practice problem
You want to estimate the average ages of
kids that ride a particular kid’s ride at
Disneyland. You take a random sample of 8
kids exiting the ride, and find that their ages
are: 2,3,4,5,6,6,7,7. Assume that ages are
roughly normally distributed.
a. Calculate the sample mean.
b. Calculate the sample standard deviation.
c. Calculate the standard error of the mean.
d. Calculate the 99% confidence interval.
26. Answer (a,b)
a. Calculate the sample mean.
b. Calculate the sample standard deviation.
0
.
5
8
40
8
7
7
6
6
5
4
3
2
8
8
1
8
i
i
X
X
9
.
1
4
.
3
4
.
3
7
24
7
)
2
(
2
)
1
(
2
0
1
2
3
1
8
)
5
( 2
2
2
2
2
8
1
2
2
X
i
i
X
s
X
s
28. Answer (d)
d. Calculate the 99% confidence interval.
)
35
.
7
,
65
.
2
(
)
50
.
3
(
67
.
0
.
5
)
( 2
/
,
df
X t
s
mean
t7,.005=3.5
29. Example problem, class data:
A two-tailed hypothesis test:
A researcher claims that Stanford affiliates eat
fewer than the recommended intake of 5
fruits and vegetables per week.
We have data to address this claim: 24 people
in the class provided data on their daily fruit
and vegetable intake.
Do we have evidence to dispute her claim?
30. Histogram fruit and veggie
intake (n=24)…
Mean=3.7 servings
Median=3 servings
Mode=3 servings
Std Dev=1.7 servings
31. Answer
1. Define your hypotheses (null, alternative)
H0: P(average servings)=5.0
Ha: P(average servings)≠5.0 servings (two-sided)
2. Specify your null distribution
)
34
.
0
24
7
.
1
,
0
.
5
(
~ 23
24
T
X
32. Answer, continued
5. Reject or fail to reject (~accept) the null hypothesis
Reject! Stanford affiliates eat significantly fewer than the
recommended servings of fruits and veggies.
T23 critical value
for p<.05, two
tailed = 2.07
3. Do an experiment
observed mean in our experiment = 3.7 servings
4. Calculate the p-value of what you observed
p-value < .05;
8
.
3
34
.
0
5
7
.
3
23
T
34. Paired data (repeated
measures)
Patient BP Before (diastolic) BP After
1 100 92
2 89 84
3 83 80
4 98 93
5 108 98
6 95 90
What about these
data? How do you
analyze these?
35. Example problem: paired ttest
Patient Diastolic BP Before D. BP After Change
1 100 92 -8
2 89 84 -5
3 83 80 -3
4 98 93 -5
5 108 98 -10
6 95 90 -5
Null Hypothesis: Average Change = 0
37. Example problem: paired ttest
Change
-8
-5
-3
-5
-10
-5
8.571)
-
,
(-3.43
(1.0)
*
2.571
6
-
:
CI
95%
Note: does not include 0.
38. Summary: Single population
mean (small n, normality)
Hypothesis test:
Confidence Interval
n
s
t
x
n
mean
null
mean
observed
1
)
(
*
t
mean
observed
interval
confidence /2
1,
-
n
n
sx
39. Summary: paired ttest
Hypothesis test:
Confidence Interval
n
s
t
d
n
0
d
mean
observed
1
)
(
*
t
d
mean
observed
interval
confidence /2
1,
-
n
n
sd
Where d=change
over time or
difference within
a pair.
40. Summary: Single population
mean (large n)
Hypothesis test:
Confidence Interval
n
s
t
Z
x
n
mean
null
mean
observed
1
)
(
*
]
Z
t
[
mean
observed
interval
confidence /2
/2
1,
-
n
n
sx
41. Examples of Sample Statistics:
Single population mean (known )
Single population mean (unknown )
Single population proportion
Difference in means (ttest)
Difference in proportions (Z-test)
Odds ratio/risk ratio
Correlation coefficient
Regression coefficient
…
42. Recall: normal approximation
to the binomial…
Statistics for proportions are based on a
normal distribution, because the
binomial can be approximated as
normal if np>5
43. Recall: stats for proportions
For binomial:
)
1
(
)
1
(
2
p
np
p
np
np
x
x
x
For proportion:
n
p
p
n
p
p
n
p
np
p
p
p
p
)
1
(
)
1
(
)
1
(
ˆ
2
2
ˆ
ˆ
P-hat stands for “sample
proportion.”
Differs by
a factor of
n.
Differs
by a
factor
of n.
44. Sampling distribution of a
sample proportion
)
)
ˆ
1
(
ˆ
,
(
~
ˆ
n
p
p
p
Normal
p
n
p
p
s
n
p
p
p
p
p
p
)
ˆ
1
(
ˆ
)
1
(
ˆ
ˆ
ˆ
Always a normal
distribution!
p=true population proportion.
BUT… if you knew p you wouldn’t
be doing the experiment!
45. Practice Problem
A fellow researcher claims that at least 15% of smokers
fail to eat any fruits and vegetables at least 3 days a week.
You find this hard to believe and decide to check the
validity of this statistic by taking a random (representative)
sample of smokers. Do you have sufficient evidence to
reject your colleague’s claim if you discover that 17 of the
200 smokers in your sample eat no fruits and vegetables at
least 3 days a week?
46. Answer
1. What is your null hypothesis?
Null hypothesis: p=proportion of smokers who skip fruits and veggies frequently
>= .15
Alternative hypothesis: p < .15
2. What is your null distribution?
Var( ) = .15*.85/200 = .00064 SD( ) = .025
~ N (.15, .025)
3. Empirical evidence: 1 random sample: = 17/200 = .085
4. Z = (.085-.15)/.025 = -2.6
p-value = P(Z<-2.6) = .0047
5. Sufficient evidence to reject the claim.
p̂
p̂ p̂
47. OR, use computer simulation…
1. Have SAS randomly pick 200 observations
from a binomial distribution with p=.15 (the
null).
2. Divide the resulting count by 200 to get
the observed sample proportion.
3. Repeat this 1000 times (or some arbitrarily
large number of times).
4. Plot the resulting distribution of sample
proportions in a histogram:
48. How often did we get observed
values of 0.085 or lower when
true p=.15?
Only 4/1000 times!
Emprical p-value=.004
49. Practice Problem
In Saturday’s newspaper, in a story about poll results from Ohio, the
article said that 625 people in Ohio were sampled and claimed that the
margin of error in the results was 4%. Can you explain where that 4%
margin of error came from?
51. Paired data proportions test…
Analogous to paired ttest…
Also takes on a slightly different form
known as McNemar’s test (we’ll see lots
more on this next term…)
52. Paired data proportions test…
1000 subjects were treated with
antidepressants for 6 months and with
placebo for 6 months (order of tx was
randomly assigned)
Question: do suicide attempts (yes/no)
differ depending on whether a subject is
on antidepressants or on placebo?
53. Paired data proportions test…
Data:
15 subjects attempted suicide in both conditions (non-
informative)
10 subjects attempted suicide in the antidepressant
condition but not the placebo condition
5 subjects attempted suicide in the placebo condition
but not the antidepressant condition
970 did not attempt suicide in either condition (non-
informative)
Data boils down to 15 observations…
In 10/15 cases (66.6%), antidepressant>placebo.
54. Paired proportions test…
Single proportions test:
Under the null hypothesis, antidepressants and placebo work
equally well. So,
Ho: among discordant cases, p (antidepressant>placebo) = 0.5
Observed p = .666
05
.
;
29
.
1
15
)
5
)(.
5
(.
5
.
666
.
)
1
)(
(
ˆ
0
0
0
p
n
p
p
p
p
Z
Not enough evidence to reject the null!
55. Key one-sample Hypothesis
Tests…
Test for Ho: μ = μ0
Test for Ho: p = po:
n
s
x
t
x
n
0
1
n
p
p
p
p
Z
)
1
)(
(
ˆ
0
0
0
Tn-1 approaches Z
for large n.
** If np (expected
value)<5, use exact
binomial rather than Z
approximation…
56. Corresponding confidence
intervals…
For a mean:
For a proportion:
n
s
t
x x
n
2
/
,
1
n
p
p
Z
p
)
ˆ
1
)(
ˆ
(
ˆ 2
/
Tn-1 approaches Z
for large n.
** If np
(expected
value)<5, use
exact binomial
rather than Z
approximation
…
57. Symbol overload!
n: Sample size
Z: Z-statistic (standard normal)
tdf: T-statistic (t-distribution with df degrees of
freedom)
p: (“p-hat”): sample proportion
X: (“X-bar”): sample mean
s: Sample standard deviation
p0: Null hypothesis proportion
0: Null hypothesis mean