The document discusses multiple statistical comparisons and techniques for controlling error rates when performing multiple hypothesis tests on data. It introduces the concepts of family-wise error rate (FWER) and false discovery rate (FDR), and methods like the Sidak correction, Bonferroni correction, and Benjamini-Hochberg procedure for controlling FWER and FDR. It also discusses how p-value distributions can be used to estimate FDR and calculate q-values. Interactive demonstrations are provided to help illustrate key concepts like Type I and Type II errors.
Multiple Regression and Logistic RegressionKaushik Rajan
1) Multiple Regression to predict Life Expectancy using independent variables Lifeexpectancymale, Lifeexpectancyfemale, Adultswhosmoke, Bingedrinkingadults, Healthyeatingadults and Physicallyactiveadults.
2) Binomial Logistic Regression to predict the Gender (0 - Male, 1 - Female) with the help of independent variables such as LifeExpectancy, Smokingadults, DrinkingAdults, Physicallyactiveadults and Healthyeatingadults.
Tools used:
> RStudio for Data pre-processing and exploratory data analysis
> SPSS for building the models
> LATEX for documentation
It is most useful for the students of BBA for the subject of "Data Analysis and Modeling"/
It has covered the content of chapter- Data regression Model
Visit for more on www.ramkumarshah.com.np/
A brief description of F Test and ANOVA for Msc Life Science students. I have taken the example slides from youtube where an excellent explanation is available.
Here is the link : https://www.youtube.com/watch?v=-yQb_ZJnFXw
Multiple Regression and Logistic RegressionKaushik Rajan
1) Multiple Regression to predict Life Expectancy using independent variables Lifeexpectancymale, Lifeexpectancyfemale, Adultswhosmoke, Bingedrinkingadults, Healthyeatingadults and Physicallyactiveadults.
2) Binomial Logistic Regression to predict the Gender (0 - Male, 1 - Female) with the help of independent variables such as LifeExpectancy, Smokingadults, DrinkingAdults, Physicallyactiveadults and Healthyeatingadults.
Tools used:
> RStudio for Data pre-processing and exploratory data analysis
> SPSS for building the models
> LATEX for documentation
It is most useful for the students of BBA for the subject of "Data Analysis and Modeling"/
It has covered the content of chapter- Data regression Model
Visit for more on www.ramkumarshah.com.np/
A brief description of F Test and ANOVA for Msc Life Science students. I have taken the example slides from youtube where an excellent explanation is available.
Here is the link : https://www.youtube.com/watch?v=-yQb_ZJnFXw
Hypothesis Testing: Central Tendency – Normal (Compare 1:1)Matt Hansen
An extension on a series about hypothesis testing, this lesson reviews the 2 Sample T & Paired T tests as central tendency measurements for normal distributions.
Uncertainty & Probability
Baye's rule
Choosing Hypotheses- Maximum a posteriori
Maximum Likelihood - Baye's concept learning
Maximum Likelihood of real valued function
Bayes optimal Classifier
Joint distributions
Naive Bayes Classifier
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
1. ... and are you sure?
Multiple statistical comparisons problem
Jiˇr´ı Haviger
jiri.haviger@uhk.cz
May 12, 2018
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 1 / 24
3. Introduction
... and are you sure?
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 3 / 24
4. Basic idea of inferential statistics Inference, confidence intervals and pvalue
Inference
Demostration of sample means distributions, shiny.rit.albany.edu
Demostration of sample means distributions, rpsychologist.com
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 4 / 24
5. Basic idea of inferential statistics Inference, confidence intervals and pvalue
Confidence intervals
Q: how to estimate the popultion characteristic from knowing sample? Point? Interval?
Probabilistic theory:
is knowing probability density function PDF of sample measures
(eg. Student s T distribution of sample means m)
for different samples
we have: sample with statistical characteristic (n, x, sd, ...)
we have: α as a probability in which we accept mistake (usually
α = 0.05)
to do:: from sample information N, m, sd and α...
transform sample characteristics into variable with knowing
distribution (e.g. t = x−µ
s ·
√
n)
to do: based on PDF and t determine confidence interval for
characteristic (eg. CI(µ))
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 5 / 24
7. Hypothesis testing Hypothesis testing process
Hypothesis testing procecss
Q: Comes our sample comes population with null hypothesis?
we have: idea about population (from theory, intuition, goverment,
... )
we have: sample with statistical characteristic (n, x, sd, ...)
we have: α as a probability in which we accept mistake (usually
α = 0.05)
to do: formulate null and alternative hypothesis
to do: determine probability, that our sample is from population with
null hypothesis → p-value or sig.
to do: compare pvalue from sample and α level
pvalue < α → reject null hypothesis
pvalue ≥ α → retain null hypothesis.
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 7 / 24
8. Hypothesis testing Two possible errors
Two possible errors
Q: Which mistakes in null hypothesis testing can I do?
null hypothesis rejected correctly (True Positive, TP)
null hypothesis rejected noncorrectly (False Positive, FP, error I)
null hypothesis retain correctly (True Negativ, TN)
null hypothesis retain noncorrectly (False Negative, FN, error II)
Terminology: H0 is reject ∼ test is positive ∼ discovery
test result about H0 rejection
positive (discovery) negative
reality H0 false TP FN
true FP TN
Online demostration of two type of error
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 8 / 24
9. Hypothesis testing Two possible errors
Two errors
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 9 / 24
10. Hypothesis testing Power of analysis, sample size, effect size
Power of test
test result about H0
positive (discovery) negative
reality H0 false TP FN (β)
true FP (α) TN (power, 1 − β)
In ”basic level of statistic” you determine α as probability of false
positives results (eg. false positives diagnoses of cancer)
in ”advanced level of statistic” you to compute minimal reqiured
sample size from given α β and effect size.
There are four numbers in relation: α, β, effect size and sample size
if is fixed effect size and sample size, then
decreasing α implies increasing β
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 10 / 24
11. Hypothesis testing Power of analysis, sample size, effect size
Software for power analysis
G*power, package for R or python, ...)
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 11 / 24
12. Multil comparisin problem Introduction
More tests
Q: Whats happends with probability of false popsitives, if we use more than one test?
for one test: probability that we have false positive results is
P(FP) = α
for two tests: probability of at least one false positive results is
P(FP1 or FP2) = P(FP1) + P(FP2) − P(FP1 and FP2) = · · ·
· · · = 1 − P(¬FP1 and ¬FP2) = · · ·
· · · = 1 − (1 − α) · (1 − α) = 1 − (1 − α)2
for m tests: probability of at least one false positive results is
P(FP1 or . . . or FPm) = 1 − (1 − α)m
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 12 / 24
13. Multil comparisin problem Family wise error rate correction
More tests
Q: Relationship between number of test m and P(FP1 or . . . or FPm) = 1 − (1 − α)m
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 13 / 24
14. Multil comparisin problem Family wise error rate correction
Basic alpha correction
Q: How to change α → αcorr so the prob of P(FP1 or . . . or FPm) will be α?
P(FP1 or . . . or FPm) should be α
P(FP1 or . . . or FPm) = α
1 − (1 − αcorr )m = α
αcorr = 1 − (1 − α)1/m
αcorr is call ˇSid´ak correction named by Czech statistician Zbynˇek ˇSid´ak
(see wiki) and we will use sign αsid
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 14 / 24
15. Multil comparisin problem Family wise error rate correction
Bonferroni correction
Q: What about Bonfferoni correction αbonf = α
m
?
linear approximation of ˇSid´ak correction
αsid = 1 − (1 − α)1/m
Laurent series at m = ∞: αsid ≈ −log(1−α)
m + O(( 1
m )2)
Taylor series at α = 0: −log(1−α)
m ≈ α
m + O(α2)
Practically there is no difference in using
αsid ≈
α
m
= αbonf
αsid and αbonf corrections are based on number of all tests.
Bonferroni correction is named by Italian mathematician Carlo Emilio
Bonferroni.
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 15 / 24
16. Multil comparisin problem Two type of errors again...
Balance between FP and FN
Q: And what about β?
Online demonstration of two type of error
decrease α → increase β
increase β → increase probability of FN → test is going to ”blind”
how to balance between FP and FN depends on solving problem
sometime is better to decrement FP
e.g. in justice - no one false prison
sometimes is better to decrement FN
e.g. in brain disorders - detect some disorders correct and some wrong
is better, than non-detect disorders at all
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 16 / 24
17. Multil comparisin problem Two type of errors again...
Balance between FP and FN
Q: What if we have thousand of tests?
ˇSid´ak and Bonferonni control False Positive from all results
Family Wise Error Rate (FWER), FWER = FP/M
FWER corrections are strict and tending to blind test
other point of view is necessary, so what about ...
... tocontrol False Positive rate only from Discoveries
False Discovery Rate (FDR), FDR = FP/(TP+FP)
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 17 / 24
18. Multil comparisin problem control False Discovery Rate
Benjamini - Hochberg algorithm
Q: How to control FDR to predefined level α in m tests?
Benjamini - Hochberg algorithm for independent tests
1 create all tests and determine all pvalues
2 sort pvalues from smallest one - P[i]
3 compute linear series C[i] = α · i
m
4 set k as a first i, for which P[i] ≥ C[i]
5 αbh = α · k
m
αbh is based on numbers of all tests and concrete pvalue series.
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 18 / 24
19. Multil comparisin problem control False Discovery Rate
Bemjamini - Hochberg visualization
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 19 / 24
20. Multil comparisin problem control False Discovery Rate
pvalues distribution
Q: Why αBH used number of all test, if control FDR only?
we don’t know, which pvalues are from discoveries
and which not, but ...
we can construct pvalue distribution
form definition of p-values we know:
all pvalues from H0 has uniform distribution between 0,1
all pvalues from HA has decreasing distribution
from top (close to 0) to zero (close to 1)
all pvalues has mixed distribution
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 20 / 24
21. Multil comparisin problem control False Discovery Rate
pvalues distribution
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 21 / 24
22. Multil comparisin problem control False Discovery Rate
pvalues distribution and qvalues
Q: So is possible to use pvalue distribution for control FDR?
Determining qvalues from pvalues distributions (Storey)
1 sort pvalues from smallest one - P[i]
2 create density plot of P[i] in (0,1) with step 0.05 (or smaller)
3 determine π0 from right part of density - level selecting H0 from HA
4 compute qvalues Q[k] as false discovery rate
5 select max Q[k] so Q[k] ≤ α
6 αst = k
7 αst is based on distributions of pvalues
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 22 / 24
23. Multil comparisin problem control False Discovery Rate
Computational Psycholinguistic Analysis of Czech Text
Two examples of pvalue distributions from our research
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 23 / 24
24. Finish Questions?
Web sources, contact
https://xkcd.com/882/
https://shiny.rit.albany.edu/stat/confidence/
http://rpsychologist.com/d3/CI/
http://varianceexplained.org/statistics/interpreting-pvalue-histogram/
http://qvalue.princeton.edu/
Jiˇr´ı Haviger
ResearchGate, ORCID, LinkedIn ...
e:jiri.haviger@uhk.cz
Jiˇr´ı Haviger (jiri.haviger@uhk.cz) ... and are you sure? May 12, 2018 24 / 24