Making believable causal claims can be difficult, especially with the much repeated adage “correlation is not causation”. This talk will walk through some tools often used to practice safe causation, such as propensity scores and sensitivity analyses. In addition, we will cover principles that suggest causation such as the understanding of counterfactuals, and applying Hill’s criteria in a data science setting. We will walk through specific examples, as well as provide R code for all methods discussed.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 7: Estimating Parameters and Determining Sample Sizes
7.3: Estimating a Population Standard Deviation or Variance
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 7: Estimating Parameters and Determining Sample Sizes
7.3: Estimating a Population Standard Deviation or Variance
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 7: Estimating Parameters and Determining Sample Sizes
7.3: Estimating a Population Standard Deviation or Variance
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 7: Estimating Parameters and Determining Sample Sizes
7.3: Estimating a Population Standard Deviation or Variance
In this presentation, we will discuss the mathematical basis of linear regression and analyze the concepts of p-value, hypothesis testing, and confidence intervals, and their interpretation.
Statistical inference: Probability and DistributionEugene Yan Ziyou
This deck was used in the IDA facilitation of the John Hopkins' Data Science Specialization course for Statistical Inference. It covers the topics in week 1 (probability) and week 2 (distribution).
Predicting your adversary's behaviour is the holy grail of threat modeling. This talk will explore the problem of adversarial reasoning under uncertainty through the lens of game theory, the study of strategic decision-making among cooperating or conflicting agents. Starting with a thorough grounding in classical two-player games such as the Prisoner's Dilemma and the Stag Hunt, we will also consider the curious patterns that emerge in iterated, round-robin, and societal iterated games.
But as a tool for the real world, game theory seems to put the cart before the horse: how can you choose the proper strategy if you don't necessarily even know what game you're playing? For this, we turn to the relatively young field of probabilistic programming, which enables us to make powerful predictions about adversaries' strategies and behaviour based on observed data.
This talk is intended for a general audience; if you can compare two numbers and know which one is bigger than the other, you have all the mathematical foundations you need.
1Bivariate RegressionStraight Lines¾ Simple way to.docxaulasnilda
1
Bivariate Regression
Straight Lines
¾ Simple way to describe a relationship
¾ Remember the equation for a straight line?
z y = mx + b
¾ What is m? What is b?
¾ How do you compute the equation?
(x1,y1)
(x2,y2)
What if every point is
not on the line?
¾ Straight line may be good description even
if not all points are on the line
Computing the line
when points are scattered
¾ = a + bX
¾ Y-hat means predicted value of Y
¾ Computing the slope:
¾ b = 𝑋−𝑋 𝑌−𝑌
𝑋−𝑋
¾ I ill ri e/r n, b no e al o
consider variability in X and Y
Computing the intercept
¾ a = - bX
¾ Need o pl g in al e of (X, )
¾ Can e j an Y or X!
z Line would be very different depending on
which ones you chose
¾ Must have X and Y that we know are on
the line
z mean of X and mean of Y
2
Computing the intercept
¾ Regression line will always go through the
mean of X and mean of Y
¾ A = 𝑌 - b𝑋
¾ Le r it with our example from before
X
(# of kids)
Y
(hours of
housework) 𝑋 𝑋 𝑌 𝑌 𝑋 𝑋 𝑌 𝑌 𝑋 𝑋
1 1 -1.75 -2.5 4.375 3.063
1 2 -1.75 -1.5 2.625 3.063
1 3 -1.75 -0.5 0.875 3.063
2 6 -0.75 2.5 -1.875 0.563
2 4 -0.75 0.5 -0.375 0.563
2 1 -0.75 -2.5 1.875 0.563
3 5 0.25 1.5 0.375 0.063
3 0 0.25 -3.5 -0.875 0.063
4 6 1.25 2.5 3.125 1.563
4 3 1.25 -0.5 -0.625 1.563
5 7 2.25 3.5 7.875 5.063
5 4 2.25 0.5 1.125 5.063
MX=2.75 MY=3.5 = 0 = 0 = 18.5 = 24.25
Computing the equation
¾ b = .
.
.76
¾ a = 3.5 - .76(2.75)
¾ = 1.41
¾ = 1.41 + .76X
Interpreting the coefficients
¾ Slope
z For a one unit increase in X, we predict a b
unit increase in Y
What does that mean for this study?
¾ Intercept
z The predicted value of Y when X = 0
What does that mean for this study?
Interpreting the coefficients
¾ Slope
z For each additional child, we predict
parents will do an additional .76 hours of
housework per day
¾ Intercept
z For a family with zero kids, we predict they
will do 1.41 hours of housework per day
Drawing the regression line
¾ Need to plot two points
z 𝑋, 𝑌
z Y-intercept
1
Scatterplots and
Correlation
Correlation
¾ Useful tool to assess relationships
¾ Must have two variables measured on one set of
people
¾ Correlation only measures strength of linear
association
Linear relationships are
not perfect lines
¾ Variables have variability (duh)
¾ Relationships may be generally linear
even if all points are not on the line
Magnitude of r Not all relationships are linear
2
Properties of r
¾ X & Y must be quantitative
z Interval or ratio
¾ I doe n ma e hich a iable i edic o
and which is response
z rxy = ryx
Properties of r
¾ Correlation has no units
z So r can be compared for different variables
¾ Value of r is always between -1 and +1
Computing r
¾ Consider deviations around mean of X & Y
¾ (X 𝑋) (Y 𝑌)
Cross-Product
¾ To consider X & Y together, multiply their
deviations
¾ (X 𝑋)(Y 𝑌)
¾ Sign will be positive or negative
¾ Sum of cross-pr
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 9: Inferences from Two Samples
9.3 Two Means, Two Dependent Samples, Matched Pairs
In this presentation, we will discuss the mathematical basis of linear regression and analyze the concepts of p-value, hypothesis testing, and confidence intervals, and their interpretation.
Statistical inference: Probability and DistributionEugene Yan Ziyou
This deck was used in the IDA facilitation of the John Hopkins' Data Science Specialization course for Statistical Inference. It covers the topics in week 1 (probability) and week 2 (distribution).
Predicting your adversary's behaviour is the holy grail of threat modeling. This talk will explore the problem of adversarial reasoning under uncertainty through the lens of game theory, the study of strategic decision-making among cooperating or conflicting agents. Starting with a thorough grounding in classical two-player games such as the Prisoner's Dilemma and the Stag Hunt, we will also consider the curious patterns that emerge in iterated, round-robin, and societal iterated games.
But as a tool for the real world, game theory seems to put the cart before the horse: how can you choose the proper strategy if you don't necessarily even know what game you're playing? For this, we turn to the relatively young field of probabilistic programming, which enables us to make powerful predictions about adversaries' strategies and behaviour based on observed data.
This talk is intended for a general audience; if you can compare two numbers and know which one is bigger than the other, you have all the mathematical foundations you need.
1Bivariate RegressionStraight Lines¾ Simple way to.docxaulasnilda
1
Bivariate Regression
Straight Lines
¾ Simple way to describe a relationship
¾ Remember the equation for a straight line?
z y = mx + b
¾ What is m? What is b?
¾ How do you compute the equation?
(x1,y1)
(x2,y2)
What if every point is
not on the line?
¾ Straight line may be good description even
if not all points are on the line
Computing the line
when points are scattered
¾ = a + bX
¾ Y-hat means predicted value of Y
¾ Computing the slope:
¾ b = 𝑋−𝑋 𝑌−𝑌
𝑋−𝑋
¾ I ill ri e/r n, b no e al o
consider variability in X and Y
Computing the intercept
¾ a = - bX
¾ Need o pl g in al e of (X, )
¾ Can e j an Y or X!
z Line would be very different depending on
which ones you chose
¾ Must have X and Y that we know are on
the line
z mean of X and mean of Y
2
Computing the intercept
¾ Regression line will always go through the
mean of X and mean of Y
¾ A = 𝑌 - b𝑋
¾ Le r it with our example from before
X
(# of kids)
Y
(hours of
housework) 𝑋 𝑋 𝑌 𝑌 𝑋 𝑋 𝑌 𝑌 𝑋 𝑋
1 1 -1.75 -2.5 4.375 3.063
1 2 -1.75 -1.5 2.625 3.063
1 3 -1.75 -0.5 0.875 3.063
2 6 -0.75 2.5 -1.875 0.563
2 4 -0.75 0.5 -0.375 0.563
2 1 -0.75 -2.5 1.875 0.563
3 5 0.25 1.5 0.375 0.063
3 0 0.25 -3.5 -0.875 0.063
4 6 1.25 2.5 3.125 1.563
4 3 1.25 -0.5 -0.625 1.563
5 7 2.25 3.5 7.875 5.063
5 4 2.25 0.5 1.125 5.063
MX=2.75 MY=3.5 = 0 = 0 = 18.5 = 24.25
Computing the equation
¾ b = .
.
.76
¾ a = 3.5 - .76(2.75)
¾ = 1.41
¾ = 1.41 + .76X
Interpreting the coefficients
¾ Slope
z For a one unit increase in X, we predict a b
unit increase in Y
What does that mean for this study?
¾ Intercept
z The predicted value of Y when X = 0
What does that mean for this study?
Interpreting the coefficients
¾ Slope
z For each additional child, we predict
parents will do an additional .76 hours of
housework per day
¾ Intercept
z For a family with zero kids, we predict they
will do 1.41 hours of housework per day
Drawing the regression line
¾ Need to plot two points
z 𝑋, 𝑌
z Y-intercept
1
Scatterplots and
Correlation
Correlation
¾ Useful tool to assess relationships
¾ Must have two variables measured on one set of
people
¾ Correlation only measures strength of linear
association
Linear relationships are
not perfect lines
¾ Variables have variability (duh)
¾ Relationships may be generally linear
even if all points are not on the line
Magnitude of r Not all relationships are linear
2
Properties of r
¾ X & Y must be quantitative
z Interval or ratio
¾ I doe n ma e hich a iable i edic o
and which is response
z rxy = ryx
Properties of r
¾ Correlation has no units
z So r can be compared for different variables
¾ Value of r is always between -1 and +1
Computing r
¾ Consider deviations around mean of X & Y
¾ (X 𝑋) (Y 𝑌)
Cross-Product
¾ To consider X & Y together, multiply their
deviations
¾ (X 𝑋)(Y 𝑌)
¾ Sign will be positive or negative
¾ Sum of cross-pr
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 9: Inferences from Two Samples
9.3 Two Means, Two Dependent Samples, Matched Pairs
Similar to Making Causal Claims as a Data Scientist: Tips and Tricks Using R (20)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
7. D A T A D A Y T E X A S 2 0 1 8
✌ parts
1. Discuss some ways to strengthen a causal
argument: Hill’s criteria
2. Discuss a specific causal inference method:
propensity scores + sensitivity analyses
8. D A T A D A Y T E X A S 2 0 1 8
Hill’s Criteria
h-ps://commons.wikimedia.org/wiki/File:AusBn_Bradford_Hill.jpg
9. D A T A D A Y T E X A S 2 0 1 8
http://nssdeviations.com
h<p://livefreeordichotomize.com/2016/12/15/hill-for-the-data-scienDst-an-xkcd-story/
10. D A T A D A Y T E X A S 2 0 1 8
Strength
h-ps://xkcd.com/539/
11. D A T A D A Y T E X A S 2 0 1 8
h-ps://xkcd.com/242/
Consistency
12. D A T A D A Y T E X A S 2 0 1 8
h-ps://xkcd.com/1217/
Specificity
13. D A T A D A Y T E X A S 2 0 1 8
h-ps://xkcd.com/925/
Temporality
14. D A T A D A Y T E X A S 2 0 1 8
h-ps://xkcd.com/323/
Biological
gradient
15. D A T A D A Y T E X A S 2 0 1 8
h-ps://xkcd.com/605/
Plausibility
16. D A T A D A Y T E X A S 2 0 1 8
Coherence
h-ps://xkcd.com/1170/
17. D A T A D A Y T E X A S 2 0 1 8
Analogy
h-ps://xkcd.com/882/
18. D A T A D A Y T E X A S 2 0 1 8
Experiment
h-ps://xkcd.com/1462/
19. D A T A D A Y T E X A S 2 0 1 8
What if you can’t
do a controlled
experiment?
20. D A T A D A Y T E X A S 2 0 1 8
Observational Studies
Goal: Answer a research question
exposure outcome
21. D A T A D A Y T E X A S 2 0 1 8
“Gold Standard” – Randomized Controlled Trials
Treatment
Control
P=0.5
P=0.5
Observational Studies
22. D A T A D A Y T E X A S 2 0 1 8
“Gold Standard” – Randomized Controlled Trials
A
B
P=0.5
P=0.5
Observational Studies
23. D A T A D A Y T E X A S 2 0 1 8
Treatment
Control
P=0.9
P=0.1
Observational Studies
24. D A T A D A Y T E X A S 2 0 1 8
Exposed
Unexposed
P=0.8
P=0.2
Observational Studies
25. D A T A D A Y T E X A S 2 0 1 8
Exposed
Unexposed
P=0.8
P=0.2
Observational Studies
26. D A T A D A Y T E X A S 2 0 1 8
Exposed
Unexposed
P=0.9
P=0.1
Observational Studies
29. D A T A D A Y T E X A S 2 0 1 8
Confounding
Exposure Outcome
Confounder
30. D A T A D A Y T E X A S 2 0 1 8
Confounding
Exposure Outcome
Confounder
31. D A T A D A Y T E X A S 2 0 1 8
Meaningful confounders
1. How imbalanced is the confounder between the
exposure groups?
2. How predictive is the confounder of the outcome?
32. D A T A D A Y T E X A S 2 0 1 8
Confounding
Exposure Outcome
Confounder
33. D A T A D A Y T E X A S 2 0 1 8
Meaningful confounders
1. How imbalanced is the confounder between the
exposure groups?
2. How predictive is the confounder of the outcome?
34. D A T A D A Y T E X A S 2 0 1 8
Confounding
Exposure Outcome
Confounder
36. D A T A D A Y T E X A S 2 0 1 8
Rosenbaum and Rubin showed in observational studies, conditioning on
propensity scores can lead to unbiased estimates of the exposure effect
1. There are no unmeasured confounders
2. Every subject has a nonzero probability of
receiving either exposure
Propensity scores
37. D A T A D A Y T E X A S 2 0 1 8
Rosenbaum and Rubin showed in observational studies, conditioning on
propensity scores can lead to unbiased estimates of the exposure effect
1. There are no unmeasured confounders
2. Every subject has a nonzero probability of
receiving either exposure
Propensity scores
38. D A T A D A Y T E X A S 2 0 1 8
• Fit a logistic regression predicting exposure using known
covariates
Pr exposure = 1 =
1
1 + exp −X.
• Each individuals' predicted values are the propensity
scores
Propensity scores
39. D A T A D A Y T E X A S 2 0 1 8
Propensity scores
glm(exposure ~ smoker + cowboy_hat + sick . . .,
data = df,
family = binomial)
40. D A T A D A Y T E X A S 2 0 1 8
Propensity scores
glm(exposure ~ smoker + cowboy_hat + sick . . .,
data = df,
family = binomial) %>%
predict(type = “response”)
41. D A T A D A Y T E X A S 2 0 1 8
Propensity scores
glm(exposure ~ smoker + cowboy_hat + sick . . .,
data = df,
family = binomial) %>%
predict(type = “response”)
#> 1 2 3 4 5
#> 0.6967404 0.6370019 0.7195434 0.9079320 0.5905542
42. D A T A D A Y T E X A S 2 0 1 8
Propensity scores
p <- glm(exposure ~ smoker + cowboy_hat + sick . . .,
data = df,
family = binomial) %>%
predict(type = “response”)
43. D A T A D A Y T E X A S 2 0 1 8
0.6
0.9
0.8
0.2
0.7
0.1
0.2
0.3
0.2
0.7
0.8
0.2
0.3
0.7
0.2
0.1
0.1
0.6
0.1
0.3
0.9
Propensity scores
44. DA T A DA Y T E X A S 2 0 1 8
0.6
0.9
0.1
0.3
0.8
0.3
0.7
0.6
0.9
0.1
0.8
0.2
0.2
0.2
0.2
0.7
Propensity scores
45. DA T A DA Y T E X A S 2 0 1 8
0.1 0.2 0.2 0.3 0.6 0.7 0.8 0.9
0.1 0.2 0.2 0.3 0.6 0.7 0.8 0.9
Propensity scores
46. DA T A DA Y T E X A S 2 0 1 8
Propensity Scores
• Weighting
• Matching
• Stratification
• Direct Adjustment
• …
47. DA T A DA Y T E X A S 2 0 1 8
Propensity Scores
• Weighting
• Matching
• Stratification
• Direct Adjustment
• …
48. DA T A DA Y T E X A S 2 0 1 8
Weighting
!"#$ =
min{*+, 1 − *+}
0+*+ + (1 − 0+)(1 − *+)
49. DA T A DA Y T E X A S 2 0 1 8
Weighting
p_0 <- 1 - p
df$weight <- pmin(p, p_0) /
ifelse(exposure == 1, p, p_0)
50. DA T A DA Y T E X A S 2 0 1 8
0.1 0.2 0.2 0.3 0.6 0.7 0.8 0.9
0.1 0.2 0.2 0.3 0.6 0.7 0.8 0.9
Imbalance between
exposures
51. D A T A D A Y T E X A S 2 0 1 8
0.1 0.2 0.2 0.3 0.6 0.7 0.8 0.9
0.1 0.2 0.2 0.3 0.6 0.7 0.8 0.9
Propensity Scores
Imbalance between
exposures
52. DA T A DA Y T E X A S 2 0 1 8
Standardized mean
difference
! =
#$%#&'(%! − #$*+%#&'(%!
(%#&'(%!
,
+ (*+%#&'(%!
,
,
53. DA T A DA Y T E X A S 2 0 1 8
Standardized Mean Difference
library(survey)
svy_des <- svydesign(
ids = ~ 1,
data = df,
weights = ~ weight)
54. DA T A DA Y T E X A S 2 0 1 8
Standardized Mean Difference
library(tableone)
smd_table <- svyCreateTableOne(
vars = c(“smoker”, “cowboy_hat”, “sick”, . . .),
strata = “exposure”,
data = svy_des,
test = FALSE)
print(smd_table, smd = TRUE)
55. DA T A DA Y T E X A S 2 0 1 8
Exposed Unexposed SMD
n 1897.07 1900.24
mustache 60.71 (16.95) 60.90 (15.61) 0.012
cowboy hat 204.59 (101.45) 202.90 (106.44) 0.016
glasses 70.77 (27.00) 70.51 (27.50) 0.01
smoker 37.46 (10.93) 37.40 (11.28) 0.005
crazy hair 2.27 (2.46) 2.27 (1.82) 0.003
smile 2.49 (5.48) 2.47 (4.80) 0.003
sick 30.92 (8.23) 30.95 (7.54) 0.003
harry potter 3.04 (0.69) 3.03 (0.93) 0.01
healthy 145.4 (7.7) 152.2 (8.0) 0.013
Standardized Mean Difference
56. DA T A DA Y T E X A S 2 0 1 8
Standardized Mean Difference
library(tableone)
smd_table_unweighted <- CreateTableOne(
vars = c(“smoker”, “cowboy_hat”, “sick”, . . .),
strata = “exposure”,
data = df,
test = FALSE)
print(smd_table_unweighted, smd = TRUE)
58. DA T A DA Y T E X A S 2 0 1 8
Love Plots
Imbalance between
exposures
59. DA T A DA Y T E X A S 2 0 1 8
Standardized Mean Difference
library(tidyr)
plot_df <- data.frame(
var = names(ExtractSmd(smd_table)),
Unadjusted = ExtractSmd(smd_table_unweighted ),
Weighted = ExtractSmd(smd_table)) %>%
gather("Method", "SMD", -var)
60. DA T A DA Y T E X A S 2 0 1 8
Standardized Mean Difference
library(ggplot2)
ggplot(
data = plot_df,
mapping = aes(x = var, y = SMD, group = Method, color = Method)
) +
geom_line() +
geom_point() +
geom_hline(yintercept = 0.1, color = "black", size = 0.1) +
coord_flip()
62. DA T A DA Y T E X A S 2 0 1 8
Outcome model
svyglm(
outcome ~ exposure,
data = svy_des,
family = binomial)
63. DA T A DA Y T E X A S 2 0 1 8
Outcome model
svyglm(
outcome ~ exposure,
data = svy_des,
family = binomial) %>%
confint(parm = “exposure”)
64. DA T A DA Y T E X A S 2 0 1 8
Outcome model
svyglm(
outcome ~ exposure,
data = svy_des,
family = binomial) %>%
confint(parm = “exposure”) %>%
exp()
#> 2.5% 97.5%
#> exposure 1.51 2.01
65. DA T A DA Y T E X A S 2 0 1 8
unmeasured confounding
66. DA T A DA Y T E X A S 2 0 1 8
All you need
ü Exposure-outcome effect
ü Exposure-unmeasured confounder effect
ü Outcome-unmeasured confounder effect
67. DA T A DA Y T E X A S 2 0 1 8
All you need
ü Exposure-outcome effect generally estimated from a
model, for example an odds ratio, hazard ratio, or risk
ratio
ü Exposure-unmeasured confounder effect
ü Outcome-unmeasured confounder effect
Odds Ratio
Confidence Interval:
(1.5, 2)
68. DA T A DA Y T E X A S 2 0 1 8
All you need
ü Exposure-outcome effect generally estimated from a
model, for example an odds ratio, hazard ratio, or risk
ratio
ü Exposure-unmeasured confounder effect
ü Outcome-unmeasured confounder effect
69. DA T A DA Y T E X A S 2 0 1 8
Tipping point analyses
what will tip our confidence bound to cross 1
70. DA T A DA Y T E X A S 2 0 1 8
All you need
ü Exposure-outcome effect generally estimated from a
model, for example an odds ratio, hazard ratio, or risk
ratio
ü Exposure-unmeasured confounder effect can use my
observed variables to estimate this
ü Outcome-unmeasured confounder effect
71. DA T A DA Y T E X A S 2 0 1 8
All you need
ü Exposure-outcome effect generally estimated from a
model, for example an odds ratio, hazard ratio, or risk
ratio
ü Exposure-unmeasured confounder effect can use my
observed variables to estimate this
ü Outcome-unmeasured confounder effect
72. DA T A DA Y T E X A S 2 0 1 8
Exposure-unmeasured
confounder effect
•Prevalence of the unmeasured confounder
among the exposed
•Prevalence of the unmeasured confounder
among the unexposed
73. DA T A DA Y T E X A S 2 0 1 8
Love Plots
!" = 0.3
!' = 0.1
74. DA T A DA Y T E X A S 2 0 1 8
Confounding
Exposure Outcome
Confounder
!" = 0.3
!' = 0.1
?
Odds Ratio Confidence
Interval: (1.5, 2)
75. DA T A DA Y T E X A S 2 0 1 8
Standardized Mean Difference
library(tipr)
tip_with_binary(
p0 = 0.1,
p1 = 0.3,
lb = 1.5,
ub = 2)
76. DA T A DA Y T E X A S 2 0 1 8
Standardized Mean Difference
library(tipr)
tip_with_binary(
p0 = 0.1,
p1 = 0.3,
lb = 1.5,
ub = 2)
#> [1] 4.333333
77. DA T A DA Y T E X A S 2 0 1 8
A hypothetical unobserved binary
confounder that is prevalent in 30% of
the exposed population and 10% of the
unexposed population would need to
have an association with Y of 4.33 to
tip this analysis at the 5% level,
rendering it inconclusive.
”
“
78. DA T A DA Y T E X A S 2 0 1 8
✌ parts
1. Discuss some ways to strengthen a causal
argument: Hill’s criteria
2. Discuss a specific causal inference method:
propensity scores + sensitivity analyses
79. DA T A DA Y T E X A S 2 0 1 8
h.ps://xkcd.com/552/
80. DA T A DA Y T E X A S 2 0 1 8
R
• survey
• tableone
• 9dyr
• ggplot2
• 9pr
81. DA T A DA Y T E X A S 2 0 1 8
References
1.Cornfield, J., Haenszel, W., Hammond, E. C., Lilienfeld, A. M., Shimkin, M. B., & Wynder, E. L.
(2009). Smoking and lung cancer: recent evidence and a discussion of some quesQons. 1959.
Interna'onal journal of epidemiology (Vol. 38, pp. 1175–1191). Oxford University Press.
hp://doi.org/10.1093/ije/dyp289
2.Schlesselman, J. J. (1978). Assessing effects of confounding variables. American Journal of
Epidemiology, 108(1), 3–8.
3.Rosenbaum, P. R., & Rubin, D. B. (1983). The Central Role of the Propensity Score in ObservaQonal
Studies for Causal Effects. Biometrika, 70(1), 41. hp://doi.org/10.2307/2335942
4.Lin, D. Y., Psaty, B. M., & Kronmal, R. A. (1998). Assessing the sensiQvity of regression results to
unmeasured confounders in observaQonal studies. Biometrics, 54(3), 948–963.
5.hps://cran.r-project.org/web/packages/tableone/vignees/smd.html
82. DA T A DA Y T E X A S 2 0 1 8
Thank you!
@LucyStats
http://bit.ly/LucyStatsDDTX18
lucymcgowan.com