Probability Homework Help

Probability Homework Help
For any Homework related queries, Call us at : - +1 678 648 4277
You can mail us at : - info@statisticshomeworkhelper.com or
reach us at : - https://www.statisticshomeworkhelper.com/

Using probability to check the possibility of an event happening
Here, we use probability to see the connection between sleep and stroke. Also we will
use probability to see the connection between two types of cancers.
PROBLEM 1:
a) A number of studies have been conducted to investigate the association between sleep
duration and risk of stroke in middle-aged and older adults. One such study was
conducted in a sample of 31,000 participants with an average age of 62 years.
Information on sociodemographic characteristics and sleep duration was obtained by a
self-administered questionnaire. The study took place over 6 years, during which 1,557
cases of stroke were documented. The risk of stroke was higher in participants who
reported sleeping at least 9 hours per night, in comparison to the group who reported
sleeping 7 to 8 hours per night; the relative risk was 1.23.
A news article reporting the results stated: “Sleeping a lot may increase the risk for
stroke. A new study has found that compared with sleeping seven to eight hours a night,
sleeping nine or more hours increased the relative risk for stroke by 23 percent.
Maintaining appropriate sleep duration might hold great promise as primary prevention
of stroke.”
statisticshomeworkhelper.com

Write a short response to the newspaper editor explaining clearly why the article is
potentially misleading. Be sure to use language accessible to a general audience
without a statistics background. Limit your answer to at most six sentences.
b) Suppose that you have purchased a 40-piece box of salt water taffy; taffy is a type of
soft, chewy candy that became popular in the United States in the 1800s. The box is
advertised as containing a random assortment of 8 flavors. Calculate the probability
that at least one flavor is missing from the box. If using an algebraic approach, explain
your reasoning and any necessary assumptions. If using a simulation approach, be sure
to clearly comment any code and describe your logic.
c) Accurately distinguishing lung cancer from benign lung disease remains challenging,
even with the use of imaging scans; computed tomography (CT) scans are known to
have high sensitivity but poor specificity for lung cancer diagnosis. Tumor markers,
molecules produced by a tumor associated with a cancer or by the body in response to a
cancer, may be useful for clinical diagnosis. Consider two tumor markers for lung
cancer, CYFRA 21-1 and CEA, which tend to be elevated in patients with lung cancer
relative to those with benign lung disease. A study was conducted on patients with
known lung cancer status to assess how these tumor markers could be used for clinical
diagnosis.

The study team observed that in patients with lung cancer, CYFRA 21-1 is normally
distributed with mean 4.7 ng/mL and standard deviation 9.2 ng/mL while CEA is
normally distributed with mean 5.9 ng/mL and standard deviation 19.8 ng/mL. In
patients with benign lung disease, CYFRA 21-1 is normally distributed with mean 1.6
ng/mL and standard deviation 4.3 ng/mL while CEA is normally distributed with
mean 2.2 ng/mL and standard deviation 5.3 ng/mL.
Use the data from this study to answer the following questions.
i. Compute the sensitivity and specificity of a diagnosis test based on classifying
patients with CYFRA 21-1 level greater than 3.3 ng/mL as having lung cancer.
ii. Compute the sensitivity and specificity of a diagnosis test based on classifying
patientswith CEA level greater than 5.0 ng/mL as having lung cancer.
iii. Explain the reasoning behind why a diagnostic test with low sensitivity may not be
recommended for use in the general population but appropriate for use in high-risk
groups, such as patients presenting with several risk factors or symptoms strongly
predictive of lung cancer. Use language accessible to someone who has not taken a
statistics course. Limit your answer to no more than six sentences.

iv. Suppose a high-risk patient is tested for elevated CYFRA 21-1 level and found to
have
CYFRA 21-1 level below the cutoff in part i. Explain whether it seems reasonable to
rule out lung cancer for this patient based on this test result and the test features
computed in part i. Limit your answer to no more than six sentences, referencing
numerical results as necessary. The study team is interested in whether a diagnostic
test based on both CYFRA 21-1 and CEA is an improvement over tests based solely
on one of the markers. Suppose that a patient is classified as having lung cancer if at
least one of the markers is above the cutoffs used in parts i. and ii; i.e., a patient tests
positive for lung cancer if CYFRA 21-1 level is greater than 3.3 ng/mL, CEA level is
greater than 5.0 ng/mL, or both are elevated.
v. Compute the sensitivity and specificity of this diagnostic test. State any
assumptions necessary to make the calculation and comment on whether those
assumptions seem reasonable.
vi. Does the diagnostic test based on both markers represent an improvement over
the tests in parts i. and ii. for use in high-risk patients? Explain your answer,
referencing numerical results to support your reasoning.

Solution
(a) “Dear Sir or madam. With reference to your article about risks of longer sleeping
hours to have a heart stroke I am writing to clarify few points that I do not agree with.
As a statistics graduate I would like to point out that concept of relative risk is more
complicated to express it as an increase in percentage. One would prefer to express
relative risk in terms, risks of people with longer sleeping hours are more by factor
1.23 higer than those of with shorter hours. The other issue why the conclusion could
be misleading is the fact it is self-administered questionnaire. That means there are
could be measurement error or the sample given older age of the sample could be not
representative. For example, if people in the category longer sleeping hours are older
then those with shoter sleeping hours, then conclusion is completely wrong. In that
respect I would like to suggest to take extra care when trying interpret non-scientific
or reviewed studies.”
We need to find probability that in the 40-piece box at least one of the 8 flavors not
included. we calculate first the probability that the 1st flavour, is missing. This is to
say that the probability that each piece does not contain the 1st flavor.

The probability intersection of 40 independent events is the product of the following
event: any sample does not contain the 1st flavor, which is given by and hence the
above probabiilty is the product of all these events:
Next we calculate this probabily for other 7 flavors:
i. Computing sensitivity and specifity with cutpoint of 3.3 ng/mL as having lung
cancer.
We generate data from a large sample with equal number of lung cancer patients

N=10^5
dat=data.frame(cancer=rep(c("YES","NO"),each=N/2))
dat[dat$cancer=="YES","CYFRA"]=rnorm(N/2,mean=4.7,sd=9.2)
dat[dat$cancer=="NO","CYFRA"]=rnorm(N/2,mean=1.6,sd=4.3)
dat[dat$cancer=="YES","CEA"]=rnorm(N/2,mean=5.9,sd=19.8)
dat[dat$cancer=="NO","CEA"]=rnorm(N/2,mean=2.2,sd=5.3)
dat$CYFRA_Pred=ifelse(dat$CYFRA>=3.3,"YES","NO")
CrossTable(dat$cancer,dat$CYFRA_Pred, prop.c = F,prop.r = T,prop.t = F,
expected = F, cell.layout = F, format="SAS", prop.chisq = F)
##
## ======================================
## dat$CYFRA_Pred
## dat$cancer NO YES Total

## --------------------------------------
## NO 32591 17409 50000
## row prop. 0.652 0.348 0.500
## --------------------------------------
## YES 21983 28017 50000
## row prop. 0.440 0.560 0.500
## --------------------------------------
## Total 54574 45426 100000
## ======================================
cat("nSensitivity based on diag test CYFRA 21.1=",
sensitivity(factor(dat$CYFRA_Pred),factor(dat$cancer), negative="NO",
positive="YES"))
##
## Sensitivity based on diag test CYFRA 21.1= 0.56034
cat("nSpecifivity based on diag test CYFRA 21.1=",
specificity(factor(dat$CYFRA_Pred),factor(dat$cancer), negative="NO",
positive="YES"))

##
## Specifivity based on diag test CYFRA 21.1= 0.65182
ii.Test based on CEA 5.0 ng.
dat$CEA_Pred=ifelse(dat$CEA>=5,"YES","NO")
CrossTable(dat$cancer,dat$CEA_Pred)
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
## ==========================================
## dat$CEA_Pred
## ------------------------------------------

## NO 35014 14986 50000
## 1034.948 1488.023
## 0.700 0.300 0.500
## 0.594 0.365
## 0.35 0.15
## ------------------------------------------
## YES 23965 26035 50000
## 1034.948 1488.023
## 0.479 0.521 0.500
## 0.406 0.635
## 0.24 0.26
## ------------------------------------------
## Total 58979 41021 100000
## 0.59 0.41
## =========================================
cat("nSensitivity based on diag test CEA 5.0 ng =",
sensitivity(factor(dat$CEA_Pred),factor(dat$cancer), negative="NO",
positive="YES"))

##
## Sensitivity based on diag test CEA 5.0 ng = 0.5207
cat("nSpecifivity based on diag test CEA 5.0 ng =",
specificity(factor(dat$CEA_Pred),factor(dat$cancer), negative="NO", posit
ive="YES"))
##
## Specifivity based on diag test CEA 5.0 ng = 0.70028
(iii) Diagnostic tests with low sensititivy imply that the in the general population such
tests does correctly identify enough risky patients, therefore such criteria is crucial
for large scale diagnostics. On the other hand considering only high risk population
such diagnostic tests might be effective because these type of misclassification
could be much less among such high risk groups.
(iv) If a patient has a CYFRA-level below the considered cut-off 3.3, then given the
specificity of this diagnosis test there is at least 65% chance that the patient has no
cancer. This can be read from the definition of specificity which can be translated
as
proportion of correctly identified non-risk patients below cut-off.
(v) Here we set the rule for predicted classification of lung cancer if at least one of
the
cut offs or rules in (i) and (ii) apply.

dat$Joint_Pred=factor(ifelse(dat$CEA>=5|dat$CYFRA>=3.3,"YES","NO"))
CrossTable(dat$cancer,dat$Joint_Pred)
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
## ==========================================
## dat$Joint_Pred
## ------------------------------------------
## NO 22828 27172 50000
## 2240.838 1124.693
## 0.457 0.543 0.500
## 0.683 0.408
## 0.228 0.272

## ------------------------------------------
## YES 10590 39410 50000
## 2240.838 1124.693
## 0.212 0.788 0.500
## 0.317 0.592
## 0.106 0.394
## ------------------------------------------
## Total 33418 66582 100000
## 0.334 0.666
## ==========================================
cat("nSensitivity based on the joint diag test =",
sensitivity(factor(dat$Joint_Pred),factor(dat$cancer), negative="NO", posit
ive="YES"))
##
## Sensitivity based on the joint diag test = 0.7882
cat("nSpecifivity based on the joint diag test =",
specificity(factor(dat$Joint_Pred),factor(dat$cancer), negative="NO", posit
ive="YES"))
##
## Specifivity based on the joint diag test = 0.45656

We observe increase of specifity but still lower sensitivity rate. This implies that such
diagnostics is good for identifying people not having the cancer. But at the same time
has poor performance for identifying people having cancer. The
PROBLEM 2: ICE CREAM SALES
Congratulations! You have inherited an ice cream truck for the summer of 2022. You
will be selling ice cream from the truck every day that summer: 101 days in total, from
Memorial Day weekend until Labor Day weekend. This problem will step you through
projecting how much ice cream you expect to sell (as well as how much revenue you
expect to make) during the summer.
a) Ice cream sales are known to be very dependent on weather. Suppose that for any
particular day, the weather is either rainy or sunny; on average, one-third of the days are
rainy and two-thirds of the days are sunny. Compute the expected number of sunny days
in summer 2022 and the probability of there being more rainy days than sunny days in
summer 2022. You may assume that the weather is independent between days.
b) It is more realistic to think that tomorrow’s weather depends on today’s weather (for
all days throughout the summer). Suppose that if it is sunny today, there is an 80%
chance that tomorrow will also be sunny; however, if it is rainy today, there is only a
30% chance

Based on a simulation with 10,000 replicates, estimate the probability of there being more
rainy days than sunny days in summer 2022. Write a brief paragraph outlining the logic of
the simulation and clearly comment your code. The number of ice cream cones sold on a
typical day depends on the weather. When it is sunny, the number of ice cream cones sold
is approximately normally distributed with mean 200 and standard deviation 40. When it
is rainy, the number of ice cream cones sold is approximately normally distributed with
mean 120 and standard deviation 30. Use the round () function to round to the nearest
whole number. You make $2.00 in profit for each ice cream cone sold and your fixed
daily operating cost is $300 per day.
c) Based on a simulation with 10,000 replicates, estimate the probability that you lose
money on any one randomly selected day in the summer; write a brief paragraph
outlining the logic of the simulation and clearly comment your code. Assume the weather
follows the pattern described in part a).
d) Based on a simulation with 10,000 replicates, estimate the probability that you lose
money with your ice cream truck during the summer of 2022; write a brief paragraph
outlining the logic of the simulation and clearly comment your code. Assume the weather
follows the pattern described in part b).
that tomorrow will be sunny. Additionally, suppose that the weather on ‘Day 0’ (i.e.,
the day before your ice cream truck opens) is known to be sunny.

Solution
(a) We note that probability that a single day is sunny follows Bernoulli distribution and
the number of sunny days, in 101 days follows Binomial distribution with the success
rate of a single sunny day () and trial size of 101:
(b)Next we are interested in finding the expected value of :
The event that there are more sunny days than rainy is the same as the event there are
at least 51 sunny days:
We use R-statistical package to calculate this probability (although approximations
using incomplete Beta-function exist).

1-pbinom(50,101,2/3)
## [1] 0.9997276
##equivalently
sum(dbinom(51:101,101,2/3))
## [1] 0.9997276
b. Considering the dependent model
We create the empty vector of 102 days (0 day as first) and then each step draw a
Bernouilli sample with success probability of 0.8 if it the previous day is sunny and 0.3
if the previuos day is raining:
We run collect these samples and count number of cases where sunny days are more than
51.

set.seed(2344)
##number of simulations
N=10^4
n_sun_sim=rep(NA,N)
for(j in1:N)
{
## we create the empty vector of 102 days (0
day as first)
days=rep(NA,102)
days[1]=1
for(i in2:102)
{
days[i]=ifelse(days[i-
1]==1,rbinom(1,1,0.8),rbinom(1,1,0.3))
}
n_sun_sim[j]=sum(days[-1])
}
cat("Probability of more sunny days=",sum(n_sun_sim>=51)/N)

## Probability of more sunny days= 0.886
(c) We are looking for probability that on a single day which could be rainy or sunny
the money is lost. Given the variable profit of $2 and daily fixed cost of $300 this is en
event of selling less than 150 cones:
The number of cone sales,, is conditional on Bernouilli variable that takes 1 if it is sunny
and follows normal distribution.
#N=10^4
n_lose_sim=rep(NA,N)
for(j in1:N)
{
x_s=rbinom(1,1,2/3)
n_lose_sim[j]=ifelse(x_s==1,rnorm(1,mean=200,sd=40),rnorm(1,mean=120,sd=30))

}
cat("Proportion of financially unsuccessful days=",sum(n_lose_sim<=149)/N)
## Proportion of financially unsuccessful days= 0.3479
We can actually check validity (or accuracy) of this estimate using the additivity of
expectation operator and R:
pnorm(149,mean=200,sd=40)*2/3+pnorm(149,mean=120,sd=30)*1/3
## [1] 0.3451513
(d) In order to calculate of losing money during the whole year of 2002, We extend
the
simulation.
n_balance_sim=rep(NA,N)
{
## we create the empty vector of 102 days (0 day as first)

days=rep(NA,102)
balance=rep(NA,102)
balance[1]=0
days[1]=1
for(i in2:102)
{
days[i]=ifelse(days[i-1]==1,rbinom(1,1,0.8),rbinom(1,1,0.3))
n_c=ifelse(days[i]==1,rnorm(1,mean=200,sd=40),rnorm(1,mean=120,sd=30))
balance[i]=round(n_c)*2-300
}
n_balance_sim[j]=sum(balance)
}
cat("Proportion of financially unsuccessful years=",sum(n_balance_sim<=0)/N)
## Proportion of financially unsuccessful years= 0.0093

Probability Homework Help

Recommended

Recommended

More Related Content

Similar to Probability Homework Help

Similar to Probability Homework Help (18)

More from Statistics Homework Helper

More from Statistics Homework Helper (20)

Recently uploaded

Recently uploaded (20)

Probability Homework Help