DCP EPIDEMIOLOGY PRJECT

Abstract
Topic:
The relation between smoking and state of health/lifestyle.
Authors:
Alex Luojos, Axel Linnovaara, Ashwath Venkatasubramanian, Sergei Gordienko.
Aim of the project:
The aim of our project is to investigate the relationship between smoking and health state.
Hypotheses:
1. Currently smoking people are sick more often than former smokers and former smokers are
sick more often than never smokers.
2. Currently smoking people are the least physically active, former smokers are more physically
active but never-smoking people are the most physically active
3. Current smokers consume the most alcohol, former smokers consume the intermediate amount
and never smokers consume the least alcohol.
These theses were made based on everyday observations of smokers and non-smokers.
Our target is people aged 18-30, the data will be collected with by a questionnaire by spreading it
through social networks. Additionally, we expect to collect data from approximately 100 people.
We will collect data regarding:
1. Age
2. Sex
3. Smoking habits
4. Weight/ height
5. Sport activity (h/week)

6. Cases of being sick (last 6 months)
7. Alcohol consumption (restaurant portions/week)
8. Personal judgment of health state (1-5)
9. Duration of smoking habit
The survey is included in the appendix of this project.
Introduction
Smoking is the single most important preventable health risk in developed world. It is
also an important cause of premature death worldwide. Smoking cause a wide range of diseases,
including cancers, chronic obstructive pulmonary disease, coronary heart disease and stroke.
There is also a great interest for studying unhealthy lifestyles and their causes, because smoking
and alcohol related health problems are a major burden for the modern society both economically
and socially. In our research we studied the relation between health state and smoking habits.
The main aim of the project was to understand the relationship between smoking and health state
of an individual. We also studied whether or not smoking people use alcohol more often. The
study was carried out by using of internet survey that was sent to Erasmus and medical school
students in Tartu. The study material was analyzed by help of “Stata” program. To be able to
understand the lifestyle and health state of our sample we collected this data through
questionnaire as well.
Course of work and methods
To collect data, it was decided that an online survey would be the most appropriate as
many people can have access to it. Furthermore, to avoid bias, there is a certain degree of
randomization when opening the survey to a wide range of people. The survey comprised of 10
questions capable of obtaining the relevant information about lifestyle and smoking habits of the
subjects. The survey was made and launched on the “surveymonkey.com” platform and spread

through social networks. Overall 104 random exchange and degree students of Tartu University
of different ages and nationalities took part in the survey. The raw data collected was later
processed and this process can be seen in the next section. However, before that the sample was
profiled. See figures 1-6.
Figure 1. Gender distribution of the sample.

Figure 2. Age distribution of the sample
Figure 3. Weight (in Kg) distribution of the sample

Figure 4. Height(in cm) distribution of the sample
Figure 5. Smoking habit distribution of the sample

Figure 6. Health self-evaluation distribution amongst the sample
Health state evaluation by subjects.
1. My health state is dangerously poor (no results for this category).
2. My health state is poor but it is not a danger, however it still needs improvement.
3. My health state is not too bad however it is not too good either.
4. My health state is quite good.
5. My health state is extremely good, it can’t get much better.
After looking through the general profile of the sample, more detailed analysis in order to
check 3 hypotheses was conducted. Due to the pattern of data obtained the main methods applied
were Fisher’s exact test and Kruskal-Wallis test. More complex information about the course of
data processing including graphs and results of test can be found in the next section of the report.

Data processing and results
To process our data we used the program “Stata”. With this program we were able to
process a large amount of data quickly. Our first step was to determine what statistical tests
should be used according to our data. Therefore, we decided to create histograms of the variables
that are part of our hypotheses. This is to show the distribution of our data as a normal
distribution can point to the use of a t-test. Refer to figures 7, 8 and 9.
Figure 7. Histogram showing distribution of amount of times sick in the last 6 months

Figure 8. Histogram showing distribution of hours of exercise per week
Exercising habits of subjects.
1. 0-2 hours per week
5. More than 11 hours per week

Figure 9. Histogram showing distribution of alcohol consumption (portions per week)
As one can see from the graphs above, the data in all of the variables is not normally
distributed. Therefore, it is certain that the t-test cannot be used. As a result, it was decided that
the Kruskal-Wallis test can be used, along with the fisher exact test for tabulated data. The
reason we will not use the chi-squared test instead of the fisher exact test is because we do not
have a very large amount of data.
The first hypothesis to be tested is that currently smoking people are sick more often than
former smokers and former smokers are sick more often than never smokers. For this the
Kruskal-Wallis test can be used. It is a test that shows whether samples originate from the same
distribution, essentially showing if a variable has a significant effect on the results. It is similar to
a t-test in the way that in the test we seek to reject the null hypothesis and therefore accept the
alternative hypothesis. We can reject the null hypothesis only when the p-value (resulting value
form the test) is equal to or less than 0.05.

H0 (null hypothesis) - The hypothesis that there is no significant difference between
specified populations, any observed difference being due to sampling or experimental
error.
H1 (alternative hypothesis) - The hypothesis that the observations are the result of a
real effect. There is significance of the variable in question.
The resulting p-value that was found by using the Kruskal-Wallis test of sickness by
smoking was 0.3768. Therefore, p-value > 0.05, and this means that we have to accept the null
hypothesis as we don’t have the evidence to reject it and there is no significant effect by
smoking on the amount of times fallen sick. A box and whisker plot can also be generated to
show the relationship between the two variables. See figure 10.
Figure 10. Box and whisker plot showing relationship between smoking and the amount of times
fallen sick in the past year.
X-axis (smoking): 0 - Neversmokers, 1 - Former smokers, 2 - Current smokers

Even this box and whisker plot shows that there is no clear relationship as the medians of
every category of smoking are around the same. If the hypothesis stayed true then one would
expect to see never-smokers with the lowest median, former smokers with the second highest
median and current smokers with the highest median.
To make sure, we also used the fisher exact test on this hypothesis. However, as the
fisher exact test can be used only with two categorical variables, we had to make our sick
variable into two categories: fallen sick in the last 6 months and not fallen sick in the last 6
months. After this new variable was generated we tabulated the data. See table 1.
Table 1. Table showing frequency of falling sick amongst current smokers, former smokers and
never smokers.
Smoking Not fallen sick Fallen sick Total
0 (never smokers) 15
25.86
43
74.14
58
100.00
1 (former smokers) 3
20.00
12
80.00
15
100.00
2 (current smokers) 8
25.81
23
74.19
31
100.00
Total 26
25.00
78
75.00
104
100.00
The fisher exact test is very similar to the Kruskal-Wallis test, in a way that it is based
upon the null and alternative hypothesis. The same rules apply so p-value has to be equal to or
less than 0.05 for us to reject the null hypothesis and believe that there is a statistically
significant effect.
The result for the fisher exact test is 0.952. This means that we do not have enough
evidence to reject the null hypothesis and we must believe that there is no effect or statistically
significant difference. Therefore once again, it is proved that smoking does not have an effect on
amount of times one falls sick.
The next hypothesis that needs to be tested is that currently smoking people are the least
physically active, former smokers are more physically active but never-smoking people are the
most physically active. An immediate problem that we saw with the data about sport activity is
that only one person exercised more than 11 hours per week (5th category). Therefore, what was

done was that the 4th and 5th categories were joined together. So the new categories for sport
activity would look like:
4. More than 9 hours per week
See table 2 below.
Table 2. Table showing amount of sport activity amongst current smokers, former smokers and
never smokers.
Sport activity (hours per week)
Smoking 0-2 3-5 6-8 9+ Total
Never
smokers
25 21 7 5 58
Former
smokers
8 3 2 2 15
Current
smokers
17 8 4 2 31
Total 50 32 13 9 104
With this data we can carry out a fisher exact test to verify if our hypothesis is true or not
or if we need to accept the null hypothesis. The result of the fisher exact test is: 0.840. This
means that we do not have enough evidence to reject the null hypothesis, which is that there is no
effect on sport activity with smoking; therefore we have to accept it. Once again, this can be
shown on a box and whisker plot. See figure 11.

Figure 11. Box and whisker plot showing relationship between smoking and the amount of hours
of sport activity in a week.
Y-axis (sport activity): 1: 0-2 hours per week, 2: 3-5 hours per week, 3: 6-8 hours per week,
4: 9-11 hours per week, 5: More than 11 hours per week
This box and whisker plot also shows that there is no clear relationship as the medians are
all about in the same range. If the hypothesis was true the median of never smokers group would
be the highest, former smokers in the middle and current smokers the lowest.
The third and final hypothesis that has to be investigated is that current smokers consume
the most alcohol, former smokers consume the intermediate amount and never smokers consume
the least alcohol. The set of data regarding this test is quite large and therefore the fisher exact
test can’t be used. Instead we will revert back to the Kruskal-Wallis test. The p-value of this test
ended up being 0.0001. This means that there is enough evidence to reject the null hypothesis,
which would be that current smoker, former smokers and never smokers consume around the

same amount of alcohol. However, since the null hypothesis is rejected we can accept the
alternative hypothesis, which is that current smokers consume the most alcohol, former smokers
consume the intermediate amount and never smokers consume the least alcohol.
Furthermore, we can show this relationship using a box and whisker plot once again. See
figure 12 below.
Figure 12. Box and whisker plot showing relationship between smoking and the consumption of
alcohol (restaurant portions per week)
From this graph, it can be observed that there is a very clear pattern as stated in our
hypothesis. The median of the never smoking group is the lowest, and then next comes the
former smoking group and the highest median is of the current smoking group. This clear pattern
proves our hypothesis.

Conclusions
The study was concentrated around three hypotheses that predicted the associations
between smoking and health, or smoking and lifestyle. In these hypotheses the assumptions were
that people who smoke are sick more often, smokers are less physically active than non-smokers,
and smokers consume more alcohol than non-smokers.
When studying the assumption that smokers are sick more often, no consistent
association was found between susceptibility to infections and smoking by using Fischer’s test
(p=0.952, therefore H0 couldn’t be rejected). Also no difference was found between groups when
comparing the self-evaluation of their health.
With the assumption that smokers are less physically active than non-smokers, no reliable
association was found with Fisher’s exact test (p-value=0.324, H0 couldn’t be rejected). With
this hypothesis, however, the results could have been affected by our population, sample size and
wide categories for physical activity. Neither of the groups were physically very active, and with
our broad physical activity categories, almost all the answers fell in the same category of
exercise (0-2h a week).
Unlike in the previous two hypotheses, considerable association was found between
volumes of consumed alcohol and smoking within the sample. Kruskal-Wallis test was used to
determine the equality of populations and chi-squared test showed positive association with
smoking and volume of alcohol used (p-value=0.0001). Even though causalities cannot be
determined from this study, it would seem that in this sample of students, habits of smoking and
drinking alcohol were often concentrated on the same subjects.
It has to be noted that when interpreting the validity of the results, a few other concerns
should be taken into account. As the study was conducted via social media groups, it was
probably answered mostly by Erasmus students, and also to some degree by medical students.
Therefore could be assumed that these samples consist mainly of people who in general are very
healthy (young age, short smoking time, ability to acquire higher education and travel abroad).
As the sample of the study was very small and sample was taken from very specific population

with assumingly similar lifestyle, these results shouldn’t probably be extrapolated to concern any
wider groups of people.
What might also be interpreted from our results is that within these student groups with
young and generally healthy people, the primary health effects of smoking were not yet visible
on their health. If a follow-up study could be conducted, it would be interesting and perhaps
more fruitful to see if the association would be visible in these samples after 10 or 20 years.
Appendix
Survey questions:
1. How old are you?
2. Male/Female?
3. Are you a smoker? Yes, I currently smoke/No, I used to smoke/No, I never smoked
4. For how many years or months have you smoked for continuously? (Continuous smoking
defined as at least a pack of cigarettes a week) **Only answer if you are a current smoker or
former smoker**
4. How much do you smoke? (Packs per week)
a) 1
b) 2
c) 3
d) 4
e) 5
f) 6
g) 7
5. What is your weight? (in Kg)

6. What is your height? (in cm)
7. How many times have you fallen sick (flu or cold), in the last 6 months?
8. How many portions of alcohol do you consume in a week? *One portion is defined as a
restaurant portion such as 500 ml of beer (standard glass of beer), 175 ml of wine (standard glass
of wine), 45 ml of hard liquor (standard shot).*
9. How would you grade your state of health from 1 to 5?
1: My health state is dangerously poor.
2: My health state is poor but it is not a danger, however it still needs improvement.
3: My health state is not too bad however it is not too good either.
4: My health state is quite good.
5: My health state is extremely good, it can’t get much better.
10. How many hours of sporting activity do you do in a week (slow walking does not count!)?
a) 0-2 hours a week
b) 3-5 hours a week
c) 6-8 hours a week
d) 9-11 hours a week
e) More than 11 hours a week

DCP EPIDEMIOLOGY PRJECT

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to DCP EPIDEMIOLOGY PRJECT

Similar to DCP EPIDEMIOLOGY PRJECT (20)

More from Ashwath Venkatasubramanian

More from Ashwath Venkatasubramanian (7)

DCP EPIDEMIOLOGY PRJECT