Discover Factors That Affect Smoking With Data Analytics
1. Alight
Technical Report
Introduction
Data discovery means that we already have some understanding of a phenomenon (ie.
smoking); we obtained data on factors which we think contributes to this phenomenon.
Since temporal correlation is difficult to establish with complex phenomenon such as
smoking, we have to use mathematical means of discovering which of these factors actually
affect the occurrence of the phenomenon.
Taking the relationship between income and education, it is widely believed that higher
education leads to higher income; however individuals cannot easily translate this trend
into their personal life. For example, someone from a low income family who wishes to
improve his financial future would know the importance of education, but he cannot act on
this information. However, if researchers include other confounding factors related to
education, such as knowledge of available funding sources or friends who have attended
higher education, then that person can take active steps either to find more information
about funding sources or connect with the right peers.
We are trying to change the current paradigm of smoking research, which is similar to the
income and education situation described above, into a personalized one in which the
findings can affect smokers personally.
DataCollection
We will use two main data sources for our data discovery, data mining, and predictive
analytics. First, date, time and location data will be generated by the user when they light a
cigarette. Secondly, descriptive categorical data such as age, sex, income status, place of
residence, etc. will be collected when the user creates a profile on our online interface.
Lastly, we have the option of creating additional surveys that the users will fill on our
website in case there are specific questions researchers want to ask; for example, the userβs
smoking reduction goals.
StatisticalMethods
Data discovery
One of the major tools used in statistical studies is regression analysis. Regression gives a
mathematical formula to describe the relationship between different factors. Among these
factors, independent variable is the phenomena we are trying to describe using the
formula, while dependent variables are factors we think affect the outcome. The
independent variable will be individual smoking incidence, and the dependent variables
are: time, GPS location, demographics, and any other data that can be obtained from
surveys. The regression will take the form of:
πππππ(π¦ππ ππ) = πΏππππ‘πππ + ππππ ππ πππ¦ + π·ππ‘π + π·πππππππβπππ ( πππ, π ππ₯, ππ‘π) + πππππβ
2. Alight
To be more specific on the math, we are using Probit and Poisson models which give
precise probability of an event occurring or not (i.e. smoking). Probit will give the
probability of a user lighting a cigarette, while Poisson regressions will give the probability
of how many cigarettes a user smokes in one day.
The way that we measure the accuracy of such probabilities is testing whether our result
was obtained purely by chance (i.e. false positive). Imagine every time a person lights a
cigarette, he always has a cup of coffee; it could be the case that coffee triggers him to
smoke, or it is purely due to chance that he happened to have a cup of coffee when he
smokes. The way of discerning if coffee is the culprit is to calculate its statistical
significance (i.e. p-value).
Data mining
Smoking involves subtle and often inconspicuous influences such as seeing another person
smoke or passing by a convenience store. Data mining techniques allow us to discover the
hidden relationship among unlikely agents that might affect smoking.
We believe that the overlooked aspects of smoking are: where you are, and what is around
you. That is to say, the GPS information we obtain can derive additional benefits. We will
compare the userβs location data with publicly available geo-spatial data, such as locations
of businesses (e.g. coffee shops, convenience stores), weather conditions, traffic conditions
or other smokes in the vicinity.
Two data algorithms exist for such analysis: clustering and associative rule learning, and
both algorithms do not require a researcher to pre-define any set of rules (such as what we
would do with regression analysis).
A clustering algorithm measures the distance between each data point and automatically
creates rules to define the βclusterβ (i.e. classification) each data point belongs to. For
example, given a large data pool, we can build βsmoking clustersβ without human errors
which often happen with large data pools. The associative rule learning algorithm creates
rules that define the probability of an event occurring given the concurrence of a fixed
basket of events. In this case, we can measure how many concurrent events (e.g. number of
convenience stores; number of adjacent smokers) does it take for someone to smoke.
Factors unearthed by data mining are reintroduced to the regression analysis in order to
increase the value derived from existing data. The resultant model will describe smoking
more accurately, so that researchers and policy-makers can understand and modify
smoking behaviour.
Prediction andForecasting
The better a statistical model becomes, the more accurately it describes the relationship
between the outcome and other associated factors. But, regardless of how well the model
described the data from the past, it is hard to assess the predictive power of our model.
The best data scientists can do is to randomly separate the original data into training,
validation, and testing groups. The model creation only takes data from the training group,
3. Alight
and the validation data group is used to refine the performance of the model. Lastly, the
modelβs prediction is matched with data from the testing group, and the predictive
performance is determined based on the difference between the modelβs predictions and
the data values from the testing group.
For instance, we have 100 data points on smokersβ location. We use 60 data points to create
the model, 10 data points to validate, and we generate 30 data points from this model.
Finally, the 30 forecasted data points are compared with the unused (testing) data group. If
25 of the predicted data points match exactly with the 30 data points in the original data
pool, the predictive power of the model is 83%.
FuturePossibilities
The goal of smoking research should be tied to the health outcomes of the general
population and it should not be isolated from the other wealth of health-related data.
Currently, it is very difficult for researchers from different fields to pool their separate data
together. Part of this difficulty arises from lack of a unique identifier for convergent data.
That is to say, researcher A collects 100 observations on smoking data and researcher B
collects 100 observations on blood pressure data; it is impossible for both researchers to
know which observations came from the same data source. Hence, the researchers can
conduct cross data-source research.
Our solution for the data sharing problem is to endorse Appleβs newly announced
ResearchKit. Apple has created a common platform for medical researchers to have easy
access (fingerprint ID approval) to medical information attributed to unique identifiers
(iPhone users).
Version 2.0 of Alight is to be paired via Bluetooth to mobile phones and to have a dedicated
app on the Apple ResearchKit platform. This means researchers at CAMH will have access
to other health-related data without having to run another primary research.