Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
ECML	
  PKDD	
  2015,	
  Porto,	
  Portugal
Assessing	
  the	
  impact	
  of	
  a	
  health	
  intervention	
  
via	
  use...
๏ Background	
  and	
  motivation	
  
๏ Nowcasting	
  disease	
  rates	
  from	
  online	
  text	
  
๏ Estimating	
  the	
...
Online,	
  user-­‐generated	
  data
+ Social	
  media,	
  blogs,	
  search	
  engine	
  query	
  logs	
  
+ Proxy	
  of	
 ...
Online,	
  user-­‐generated	
  data	
  —	
  Applications
+ Politics	
  
• voting	
  intention	
  
• result	
  of	
  an	
  ...
Online,	
  user-­‐generated	
  data	
  for	
  health
Traditional	
  disease	
  surveillance	
  
- does	
  not	
  cover	
  ...
What	
  this	
  work	
  is	
  all	
  about
Health	
  intervention
disease	
  rates
(
Pebody	
  &	
  Cox,	
  2015
impact ?
What	
  this	
  work	
  is	
  all	
  about
Health	
  intervention
disease	
  rates
(Lampos,	
  Yom-­‐Tov,	
  
Pebody	
  &	...
✓ Background	
  and	
  motivation	
  
๏ Estimating	
  disease	
  rates	
  from	
  online	
  text	
  
๏ Estimating	
  the	
...
Estimating	
  disease	
  rates	
  from	
  online	
  textVariables
N
M
X 2 RN⇥M
y 2 RN
time	
  intervals
n-­‐grams
frequenc...
Estimating	
  disease	
  rates	
  from	
  online	
  text
the observation matrix X) we want to learn a function f:
drawn fr...
Estimating	
  disease	
  rates	
  from	
  online	
  text
the observation matrix X) we want to learn a function f:
drawn fr...
Estimating	
  influenza-­‐like	
  illness	
  (ILI)	
  rates	
  —	
  Data
2012 2013 2014
0
0.01
0.02
0.03
0.04
ILIrateper10...
Estimating	
  ILI	
  rates	
  —	
  Feature	
  extraction
• Start	
  with	
  a	
  manually	
  crafted	
  list	
  of	
  36	
...
Estimating	
  ILI	
  rates	
  —	
  Experimental	
  setup
Two	
  time	
  intervals	
  based	
  on	
  the	
  different	
  tem...
Pearson	
  correlation	
  (r)
0.5
0.6
0.7
0.8
0.9
1
User-­‐generated	
  data	
  source
Twitter	
  (Dt1) Twitter	
  (Dt2) B...
MAE
1
1.64
2.28
2.92
3.56
4.2
User-­‐generated	
  data	
  source
Twitter	
  (Dt1) Twitter	
  (Dt2) Bing	
  (Dt2)
1.598
1.9...
✓ Background	
  and	
  motivation	
  
✓ Estimating	
  disease	
  rates	
  from	
  online	
  text	
  
๏ Estimating	
  the	
...
Estimating	
  the	
  impact	
  of	
  a	
  health	
  intervention
1. Disease	
  intervention	
  launched	
  (to	
  a	
  set...
Estimating	
  the	
  impact	
  of	
  a	
  health	
  intervention
Based on a new observation x⇤, a prediction is conduc
the...
Estimating	
  the	
  impact	
  of	
  a	
  health	
  intervention
c
r(q⌧
v, q⌧
c )
f(w, ) : R ! R
argmin
w,
NX
i=1
qti
c w ...
Estimating	
  the	
  impact	
  of	
  a	
  health	
  intervention
c
r(q⌧
v, q⌧
c )
f(w, ) : R ! R
argmin
w,
NX
i=1
qti
c w ...
✓ Background	
  and	
  motivation	
  
✓ Estimating	
  disease	
  rates	
  from	
  online	
  text	
  
✓ Estimating	
  the	
...
Live	
  Attenuated	
  Influenza	
  Vaccine	
  (LAIV)	
  campaign
2012 2013 2014
0
0.01
0.02
0.03
ILIrateper100people
PHE/R...
Target	
  (vaccinated)	
  &	
  control	
  areas
Brighton	
  •	
  Bristol	
  •	
  Cambridge	
  
Exeter	
  •	
  Leeds	
  •	
...
Applying	
  the	
  impact	
  estimation	
  framework
Target	
  vs.	
  control	
  areas	
  
• Use	
  previous	
  flu	
  seas...
Relationship	
  between	
  vaccinated	
  &	
  control	
  areas
Twitter	
  —	
  All	
  areas
Bing	
  —	
  All	
  areas
0 0....
Relationship	
  between	
  vaccinated	
  &	
  control	
  areas
Twitter	
  —	
  London	
  
areas
Bing	
  —	
  London	
  are...
Impact	
  estimation	
  results	
  (strongly	
  correlated	
  controls)
Source Target r δ	
  x	
  103 θ	
  (%)
Twitter All...
Impact	
  estimation	
  results	
  (strongly	
  correlated	
  controls)
Source Target r δ	
  x	
  103 θ	
  (%)
Twitter All...
Source Target r δ	
  x	
  103 θ	
  (%)
Twitter All	
  areas .861 -­‐2.5	
  (-­‐4.1,	
  -­‐1.0) -­‐32.8	
  (-­‐47.4,	
  -­‐...
Source Target r δ	
  x	
  103 θ	
  (%)
Twitter All	
  areas .861 -­‐2.5	
  (-­‐4.1,	
  -­‐1.0) -­‐32.8	
  (-­‐47.4,	
  -­‐...
Impact	
  estimation	
  results	
  (stat.	
  sig.)
-­‐θ	
  (%)
0
7
14
21
28
35
All	
  areas London	
  areas Newham Cumbria...
Projected	
  vs.	
  inferred	
  ILI	
  rates	
  in	
  vaccinated	
  locations
Twitter	
  —	
  All	
  areas
Bing	
  —	
  Al...
Projected	
  vs.	
  inferred	
  ILI	
  rates	
  in	
  vaccinated	
  locations
Twitter	
  —	
  London	
  
areas
Bing	
  —	
...
Sensitivity	
  of	
  impact	
  estimates	
  to	
  variable	
  controls
• Repeat	
  the	
  impact	
  estimation	
  for	
  t...
Sensitivity	
  of	
  impact	
  estimates	
  to	
  variable	
  controls
• Repeat	
  the	
  impact	
  estimation	
  for	
  t...
Sensitivity	
  of	
  impact	
  estimates	
  to	
  variable	
  controls
• Repeat	
  the	
  impact	
  estimation	
  for	
  t...
✓ Background	
  and	
  motivation	
  
✓ Estimating	
  disease	
  rates	
  from	
  online	
  text	
  
✓ Estimating	
  the	
...
Conclusions	
  &	
  points	
  for	
  discussion
• Framework	
  for	
  estimating	
  the	
  impact	
  of	
  a	
  health	
  ...
Potential	
  future	
  work	
  directions
• Improve	
  supervised	
  learning	
  models	
  
- better	
  natural	
  languag...
Collaborators,	
  acknowledgements	
  &	
  material
Elad	
  Yom-­‐Tov,	
  Microsoft	
  Research	
  
Richard	
  Pebody,	
  ...
Bollen,	
  Mao	
  &	
  Zeng.	
  Twitter	
  mood	
  predicts	
  the	
  stock	
  market.	
  J	
  Comp	
  Science,	
  2011.	
...
Upcoming SlideShare
Loading in …5
×

Assessing the impact of a health intervention via user-generated Internet content

2,176 views

Published on

Assessing the effect of a health-oriented intervention by traditional epidemiological methods is commonly based only on population segments that use healthcare services. Here we introduce a complementary framework for evaluating the impact of a targeted intervention, such as a vaccination campaign against an infectious disease, through a statistical analysis of user-generated content submitted on web platforms. Using supervised learning, we derive a nonlinear regression model for estimating the prevalence of a health event in a population from Internet data. This model is applied to identify control location groups that correlate historically with the areas, where a specific intervention campaign has taken place. We then determine the impact of the intervention by inferring a projection of the disease rates that could have emerged in the absence of a campaign. Our case study focuses on the influenza vaccination program that was launched in England during the 2013/14 season, and our observations consist of millions of geo-located search queries to the Bing search engine and posts on Twitter. The impact estimates derived from the application of the proposed statistical framework support conventional assessments of the campaign.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Assessing the impact of a health intervention via user-generated Internet content

  1. 1. ECML  PKDD  2015,  Porto,  Portugal Assessing  the  impact  of  a  health  intervention   via  user-­‐generated  Internet  data   Data  Mining  and  Knowledge  Discovery  29(5),  pp.  1434–1457,  2015 Vasileios  Lampos,  Elad  Yom-­‐Tov,     Richard  Pebody  and  Ingemar  J.  Cox STATUTORY NOTIFICATIONS OF INFECTIOUS D WEEK 2015/33 week ending 16/08/2015 in ENGLAND and WALES Table 1 Statutory notifications of infectious diseases in the past 6 week current year compared with corresponding periods of the two p CONTENTS Table 2 Statutory notifications of infectious diseases for diseases for W Region, county, local and unitary authority including additional 6th April 2010 Registered Medical Practioner in England and Wales have a statutory duty to the local authority, often the CCDC (Consultant in Communicable Disease Co of certain infectious diseases: Acute encephalitis Haemolytic uraemic syndrome * R NOIDs WEEKLY REPORTat bridge
  2. 2. ๏ Background  and  motivation   ๏ Nowcasting  disease  rates  from  online  text   ๏ Estimating  the  impact  of  a  health  intervention   ๏ Case  study:  influenza  vaccination  impact   ๏ Conclusions  &  future  work 1% Assessing  the  impact  of  a  health  intervention  via  online  content
  3. 3. Online,  user-­‐generated  data + Social  media,  blogs,  search  engine  query  logs   + Proxy  of  real-­‐world  (online+offline)  behaviour   + Complementary  information  sensors  to  more   ‘traditional’  crowdsourcing  efforts   + Can  answer  questions  difficult  to  resolve  otherwise   + Strong  predictive  power
  4. 4. Online,  user-­‐generated  data  —  Applications + Politics   • voting  intention   • result  of  an  election   + Finance   • financial  indices   • tourism  patterns   + User  profiling   • age   • gender   • occupation (Preotiuc-­‐Pietro,  Lampos  &  Aletras,  2015) (Burger  et  al.,  2011) (Rao  et  al.,  2010) (Bollen,  Mao  &  Zeng,  2011) (Choi  &  Varian,  2012) (Lampos,  Preotiuc-­‐Pietro  &  Cohn,  2013) (Tumasjan  et  al.,  2010)
  5. 5. Online,  user-­‐generated  data  for  health Traditional  disease  surveillance   - does  not  cover  the  entire  population   - not  present  everywhere  (cities  /  countries)   - not  always  timely   Digital  disease  surveillance   + different  or  better  population  coverage   + better  geographical  granularity   + useful  in  underdeveloped  parts  of  the  world   + almost  instant   - noisy,  unstructured  information e.g.  (Lampos  &  Cristianini,  2010  &  2012),  (Lamb,  Paul  &  Dredze,  2013),  (Lampos  et  al.,  2015)  
  6. 6. What  this  work  is  all  about Health  intervention disease  rates ( Pebody  &  Cox,  2015 impact ?
  7. 7. What  this  work  is  all  about Health  intervention disease  rates (Lampos,  Yom-­‐Tov,   Pebody  &  Cox,  2015) impact ?
  8. 8. ✓ Background  and  motivation   ๏ Estimating  disease  rates  from  online  text   ๏ Estimating  the  impact  of  a  health  intervention   ๏ Case  study:  influenza  vaccination  impact   ๏ Conclusions  &  future  work Assessing  the  impact  of  a  health  intervention  via  online  content 15%
  9. 9. Estimating  disease  rates  from  online  textVariables N M X 2 RN⇥M y 2 RN time  intervals n-­‐grams frequency  of  n-­‐grams  during  the  time  intervals disease  rates  during  the  time  intervals Ridge regression argmin w, 0 @ NX i=1 (xiw + yi)2 +  MX j=1 w2 j 1 A Elastic net min 0 @ NX i=1 (xiw + yi)2 + 1 MX j=1 |wj| + 2 MX j=1 w2 j 1 A (Hoerl  &  Kennard,  1970) Ridge  regression Ridge regression argmin w, 0 @ NX i=1 (xiw + yi)2 +  MX j=1 w2 j 1 A Elastic net argmin w, 0 @ NX i=1 (xiw + yi)2 + 1 MX j=1 |wj| + 2 MX j=1 w2 j 1 A (Zou  &  Hastie,  2005) Elastic  net
  10. 10. Estimating  disease  rates  from  online  text the observation matrix X) we want to learn a function f: drawn from a GP prior f(x) ⇠ GP µ(x) = 0, k(x, x0 ) kSE(x, x0 ) = 2 exp ✓ kx x0k2 2 2`2 ◆ where 2 describes the overall level of variance and ` is r characteristic length-scale parameter. An infinite sum of SE kernels with di↵erent length-scal other well studied covariance function, the Rational Quadra kRQ(x, x0 ) = 2 ✓ 1 + kx x0k2 2 2↵`2 ◆ ↵ ↵ is a parameter that determines the relative weightin and large-scale variations of input pairs. The RQ kernel can Gaussian  Process kSE(x, x0 ) = 2 exp 2 2`2 where 2 describes the overall level of variance and ` is referred to a acteristic length-scale parameter. An infinite sum of SE kernels with di↵erent length-scales results to r well studied covariance function, the Rational Quadratic (RQ) ke kRQ(x, x0 ) = 2 ✓ 1 + kx x0k2 2 2↵`2 ◆ ↵ ↵ is a parameter that determines the relative weighting between s large-scale variations of input pairs. The RQ kernel can be used to m tions that are expected to vary smoothly across many length-scale 1 Rational  Quadratic  covariance  function  (kernel) infinite  sum  of  squared  exponential  (RBF)  kernels k(x, x0 ) = CX n=1 kRQ(gn, g0 n) ! + kN(x, x0 ) One  kernel  per  n-­‐gram  category   varied  usage  patterns,  increasing  semantic  value (Rasmussen  &  Williams,  2006) see  also  (
  11. 11. Estimating  disease  rates  from  online  text the observation matrix X) we want to learn a function f: drawn from a GP prior f(x) ⇠ GP µ(x) = 0, k(x, x0 ) kSE(x, x0 ) = 2 exp ✓ kx x0k2 2 2`2 ◆ where 2 describes the overall level of variance and ` is r characteristic length-scale parameter. An infinite sum of SE kernels with di↵erent length-scal other well studied covariance function, the Rational Quadra kRQ(x, x0 ) = 2 ✓ 1 + kx x0k2 2 2↵`2 ◆ ↵ ↵ is a parameter that determines the relative weightin and large-scale variations of input pairs. The RQ kernel can Gaussian  Process kSE(x, x0 ) = 2 exp 2 2`2 where 2 describes the overall level of variance and ` is referred to a acteristic length-scale parameter. An infinite sum of SE kernels with di↵erent length-scales results to r well studied covariance function, the Rational Quadratic (RQ) ke kRQ(x, x0 ) = 2 ✓ 1 + kx x0k2 2 2↵`2 ◆ ↵ ↵ is a parameter that determines the relative weighting between s large-scale variations of input pairs. The RQ kernel can be used to m tions that are expected to vary smoothly across many length-scale 1 Rational  Quadratic  covariance  function  (kernel) infinite  sum  of  squared  exponential  (RBF)  kernels k(x, x0 ) = CX n=1 kRQ(gn, g0 n) ! + kN(x, x0 ) here gn is used to express the features of each n-gram category One  kernel  per  n-­‐gram  category   varied  usage  patterns,  increasing  semantic  value (Rasmussen  &  Williams,  2006) see  also  (Lampos  et  al.,  2015)
  12. 12. Estimating  influenza-­‐like  illness  (ILI)  rates  —  Data 2012 2013 2014 0 0.01 0.02 0.03 0.04 ILIrateper100people ILI rates (PHE) Bing User-­‐generated  data,  geolocated  in  England   • Twitter:  May  2011  to  April  2014  (308  million  tweets)   • Bing:  end  of  December  2012  to  April  2014 ILI  rates  from  Public  Health  England  (PHE)
  13. 13. Estimating  ILI  rates  —  Feature  extraction • Start  with  a  manually  crafted  list  of  36  textual   markers,  e.g.  flu,  headache,  doctor,  cough     • Extract  frequent  co-­‐occurring  n-­‐grams  from  a  corpus   of  30  million  UK  tweets  (February  &  March,  2014)   after  removing  stop-­‐words   • Set  of  markers  expanded  to  205  n-­‐grams  (n  ≤  4)
 e.g.  #flu,  #cough,  annoying  cough,  worst  sore  throat     • Relatively  small  set  of  features  motivated  by   previous  work   (Culotta,  2013)
  14. 14. Estimating  ILI  rates  —  Experimental  setup Two  time  intervals  based  on  the  different  temporal   coverage  of  Twitter  and  Bing  data   • Dt1:  154  weeks  (May  2011  to  April  2014)   • Dt2:  67  weeks  (December  2012  to  April  2014)   Stratified  10-­‐fold  cross  validation   Error  metrics   • Pearson  correlation  (r)   • Mean  Absolute  Error  (MAE)
  15. 15. Pearson  correlation  (r) 0.5 0.6 0.7 0.8 0.9 1 User-­‐generated  data  source Twitter  (Dt1) Twitter  (Dt2) Bing  (Dt2) 0.952 0.924 0.845 0.867 0.744 0.718 0.814 0.698 0.64 Ridge  Regression Elastic  Net Gaussian  Process Estimating  ILI  rates  —  Performance
  16. 16. MAE 1 1.64 2.28 2.92 3.56 4.2 User-­‐generated  data  source Twitter  (Dt1) Twitter  (Dt2) Bing  (Dt2) 1.598 1.999 2.196 2.564 3.198 2.828 2.963 4.084 3.074 Ridge  Regression Elastic  Net Gaussian  Process Estimating  ILI  rates  —  Performancex  103
  17. 17. ✓ Background  and  motivation   ✓ Estimating  disease  rates  from  online  text   ๏ Estimating  the  impact  of  a  health  intervention   ๏ Case  study:  influenza  vaccination  impact   ๏ Conclusions  &  future  work Assessing  the  impact  of  a  health  intervention  via  online  content 41%
  18. 18. Estimating  the  impact  of  a  health  intervention 1. Disease  intervention  launched  (to  a  set  of  areas)   2. Define  a  distinct  set  of  control  areas   3. Estimate  disease  rates  in  all  areas   4.Identify  pairs  of  areas  with  strong  historical  correlation   in  their  disease  rates   5. Use  this  relationship  during  and  slightly  after  the   intervention  to  infer  diseases  rates  in  the  affected  areas   had  the  intervention  not  taken  place
  19. 19. Estimating  the  impact  of  a  health  intervention Based on a new observation x⇤, a prediction is conduc the mean value of the posterior predictive distribution, E — ⌧ = {t1, . . . , tN } v c r(q⌧ v, q⌧ c ) f(w, ) : R ! R argmin w, NX i=1 qti c w + qti v 2 q⇤ v = q⇤ cw + b time  interval(s)  before  the  intervention location(s)  where  the  intervention  took  place control  location(s) log-marginal likelihood function argmin 1,..., C ,`1,...,`C ,↵1,...,↵C , N (y µ)| K 1 (y µ) + log |K where K holds the covariance function evaluations for all pai i.e., (K)i,j = k(xi, xj), and µ = (µ(x1), . . . , µ(xN )). Based on a new observation x⇤, a prediction is conducted by the mean value of the posterior predictive distribution, E[y⇤|y, — ⌧ = {t1, . . . , tN } v c r(q⌧ v, q⌧ c ) f(w, ) : R ! R where K holds the covariance function evaluations i.e., (K)i,j = k(xi, xj), and µ = (µ(x1), . . . , µ(xN )). Based on a new observation x⇤, a prediction is con the mean value of the posterior predictive distribution — ⌧ = {t1, . . . , tN } v c r(q⌧ v, q⌧ c ) f(w, ) : R ! R argmin w, NX i=1 qti c w + qti v 2 such  that i.e., (K)i,j = k(xi, xj), and µ = (µ( Based on a new observation x⇤, the mean value of the posterior pre — ⌧ = {t1, . . . , tN } v c r(q⌧ v, q⌧ c ) f(w, ) : R ! R argmin w, NX i=1 ⇤ disease  rate(s)  in   affected  location   before  intervention disease  rate(s)  in   control  location   before  intervention high
  20. 20. Estimating  the  impact  of  a  health  intervention c r(q⌧ v, q⌧ c ) f(w, ) : R ! R argmin w, NX i=1 qti c w + qti v 2 q⇤ v = q⇤ cw + b qv v = qv q⇤ v qv q⇤ v f(w, ) : R ! R argmin w, NX i=1 qti c w + qti v 2 q⇤ v = q⇤ cw + b qv v = qv q⇤ v ✓v = qv q⇤ v q⇤ v . such  that qv disease  rate(s)  in  affected  location   during/after  intervention v = qv q⇤ v absolute  difference ✓v = qv q⇤ v q⇤ v relative  difference  (impact) (Lambert  &  Pregibon,  2008 estimate  projected  rate(s)  in  affected   location  during/after  intervention argmin w, i=1 qc w + q⇤ v = q⇤ cw + b q⇤ v = qcw + b qv v = qv q⇤ v 2
  21. 21. Estimating  the  impact  of  a  health  intervention c r(q⌧ v, q⌧ c ) f(w, ) : R ! R argmin w, NX i=1 qti c w + qti v 2 q⇤ v = q⇤ cw + b qv v = qv q⇤ v qv q⇤ v f(w, ) : R ! R argmin w, NX i=1 qti c w + qti v 2 q⇤ v = q⇤ cw + b qv v = qv q⇤ v ✓v = qv q⇤ v q⇤ v . such  that f(w, ) : R ! R argmin w, NX i=1 qti c w + qti v q⇤ v = q⇤ cw + b qv v = qv q⇤ v ✓v = qv q⇤ v q⇤ v . disease  rate(s)  in  affected  location   during/after  intervention argmin w, NX i=1 qti c w + qti v 2 q⇤ v = q⇤ cw + b v = qv q⇤ v ✓ = qv q⇤ v absolute  difference argmin w, NX i=1 qti c w + qti v 2 q⇤ v = q⇤ cw + b qv v = qv q⇤ v ✓v = qv q⇤ v q⇤ v relative  difference  (impact) (Lambert  &  Pregibon,  2008) estimate  projected  rate(s)  in  affected   location  during/after  intervention argmin w, i=1 qc w + q⇤ v = q⇤ cw + b q⇤ v = qcw + b qv v = qv q⇤ v 2
  22. 22. ✓ Background  and  motivation   ✓ Estimating  disease  rates  from  online  text   ✓ Estimating  the  impact  of  a  health  intervention   ๏ Case  study:  influenza  vaccination  impact   ๏ Conclusions  &  future  work Assessing  the  impact  of  a  health  intervention  via  online  content 52%
  23. 23. Live  Attenuated  Influenza  Vaccine  (LAIV)  campaign 2012 2013 2014 0 0.01 0.02 0.03 ILIrateper100people PHE/RCGP LAIV Post LAIV ∆t v • LAIV  programme  for  children  (4  to  11  years)  in  pilot   areas  of  England  during  the  2013/14  flu  season   • Vaccination  period  (blue):  Sept.  2013  to  Jan.  2014   • Post-­‐vaccination  period  (green):  Feb.  to  April  2014
  24. 24. Target  (vaccinated)  &  control  areas Brighton  •  Bristol  •  Cambridge   Exeter  •  Leeds  •  Liverpool   Norwich  •  Nottingham  •  Plymouth   Sheffield  •  Southampton  •  York Control  areas Bury  •  Cumbria  •  Gateshead   Leicester  •  East  Leicestershire   Rutland  •  South-­‐East  Essex   Havering  (London)   Newham  (London) Vaccinated  areas
  25. 25. Applying  the  impact  estimation  framework Target  vs.  control  areas   • Use  previous  flu  season  only  to  establish  relationships   • Find  the  best  correlated  areas  or  supersets  of  them   Confidence  intervals   • Bootstrap  sampling  of  the  regression  residuals   (mapping  function  of  control  to  vaccinated  areas)   • Bootstrap  sampling  of  data  prior  to  the  application  of   the  bootstrapped  regressor   • 105  bootstraps;  use  the  .025  and  .975  quantiles   Statistical  significance  assessment   • Impact  estimate  (abs.)  >  2σ  of  the  bootstrap  estimates
  26. 26. Relationship  between  vaccinated  &  control  areas Twitter  —  All  areas Bing  —  All  areas 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 ILIratesinvaccinatedareas ILI rates in control areas pre−vaccination period during/after LAIV 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 ILIratesinvaccinatedareas ILI rates in control areas pre−vaccination period during/after LAIV axes  normalised   from  0  to  1 r  =  .86 r  =  .87
  27. 27. Relationship  between  vaccinated  &  control  areas Twitter  —  London   areas Bing  —  London  areas 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 ILIratesinvaccinatedareas ILI rates in control areas pre−vaccination period during/after LAIV 0 0.25 0.5 0.75 1 0 0.25 0.5 0.75 1 ILIratesinvaccinatedareas ILI rates in control areas pre−vaccination period during/after LAIV axes  normalised   from  0  to  1 r  =  .74 r  =  .85
  28. 28. Impact  estimation  results  (strongly  correlated  controls) Source Target r δ  x  103 θ  (%) Twitter All  areas .861 -­‐2.5  (-­‐4.1,  -­‐1.0) -­‐32.8  (-­‐47.4,  -­‐15.6) Bing All  areas .866 -­‐1.9  (-­‐3.2,  -­‐0.7) -­‐21.7  (-­‐32.1,  -­‐9.10) Twitter London   areas .738 -­‐1.7  (-­‐2.5,  -­‐0.9) -­‐30.5  (-­‐41.8,  -­‐17.5) Bing London   areas .848 -­‐2.8  (-­‐4.1,  -­‐1.6) -­‐28.4  (-­‐36.7,  -­‐17.9)
  29. 29. Impact  estimation  results  (strongly  correlated  controls) Source Target r δ  x  103 θ  (%) Twitter All  areas .861 -­‐2.5  (-­‐4.1,  -­‐1.0) -­‐32.8  (-­‐47.4,  -­‐15.6) Bing All  areas .866 -­‐1.9  (-­‐3.2,  -­‐0.7) -­‐21.7  (-­‐32.1,  -­‐9.10) Twitter London   areas .738 -­‐1.7  (-­‐2.5,  -­‐0.9) -­‐30.5  (-­‐41.8,  -­‐17.5) Bing London   areas .848 -­‐2.8  (-­‐4.1,  -­‐1.6) -­‐28.4  (-­‐36.7,  -­‐17.9)
  30. 30. Source Target r δ  x  103 θ  (%) Twitter All  areas .861 -­‐2.5  (-­‐4.1,  -­‐1.0) -­‐32.8  (-­‐47.4,  -­‐15.6) Bing All  areas .866 -­‐1.9  (-­‐3.2,  -­‐0.7) -­‐21.7  (-­‐32.1,  -­‐9.10) Twitter London   areas .738 -­‐1.7  (-­‐2.5,  -­‐0.9) -­‐30.5  (-­‐41.8,  -­‐17.5) Bing London   areas .848 -­‐2.8  (-­‐4.1,  -­‐1.6) -­‐28.4  (-­‐36.7,  -­‐17.9) Impact  estimation  results  (strongly  correlated  controls)
  31. 31. Source Target r δ  x  103 θ  (%) Twitter All  areas .861 -­‐2.5  (-­‐4.1,  -­‐1.0) -­‐32.8  (-­‐47.4,  -­‐15.6) Bing All  areas .866 -­‐1.9  (-­‐3.2,  -­‐0.7) -­‐21.7  (-­‐32.1,  -­‐9.10) Twitter London   areas .738 -­‐1.7  (-­‐2.5,  -­‐0.9) -­‐30.5  (-­‐41.8,  -­‐17.5) Bing London   areas .848 -­‐2.8  (-­‐4.1,  -­‐1.6) -­‐28.4  (-­‐36.7,  -­‐17.9) Impact  estimation  results  (strongly  correlated  controls)
  32. 32. Impact  estimation  results  (stat.  sig.) -­‐θ  (%) 0 7 14 21 28 35 All  areas London  areas Newham Cumbria Gateshead 30.2 28.7 21.7 21.1 30.430.5 32.8 Twitter Bing
  33. 33. Projected  vs.  inferred  ILI  rates  in  vaccinated  locations Twitter  —  All  areas Bing  —  All  areas Oct Nov Dec Jan Feb Mar Apr 0 0.005 0.01 0.015 0.02 ILIratesper100people weeks during and after the vaccination programme inferred ILI rates projected ILI rates Oct Nov Dec Jan Feb Mar Apr 0 0.005 0.01 0.015 0.02 ILIratesper100people weeks during and after the vaccination programme inferred ILI rates projected ILI rates
  34. 34. Projected  vs.  inferred  ILI  rates  in  vaccinated  locations Twitter  —  London   areas Bing  —  London  areas Oct Nov Dec Jan Feb Mar Apr 0 0.005 0.01 ILIratesper100people weeks during and after the vaccination programme inferred ILI rates projected ILI rates Oct Nov Dec Jan Feb Mar Apr 0 0.005 0.01 0.015 ILIratesper100people weeks during and after the vaccination programme inferred ILI rates projected ILI rates
  35. 35. Sensitivity  of  impact  estimates  to  variable  controls • Repeat  the  impact  estimation  for  the  N  controls  (up  to   a  100)  with  r  ≥  95%  of  the  best  r  —>  μ(δ)  and  μ(θ)  (%)   • Measure  %  of  difference,  Δ(θ),  between  θ  and  μ(θ) Source Target N μ(r) μ(δ)  x  103 μ(θ)  (%) Δθ  (%) Twitter All  areas 100 0.84 -­‐2.5  (0.2) -­‐32.7  (2.1) 0.10 Bing All  areas 46 0.85 -­‐1.4  (0.4) -­‐16.4  (3.6) 24.4 Twitter London   areas 79 0.70 -­‐1.5  (0.1) -­‐27.9  (2.0) 8.32 Bing London   areas 100 0.84 -­‐1.4  (0.2) -­‐16.9  (1.8) 40.4
  36. 36. Sensitivity  of  impact  estimates  to  variable  controls • Repeat  the  impact  estimation  for  the  N  controls  (up  to   a  100)  with  r  ≥  95%  of  the  best  r  —>  μ(δ)  and  μ(θ)  (%)   • Measure  %  of  difference,  Δ(θ),  between  θ  and  μ(θ) Source Target N μ(r) μ(δ)  x  103 μ(θ)  (%) Δθ  (%) Twitter All  areas 100 0.84 -­‐2.5  (0.2) -­‐32.7  (2.1) 0.10 Bing All  areas 46 0.85 -­‐1.4  (0.4) -­‐16.4  (3.6) 24.4 Twitter London   areas 79 0.70 -­‐1.5  (0.1) -­‐27.9  (2.0) 8.32 Bing London   areas 100 0.84 -­‐1.4  (0.2) -­‐16.9  (1.8) 40.4
  37. 37. Sensitivity  of  impact  estimates  to  variable  controls • Repeat  the  impact  estimation  for  the  N  controls  (up  to   a  100)  with  r  ≥  95%  of  the  best  r  —>  μ(δ)  and  μ(θ)  (%)   • Measure  %  of  difference,  Δ(θ),  between  θ  and  μ(θ) Source Target N μ(r) μ(δ)  x  103 μ(θ)  (%) Δθ  (%) Twitter All  areas 100 0.84 -­‐2.5  (0.2) -­‐32.7  (2.1) 0.10 Bing All  areas 46 0.85 -­‐1.4  (0.4) -­‐16.4  (3.6) 24.4 Twitter London   areas 79 0.70 -­‐1.5  (0.1) -­‐27.9  (2.0) 8.32 Bing London   areas 100 0.84 -­‐1.4  (0.2) -­‐16.9  (1.8) 40.4
  38. 38. ✓ Background  and  motivation   ✓ Estimating  disease  rates  from  online  text   ✓ Estimating  the  impact  of  a  health  intervention   ✓ Case  study:  influenza  vaccination  impact   ๏ Conclusions  &  future  work Assessing  the  impact  of  a  health  intervention  via  online  content 89%
  39. 39. Conclusions  &  points  for  discussion • Framework  for  estimating  the  impact  of  a  health   intervention  based  on  online  content   • Access  to  different  &  larger  parts  of  the  population   Evaluation  is  hard,  however:   • PHE’s  impact  estimates:  -­‐66%  based  on  sentinel   surveillance,  -­‐24%  laboratory  confirmed   • Correlation  between  actual  vaccination  uptake  and  our   study’s  estimated  impacts   Why  are  Bing  and  Twitter  estimations  different?   • Different  user  demographics  (?)  —  this  can  be  useful   • Different  temporal  resolution (Pebody  et  al.,  2014)
  40. 40. Potential  future  work  directions • Improve  supervised  learning  models   - better  natural  language  processing  /  machine   learning  modelling   - combination  of  different  data  sources   • Work  on  unsupervised  techniques   - inferring  /  understanding  the  demographics  of  the   online  medium  will  be  essential   • More  rigorous  evaluation
  41. 41. Collaborators,  acknowledgements  &  material Elad  Yom-­‐Tov,  Microsoft  Research   Richard  Pebody,  Public  Health  England   Ingemar  J.  Cox,  UCL  &  University  of  Copenhagen Jens  Geyti,  UCL  (Software  Engineer)   Simon  de  Lusignan,  University  of  Surrey  &  RCGP Slides:  ow.ly/RN7MZPaper:  ow.ly/RN9J2 i-­‐sense.org.uk
  42. 42. Bollen,  Mao  &  Zeng.  Twitter  mood  predicts  the  stock  market.  J  Comp  Science,  2011.   Burger,  Henderson,  Kim  &  Zarrella.  Discriminating  Gender  on  Twitter.  EMNLP,  2011.   Choi  &  Varian.  Predicting  the  Present  with  Google  Trends.  Economic  Record,  2012.   Culotta.  Lightweight  methods  to  estimate  influenza  rates  and  alcohol  sales  volume  from  Twitter  messages.  Lang   Resour  Eval,  2013.   Hoerl  &  Kennard.  Ridge  regression:  biased  estimation  for  nonorthogonal  problems.  Technometrics,  1970.   Lamb,  Paul  &  Dredze.  Separating  Fact  from  Fear:  Tracking  Flu  Infections  on  Twitter.  NAACL,  2013.   Lambert  &  Pregibon.  Online  effects  of  offline  ads.  Data  Mining  &  Audience  Intelligence  for  Advertising,  2008.   Lampos  &  Cristianini.  Tracking  the  flu  pandemic  by  monitoring  the  Social  Web.  CIP,  2010.   Lampos  &  Cristianini.  Nowcasting  Events  from  the  Social  Web  with  Statistical  Learning.  ACM  TIST,  2012.   Lampos,  Miller,  Crossan  &  Stefansen.  Advances  in  nowcasting  influenza-­‐like  illness  rates  using  search  query  logs.   Sci  Rep,  2015.   Lampos,   Yom-­‐Tov,   Pebody   &   Cox.   Assessing   the   impact   of   a   health   intervention   via   user-­‐generated   Internet   content.  DMKD,  2015.   Pebody  et  al.  Uptake  and  impact  of  a  new  live  attenuated  influenza  vaccine  programme  in  England:  early  results  of   a  pilot  in  primary  school-­‐age  children,  2013/14  influenza  season.  Eurosurveillance,  2014.   Preotiuc-­‐Pietro,  Lampos  &  Aletras.  An  analysis  of  the  user  occupational  class  through  Twitter  content.  ACL,  2015.   Rao,  Yarowsky,  Shreevats  &  Gupta.  Classifying  Latent  User  Attributes  in  Twitter.  SMUC,  2010.   Rasmussen  &  Williams.  Gaussian  Processes  for  Machine  Learning.  MIT  Press,  2006.   Tumasjan,   Sprenger,   Sandner   &   Welpe.   Predicting   Elections   with   Twitter:   What   140   characters   Reveal   about   Political  Sentiment.  ICWSM,  2010.   Zou  &  Hastie.  Regularization  and  variable  selection  via  the  elastic  net.  J  R  Stat  Soc  Series  B  Stat  Methodol,  2005. References

×