Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mining online data for public health surveillance

241 views

Published on

Online user-generated data (search engine activity, social media) offers health-related insights that traditional health monitoring systems cannot always provide. A central paradigm for this has been the estimation of the rate of a disease in a population by modelling the corresponding online search engine activity. However, the various inaccuracies in the estimates of Google Flu Trends (GFT) ―a popular approach for estimating influenza-like illness (ILI) rates from online search logs― have generated a debate as to whether this is something feasible. In this talk, we will first go through GFT and try to explain why it has been unsuccessful. We will then present alternative approaches which deploy more adequate statistical as well as semantic modelling techniques to successfully capture ILI rates both in the US and England. Long-term empirical analysis will showcase that models retain a strong predictive performance across many consecutive flu seasons. Expanding on the previous approaches, we will also present a brief overview of more recent work, where multi-task learning across different geographies is used to further improve ILI models, especially in situations where sporadic historical health reports are available.

Published in: Health & Medicine
  • Be the first to comment

Mining online data for public health surveillance

  1. 1. Mining online data for public health surveillance Vasileios Lampos (a.k.a. Bill) Computer Science University College London @lampos
  2. 2. Structure ➡ Using online data for health applications ➡ From web searches to syndromic surveillance i. Google Flu Trends: original failure and correction ii. Better feature selection using semantic concepts iii. Snapshot: Multi-task learning for disease models
  3. 3. Online data 22/03/2017
  4. 4. Online data for health (1/3) — coverage — speed — cost When & why? How? Evaluation? — collaborate with experts — access to user activity data — machine learning — natural language processing — (partial) ground truth — model interpretation — real-time
  5. 5. Online data for health (1/3) — coverage — speed — cost When & why? How? Evaluation? — collaborate with experts — access to user activity data — machine learning — natural language processing — (partial) ground truth — model interpretation — real-time
  6. 6. Online data for health (1/3) — coverage — speed — cost When & why? How? Evaluation? — collaborate with experts — access to user activity data — machine learning — natural language processing — (partial) ground truth — model interpretation — real-time
  7. 7. Online data for health (1/3) — coverage — speed — cost When & why? How? Evaluation? — collaborate with experts — access to user activity data — machine learning — natural language processing — (partial) ground truth — model interpretation — real-time
  8. 8. Online data for health (2/3) Google Flu Trends (discontinued)
  9. 9. Online data for health (2/3) Flu Detector, fludetector.cs.ucl.ac.uk
  10. 10. Online data for health (2/3) Health Map, healthmap.org
  11. 11. Online data for health (3/3) Health intervention Impact?
  12. 12. Online data for health (3/3) Health intervention Impact?
  13. 13. Online data for health (3/3) Vaccinations against flu Impact? Lampos, Yom-Tov, Pebody, Cox (2015) doi:10.1007/s10618-015-0427-9 Wagner, Lampos, Yom-Tov, Pebody, Cox (2017) doi:10.2196/jmir.8184
  14. 14. Google Flu Trends (GFT) 0 10 20 30 40 50 60 70 80 90 100 0.90 45 queries 0.85 0.95 Meancorrelation 2004 2005 2006 Year 2007 2008 2 4 6 8 10 ILIpercentage 0 12 Figure 2 | A comparison of model estimates for the mid-Atlantic region (black) against CDC-reported ILI percentages (red), including points over which the model was fit and validated. A correlation of 0.85 was obtained over 128 points from this region to which the model was fit, whereas a NATURE|Vol 457|19 February 2009 LETTERS
  15. 15. Google Flu Trends (GFT) 0 10 20 30 40 50 60 70 80 90 100 0.90 45 queries 0.85 0.95 Meancorrelation 2004 2005 2006 Year 2007 2008 2 4 6 8 10 ILIpercentage 0 12 Figure 2 | A comparison of model estimates for the mid-Atlantic region (black) against CDC-reported ILI percentages (red), including points over which the model was fit and validated. A correlation of 0.85 was obtained over 128 points from this region to which the model was fit, whereas a NATURE|Vol 457|19 February 2009 LETTERS f ?
  16. 16. GFT — Supervised learning Regression Observations (X): Frequencies of n search queries for a location L and m contiguous time intervals of length τ Targets (y): Rates of influenza-like illness (ILI) for L and for the same m contiguous time intervals, obtained from a health agency Learn a function f such that f: X ∈ ℝn⨯m ⟶ y ∈ ℝn
  17. 17. GFT — Supervised learning Regression Observations (X): Frequencies of n search queries for a location L and m contiguous time intervals of length τ Targets (y): Rates of influenza-like illness (ILI) for L and for the same m contiguous time intervals, obtained from a health agency Learn a function f such that f: X ∈ ℝn⨯m ⟶ y ∈ ℝn count of qi total count of all queries frequency of qi =
  18. 18. GFT v.1 — Model ermined by an automated method described below. We f near model using the log-odds of an ILI physician visit an log-odds of an ILI-related search query: logit(P) = β0 + β1 × logit(Q) + ε ere P is the percentage of ILI physician visits, Q is ILI-related query fraction, β0 is the intercept,Aggregate frequency of a set of search queries Percentage (probability) of doctor visits Regression bias term Regression weight (one weight only) independent, zero-centered noise dom physician visit in a particular region nza-like illness (ILI); this is equivalent ILI-related physician visits. A single was used: the probability that a random ted from the same region is ILI-related, as tomated method described below. We fit the log-odds of an ILI physician visit and I-related search query: (P) = β0 + β1 × logit(Q) + ε ntage of ILI physician visits, Q is fraction, β0 is the intercept, Figure 1: ILI-relate points du queries. A which is 0 0.85 0. 9 Meancorrelation mple model which estimates the physician visit in a particular region ike illness (ILI); this is equivalent elated physician visits. A single used: the probability that a random rom the same region is ILI-related, as ted method described below. We fit og-odds of an ILI physician visit and ated search query: = β0 + β1 × logit(Q) + ε e of ILI physician visits, Q is Figure 1: An ILI-related q points durin queries. A st 0 0.85 0. 9 Meancorrelation hich estimates the in a particular region ); this is equivalent an visits. A single ability that a random region is ILI-related, as escribed below. We fit LI physician visit and ery: (Q) + ε Figure 1: An evaluation of how ILI-related query fraction. Maxi 0 10 20 30 0.85 0. 9 Meancorrelation Q P
  19. 19. GFT v.1 — Model ermined by an automated method described below. We f near model using the log-odds of an ILI physician visit an log-odds of an ILI-related search query: logit(P) = β0 + β1 × logit(Q) + ε ere P is the percentage of ILI physician visits, Q is ILI-related query fraction, β0 is the intercept,Aggregate frequency of a set of search queries Percentage (probability) of doctor visits Regression bias term Regression weight (one weight only) independent, zero-centered noise dom physician visit in a particular region nza-like illness (ILI); this is equivalent ILI-related physician visits. A single was used: the probability that a random ted from the same region is ILI-related, as tomated method described below. We fit the log-odds of an ILI physician visit and I-related search query: (P) = β0 + β1 × logit(Q) + ε ntage of ILI physician visits, Q is fraction, β0 is the intercept, Figure 1: ILI-relate points du queries. A which is 0 0.85 0. 9 Meancorrelation mple model which estimates the physician visit in a particular region ike illness (ILI); this is equivalent elated physician visits. A single used: the probability that a random rom the same region is ILI-related, as ted method described below. We fit og-odds of an ILI physician visit and ated search query: = β0 + β1 × logit(Q) + ε e of ILI physician visits, Q is Figure 1: An ILI-related q points durin queries. A st 0 0.85 0. 9 Meancorrelation hich estimates the in a particular region ); this is equivalent an visits. A single ability that a random region is ILI-related, as escribed below. We fit LI physician visit and ery: (Q) + ε Figure 1: An evaluation of how ILI-related query fraction. Maxi 0 10 20 30 0.85 0. 9 Meancorrelation Q P
  20. 20. GFT v.1 — Model ermined by an automated method described below. We f near model using the log-odds of an ILI physician visit an log-odds of an ILI-related search query: logit(P) = β0 + β1 × logit(Q) + ε ere P is the percentage of ILI physician visits, Q is ILI-related query fraction, β0 is the intercept,Aggregate frequency of a set of search queries Percentage (probability) of doctor visits Regression bias term Regression weight (one weight only) independent, zero-centered noise dom physician visit in a particular region nza-like illness (ILI); this is equivalent ILI-related physician visits. A single was used: the probability that a random ted from the same region is ILI-related, as tomated method described below. We fit the log-odds of an ILI physician visit and I-related search query: (P) = β0 + β1 × logit(Q) + ε ntage of ILI physician visits, Q is fraction, β0 is the intercept, Figure 1: ILI-relate points du queries. A which is 0 0.85 0. 9 Meancorrelation mple model which estimates the physician visit in a particular region ike illness (ILI); this is equivalent elated physician visits. A single used: the probability that a random rom the same region is ILI-related, as ted method described below. We fit og-odds of an ILI physician visit and ated search query: = β0 + β1 × logit(Q) + ε e of ILI physician visits, Q is Figure 1: An ILI-related q points durin queries. A st 0 0.85 0. 9 Meancorrelation hich estimates the in a particular region ); this is equivalent an visits. A single ability that a random region is ILI-related, as escribed below. We fit LI physician visit and ery: (Q) + ε Figure 1: An evaluation of how ILI-related query fraction. Maxi 0 10 20 30 0.85 0. 9 Meancorrelation Q P
  21. 21. GFT v.1 — Model ermined by an automated method described below. We f near model using the log-odds of an ILI physician visit an log-odds of an ILI-related search query: logit(P) = β0 + β1 × logit(Q) + ε ere P is the percentage of ILI physician visits, Q is ILI-related query fraction, β0 is the intercept,Aggregate frequency of a set of search queries Percentage (probability) of doctor visits Regression bias term Regression weight (one weight only) independent, zero-centered noise dom physician visit in a particular region nza-like illness (ILI); this is equivalent ILI-related physician visits. A single was used: the probability that a random ted from the same region is ILI-related, as tomated method described below. We fit the log-odds of an ILI physician visit and I-related search query: (P) = β0 + β1 × logit(Q) + ε ntage of ILI physician visits, Q is fraction, β0 is the intercept, Figure 1: ILI-relate points du queries. A which is 0 0.85 0. 9 Meancorrelation mple model which estimates the physician visit in a particular region ike illness (ILI); this is equivalent elated physician visits. A single used: the probability that a random rom the same region is ILI-related, as ted method described below. We fit og-odds of an ILI physician visit and ated search query: = β0 + β1 × logit(Q) + ε e of ILI physician visits, Q is Figure 1: An ILI-related q points durin queries. A st 0 0.85 0. 9 Meancorrelation hich estimates the in a particular region ); this is equivalent an visits. A single ability that a random region is ILI-related, as escribed below. We fit LI physician visit and ery: (Q) + ε Figure 1: An evaluation of how ILI-related query fraction. Maxi 0 10 20 30 0.85 0. 9 Meancorrelation Q P
  22. 22. GFT v.1 — “Logit”, why? the logit function nts the space of possible query fractions and ILI percentages, T and and queries respectively. For a certain week, ∈y denotes the IL ery volumes; for a set of T weeks, all query volumes are represent y analysis found that pairwise relationships between query rates an e logit space, motivating the use of this transformation across all e S3); related work also followed the same modeling principle13,23 = ( )y ylogit , where logit(α)= log(α/(1− α)), considering that the se manner; similarly X denotes the logit-transformed input matri alues for a particular week t. Predictions made by the presented m tion before analysis. proaches for search query modeling proposed linear functions o 3 selected search queries. In particular, GFT’s regression model r β+ w·z+ ε, where the single covariate z denotes the logit-transfo y of a set of queries, w is a weight coefficient we aim to learn, β de dependent, zero-centered noise. The set of queries is selected thro see Supplementary Information [SI], Feature selection in the GFT m this basic model mispredicted ILI in several flu seasons, with signi α ∈ (0,1) 0 2 4 6 8 10 12 x 0 0.2 0.4 0.6 0.8 1 y (x,y) pair values 0 2 4 6 8 10 12 x -10 -5 0 5 10 logit(y) (x,logit(y)) pair values
  23. 23. GFT v.1 — “Logit”, why? the logit function nts the space of possible query fractions and ILI percentages, T and and queries respectively. For a certain week, ∈y denotes the IL ery volumes; for a set of T weeks, all query volumes are represent y analysis found that pairwise relationships between query rates an e logit space, motivating the use of this transformation across all e S3); related work also followed the same modeling principle13,23 = ( )y ylogit , where logit(α)= log(α/(1− α)), considering that the se manner; similarly X denotes the logit-transformed input matri alues for a particular week t. Predictions made by the presented m tion before analysis. proaches for search query modeling proposed linear functions o 3 selected search queries. In particular, GFT’s regression model r β+ w·z+ ε, where the single covariate z denotes the logit-transfo y of a set of queries, w is a weight coefficient we aim to learn, β de dependent, zero-centered noise. The set of queries is selected thro see Supplementary Information [SI], Feature selection in the GFT m this basic model mispredicted ILI in several flu seasons, with signi α ∈ (0,1) 0 2 4 6 8 10 12 x 0 0.2 0.4 0.6 0.8 1 y (x,y) pair values 0 2 4 6 8 10 12 x -10 -5 0 5 10 logit(y) (x,logit(y)) pair values 0 2 4 6 8 10 12 x -3 -2 -1 0 1 2 yorlogit(y) y logit(y) values close to 0.5 are “squashed” border values (close to 0 or 1) are “emphasised” z-scored
  24. 24. GFT v.1 — Data 9 US regions considered 50 million search queries (most frequent) geolocated in these 9 US regions Weekly ILI rates from CDC 170 weeks, 28/9/2003 to 11/5/2008 with ILI rate > 0 First 128 weeks: Training, 9 x 128 = 1,152 samples Last 42 weeks: Testing (per region)
  25. 25. GFT v.1 — Feature selection (1/2) 1. Single query flu models are trained for each US region
 50 million queries x 9 US regions = 450 million models 2. Inference accuracy is estimated for each query using linear correlation (Pearson) as the metric 3. Starting from the best performing query, adding up one query each time, a new model is trained and evaluated
  26. 26. 1. Single query flu models are trained for each US region
 50 million queries x 9 US regions = 450 million models 2. Inference accuracy is estimated for each query using linear correlation (Pearson) as the metric 3. Starting from the best performing query, adding up one query each time, a new model is trained and evaluated GFT v.1 — Feature selection (1/2) 0 10 20 30 40 50 60 70 80 90 100 0.90 Number of queries 45 queries 0.85 0.95 Meancorrelation Figure 1 | An evaluation of how many top-scoring queries to include in the 2004 2 4 6 8 10 ILIpercentage 0 12 Figure 2 | A com (black) against C which the model over 128 points f correlation of 0.9 indicate 95% pre NATURE|Vol 457|19 February 2009
  27. 27. GFT v.1 — Feature selection (2/2) Because localized influenza surveillance is particularly use public health planning, we sought to validate further our Table 1 | Topics found in search queries which were found to be m related with CDC ILI data Search query topic Top 45 queries Next 55 quer n Weighted n W Influenza complication 11 18.15 5 3. Cold/flu remedy 8 5.05 6 5. General influenza symptoms 5 2.60 1 0. Term for influenza 4 3.74 6 0. Specific influenza symptom 4 2.54 6 3. Symptoms of an influenza complication 4 2.21 2 0. Antibiotic medication 3 6.23 3 3. General influenza remedies 2 0.18 1 0. Symptoms of a related disease 2 1.66 2 0. Antiviral medication 1 0.39 1 0. Related disease 1 6.66 3 3. Unrelated to influenza 0 0.00 19 28 Total 45 49.40 55 50 The top 45 queries were used in our final model; the next 55 queries are presented comparison purposes. The number of queries in each topic is indicated, as well as q
  28. 28. GFT v.1 — Performance (1/2) Evaluated on 42 weeks (per region) from 2007-2008 Evaluation metric: Pearson correlation μ(r) = .97 with min(r) = .92 and max(r) = .99 Performance looked great at the time, but this is not a proper performance evaluation!
 
 Why?
 Potentially misleading metric (not the loss function here) and rather small testing time span (< 1 flu season)
  29. 29. GFT v.1 — Performance (2/2) 100 the p 45 2004 2005 2006 Year 2007 2008 2 4 6 8 10 ILIpercentage 0 12 Figure 2 | A comparison of model estimates for the mid-Atlantic region (black) against CDC-reported ILI percentages (red), including points over which the model was fit and validated. A correlation of 0.85 was obtained over 128 points from this region to which the model was fit, whereas a correlation of 0.96 was obtained over 42 validation points. Dotted lines indicate 95% prediction intervals. The region comprises New York, New Jersey and Pennsylvania. LETTERS Mid-Atlantic US region Pearson correlation, r = .96
  30. 30. GFT v.2 — Data & evaluation weekly frequency of 49,708 search queries (US) filtered by a relaxed health topic classifier, intersection of frequent queries across all US regions from 4/1/2004 to 28/12/2013 (521 weeks) corresponding weekly US ILI rates from CDC test on 5 flu seasons, 5 year-long test sets (2008-13) train on increasing data sets starting from 2004, using all data prior to a test period
  31. 31. GFT v.1 was simple to a (significant) fault 2009 2010 2011 2012 2013 Time (weeks) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ILIrate(US) CDC ILI rates GFT
  32. 32. 2009 2010 2011 2012 2013 Time (weeks) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ILIrate(US) CDC ILI rates GFT “rsv” — 25% “flu symptoms” — 18% “benzonatate” — 6% “symptoms of pneumonia” — 6% “upper respiratory infection” — 4% GFT v.1 was simple to a (significant) fault
  33. 33. GFT v.2 — Linear multivariate regressionargmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression argmin w, 0 @ NX i=1 (xiw + yi)2 +  MX j=1 w2 j 1 A Elastic net 0 @ NX (xiw + yi)2 + 1 MX |wj| + 2 MX w2 j 1 A X 2 Rn⇥m Rm i 2 R w 2 Rm 2 R 2 R 2 R y 2 Rn R 2 R yi 2 R Rm ILI rates from CDC for n weeks … for week i 2 R xi 2 Rm, i 2 {1, . . . , n} 2 Rn frequency of m search queries for n weeks … for week i weights for the m search queries intercept term Least squares
  34. 34. GFT v.2 — Linear multivariate regressionargmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression argmin w, 0 @ NX i=1 (xiw + yi)2 +  MX j=1 w2 j 1 A Elastic net 0 @ NX (xiw + yi)2 + 1 MX |wj| + 2 MX w2 j 1 A argmin w, @ i=1 (xiw + yi)2 + 1 j=1 |wj| + 2 j=1 w2 j Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R GPs can be defined as sets of random variables, any finite which have a multivariate Gaussian distribution. i 2 R w 2 Rm 2 R 2 R 2 R y 2 Rn R 2 R yi 2 R Rm ILI rates from CDC for n weeks … for week i argmin w, @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R GPs can be defined as sets of random variables, any finite which have a multivariate Gaussian distribution. frequency of m search queries for n weeks … for week i weights for the m search queries intercept term Least squares
  35. 35. GFT v.2 — Linear multivariate regressionargmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression argmin w, 0 @ NX i=1 (xiw + yi)2 +  MX j=1 w2 j 1 A Elastic net 0 @ NX (xiw + yi)2 + 1 MX |wj| + 2 MX w2 j 1 A X 2 Rn⇥m Rm i 2 R w 2 Rm 2 R 2 R argmin w, @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R GPs can be defined as sets of random variables, any finite which have a multivariate Gaussian distribution. argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 1 A Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R GPs can be defined as sets of random variables, any finite ILI rates from CDC for n weeks … for week i 2 R xi 2 Rm, i 2 {1, . . . , n} 2 Rn frequency of m search queries for n weeks … for week i weights for the m search queries intercept term Least squares
  36. 36. GFT v.2 — Linear multivariate regressionargmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression argmin w, 0 @ NX i=1 (xiw + yi)2 +  MX j=1 w2 j 1 A Elastic net 0 @ NX (xiw + yi)2 + 1 MX |wj| + 2 MX w2 j 1 A X 2 Rn⇥m Rm argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R GPs can be defined as sets of random variables, any finite argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R GPs can be defined as sets of random variables, any finite 2 R y 2 Rn R 2 R yi 2 R Rm ILI rates from CDC for n weeks … for week i 2 R xi 2 Rm, i 2 {1, . . . , n} 2 Rn frequency of m search queries for n weeks … for week i weights for the m search queries intercept term Least squares
  37. 37. GFT v.2 — Linear multivariate regression X argmin w, nX i=1 (xiw + yi)2 ! X 2 Rn⇥m Rm i 2 R w 2 Rm 2 R 2 R 2 R y 2 Rn R 2 R yi 2 R Rm ILI rates from CDC for n weeks … for week i 2 R xi 2 Rm, i 2 {1, . . . , n} 2 Rn frequency of m search queries for n weeks … for week i weights for the m search queries intercept term Least Squares ⚠ ☣ ⚠ Least squares regression is not applicable here because we have very few training samples (n) but many features (search queries; m). Models derived from least squares will tend to overfit the data, resulting to bad solutions.
  38. 38. GFT v.2 — Regularisation with elastic net Elastic net argmin w, 0 @ NX i=1 (xiw + yi)2 + 1 MX j=1 |wj| + 2 MX j=1 w2 j 1 A argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 1 A Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R +
  39. 39. GFT v.2 — Regularisation with elastic net Elastic net argmin w, 0 @ NX i=1 (xiw + yi)2 + 1 MX j=1 |wj| + 2 MX j=1 w2 j 1 A argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 1 A Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R + least squares
  40. 40. GFT v.2 — Regularisation with elastic net Elastic net argmin w, 0 @ NX i=1 (xiw + yi)2 + 1 MX j=1 |wj| + 2 MX j=1 w2 j 1 A argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 1 A Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R + least squares M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R 1 2 R+ 2 2 R+ 1 L1 & L2-norm regularisers for the weights
  41. 41. GFT v.2 — Regularisation with elastic net Elastic net argmin w, 0 @ NX i=1 (xiw + yi)2 + 1 MX j=1 |wj| + 2 MX j=1 w2 j 1 A argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 1 A Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R + least squares Encourages sparse models (feature selection) Handles collinear features (search queries) Number of selected features is not limited to the number of samples (n) M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R 1 2 R+ 2 2 R+ 1 L1 & L2-norm regularisers for the weights
  42. 42. GFT v.2 — Regularisation with elastic net Elastic net argmin w, 0 @ NX i=1 (xiw + yi)2 + 1 MX j=1 |wj| + 2 MX j=1 w2 j 1 A argmin w, 0 @ nX i=1 (xiw + yi)2 + 1 mX j=1 |wj| + 2 mX j=1 w2 j 1 A Variables N M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R + least squares Encourages sparse models (feature selection) Handles collinear features (search queries) Number of selected features is not limited to the number of samples (n) M X 2 RN⇥M y 2 RN X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R 1 2 R+ 2 2 R+ 1 L1 & L2-norm regularisers for the weights many weights will be set to zero!
  43. 43. GFT v.2 — Feature selection 1st layer: Keep search queries that their frequency time series has a ≥ 0.5 Pearson correlation with the CDC ILI rates (in the training data) 2nd layer: Elastic net will assign weights equal to 0 to features (search queries) that are identified as statistically irrelevant to our task μ (σ) # queries selected across all training data sets # queries r ≥ 0.5 GFT Elastic net 49,708 937 (334) 46 (39) 278 (64)
  44. 44. GFT v.2 — Evaluation (1/2) ut a known cause other than influenza.” ILI rates are published on a weekly b icating the percentage of ILI prevalence at a national level. We use ILINet’s rate to train and evaluate our models throughout our work. ance metrics = y1, . . . , yN of ground truth values and ˆy = ˆy1, . . . , ˆyN of predictions, we ap rformance: on correlation (r), defined by r = 1 N 1 NX t=1 ✓ yt µ(y) (y) ◆ ✓ ˆyt µ(ˆy) (ˆy) ◆ , µ(·) and (·) are the sample mean and standard deviation respectively. Absolute Error (MAE), defined by MAE (ˆy, y) = 1 N NX t=1 |ˆyt yt| . window) indicating the percentage of ILI prevale (see Fig. S1) to train and evaluate our models thr Performance metrics Given a set y = y1, . . . , yN of ground truth value predictive performance: • Pearson correlation (r), defined by r = N where µ(·) and (·) are the sample mean a • Mean Absolute Error (MAE), defined by • Mean Absolute Percentage of Error (MAP Target variable: Estimates: et y = y1, . . . , yN of ground truth values and ˆy = ˆy1, . . . , ˆyN of predictions, we a performance: rson correlation (r), defined by r = 1 N 1 NX t=1 ✓ yt µ(y) (y) ◆ ✓ ˆyt µ(ˆy) (ˆy) ◆ , ere µ(·) and (·) are the sample mean and standard deviation respectively. an Absolute Error (MAE), defined by MAE (ˆy, y) = 1 N NX t=1 |ˆyt yt| . an Absolute Percentage of Error (MAPE), defined by MAPE (ˆy, y) = 1 N NX t=1 ˆyt yt yt . set y = y1, . . . , yN of ground truth values and ˆy = ˆy1, . . . , ˆyN of predictions, we e performance: arson correlation (r), defined by r = 1 N 1 NX t=1 ✓ yt µ(y) (y) ◆ ✓ ˆyt µ(ˆy) (ˆy) ◆ , here µ(·) and (·) are the sample mean and standard deviation respectively. ean Absolute Error (MAE), defined by MAE (ˆy, y) = 1 N NX t=1 |ˆyt yt| . ean Absolute Percentage of Error (MAPE), defined by MAPE (ˆy, y) = 1 N NX t=1 ˆyt yt yt . N t=1 Mean Absolute Percentage of Error (MAPE), defined by MAPE (ˆy, y) = 1 N NX t=1 ˆyt yt yt . e that performance is measured after reversing the logit transformation (see bel MSE (ˆy, y) = 1 N NX t=1 (ˆyt yt) 2 .Mean Squared Error: Mean Absolute Error: Mean Absolute Percentage of Error:
  45. 45. GFT v.2 — Evaluation (2/2) 2009 2010 2011 2012 2013 Time (weeks) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ILIrate(US) CDC ILI rates Elastic Net
  46. 46. GFT v.2 — Evaluation (2/2) 2009 2010 2011 2012 2013 Time (weeks) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ILIrate(US) CDC ILI rates Elastic Net GFT GFT r = .89, MAE = 3.81·10-3 , MAPE = 20.4% Elastic net r = .92, MAE = 2.60·10-3 , MAPE = 11.9%
  47. 47. GFT v.2 — Nonlinearities in the data US ILI rates (CDC) ~ freq. of query ‘flu’ 0.5 1 1.5 2 2.5 3 3.5 Frequency of query 'flu' 10 -5 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 ILIrates Raw data Linear fit
  48. 48. GFT v.2 — Nonlinearities in the data US ILI rates (CDC) ~ freq. of query ‘flu medicine’ 1 2 3 4 5 6 Frequency of query 'flu medicine' 10 -7 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 ILIrates Raw data Linear fit
  49. 49. GFT v.2 — Nonlinearities in the data US ILI rates (CDC) ~ freq. of query ‘how long is flu contagious’ 0 1 2 3 4 5 6 Frequency of query 'how long is flu contagious' 10 -7 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 ILIrates Raw data Linear fit
  50. 50. GFT v.2 — Nonlinearities in the data US ILI rates (CDC) ~ freq. of query ‘how to break a fever’ 1 2 3 4 5 6 7 8 9 10 Frequency of query 'how to break a fever' 10 -7 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 ILIrates Raw data Linear fit
  51. 51. GFT v.2 — Nonlinearities in the data US ILI rates (CDC) ~ freq. of query ‘sore throat treatment’ 0.5 1 1.5 2 2.5 3 Frequency of query 'sore throat treatment' 10 -7 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 ILIrates Raw data Linear fit
  52. 52. GFT v.2 — Gaussian Processes (1/4) A Gaussian Process (GP) learns a distribution over functions that can explain the data Fully specified by a mean (m) and a covariance (kernel) function (k); we set m(x) = 0 in our experiments Collection of random variables any finite number of which have a multivariate Gaussian distribution f(x) ⇠ GP m(x), k(x, x0 ) , x, x0 2 Rm , f : Rm ! R f⇤ ⇠ N (0, K) , (K)ij = k(xi, xj) cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i
  53. 53. GFT v.2 — Gaussian Processes (1/4) A Gaussian Process (GP) learns a distribution over functions that can explain the data Fully specified by a mean (m) and a covariance (kernel) function (k); we set m(x) = 0 in our experiments Collection of random variables any finite number of which have a multivariate Gaussian distribution E[x] = ∞ −∞ N x|µ, σ2 x dx = µ. (1.49) Because the parameter µ represents the average value of x under the distribution, it is referred to as the mean. Similarly, for the second order moment E[x2 ] = ∞ −∞ N x|µ, σ2 x2 dx = µ2 + σ2 . (1.50) From (1.49) and (1.50), it follows that the variance of x is given by var[x] = E[x2 ] − E[x]2 = σ2 (1.51) and hence σ2 is referred to as the variance parameter. The maximum of a distribution is known as its mode. For a Gaussian, the mode coincides with the mean.e 1.9 We are also interested in the Gaussian distribution defined over a D-dimensional vector x of continuous variables, which is given by N(x|µ, Σ) = 1 (2π)D/2 1 |Σ|1/2 exp − 1 2 (x − µ)T Σ−1 (x − µ) (1.52) where the D-dimensional vector µ is called the mean, the D × D matrix Σ is called the covariance, and |Σ| denotes the determinant of Σ. We shall make use of the multivariate Gaussian distribution briefly in this chapter, although its properties will be studied in detail in Section 2.3. f(x) ⇠ GP m(x), k(x, x0 ) , x, x0 2 Rm , f : Rm ! R f⇤ ⇠ N (0, K) , (K)ij = k(xi, xj) cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i
  54. 54. GFT v.2 — Gaussian Processes (1/4) A Gaussian Process (GP) learns a distribution over functions that can explain the data Fully specified by a mean (m) and a covariance (kernel) function (k); we set m(x) = 0 in our experiments Collection of random variables any finite number of which have a multivariate Gaussian distribution E[x] = ∞ −∞ N x|µ, σ2 x dx = µ. (1.49) Because the parameter µ represents the average value of x under the distribution, it is referred to as the mean. Similarly, for the second order moment E[x2 ] = ∞ −∞ N x|µ, σ2 x2 dx = µ2 + σ2 . (1.50) From (1.49) and (1.50), it follows that the variance of x is given by var[x] = E[x2 ] − E[x]2 = σ2 (1.51) and hence σ2 is referred to as the variance parameter. The maximum of a distribution is known as its mode. For a Gaussian, the mode coincides with the mean.e 1.9 We are also interested in the Gaussian distribution defined over a D-dimensional vector x of continuous variables, which is given by N(x|µ, Σ) = 1 (2π)D/2 1 |Σ|1/2 exp − 1 2 (x − µ)T Σ−1 (x − µ) (1.52) where the D-dimensional vector µ is called the mean, the D × D matrix Σ is called the covariance, and |Σ| denotes the determinant of Σ. We shall make use of the multivariate Gaussian distribution briefly in this chapter, although its properties will be studied in detail in Section 2.3. f(x) ⇠ GP m(x), k(x, x0 ) , x, x0 2 Rm , f : Rm ! R f⇤ ⇠ N (0, K) , (K)ij = k(xi, xj) cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i f(x) ⇠ GP m(x), k(x, x0 ) , x, x0 2 Rm , f : Rm ! R f⇤ ⇠ N (0, K) , (K)ij = k(xi, xj)inference
  55. 55. GFT v.2 — Gaussian Processes (2/4) 1.2 A few basic kernels To begin understanding the types of structures expressible by GPs, we will start by briefly examining the priors on functions encoded by some commonly used kernels: the squared-exponential (SE), periodic (Per), and linear (Lin) kernels. These kernels are defined in figure 1.1. Kernel name: Squared-exp (SE) Periodic (Per) Linear (Lin) k(x, xÕ ) = ‡2 f exp 1 ≠(x≠xÕ)2 2¸2 2 ‡2 f exp 1 ≠ 2 ¸2 sin2 1 fix≠xÕ p 22 ‡2 f (x ≠ c)(xÕ ≠ c) Plot of k(x, xÕ ): 0 0 0 x ≠ xÕ x ≠ xÕ x (with xÕ = 1) ¿ ¿ ¿ Functions f(x) sampled from GP prior: x x x Type of structure: local variation repeating structure linear functions Common GP kernels (covariance functions)
  56. 56. GFT v.2 — Gaussian Processes (3/4) 4 Expressing Structure with Kernels Lin ◊ Lin SE ◊ Per Lin ◊ SE Lin ◊ Per 0 0 0 0 x (with xÕ = 1) x ≠ xÕ x (with xÕ = 1) x (with xÕ = 1) ¿ ¿ ¿ ¿ quadratic functions locally periodic increasing variation growing amplitude Figure 1.2: Examples of one-dimensional structures expressible by multiplying kernels. Adding or multiplying GP kernels produces a new valid GP kernel
  57. 57. GFT v.2 — Gaussian Processes (4/4) (x,y) pairs with obvious nonlinear relationship 0 10 20 30 40 50 60 x (predictor variable) 0 5 10 15 20 y(targetvariable) x,y pairs
  58. 58. GFT v.2 — Gaussian Processes (4/4) least squares regression (poor solution) 0 10 20 30 40 50 60 x (predictor variable) 0 5 10 15 20 y(targetvariable) x,y pairs OLS fit (train) OLS fit
  59. 59. GFT v.2 — Gaussian Processes (4/4) sum of 2 GP kernels (periodic + squared exponential) 0 10 20 30 40 50 60 x (predictor variable) 0 5 10 15 20 y(targetvariable) x,y pairs OLS fit OLS fit (train) GP fit (train) GP fit
  60. 60. Clustering queries selected by elastic net into C clusters with k-means Clusters are determined by using cosine similarity as the distance metric (on query frequency time series) Groups queries with similar topicality & usage patterns GFT v.2 — k-means and GP regression f(x) ⇠ GP m(x), k(x, x0 ) , x, x0 2 Rm , f : Rm ! R f⇤ ⇠ N (0, K) , (K)ij = k(xi, xj) k(x, x0 ) = CX i=1 kSE(xci , x0 ci ) ! + 2 · (x, x0 ) x = {xc1 , . . . , xc10 } cos(v, u) = v · u kvkkuk = nX i=1 viui v u u nX v u u nX f(x) ⇠ GP m(x), k(x, x0 ) , x, x0 2 Rm , f : Rm ! R f⇤ ⇠ N (0, K) , (K)ij = k(xi, xj) k(x, x0 ) = CX i=1 kSE(xci , x0 ci ) ! + 2 · (x, x0 ) x = {xc1 , . . . , xc10 } v · u nX i=1 viui y 2 R X 2 Rn⇥m xi 2 Rm, i 2 {1, . . . , n} y 2 Rn yi 2 R w 2 Rm 2 R 1 2 R+ 2 2 R+ GPs can be defined as sets of random variables, any finite nu which have a multivariate Gaussian distribution. In GP regression, for the inputs x, x0 2 RM (both expressing the observation matrix X) we want to learn a function f: RM ! R drawn from a GP prior f(x) ⇠ GP µ(x) = 0, k(x, x0 ) kSE(x, x0 ) = 2 exp ✓ kx x0k2 2 2`2 ◆ 2 clusters noise
  61. 61. GFT v.2 — Performance 2009 2010 2011 2012 2013 Time (weeks) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 ILIrate(US) CDC ILI rates GP
  62. 62. GFT v.2 — Performance 2009 2010 2011 2012 2013 Time (weeks) 0.01 0.02 0.03 0.04 0.05 0.06 0.07 ILIrate(US) CDC ILI rates GP Elastic Net Elastic net r = .92, MAE = 2.60·10-3 , MAPE = 11.9% GP r = .95, MAE = 2.21·10-3 , MAPE = 10.8%
  63. 63. GFT v.2 — Queries’ added value MAX model omponent in the ARMAX function incorporates further information into the model. In all of e season is fixed to 52 weeks (1-year long). The full model description, which extends Eq. 6 in yt = pX i=1 iyt i + JX i=1 !iyt 52 i | {z } AR and seasonal AR + qX i=1 ✓i✏t i + KX i=1 ⌫i✏t 52 i | {z } MA and seasonal MA + DX i=1 wiht,i | {z } regression + ✏t , i are lagged variable parameters of order J and K, respectively. We estimate a series of mo el that minimizes the Akaike Information Criterion (AIC);8 therefore all the hyper-parameters a y observing these parameters, the evidence of seasonality in the signal is far less clear in the here are fewer samples from previous years) than the evidence in the last prediction period ny preceding seasons). More precisely, the first few prediction periods do not incorporate a yea end to incorporate an AR seasonal lag of order 1 and a moving average seasonal lag of order e, the estimation procedure includes a search over integrated components, augmenting the ntegrated moving average regression (ARIMA). The integrated part of the model refers to an in emoving non-stationarities present in the time series. As more evidence of stationarity is p Autoregression: Combine CDC ILI rates from the previous week(s) with the ILI rate estimate from search queries for the current week Various week lags explored (1, 2,…, 6 weeks)
  64. 64. GFT v.2 — Performance
  65. 65. GFT v.2 — Performance 1-week lag for the CDC data AR(CDC) r = .97, MAE = 1.87·10-3 , MAPE = 8.2% AR(CDC,GP) r = .99, MAE = 1.05·10-3 , MAPE = 5.7% — 2-week lag for the CDC data AR(CDC) r = .87, MAE = 3.36·10-3 , MAPE = 14.3% AR(CDC,GP) r = .99, MAE = 1.35·10-3 , MAPE = 7.3% — GP r = .95, MAE = 2.21·10-3 , MAPE = 10.8%
  66. 66. GFT v.2 — Non-optimal feature selection Queries irrelevant to flu are still maintained, e.g. “nba injury report” or “muscle building supplements” Feature selection is primarily based on correlation, then on a linear relationship Introduce a semantic feature selection
 — enhance causal connections (implicitly)
 — circumvent the painful training of a classifier
  67. 67. GFT v.3 — Word embeddings Word embeddings are vectors of a certain dimensionality (usually from 50 to 1024) that represent words in a corpus Derive these vectors by predicting contextual word occurrence in large corpora (word2vec) using a shallow neural network approach:
 — Continuous Bag-Of-Words (CBOW): Predict centre word from surrounding ones
 — skip-gram: Predict surrounding words from centre one Other methods available: GloVe, fastText
  68. 68. GFT v.3 — Word embedding data sets Use tweets geolocated in the UK to learn word embeddings that may capture
 — informal language used in searches
 — British English language / expressions
 — cultural biases (a)215 million tweets (February 2014 to March 2016), CBOW, 512 dimensions, 137,421 words covered
 https://doi.org/10.6084/m9.figshare.4052331.v1 (b)1.1 billion tweets (2012 to 2016), skip-gram, 512 dimensions, 470,194 words covered
 https://doi.org/10.6084/m9.figshare.5791650.v1
  69. 69. GFT v.3 — Cosine similarity cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i Least squares argmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression 0 1 word embeddings
  70. 70. GFT v.3 — Cosine similarity cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i Least squares argmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression 0 1 word embeddingscos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘queen’ Least squares argmin w, NX i=1 (xiw + yi)2 ! nX !
  71. 71. GFT v.3 — Cosine similarity cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i Least squares argmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression 0 1 word embeddings max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘queen’ cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘que max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘que max v ✓ cos(v, ‘king’) ⇥ cos(v, ‘woman’) cos(v, ‘man’) ◆ ) v = ‘queen’ where cos(·, ·) = (cos(·, ·) + 1) /2 cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man max v ✓ cos(v, ‘king’) ⇥ cos(v, ‘woman’) cos(v, ‘man’) ◆ ) v where cos(·, ·) = (cos(·, ·) + 1) /2where or
  72. 72. GFT v.3 — Cosine similarity cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i Least squares argmin w, NX i=1 (xiw + yi)2 ! argmin w, nX i=1 (xiw + yi)2 ! Ridge regression 0 1 word embeddings max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘queen’ cos(v, u) = v · u kvkkuk = nX i=1 viui v u u t nX i=1 v2 i v u u t nX i=1 u2 i max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘que max v (cos(v, ‘king’) + cos(v, ‘woman’) cos(v, ‘man’)) ) v = ‘que max v ✓ cos(v, ‘king’) ⇥ cos(v, ‘woman’) cos(v, ‘man’) ◆ ) v = ‘queen’ where cos(·, ·) = (cos(·, ·) + 1) /2 cos(·, ·) = (cos(·, ·) + 1) /2where or Positive context Negative context
  73. 73. GFT v.3 — Analogies in Twitter embd. The … for … not the … is … ? woman king man ? him she he ? better bad good ? England Rome London ? Messi basketball football ? Guardian Conservatives Labour ? Trump Europe USA ? rsv fever skin ?
  74. 74. GFT v.3 — Analogies in Twitter embd. The … for … not the … is … ? woman king man queen him she he ? better bad good ? England Rome London ? Messi basketball football ? Guardian Conservatives Labour ? Trump Europe USA ? rsv fever skin ?
  75. 75. The … for … not the … is … ? woman king man queen him she he her better bad good ? England Rome London ? Messi basketball football ? Guardian Conservatives Labour ? Trump Europe USA ? rsv fever skin ? GFT v.3 — Analogies in Twitter embd.
  76. 76. The … for … not the … is … ? woman king man queen him she he her better bad good worse England Rome London Italy Messi basketball football Lebron Guardian Conservatives Labour Telegraph Trump Europe USA Farage rsv fever skin flu GFT v.3 — Analogies in Twitter embd.
  77. 77. GFT v.3 — Better query selection (1/3) 1. Query embedding = Average token embedding 2. Derive a concept by specifying a positive (P) and a negative (N) context (sets of n-grams) 3. Rank all queries using their similarity score with this concept
  78. 78. GFT v.3 — Better query selection (1/3) 1. Query embedding = Average token embedding 2. Derive a concept by specifying a positive (P) and a negative (N) context (sets of n-grams) 3. Rank all queries using their similarity score with this concept ummy ache, nausea, feeling nausea, nausea and vomiting, bloated ummy, dull stomach ache, heartburn, feeling bloated, aches, belly che, stomach ache, feeling sleepy, spasms, stomach aches, stomach che after eating, ache, feeling nauseous, headache and nausea flu epidemic, flu, dispensary, hospital, sanatorium, fever, flu out- break, epidemic, flu medicine, doctors hospital, flu treatment, in- fluenza flu, flu pandemic, gp surgery, clinic, flu vaccine, flu shot, nfirmary, hospice, tuberculosis, physician, flu vaccination similarity function: S (Q, C) = k i=1 cos (eQ, ePi ) z j=1 cos eQ, eNj + γ . (7) The numerator and denominator of Eq. 7 are sums of co- sine similarities between the embedding of the search query
  79. 79. GFT v.3 — Better query selection (1/3) 1. Query embedding = Average token embedding 2. Derive a concept by specifying a positive (P) and a negative (N) context (sets of n-grams) 3. Rank all queries using their similarity score with this concept ummy ache, nausea, feeling nausea, nausea and vomiting, bloated ummy, dull stomach ache, heartburn, feeling bloated, aches, belly che, stomach ache, feeling sleepy, spasms, stomach aches, stomach che after eating, ache, feeling nauseous, headache and nausea flu epidemic, flu, dispensary, hospital, sanatorium, fever, flu out- break, epidemic, flu medicine, doctors hospital, flu treatment, in- fluenza flu, flu pandemic, gp surgery, clinic, flu vaccine, flu shot, nfirmary, hospice, tuberculosis, physician, flu vaccination similarity function: S (Q, C) = k i=1 cos (eQ, ePi ) z j=1 cos eQ, eNj + γ . (7) The numerator and denominator of Eq. 7 are sums of co- sine similarities between the embedding of the search query query embedding embedding of a negative concept n-gram constant to avoid division by 0
  80. 80. GFT v.3 — Better query selection (2/3) Positive context Negative context Most similar queries #flu fever flu flu medicine gp hospital bieber ebola wikipedia cold flu medicine flu aches cold and flu cold flu symptoms colds and flu flu flu gp flu hospital flu medicine ebola wikipedia flu aches flu colds and flu cold and flu cold flu medicine
  81. 81. GFT v.3 — Better query selection (3/3) Similarity score (S) 1.6 1.8 2 2.2 2.4 2.6 2.8 Numberofsearchqueries 0 500 1000 1500 2000 Figure 2: Histogram presenting the distribution of the search query similarity scores (S; see Eq. 7) with the flu infection concept C1. decided, perhaps with the assistance of an expert. As de- scribed in the following sections, the optimal performance is Table 2: Lin mance estima ture selection embedding ba SSS >>> µSµSµS |Q|Q|Q +0 14,7 +σS 5,16 +2σS 1,04 +2.5σS 30 +3σS 69 +3.5σS 7 NA 35,5 context is form the Twitter has as more genera the need for me Given that the distribution of concept similarity scores appears to be unimodal, we standard deviations from the mean (μS+θσS) to determine the selected queries
  82. 82. GFT v.3 — Hybrid feature selection Embedding based feature selection is an unsupervised technique, thus non optimal If we combine it with the previous ways for selecting features, will we obtain better inference accuracy? We test 7 feature selection approaches: similarity → elastic net (1) correlation → elactic net (2) → GP (3) similarity → correlation → elastic net (4) → GP (5) similarity → correlation → GP (6) correlation → GP (7)
  83. 83. d h- e e- LI t e i- r s u- al s trix X. By setting µ(x) = 0, a common practice in GP modelling, we focus only on the kernel function. We use the Mat´ern covariance function [42] to handle abrupt changes in the predictors given that the experiments are based on a sample of the original Google search data. It is defined as k (ν) M (x, x′ ) = σ2 21− ν Γ(ν) √ 2ν ℓ r ν Kν √ 2ν ℓ r , (3) where Kν is a modified Bessel function, ν is a positive con- stant,7 ℓ is the lengthscale parameter, σ2 a scaling factor (variance), and r = x−x′ . We also use a Squared Exponen- tial (SE) covariance to capture more smooth trends in the data, defined by kSE(x, x′ ) = σ2 exp − r2 2ℓ2 . (4) We have chosen to combine these kernels through a summa- tion. Note that the summation of GP kernels results in a new valid GP kernel [54]. An additive kernel allows mod- - e - I t e - r s - l s a n - y sample of the original Google search data. It is defined as k (ν) M (x, x′ ) = σ2 21− ν Γ(ν) √ 2ν ℓ r ν Kν √ 2ν ℓ r , (3) where Kν is a modified Bessel function, ν is a positive con- stant,7 ℓ is the lengthscale parameter, σ2 a scaling factor (variance), and r = x−x′ . We also use a Squared Exponen- tial (SE) covariance to capture more smooth trends in the data, defined by kSE(x, x′ ) = σ2 exp − r2 2ℓ2 . (4) We have chosen to combine these kernels through a summa- tion. Note that the summation of GP kernels results in a new valid GP kernel [54]. An additive kernel allows mod- elling with a sum of independent functions, where each one can potentially account for a different type of structure in the data [18]. We are using two Mat´ern functions (ν = 3/2) in an attempt to model long as well as medium (or short) e - r s - l s a n - y a e , n y - r d data, defined by kSE(x, x′ ) = σ2 exp − r2 2ℓ2 . (4) We have chosen to combine these kernels through a summa- tion. Note that the summation of GP kernels results in a new valid GP kernel [54]. An additive kernel allows mod- elling with a sum of independent functions, where each one can potentially account for a different type of structure in the data [18]. We are using two Mat´ern functions (ν = 3/2) in an attempt to model long as well as medium (or short) term irregularities, an SE kernel, and white noise. Thus, the final kernel is given by k(x, x′ ) = 2 i=1 k (ν=3/2) M (x, x′ ;σi, ℓi) + kSE(x, x′ ;σ3 , ℓ3 ) + σ2 4 δ(x, x′ ) , (5) where δ is a Kronecker delta function. Parameters (7 in total) are optimised using the Laplace approximation (under GFT v.3 — GP model details Skipped in the interest of time! If you’re interested, check Section 3.1 of https://doi.org/10.1145/3038912.3052622
  84. 84. GFT v.3 — Data & evaluation weekly frequency of 35,572 search queries (UK) from 1/1/2007 to 9/08/2015 (449 weeks) access to a private Google Health Trends API for health- oriented research corresponding ILI rates for England (Royal College of General Practitioners and Public Health England) test on the last 3 flu seasons in the data (2012-2015) train on increasing data sets starting from 2007, using all data prior to a test period
  85. 85. GFT v.3 — Performance (1/3) (a) similarity → elastic net (b) correlation → elactic net (c) similarity → correlation → elastic net (a) (b) (c) 36.23% 47.15% 61.05% 0.190.21 0.30 0.91 0.880.87 r MAE x 0.1 MAPE
  86. 86. GFT v.3 — Performance (1/3) (a) similarity → elastic net (b) correlation → elactic net (c) similarity → correlation → elastic net (a) (b) (c) 36.23% 47.15% 61.05% 0.190.21 0.30 0.91 0.880.87 r MAE x 0.1 MAPE
  87. 87. GFT v.3 — Performance (2/3) Time (weeks) 2013 2014 2015 ILIrateper100,000people 0 10 20 30 RCGP/PHE ILI rates Elastic Net (correlation-based feature selection) Elastic Net (hybrid feature selection) Figure 3: Comparative plot of the optimal models for the correlation based and hybrid feature selection under Elastic Net for the estimation of ILI rates in England (RCGP/PHE ILI rates denote the ground truth). Table 4 shows a few characteristic examples of potentially misleading queries that are filtered by the hybrid feature se- lection approach, while previously have received a nonzero regression weight. Evidently, there exist several queries ir- relevant to the target theme, referring to specific individu- als and related activities, different health problems or sea- sonal topics. The regression weight that these queries re- ceive tends to constitute a significant proportion of the high- the performance of the corresponding linear model, provid- ing an indirect indication for the inappropriateness of the selected features. Figure 4 draws a comparison between the inferences of the best nonlinear and linear models, both of which happen to use the same feature basis (r>.30 ∩ S>µS+σS). The GP model provides more smooth estimates and an overall better balance between stronger and milder flu seasons. It Elastic net with and without word embeddings filtering
  88. 88. GFT v.3 — Performance (2/3) Time (weeks) 2013 2014 2015 ILIrateper100,000people 0 10 20 30 RCGP/PHE ILI rates Elastic Net (correlation-based feature selection) Elastic Net (hybrid feature selection) Figure 3: Comparative plot of the optimal models for the correlation based and hybrid feature selection under Elastic Net for the estimation of ILI rates in England (RCGP/PHE ILI rates denote the ground truth). Table 4 shows a few characteristic examples of potentially misleading queries that are filtered by the hybrid feature se- lection approach, while previously have received a nonzero regression weight. Evidently, there exist several queries ir- relevant to the target theme, referring to specific individu- als and related activities, different health problems or sea- sonal topics. The regression weight that these queries re- ceive tends to constitute a significant proportion of the high- the performance of the corresponding linear model, provid- ing an indirect indication for the inappropriateness of the selected features. Figure 4 draws a comparison between the inferences of the best nonlinear and linear models, both of which happen to use the same feature basis (r>.30 ∩ S>µS+σS). The GP model provides more smooth estimates and an overall better balance between stronger and milder flu seasons. It Elastic net with and without word embeddings filtering prof. surname (70.3%), name surname (27.2%), heal the world (21.9%), heating oil (21.2%), name surname recipes (21%), tlc diet (13.3%), blood game (12.3%), swine flu vaccine side effects (7.2%) ratio over highest weight
  89. 89. GFT v.3 — Performance (3/3) (a) correlation → GP (b) correlation → elastic net → GP (c) similarity → correlation → elactic net → GP (d) similarity → correlation → GP (a) (b) (c) (d) 25.81% 30.30% 35.88%34.17% 0.160.17 0.230.22 0.940.930.920.89 r MAE x 0.1 MAPE
  90. 90. GFT v.3 — Performance (3/3) (a) correlation → GP (b) correlation → elastic net → GP (c) similarity → correlation → elactic net → GP (d) similarity → correlation → GP (a) (b) (c) (d) 25.81% 30.30% 35.88%34.17% 0.160.17 0.230.22 0.940.930.920.89 r MAE x 0.1 MAPE
  91. 91. Multi-task learning m tasks (problems) t1,…,tm observations Xt1,yt1,…,Xtm,ytm learn models fti: Xti → yti jointly (and not independently) Why? When tasks are related, multi-task learning is expected to perform better than learning each task independently Model learning possible even with a few training samples
  92. 92. Multi-task learning for disease modelling m tasks (problems) t1,…,tm observations Xt1,yt1,…,Xtm,ytm learn models fti: Xti → yti jointly (and not independently) Can we improve disease models (flu) from online search: when sporadic training data are available? across the geographical regions of a country? across two different countries?
  93. 93. Multi-task learning GFT (1/5) Department of Computer Science University College London United Kingdom v.lampos@ucl.ac.uk Department of Computer Science University College London United Kingdom i.cox@ucl.ac.uk ing to disease surveil- on is two-fold. Firstly, models for various ge- erent countries — can such models to assist c disease surveillance training data. We ex- ecically a multi-task Gaussian Process, and formulations. We use nduct experiments on where both health and pirical results indicate well as national mod- ent on mean absolute aining data is reduced models can be obtained, rvals. Furthermore, in reports (training data) ing helps to maintain locations. Finally, we ment, where data from As the historical train- of multi-task learning HHS Regions CA OR WA MT ID NV UT AZ WY CO NM TX OK KS NE SD ND MN WI IA IL MO AR LA MS AL GA FL SC NC TN KY IN MI OH WV VA MD DE PA NJ NY CT MA VT NH ME RI AK HI Region 1 Region 2 Region 3 Region 4 Region 5 Region 6 Region 7 Region 8 Region 9 Region 10 Figure 1: The 10 US regions as specied by the Department of Health Human Services (HHS). 1 INTRODUCTION Online user-generated content contains a signicant amount of information about the oine behavior or state of users. For the Can multi-task learning across the 10 US regions help us improve the national ILI model?
  94. 94. Multi-task learning GFT (1/5) Can multi-task learning across the 10 US regions help us improve the national ILI model? Elastic Net MT Elastic Net GP MT GP 0.250.25 0.350.35 0.970.970.960.96 r MAE 5 years of training data
  95. 95. Multi-task learning GFT (1/5) Can multi-task learning across the 10 US regions help us improve the national ILI model? Elastic Net MT Elastic Net GP MT GP 0.44 0.51 0.46 0.53 0.880.850.870.85 r MAE 1 year of training data
  96. 96. Multi-task learning GFT (2/5) Can multi-task learning across the 10 US regions help us improve the regional ILI models? Elastic Net MT Elastic Net GP MT GP 0.47 0.54 0.490.53 0.870.840.860.85 r MAE 1 year of training data
  97. 97. Multi-task learning GFT (3/5) Can multi-task learning across the 10 US regions help us improve regional models under sporadic health reporting? Split US regions into two groups, one including the 2 regions with the highest population (4 and 9 in the map), and the other having the remaining 8 regions Train and evaluate models for the 8 regions under the hypothesis that there might exist sporadic health reports Start downsampling the data from the 8 regions using burst error sampling (random data blocks removed) with rate γ (1 no sampling, 0.1 10% sample)
  98. 98. Multi-task learning GFT (3/5) Can multi-task learning across the 10 US regions help us improve regional models under sporadic health reporting? Split US regions into two groups, one including the 2 regions with the highest population (4 and 9 in the map), and the other having the remaining 8 regions Train and evaluate models for the 8 regions under the hypothesis that there might exist sporadic health reports Start downsampling the data from the 8 regions using burst error sampling (random data blocks removed) with rate γ (1 no sampling, 0.1 10% sample) ch TheWebConf 2018, 23–27 Apr. 2018, Lyon, France od- to in- ng E 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.5 0.6 0.7 0.8 0.9 1.0 MAE EN MTEN GP MTGP Figure 5: Comparing the performance of EN (dotted), GP (dashed), MTEN (dash dot) and MTGP (solid) on estimating the ILI rates for US HHS Regions (except Regions 4 and 9) for varying burst error sampling (type C) rates ( ).
  99. 99. Multi-task learning GFT (4/5) Correlations between US regions induced by the covariance matrix of the MT GP model Multi-task learning model seems to be capturing existing geographical relations and MTGP (blue) ILI estimates for England under varying training data sizes. re else. Fig. 5 compares under this scenario. It evious experiment still much less aected by in single task learning decreases. s Countries whether a stable data ce a disease model for a underlying assumption on language and have patterns of user search US and England and as- rical health reports for xperiments described in data, we always assume past L = 5 years. The e same, with the follow- about medication were heir search frequencies “robitussin” and “z pak” ults as in the previous ls register statistically ngle task learning ones. ced, the improvements to be inferring the trends of the time series correctly, the multi-task estimates are more close to the actual values of the signal’s peaks. The results conrm our original hypothesis that data from one country could improve a disease model for another country with similar characteristics. This motivates the development of more advanced transfer learning schemes [41], capable of operating be- tween countries with dierent languages by overcoming language barrier problems, using variants of machine translation. R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 US 0.64 0.72 0.80 0.88 0.96 A1 A2
  100. 100. Multi-task learning GFT (5/5) Can multi-task learning across countries (US, England) help us improve the ILI model for England? Elastic Net MT Elastic Net GP MT GP 0.47 0.60 0.49 0.70 0.900.890.900.89 r MAE 5 years of training data
  101. 101. Multi-task learning GFT (5/5) Can multi-task learning across countries (US, England) help us improve the ILI model for England? Elastic Net MT Elastic Net GP MT GP 0.59 0.98 0.60 1.00 0.860.850.860.84 r MAE 1 year of training data
  102. 102. Conclusions Online (user-generated) data can help us improve our current understanding about public health matters The original Google Flu Trends was based on a good idea, but on very limited modelling effort, resulting to major errors Subsequent models improved the statistical modelling as well as the semantic disambiguation between possible features and delivered better / more robust performance Multi-task learning improves disease models further Future direction: Models without strong supervision
  103. 103. Acknowledgements Industrial partners — Microsoft Research (Elad Yom-Tov) — Google Public health organisations — Public Health England — Royal College of General Practitioners Funding: EPSRC (“i-sense”) Collaborators: Andrew Miller, Bin Zou, Ingemar J. Cox
  104. 104. Thank you. Vasileios Lampos (a.k.a. Bill) Computer Science University College London @lampos
  105. 105. References Ginsberg et al. Detecting influenza epidemics using search engine query data. Nature 457, pp. 1012—1014 (2009). Lampos, Miller, Crossan and Stefansen. Advances in nowcasting influenza-like illness rates using search query logs. Scientific Reports 5, 12760 (2015). Lampos, Zou and Cox. Enhancing feature selection using word embeddings: The case of flu surveillance. WWW ’17, pp. 695–704 (2017). Zou, Lampos and Cox. Multi-task learning improves disease models from Web search. WWW ’18, In Press (2018). GFT v.1 GFT v.2 GFT v.3 MTL

×