User-generated content: collective and personalised inference tasks

496 views

Published on

Over the past few years large-scale, user-generated, online resources, such as social media or search engine queries, have been driving various interesting research efforts. This talk will showcase a series of ideas that we have developed based on the hypothesis that this online stream of content should represent a biased -at least- fraction of real-world situations, opinions or behaviour. Firstly, core algorithms have been proposed for mapping Twitter or Google search data to estimations about the current rate of an infectious disease, such as influenza. Then, through the addition of a user dimension in linear regularised text regression models, a family of bilinear models has been introduced, capable of improving approximations of voting intention trends. Finally, by redirecting the previous modelling principle, social media content has been used to learn as well as
qualitatively interpret user attributes, such as impact, occupation,
income and socioeconomic status.

Published in: Science
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
496
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

User-generated content: collective and personalised inference tasks

  1. 1. User-generated content: collective and personalised inference tasks Vasileios Lampos Department of Computer Science University College London (March, 2016; @ DIKU)
  2. 2. Structure of the talk 1. Introductory remarks 2. Collective inference tasks from user-generated content
 — Nowcasting flu rates from Twitter / Google
 — Modelling voting intention (bilinear text regression) 3. Personalised inference tasks using social media 
 — Occupation, income, socioeconomic status & impact 4. Concluding remarks
  3. 3. Context and motivation + the Internet, the World Wide Web and connectivity + numerous successful web products feeding from user activity + lots of user-generated content & activity logs, e.g. social media and search engine query logs + large volumes of digitised data (‘Big Data’), birth of Data Science (nothing new in principal) How can we use online data to improve our society, interpret human behaviour, and enhance our understanding about our world?
  4. 4. Context and motivation + the Internet, the World Wide Web and connectivity + numerous successful web products feeding from user activity + lots of user-generated content & activity logs, e.g. social media and search engine query logs + large volumes of digitised data (‘Big Data’), birth of Data Science (nothing new in principal) How can we use online data to improve our society, interpret human behaviour, and enhance our understanding about our world?
  5. 5. User-generated content: Ongoing applications + Health > disease surveillance, intervention impact + Finance & Commerce > financial indices > consumer satisfaction, market share + Politics > estimation of voting intentions > public opinion barometers + Social and behavioural sciences > complement questionnaire based studies > approach answers to unresolved questions
  6. 6. Added value of user-generated content for health + Online content can potentially access a larger and more representative part of the population
 Note: Traditional health surveillance schemes are based on the subset of people that actively seek medical attention + More timely information (almost instant) about a disease outbreak in a population + Geographical regions with less established health monitoring systems can greatly benefit + Small cost when data access and expertise are in place
  7. 7. Collective inference tasks 
 from user-generated content Lampos & Cristianini, 2012; Lampos, Preotiuc-Pietro & Cohn, 2013; Lampos, Miller, Crossan & Stefansen, 2015
  8. 8. Flu rates from Twitter: The task Flu surveillance disease rates from a health agency f : X ∈ ℝM x N y ∈ ℝM n-gram frequency time series 2012 2013 2014 0 0.01 0.02 0.03 0.04 ILIrateper100people ILI rates (PHE) Bing (Lampos & Cristianini, 2012)
  9. 9. Flu rates from Twitter: Lasso for feature selection gression basics — Lasso • observations xxxi œ Rm, i œ {1, ..., n} — XXX • responses yi œ R, i œ {1, ..., n} — yyy • weights, bias wj, — œ R, j œ {1, ..., m} — wwwú = [www; — ¸1¸1¸1–norm regularisation or lasso (Tibshirani, 1996) argmin www,— Y _] _[ nÿ i=1 Q ayi ≠ — ≠ mÿ j=1 xijwj R b 2 + ⁄ mÿ j=1 |wj| Z _^ _ or argmin wwwú Ó ÎXXXúwwwú ≠ yyyÎ2 ¸2 + ⁄Îwwwθ1 Ô Regression basics — Ordinary Least Squares (1/2) • observations xxxi œ Rm, i œ {1, ..., n} — XXX • responses yi œ R, i œ {1, ..., n} — yyy • weights, bias wj, — œ R, j œ {1, ..., m} — wwwú = [www; —] also known as lasso or L1-norm regularisation (Tibshirani, 1996)
  10. 10. Flu rates from Twitter: Bootstrap lasso Lasso may not always select the true model
 due to collinearities in the feature space Bootstrapping lasso (‘bolasso’) for feature selection + For a number (N) of bootstraps, i.e. iterations > Sample the feature space with replacement (Xi) > Learn a new model (wi) by applying lasso on Xi and y > Remember the n-grams with nonzero weights + Select the n-grams with nonzero weights in p% of the N bootstraps + p can be optimised; if p<100%, then ‘soft bolasso’ (Zhao & Yu, 2006) (Bach, 2008)
  11. 11. Flu rates from Twitter: Performance wcasting Events from the Social Web with Statistical Learning 72 Fig. 8. Feature Class H – Inference for Flu case study (Round 1 of 5-fold cross validation). Root Mean Squared Error 9 10 12 14 1-grams 2-grams Hybrid 11.62 13.82 12.44 10.57 12.64 11.14 Soft-Bolasso Baseline (correlation based feature selection) (Lampos & Cristianini, 2012)
  12. 12. Flu rates from Twitter: Selected features Word cloud with selected n-grams. Font size is proportional to the regression’s weight; n-grams that are upside-down have a negative weight.
  13. 13. Rainfall rates from Twitter: GeneralisationFig. 3. Feature Class B – Inference for Rainfall case study (Round 5 of 6-fold cross validation).
  14. 14. Rainfall rates from Twitter: Selected features Word cloud with selected n-grams. Font size is proportional to the regression’s weight; n-grams that are upside-down have a negative weight.
  15. 15. Bilinear regression ilinear Text Regression — The general idea (2/2) • users p œ Z+ • observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX • responses yi œ R, i œ {1, ..., n} — yyy • weights, bias uk, wj, — œ R, k œ {1, ..., p} — uuu, www, — j œ {1, ..., m} f (QQQi) = uuuTQQQiwww + — ◊ ◊ + — uuuT QQQi www Bilinear Text Regression — The general idea (2/2) • users p œ Z+ • observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX • responses yi œ R, i œ {1, ..., n} — yyy • weights, bias uk, wj, — œ R, k œ {1, ..., p} — uuu, www, — j œ {1, ..., m} f (QQQi) = uuuTQQQiwww + — ◊ ◊ + — uuuT QQQi www 16
  16. 16. Bilinear regularised regressionBilinear Text Regression — The general idea (2/2) • users p œ Z+ • observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX • responses yi œ R, i œ {1, ..., n} — yyy • weights, bias uk, wj, — œ R, k œ {1, ..., p} — uuu, www, — j œ {1, ..., m} f (QQQi) = uuuTQQQiwww + — ◊ ◊ + — uuuT QQQi www 16 Bilinear Text Regression — Regularisation • users p œ Z+ • observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX • responses yi œ R, i œ {1, ..., n} — yyy • weights, bias uk, wj, — œ R, k œ {1, ..., p} — uuu, www, — j œ {1, ..., m} argmin uuu,www,— I nÿ i=1 1 uuuT QQQiwww + — ≠ yi 22 + Â(uuu, ◊u) + Â(www, ◊w) J Â(·): regularisation function with a set of hyper-parameters (◊) • if  (vvv, ⁄) = ⁄Îvvvθ1 Bilinear Lasso • if  (vvv, ⁄1, ⁄2) = ⁄1ÎvvvÎ2 ¸2 + ⁄2Îvvvθ1 Bilinear Elastic Net (BEN) (Lampos et al., 2013) (Lampos, Preotiuc-Pietro & Cohn, 2013)
  17. 17. Bilinear elastic net (BEN): training a model tic Net (BEN) 1 uuuT QQQiwww + — ≠ yi 22 BEN’s objective function 1 ÎuuuÎ2 ¸2 + ⁄u2 Îuuuθ1 1 ÎwwwÎ2 ¸2 + ⁄w2 Îwwwθ1 J 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0 0.4 0.8 1.2 1.6 2 2.4 Step Global Objective RMSE xity: fix uuu, learn www and vv hrough convex on tasks: convergence Global objective function during training (red) Corresponding prediction error on held out data (blue) Biconvex problem + fix u, learn w and vice versa + iterate through convex
 optimisation tasks Bilinear Elastic Net (BEN) argmin uuu,www,— I nÿ i=1 1 uuuT QQQiwww + — ≠ yi 22 BEN’s objective function + ⁄u1 ÎuuuÎ2 ¸2 + ⁄u2 Îuuuθ1 + ⁄w1 ÎwwwÎ2 ¸2 + ⁄w2 Îwwwθ1 J 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0 0.4 0.8 1.2 1.6 2 2.4 Step Global Objective RMSE Figure 2 : Objective function value and RMSE (on hold-out data) through the model’s iterations • Bi-convexity: fix uuu, learn www and vv • Iterating through convex optimisation tasks: convergence (Al-Khayyal & Falk, 1983; Horst & Tuy, 1996) • FISTA (Beck & Teboulle, 2009) in SPAMS (Mairal et al., 2010): Large-scale optimisation solver, quick convergence V. Lampos v.lampos@ucl.ac.uk Bilinear Text Regression and Applications 18/45 18 /45 BEN’s objective function (Mairal et al., 2010) Large-scale solvers in SPAMS
  18. 18. Bilinear multi-task learningBilinear Multi-Task Learning • tasks · œ Z+ • users p œ Z+ • observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX • responses yyyi œ R· , i œ {1, ..., n} — YYY • weights, bias uuuk,wwwj,——— œ R· , k œ {1, ..., p} — UUU, WWW, ——— j œ {1, ..., m} f (QQQi) = tr 1 UUUTQQQiWWW 2 + ——— ◊ ◊ UUUT QQQi WWW v.lampos@ucl.ac.uk Slides: http://bit.ly/1GrxI8j 23/45 23 /45 Bilinear Multi-Task Learning • tasks · œ Z+ • users p œ Z+ • observations QQQi œ Rp◊m, i œ {1, ..., n} — XXX • responses yyyi œ R· , i œ {1, ..., n} — YYY • weights, bias uuuk,wwwj,——— œ R· , k œ {1, ..., p} — UUU, WWW, ——— j œ {1, ..., m} f (QQQi) = tr 1 UUUTQQQiWWW 2 + ——— ◊ ◊ UUUT QQQi WWW
  19. 19. Bilinear Group l2,1 (BGL)argmin UUU,WWW,——— I ·ÿ t=1 nÿ i=1 1 uuuT t QQQiwwwt + —t ≠ yti 22 + ⁄u pÿ k=1 ÎUUUkÎ2 + ⁄w mÿ j=1 ÎWWWjÎ2 J ◊ ◊ UUUT QQQi WWW a feature (user/word) is selected for all tasks (not just one), but possibly with di erent weights especially useful in the domain of politics (e.g. user pro party A + a feature (user or word) is usually selected (activated) for all tasks, but with different weights + useful in the domain of political preference inference eights, bias uuuk,wwwj,——— œ R· , k œ {1, ..., p} — UUU, WWW, ——— j œ {1, ..., m} argmin UUU,WWW,——— I ·ÿ t=1 nÿ i=1 1 uuuT t QQQiwwwt + —t ≠ yti 22 + ⁄u pÿ k=1 ÎUUUkÎ2 + ⁄w mÿ j=1 ÎWWWjÎ2 J GL can be broken into 2 convex tasks: first learn {WWW,———}, then UUU,———} and vv + iterate through this process @ucl.ac.uk Slides: http://bit.ly/1GrxI8j 24/45 24 /45 (Argyriou et al., 2008)
  20. 20. Inferring voting intention from Twitter: Data United Kingdom + 3 parties (Conservatives, Labour, Lib Dem) + 42,000 Twitter users distributed proportionally to UK’s regional population figures + 60 million tweets & 80,976 1-grams extracted + 240 polls from 30 Apr. 2010 to 13 Feb. 2012 Austria + 4 parties (SPO, OVP, FPO, GRU) + 1,100 politically active Twitter users selected by political scientists + 800,000 tweets & 22,917 1-grams extracted + 98 polls from 25 Jan. to 25 Dec. 2012
  21. 21. Inferring voting intention from Twitter: PerformanceRoot Mean 
 Squared Error 0 1 2 2 3 UK Austria 1.4391.478 1.699 1.573 1.442 3.067 1.47 1.723 1.851 1.69 Mean poll Last poll Elastic Net (words) BEN BGL (Lampos, Preotiuc-Pietro & Cohn, 2013)
  22. 22. Inferring voting intention from Twitter: UK 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 VotingIntention% Time CON LAB LIB BEN 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 VotingIntention% Time CON LAB LIB BGL 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 VotingIntention% Time CON LAB LIBYouGov
  23. 23. Inferring voting intention from Twitter: Austria 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 VotingIntention% Time SPÖ ÖVP FPÖ GRÜ Polls 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 VotingIntention% Time SPÖ ÖVP FPÖ GRÜ BEN 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 VotingIntention% Time SPÖ ÖVP FPÖ GRÜ BGL
  24. 24. Inferring voting intention from Twitter: A qualitative outcome Party Tweet Score User type SPÖ centre Inflation rate in Austria slightly down in July from 2.2 to 2.1%. Accommodation, Water, Energy more expensive. 0.745 Journalist ÖVP centre right Can really recommend the book “Res Publica” by Johannes #Voggenhuber! Food for thought and so on #Europe #Democracy -2.323 User FPÖ
 far right Campaign of the Viennese SPO on “Living together” plays right into the hands of right-wing populists -3.44 Human rights GRÜ centre left Protest songs against the closing-down of the bachelor course of International Development: <link> #ID_remains #UniBurns #UniRage 1.45 Student Union
  25. 25. Nonlinearities in the data (1) frequency of search query ‘dry cough’ (Google)
  26. 26. Nonlinearities in the data (2) frequency of search query ‘dry cough’ (Google) linear nonlinear
  27. 27. Gaussian Processes (GPs) analysis step. Embeddings are cause each dimension is an ab- the clusters can be interpreted of the most frequent or repre- e latter are identified using the metric: P x2c NPMI(w, x) |c| 1 , (2) cluster and w the target word. beddings (W2V-E) been a growing interest in neu- research (Cohn and Sp and Cohn, 2013) with limited to (Polajnar et Formally, GP meth f : Rd ! R drawn inputs xxx 2 Rd: f(xxx) ⇠ GP( where m(·) is the mean is the covariance kerne ponential (SE) kernel used to encourage smo dimensional pair of inp usters can be interpreted most frequent or repre- are identified using the c: PMI(w, x) | 1 , (2) r and w the target word. gs (W2V-E) growing interest in neu- e the words are projected dense vector space via a limited to (Polajnar et al., 201 Formally, GP methods aim f : Rd ! R drawn from a inputs xxx 2 Rd: f(xxx) ⇠ GP(m(xxx), where m(·) is the mean functi is the covariance kernel. Usu ponential (SE) kernel (a.k.a. used to encourage smooth fun dimensional pair of inputs (xxx kard(xxx,xxx0 ) = 2 exp " dX limited to (Polajnar et al., 2011). Formally, GP methods aim to learn a function f : Rd ! R drawn from a GP prior given the inputs xxx 2 Rd: f(xxx) ⇠ GP(m(xxx), k(xxx,xxx0 )) , (3) where m(·) is the mean function (here 0) and k(·, · is the covariance kernel. Usually, the Squared Ex ponential (SE) kernel (a.k.a. RBF or Gaussian) is used to encourage smooth functions. For the multi dimensional pair of inputs (xxx,xxx0), this is: Based on d-dimensional input data we want to learn a function Formally: Sets of random variables any finite number of which have a multivariate Gaussian distribution mean function drawn on inputs covariance function (or kernel) drawn on pairs of inputs (Rasmussen & Williams, 2006)
  28. 28. Common covariance functions (kernels) briefly examining the priors on functions encoded by some commonly used kernels: the squared-exponential (SE), periodic (Per), and linear (Lin) kernels. These kernels are defined in figure 1.1. Kernel name: Squared-exp (SE) Periodic (Per) Linear (Lin) k(x, xÕ ) = ‡2 f exp 1 ≠(x≠xÕ)2 2¸2 2 ‡2 f exp 1 ≠ 2 ¸2 sin2 1 fix≠xÕ p 22 ‡2 f (x ≠ c)(xÕ ≠ c) Plot of k(x, xÕ ): 0 0 0 x ≠ xÕ x ≠ xÕ x (with xÕ = 1) ¿ ¿ ¿ Functions f(x) sampled from GP prior: x x x Type of structure: local variation repeating structure linear functions Figure 1.1: Examples of structures expressible by some basic kernels. (Duvenaud, 2014)
  29. 29. Combining kernels in a GP 4 Expressing Structure with Kernels Lin ◊ Lin SE ◊ Per Lin ◊ SE Lin ◊ Per 0 0 0 0 x (with xÕ = 1) x ≠ xÕ x (with xÕ = 1) x (with xÕ = 1) ¿ ¿ ¿ ¿ quadratic functions locally periodic increasing variation growing amplitude Figure 1.2: Examples of one-dimensional structures expressible by multiplying kernels. it is possible to add or multiply kernels (among other operations) (Duvenaud, 2014)
  30. 30. GPs for regression: A toy example (1) take some (x,y) pairs with some obvious nonlinear underlying structure x (predictor variable) 0 10 20 30 40 50 60 y(targetvariable) 2 4 6 8 10 12 14 16 18 20 22 x,y pairs
  31. 31. x (predictor variable) 0 10 20 30 40 50 60 y(targetvariable) 2 4 6 8 10 12 14 16 18 20 22 x,y pairs OLS fit GP fit GPs for regression: A toy example (2) Addition of 2 GP kernels: periodic + squared exponential + noise testing (solid line)training (dashed line)
  32. 32. More information about GPs + Book — “Gaussian Processes for Machine Learning”
 http://www.gaussianprocess.org/gpml/ + Tutorial — “Gaussian Processes for Natural Language Processing”
 http://people.eng.unimelb.edu.au/tcohn/tutorial.html + Video-lecture — “Gaussian Process Basics”
 http://videolectures.net/gpip06_mackay_gpb/ + Software I — GPML for Octave or MATLAB
 http://www.gaussianprocess.org/gpml/code + Software II — GPy for Python
 http://sheffieldml.github.io/GPy/
  33. 33. Google Flu Trends: The idea Can we turn search query information (statistics) to estimates about the rate of influenza-like illness in the real-world population?
  34. 34. Google Flu Trends: Failure 0 2 4 6 8 10 07/01/09 07/01/10 07/01/11 07/01/12 07/01/13 Google Flu Lagged CDC Google Flu + CDC CDC 50 100 150 Google Flu Lagged CDC Google Flu + CDC Google estimates more than double CDC estimates Google starts estimating high 100 out of 108 weeks %ILI%baseline) The estimates of the online Google Flu Trends tool were approx. two times larger than the ones from the CDC in 2012/13 (Lazer et al., 2014) odel using the log-odds of an ILI physician visit and ds of an ILI-related search query: logit(P) = β0 + β1 × logit(Q) + ε the percentage of ILI physician visits, Q is ated query fraction, β0 is the intercept, Fig ILI- poin que whi 0.8 (Ginsberg et al., 2009)
  35. 35. Google Flu Trends: Hypotheses for failure + ‘Big Data’ are not always good enough; may not always capture the target signal properly + The estimates were based on a rather simplistic model + The model was OK, but some spurious search queries invalidated the ILI inferences, e.g. ‘flu symptoms’ + Media hype about the topic of ‘flu’ significantly increased the search query volume from people that were just seeking information (non patients) + Side note: CDC’s estimates are not necessarily the ground truth; they can also go wrong sometimes, although we generally assume that they are a good representation of the real signal
  36. 36. Google Flu Trends revised: Data (1) Google search query logs > geo-located in US regions > from 4 Jan. 2004 to 28 Dec. 2013 (521 weeks, ~decade) > filtered by a very relaxed health-topic classifier > intersection among frequently occurring search queries in all US regions > weekly frequencies of 49,708 queries (# of features) > all data have been anonymised and aggregated plus corresponding ILI rates from the CDC
  37. 37. Google Flu Trends revised: Data (2) Corresponding ILI rates from the CDC Table S3. Cumulative performance (2008-2013) of GP model with various numbers of clusters. Covariance function r MAE⇥102 MAPE (%) SE .95 .221 10.8 Mat´ern .95 .228 11 e S4. Performance comparison of the optimal GP model (10 clusters) when a different covariance function (Mat´ern ure S1. CDC ILI rates for the US covering 2004 to 2013, i.e., the time span of the data used in our experimental pro eriods are distinguished by color. different colouring per flu season
  38. 38. Google Flu Trends revised: Methods (1) r>a Elastic Net Google search query frequencies (Q) Historical CDC ILI data k-means k1 k2 k3 kN … + GP(μ,k) Q’≤Q Q’’≤Q’ ILI inference (Lampos, Miller, Crossan & Stefansen, 2015)
  39. 39. Google Flu Trends revised: Methods (2) 1. Keep search queries with r ≥ 0.5 (reduces the amount of irrelevant queries) 2. Apply the previous model (GFT) to get a baseline performance estimate 3. Apply elastic net to select a subset of search queries and compute another baseline 4. Group the selected queries into N = 10 clusters using 
 k-means to account for their different semantics 5. Use a different GP covariance function on top of each query cluster to explore non-linearities
  40. 40. Google Flu Trends revised: Methods (3) milarity metric and then apply a composite GP kernel on clusters of qu search queries = , …,x c c{ }C1 , where ci denotes the subset of queries cl GP covariance function to be ∑ σ δ′ ′( , ) = ⎛ ⎝ ⎜⎜⎜ ( , ′) ⎞ ⎠ ⎟⎟⎟⎟ + ⋅ ( , ), = k kx x c c x x i C i i 1 SE n 2 otes the number of clusters, kSE has a different set of hyperparameters (σ rm of the equation models noise (δ being a Kronecker delta function). ntation of queries by applying the k-means++ algorithm32,33 (see SI, Gau The distance metric of k-means uses the cosine similarity between time s the different magnitudes of the query frequencies in our data34 . )/ ⎛ ⎝ ⎜⎜⎜ ⋅ ⎞ ⎠ ⎟⎟⎟ x xq q 2 2 i j , where ∈ , xq T i j{ } denotes a column of the input mat ng on sets of queries, the proposed method can protect an inferred m e frequency of single queries that are not representative of an entire clu bout a disease may trigger queries expressing a general concern rather th s are expected to utilize a small subset of specific key-phrases, but no d to flu infection. In addition, assuming that query clusters may convey + protect a model from radical changes in the frequency of single queries that are not representative of a cluster + model the contribution of various thematic concepts (captured by different clusters) to the final prediction + learning a sum of lower-dimensional functions: significantly smaller input space, much easier learning task, fewer samples required, more statistical traction obtained - imposes the assumption that the relationship between queries in separate clusters provides no information about ILI (reasonable trade-off)
  41. 41. Google Flu Trends revised: Results (1)e.com/scientificreports/
  42. 42. Google Flu Trends revised: Results (2)MAPE (%) 0 5 10 16 21 26 Mean absolute percentage (%) of error (MAPE) in flu rate estimates during a 5-year period (2008-2013) Test data Test data; peaking moments 11%10.8% 15.8% 11.9% 24.8% 20.4% Google Flu Trends old model Elastic Net Gaussian Process
  43. 43. Google Flu Trends revised: Results (3) ‘rsv’ — 25% ‘flu symptoms’ — 18% ‘benzonatate’ — 6% ‘symptoms of pneumonia’ — 6% ‘upper respiratory infection’ — 4% impact of automatically selected queries in a flu estimate during the over-predictions previous GFT model
  44. 44. Google Flu Trends revised: Methods (4) component (p), the moving average component (q), and a regression elemen the sequential observations , …,y yT1 , and a D-dimensional exogenous inpu specifies the relationship ∑ ∑ ∑φ θ ε ε= + + + , = − = − = ,y y w ht i p i t i i q i t i i D i t i t 1 1 1 where the φi, θi, and wi are coefficients to be learned and εt is mean zer unknown variance. For fixed values of p and q, this model is trained usin extend this model with a seasonal component that incorporates yearly lag model) and determine orders p and q as well as seasonal orders automatic procedure39 . Instead of using all available query fractions as the exogenous i the single prediction result (D= 1) from a query model, ˆyt . Essentially, this to distill all of the information that search data have to offer about the IL this meta-information in the ARMAX procedure. Predictive intervals are es sive nowcast through the maximum likelihood variance of the model. Results We evaluate our methodology on held out ILI rates and normalized query f utive periods matching the influenza seasons from 2008 to 2013, as define and Supplementary Fig. S1). For each test period (flu season i), we train a m sonal ARMAX model asonality component in the ARMAX function incorporates further information into t ngth of the season is fixed to 52 weeks (1-year long). The full model description, wh mes yt = pX i=1 iyt i + JX i=1 !iyt 52 i | {z } AR and seasonal AR + qX i=1 ✓i✏t i + KX i=1 ⌫i✏t 52 i | {z } MA and seasonal MA + DX i=1 wiht,i | {z } regression + ✏t , Seasonal ARMAX Auto-regressive moving average with exogenous inputs (ARMAX) AR component Moving average component Exogenous input
  45. 45. Google Flu Trends revised: Results (4)e.com/scientificreports/ Figure 2. Comparison of nowcasts between an autoregressive baseline model which is based only on
  46. 46. Google Flu Trends revised: Results (5) 0 3 6 9 12 15 MAPE (%) in flu rate autoregressive (AR) estimates during a 4-year period (2009-2013) Test data Test data; peaking moments 14.3 7.5%7.3% 8.3%7.7% 13% 10.2% Google Flu Trends old model (AR) Elastic Net (AR) Gaussian Process (AR) CDC (AR)
  47. 47. Personalised inference tasks using social media content Lampos, Aletras, Preotiuc-Pietro & Cohn, 2014; Preotiuc-Pietro, Lampos & Aletras, 2015; Preotiuc-Pietro, Volkova, Lampos, Bachrach & Aletras, 2015; Lampos, Aletras, Geyti, Zou & Cox, 2015
  48. 48. Occupational class inference: Motivation + Validate this hypothesis on a broader, larger data set using social media (Twitter) + Downstream applications > research (social science & other domains) > commercial + Proxy for additional user attributes, e.g. income and socioeconomic status (Bernstein, 1960; Labov, 1972/2006) (Preotiuc-Pietro, Lampos & Aletras, 2015) “Socioeconomic variables are influencing language use.”
  49. 49. Occupational class inference: SOC 2010 C1 — Managers, Directors & Senior Officials e.g. chief executive, bank manager C2 — Professional Occupations (e.g. mechanical engineer, pediatrist) C3 — Associate Professional & Technical e.g. system administrator, dispensing optician C4 — Administrative & Secretarial (e.g. legal clerk, secretary) C5 — Skilled Trades (e.g. electrical fitter, tailor) C6 — Caring, Leisure, Other Service e.g. nursery assistant, hairdresser C7 — Sales & Customer Service (e.g. sales assistant, telephonist) C8 — Process, Plant and Machine Operatives e.g. factory worker, van driver C9 — Elementary (e.g. shelf stacker, bartender) Standard Occupational Classification (SOC)
  50. 50. Occupational class inference: Data + 5,191 Twitter users mapped to their occupations, then mapped to one of the 9 SOC categories + 10 million tweets + Download the data set % of users per SOC category 0 10 20 30 40 C1 C2 C3 C4 C5 C6 C7 C8 C9
  51. 51. Occupational class inference: Features User attributes (18) + number of followers, friends, listings, follower/friend ratio, favourites, tweets, retweets, hashtags, @-mentions, @-replies, links and so on Topics — Word clusters (200) + SVD on the graph laplacian of the word x word similarity matrix using normalised PMI, i.e. a form of spectral clustering + Skip-gram model with negative sampling to learn word embeddings (Word2Vec); pairwise cosine similarity on the embeddings to derive a word x word similarity matrix; then spectral clustering on the similarity matrix (Bouma, 2009; von Luxburg, 2007) (Mikolov et al., 2013)
  52. 52. Occupational class inference: Performance Accuracy (%) 25 31 37 43 49 55 Feature type User Attributes SVD-200-clusters Word2Vec-200-clusters 52.7 48.2 34.2 51.7 47.9 31.5 46.9 44.2 34 Logistic Regression SVM (RBF) Gaussian Process (SE-ARD) most frequent class baseline
  53. 53. Occupational class inference: Topic CDFs (1) Feature Analysis - Cumulative Density Functions 0.001 0.01 0.05 0 0.2 0.4 0.6 0.8 1 Topic proportion Userprobability Higher Education (#21) C1 C2 C3 C4 C5 C6 C7 C8 C9 Topic more prevalent ! CDF line closer to bottom-right cornerTopic more prevalent in a class (C1-C9), if the line leans closer to the bottom-right corner of the plot
  54. 54. Occupational class inference: Topic CDFs (2) Feature Analysis - Cumulative Density Functions 0.001 0.01 0.05 0 0.2 0.4 0.6 0.8 1 Topic proportion Userprobability Arts (#116) C1 C2 C3 C4 C5 C6 C7 C8 C9 Topic more prevalent ! CDF line closer to bottom-right cornerTopic more prevalent in a class (C1-C9), if the line leans closer to the bottom-right corner of the plot
  55. 55. Occupational class inference: Topic CDFs (3) Feature Analysis - Cumulative Density Functions 0.001 0.01 0.05 0 0.2 0.4 0.6 0.8 1 Topic proportion Userprobability Elongated Words (#164) C1 C2 C3 C4 C5 C6 C7 C8 C9 Topic more prevalent ! CDF line closer to bottom-right cornerTopic more prevalent in a class (C1-C9), if the line leans closer to the bottom-right corner of the plot
  56. 56. Occupational class inference: Topic similarity l as topics ‘Football’ re’ (#153). cales only iscrimina- hich topic formance. ge across cs are cov- shows the Fs) across r these six ers having tweets. A CDF line the plot. evalent in similar pattern in both topics by which users with lower skilled jobs tweet more often. Figure 3: Jensen-Shannon divergence in the topic distributions between the different occupational classes (C 1–9). Occupational class Occupational class Topic distribution distance (Jensen-Shannon divergence) for the different occupational classes
  57. 57. Occupational class inference: Topic similarity l as topics ‘Football’ re’ (#153). cales only iscrimina- hich topic formance. ge across cs are cov- shows the Fs) across r these six ers having tweets. A CDF line the plot. evalent in similar pattern in both topics by which users with lower skilled jobs tweet more often. Figure 3: Jensen-Shannon divergence in the topic distributions between the different occupational classes (C 1–9). Occupational class Occupational class Topic distribution distance (Jensen-Shannon divergence) for the different occupational classes
  58. 58. Occupational class inference: Topic similarity l as topics ‘Football’ re’ (#153). cales only iscrimina- hich topic formance. ge across cs are cov- shows the Fs) across r these six ers having tweets. A CDF line the plot. evalent in similar pattern in both topics by which users with lower skilled jobs tweet more often. Figure 3: Jensen-Shannon divergence in the topic distributions between the different occupational classes (C 1–9). Occupational class Occupational class Topic distribution distance (Jensen-Shannon divergence) for the different occupational classes
  59. 59. Income inference: Data Income prediction 10k 30k 50k 100k 0 200 400 600 800 1000 Yearly income (£) No.Users + 5,191 Twitter users (same as in the previous study) mapped to their occupations, then mapped to an average income in GBP (£) using the SOC taxonomy + approx. 11 million tweets + Download the data set (Preotiuc-Pietro, Volkova, Lampos, Bachrach & Aletras, 2015)
  60. 60. Income inference: Features + Profile (8)
 e.g. #followers, #followees, times listed etc. + Shallow textual features (10)
 e.g. proportion of hashtags, @-replies, @-mentions etc. + Inferred (perceived) psycho-demographic features (15)
 e.g. gender, age, education level, religion, life satisfaction, excitement, anxiety etc. + Emotions (9)
 e.g. positive / negative sentiment, joy, anger, fear, disgust, sadness, surprise etc. + Word clusters — Topics of discussion (200)
 based on word embeddings and by applying spectral clustering
  61. 61. Income inference: Performance MAE £8500 £9275 £10050 £10825 £11600 Income inference error (Mean Absolute Error) using GP regression or a linear ensemble for all features Feature Categories £9,535£9,621 £11,456 £10,980 £10,110 £11,291 Profile Demo Emotion Shallow Topics All features
  62. 62. Income inference: Qualitative analysis (1) e1: positive (l=46.27) e2: neutral (l=57.64) e3: negative(l=76.34) e4: joy (l=36.37) e5: sadness (l=67.05) e6: disgust (l=116.66) e7: anger (l=95.50) e8: surprise (l=83.61) e9: fear (l=31.74) 28000 35000 42000 28000 35000 42000 28000 35000 42000 0.1 0.2 0.3 0.4 0.5 0.4 0.5 0.6 0.7 0.8 0.9 0.05 0.10 0.15 0.20 0.5 0.6 0.7 0.8 0.05 0.10 0.010 0.015 0.020 0.025 0.030 0.01 0.02 0.03 0.04 0.05 0.10 0.15 0.20 0.25 0.05 0.10 0.15 Feature value Income Relating income and emotion Linear vs GP fit
  63. 63. Income inference: Qualitative analysis (2) Topic 107 (Justice) Topic 124 (Corporate 1) Topic 139 (Politics) Topic 163 (NGOs) Topic 196 (Web analytics/Surveys) Topic 99 (Swearing) 30000 40000 50000 30000 40000 50000 0.00 0.02 0.04 0.06 0.00 0.02 0.04 0.000 0.025 0.050 0.075 0.000 0.025 0.050 0.075 0.100 0.00 0.01 0.02 0.03 0.04 0.00 0.03 0.06 0.09 0.12 Feature value Income Relating income and topics of discussion Linear vs GP fit
  64. 64. Inferring the socioeconomic status: Task Profile description on Twitter Occupation SOC category1 NS-SEC2 1. Standard Occupational Classification: 369 job groupings 2. National Statistics Socio-Economic Classification: Map from the job groupings in SOC to a socioeconomic status, i.e. {upper, middle or lower} (Lampos, Aletras, Geyti, Zou & Cox, 2016)
  65. 65. Inferring the socioeconomic status: Data & Features + 1,342 Twitter user profiles
 distinct data set from the previous works + 2 million tweets + Date interval: Feb. 1, 2014 to March 21, 2015 + Each user has a socioeconomic status (SES) label:
 {upper, middle, lower} + Download the data set 1,291 features representing user behaviour (4), biographical / profile information (523), text in the tweets (560), topics of discussion (200), and impact on the platform (4)
  66. 66. Inferring the socioeconomic status: Results T1 T2 P O1 584 115 83.5% O2 126 517 80.4% R 82.3% 81.8% 82.0% T1 T2 T3 P O1 606 84 53 81.6% O2 49 186 45 66.4% O3 55 48 216 67.7% R 854% 58.5% 68.8% 75.1% Classification Accuracy (%) Precision (%) Recall (%) F1 2-way 82.05 (2.4) 82.2 (2.4) 81.97 (2.6) .821 (.03) 3-way 75.09 (3.3) 72.04 (4.4) 70.76 (5.7) .714 (.05) Confusion matrices for the 3- and 2-way classification Classification performance (using a GP classifier)
  67. 67. Characterising user impact: Task & Data ct — a simplified definition S(„in, „out, „⁄) = ln A („⁄ + ◊) („in + ◊)2 „out + ◊ B mber of followers, „out: number of followees mber of times the account has been listed ogarithm is applied on a positive number ut " = („in ≠ „out) ◊ („in/„out) + „in of the user impact n our data set S) = 6.776 −5 0 5 10 15 20 25 30 0 0.05 0.1 0.15 Impact Score (S) ProbabilityDensity @guardian @David_Cameron @PaulMasonNews @lampos @nikaletras @spam? c.uk Slides: http://bit.ly/1GrxI8j 40/52 40 /52 „out • „in: number of followers, „out: number of follow • „⁄: number of times the account has been liste • ◊ = 1, logarithm is applied on a positive numbe • ! „2 in/„out " = („in ≠ „out) ◊ („in/„out) + „in Histogram of the user impact scores in our data set µ(S) = 6.776 −5 0 5 0 0.05 0.1 0.15 ProbabilityDensity @spam? v.lampos@ucl.ac.uk Slides: ser impact — a simplified definition S(„in, „out, „⁄) = ln A („⁄ + ◊) („in + ◊)2 „out + ◊ B • „in: number of followers, „out: number of followees • „⁄: number of times the account has been listed • ◊ = 1, logarithm is applied on a positive number • ! „2 in/„out " = („in ≠ „out) ◊ („in/„out) + „in Histogram of the user impact scores in our data set µ(S) = 6.776 0.05 0.1 0.15 ProbabilityDensity @guardian @David_Cameron @PaulMasonNews @lampos @nikaletras @spam? User impact — a simplified definition S(„in, „out, „⁄) = ln A („⁄ + ◊) („in „out + ◊ • „in: number of followers, „out: number of followee • „⁄: number of times the account has been listed • ◊ = 1, logarithm is applied on a positive number • ! „2 in/„out " = („in ≠ „out) ◊ („in/„out) + „in Histogram of the user impact scores in our data set µ(S) = 6.776 −5 0 5 10 0 0.05 0.1 0.15 Impa ProbabilityDensity @ @@spam? User impact — a simplified definition S(„in, „out, „⁄) = ln A („⁄ + ◊) („in + ◊ „out + ◊ • „in: number of followers, „out: number of followees • „⁄: number of times the account has been listed • ◊ = 1, logarithm is applied on a positive number • ! „2 in/„out " = („in ≠ „out) ◊ („in/„out) + „in Histogram of the user impact scores in our data set µ(S) = 6.776 −5 0 5 10 1 0 0.05 0.1 0.15 Impact Sc ProbabilityDensity @Paul @lamp @nika @spam? v.lampos@ucl.ac.uk Slides: http:/ mplified definition t, „⁄) = ln A („⁄ + ◊) („in + ◊)2 „out + ◊ B ers, „out: number of followees the account has been listed plied on a positive number out) ◊ („in/„out) + „in pact t 0.05 0.1 0.15 ProbabilityDensity @guardian @David_Cameron @PaulMasonNews @lampos @nikaletras @spam? —> number of followers —> number of followees —> number of times listed —> logarithm is applied on a positive number β Vasileios Lampos ~ @lampos ν Nikolaos Aletras ~ @nikaletras 40K Twitter accounts (UK) considered (Lampos et al., 2014)
  68. 68. Characterising user impact: Topic entropy high participation in the 10 most relevant topics. Dot-dashed lines d is the mean of the entire sample (= 6.776). NumberofAccounts Impact Score (S) 0 10 20 30 0 50 100 All Low Entropy High Entropy Figure 4: User impact distribution for accounts with high (blue) and low (dark grey) topic entropy. Lines denote the respective mean impact scores. H(ui, ⌧) = where ui is a This is a mea meaning tha ered as mor quality of the pact score di the lowest an tropies are se clearly below latter above. a connection may exist, at tion in the en Use case sce On average, the higher the user impact score, the higher the topic entropy
  69. 69. Characterising user impact: Use case scenarios Impact distribution under user behaviour scenarios 0 10 20 30 0 10 20 30 0 100 200 300 400 500 L NL C mpact distribution (x-axis: impact points, y-axis: # of user account ed on subsets of the most relevant attributes and topics – IA: Interac ng many links, TO: Topic-Overall, TF: Topic-Focused, LT: ‘Light’ to s negation and lines the respective mean impact scores. 0 0 10 20 30 0 100 200 300 400 IA IAC B pact distribution (x-axis: impact points, y-axis: # of user accounts) d on subsets of the most relevant attributes and topics – IA: Interact g many links, TO: Topic-Overall, TF: Topic-Focused, LT: ‘Light’ top negation and lines the respective mean impact scores. 0 10 20 30 0 10 20 30 0 50 100 150 200 LT ST E x-axis: impact points, y-axis: # of user accounts) for five Twitter e most relevant attributes and topics – IA: Interactive, IAC: Clique Topic-Overall, TF: Topic-Focused, LT: ‘Light’ topics, ST: ‘Serious’ s the respective mean impact scores. Interactive (IA) vs. clique interactive (IAC) Links (L) vs. very few links (NL) Light topics (LT) vs. more ‘serious’ topics (ST)
  70. 70. Concluding remarks + User-generated content is a valuable asset > improve health surveillance tasks > mine collective knowledge > infer user characteristics > numerous other tasks + Nonlinear models tend to perform better given the multimodality of the feature space + Deep representations of text tend to improve performance (better representations) + Qualitative analysis is important > Evaluation > Interesting insights
  71. 71. Future research challenges + Interdisciplinary research tasks require to work closer with domain experts + Understand better the biases in the online media (demographics, information propagation, external influence etc.) + Attack more interesting (usually more complex) questions, attempt to generalise findings, identify and define limitations + Conduct more rigorous evaluation + Improve on existing methods
 (‘deeper’ understandings & interpretations) + Ethical concerns
  72. 72. Acknowledgements Currently funded by All collaborators (alphabetical order) in research mentioned today Nikolaos Aletras (Amazon), Yoram Bachrach (Microsoft Research), Trevor Cohn (Univ. of Melbourne), Ingemar J. Cox (UCL & Univ. of Copenhagen), Nello Cristianini (Univ. of Bristol), Steve Crossan (Google), Jens K. Geyti (UCL), Andrew C. Miller (Harvard Univ.), Daniel Preotiuc-Pietro (Penn), Christian Stefansen (Google), Sviltana Volkova (PNNL), Bin Zou (UCL)
  73. 73. Thank you. Any questions? Slides can be downloaded from lampos.net/talks-posters
  74. 74. References Argyriou, Evgeniou & Pontil. Convex Multi-Task Feature Learning (Machine Learning, 2008) Bach. Bolasso: Model Consistent Lasso Estimation through the Bootstrap (ICML, 2008) Bernstein. Language and social class (Br J Sociol, 1960) Bouma. Normalized (pointwise) mutual information in collocation extraction (GSCL, 2009) David Duvenaud. Automatic Model Construction with Gaussian Processes (Ph.D. Thesis, Univ of Cambridge, 2014) Ginsberg et al. Detecting influenza epidemics using search engine query data (Nature, 2009) Hastie, Tibshirani & Friedman. The Elements of Statistical Learning (Springer, 2009) Labov. The Social Stratification of English in New York City (Cambridge Univ Press, 1972; 2006, 2nd ed.) Lampos & Cristianini. Nowcasting Events from the Social Web with Statistical Learning (ACM TIST, 2012) Lampos, Aletras, Geyti, Zou & Cox. Inferring the Socioeconomic Status of Social Media Users based on Behaviour and Language (ECIR, 2016) Lampos, Miller, Crossan & Stefansen. Advances in nowcasting influenza-like illness rates using search query logs (Nature Sci Rep, 2015) Lampos, Preotiuc-Pietro, Aletras & Cohn. Predicting and Characterising User Impact on Twitter (EACL, 2014) Lampos, Preotiuc-Pietro & Cohn. A user-centric of voting intention from Social Media (ACL, 2013) Lazer, Kennedy, King and Vespignani. The Parable of Google Flu: Traps in Big Data Analysis (Science, 2014) Mairal, Jenatton, Obozinski & Bach. Network Flow Algorithms for Structured Sparsity (NIPS, 2010) Mikolov, Chen, Corrado & Dean. Efficient estimation of word representations in vector space (ICLR, 2013) Preotiuc-Pietro, Lampos & Aletras. An analysis of the user occupational class through Twitter content (ACL, 2015) Preotiuc-Pietro, Volkova, Lampos, Bachrach & Aletras. Studying User Income through Language, Behaviour and Affect in Social Media (PLoS ONE, 2015) Rasmussen & Williams. Gaussian Processes for Machine Learning (MIT Press, 2006) Tibshirani. Regression shrinkage and selection via the lasso (J R Stat Soc Series B Stat Methodol, 1996) von Luxburg. A tutorial on spectral clustering (Stat Comput, 2007) Zhao & Yu. On model selection consistency of lasso (JMLR, 2006) Zou & Hastie. Regularization and variable selection via the elastic net (J R Stat Soc Series B Stat Methodol, 2005)

×