Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Twitter-based Sensing of City-level Air Quality

66 views

Published on

Presentation on the topic of sensing air-quality at city level based on Twitter data given at the IEEE Image, Video, and Multidimensional Signal Processing (IVMSP) 2018 workshop in Aristi, Greece.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Twitter-based Sensing of City-level Air Quality

  1. 1. Twitter-based Sensing of City-level Air Quality Polychronis Charitidis, Eleftherios Spyromitros-Xioufis, Symeon Papadopoulos, Yiannis Kompatsiaris IVMSP 2018, June 6, Aristi, Greece
  2. 2. source: http://www.worldbank.org
  3. 3. Primary air pollutants
  4. 4. PM and ways of measuring • Particulate matter: air-suspended mixture of both solid and liquid particles • PM10: particles smaller than 10μm • PM2.5: particles smaller than 2.5μm • Way of measuring PM • Certified reference instruments • Certified equivalent instruments • Certified indicative instruments • Indicative instruments source: https://www.aeroqual.com/particulate-matters-why-monitor-pm10-and-pm2-5 accuracy cost
  5. 5. Sensing AQ using Twitter air pollution affected city affected citizens social media discussionsmonitoring and mining prediction
  6. 6. AQ-related tweets
  7. 7. Related work • Related research has focused on the use of Sina Weibo as a “sensor” of air quality in China and by use of a standard model building setting • Our setting is different and more challenging • Much lower volatility/variability in AQ values  less relevant signals in social media • Less population compared to China  less posts in social media • We differentiate between monitored & unmonitored cities and adopt a transfer learning formulation Mei et al. “Inferring air pollution by sniffing social media.” ASONAM 2014 Jiang et al. “Using social media to detect outdoor air pollution ...” PLOS One 2015 Wang et al. ““Social media as a sensor of air …”. Journal of Medical Internet Research, 2015 Tao et al. “Inferring atmospheric particulate matter concentrations ….”. PLOS One 2016
  8. 8. Problem formulation • CM/CU: Set of monitored and unmonitored cities • For each 𝑐𝑗 ∈ 𝐶 𝑀: • training samples 𝐷𝑐 𝑗 = { 𝒙1, 𝑦1 , … , (𝒙 𝑁, 𝑦 𝑁)} • 𝒙𝑖 ∈ 𝑅 𝑑, d-dim vector summarizing tweets in city cj during i-th temporal bin, 𝑦 𝑖 ∈ 𝑅 average PM2.5 concentration during bin i • For each 𝑐 𝑞 ∈ 𝐶 𝑈: • build model ℎ 𝑐 𝑞 : 𝑿 → 𝑌 (only xi available)
  9. 9. Transfer learning • Data pooling approach • Train regression model h on 𝐷 = 𝑐 𝑗∈𝐶 𝑀 𝐷𝑐 𝑗 • Simultaneously minimize prediction error on all monitored cities • Feature selection: keep top k features ranked by their Pearson correlation with Y • Variant: weighted data pooling where each training example weighted by inverse distance between its city and target city
  10. 10. Data • Track 120 English air quality-related keywords air pollution, aqi, emission, smog, haze, cough, wheeze, … • Infer location from each tweet using geotagging method (Kordopatis-Zilos et al., 2017): • Tweet text with geotagging confidence > 0.8 • Twitter account’s location field if above not possible • Air quality data: OpenAQ API • Hourly measurements • Average measurements from different stations in the same city Kordopatis-Zilos et al. ““Geotagging text content with language models and ...” PIEEE 2017
  11. 11. Feature extraction • Bag-of-words: • tokenization, lowercasing and stop word removal • vocabulary W = {w1,…,wn} of n=10K most frequent words in a random sample of 1M of the collected tweets • x = [x1, … , xn] represents all tweets in city c at time interval t, xi denotes number of tweets containing wi divided by total number of tweets in (c,t) • Two variants: • “current”: only tweets from current temporal bin • “lagged”: include tweets from previous bins
  12. 12. Experiments
  13. 13. Setup • Five cities in UK and five in US • Period: Feb 8, 2017  Jan 18, 2018 • Each city in turn considered test city (i.e. no access to ground truth air quality ground truth) • Three temporal granularities: 6h, 12h, 24h (ground truth  average of hourly measurements) • Root Mean Squared Error (RMSE) • Macro-averaging for country-wise/overall performance (αRMSE) • Gradient Tree Boosting for regression • scikit-learn: learning rate = 0.01, nr. estimators = 200
  14. 14. UK cities CITY #TWEETS/DAY London 3972 Birmingham 198 Leeds 112 Liverpool 108 Manchester 321
  15. 15. US cities CITY #TWEETS/DAY New York 2564 Philadelphia 478 Boston 574 Baltimore 394 Pittsburgh 169
  16. 16. Baseline performance analysis • IDW: Inverse Distance Weighting (spatial interpol.) • High correlation between close-by cities • mean: always predict mean PM2.5 value per city • Small variability and mostly low PM2.5 values UK US Overall 6h 12h 24h 6h 12h 24h 6h 12h 24h IDW 3.79 3.34 3.09 4.12 3.73 3.41 3.96 3.54 3.25 mean 7.00 6.64 6.36 4.60 4.26 4.02 5.80 5.46 5.19 αRMSE
  17. 17. Within-city models • #tw: total number of tweets in spatiotemporal bin • #aqs: number of tweets related to air quality • #high: number of tweets related to high air pollution • all: concatenation of #tw, #aqs, #high • BoW/BoW-1/BoW-2: BoW and lagged versions #tw #aqs #high all BoW BoW-1 BoW-2 6h 5.96 5.93 5.98 5.84 5.15 4.99 4.97 12h 6.17 5.98 6.02 5.77 4.96 4.84 5.16 24h 5.83 6.11 5.82 5.52 4.65 4.96 5.16 αRMSE
  18. 18. Ground truth vs features PM2.5 in London date PM2.5(μg/m3)
  19. 19. Cross-city models • full: full dimensional BoW (or lagged) • k=N: top-k features are selected • w=0/1: without/with sample weighting full k=10 k=20 k=50 k=100 k=200 k=500 w=0 6h 5.36 5.48 5.28 5.21 5.24 5.29 5.31 12h 5.21 5.29 5.18 5.12 5.09 5.11 5.15 24h 4.97 4.89 4.78 4.78 4.75 4.79 4.86 w=1 6h 5.35 5.47 5.27 5.21 5.24 5.29 5.30 12h 5.21 5.26 5.18 5.11 5.08 5.11 5.16 24h 4.95 4.85 4.77 4.76 4.73 4.77 4.84 αRMSE
  20. 20. Fusion • Simple fusion of two inputs • IDW estimate • Twitter-based estimate • Overall, still slightly lower compared to IDW • Better for three cities: Boston, London, Pittsburgh (i.e. cities that are far from the rest) Overall 6h 12h 24h IDW 3.96 3.54 3.25 mean 5.80 5.46 5.19 fusion 4.15 4.00 3.63
  21. 21. Summary & outlook • Features extracted from Twitter can offer useful signals that can contribute to coarse air quality estimations • Combined with actual air quality measurements from nearby locations, Twitter-based estimations can lead to improved results • Still room for further improvements: • Better tweet classification, feature extraction, modelling • Use of additional modalities (sky images)
  22. 22. Thank you! Symeon Papadopoulos papadop@iti.gr / @sympap code: https://github.com/MKLab-ITI/twitter-aq
  23. 23. Top selected features feature Correlation measured 0.678 moderate 0.666 particles 0.666 temperature 0.661 wind 0.655 humidity 0.654 pm10 0.591 weather 0.574 haze 0.574 pollutants 0.560 feature correlation currently 0.557 spam 0.551 forecast 0.529 tube 0.527 bonfire 0.523 polluted 0.515 air 0.515 temperatures 0.514 begun 0.508 exceeding 0.507
  24. 24. PM2.5 correlations of city pairs
  25. 25. Example tweets (aqs-high) Tweets classified as aqs AND high RT @PlumeInLondon: High pollution (50) at 10PM. High for #London. Avoid physical activities if sensitive https://t.co/3LVRgps965 London's air pollution is killing me. Coughs now sound like squeaky chew toy. #sendhelp #sendventolin RT @cargill_taxi: And the mayor of London tries to blame poor air quality on toxic air from German factories. @claireL23 The traffic, poor air quality, the light pollution, the lack of green space, the concrete jungle, the building work. Need I go on? RT @SkyNews: THE GUARDIAN FRONT PAGE: "Toxic air risk to one in four London schools" #skypapers https://t.co/2c6ANlujep RT @MayorofLondon: London’s toxic air is a public health emergency. Here is what I’m doing about it https://t.co/YHw2CVepPI

×