Presentation on the topic of sensing air-quality at city level based on Twitter data given at the IEEE Image, Video, and Multidimensional Signal Processing (IVMSP) 2018 workshop in Aristi, Greece.
8. Related work
• Related research has focused on the use of Sina Weibo as a
“sensor” of air quality in China and by use of a standard
model building setting
• Our setting is different and more challenging
• Much lower volatility/variability in AQ values less relevant
signals in social media
• Less population compared to China less posts in social media
• We differentiate between monitored & unmonitored cities and
adopt a transfer learning formulation
Mei et al. “Inferring air pollution by sniffing social media.” ASONAM 2014
Jiang et al. “Using social media to detect outdoor air pollution ...” PLOS One 2015
Wang et al. ““Social media as a sensor of air …”. Journal of Medical Internet Research, 2015
Tao et al. “Inferring atmospheric particulate matter concentrations ….”. PLOS One 2016
9. Problem formulation
• CM/CU: Set of monitored and unmonitored cities
• For each 𝑐𝑗 ∈ 𝐶 𝑀:
• training samples 𝐷𝑐 𝑗
= { 𝒙1, 𝑦1 , … , (𝒙 𝑁, 𝑦 𝑁)}
• 𝒙𝑖 ∈ 𝑅 𝑑, d-dim vector summarizing tweets in city cj
during i-th temporal bin, 𝑦 𝑖
∈ 𝑅 average PM2.5
concentration during bin i
• For each 𝑐 𝑞 ∈ 𝐶 𝑈:
• build model ℎ 𝑐 𝑞
: 𝑿 → 𝑌 (only xi available)
10. Transfer learning
• Data pooling approach
• Train regression model h on 𝐷 = 𝑐 𝑗∈𝐶 𝑀
𝐷𝑐 𝑗
• Simultaneously minimize prediction error on all
monitored cities
• Feature selection: keep top k features ranked by
their Pearson correlation with Y
• Variant: weighted data pooling where each training
example weighted by inverse distance between its
city and target city
11. Data
• Track 120 English air quality-related keywords
air pollution, aqi, emission, smog, haze, cough, wheeze, …
• Infer location from each tweet using geotagging
method (Kordopatis-Zilos et al., 2017):
• Tweet text with geotagging confidence > 0.8
• Twitter account’s location field if above not possible
• Air quality data: OpenAQ API
• Hourly measurements
• Average measurements from different stations in the
same city
Kordopatis-Zilos et al. ““Geotagging text content with language models and ...” PIEEE 2017
12. Feature extraction
• Bag-of-words:
• tokenization, lowercasing and stop word removal
• vocabulary W = {w1,…,wn} of n=10K most frequent words
in a random sample of 1M of the collected tweets
• x = [x1, … , xn] represents all tweets in city c at time
interval t, xi denotes number of tweets containing wi
divided by total number of tweets in (c,t)
• Two variants:
• “current”: only tweets from current temporal bin
• “lagged”: include tweets from previous bins
15. Setup
• Five cities in UK and five in US
• Period: Feb 8, 2017 Jan 18, 2018
• Each city in turn considered test city (i.e. no access to
ground truth air quality ground truth)
• Three temporal granularities: 6h, 12h, 24h (ground
truth average of hourly measurements)
• Root Mean Squared Error (RMSE)
• Macro-averaging for country-wise/overall performance
(αRMSE)
• Gradient Tree Boosting for regression
• scikit-learn: learning rate = 0.01, nr. estimators = 200
18. Baseline performance analysis
• IDW: Inverse Distance Weighting (spatial interpol.)
• High correlation between close-by cities
• mean: always predict mean PM2.5 value per city
• Small variability and mostly low PM2.5 values
UK US Overall
6h 12h 24h 6h 12h 24h 6h 12h 24h
IDW 3.79 3.34 3.09 4.12 3.73 3.41 3.96 3.54 3.25
mean 7.00 6.64 6.36 4.60 4.26 4.02 5.80 5.46 5.19
αRMSE
19. Within-city models
• #tw: total number of tweets in spatiotemporal bin
• #aqs: number of tweets related to air quality
• #high: number of tweets related to high air pollution
• all: concatenation of #tw, #aqs, #high
• BoW/BoW-1/BoW-2: BoW and lagged versions
#tw #aqs #high all BoW BoW-1 BoW-2
6h 5.96 5.93 5.98 5.84 5.15 4.99 4.97
12h 6.17 5.98 6.02 5.77 4.96 4.84 5.16
24h 5.83 6.11 5.82 5.52 4.65 4.96 5.16
αRMSE
22. Fusion
• Simple fusion of two inputs
• IDW estimate
• Twitter-based estimate
• Overall, still slightly lower
compared to IDW
• Better for three cities: Boston, London, Pittsburgh
(i.e. cities that are far from the rest)
Overall
6h 12h 24h
IDW 3.96 3.54 3.25
mean 5.80 5.46 5.19
fusion 4.15 4.00 3.63
23. Summary & outlook
• Features extracted from Twitter can offer useful
signals that can contribute to coarse air quality
estimations
• Combined with actual air quality measurements
from nearby locations, Twitter-based estimations
can lead to improved results
• Still room for further improvements:
• Better tweet classification, feature extraction, modelling
• Use of additional modalities (sky images)
27. Example tweets (aqs-high)
Tweets classified as aqs AND high
RT @PlumeInLondon: High pollution (50) at 10PM. High for #London. Avoid
physical activities if sensitive https://t.co/3LVRgps965
London's air pollution is killing me. Coughs now sound like squeaky chew toy.
#sendhelp #sendventolin
RT @cargill_taxi: And the mayor of London tries to blame poor air quality on toxic
air from German factories.
@claireL23 The traffic, poor air quality, the light pollution, the lack of green space,
the concrete jungle, the building work. Need I go on?
RT @SkyNews: THE GUARDIAN FRONT PAGE: "Toxic air risk to one in four London
schools" #skypapers https://t.co/2c6ANlujep
RT @MayorofLondon: London’s toxic air is a public health emergency. Here is what
I’m doing about it https://t.co/YHw2CVepPI