Twitter-based Sensing of City-level Air Quality

Twitter-based Sensing
of City-level Air Quality
Polychronis Charitidis, Eleftherios Spyromitros-Xioufis,
Symeon Papadopoulos, Yiannis Kompatsiaris
IVMSP 2018, June 6, Aristi, Greece

source: http://www.worldbank.org

PM and ways of measuring
• Particulate matter: air-suspended mixture of both
solid and liquid particles
• PM10: particles smaller than 10μm
• PM2.5: particles smaller than 2.5μm
• Way of measuring PM
• Certified reference instruments
• Certified equivalent instruments
• Certified indicative instruments
• Indicative instruments
source: https://www.aeroqual.com/particulate-matters-why-monitor-pm10-and-pm2-5
accuracy cost

Sensing AQ using Twitter
air pollution affected city affected citizens
social media discussionsmonitoring and mining
prediction

Related work
• Related research has focused on the use of Sina Weibo as a
“sensor” of air quality in China and by use of a standard
model building setting
• Our setting is different and more challenging
• Much lower volatility/variability in AQ values  less relevant
signals in social media
• Less population compared to China  less posts in social media
• We differentiate between monitored & unmonitored cities and
adopt a transfer learning formulation
Mei et al. “Inferring air pollution by sniffing social media.” ASONAM 2014
Jiang et al. “Using social media to detect outdoor air pollution ...” PLOS One 2015
Wang et al. ““Social media as a sensor of air …”. Journal of Medical Internet Research, 2015
Tao et al. “Inferring atmospheric particulate matter concentrations ….”. PLOS One 2016

Problem formulation
• CM/CU: Set of monitored and unmonitored cities
• For each 𝑐𝑗 ∈ 𝐶 𝑀:
• training samples 𝐷𝑐 𝑗
= { 𝒙1, 𝑦1 , … , (𝒙 𝑁, 𝑦 𝑁)}
• 𝒙𝑖 ∈ 𝑅 𝑑, d-dim vector summarizing tweets in city cj
during i-th temporal bin, 𝑦 𝑖
∈ 𝑅 average PM2.5
concentration during bin i
• For each 𝑐 𝑞 ∈ 𝐶 𝑈:
• build model ℎ 𝑐 𝑞
: 𝑿 → 𝑌 (only xi available)

Transfer learning
• Data pooling approach
• Train regression model h on 𝐷 = 𝑐 𝑗∈𝐶 𝑀
𝐷𝑐 𝑗
• Simultaneously minimize prediction error on all
monitored cities
• Feature selection: keep top k features ranked by
their Pearson correlation with Y
• Variant: weighted data pooling where each training
example weighted by inverse distance between its
city and target city

Data
• Track 120 English air quality-related keywords
air pollution, aqi, emission, smog, haze, cough, wheeze, …
• Infer location from each tweet using geotagging
method (Kordopatis-Zilos et al., 2017):
• Tweet text with geotagging confidence > 0.8
• Twitter account’s location field if above not possible
• Air quality data: OpenAQ API
• Hourly measurements
• Average measurements from different stations in the
same city
Kordopatis-Zilos et al. ““Geotagging text content with language models and ...” PIEEE 2017

Feature extraction
• Bag-of-words:
• tokenization, lowercasing and stop word removal
• vocabulary W = {w1,…,wn} of n=10K most frequent words
in a random sample of 1M of the collected tweets
• x = [x1, … , xn] represents all tweets in city c at time
interval t, xi denotes number of tweets containing wi
divided by total number of tweets in (c,t)
• Two variants:
• “current”: only tweets from current temporal bin
• “lagged”: include tweets from previous bins

Setup
• Five cities in UK and five in US
• Period: Feb 8, 2017  Jan 18, 2018
• Each city in turn considered test city (i.e. no access to
ground truth air quality ground truth)
• Three temporal granularities: 6h, 12h, 24h (ground
truth  average of hourly measurements)
• Root Mean Squared Error (RMSE)
• Macro-averaging for country-wise/overall performance
(αRMSE)
• Gradient Tree Boosting for regression
• scikit-learn: learning rate = 0.01, nr. estimators = 200

UK cities
CITY #TWEETS/DAY
London 3972
Birmingham 198
Leeds 112
Liverpool 108
Manchester 321

US cities
CITY #TWEETS/DAY
New York 2564
Philadelphia 478
Boston 574
Baltimore 394
Pittsburgh 169

Baseline performance analysis
• IDW: Inverse Distance Weighting (spatial interpol.)
• High correlation between close-by cities
• mean: always predict mean PM2.5 value per city
• Small variability and mostly low PM2.5 values
UK US Overall
6h 12h 24h 6h 12h 24h 6h 12h 24h
IDW 3.79 3.34 3.09 4.12 3.73 3.41 3.96 3.54 3.25
mean 7.00 6.64 6.36 4.60 4.26 4.02 5.80 5.46 5.19
αRMSE

Within-city models
• #tw: total number of tweets in spatiotemporal bin
• #aqs: number of tweets related to air quality
• #high: number of tweets related to high air pollution
• all: concatenation of #tw, #aqs, #high
• BoW/BoW-1/BoW-2: BoW and lagged versions
#tw #aqs #high all BoW BoW-1 BoW-2
6h 5.96 5.93 5.98 5.84 5.15 4.99 4.97
12h 6.17 5.98 6.02 5.77 4.96 4.84 5.16
24h 5.83 6.11 5.82 5.52 4.65 4.96 5.16
αRMSE

Ground truth vs features
PM2.5 in London
date
PM2.5(μg/m3)

Cross-city models
• full: full dimensional BoW (or lagged)
• k=N: top-k features are selected
• w=0/1: without/with sample weighting
full k=10 k=20 k=50 k=100 k=200 k=500
w=0
6h 5.36 5.48 5.28 5.21 5.24 5.29 5.31
12h 5.21 5.29 5.18 5.12 5.09 5.11 5.15
24h 4.97 4.89 4.78 4.78 4.75 4.79 4.86
w=1
6h 5.35 5.47 5.27 5.21 5.24 5.29 5.30
12h 5.21 5.26 5.18 5.11 5.08 5.11 5.16
24h 4.95 4.85 4.77 4.76 4.73 4.77 4.84
αRMSE

Fusion
• Simple fusion of two inputs
• IDW estimate
• Twitter-based estimate
• Overall, still slightly lower
compared to IDW
• Better for three cities: Boston, London, Pittsburgh
(i.e. cities that are far from the rest)
Overall
6h 12h 24h
IDW 3.96 3.54 3.25
mean 5.80 5.46 5.19
fusion 4.15 4.00 3.63

Summary & outlook
• Features extracted from Twitter can offer useful
signals that can contribute to coarse air quality
estimations
• Combined with actual air quality measurements
from nearby locations, Twitter-based estimations
can lead to improved results
• Still room for further improvements:
• Better tweet classification, feature extraction, modelling
• Use of additional modalities (sky images)

Thank you!
Symeon Papadopoulos
papadop@iti.gr / @sympap
code: https://github.com/MKLab-ITI/twitter-aq

Top selected features
feature Correlation
measured 0.678
moderate 0.666
particles 0.666
temperature 0.661
wind 0.655
humidity 0.654
pm10 0.591
weather 0.574
haze 0.574
pollutants 0.560
feature correlation
currently 0.557
spam 0.551
forecast 0.529
tube 0.527
bonfire 0.523
polluted 0.515
air 0.515
temperatures 0.514
begun 0.508
exceeding 0.507

PM2.5 correlations of city pairs

Example tweets (aqs-high)
Tweets classified as aqs AND high
RT @PlumeInLondon: High pollution (50) at 10PM. High for #London. Avoid
physical activities if sensitive https://t.co/3LVRgps965
London's air pollution is killing me. Coughs now sound like squeaky chew toy.
#sendhelp #sendventolin
RT @cargill_taxi: And the mayor of London tries to blame poor air quality on toxic
air from German factories.
@claireL23 The traffic, poor air quality, the light pollution, the lack of green space,
the concrete jungle, the building work. Need I go on?
RT @SkyNews: THE GUARDIAN FRONT PAGE: "Toxic air risk to one in four London
schools" #skypapers https://t.co/2c6ANlujep
RT @MayorofLondon: London’s toxic air is a public health emergency. Here is what
I’m doing about it https://t.co/YHw2CVepPI

Twitter-based Sensing of City-level Air Quality

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Twitter-based Sensing of City-level Air Quality

Similar to Twitter-based Sensing of City-level Air Quality (20)

More from Symeon Papadopoulos

More from Symeon Papadopoulos (20)

Recently uploaded

Recently uploaded (20)

Twitter-based Sensing of City-level Air Quality

Editor's Notes