Processing and understanding geo-social media content
1. PROCESSING AND UNDERSTANDING GEO-SOCIAL
MEDIA CONTENT
EARTH OBSERVATION WITH UNCALIBRATED IN-SITU SENSORS
Frank O. Ostermann
IfGI GI-Forum, 23.06.2015
2. Introduction: Geo-social media APIs as
sensors
Where we are: Current state-of-the-art and
practical examples from disaster response
Outlook: Future research directions
23.06.2015F.O.Ostermann - ifgi GI-Forum 2
PROCESSING AND UNDERSTANDING GEO-SOCIAL MEDIA
CONTENT
EARTH OBSERVATION WITH UNCALIBRATED IN-SITU SENSORS
3. 23.06.2015F.O.Ostermann - ifgi GI-Forum 3
ONCE UPON A TIME…
INTRODUCTION
… there were Desktop-GIS and Shapefiles,
digitized or scanned from paper maps,
or from raw surveying or satellite data.
4. Mobile Web 2.0
Cloud Computing
Internet of Things (in
particular, sensors)
23.06.2015F.O.Ostermann - ifgi GI-Forum 4
THREE DISRUPTIVE INNOVATIONS
INTRODUCTION
7. Real-time data input
stream
Citizens as sensors
Multi-layered, inter-
operable data sets
Linked and open data
GEOSS, Eye on Earth,
INSPIRE, …
23.06.2015F.O.Ostermann - ifgi GI-Forum 7
THE BIG PICTURE: NEXT GENERATION DIGITAL EARTH
INTRODUCTION
8. 23.06.2015F.O.Ostermann - ifgi GI-Forum 8
LOW-COST IN-SITU AND MOBILE SENSORS
INTRODUCTION
Publiclaboratory.com
Mikrokopter.de
Libelium
Waspmote
9. 23.06.2015F.O.Ostermann - ifgi GI-Forum 9
CITIZENS AS SENSORS
INTRODUCTION
+ = !
Why not treat information from the citizens
as another type of sensor data?
12. 23.06.2015F.O.Ostermann - ifgi GI-Forum 12
NEW SOURCES OF GEO-INFORMATION
INTRODUCTION
Geography
Explicit Implicit
Participation
Explicit
Volunteered Geographic Information
(VGI)
Open Street Map
Volunteered Geographic Content (VGC)
Wikipedia articles on non-geographic
topics containing place names,
Foursquare
Implicit
Contributed / Ambient Geographic
Information (CGI/AGI)
Public Tweets referring to the
properties of an identifiable place.
User-Generated Geographic Content
(UGGC)
Public Flickr images containing a place
name or being georeferenced
Adopted from [1]
13. 23.06.2015F.O.Ostermann - ifgi GI-Forum 13
TWITTER
INTRODUCTION
• 140 characters micro-blogging platform
• Asymetric following – being followed
• Inflated user numbers:
• 100 million daily active vs.
• 300 million montly active vs.
• 1 billion registered (number of bots high, >40% never tweeted)
• Two APIs: Streaming API & Search API
• Rich metadata returned
• <5% with coordinates, but much more with toponyms
• Huge ecosystem of third-party apps and services
• Boost to data-driven research, but what about reproducibility?
14. 23.06.2015F.O.Ostermann - ifgi GI-Forum 14
FLICKR
INTRODUCTION
• 92 million users
• 1 million photos shared every day
• Pioneer, then declined, then bounced back
• API offers detailed search functionality
• ~20% geocoded, many more with toponyms
• Potentially rich source of data:
• Title
• Tags
• Description
• But: Bulk uploads (and tagging)
15. 23.06.2015F.O.Ostermann - ifgi GI-Forum 15
GEO-SOCIAL MEDIA SENSORS – SO WHAT‘S DIFFERENT?
INTRODUCTION
• Often In-situ
• Rich, pre-processed information but varying level of quality
• Uneven spatio-temporal distribution (stream)
• Redundancy of content and channels (sharing)
• Heterogeneous structure
• Unknown source/lineage
• Unclear / changing licencing, property rights, liability (e.g.
OpenStreetMap)
• Unknown/Immeasurable precision, error, completeness
• Uncertainty about the uncertainty!
• How to calibrate? (Should we?)
16. 23.06.2015F.O.Ostermann - ifgi GI-Forum 16
QUALITY OF GEO-SOCIAL MEDIA INFORMATION
INTRODUCTION
Adopted from [2, 3]
Source
Credibility
Relevance
Content
Location
Context
Natual Language
Processing
Social Network
Analysis
Geographic
Contextualization
18. Introduction: Geo-social media APIs as
sensors
Where we are : Current state-of-the-art
and practical examples from disaster
response
Outlook: Future research directions
23.06.2015F.O.Ostermann - ifgi GI-Forum 18
PROCESSING AND UNDERSTANDING GEO-SOCIAL MEDIA
CONTENT
EARTH OBSERVATION WITH UNCALIBRATED IN-SITU SENSORS
19. 23.06.2015F.O.Ostermann - ifgi GI-Forum 19
GEO-SOCIAL MEDIA AND CRISIS MANAGEMENT
WHERE WE ARE
Social media offers… Crisis management needs…
rich up-to-date information up-to-date information
new paths of communication redundant paths of communication
noise, uncertain lineage and accuracy high-quality and reliable information
Crowd-sourced data curation faces limits of
Sustainability
Scalability
20. 23.06.2015F.O.Ostermann - ifgi GI-Forum 20
HUMANITARIAN OPENSTREETMAP TEAM
INTRODUCTION
• Many activations, last one after Nepal earthquake
• Three main communication channels:
• Tasking manager
• E-Mail list
• IRC channel
24. 23.06.2015F.O.Ostermann - ifgi GI-Forum 24
AIDR
WHERE WE ARE
http://irevolution.net/2013/10/01/aidr-artificial-
intelligence-for-disaster-response/
25. 23.06.2015F.O.Ostermann - ifgi GI-Forum 25
GEOGRAPHIC CONTEXT ANALYSIS OF VOLUNTEERED
INFORMATION (GEOCONAVI)
WHERE WE ARE
1. Deploy a system for using UGC
in crisis decision support on forest
fires
2. Assess the added value of
using UGC for forest fire response.
26. 23.06.2015F.O.Ostermann - ifgi GI-Forum 26
FOREST FIRE CHARACTERISTICS
WHERE WE ARE
• Dynamics require near real-time
processing
• Less signals since often in sparsely
populated areas
• Predictability and recurrence facilitate
sensor and model calibration
27. 23.06.2015F.O.Ostermann - ifgi GI-Forum 27
GEOCONAVI FIGHTING FOREST FIRES
WHERE WE ARE
1.1 Retrieval
Scheduled Java code
accessing APIs
2.1 Topicality
Scheduled PLSQL job
2.2 Geo-Coding
a) Scheduled PLSQL job
b) Scheduled Java code
2.3 Geographic context
Scheduled PLSQL job
3.1 Spatio-temporal
clustering
Scheduled Python script
calling SatScan job
2.4 Quality Assessment
Scheduled PLSQL job
1.2 Storage
Scheduled Java code
writing to DBMS
Oracle DBMS
3.2 Quality Re-Assessment
Scheduled PLSQL job
Twitter
Stream-
ing API
Flickr
Search
API
Dissemination
SMS, WFS, WMS, RSS, SES
EFFIS
Hotspot
Data
European Media Monitor
Geo-coding API
28. Flickr API
Twitter Streaming API
Keyword-based:
Domain expertise
Task-oriented
Scheduled scripts
Writing to Oracle DBMS
23.06.2015F.O.Ostermann - ifgi GI-Forum 28
DATA COLLECTION AND STORAGE
WHERE WE ARE
30. 23.06.2015F.O.Ostermann - ifgi GI-Forum 30
EXAMPLE GEO-SOCIAL MEDIA
WHERE WE ARE
“Back at hotel. Fire skirted
round village. Little evidence of
significant damage. Helicopters
still overhead damping scrub.
Beer unaffected”
(Canada BCGovFireInfo): “Important
notice from the Reg Dist of Bulkley-
Nechako regarding evacuations due
to wildfires in the area
http://ow.ly/2sBxH”
“Are you a fireman?
Cause you’re always there to extinguish
the fire inside my heart.”
31. 23.06.2015F.O.Ostermann - ifgi GI-Forum 31
GEOCONAVI FIGHTING FOREST FIRES
WHERE WE ARE
1.1 Retrieval
Scheduled Java code
accessing APIs
2.1 Topicality
Scheduled PLSQL job
2.2 Geo-Coding
a) Scheduled PLSQL job
b) Scheduled Java code
2.3 Geographic context
Scheduled PLSQL job
3.1 Spatio-temporal
clustering
Scheduled Python script
calling SatScan job
2.4 Quality Assessment
Scheduled PLSQL job
1.2 Storage
Scheduled Java code
writing to DBMS
Oracle DBMS
3.2 Quality Re-Assessment
Scheduled PLSQL job
Twitter
Stream-
ing API
Flickr
Search
API
Dissemination
SMS, WFS, WMS, RSS, SES
EFFIS
Hotspot
Data
European Media Monitor
Geo-coding API
32. 23.06.2015F.O.Ostermann - ifgi GI-Forum 32
SCORING GEO-SOCIAL MEDIA
WHERE WE ARE
• Sum of weighted scores: QS(Oj) = ∑N
i=1wisji
• with w being weight for criterion i, and s being the score for the geo-
social media object j
• Topicality: keyword-based
• Proximity: next concurrent reported hotspot
• Land cover: Forest, no-Forest, Built-up
• Population Density: Risk factor
• Information clusters: Similar messages or lone signal?
33. 23.06.2015F.O.Ostermann - ifgi GI-Forum 33
TOPICALITY MACHINE LEARNING CLASSIFICATION
WHERE WE ARE
1. Manually annotated (Yes/No) random sample
2. Counted keyword occurences
3. Used Weka 10-fold stratified cross validation with
a) Decision trees
b) Naive Bayes
c) Association Rules
4. J48 Decision Tree works best
Classified as YES Classified as NO
On Forest Fire 1196 370
Not on Forest Fire 403 3712
34. 23.06.2015F.O.Ostermann - ifgi GI-Forum 34
GEOCODING GEO-SOCIAL MEDIA
WHERE WE ARE
Several Geocoders used:
• GISCO/LAU2 brute string matching
• European Media Monitor algorithms
• Yahoo! Placemaker (2010)
TWITTER FLICKR
August 2010 August 2011 August 2010 August 2011
Retrieved items 2,904,065 7,996,228 7,991 17,850
Percentage with
toponym
35% 27% 53%
50%
Percentage with
coordinates
1.1% 0.92% 20% 21%
35. 23.06.2015F.O.Ostermann - ifgi GI-Forum 35
GEOCONAVI FIGHTING FOREST FIRES
WHERE WE ARE
1.1 Retrieval
Scheduled Java code
accessing APIs
2.1 Topicality
Scheduled PLSQL job
2.2 Geo-Coding
a) Scheduled PLSQL job
b) Scheduled Java code
2.3 Geographic context
Scheduled PLSQL job
3.1 Spatio-temporal
clustering
Scheduled Python script
calling SatScan job
2.4 Quality Assessment
Scheduled PLSQL job
1.2 Storage
Scheduled Java code
writing to DBMS
Oracle DBMS
3.2 Quality Re-Assessment
Scheduled PLSQL job
Twitter
Stream-
ing API
Flickr
Search
API
Dissemination
SMS, WFS, WMS, RSS, SES
EFFIS
Hotspot
Data
European Media Monitor
Geo-coding API
36. 23.06.2015F.O.Ostermann - ifgi GI-Forum 36
SPATIO-TEMPORAL CLUSTERING
WHERE WE ARE
• SatScan external software
• Scheduled Python script
1. Reads new geo-social media from database
2. Converts it to SatScan input format
3. Calls SatScan from the command line with appropriate parameters
4. Waits for SatScan to complete analysis
5. Reads SatScan output
6. Stores relevant information in database
37. 23.06.2015F.O.Ostermann - ifgi GI-Forum 37
SPATIO-TEMPORAL CLUSTERING PARAMETERS
WHERE WE ARE
Type of clustering algorithm
Spatial location of clusters based on grid/locations or not
Type of spatial overlap of clusters
Maximum spatial cluster size
Maximum temporal cluster size
Used in 2011: Discrete Poisson adjusting for population, no grid, no
overlap, max radius 50 km, max temporal extent 10% of study period (9
days)
38. 23.06.2015F.O.Ostermann - ifgi GI-Forum 38
GEOCONAVI FIGHTING FOREST FIRES
WHERE WE ARE
1.1 Retrieval
Scheduled Java code
accessing APIs
2.1 Topicality
Scheduled PLSQL job
2.2 Geo-Coding
a) Scheduled PLSQL job
b) Scheduled Java code
2.3 Geographic context
Scheduled PLSQL job
3.1 Spatio-temporal
clustering
Scheduled Python script
calling SatScan job
2.4 Quality Assessment
Scheduled PLSQL job
1.2 Storage
Scheduled Java code
writing to DBMS
Oracle DBMS
3.2 Quality Re-Assessment
Scheduled PLSQL job
Twitter
Stream-
ing API
Flickr
Search
API
Dissemination
SMS, WFS, WMS, RSS, SES
EFFIS
Hotspot
Data
European Media Monitor
Geo-coding API
42. 23.06.2015F.O.Ostermann - ifgi GI-Forum 42
FRENCH FOREST FIRE SOCIAL MEDIA
WHERE WE ARE
(2) Machine-learned
relevance filter:
25,684 items left
(3) Geocoded and
context enriched:
5,770 items left
(4) Clustered in
space and time:
129 clusters with
2,682 items
(5) Second relevance filter:
11 clusters left
with 469 items
(1) Containing French keywords:
659,676 Tweets and
39,016 Flickr images
43. 23.06.2015F.O.Ostermann - ifgi GI-Forum 43
GEOCONAVI RESULTS
WHERE WE ARE
• Simple keyword queries suffice
• Additional Geo-coding indispensable
• Topicality and context filtering plus spatio-temporal clustering crucial
• Able to detect fires from Tweets and Flickr images by spatio-temporal
clustering
• Relevance, credibility and overall quality vary greatly, thus more rules
and human assessment needed
44. 23.06.2015F.O.Ostermann - ifgi GI-Forum 44
SEMANTICS OF PLACES ACROSS GEO-SOCIAL MEDIA
WHERE WE ARE
Theory-guided research and local case study:
How to people see and understand the places they frequent?
What is different across media sources?
More than one (volunteered) data source
Identification of places and their semantics
Comparison of places between data sources
Comparison of places with geographic features and authoritative data
sources
45. 23.06.2015F.O.Ostermann - ifgi GI-Forum 45
SEMANTICS OF PLACES - IMPLEMETATION
WHERE WE ARE
Shatford-Panofsky and Agnew
Greater London Area
From Twitter to Flickr
Data Mining (Spatio-temporal clustering) -> Semantic Analysis (Cosine
Similarity, …)
Geo-demographic data
50. Introduction: Geo-social media APIs as
sensors
Where we are : Current state-of-the-art
and practical examples from disaster
response
Outlook: Future research directions
23.06.2015F.O.Ostermann - ifgi GI-Forum 50
PROCESSING AND UNDERSTANDING GEO-SOCIAL MEDIA
CONTENT
EARTH OBSERVATION WITH UNCALIBRATED IN-SITU SENSORS
51. 23.06.2015F.O.Ostermann - ifgi GI-Forum 51
UNSOLVED PROBLEMS FROM FRENCH CASE STUDY
WHERE WE ARE
Relevant datasets for contextualization
• Choice
• Integration
Settings for data mining and machine learning
• Method
• Parameters
Geospatial Semantic Web
Multi-Sensory Integration
Crowdsourced Supervision
54. 23.06.2015F.O.Ostermann - ifgi GI-Forum 54
HYBRID GEO-INFORMATION PROCESSING
OUTLOOK
Time-consuming and resource-intensive
• Manual annotation and experiments for topicality filtering
• Parameterization of spatio-temporal clustering
Other challenges:
• Dependency on data quality
• Overfitting
• Diversity of contexts and tasks
• Near real-time
Crowdsourced Supervision
55. 23.06.2015F.O.Ostermann - ifgi GI-Forum 55
GEOCONAVI FIGHTING FOREST FIRES
OUTLOOK
1.1 Retrieval
Scheduled Java code
accessing APIs
2.1 Topicality
Scheduled PLSQL job
2.2 Geo-Coding
a) Scheduled PLSQL job
b) Scheduled Java code
2.3 Geographic context
Scheduled PLSQL job
3.1 Spatio-temporal
clustering
Scheduled Python script
calling SatScan job
2.4 Quality Assessment
Scheduled PLSQL job
1.2 Storage
Scheduled Java code
writing to DBMS
Oracle DBMS
3.2 Quality Re-Assessment
Scheduled PLSQL job
Twitter
Stream-
ing API
Flickr
Search
API
Dissemination
SMS, WFS, WMS, RSS, SES
EFFIS
Hotspot
Data
European Media Monitor
Geo-coding API
56. 23.06.2015F.O.Ostermann - ifgi GI-Forum 56
HYBRID GEO-INFORMATION PROCESSING
OUTLOOK
Developing hybrid quality assurance mechanisms for near real-
time geo-information streams
• Link the characteristics of geographic information with machine
learning class labelling and regression
• Provide a multi-modal interface to let human oracles simultaneously
label instances
• Translate the learner models into nomothetic principles on
geographic semantics
57. 23.06.2015F.O.Ostermann - ifgi GI-Forum 57
MACHINE LEARNING FOR GEO-SOCIAL MEDIA
OUTLOOK
Every data instance needs multi-class labelling:
• Content type
• Geographic footprints of locations and/or events
• Distinct event membership
• Credibility based on a combination of the other class labels
Learners have to deal with characteristics of geographic information:
• Spatial autocorrelation
• Vague boundaries and class memberships
• Uncontrolled variance
58. 23.06.2015F.O.Ostermann - ifgi GI-Forum 58
MACHINE LEARNING FOR GEO-SOCIAL MEDIA
OUTLOOK
• Multiple human oracles annotate instances for all model classes
• Responses will modify the
• Learners
• Parameters used for the geographic analysis steps to compute
footprints and clusters.
• Resulting models indirectly encode the semantic similarity of
geographic places and concepts
• Reference to (linked) data repositories such as DBpedia and
GeoNames when possible.
59. 23.06.2015F.O.Ostermann - ifgi GI-Forum 59
ACTIVE LEARNING
OUTLOOK
• Active learners profit from domain expertise
• Passive learners suited for domain novices
• Learner chooses instances to be labelled and presents them to the
human annotator
• Maximize the impact of human annotation
• Learner remains flexible towards new instances
60. 23.06.2015F.O.Ostermann - ifgi GI-Forum 60
EXAMPLE QUERIES
OUTLOOK
Toponym disambiguation:
• “Does this [item] talk about [location A] or [location B], or none, or
both?”
Spatial footprint calculation for vague geographies:
• “Is this spatial footprint for [item] correct? If not, is it too large, too
small, or wrong shape, or wrong place?”
Spatio-temporal clustering:
• “Does this [item] belong to a cluster named [event] in [location]? If
not, what’s wrong: Event, Location, or both?”