The document discusses using crowdsourced supervision to improve machine learning models for analyzing geospatial data from social media. It provides examples of existing systems that calibrate social media data for tasks like detecting crisis events. The author argues that fully automating calibration is challenging due to data quality issues and context dependencies. The proposed approach involves developing hybrid quality assurance mechanisms that link geospatial data characteristics to machine learning models. Crowdsourced annotation would help parameterize analysis steps and provide training data. Active learning strategies are suggested to maximize the impact of human labeling.
2. Introduction: Why machine learning and
why crowdsource supervision?
State of the art: Practical examples
Opportunities and challenges: Future
research directions
11.06.2015F.O.Ostermann - 18th AGILE Conference 2
HYBRID GEO-INFORMATION PROCESSING
CROWDSOURCED SUPERVISION OF GEO-SPATIAL MACHINE LEARNING TASKS
3. 11.06.2015F.O.Ostermann - 18th AGILE Conference 3
NEW SOURCES OF GEO-INFORMATION
GEO-SOCIAL MEDIA AS SENSORS
Geography
Explicit Implicit
Participation
Explicit
Volunteered Geographic
Information (VGI)
Open Street Map
Volunteered Geographic Content
(VGC)
Wikipedia articles on non-geographic
topics containing place names,
Foursquare
Implicit
Contributed / Ambient
Geographic Information (CGI/AGI)
Public Tweets referring to the
properties of an identifiable place.
User-Generated Geographic Content
(UGGC)
Public Flickr images containing a place
name or being georeferenced
4. 11.06.2015F.O.Ostermann - 18th AGILE Conference 4
GEO-SOCIAL MEDIA SENSORS - WHAT‘S DIFFERENT?
GEO-SOCIAL MEDIA AS SENSORS
• Rich, pre-processed information
• Uneven distribution
• Heterogeneous level of quality
• Varying but high update frequency (stream)
• Redundancy of content and channels (sharing)
• Heterogeneous structure
• Unknown source/lineage
• Unclear / changing licencing, property rights, liability
• Unknown/Immeasurable precision, error, completeness
5. 11.06.2015F.O.Ostermann - 18th AGILE Conference 5
CALIBRATING GEO-SOCIAL SENSORS
GEO-SOCIAL MEDIA AS SENSORS
How to calibrate? (Should we?)
So far:
• Crowdsourced curation
• Post-hoc analysis
Crowdsourced curation problems
• Scalability
• Sustainability
Automate curation!
6. Introduction: Why machine learning and
why crowdsource supervision?
State of the art: Practical examples
Opportunities and challenges: Future
research directions
11.06.2015F.O.Ostermann - 18th AGILE Conference 6
HYBRID GEO-INFORMATION PROCESSING
CROWDSOURCED SUPERVISION OF GEO-SPATIAL MACHINE LEARNING TASKS
11. 11.06.2015F.O.Ostermann - 18th AGILE Conference 11
FRENCH FOREST FIRE SOCIAL MEDIA
GEOCONAVI
(2) Machine-learned
relevance filter:
25,684 items left
(3) Geocoded and
context enriched:
5,770 items left
(4) Clustered in
space and time:
129 clusters with
2,682 items
(5) Second relevance filter:
11 clusters left
with 469 items
(1) Containing French keywords:
659,676 Tweets and
39,016 Flickr images
12. 11.06.2015F.O.Ostermann - 18th AGILE Conference 12
GEOCONAVI FIGHTING FOREST FIRES
TOPICALITY MACHINE LEARNING CLASSIFICATION
1. Manually annotated (Yes/No) random sample
2. Counted keyword occurences
3. Used Weka 10-fold stratified cross validation with
a) Decision trees
b) Naive Bayes
c) Association Rules
4. J48 Decision Tree works best
Classified as YES Classified as NO
On Forest Fire 1196 370
Not on Forest Fire 403 3712
13. 11.06.2015F.O.Ostermann - 18th AGILE Conference 13
GEOCONAVI FIGHTING FOREST FIRES
GEOCONAVI
1.1 Retrieval
Scheduled Java code
accessing APIs
2.1 Topicality
Scheduled PLSQL job
2.2 Geo-Coding
a) Scheduled PLSQL job
b) Scheduled Java code
2.3 Geographic context
Scheduled PLSQL job
3.1 Spatio-temporal
clustering
Scheduled Python script
calling SatScan job
2.4 Quality Assessment
Scheduled PLSQL job
1.2 Storage
Scheduled Java code
writing to DBMS
Oracle DBMS
3.2 Quality Re-Assessment
Scheduled PLSQL job
Twitter
Stream-
ing API
Flickr
Search
API
Dissemination
SMS, WFS, WMS, RSS, SES
EFFIS
Hotspot
Data
European Media Monitor
Geo-coding API
14. 11.06.2015F.O.Ostermann - 18th AGILE Conference 14
HYBRID GEO-INFORMATION PROCESSING
WHY THE EFFORT?
Time-consuming and resource-intensive
• Manual annotation and experiments for topicality filtering
• Parameterization of spatio-temporal clustering
Other challenges:
• Dependency on data quality
• Overfitting
• Diversity of contexts and tasks
• Near real-time
Crowdsourced Supervision
15. Introduction: Why machine learning and
why crowdsource supervision?
State of the art: Practical examples
Opportunities and challenges: Future
research directions
11.06.2015F.O.Ostermann - 18th AGILE Conference 15
HYBRID GEO-INFORMATION PROCESSING
CROWDSOURCED SUPERVISION OF GEO-SPATIAL MACHINE LEARNING TASKS
17. 11.06.2015F.O.Ostermann - 18th AGILE Conference 17
HYBRID GEO-INFORMATION PROCESSING
RESEARCH QUESTIONS
Developing hybrid quality assurance mechanisms for near real-
time geo-information streams
• Link the characteristics of geographic information with machine
learning class labelling and regression
• Provide a multi-modal interface to let human oracles simultaneously
label instances
• Translate the learner models into nomothetic principles on
geographic semantics
18. 11.06.2015F.O.Ostermann - 18th AGILE Conference 18
MACHINE LEARNING FOR GEO-SOCIAL MEDIA
LINKING GEOINFORMATION WITH MACHINE LEARNERS
Every UGGC instance needs multi-class labelling:
• Content type
• Geographic footprints of locations and/or events
• Distinct event membership
• Credibility based on a combination of the other class labels
Learners have to deal with characteristics of geographic information:
• Spatial autocorrelation
• Vague boundaries and class memberships
• Uncontrolled variance
19. 11.06.2015F.O.Ostermann - 18th AGILE Conference 19
MACHINE LEARNING FOR GEO-SOCIAL MEDIA
HUMAN ORACLES AND GEO-SEMANTICS
• Multiple human oracles annotate instances for all model classes
• Responses will modify the
• Learners
• Parameters used for the geographic analysis steps to compute
footprints and clusters.
• Resulting models indirectly encode the semantic similarity of
geographic places and concepts
• Reference to (linked) data repositories such as DBpedia and
GeoNames when possible.
20. 11.06.2015F.O.Ostermann - 18th AGILE Conference 20
ACTIVE LEARNING
BASIC CONSIDERATIONS
• Learner chooses instances to be labelled and presents them to the
human annotator
• Maximize the impact of human annotation
• Learner remains flexible towards new instances
21. 11.06.2015F.O.Ostermann - 18th AGILE Conference 21
ACTIVE LEARNING FOR GEO-SOCIAL MEDIA
ADVANCED CONSIDERATIONS
• Active learners profit from domain expertise
• Passive learners suited for domain novices
• Initial training set should be representative with respect to the
classes that the learning process is to handle; omitting classes form
the inital seed set might result in trouble further down the road
• Batch-mode better suited to multiple, parallel annotators
• Learning costs positively related to labeling informativeness
• Crowdsourced labeling might require repeated labiling to de-noise
existing training instances
22. 11.06.2015F.O.Ostermann - 18th AGILE Conference 22
ACTIVE LEARNING FOR GEO-SOCIAL MEDIA
QUERIES STRATEGIES AND TYPES
• Stream-based selective sampling: Learner samples instance and
decides to query it or not; well-established e.g. for word sense
disambiguation.
• Density-weighted margin-based uncertainty sampling: Avoids
choice of outliers which have high uncertainty but will not improve a
model's performance; well-established for classification tasks
• Membership (is this concept an example of the target concept)
• Equivalence (is x equivalent to y)
• Disjointness (are x any disjoint)
23. 11.06.2015F.O.Ostermann - 18th AGILE Conference 23
ACTIVE LEARNING FOR GEO-SOCIAL MEDIA
QUERIES
Toponym disambiguation:
• “Does this [item] talk about [location A] or [location B], or none, or
both?”
Spatial footprint calculation for vague geographies:
• “Is this spatial footprint for [item] correct? If not, is it too large, too
small, or wrong shape, or wrong place?”
Spatio-temporal clustering:
• “Does this [item] belong to a cluster named [event] in [location]? If
not, what’s wrong: Event, Location, or both?”
27. 11.06.2015F.O.Ostermann - 18th AGILE Conference 27
HYBRID GEO-INFORMATION PROCESSING
FUTURE STRATEGIES
Two future implementation strategies
• Extension of AIDR with GeoCONAVI functionality
• Extension of GeoCONAVI with facilities to crowd-source the
supervision of machine learning tasks and the parameterization of
analysis function.
“The next step will be to decide on a concrete strategy, followed by a
step-wise, iterative implementation and testing of the geoprocessing
tasks described in this paper.”
28. 11.06.2015F.O.Ostermann - 18th AGILE Conference 28
CHALLENGES AND OPPORTUNITIES OF GEO-SOCIAL
MEDIA
EARTH OBSERVATION WITH UNCALIBRATED IN-SITU SENSORS
Thank you!
f.o.ostermann@utwente.nl
@f_ostermann
nl.linkedin.com/in/foost