Ambiguity & Plausibility: managing the classification quality in Volunteered Geographic Information

AMBIGUITY & PLAUSIBILITY
Managing Classification Quality in Volunteered Geographic Information
Ahmed Loai Ali, Falko Schmid, Rami Al-Salman, Tomi Kauppinen
University of Bremen
Cognitive Systems Research Group

Reliable
Services
VGI
Data
Quality
Management

Turn left onto Schwachhauser Ring
Go straight forward through the park
Get out of the park right to Am Weidedamm
Navigation
Cross the lake

Ambiguity & Plausibility
park
garden
recreatio
n
grass

Locality
Filtering
Maintain locality during
learning
Data with sufficient
quality for learning

Classification by Tags
Residential
Industrial
Agriculture
Forest
Park
Garden
Playground

Garden: "a distinguishable planned space, usually outdoors, set aside for the display, cultivation, and
enjoyment of plants and other forms of nature. Residential garden is most common, it is generally
found in proximity to a residence, such as the front or back garden."
Grass: "a smaller areas of mown and managed grass for example in the middle of a roundabout, verges
beside a road or in the middle of a dual-carriageway."
Meadow: "a land primarily vegetated by grass plus other non-woody plants."
Park: "an open, green area for recreation, usually municipal. These are outdoor areas, typically grassy
or
green areas, set aside of leisure and recreation. Typically open to the public, but may be fenced, and
may be closed; e.g., at night time."

Classifier
properties
Classifier
learning
Classifier
validation
Classifier
application

Data from densest
cities
Meta-data analysis
Mapper activities analysis
Version edits analysis
𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
𝑎𝑟𝑒𝑎

A. L. Ali and F. Schmid, Data quality assurance for Volunteered Geographic Information
In Proceedings of the 8th International Conference on Geographic Information Science,
GIScience2014, pages 126-141, 2014

• 9-Intersection Model (9IM)
• Meet, Overlap and Contains relations
• Assumptions
• Park usually contains entertainments facilities
• “residential” Garden often meet “residential” houses
• Grass meets roads or buildings and rarely contains other objects
• Meadow likely meets or overlaps with farms/farmlands

Frequent Keys involved in the topological relations
admin_level building amenity wetland surface bicycle barrier historic aerialway tourism
man_made covered landuse aeroway power bridge foot wood bridge service
intermittent shop natural leisure office religion ref highway tunnel width
construction water harbour military sport place name railway waterway brand

1.99% 43.79% 5.15% 0.10% 0.00% 15.37% 6.40% 0.49% 0.01% 0.71%
0.00% 0.05% 22.79% 0.05% 1.18% 0.02% 12.33% 1.19% 1.22% 1.71%
0.00% 0.14% 6.16% 10.30% 0.00% 0.28% 0.00% 63.40% 1.97% 0.00%
0.06% 0.08% 0.00% 0.00% 2.74% 1.07% 0.00% 0.00% 3.64% 0.01%

building amenity bicycle barrier
landuse foot
natural leisure highway
sport waterway
1.99% 0.10% 0.00% 0.49% 0.01% 0.71%
0.00% 0.05% 0.05% 1.18% 0.02% 1.19% 1.22% 1.71%
0.00% 0.14% 0.00% 0.28% 0.00% 1.97% 0.00%
0.06% 0.08% 0.00% 0.00% 1.07% 0.00% 0.00% 0.01%

Entity
size 𝑚𝑒𝑒𝑡 𝐴 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝐴𝑜𝑣𝑒𝑟𝑙𝑎𝑝 𝐴
𝑜𝑣𝑒𝑟𝑙𝑎𝑝 𝐿𝑚𝑒𝑒𝑡 𝐿 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝐿
land use
amenity
building
leisure
sport
highway
waterway
foot
bicycle
barrier
grass
meadow
garden
park
natural

meadow
K-nearest neighbours
Classification
eagerlazy

Tag
Based
Land use
grass
meadow
Leisure
garden
park
Label
Based
grass
meadow
garden
park
Label-Based Model
(LBM)
Tag-Based Model
(TBM)

Accuracy
Area Under ROC Curve
(AUC)
𝐴𝑐𝑐𝑢𝑟𝑎𝑦 =
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃 + 𝑇𝑁
%

Accuracy AUC
GERMAN
Y
64.3 % 0.85
UK 85.1 % 0.93
Accuracy AUC
GERMAN
Y
76.8 % 0.85
UK 89.0 % 0.92
Label-Based Model
(LBM)
Tag-Based Model
(TBM)
64.3 %
76.8 %
85.1 %
89.0 %
0
10
20
30
40
50
60
70
80
90
100
LBM TBM
Germany UK

Manual checking
Re-checking
Empirical study
Evaluation

Manual
checking
Re-checking
Empirical
study
park garden
meadowgrass

Manual
checking
Re-checking
Empirical
study
December 2013 June 2014
Detected
Outliers
Updated
Germany 6568 entities ≈ 23 %
UK 310 entities ≈ 60 %

Manual
checking
Re-checking
Empirical
study
• Present a sample of entities to participants
• Ask bout their opinions about the current
classification of the entities
• In case of disagreement with the current
class, the participant is asked to provide an
appropriate class

157 participants
115 pa. complete the study
81 pa. give complete opinions
They represent different cultures
More than 10 mother languages
Various levels of OSM experience
24 no knowledge, 17 beginners, 21 moderate knowledge,19 experts

• To evaluate the results, we used Light's Kappa for m raters
• 1.0 means maximum agreement
• Less than 0 means chance agreement
• 0.01 to 1.0 is slight, fair, moderate, and substantial
• Light's Kappa for all 81 participants was 0.176
• slight agreement

Conclusion
• Quality management mechanisms are required for VGI
• Classification is one facet of data quality of VGI
• In VGI context, the classification depends on multiple
factors:
• User perception, locality and expert level
• Purpose-for-usage
• Inherent properties
• Entity’s geographic context

Conclusion
• Classification process has various characteristics:
• Structured or Unstructured
• With vast amount of data, learning tackles the problem
• Crowdsourcing revisions acts to check the detected outliers
• Guided classifications mechanism is needed
• Only guided without force

?
contact: loai@informatik.uni-bremen.de

Ambiguity & Plausibility: managing the classification quality in Volunteered Geographic Information

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Ambiguity & Plausibility: managing the classification quality in Volunteered Geographic Information

Editor's Notes