IEEE IRI 2016: Automatic Geocoding of Web Data Locations

July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA
Madhav Sharan
msharan@usc.edu
Dr. Chris Mattmann
mattmann@usc.edu
1
An Automatic Approach for Discovering
and Geocoding Locations in Domain-
Specific Web Data
Information Retrieval
and Data Science

OUTLINE
• Introduction
• Overview of ecosytem
• Data Flow
• GeoTopic Identification
• Identifying location names
• Lucene Geo Gazetteer
• Geonames Features
• Evolution of Ranking Algorithm
• Evaluation and results
• Challenges
• Conclusion and Future Work
and Data Science
2

Introduction
and Data Science
3
• AIM – Discover geospatial locations from WWW data culled from
diverse domains
• Locations can be present in text or metadata of a file
• We present the overall approach and then explains in detail the
Lucene Geo Gazetteer derived from the Geonames.org dataset

Overview of ecosytem
and Data Science
4
• WWW is home to a rich diversity of and large volumes of data
across many domains and disciplines
• In MEMEX we have collected web page and textual data from
many domains including Materials Research, and Autonomous
Robots and other domains
• There is no single field or textual pattern for discerning location
information in all domains

Example dataset
and Data Science
5
Source Science Topic Location-related field Notes
AMD
Agriculture, Atmosphere, Biologi-
cal Classification, Biosphere, Cli-
mate Indicators, Cryosphere, Hu-
man Dimensions, Land Surface,
Oceans, Paleoclimate, Solid
Earth, Terrestrial Hydrosphere,
Spectral Engineering, Sun-Earth
Interac- tions
1.Location Keywords field in
Metadata
2.Sometimes mentioned in the
title and Summary fields
3.Spatial coordinates provided
1.Continent level and Geo-
graphic Region
2.Island, sometimes specific
3.Multi-values
4.Not missing in Location Key-
words
ACADIS
Agriculture, Atmosphere, Biologi-
cal Classification, Biosphere, Cli-
mate Indicators, Cryosphere, Hu-
man Dimensions, Land Surface,
Oceans. Paleoclimate, Solid
Earth, Terrestrial Hydrosphere
1.Location(s) field in Metadata
2.Location mentioned in De-
scription field, often in the first
few sentences or last few sen-
tences. Sometimes missing.
3.Longitude and Latitude pro-
vided in the Metadata.
4.Sometimes mentioned in the
title
1. Province/State level or Geo-
graphic Region
2. Multi-values
3. Not missing in Location
Locations mentioned in the NSF TREC-DD-Polar dataset

Data Flow
and Data Science
6
Crawled website
data
Extract text and
metadata
Discover location
entities
Find location in
gazetteer
Lucene Geo
Gazetteer

and Data Science
Lucene Geo Gazetteer
7
Uses the indexed Geonames.org dataset to
match location names with latitude and
longitude.
Indexes Geonames.org dataset to a
lucene based index
Ranks matched results based on
certain features.

and Data Science
Geonames Features
• Name - Legal name of a location.
• Alternate names - All other names by which a location is known.
Alternate names provide a CSV of all pronunciations and synonyms for a
location.
• Feature class - A high level bucket for similar locations. This bucket
distinguishes continent and countries from cities and villages.
• Feature code - A more granular bucket for locations to identify it as a
country, state, capital, city.
• Population, Latitude, Longitude, Country code, Admin1 code,
Admin2code
8

and Data Science
Evolution of Ranking Algorithm
The naive version –
1. Search input phrase in name, alternate name stored index and
get top N results.
2. Calculate edit distance between input phrase and name,
alternate name.
3. Rank results in order of lowest edit distance between input
phrase and name, alternate name fields.
9

and Data Science
The naive version results and observations–
• Poor results. 1/50 locations gave correct result
• There are many location with similar names. String “Pasadena”
have 250 and “Portland” have 897 matches in dataset.
• Popular places have greater number of alternate names
10

and Data Science
First improvement –
1. Add multiple sorting criteria in lucene query in order “feature
class”, “feature code” and “population”
2. Fetch top N results from this criteria
3. Add edit distance between input phrase and name and all
alternate names
11

and Data Science
Observations –
• More weightage to name as it is legal name for a location e.g.
China is stored with “People’s Republic of China” and alternate
name have ”... Cayina,Chaina,China,Chine, ..”
• We can not just rely on edit distance.
• Edit distance between “Los Cabos” and “Los Angeles” == 6
• Edit distance between “Los Angeles County” and “Los Angeles” == 7
• Results sorted only by population are very relevant and query
time is improved
12

and Data Science
One more try –
Pass 1
1. Get top N * 3 results from lucene index sorted by population for input phrase.
2. Sort results by feature code.
3. Pass top N results.
Pass 2
1. Assign a score if input phrase is exactly a token in name. e.g. “Los Angeles” in
“Los Angeles County”
2. Assign a lesser score if input phrase is partly a token in name e.g. India in Indiana
3. Calculate edit distance of input phrase with all alternate names.
4. Assign weight as a function of count of alternative names.
13

and Data Science
Evaluation and Results
We tested our improved algorithm on a diverse set of 50,000+
locations with countries, states, districts, cities, towns and
villages from DARPA MEMEX and from the NSF Polar domains.
Our ground truth data comparisons resulted in a total overall
accuracy of 94%
14

and Data Science
Challenges
• Limitations with opensource Geonames.org data set. More
features will help –
• Popular name – “Argentina” for “Argentine Republic“
• Area of a location
• No one best result.
• Pasadena in both California and Texas
• Huge and diverse data set which have data ranging from parks
to continents. Improved performance if data set is reduced to
eg. continents, countries, state, counties, capitals and cities
15

and Data Science
Conclusion and Future Work
• Early results from this effort are promising but a number of
avenues of future work remain unexplored.
• Automatically augment Geonames.org with crowd sourced
popular place names
• Geospatial coordinate data beyond points, e.g., bounding
boxes, and other shapes, to allow for more meaningful and
spatially directed queries.
16

and Data Science
ACKNOWLEDGMENTS
• This work was supported by the DARPA XDATA/Memex
program.
• NSF Polar Cyber infrastructure award numbers PLR-1348450
and PLR-144562 funded a portion of the work.
• Effort supported in part by JPL, managed by the California
Institute of Technology on behalf of NASA.
17

• Lucene Geo Gazetteer: https://github.com/chrismattmann/lucene-geo-gazetteer
• Geonames dataset: https://www.geonames.org
• Apache Tika: https://tika.apache.org/
• Apache OpenNLP: https://opennlp.apache.org/
• Apache Lucene: http://lucene.apache.org
• Memex: http://memex.jpl.nasa.gov/
More about us on github:
18
and Data Science
THANK YOU
Madhav Sharan
smadha
Dr. Chris Mattmann
chrismattmann

IEEE IRI 2016: Automatic Geocoding of Web Data Locations

Recommended

Recommended

More Related Content

Similar to IEEE IRI 2016: Automatic Geocoding of Web Data Locations

Similar to IEEE IRI 2016: Automatic Geocoding of Web Data Locations (20)

Recently uploaded

Recently uploaded (20)

IEEE IRI 2016: Automatic Geocoding of Web Data Locations

Editor's Notes