SlideShare a Scribd company logo
1 of 18
July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA
Madhav Sharan
msharan@usc.edu
Dr. Chris Mattmann
mattmann@usc.edu
1
An Automatic Approach for Discovering
and Geocoding Locations in Domain-
Specific Web Data
Information Retrieval
and Data Science
OUTLINE
• Introduction
• Overview of ecosytem
• Data Flow
• GeoTopic Identification
• Identifying location names
• Lucene Geo Gazetteer
• Geonames Features
• Evolution of Ranking Algorithm
• Evaluation and results
• Challenges
• Conclusion and Future Work
Information Retrieval
and Data Science
2
Introduction
Information Retrieval
and Data Science
3
• AIM – Discover geospatial locations from WWW data culled from
diverse domains
• Locations can be present in text or metadata of a file
• We present the overall approach and then explains in detail the
Lucene Geo Gazetteer derived from the Geonames.org dataset
Overview of ecosytem
Information Retrieval
and Data Science
4
• WWW is home to a rich diversity of and large volumes of data
across many domains and disciplines
• In MEMEX we have collected web page and textual data from
many domains including Materials Research, and Autonomous
Robots and other domains
• There is no single field or textual pattern for discerning location
information in all domains
Example dataset
Information Retrieval
and Data Science
5
Source Science Topic Location-related field Notes
AMD
Agriculture, Atmosphere, Biologi-
cal Classification, Biosphere, Cli-
mate Indicators, Cryosphere, Hu-
man Dimensions, Land Surface,
Oceans, Paleoclimate, Solid
Earth, Terrestrial Hydrosphere,
Spectral Engineering, Sun-Earth
Interac- tions
1.Location Keywords field in
Metadata
2.Sometimes mentioned in the
title and Summary fields
3.Spatial coordinates provided
1.Continent level and Geo-
graphic Region
2.Island, sometimes specific
3.Multi-values
4.Not missing in Location Key-
words
ACADIS
Agriculture, Atmosphere, Biologi-
cal Classification, Biosphere, Cli-
mate Indicators, Cryosphere, Hu-
man Dimensions, Land Surface,
Oceans. Paleoclimate, Solid
Earth, Terrestrial Hydrosphere
1.Location(s) field in Metadata
2.Location mentioned in De-
scription field, often in the first
few sentences or last few sen-
tences. Sometimes missing.
3.Longitude and Latitude pro-
vided in the Metadata.
4.Sometimes mentioned in the
title
1. Province/State level or Geo-
graphic Region
2. Multi-values
3. Not missing in Location
Locations mentioned in the NSF TREC-DD-Polar dataset
Data Flow
Information Retrieval
and Data Science
6
Crawled website
data
Extract text and
metadata
Discover location
entities
Find location in
gazetteer
Lucene Geo
Gazetteer
Information Retrieval
and Data Science
Lucene Geo Gazetteer
7
Uses the indexed Geonames.org dataset to
match location names with latitude and
longitude.
Indexes Geonames.org dataset to a
lucene based index
Ranks matched results based on
certain features.
Information Retrieval
and Data Science
Geonames Features
• Name - Legal name of a location.
• Alternate names - All other names by which a loca- tion is known.
Alternate names provide a CSV of all pronunciations and synonyms for a
location.
• Feature class - A high level bucket for similar locations. This bucket
distinguishes continent and countries from cities and villages.
• Feature code - A more granular bucket for locations to identify it as a
country, state, capital, city.
• Population, Latitude, Longitude, Country code, Admin1 code,
Admin2code
8
Information Retrieval
and Data Science
Evolution of Ranking Algorithm
The naive version –
1. Search input phrase in name, alternate name stored index and
get top N results.
2. Calculate edit distance between input phrase and name,
alternate name.
3. Rank results in order of lowest edit distance between input
phrase and name, alternate name fields.
9
Information Retrieval
and Data Science
Evolution of Ranking Algorithm
The naive version results and observations–
• Poor results. 1/50 locations gave correct result
• There are many location with similar names. String “Pasadena”
have 250 and “Portland” have 897 matches in dataset.
• Popular places have greater number of alternate names
10
Information Retrieval
and Data Science
Evolution of Ranking Algorithm
First improvement –
1. Add multiple sorting criteria in lucene query in order “feature
class”, “feature code” and “population”
2. Fetch top N results from this criteria
3. Add edit distance between input phrase and name and all
alternate names
11
Information Retrieval
and Data Science
Evolution of Ranking Algorithm
Observations –
• More weightage to name as it is legal name for a location e.g.
China is stored with “People’s Republic of China” and alternate
name have ”... Cayina,Chaina,China,Chine, ..”
• We can not just rely on edit distance.
• Edit distance between “Los Cabos” and “Los Angeles” == 6
• Edit distance between “Los Angeles County” and “Los Angeles” == 7
• Results sorted only by population are very relevant and query
time is improved
12
Information Retrieval
and Data Science
Evolution of Ranking Algorithm
One more try –
Pass 1
1. Get top N * 3 results from lucene index sorted by population for input phrase.
2. Sort results by feature code.
3. Pass top N results.
Pass 2
1. Assign a score if input phrase is exactly a token in name. e.g. “Los Angeles” in
“Los Angeles County”
2. Assign a lesser score if input phrase is partly a token in name e.g. India in Indiana
3. Calculate edit distance of input phrase with all alternate names.
4. Assign weight as a function of count of alternative names.
13
Information Retrieval
and Data Science
Evaluation and Results
We tested our improved algorithm on a diverse set of 50,000+
locations with countries, states, districts, cities, towns and
villages from DARPA MEMEX and from the NSF Polar domains.
Our ground truth data comparisons resulted in a total overall
accuracy of 94%
14
Information Retrieval
and Data Science
Challenges
• Limitations with opensource Geonames.org data set. More
features will help –
• Popular name – “Argentina” for “Argentine Republic“
• Area of a location
• No one best result.
• Pasadena in both California and Texas
• Huge and diverse data set which have data ranging from parks
to continents. Improved performance if data set is reduced to
eg. continents, countries, state, counties, capitals and cities
15
Information Retrieval
and Data Science
Conclusion and Future Work
• Early results from this effort are promising but a number of
avenues of future work remain unexplored.
• Automatically augment Geonames.org with crowd sourced
popular place names
• Geospatial coordinate data beyond points, e.g., bounding
boxes, and other shapes, to allow for more meaningful and
spatially directed queries.
16
Information Retrieval
and Data Science
ACKNOWLEDGMENTS
• This work was supported by the DARPA XDATA/Memex
program.
• NSF Polar Cyber infrastructure award numbers PLR-1348450
and PLR-144562 funded a portion of the work.
• Effort supported in part by JPL, managed by the California
Institute of Technology on behalf of NASA.
17
• Lucene Geo Gazetteer: https://github.com/chrismattmann/lucene-geo-gazetteer
• Geonames dataset: https://www.geonames.org
• Apache Tika: https://tika.apache.org/
• Apache OpenNLP: https://opennlp.apache.org/
• Apache Lucene: http://lucene.apache.org
• Memex: http://memex.jpl.nasa.gov/
More about us on github:
18
Information Retrieval
and Data Science
THANK YOU
Madhav Sharan
smadha
Dr. Chris Mattmann
chrismattmann

More Related Content

Similar to IEEE IRI 2016: Automatic Geocoding of Web Data Locations

Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...
Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...
Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...Amit Sheth
 
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Craig Knoblock
 
Domain Identification for Linked Open Data
Domain Identification for Linked Open DataDomain Identification for Linked Open Data
Domain Identification for Linked Open DataSarasi Sarangi
 
Ranking spatial data by quality preferences ppt
Ranking spatial data by quality preferences  pptRanking spatial data by quality preferences  ppt
Ranking spatial data by quality preferences pptSaurav Kumar
 
2014 Family Search Developer Conference Place 2.0 API
2014 Family Search Developer Conference Place 2.0 API2014 Family Search Developer Conference Place 2.0 API
2014 Family Search Developer Conference Place 2.0 APIdshellman
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Riccardo Albertoni
 
Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014
Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014
Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014TERN Australia
 
Geo tagging & spatial indexing of text-specified data
Geo tagging & spatial indexing of text-specified dataGeo tagging & spatial indexing of text-specified data
Geo tagging & spatial indexing of text-specified dataShiv Shakti Ghosh
 
Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...
Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...
Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...Artificial Intelligence Institute at UofSC
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...Yongyao Jiang
 
Introducing the SQUAD Tool: A tool for identifying anomalies in large spatial...
Introducing the SQUAD Tool: A tool for identifying anomalies in large spatial...Introducing the SQUAD Tool: A tool for identifying anomalies in large spatial...
Introducing the SQUAD Tool: A tool for identifying anomalies in large spatial...MEASURE Evaluation
 

Similar to IEEE IRI 2016: Automatic Geocoding of Web Data Locations (20)

Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...
Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...
Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...
 
MUDROD - Ranking
MUDROD - RankingMUDROD - Ranking
MUDROD - Ranking
 
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
Automatic Spatio-temporal Indexing to Integrate and Analyze the Data of an Or...
 
Domain Identification for Linked Open Data
Domain Identification for Linked Open DataDomain Identification for Linked Open Data
Domain Identification for Linked Open Data
 
Domain Identification for Linked Open Data
Domain Identification for Linked Open DataDomain Identification for Linked Open Data
Domain Identification for Linked Open Data
 
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and ApplicationsSemantics-enhanced Geoscience Interoperability, Analytics, and Applications
Semantics-enhanced Geoscience Interoperability, Analytics, and Applications
 
Ranking spatial data by quality preferences ppt
Ranking spatial data by quality preferences  pptRanking spatial data by quality preferences  ppt
Ranking spatial data by quality preferences ppt
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 
2014 Family Search Developer Conference Place 2.0 API
2014 Family Search Developer Conference Place 2.0 API2014 Family Search Developer Conference Place 2.0 API
2014 Family Search Developer Conference Place 2.0 API
 
Ch1revised
Ch1revisedCh1revised
Ch1revised
 
Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...Semantic Similarity and Selection of Resources Published According to Linked ...
Semantic Similarity and Selection of Resources Published According to Linked ...
 
Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014
Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014
Ecosystem data and TERN: Genes to geosciences workshop 19 May 2014
 
Wisconsin Land Survey Network
Wisconsin Land Survey NetworkWisconsin Land Survey Network
Wisconsin Land Survey Network
 
Geo tagging & spatial indexing of text-specified data
Geo tagging & spatial indexing of text-specified dataGeo tagging & spatial indexing of text-specified data
Geo tagging & spatial indexing of text-specified data
 
Spatial databases
Spatial databasesSpatial databases
Spatial databases
 
Spatial Databases
Spatial DatabasesSpatial Databases
Spatial Databases
 
Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...
Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...
Spatial Semantics for Better Interoperability and Analysis: Challenges and Ex...
 
Database Management System
Database Management SystemDatabase Management System
Database Management System
 
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
MUDROD - Mining and Utilizing Dataset Relevancy from Oceanographic Dataset Me...
 
Introducing the SQUAD Tool: A tool for identifying anomalies in large spatial...
Introducing the SQUAD Tool: A tool for identifying anomalies in large spatial...Introducing the SQUAD Tool: A tool for identifying anomalies in large spatial...
Introducing the SQUAD Tool: A tool for identifying anomalies in large spatial...
 

Recently uploaded

Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfDrew Moseley
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodManicka Mamallan Andavar
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfBalamuruganV28
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxStephen Sitton
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdfsahilsajad201
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESkarthi keyan
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfisabel213075
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 

Recently uploaded (20)

Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
Immutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdfImmutable Image-Based Operating Systems - EW2024.pdf
Immutable Image-Based Operating Systems - EW2024.pdf
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
Levelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument methodLevelling - Rise and fall - Height of instrument method
Levelling - Rise and fall - Height of instrument method
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
CS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdfCS 3251 Programming in c all unit notes pdf
CS 3251 Programming in c all unit notes pdf
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
Stork Webinar | APM Transformational planning, Tool Selection & Performance T...
 
Turn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptxTurn leadership mistakes into a better future.pptx
Turn leadership mistakes into a better future.pptx
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdf
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
List of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdfList of Accredited Concrete Batching Plant.pdf
List of Accredited Concrete Batching Plant.pdf
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 

IEEE IRI 2016: Automatic Geocoding of Web Data Locations

  • 1. July 28-30, 2016; IEEE IRI, Pittsburgh, PA, USA Madhav Sharan msharan@usc.edu Dr. Chris Mattmann mattmann@usc.edu 1 An Automatic Approach for Discovering and Geocoding Locations in Domain- Specific Web Data Information Retrieval and Data Science
  • 2. OUTLINE • Introduction • Overview of ecosytem • Data Flow • GeoTopic Identification • Identifying location names • Lucene Geo Gazetteer • Geonames Features • Evolution of Ranking Algorithm • Evaluation and results • Challenges • Conclusion and Future Work Information Retrieval and Data Science 2
  • 3. Introduction Information Retrieval and Data Science 3 • AIM – Discover geospatial locations from WWW data culled from diverse domains • Locations can be present in text or metadata of a file • We present the overall approach and then explains in detail the Lucene Geo Gazetteer derived from the Geonames.org dataset
  • 4. Overview of ecosytem Information Retrieval and Data Science 4 • WWW is home to a rich diversity of and large volumes of data across many domains and disciplines • In MEMEX we have collected web page and textual data from many domains including Materials Research, and Autonomous Robots and other domains • There is no single field or textual pattern for discerning location information in all domains
  • 5. Example dataset Information Retrieval and Data Science 5 Source Science Topic Location-related field Notes AMD Agriculture, Atmosphere, Biologi- cal Classification, Biosphere, Cli- mate Indicators, Cryosphere, Hu- man Dimensions, Land Surface, Oceans, Paleoclimate, Solid Earth, Terrestrial Hydrosphere, Spectral Engineering, Sun-Earth Interac- tions 1.Location Keywords field in Metadata 2.Sometimes mentioned in the title and Summary fields 3.Spatial coordinates provided 1.Continent level and Geo- graphic Region 2.Island, sometimes specific 3.Multi-values 4.Not missing in Location Key- words ACADIS Agriculture, Atmosphere, Biologi- cal Classification, Biosphere, Cli- mate Indicators, Cryosphere, Hu- man Dimensions, Land Surface, Oceans. Paleoclimate, Solid Earth, Terrestrial Hydrosphere 1.Location(s) field in Metadata 2.Location mentioned in De- scription field, often in the first few sentences or last few sen- tences. Sometimes missing. 3.Longitude and Latitude pro- vided in the Metadata. 4.Sometimes mentioned in the title 1. Province/State level or Geo- graphic Region 2. Multi-values 3. Not missing in Location Locations mentioned in the NSF TREC-DD-Polar dataset
  • 6. Data Flow Information Retrieval and Data Science 6 Crawled website data Extract text and metadata Discover location entities Find location in gazetteer Lucene Geo Gazetteer
  • 7. Information Retrieval and Data Science Lucene Geo Gazetteer 7 Uses the indexed Geonames.org dataset to match location names with latitude and longitude. Indexes Geonames.org dataset to a lucene based index Ranks matched results based on certain features.
  • 8. Information Retrieval and Data Science Geonames Features • Name - Legal name of a location. • Alternate names - All other names by which a loca- tion is known. Alternate names provide a CSV of all pronunciations and synonyms for a location. • Feature class - A high level bucket for similar locations. This bucket distinguishes continent and countries from cities and villages. • Feature code - A more granular bucket for locations to identify it as a country, state, capital, city. • Population, Latitude, Longitude, Country code, Admin1 code, Admin2code 8
  • 9. Information Retrieval and Data Science Evolution of Ranking Algorithm The naive version – 1. Search input phrase in name, alternate name stored index and get top N results. 2. Calculate edit distance between input phrase and name, alternate name. 3. Rank results in order of lowest edit distance between input phrase and name, alternate name fields. 9
  • 10. Information Retrieval and Data Science Evolution of Ranking Algorithm The naive version results and observations– • Poor results. 1/50 locations gave correct result • There are many location with similar names. String “Pasadena” have 250 and “Portland” have 897 matches in dataset. • Popular places have greater number of alternate names 10
  • 11. Information Retrieval and Data Science Evolution of Ranking Algorithm First improvement – 1. Add multiple sorting criteria in lucene query in order “feature class”, “feature code” and “population” 2. Fetch top N results from this criteria 3. Add edit distance between input phrase and name and all alternate names 11
  • 12. Information Retrieval and Data Science Evolution of Ranking Algorithm Observations – • More weightage to name as it is legal name for a location e.g. China is stored with “People’s Republic of China” and alternate name have ”... Cayina,Chaina,China,Chine, ..” • We can not just rely on edit distance. • Edit distance between “Los Cabos” and “Los Angeles” == 6 • Edit distance between “Los Angeles County” and “Los Angeles” == 7 • Results sorted only by population are very relevant and query time is improved 12
  • 13. Information Retrieval and Data Science Evolution of Ranking Algorithm One more try – Pass 1 1. Get top N * 3 results from lucene index sorted by population for input phrase. 2. Sort results by feature code. 3. Pass top N results. Pass 2 1. Assign a score if input phrase is exactly a token in name. e.g. “Los Angeles” in “Los Angeles County” 2. Assign a lesser score if input phrase is partly a token in name e.g. India in Indiana 3. Calculate edit distance of input phrase with all alternate names. 4. Assign weight as a function of count of alternative names. 13
  • 14. Information Retrieval and Data Science Evaluation and Results We tested our improved algorithm on a diverse set of 50,000+ locations with countries, states, districts, cities, towns and villages from DARPA MEMEX and from the NSF Polar domains. Our ground truth data comparisons resulted in a total overall accuracy of 94% 14
  • 15. Information Retrieval and Data Science Challenges • Limitations with opensource Geonames.org data set. More features will help – • Popular name – “Argentina” for “Argentine Republic“ • Area of a location • No one best result. • Pasadena in both California and Texas • Huge and diverse data set which have data ranging from parks to continents. Improved performance if data set is reduced to eg. continents, countries, state, counties, capitals and cities 15
  • 16. Information Retrieval and Data Science Conclusion and Future Work • Early results from this effort are promising but a number of avenues of future work remain unexplored. • Automatically augment Geonames.org with crowd sourced popular place names • Geospatial coordinate data beyond points, e.g., bounding boxes, and other shapes, to allow for more meaningful and spatially directed queries. 16
  • 17. Information Retrieval and Data Science ACKNOWLEDGMENTS • This work was supported by the DARPA XDATA/Memex program. • NSF Polar Cyber infrastructure award numbers PLR-1348450 and PLR-144562 funded a portion of the work. • Effort supported in part by JPL, managed by the California Institute of Technology on behalf of NASA. 17
  • 18. • Lucene Geo Gazetteer: https://github.com/chrismattmann/lucene-geo-gazetteer • Geonames dataset: https://www.geonames.org • Apache Tika: https://tika.apache.org/ • Apache OpenNLP: https://opennlp.apache.org/ • Apache Lucene: http://lucene.apache.org • Memex: http://memex.jpl.nasa.gov/ More about us on github: 18 Information Retrieval and Data Science THANK YOU Madhav Sharan smadha Dr. Chris Mattmann chrismattmann

Editor's Notes

  1. research questions - geospatial location and temporal constraints
  2. research questions - geospatial location and temporal constraints