Open Source Software for Geospatial
Analytics on Unstructured Big Data
Charlie Greenbacker, Principal Data Scientist
Background




                                                                 About Me:
                                                                        Data Scientist
                                                                        Natural Language Processing
                                                                        Unstructured Text  Information


                                                                 Berico Technologies:
                                                                        Veteran-owned Small Business
                                                                        Big Data Analytics in the Cloud
                                                                        Defense & Intel Community



All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   2
The Problem: geotagging unstructured text




     Growing demand for
     geospatial analytics

     Most of human knowledge
     remains “trapped” in text

     Existing solutions are
     expensive and don’t scale

     Need an open source solution



All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   3
The Solution: an open source geoparser



                                                                     1. Data Ingestion
                                                                                  Input: unstructured text
                                                                     2. Entity Extraction
                                                                                  Named entity recognition
                                                                                  Find location names in text
                                                                     3. Entity Resolution
                                                                                  Match against a gazetteer
                                                                                  “The Springfield Problem”
                                                                     4. Data Enrichment
                                                                                  Output: structured geo data



All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   4
Data Ingestion: unstructured text




                                                                                                              photo: Flickr user NS Newsflash


All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.          5
Entity Extraction: named entity recognition




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   6
Entity Resolution: match against a gazetteer




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   7
Data Enrichment: structured geo data




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   8
“The Springfield Problem”




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   9
Dealing with Ambiguity


     Intelligent Context-based Heuristics
            First: rank by population
            Next: look for other locations mentioned in the same document
                   “Springfield” + “Chicago” = Illinois
                   “Springfield” + “Boston” = Massachusetts
            Soon: calculate distance based on lat/lons


     Resolve alternate names to same geospatial entity
            “Ivory Coast” = “Côte d’Ivoire”


     Use fuzzy matching to capture misspelled place names
            Including both phonetic spelling & typographical errors

All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   10
CLAVIN: an open source geoparser




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   11
System Architecture




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   12
Live Demonstration




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   13
Live Demonstration




                                              What can I do
                                              with this data?




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   14
Map Visualizations




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   15
Hierarchical Geospatial Search




                                                     Virginia

                                                               Reston           Arlington




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   16
Geospatial Bounding Box Search




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   17
Geospatial Analytics on Unstructured Text




All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   18
Performance Metrics & Features


                                                     Accurate: 0.75 F-measure
     CLAVIN




“
                                                     Fast: 100 locations per sec per cpu
Cartographic
                                                     Scalable: processes 1 million documents
Location                                             in 1 hour on a 9-node Hadoop cluster

And                                                  Smart: natural language
                                                     processing, context-based heuristics, &
Vicinity                                             fuzzy matching

INdexer                                              Easy to use: simple Java-based API

                                                     Open source: Apache License
 All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   19
clavin.bericotechnologies.com
                                                                                    Charlie Greenbacker
                                                                                          @greenbacker
                                 meetup.com/DC-NLP
                                 @DCNLP


All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC.   20

Greenbacker open analyticsdc

  • 1.
    Open Source Softwarefor Geospatial Analytics on Unstructured Big Data Charlie Greenbacker, Principal Data Scientist
  • 2.
    Background About Me: Data Scientist Natural Language Processing Unstructured Text  Information Berico Technologies: Veteran-owned Small Business Big Data Analytics in the Cloud Defense & Intel Community All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 2
  • 3.
    The Problem: geotaggingunstructured text Growing demand for geospatial analytics Most of human knowledge remains “trapped” in text Existing solutions are expensive and don’t scale Need an open source solution All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 3
  • 4.
    The Solution: anopen source geoparser 1. Data Ingestion Input: unstructured text 2. Entity Extraction Named entity recognition Find location names in text 3. Entity Resolution Match against a gazetteer “The Springfield Problem” 4. Data Enrichment Output: structured geo data All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 4
  • 5.
    Data Ingestion: unstructuredtext photo: Flickr user NS Newsflash All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 5
  • 6.
    Entity Extraction: namedentity recognition All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 6
  • 7.
    Entity Resolution: matchagainst a gazetteer All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 7
  • 8.
    Data Enrichment: structuredgeo data All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 8
  • 9.
    “The Springfield Problem” Allinformation contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 9
  • 10.
    Dealing with Ambiguity Intelligent Context-based Heuristics First: rank by population Next: look for other locations mentioned in the same document “Springfield” + “Chicago” = Illinois “Springfield” + “Boston” = Massachusetts Soon: calculate distance based on lat/lons Resolve alternate names to same geospatial entity “Ivory Coast” = “Côte d’Ivoire” Use fuzzy matching to capture misspelled place names Including both phonetic spelling & typographical errors All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 10
  • 11.
    CLAVIN: an opensource geoparser All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 11
  • 12.
    System Architecture All informationcontained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 12
  • 13.
    Live Demonstration All informationcontained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 13
  • 14.
    Live Demonstration What can I do with this data? All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 14
  • 15.
    Map Visualizations All informationcontained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 15
  • 16.
    Hierarchical Geospatial Search Virginia Reston Arlington All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 16
  • 17.
    Geospatial Bounding BoxSearch All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 17
  • 18.
    Geospatial Analytics onUnstructured Text All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 18
  • 19.
    Performance Metrics &Features Accurate: 0.75 F-measure CLAVIN “ Fast: 100 locations per sec per cpu Cartographic Scalable: processes 1 million documents Location in 1 hour on a 9-node Hadoop cluster And Smart: natural language processing, context-based heuristics, & Vicinity fuzzy matching INdexer Easy to use: simple Java-based API Open source: Apache License All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 19
  • 20.
    clavin.bericotechnologies.com Charlie Greenbacker @greenbacker meetup.com/DC-NLP @DCNLP All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 20

Editor's Notes

  • #3 “Berico specializes in building open source software to support analytic missions, and implementing them through our services.”“We help our customers optimize the use of open source solutions for Cloud environments to replace the functionality traditionally licensed based projects.”“All of our products are built to run on and optimize cloud technologies – specifically HBase or Accumulo. We are the first authorized Cloudera partner in the federal sector”“CLAVIN is one of 7 open source products that we’ve built and implemented with customers in the DoD and IC. We’ve chosen CLAVIN as example to walk through today to illustrate how Berico’s open source products deliver great, market-leading, functionality with no licensing constraints, and at a fraction of the cost of proprietary tools in the market” (an infinite fraction – it’s free)
  • #11 Paris, France > Paris, Texas
  • #14 The interactivelive demo will be run offline from the presenter’s laptop. The CLAVIN demo interface accepts plain text as input, and returns a list of geospatial entities (with lat/lons, etc.) corresponding to the place names extracted and resolved from the text, along with a visualization plotting these locations on a map.The example text used in the demo may include the following:the sample text file built into the CLAVIN demo interface“Grover Cleveland was the 22nd president of the United States. He never went to Cuba.” (shows that CLAVIN knows “Grover Cleveland” is not a city in Ohio)“I was born in Boston and grew up in Springfield.” (produces a map of Massachusetts)“I was born in Chicago and grew up in Springfield.” (produces a map of Illinois)“I traveled to London and Oxford last summer.” (produces a map of England)“I traveled to London and Toronto last summer.” (produces a map of Ontario)a random news article from CNN.com (or a similar source)any example text provided by the audience
  • #20 geotag 1M documents containing 5.7M places names in under 1 hour on a 9-node Hadoop clustervsthe prohibitively expensive enterprise licenses of competing solutions like MetaCarta