Open Source Software for GeospatialAnalytics on Unstructured Big DataCharlie Greenbacker, Principal Data Scientist
Background                                                                 About Me:                                      ...
The Problem: geotagging unstructured text     Growing demand for     geospatial analytics     Most of human knowledge     ...
The Solution: an open source geoparser                                                                     1. Data Ingesti...
Data Ingestion: unstructured text                                                                                         ...
Entity Extraction: named entity recognitionAll information contained within this presentation is UNCLASSIFIED // PROPRIETA...
Entity Resolution: match against a gazetteerAll information contained within this presentation is UNCLASSIFIED // PROPRIET...
Data Enrichment: structured geo dataAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and ...
“The Springfield Problem”All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to ...
Dealing with Ambiguity     Intelligent Context-based Heuristics            First: rank by population            Next: look...
CLAVIN: an open source geoparserAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belo...
System ArchitectureAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico...
Live DemonstrationAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico ...
Live Demonstration                                              What can I do                                             ...
Map VisualizationsAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico ...
Hierarchical Geospatial Search                                                     Virginia                               ...
Geospatial Bounding Box SearchAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belong...
Geospatial Analytics on Unstructured TextAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY...
Performance Metrics & Features                                                     Accurate: 0.75 F-measure     CLAVIN“   ...
clavin.bericotechnologies.com                                                                                    Charlie G...
Upcoming SlideShare
Loading in...5
×

Greenbacker open analyticsdc

744

Published on

Berico:

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
744
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • “Berico specializes in building open source software to support analytic missions, and implementing them through our services.”“We help our customers optimize the use of open source solutions for Cloud environments to replace the functionality traditionally licensed based projects.”“All of our products are built to run on and optimize cloud technologies – specifically HBase or Accumulo. We are the first authorized Cloudera partner in the federal sector”“CLAVIN is one of 7 open source products that we’ve built and implemented with customers in the DoD and IC. We’ve chosen CLAVIN as example to walk through today to illustrate how Berico’s open source products deliver great, market-leading, functionality with no licensing constraints, and at a fraction of the cost of proprietary tools in the market” (an infinite fraction – it’s free)
  • Paris, France > Paris, Texas
  • The interactivelive demo will be run offline from the presenter’s laptop. The CLAVIN demo interface accepts plain text as input, and returns a list of geospatial entities (with lat/lons, etc.) corresponding to the place names extracted and resolved from the text, along with a visualization plotting these locations on a map.The example text used in the demo may include the following:the sample text file built into the CLAVIN demo interface“Grover Cleveland was the 22nd president of the United States. He never went to Cuba.” (shows that CLAVIN knows “Grover Cleveland” is not a city in Ohio)“I was born in Boston and grew up in Springfield.” (produces a map of Massachusetts)“I was born in Chicago and grew up in Springfield.” (produces a map of Illinois)“I traveled to London and Oxford last summer.” (produces a map of England)“I traveled to London and Toronto last summer.” (produces a map of Ontario)a random news article from CNN.com (or a similar source)any example text provided by the audience
  • geotag 1M documents containing 5.7M places names in under 1 hour on a 9-node Hadoop clustervsthe prohibitively expensive enterprise licenses of competing solutions like MetaCarta
  • Greenbacker open analyticsdc

    1. 1. Open Source Software for GeospatialAnalytics on Unstructured Big DataCharlie Greenbacker, Principal Data Scientist
    2. 2. Background About Me: Data Scientist Natural Language Processing Unstructured Text  Information Berico Technologies: Veteran-owned Small Business Big Data Analytics in the Cloud Defense & Intel CommunityAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 2
    3. 3. The Problem: geotagging unstructured text Growing demand for geospatial analytics Most of human knowledge remains “trapped” in text Existing solutions are expensive and don’t scale Need an open source solutionAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 3
    4. 4. The Solution: an open source geoparser 1. Data Ingestion Input: unstructured text 2. Entity Extraction Named entity recognition Find location names in text 3. Entity Resolution Match against a gazetteer “The Springfield Problem” 4. Data Enrichment Output: structured geo dataAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 4
    5. 5. Data Ingestion: unstructured text photo: Flickr user NS NewsflashAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 5
    6. 6. Entity Extraction: named entity recognitionAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 6
    7. 7. Entity Resolution: match against a gazetteerAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 7
    8. 8. Data Enrichment: structured geo dataAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 8
    9. 9. “The Springfield Problem”All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 9
    10. 10. Dealing with Ambiguity Intelligent Context-based Heuristics First: rank by population Next: look for other locations mentioned in the same document “Springfield” + “Chicago” = Illinois “Springfield” + “Boston” = Massachusetts Soon: calculate distance based on lat/lons Resolve alternate names to same geospatial entity “Ivory Coast” = “Côte d’Ivoire” Use fuzzy matching to capture misspelled place names Including both phonetic spelling & typographical errorsAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 10
    11. 11. CLAVIN: an open source geoparserAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 11
    12. 12. System ArchitectureAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 12
    13. 13. Live DemonstrationAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 13
    14. 14. Live Demonstration What can I do with this data?All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 14
    15. 15. Map VisualizationsAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 15
    16. 16. Hierarchical Geospatial Search Virginia Reston ArlingtonAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 16
    17. 17. Geospatial Bounding Box SearchAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 17
    18. 18. Geospatial Analytics on Unstructured TextAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 18
    19. 19. Performance Metrics & Features Accurate: 0.75 F-measure CLAVIN“ Fast: 100 locations per sec per cpuCartographic Scalable: processes 1 million documentsLocation in 1 hour on a 9-node Hadoop clusterAnd Smart: natural language processing, context-based heuristics, &Vicinity fuzzy matchingINdexer Easy to use: simple Java-based API Open source: Apache License All information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 19
    20. 20. clavin.bericotechnologies.com Charlie Greenbacker @greenbacker meetup.com/DC-NLP @DCNLPAll information contained within this presentation is UNCLASSIFIED // PROPRIETARY and belongs to Berico Technologies, LLC. 20
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×