Natural Language Processing and Graph Databases in Lumify

4,789 views
4,565 views

Published on

Lumify is an open source platform for big data analysis and visualization, designed to help organizations derive actionable insights from the large volumes of diverse data flowing through their enterprise. Utilizing both Hadoop and Storm, it ingests and integrates virtually any kind of data, from unstructured text documents and structured datasets, to images and video. Several open source analytic tools (including Tika, OpenNLP, CLAVIN, OpenCV, and ElasticSearch) are used to enrich the data, increase its discoverability, and automatically uncover hidden connections. All information is stored in a secure graph database implemented on top of Accumulo to support cell-level security of all data and metadata elements. A modern, browser-based user interface enables analysts to explore and manipulate their data, discovering subtle relationships and drawing critical new insights. In addition to full-text search, geospatial mapping, and multimedia processing, Lumify features a powerful graph visualization supporting sophisticated link analysis and complex knowledge representation.

Charlie Greenbacker, Director of Data Science at Altamira, will provide an overview of Lumify and discuss how natural language processing (NLP) tools are used to enrich the text content of ingested data and automatically discover connections with other bits of information. Joe Ferner, Senior Software Engineer at Altamira, will describe the creation of SecureGraph and how it supports authorizations, visibility strings, multivalued properties, and property metadata in a graph database.

Published in: Data & Analytics

Natural Language Processing and Graph Databases in Lumify

  1. 1. NLP and Graph Databases in Charlie Greenbacker & Joe Kerner
  2. 2. Agenda Graph Databases Lumify Overview Introductions Natural Language Processing
  3. 3. photo:&Columbia&Pictures& About me: @greenbacker Theories: popular tripe Methods: sloppy Conclusions: highly questionable
  4. 4. Best reason for not finishing PhD
  5. 5. @ExploreAltamira
  6. 6. is an open source big data analysis and visualization platform built by Altamira engineers
  7. 7. Key Lumify Concepts structure for organizing information (i.e., your data model) Ontology any “thing” you want to represent (e.g., person, place, event) Entities a link between two entities (e.g., leader-of, works-for, sibling-of) Relationships data about an entity (e.g., first name, last name, date of birth) Properties collection of entities and the relationships between them Graph
  8. 8. Live Demo
  9. 9. Who can Lumify help?
  10. 10. Lumify helps analysts fuse structured and unstructured data from myriad sources into actionable intelligence. Intelligence Analyst
  11. 11. Law enforcement personnel can use Lumify to explore criminal networks, uncover hidden connections, and develop leads. Police Investigator
  12. 12. Lumify analyzes financial data and transaction records to help detect fraud and identify possible insider threats. Financial Analyst photo:&Ken&Teegardin&(h9ps://flic.kr/p/9rn9Yh)&
  13. 13. Scientists, law firms, news organizations, and others can track their research in Lumify to unearth latent knowledge and discover critical new insights. Research Staff photo:&UK&NaConal&Archives&(h9p://bit.ly/1n9dhR8)&
  14. 14. Why Lumify?
  15. 15. •  Distributed under the permissive Apache 2.0 license •  No restrictions on modifications •  No licensing or usage constraints Free and Open Source
  16. 16. Built on Scalable Open Source Tech Hadoop&CDH&4& Accumulo& ElasCcSearch& tesseract&CLAVIN& CMU&Sphinx&OpenNLP& OpenCV& ffmpeg& Apache&Storm& Secure&Graph& custom&code&
  17. 17. •  Separate security restrictions at the entity, property, and relationship level •  Implemented in and enforced by Accumulo cell-level security Highly Secure Joaquin Guzman Loera DOB: 1957-04-04 POB: Badiraguarto Nationality: Mexican Founded: 2010-01-11 Location: Mexico City Employees: 121 Zarka de Mexico
  18. 18. •  Full-time development staff •  Custom development and customization services •  Commercial support offerings Supported
  19. 19. •  Day-to-day development done on Amazon infrastructure •  Primarily use EC2, VPC, S3, SES, CloudWatch •  Altamira is an AWS consulting partner AWS Compatible
  20. 20. Natural Language Processing in
  21. 21. Text Extraction video text docs structured data images OCR tesseract audio CMU Sphinx CMU Sphinx OCR tesseract extractor
  22. 22. Text Enrichment •  Apache OpenNLP •  Named Entity Recognition •  Extracts names of entities from unstructured text •  Persons, Orgs, & Locations •  Highlighted in preview text •  User must confirm/resolve •  CLAVIN •  Geospatial Entity Resolution •  Resolves extracted location names to gazetteer records •  Solves “Springfield problem” •  Disambiguates place names •  Turns text docs into maps!
  23. 23. Machine-powered entity extraction and resolution, combined with human QA and supplementation, supports rich semantic analysis of raw text. Enriched Text Documents Drug Lord “El Chapo” Captured in Mexico PUBLISHED DATE SOURCE Audit 2014/02/22 Wikipedia Add Property Although Guzman had long hidden successfully in remote areas of the Sierra Madre mountains, the arrested members of his security team told the military he had begun venturing out to Culiacan and the beach town of Mazatlan. A week prior to his capture, Guzman and Zambada were reported to have attended a family reunion in Sinaloa. The Mexican military followed the bodyguards tips to Guzman’s ex-wife’s house, but they had trouble ramming the steel-reinforced front door, which allowed Guzman to escape through a system of secret tunnels that connected six houses, eventually moving south to Mazatlan. He planned to stay a few days in Mazatlan to see his twin baby daughters before retreating to the mountains. On 22 February 2014, at around 6:40 a.m., Mexican authorities arrested Guzman at a hotel in a beach front area in Mazatlan, Sinaloa, following an operation by the Mexican Navy, with joint intelligence from the DEA and
  24. 24. Benefits to Users quickly find relevant data without reading Increases Discoverability machines process text faster than humans Helps Deal with Information Overload enables object-based analysis & investigations Uncovers Hidden Connections
  25. 25. Future NLP Integration e.g., Stanford NER, SUTime, MITIE Support other NER tools e.g., OpenIE (formerly ReVerb) Event/Relationship Extraction augmenting/extending GATE/ANNIE Coreference Resolution e.g., frequency analysis, topic modeling, sentiment analysis Additional Text Analytics use non-English language models for NER, etc. Multilingual Support
  26. 26. Graph Databases in view part 2 of the presentation here: github.com/altamiracorp/secure-graph-presentation
  27. 27. Questions? more info: lumify.io

×