Open Analytics DC June 2012 Presentation

504 views
444 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
504
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Open Analytics DC June 2012 Presentation

  1. 1. Document Analysis and Big Data Making Sense out of the Flood
  2. 2. Agenda• Define Big Data and Document Analysis• The Infinit.e Solution• Questions
  3. 3. What is Big Data? “Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.” Source: http://en.wikipedia.org/wiki/Big_data
  4. 4. This is what Big Data Feels Like Shamelessly stolen from: http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
  5. 5. What is Document Analysis? "Document Analysis refers to computer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.” Source: http://www.text-tech.com/docanalysis/definition.html
  6. 6. Document Analysis• The goal is to: – Extract Entities (people, places, things) – Create Associations between entities (in the form of noun-verb-noun), e.g.: • John Doe lives in Washington, D.C • John Doe is married to Jane Doe • John Doe is a Virgo • John Doe traveled to Mexico on July 6th, 2011• And…
  7. 7. Document Analysis• Turn Who, What, When and Where into a unified data structure that supports data analytics and visualization.Who Whenpeople, organizations, past, present, futurefacilities, company datesWhat Whereevents, summaries, city, state, country,facts, themes coordinate
  8. 8. The Infinit.e Solution• Infinit.e is an Open Source document discovery and analysis platform that has these very cool open source tools lurking under the hood. github.com/ikanow/Infinit.e
  9. 9. The Infinit.e Solution Infinit.e is a scalable framework for Visualizing Analyzing Retrieving Enriching StoringCollecting Structured and Unstructured Documents
  10. 10. Harvesting• Infinit.e’s harvester: – Collects documents for specified data sources (URLs, RDBMs via JDBC, file shares) – Marshalls each document through the enrichment process – Saves each metadata document, entity, and association created to MongoDB
  11. 11. Source Ingestion Data Flow
  12. 12. Sample RSS Document<rss version="2.0"><channel>…<item> <title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title> <link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in- egypt-tunisia-report-by-egyptlastminute-com-13613.html</link> <description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia the most … </description> <dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher> <dc:creator>unknown</dc:creator> <dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date></item>…</channel></rss>
  13. 13. Full Text Source
  14. 14. Document Metadata• doc_metadata.metadata{ "_id" : ObjectId("4f93638e0cf212156d0559d2"), "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...", "url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism- in-egypt-tunisia-report-by-egyptlastminute-com-13613.html" "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...", "created" : ISODate("2012-04-22T01:49:02Z"), “metadata” : {…}, "associations" : […], "entities" : […], ...}
  15. 15. Harvested Document Metadata• document.metadata"metadata" : { "location" : [ { "region" : "South Asia", "citystateprovince" : { "stateprovince" : "Rolpa”, "city" : "Newang" }, "country" : "Nepal" } ], "icn" : [ "200573487" ], "incidentdate" : [ "07/25/2005" ], "organization" : [ "Communist Party of Nepal (Maoist)/United Peoples Front” ], ...},
  16. 16. Document Enrichment• Infinit.e supports the extraction of entities and creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:
  17. 17. Harvested Entities• feature.entity{ "_id" : ObjectId("4f9189d48baf188282a1c9ef"), "alias" : [ "Zine el Abidine Ben Ali", "Zine El Abidine Ben Ali", "Zine el Abidine ben Ali" ], "batch_resync" : true, "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(143), "db_sync_time" : "1338751174988", "dimension" : "Who", "disambiguated_name" : "Zine El Abidine Ben Ali", "doccount" : 152, "index" : "zine el abidine ben ali/person", "totalfreq" : 353, "type" : "Person"}
  18. 18. Harvested Entities
  19. 19. Harvested Associations• feature.association{ "_id" : ObjectId("4f9189d48baf188282a1ca24"), "assoc_type" : "Fact", "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(70), "db_sync_time" : "1338491609281", "doccount" : NumberLong(73), "entity1" : [ "zine el abidine ben ali", "zine el abidine ben ali/person" ], "entity1_index" : "zine el abidine ben ali/person", "entity2" : ["president”,"president/position”], "entity2_index" : "president/position", "index" : "5e3fff27ddb78d6873ccfc77cf05c52f", "verb" : ["career”,"current”,"past”], "verb_category" : "career"}
  20. 20. Harvested Associations
  21. 21. Geolocation of Entities/Events• feature.geo{ "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"), "search_field" : "cairo", "country" : "Egypt", "country_code" : "EG", "city" : "cairo", "region" : "Al Qahirah", "region_code" : "EG11", "population" : 7734602, "latitude" : "30.05", "longitude" : "31.25", "geoindex" : { "lon" : 31.25, "lat" : 30.05 Note: MongoDB 2d Index }}
  22. 22. Geolocation of Entities/Events
  23. 23. Who, What, Where and When
  24. 24. Thank You! Craig Vitter www.ikanow.com cvitter@ikanow.com github.com/ikanow/Infinit.e

×