How IKANOW uses MongoDB to help organizations solve really big problems

1,632 views
1,570 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,632
On SlideShare
0
From Embeds
0
Number of Embeds
179
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

How IKANOW uses MongoDB to help organizations solve really big problems

  1. 1. The Open Source document analysis platform Or, how IKANOW usesto help organizations solve really big problems
  2. 2. Agenda• What is Document Analysis?• The Infinit.e Solution – Infinit.e’s Architecture – Why and How we use MongoDB• Analyzing #MongoDC• Questions
  3. 3. This is what Big Data Looks Like Shamelessly stolen from: http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
  4. 4. What is Document Analysis? "Document Analysis refers to computer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.” Source: http://www.text-tech.com/docanalysis/definition.html
  5. 5. Document Analysis• Common document source formats:RSS JSON XMLHTML PDF TXTRTF Word PPTMultimedia Files RDBMS Records ETC.
  6. 6. Document Analysis• The goal is to: – Extract Entities (people, places, things) – Create Associations between entities (in the form of noun-verb-noun), e.g.: • John Doe lives in Washington, D.C • John Doe is married to Jane Doe • John Doe is a Virgo • John Doe traveled to Mexico on July 6th, 2011• And…
  7. 7. Document Analysis• Turn Who, What, When and Where into a unified data structure that supports data analytics and visualization.Who Whenpeople, organizations, past, present, futurefacilities, company datesWhat Whereevents, summaries, city, state, country,facts, themes coordinate
  8. 8. The Infinit.e Solution• Infinit.e is an Open Source document discovery and analysis platform that has these very cool Open Source tools lurking under the hood. github.com/ikanow/Infinit.e
  9. 9. The Infinit.e Solution Infinit.e is a scalable framework for Visualizing Analyzing Retrieving Enriching StoringCollecting Structured and Unstructured Documents
  10. 10. IkanMeow
  11. 11. Document Collection• Infinit.e harvests documents from: – URLs – File Shares – Databases
  12. 12. Sample RSS Document<rss version="2.0"><channel>…<item> <title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title> <link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in- egypt-tunisia-report-by-egyptlastminute-com-13613.html</link> <description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia the most … </description> <dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher> <dc:creator>unknown</dc:creator> <dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date></item>…</channel></rss>
  13. 13. Full Text Source
  14. 14. Source Ingestion Data Flow
  15. 15. Document DBs and Collections
  16. 16. Document Metadata• doc_metadata.metadata{ "_id" : ObjectId("4f93638e0cf212156d0559d2"), "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...", "url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism- in-egypt-tunisia-report-by-egyptlastminute-com-13613.html" "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...", "created" : ISODate("2012-04-22T01:49:02Z"), “metadata” : {…}, "associations" : […], "entities" : […], ...}
  17. 17. Harvested Document Metadata• doc_metadata.metadata.metadata"metadata" : { "location" : [ { Note: It is okay to laugh at this "region" : "South Asia", "citystateprovince" : { "stateprovince" : "Rolpa”, "city" : "Newang" }, "country" : "Nepal" } ], "icn" : [ "200573487" ], "incidentdate" : [ "07/25/2005" ], "organization" : [ "Communist Party of Nepal (Maoist)/United Peoples Front” ], ...},
  18. 18. Document Enrichment• Infinit.e supports the extraction of entities and creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:
  19. 19. Harvested Entities• feature.entity{ "_id" : ObjectId("4f9189d48baf188282a1c9ef"), "alias" : [ "Zine el Abidine Ben Ali", "Zine El Abidine Ben Ali", "Zine el Abidine ben Ali" ], "batch_resync" : true, "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(143), "db_sync_time" : "1338751174988", "dimension" : "Who", "disambiguated_name" : "Zine El Abidine Ben Ali", "doccount" : 152, "index" : "zine el abidine ben ali/person", "totalfreq" : 353, "type" : "Person"}
  20. 20. Harvested Entities
  21. 21. Harvested Associations• feature.association{ "_id" : ObjectId("4f9189d48baf188282a1ca24"), "assoc_type" : "Fact", "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(70), "db_sync_time" : "1338491609281", "doccount" : NumberLong(73), "entity1" : [ "zine el abidine ben ali", "zine el abidine ben ali/person" ], "entity1_index" : "zine el abidine ben ali/person", "entity2" : ["president”,"president/position”], "entity2_index" : "president/position", "index" : "5e3fff27ddb78d6873ccfc77cf05c52f", "verb" : ["career”,"current”,"past”], "verb_category" : "career"}
  22. 22. Harvested Associations
  23. 23. Geolocation of Entities/Events• feature.geo{ "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"), "search_field" : "cairo", "country" : "Egypt", "country_code" : "EG", "city" : "cairo", "region" : "Al Qahirah", "region_code" : "EG11", "population" : 7734602, "latitude" : "30.05", "longitude" : "31.25", "geoindex" : { "lat" : 30.05, "lon" : 31.25 Note: MongoDB 2d Index }}
  24. 24. Geolocation of Entities/Events
  25. 25. Who, What, Where and When
  26. 26. Why MongoDB? – Reason #1Document-Oriented Storage• MongoDB’s document-oriented storage (i.e. schema-less) is perfectly suited to the data design requirements of a system that needs to ingest a wide variety of structured and unstructured document formats and normalize them into one unified, semi-structured format
  27. 27. Why MongoDB? – Reason #2JSON• The standard language of open document analysis – JSON is a common interchange format supported by tools like elasticsearch and SaaS NLP engines – BSON (Binary JSON) is MongoDB’s native data format – Infinit.e ingests and exports JSON natively via the REST based API Note: Infinit.e uses Google’s GSON JAVA library to convert JSON to POJOs and back This is the JSON logo
  28. 28. Why MongoDB? – Reason #3MongoDB Is Web Scale* *Shards are the secret ingredients in the web scale sauce. They just work.
  29. 29. Why MongoDB? – Reason #3Scalability• Seriously, MongoDB Scales – Harvesting and enriching documents requires a lot of disk space – MongoDB scales to arbitrary sizes in both read/write dimensions – Sophisticated sharding keys provide powerful/flexible balancing  BUT building an initial cluster can be complex and managing cluster changes is “fiddly”
  30. 30. Why MongoDB? – Reason #4Integration with Apache Hadoop• Hadoop is rapidly becoming the de-facto standard for data analytics – Open Source, very customizable – Proven scalability – Java libraries• The MongoDB Hadoop Adaptor allows Hadoop to read from and write to MongoDB instead of HDFS + =
  31. 31. Tweeting about MongoDC• Source: http://search.twitter.com/search.rss?q=mongodc – Who’s Tweeting? – What are they Tweeting? – What does basic document analysis of these Tweets tell us?
  32. 32. Who’s Tweeting about MongoDC?
  33. 33. How are Tweeter’s Connected?
  34. 34. What are they Tweeting About?
  35. 35. Sentiment?
  36. 36. Twitter has its Limits…
  37. 37. Thank You! Craig Vitter www.ikanow.com cvitter@ikanow.com

×