Your SlideShare is downloading. ×
  • Like
How IKANOW uses MongoDB to help organizations solve really big problems
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

How IKANOW uses MongoDB to help organizations solve really big problems

  • 1,388 views
Published

 

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,388
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
14
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Open Source document analysis platform Or, how IKANOW usesto help organizations solve really big problems
  • 2. Agenda• What is Document Analysis?• The Infinit.e Solution – Infinit.e’s Architecture – Why and How we use MongoDB• Analyzing #MongoDC• Questions
  • 3. This is what Big Data Looks Like Shamelessly stolen from: http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
  • 4. What is Document Analysis? "Document Analysis refers to computer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.” Source: http://www.text-tech.com/docanalysis/definition.html
  • 5. Document Analysis• Common document source formats:RSS JSON XMLHTML PDF TXTRTF Word PPTMultimedia Files RDBMS Records ETC.
  • 6. Document Analysis• The goal is to: – Extract Entities (people, places, things) – Create Associations between entities (in the form of noun-verb-noun), e.g.: • John Doe lives in Washington, D.C • John Doe is married to Jane Doe • John Doe is a Virgo • John Doe traveled to Mexico on July 6th, 2011• And…
  • 7. Document Analysis• Turn Who, What, When and Where into a unified data structure that supports data analytics and visualization.Who Whenpeople, organizations, past, present, futurefacilities, company datesWhat Whereevents, summaries, city, state, country,facts, themes coordinate
  • 8. The Infinit.e Solution• Infinit.e is an Open Source document discovery and analysis platform that has these very cool Open Source tools lurking under the hood. github.com/ikanow/Infinit.e
  • 9. The Infinit.e Solution Infinit.e is a scalable framework for Visualizing Analyzing Retrieving Enriching StoringCollecting Structured and Unstructured Documents
  • 10. IkanMeow
  • 11. Document Collection• Infinit.e harvests documents from: – URLs – File Shares – Databases
  • 12. Sample RSS Document<rss version="2.0"><channel>…<item> <title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title> <link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in- egypt-tunisia-report-by-egyptlastminute-com-13613.html</link> <description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia the most … </description> <dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher> <dc:creator>unknown</dc:creator> <dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date></item>…</channel></rss>
  • 13. Full Text Source
  • 14. Source Ingestion Data Flow
  • 15. Document DBs and Collections
  • 16. Document Metadata• doc_metadata.metadata{ "_id" : ObjectId("4f93638e0cf212156d0559d2"), "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...", "url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism- in-egypt-tunisia-report-by-egyptlastminute-com-13613.html" "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...", "created" : ISODate("2012-04-22T01:49:02Z"), “metadata” : {…}, "associations" : […], "entities" : […], ...}
  • 17. Harvested Document Metadata• doc_metadata.metadata.metadata"metadata" : { "location" : [ { Note: It is okay to laugh at this "region" : "South Asia", "citystateprovince" : { "stateprovince" : "Rolpa”, "city" : "Newang" }, "country" : "Nepal" } ], "icn" : [ "200573487" ], "incidentdate" : [ "07/25/2005" ], "organization" : [ "Communist Party of Nepal (Maoist)/United Peoples Front” ], ...},
  • 18. Document Enrichment• Infinit.e supports the extraction of entities and creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:
  • 19. Harvested Entities• feature.entity{ "_id" : ObjectId("4f9189d48baf188282a1c9ef"), "alias" : [ "Zine el Abidine Ben Ali", "Zine El Abidine Ben Ali", "Zine el Abidine ben Ali" ], "batch_resync" : true, "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(143), "db_sync_time" : "1338751174988", "dimension" : "Who", "disambiguated_name" : "Zine El Abidine Ben Ali", "doccount" : 152, "index" : "zine el abidine ben ali/person", "totalfreq" : 353, "type" : "Person"}
  • 20. Harvested Entities
  • 21. Harvested Associations• feature.association{ "_id" : ObjectId("4f9189d48baf188282a1ca24"), "assoc_type" : "Fact", "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(70), "db_sync_time" : "1338491609281", "doccount" : NumberLong(73), "entity1" : [ "zine el abidine ben ali", "zine el abidine ben ali/person" ], "entity1_index" : "zine el abidine ben ali/person", "entity2" : ["president”,"president/position”], "entity2_index" : "president/position", "index" : "5e3fff27ddb78d6873ccfc77cf05c52f", "verb" : ["career”,"current”,"past”], "verb_category" : "career"}
  • 22. Harvested Associations
  • 23. Geolocation of Entities/Events• feature.geo{ "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"), "search_field" : "cairo", "country" : "Egypt", "country_code" : "EG", "city" : "cairo", "region" : "Al Qahirah", "region_code" : "EG11", "population" : 7734602, "latitude" : "30.05", "longitude" : "31.25", "geoindex" : { "lat" : 30.05, "lon" : 31.25 Note: MongoDB 2d Index }}
  • 24. Geolocation of Entities/Events
  • 25. Who, What, Where and When
  • 26. Why MongoDB? – Reason #1Document-Oriented Storage• MongoDB’s document-oriented storage (i.e. schema-less) is perfectly suited to the data design requirements of a system that needs to ingest a wide variety of structured and unstructured document formats and normalize them into one unified, semi-structured format
  • 27. Why MongoDB? – Reason #2JSON• The standard language of open document analysis – JSON is a common interchange format supported by tools like elasticsearch and SaaS NLP engines – BSON (Binary JSON) is MongoDB’s native data format – Infinit.e ingests and exports JSON natively via the REST based API Note: Infinit.e uses Google’s GSON JAVA library to convert JSON to POJOs and back This is the JSON logo
  • 28. Why MongoDB? – Reason #3MongoDB Is Web Scale* *Shards are the secret ingredients in the web scale sauce. They just work.
  • 29. Why MongoDB? – Reason #3Scalability• Seriously, MongoDB Scales – Harvesting and enriching documents requires a lot of disk space – MongoDB scales to arbitrary sizes in both read/write dimensions – Sophisticated sharding keys provide powerful/flexible balancing  BUT building an initial cluster can be complex and managing cluster changes is “fiddly”
  • 30. Why MongoDB? – Reason #4Integration with Apache Hadoop• Hadoop is rapidly becoming the de-facto standard for data analytics – Open Source, very customizable – Proven scalability – Java libraries• The MongoDB Hadoop Adaptor allows Hadoop to read from and write to MongoDB instead of HDFS + =
  • 31. Tweeting about MongoDC• Source: http://search.twitter.com/search.rss?q=mongodc – Who’s Tweeting? – What are they Tweeting? – What does basic document analysis of these Tweets tell us?
  • 32. Who’s Tweeting about MongoDC?
  • 33. How are Tweeter’s Connected?
  • 34. What are they Tweeting About?
  • 35. Sentiment?
  • 36. Twitter has its Limits…
  • 37. Thank You! Craig Vitter www.ikanow.com cvitter@ikanow.com