SlideShare a Scribd company logo
The Open Source document analysis platform



  Or, how IKANOW uses
to help organizations solve really big problems
Agenda
• What is Document Analysis?
• The Infinit.e Solution
  – Infinit.e’s Architecture
  – Why and How we use MongoDB
• Analyzing #MongoDC
• Questions
This is what Big Data Looks Like




                                          Shamelessly stolen from:
   http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
What is Document Analysis?
 "Document Analysis refers to
 computer-assisted analysis of large numbers
 of documents in order to answer questions
 about the content of a document set.”
 Source: http://www.text-tech.com/docanalysis/definition.html
Document Analysis
• Common document source formats:
RSS                JSON            XML

HTML               PDF             TXT

RTF                Word            PPT

Multimedia Files   RDBMS Records   ETC.
Document Analysis
• The goal is to:
  – Extract Entities (people, places, things)
  – Create Associations between entities (in the
    form of noun-verb-noun), e.g.:
     •   John Doe lives in Washington, D.C
     •   John Doe is married to Jane Doe
     •   John Doe is a Virgo
     •   John Doe traveled to Mexico on July 6th, 2011
• And…
Document Analysis
• Turn Who, What, When and
  Where into a unified data structure that
  supports data analytics and visualization.
Who                                When
people, organizations,             past, present, future
facilities, company                dates

What                               Where
events, summaries,                 city, state, country,
facts, themes                      coordinate
The Infinit.e Solution
• Infinit.e is an Open Source
  document discovery and
  analysis platform that has
  these very cool Open Source
  tools lurking under the hood.


      github.com/ikanow/Infinit.e
The Infinit.e Solution

      Infinit.e is a
        scalable
    framework for                                           Visualizing
                                                Analyzing
                                   Retrieving
                       Enriching
             Storing
Collecting

                                        Structured and
                                   Unstructured Documents
IkanMeow
Document Collection
• Infinit.e harvests documents from:

  – URLs

  – File Shares

  – Databases
Sample RSS Document
<rss version="2.0">
<channel>
…
<item>
    <title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title>
    <link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in-
    egypt-tunisia-report-by-egyptlastminute-com-13613.html</link>
    <description>Report by egyptlastminute.com CAIRO: On Monday, the             countries of the
     Mediterranean opened a conference seeking to enhance the             future of tourism in the region. The
    conference focuses on the countries of Egypt and Tunisia the most …
    </description>
    <dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher>
    <dc:creator>unknown</dc:creator>
    <dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date>
</item>
…
</channel>
</rss>
Full Text Source
Source Ingestion Data Flow
Document DBs and Collections
Document Metadata
• doc_metadata.metadata
{
    "_id" : ObjectId("4f93638e0cf212156d0559d2"),
    "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...",
    "url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-
    in-egypt-tunisia-report-by-egyptlastminute-com-13613.html"
    "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the
    Mediterranean opened a conference seeking to enhance the future of tourism in the region. The
    conference focuses on the countries of Egypt and Tunisia; the most ...",
    "created" : ISODate("2012-04-22T01:49:02Z"),
    “metadata” : {…},
    "associations" : […],
    "entities" : […],
    ...
}
Harvested Document Metadata
• doc_metadata.metadata.metadata
"metadata" : {
     "location" : [
            {                                                          Note: It is okay to laugh at this
                   "region" : "South Asia",
                   "citystateprovince" : {
                          "stateprovince" : "Rolpa”, "city" : "Newang"
                   },
                   "country" : "Nepal"
            }
     ],
     "icn" : [ "200573487" ],
     "incidentdate" : [ "07/25/2005" ],
     "organization" : [
            "Communist Party of Nepal (Maoist)/United People's Front”
     ],
     ...
},
Document Enrichment
• Infinit.e supports the extraction of entities
  and creation of associations using a
  combination of built in enrichment libraries
  and 3rd party NLP APIs including:
Harvested Entities
• feature.entity
{
    "_id" : ObjectId("4f9189d48baf188282a1c9ef"),
    "alias" : [
           "Zine el Abidine Ben Ali",
           "Zine El Abidine Ben Ali",
           "Zine el Abidine ben Ali"
    ],
    "batch_resync" : true,
    "communityId" : ObjectId("4f8f138103644ee8003bf518"),
    "db_sync_doccount" : NumberLong(143),
    "db_sync_time" : "1338751174988",
    "dimension" : "Who",
    "disambiguated_name" : "Zine El Abidine Ben Ali",
    "doccount" : 152,
    "index" : "zine el abidine ben ali/person",
    "totalfreq" : 353,
    "type" : "Person"
}
Harvested Entities
Harvested Associations
• feature.association
{
    "_id" : ObjectId("4f9189d48baf188282a1ca24"),
    "assoc_type" : "Fact",
    "communityId" : ObjectId("4f8f138103644ee8003bf518"),
    "db_sync_doccount" : NumberLong(70),
    "db_sync_time" : "1338491609281",
    "doccount" : NumberLong(73),
    "entity1" : [
           "zine el abidine ben ali",
           "zine el abidine ben ali/person"
    ],
    "entity1_index" : "zine el abidine ben ali/person",
    "entity2" : ["president”,"president/position”],
    "entity2_index" : "president/position",
    "index" : "5e3fff27ddb78d6873ccfc77cf05c52f",
    "verb" : ["career”,"current”,"past”],
    "verb_category" : "career"
}
Harvested Associations
Geolocation of Entities/Events
• feature.geo
{
    "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"),
    "search_field" : "cairo",
    "country" : "Egypt",
    "country_code" : "EG",
    "city" : "cairo",
    "region" : "Al Qahirah",
    "region_code" : "EG11",
    "population" : 7734602,
    "latitude" : "30.05",
    "longitude" : "31.25",
    "geoindex" : {
           "lat" : 30.05,
           "lon" : 31.25                            Note: MongoDB 2d Index
    }
}
Geolocation of Entities/Events
Who, What, Where and When
Why MongoDB? – Reason #1
Document-Oriented Storage
• MongoDB’s document-oriented storage
  (i.e. schema-less) is perfectly suited to the
  data design requirements of a system that
  needs to ingest a wide variety of
  structured and unstructured document
  formats and normalize them into one
  unified, semi-structured format
Why MongoDB? – Reason #2
JSON
• The standard language of open document
  analysis
  – JSON is a common interchange format supported
    by tools like elasticsearch and SaaS NLP engines
  – BSON (Binary JSON) is MongoDB’s native data
    format
  – Infinit.e ingests and exports JSON
    natively via the REST based API
    Note: Infinit.e uses Google’s GSON JAVA library to convert
    JSON to POJOs and back




                                               This is the JSON logo
Why MongoDB? – Reason #3
MongoDB Is Web Scale*




  *Shards are the secret ingredients in the web scale sauce. They just work.
Why MongoDB? – Reason #3
Scalability
• Seriously, MongoDB Scales
  – Harvesting and enriching documents requires
    a lot of disk space
  – MongoDB scales to arbitrary sizes in both
    read/write dimensions
  – Sophisticated sharding keys provide
    powerful/flexible balancing
   BUT building an initial cluster can be complex
    and managing cluster changes is “fiddly”
Why MongoDB? – Reason #4
Integration with Apache Hadoop
•   Hadoop is rapidly becoming the de-facto standard for
    data analytics
     – Open Source, very customizable
     – Proven scalability
     – Java libraries
•   The MongoDB Hadoop Adaptor allows Hadoop to read
    from and write to MongoDB instead of HDFS

                  +                 =
Tweeting about MongoDC
• Source:
  http://search.twitter.com/search.rss?q=mongodc
   – Who’s Tweeting?
   – What are they Tweeting?
   – What does basic document analysis of these
     Tweets tell us?
Who’s Tweeting about MongoDC?
How are Tweeter’s Connected?
What are they Tweeting About?
Sentiment?
Twitter has its Limits…
Thank You!

             Craig Vitter



         www.ikanow.com
        cvitter@ikanow.com

More Related Content

Similar to How IKANOW uses MongoDB to help organizations solve really big problems

What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
Andrea Volpini
 
Webinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBWebinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDB
MongoDB
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
OpenAIRE
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch Seminar
SpazioDati
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko Grobelnik
Marko Grobelnik
 
Navigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointNavigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePoint
Joanne Klein
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document database
Robert Lujo
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
Sebastian Ryszard Kruk
 
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti... NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
National Information Standards Organization (NISO)
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
MongoDB
 
Krnarich "Assessing Contribution & Value"
Krnarich "Assessing Contribution & Value"Krnarich "Assessing Contribution & Value"
Krnarich "Assessing Contribution & Value"
National Information Standards Organization (NISO)
 
Digital Content Management
Digital Content ManagementDigital Content Management
Wiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School PkuWiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School Pku
wiser pku
 
Wiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School PkuWiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School Pku
guest8ed46d
 
Breaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsBreaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social Semantics
John Breslin
 
ItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information IntegrationItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information Integration
keepingfoundthingsfound
 
LOD2 Webinar: SIREn
LOD2 Webinar: SIREnLOD2 Webinar: SIREn
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
Deep Kayal
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries
mdabrowski
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
DESTIN-Informatique.com
 

Similar to How IKANOW uses MongoDB to help organizations solve really big problems (20)

What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Webinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBWebinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDB
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch Seminar
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko Grobelnik
 
Navigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointNavigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePoint
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document database
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
 
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti... NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
Krnarich "Assessing Contribution & Value"
Krnarich "Assessing Contribution & Value"Krnarich "Assessing Contribution & Value"
Krnarich "Assessing Contribution & Value"
 
Digital Content Management
Digital Content ManagementDigital Content Management
Digital Content Management
 
Wiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School PkuWiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School Pku
 
Wiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School PkuWiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School Pku
 
Breaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsBreaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social Semantics
 
ItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information IntegrationItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information Integration
 
LOD2 Webinar: SIREn
LOD2 Webinar: SIREnLOD2 Webinar: SIREn
LOD2 Webinar: SIREn
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 

More from ikanow

Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big DataAliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
ikanow
 
Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014
ikanow
 
Open Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media AnalysisOpen Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media Analysis
ikanow
 
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud ComputingDr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
ikanow
 
Cloud computing with AWS
Cloud computing with AWS Cloud computing with AWS
Cloud computing with AWS
ikanow
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
ikanow
 
Open Analytics DC April 2012 Meetup
Open Analytics DC April 2012 MeetupOpen Analytics DC April 2012 Meetup
Open Analytics DC April 2012 Meetup
ikanow
 
Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?
ikanow
 
Agile intelligence through Open Analytics
Agile intelligence through Open AnalyticsAgile intelligence through Open Analytics
Agile intelligence through Open Analytics
ikanow
 
Social Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big DataSocial Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big Data
ikanow
 
Value Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs AnalysisValue Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs Analysis
ikanow
 

More from ikanow (11)

Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big DataAliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
 
Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014
 
Open Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media AnalysisOpen Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media Analysis
 
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud ComputingDr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
 
Cloud computing with AWS
Cloud computing with AWS Cloud computing with AWS
Cloud computing with AWS
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Open Analytics DC April 2012 Meetup
Open Analytics DC April 2012 MeetupOpen Analytics DC April 2012 Meetup
Open Analytics DC April 2012 Meetup
 
Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?
 
Agile intelligence through Open Analytics
Agile intelligence through Open AnalyticsAgile intelligence through Open Analytics
Agile intelligence through Open Analytics
 
Social Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big DataSocial Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big Data
 
Value Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs AnalysisValue Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs Analysis
 

Recently uploaded

Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 

Recently uploaded (20)

Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 

How IKANOW uses MongoDB to help organizations solve really big problems

  • 1. The Open Source document analysis platform Or, how IKANOW uses to help organizations solve really big problems
  • 2. Agenda • What is Document Analysis? • The Infinit.e Solution – Infinit.e’s Architecture – Why and How we use MongoDB • Analyzing #MongoDC • Questions
  • 3. This is what Big Data Looks Like Shamelessly stolen from: http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
  • 4. What is Document Analysis? "Document Analysis refers to computer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.” Source: http://www.text-tech.com/docanalysis/definition.html
  • 5. Document Analysis • Common document source formats: RSS JSON XML HTML PDF TXT RTF Word PPT Multimedia Files RDBMS Records ETC.
  • 6. Document Analysis • The goal is to: – Extract Entities (people, places, things) – Create Associations between entities (in the form of noun-verb-noun), e.g.: • John Doe lives in Washington, D.C • John Doe is married to Jane Doe • John Doe is a Virgo • John Doe traveled to Mexico on July 6th, 2011 • And…
  • 7. Document Analysis • Turn Who, What, When and Where into a unified data structure that supports data analytics and visualization. Who When people, organizations, past, present, future facilities, company dates What Where events, summaries, city, state, country, facts, themes coordinate
  • 8. The Infinit.e Solution • Infinit.e is an Open Source document discovery and analysis platform that has these very cool Open Source tools lurking under the hood. github.com/ikanow/Infinit.e
  • 9. The Infinit.e Solution Infinit.e is a scalable framework for Visualizing Analyzing Retrieving Enriching Storing Collecting Structured and Unstructured Documents
  • 11. Document Collection • Infinit.e harvests documents from: – URLs – File Shares – Databases
  • 12. Sample RSS Document <rss version="2.0"> <channel> … <item> <title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title> <link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in- egypt-tunisia-report-by-egyptlastminute-com-13613.html</link> <description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia the most … </description> <dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher> <dc:creator>unknown</dc:creator> <dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date> </item> … </channel> </rss>
  • 15. Document DBs and Collections
  • 16. Document Metadata • doc_metadata.metadata { "_id" : ObjectId("4f93638e0cf212156d0559d2"), "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...", "url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism- in-egypt-tunisia-report-by-egyptlastminute-com-13613.html" "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...", "created" : ISODate("2012-04-22T01:49:02Z"), “metadata” : {…}, "associations" : […], "entities" : […], ... }
  • 17. Harvested Document Metadata • doc_metadata.metadata.metadata "metadata" : { "location" : [ { Note: It is okay to laugh at this "region" : "South Asia", "citystateprovince" : { "stateprovince" : "Rolpa”, "city" : "Newang" }, "country" : "Nepal" } ], "icn" : [ "200573487" ], "incidentdate" : [ "07/25/2005" ], "organization" : [ "Communist Party of Nepal (Maoist)/United People's Front” ], ... },
  • 18. Document Enrichment • Infinit.e supports the extraction of entities and creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:
  • 19. Harvested Entities • feature.entity { "_id" : ObjectId("4f9189d48baf188282a1c9ef"), "alias" : [ "Zine el Abidine Ben Ali", "Zine El Abidine Ben Ali", "Zine el Abidine ben Ali" ], "batch_resync" : true, "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(143), "db_sync_time" : "1338751174988", "dimension" : "Who", "disambiguated_name" : "Zine El Abidine Ben Ali", "doccount" : 152, "index" : "zine el abidine ben ali/person", "totalfreq" : 353, "type" : "Person" }
  • 21. Harvested Associations • feature.association { "_id" : ObjectId("4f9189d48baf188282a1ca24"), "assoc_type" : "Fact", "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(70), "db_sync_time" : "1338491609281", "doccount" : NumberLong(73), "entity1" : [ "zine el abidine ben ali", "zine el abidine ben ali/person" ], "entity1_index" : "zine el abidine ben ali/person", "entity2" : ["president”,"president/position”], "entity2_index" : "president/position", "index" : "5e3fff27ddb78d6873ccfc77cf05c52f", "verb" : ["career”,"current”,"past”], "verb_category" : "career" }
  • 23. Geolocation of Entities/Events • feature.geo { "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"), "search_field" : "cairo", "country" : "Egypt", "country_code" : "EG", "city" : "cairo", "region" : "Al Qahirah", "region_code" : "EG11", "population" : 7734602, "latitude" : "30.05", "longitude" : "31.25", "geoindex" : { "lat" : 30.05, "lon" : 31.25 Note: MongoDB 2d Index } }
  • 25. Who, What, Where and When
  • 26. Why MongoDB? – Reason #1 Document-Oriented Storage • MongoDB’s document-oriented storage (i.e. schema-less) is perfectly suited to the data design requirements of a system that needs to ingest a wide variety of structured and unstructured document formats and normalize them into one unified, semi-structured format
  • 27. Why MongoDB? – Reason #2 JSON • The standard language of open document analysis – JSON is a common interchange format supported by tools like elasticsearch and SaaS NLP engines – BSON (Binary JSON) is MongoDB’s native data format – Infinit.e ingests and exports JSON natively via the REST based API Note: Infinit.e uses Google’s GSON JAVA library to convert JSON to POJOs and back This is the JSON logo
  • 28. Why MongoDB? – Reason #3 MongoDB Is Web Scale* *Shards are the secret ingredients in the web scale sauce. They just work.
  • 29. Why MongoDB? – Reason #3 Scalability • Seriously, MongoDB Scales – Harvesting and enriching documents requires a lot of disk space – MongoDB scales to arbitrary sizes in both read/write dimensions – Sophisticated sharding keys provide powerful/flexible balancing  BUT building an initial cluster can be complex and managing cluster changes is “fiddly”
  • 30. Why MongoDB? – Reason #4 Integration with Apache Hadoop • Hadoop is rapidly becoming the de-facto standard for data analytics – Open Source, very customizable – Proven scalability – Java libraries • The MongoDB Hadoop Adaptor allows Hadoop to read from and write to MongoDB instead of HDFS + =
  • 31. Tweeting about MongoDC • Source: http://search.twitter.com/search.rss?q=mongodc – Who’s Tweeting? – What are they Tweeting? – What does basic document analysis of these Tweets tell us?
  • 33. How are Tweeter’s Connected?
  • 34. What are they Tweeting About?
  • 36. Twitter has its Limits…
  • 37. Thank You! Craig Vitter www.ikanow.com cvitter@ikanow.com