SlideShare a Scribd company logo
1 of 37
The Open Source document analysis platform



  Or, how IKANOW uses
to help organizations solve really big problems
Agenda
• What is Document Analysis?
• The Infinit.e Solution
  – Infinit.e’s Architecture
  – Why and How we use MongoDB
• Analyzing #MongoDC
• Questions
This is what Big Data Looks Like




                                          Shamelessly stolen from:
   http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
What is Document Analysis?
 "Document Analysis refers to
 computer-assisted analysis of large numbers
 of documents in order to answer questions
 about the content of a document set.”
 Source: http://www.text-tech.com/docanalysis/definition.html
Document Analysis
• Common document source formats:
RSS                JSON            XML

HTML               PDF             TXT

RTF                Word            PPT

Multimedia Files   RDBMS Records   ETC.
Document Analysis
• The goal is to:
  – Extract Entities (people, places, things)
  – Create Associations between entities (in the
    form of noun-verb-noun), e.g.:
     •   John Doe lives in Washington, D.C
     •   John Doe is married to Jane Doe
     •   John Doe is a Virgo
     •   John Doe traveled to Mexico on July 6th, 2011
• And…
Document Analysis
• Turn Who, What, When and
  Where into a unified data structure that
  supports data analytics and visualization.
Who                                When
people, organizations,             past, present, future
facilities, company                dates

What                               Where
events, summaries,                 city, state, country,
facts, themes                      coordinate
The Infinit.e Solution
• Infinit.e is an Open Source
  document discovery and
  analysis platform that has
  these very cool Open Source
  tools lurking under the hood.


      github.com/ikanow/Infinit.e
The Infinit.e Solution

      Infinit.e is a
        scalable
    framework for                                           Visualizing
                                                Analyzing
                                   Retrieving
                       Enriching
             Storing
Collecting

                                        Structured and
                                   Unstructured Documents
IkanMeow
Document Collection
• Infinit.e harvests documents from:

  – URLs

  – File Shares

  – Databases
Sample RSS Document
<rss version="2.0">
<channel>
…
<item>
    <title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title>
    <link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in-
    egypt-tunisia-report-by-egyptlastminute-com-13613.html</link>
    <description>Report by egyptlastminute.com CAIRO: On Monday, the             countries of the
     Mediterranean opened a conference seeking to enhance the             future of tourism in the region. The
    conference focuses on the countries of Egypt and Tunisia the most …
    </description>
    <dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher>
    <dc:creator>unknown</dc:creator>
    <dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date>
</item>
…
</channel>
</rss>
Full Text Source
Source Ingestion Data Flow
Document DBs and Collections
Document Metadata
• doc_metadata.metadata
{
    "_id" : ObjectId("4f93638e0cf212156d0559d2"),
    "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...",
    "url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-
    in-egypt-tunisia-report-by-egyptlastminute-com-13613.html"
    "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the
    Mediterranean opened a conference seeking to enhance the future of tourism in the region. The
    conference focuses on the countries of Egypt and Tunisia; the most ...",
    "created" : ISODate("2012-04-22T01:49:02Z"),
    “metadata” : {…},
    "associations" : […],
    "entities" : […],
    ...
}
Harvested Document Metadata
• doc_metadata.metadata.metadata
"metadata" : {
     "location" : [
            {                                                          Note: It is okay to laugh at this
                   "region" : "South Asia",
                   "citystateprovince" : {
                          "stateprovince" : "Rolpa”, "city" : "Newang"
                   },
                   "country" : "Nepal"
            }
     ],
     "icn" : [ "200573487" ],
     "incidentdate" : [ "07/25/2005" ],
     "organization" : [
            "Communist Party of Nepal (Maoist)/United People's Front”
     ],
     ...
},
Document Enrichment
• Infinit.e supports the extraction of entities
  and creation of associations using a
  combination of built in enrichment libraries
  and 3rd party NLP APIs including:
Harvested Entities
• feature.entity
{
    "_id" : ObjectId("4f9189d48baf188282a1c9ef"),
    "alias" : [
           "Zine el Abidine Ben Ali",
           "Zine El Abidine Ben Ali",
           "Zine el Abidine ben Ali"
    ],
    "batch_resync" : true,
    "communityId" : ObjectId("4f8f138103644ee8003bf518"),
    "db_sync_doccount" : NumberLong(143),
    "db_sync_time" : "1338751174988",
    "dimension" : "Who",
    "disambiguated_name" : "Zine El Abidine Ben Ali",
    "doccount" : 152,
    "index" : "zine el abidine ben ali/person",
    "totalfreq" : 353,
    "type" : "Person"
}
Harvested Entities
Harvested Associations
• feature.association
{
    "_id" : ObjectId("4f9189d48baf188282a1ca24"),
    "assoc_type" : "Fact",
    "communityId" : ObjectId("4f8f138103644ee8003bf518"),
    "db_sync_doccount" : NumberLong(70),
    "db_sync_time" : "1338491609281",
    "doccount" : NumberLong(73),
    "entity1" : [
           "zine el abidine ben ali",
           "zine el abidine ben ali/person"
    ],
    "entity1_index" : "zine el abidine ben ali/person",
    "entity2" : ["president”,"president/position”],
    "entity2_index" : "president/position",
    "index" : "5e3fff27ddb78d6873ccfc77cf05c52f",
    "verb" : ["career”,"current”,"past”],
    "verb_category" : "career"
}
Harvested Associations
Geolocation of Entities/Events
• feature.geo
{
    "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"),
    "search_field" : "cairo",
    "country" : "Egypt",
    "country_code" : "EG",
    "city" : "cairo",
    "region" : "Al Qahirah",
    "region_code" : "EG11",
    "population" : 7734602,
    "latitude" : "30.05",
    "longitude" : "31.25",
    "geoindex" : {
           "lat" : 30.05,
           "lon" : 31.25                            Note: MongoDB 2d Index
    }
}
Geolocation of Entities/Events
Who, What, Where and When
Why MongoDB? – Reason #1
Document-Oriented Storage
• MongoDB’s document-oriented storage
  (i.e. schema-less) is perfectly suited to the
  data design requirements of a system that
  needs to ingest a wide variety of
  structured and unstructured document
  formats and normalize them into one
  unified, semi-structured format
Why MongoDB? – Reason #2
JSON
• The standard language of open document
  analysis
  – JSON is a common interchange format supported
    by tools like elasticsearch and SaaS NLP engines
  – BSON (Binary JSON) is MongoDB’s native data
    format
  – Infinit.e ingests and exports JSON
    natively via the REST based API
    Note: Infinit.e uses Google’s GSON JAVA library to convert
    JSON to POJOs and back




                                               This is the JSON logo
Why MongoDB? – Reason #3
MongoDB Is Web Scale*




  *Shards are the secret ingredients in the web scale sauce. They just work.
Why MongoDB? – Reason #3
Scalability
• Seriously, MongoDB Scales
  – Harvesting and enriching documents requires
    a lot of disk space
  – MongoDB scales to arbitrary sizes in both
    read/write dimensions
  – Sophisticated sharding keys provide
    powerful/flexible balancing
   BUT building an initial cluster can be complex
    and managing cluster changes is “fiddly”
Why MongoDB? – Reason #4
Integration with Apache Hadoop
•   Hadoop is rapidly becoming the de-facto standard for
    data analytics
     – Open Source, very customizable
     – Proven scalability
     – Java libraries
•   The MongoDB Hadoop Adaptor allows Hadoop to read
    from and write to MongoDB instead of HDFS

                  +                 =
Tweeting about MongoDC
• Source:
  http://search.twitter.com/search.rss?q=mongodc
   – Who’s Tweeting?
   – What are they Tweeting?
   – What does basic document analysis of these
     Tweets tell us?
Who’s Tweeting about MongoDC?
How are Tweeter’s Connected?
What are they Tweeting About?
Sentiment?
Twitter has its Limits…
Thank You!

             Craig Vitter



         www.ikanow.com
        cvitter@ikanow.com

More Related Content

Similar to How IKANOW uses MongoDB to help organizations solve really big problems

What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us? Andrea Volpini
 
Webinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBWebinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBMongoDB
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...OpenAIRE
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarSpazioDati
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikMarko Grobelnik
 
Navigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointNavigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointJoanne Klein
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseRobert Lujo
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBMongoDB
 
Wiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School PkuWiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School Pkuwiser pku
 
Wiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School PkuWiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School Pkuguest8ed46d
 
Breaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsBreaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsJohn Breslin
 
ItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information IntegrationItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information Integrationkeepingfoundthingsfound
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteDeep Kayal
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries mdabrowski
 

Similar to How IKANOW uses MongoDB to help organizations solve really big problems (20)

What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Webinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBWebinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDB
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch Seminar
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko Grobelnik
 
Navigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointNavigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePoint
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document database
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
 
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti... NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
Krnarich "Assessing Contribution & Value"
Krnarich "Assessing Contribution & Value"Krnarich "Assessing Contribution & Value"
Krnarich "Assessing Contribution & Value"
 
Digital Content Management
Digital Content ManagementDigital Content Management
Digital Content Management
 
Wiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School PkuWiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School Pku
 
Wiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School PkuWiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School Pku
 
Breaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsBreaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social Semantics
 
ItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information IntegrationItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information Integration
 
LOD2 Webinar: SIREn
LOD2 Webinar: SIREnLOD2 Webinar: SIREn
LOD2 Webinar: SIREn
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 

More from ikanow

Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big DataAliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big Dataikanow
 
Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014ikanow
 
Open Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media AnalysisOpen Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media Analysisikanow
 
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud ComputingDr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud Computingikanow
 
Cloud computing with AWS
Cloud computing with AWS Cloud computing with AWS
Cloud computing with AWS ikanow
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysisikanow
 
Open Analytics DC April 2012 Meetup
Open Analytics DC April 2012 MeetupOpen Analytics DC April 2012 Meetup
Open Analytics DC April 2012 Meetupikanow
 
Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?ikanow
 
Agile intelligence through Open Analytics
Agile intelligence through Open AnalyticsAgile intelligence through Open Analytics
Agile intelligence through Open Analyticsikanow
 
Social Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big DataSocial Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big Dataikanow
 
Value Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs AnalysisValue Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs Analysisikanow
 

More from ikanow (11)

Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big DataAliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
 
Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014
 
Open Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media AnalysisOpen Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media Analysis
 
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud ComputingDr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
 
Cloud computing with AWS
Cloud computing with AWS Cloud computing with AWS
Cloud computing with AWS
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Open Analytics DC April 2012 Meetup
Open Analytics DC April 2012 MeetupOpen Analytics DC April 2012 Meetup
Open Analytics DC April 2012 Meetup
 
Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?
 
Agile intelligence through Open Analytics
Agile intelligence through Open AnalyticsAgile intelligence through Open Analytics
Agile intelligence through Open Analytics
 
Social Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big DataSocial Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big Data
 
Value Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs AnalysisValue Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs Analysis
 

Recently uploaded

How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfSrushith Repakula
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGDSC PJATK
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FIDO Alliance
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jNeo4j
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandIES VE
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfFIDO Alliance
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Skynet Technologies
 

Recently uploaded (20)

How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 

How IKANOW uses MongoDB to help organizations solve really big problems

  • 1. The Open Source document analysis platform Or, how IKANOW uses to help organizations solve really big problems
  • 2. Agenda • What is Document Analysis? • The Infinit.e Solution – Infinit.e’s Architecture – Why and How we use MongoDB • Analyzing #MongoDC • Questions
  • 3. This is what Big Data Looks Like Shamelessly stolen from: http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
  • 4. What is Document Analysis? "Document Analysis refers to computer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.” Source: http://www.text-tech.com/docanalysis/definition.html
  • 5. Document Analysis • Common document source formats: RSS JSON XML HTML PDF TXT RTF Word PPT Multimedia Files RDBMS Records ETC.
  • 6. Document Analysis • The goal is to: – Extract Entities (people, places, things) – Create Associations between entities (in the form of noun-verb-noun), e.g.: • John Doe lives in Washington, D.C • John Doe is married to Jane Doe • John Doe is a Virgo • John Doe traveled to Mexico on July 6th, 2011 • And…
  • 7. Document Analysis • Turn Who, What, When and Where into a unified data structure that supports data analytics and visualization. Who When people, organizations, past, present, future facilities, company dates What Where events, summaries, city, state, country, facts, themes coordinate
  • 8. The Infinit.e Solution • Infinit.e is an Open Source document discovery and analysis platform that has these very cool Open Source tools lurking under the hood. github.com/ikanow/Infinit.e
  • 9. The Infinit.e Solution Infinit.e is a scalable framework for Visualizing Analyzing Retrieving Enriching Storing Collecting Structured and Unstructured Documents
  • 11. Document Collection • Infinit.e harvests documents from: – URLs – File Shares – Databases
  • 12. Sample RSS Document <rss version="2.0"> <channel> … <item> <title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title> <link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in- egypt-tunisia-report-by-egyptlastminute-com-13613.html</link> <description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia the most … </description> <dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher> <dc:creator>unknown</dc:creator> <dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date> </item> … </channel> </rss>
  • 15. Document DBs and Collections
  • 16. Document Metadata • doc_metadata.metadata { "_id" : ObjectId("4f93638e0cf212156d0559d2"), "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...", "url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism- in-egypt-tunisia-report-by-egyptlastminute-com-13613.html" "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...", "created" : ISODate("2012-04-22T01:49:02Z"), “metadata” : {…}, "associations" : […], "entities" : […], ... }
  • 17. Harvested Document Metadata • doc_metadata.metadata.metadata "metadata" : { "location" : [ { Note: It is okay to laugh at this "region" : "South Asia", "citystateprovince" : { "stateprovince" : "Rolpa”, "city" : "Newang" }, "country" : "Nepal" } ], "icn" : [ "200573487" ], "incidentdate" : [ "07/25/2005" ], "organization" : [ "Communist Party of Nepal (Maoist)/United People's Front” ], ... },
  • 18. Document Enrichment • Infinit.e supports the extraction of entities and creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:
  • 19. Harvested Entities • feature.entity { "_id" : ObjectId("4f9189d48baf188282a1c9ef"), "alias" : [ "Zine el Abidine Ben Ali", "Zine El Abidine Ben Ali", "Zine el Abidine ben Ali" ], "batch_resync" : true, "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(143), "db_sync_time" : "1338751174988", "dimension" : "Who", "disambiguated_name" : "Zine El Abidine Ben Ali", "doccount" : 152, "index" : "zine el abidine ben ali/person", "totalfreq" : 353, "type" : "Person" }
  • 21. Harvested Associations • feature.association { "_id" : ObjectId("4f9189d48baf188282a1ca24"), "assoc_type" : "Fact", "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(70), "db_sync_time" : "1338491609281", "doccount" : NumberLong(73), "entity1" : [ "zine el abidine ben ali", "zine el abidine ben ali/person" ], "entity1_index" : "zine el abidine ben ali/person", "entity2" : ["president”,"president/position”], "entity2_index" : "president/position", "index" : "5e3fff27ddb78d6873ccfc77cf05c52f", "verb" : ["career”,"current”,"past”], "verb_category" : "career" }
  • 23. Geolocation of Entities/Events • feature.geo { "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"), "search_field" : "cairo", "country" : "Egypt", "country_code" : "EG", "city" : "cairo", "region" : "Al Qahirah", "region_code" : "EG11", "population" : 7734602, "latitude" : "30.05", "longitude" : "31.25", "geoindex" : { "lat" : 30.05, "lon" : 31.25 Note: MongoDB 2d Index } }
  • 25. Who, What, Where and When
  • 26. Why MongoDB? – Reason #1 Document-Oriented Storage • MongoDB’s document-oriented storage (i.e. schema-less) is perfectly suited to the data design requirements of a system that needs to ingest a wide variety of structured and unstructured document formats and normalize them into one unified, semi-structured format
  • 27. Why MongoDB? – Reason #2 JSON • The standard language of open document analysis – JSON is a common interchange format supported by tools like elasticsearch and SaaS NLP engines – BSON (Binary JSON) is MongoDB’s native data format – Infinit.e ingests and exports JSON natively via the REST based API Note: Infinit.e uses Google’s GSON JAVA library to convert JSON to POJOs and back This is the JSON logo
  • 28. Why MongoDB? – Reason #3 MongoDB Is Web Scale* *Shards are the secret ingredients in the web scale sauce. They just work.
  • 29. Why MongoDB? – Reason #3 Scalability • Seriously, MongoDB Scales – Harvesting and enriching documents requires a lot of disk space – MongoDB scales to arbitrary sizes in both read/write dimensions – Sophisticated sharding keys provide powerful/flexible balancing  BUT building an initial cluster can be complex and managing cluster changes is “fiddly”
  • 30. Why MongoDB? – Reason #4 Integration with Apache Hadoop • Hadoop is rapidly becoming the de-facto standard for data analytics – Open Source, very customizable – Proven scalability – Java libraries • The MongoDB Hadoop Adaptor allows Hadoop to read from and write to MongoDB instead of HDFS + =
  • 31. Tweeting about MongoDC • Source: http://search.twitter.com/search.rss?q=mongodc – Who’s Tweeting? – What are they Tweeting? – What does basic document analysis of these Tweets tell us?
  • 33. How are Tweeter’s Connected?
  • 34. What are they Tweeting About?
  • 36. Twitter has its Limits…
  • 37. Thank You! Craig Vitter www.ikanow.com cvitter@ikanow.com