SlideShare a Scribd company logo
1 of 24
Document Analysis and Big Data
    Making Sense out of the Flood
Agenda
• Define Big Data and Document Analysis
• The Infinit.e Solution
• Questions
What is Big Data?
 “Big data is a term applied to data sets
 whose size is beyond the ability of commonly
 used software tools to capture, manage, and
 process the data within a tolerable elapsed
 time.”
 Source: http://en.wikipedia.org/wiki/Big_data
This is what Big Data Feels Like




                                          Shamelessly stolen from:
   http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
What is Document Analysis?
 "Document Analysis refers to
 computer-assisted analysis of large numbers
 of documents in order to answer questions
 about the content of a document set.”
 Source: http://www.text-tech.com/docanalysis/definition.html
Document Analysis
• The goal is to:
  – Extract Entities (people, places, things)
  – Create Associations between entities (in the
    form of noun-verb-noun), e.g.:
     •   John Doe lives in Washington, D.C
     •   John Doe is married to Jane Doe
     •   John Doe is a Virgo
     •   John Doe traveled to Mexico on July 6th, 2011
• And…
Document Analysis
• Turn Who, What, When and
  Where into a unified data structure that
  supports data analytics and visualization.
Who                                When
people, organizations,             past, present, future
facilities, company                dates

What                               Where
events, summaries,                 city, state, country,
facts, themes                      coordinate
The Infinit.e Solution
• Infinit.e is an Open Source
  document discovery and
  analysis platform that has
  these very cool open source
  tools lurking under the hood.


       github.com/ikanow/Infinit.e
The Infinit.e Solution

      Infinit.e is a
        scalable
    framework for                                           Visualizing
                                                Analyzing
                                   Retrieving
                       Enriching
             Storing
Collecting

                                        Structured and
                                   Unstructured Documents
Harvesting
• Infinit.e’s harvester:
  – Collects documents for specified data sources
    (URLs, RDBMs via JDBC, file shares)
  – Marshalls each document through the
    enrichment process
  – Saves each metadata document, entity, and
    association created to MongoDB
Source Ingestion Data Flow
Sample RSS Document
<rss version="2.0">
<channel>
…
<item>
    <title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title>
    <link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in-
    egypt-tunisia-report-by-egyptlastminute-com-13613.html</link>
    <description>Report by egyptlastminute.com CAIRO: On Monday, the             countries of the
     Mediterranean opened a conference seeking to enhance the             future of tourism in the region. The
    conference focuses on the countries of Egypt and Tunisia the most …
    </description>
    <dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher>
    <dc:creator>unknown</dc:creator>
    <dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date>
</item>
…
</channel>
</rss>
Full Text Source
Document Metadata
• doc_metadata.metadata
{
    "_id" : ObjectId("4f93638e0cf212156d0559d2"),
    "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...",
    "url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-
    in-egypt-tunisia-report-by-egyptlastminute-com-13613.html"
    "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the
    Mediterranean opened a conference seeking to enhance the future of tourism in the region. The
    conference focuses on the countries of Egypt and Tunisia; the most ...",
    "created" : ISODate("2012-04-22T01:49:02Z"),
    “metadata” : {…},
    "associations" : […],
    "entities" : […],
    ...
}
Harvested Document Metadata
• document.metadata
"metadata" : {
     "location" : [
            {
                   "region" : "South Asia",
                   "citystateprovince" : {
                          "stateprovince" : "Rolpa”, "city" : "Newang"
                   },
                   "country" : "Nepal"
            }
     ],
     "icn" : [ "200573487" ],
     "incidentdate" : [ "07/25/2005" ],
     "organization" : [
            "Communist Party of Nepal (Maoist)/United People's Front”
     ],
     ...
},
Document Enrichment
• Infinit.e supports the extraction of entities
  and creation of associations using a
  combination of built in enrichment libraries
  and 3rd party NLP APIs including:
Harvested Entities
• feature.entity
{
    "_id" : ObjectId("4f9189d48baf188282a1c9ef"),
    "alias" : [
           "Zine el Abidine Ben Ali",
           "Zine El Abidine Ben Ali",
           "Zine el Abidine ben Ali"
    ],
    "batch_resync" : true,
    "communityId" : ObjectId("4f8f138103644ee8003bf518"),
    "db_sync_doccount" : NumberLong(143),
    "db_sync_time" : "1338751174988",
    "dimension" : "Who",
    "disambiguated_name" : "Zine El Abidine Ben Ali",
    "doccount" : 152,
    "index" : "zine el abidine ben ali/person",
    "totalfreq" : 353,
    "type" : "Person"
}
Harvested Entities
Harvested Associations
• feature.association
{
    "_id" : ObjectId("4f9189d48baf188282a1ca24"),
    "assoc_type" : "Fact",
    "communityId" : ObjectId("4f8f138103644ee8003bf518"),
    "db_sync_doccount" : NumberLong(70),
    "db_sync_time" : "1338491609281",
    "doccount" : NumberLong(73),
    "entity1" : [
           "zine el abidine ben ali",
           "zine el abidine ben ali/person"
    ],
    "entity1_index" : "zine el abidine ben ali/person",
    "entity2" : ["president”,"president/position”],
    "entity2_index" : "president/position",
    "index" : "5e3fff27ddb78d6873ccfc77cf05c52f",
    "verb" : ["career”,"current”,"past”],
    "verb_category" : "career"
}
Harvested Associations
Geolocation of Entities/Events
• feature.geo
{
    "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"),
    "search_field" : "cairo",
    "country" : "Egypt",
    "country_code" : "EG",
    "city" : "cairo",
    "region" : "Al Qahirah",
    "region_code" : "EG11",
    "population" : 7734602,
    "latitude" : "30.05",
    "longitude" : "31.25",
    "geoindex" : {
           "lon" : 31.25,
           "lat" : 30.05                            Note: MongoDB 2d Index
    }
}
Geolocation of Entities/Events
Who, What, Where and When
Thank You!

             Craig Vitter



         www.ikanow.com
        cvitter@ikanow.com
        github.com/ikanow/Infinit.e

More Related Content

Similar to Open Analytics DC June 2012 Presentation

Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studydeep.bi
 
Webinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBWebinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBMongoDB
 
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYSEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYAmit Sheth
 
lawTechCamp - Knowledge Management Panel
lawTechCamp - Knowledge Management PanellawTechCamp - Knowledge Management Panel
lawTechCamp - Knowledge Management Panellawtechcamp
 
Users as Data
Users as DataUsers as Data
Users as Datapdingles
 
Reading Group 2013 (DERI NUIG)
Reading Group 2013 (DERI NUIG)Reading Group 2013 (DERI NUIG)
Reading Group 2013 (DERI NUIG)Bianca Pereira
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
Web-scale semantic search
Web-scale semantic searchWeb-scale semantic search
Web-scale semantic searchEdgar Meij
 
Mongo at Sailthru (MongoNYC 2011)
Mongo at Sailthru (MongoNYC 2011)Mongo at Sailthru (MongoNYC 2011)
Mongo at Sailthru (MongoNYC 2011)ibwhite
 
Test Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely testsTest Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely testsHugh McCamphill
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...OpenAIRE
 
A fresh new look into Information Gathering - OWASP Spain
A fresh new look into Information Gathering - OWASP SpainA fresh new look into Information Gathering - OWASP Spain
A fresh new look into Information Gathering - OWASP SpainChristian Martorella
 
Augmenting the World using Semantic Web Technologies
Augmenting the World using Semantic Web TechnologiesAugmenting the World using Semantic Web Technologies
Augmenting the World using Semantic Web TechnologiesJens Grubert
 
Searching Relational Data with Elasticsearch
Searching Relational Data with ElasticsearchSearching Relational Data with Elasticsearch
Searching Relational Data with Elasticsearchsirensolutions
 
Montreal Elasticsearch Meetup
Montreal Elasticsearch MeetupMontreal Elasticsearch Meetup
Montreal Elasticsearch MeetupLoïc Bertron
 
MongoDB In Production At Sailthru
MongoDB In Production At SailthruMongoDB In Production At Sailthru
MongoDB In Production At Sailthruibwhite
 
Open Source Information Gathering Brucon Edition
Open Source Information Gathering Brucon EditionOpen Source Information Gathering Brucon Edition
Open Source Information Gathering Brucon EditionChris Gates
 

Similar to Open Analytics DC June 2012 Presentation (20)

Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Webinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBWebinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDB
 
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYSEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
 
lawTechCamp - Knowledge Management Panel
lawTechCamp - Knowledge Management PanellawTechCamp - Knowledge Management Panel
lawTechCamp - Knowledge Management Panel
 
LOD2 Webinar: SIREn
LOD2 Webinar: SIREnLOD2 Webinar: SIREn
LOD2 Webinar: SIREn
 
Users as Data
Users as DataUsers as Data
Users as Data
 
Reading Group 2013 (DERI NUIG)
Reading Group 2013 (DERI NUIG)Reading Group 2013 (DERI NUIG)
Reading Group 2013 (DERI NUIG)
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Web-scale semantic search
Web-scale semantic searchWeb-scale semantic search
Web-scale semantic search
 
Mongo at Sailthru (MongoNYC 2011)
Mongo at Sailthru (MongoNYC 2011)Mongo at Sailthru (MongoNYC 2011)
Mongo at Sailthru (MongoNYC 2011)
 
Test Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely testsTest Trend Analysis : Towards robust, reliable and timely tests
Test Trend Analysis : Towards robust, reliable and timely tests
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 
A fresh new look into Information Gathering - OWASP Spain
A fresh new look into Information Gathering - OWASP SpainA fresh new look into Information Gathering - OWASP Spain
A fresh new look into Information Gathering - OWASP Spain
 
Augmenting the World using Semantic Web Technologies
Augmenting the World using Semantic Web TechnologiesAugmenting the World using Semantic Web Technologies
Augmenting the World using Semantic Web Technologies
 
Searching Relational Data with Elasticsearch
Searching Relational Data with ElasticsearchSearching Relational Data with Elasticsearch
Searching Relational Data with Elasticsearch
 
Montreal Elasticsearch Meetup
Montreal Elasticsearch MeetupMontreal Elasticsearch Meetup
Montreal Elasticsearch Meetup
 
Line,,NATIONAL SEMINAR ORGANIZED BY KULISAA 15.01.2015
Line,,NATIONAL SEMINAR ORGANIZED BY KULISAA 15.01.2015Line,,NATIONAL SEMINAR ORGANIZED BY KULISAA 15.01.2015
Line,,NATIONAL SEMINAR ORGANIZED BY KULISAA 15.01.2015
 
MongoDB In Production At Sailthru
MongoDB In Production At SailthruMongoDB In Production At Sailthru
MongoDB In Production At Sailthru
 
Krnarich "Assessing Contribution & Value"
Krnarich "Assessing Contribution & Value"Krnarich "Assessing Contribution & Value"
Krnarich "Assessing Contribution & Value"
 
Open Source Information Gathering Brucon Edition
Open Source Information Gathering Brucon EditionOpen Source Information Gathering Brucon Edition
Open Source Information Gathering Brucon Edition
 

More from ikanow

Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014ikanow
 
Cloud computing with AWS
Cloud computing with AWS Cloud computing with AWS
Cloud computing with AWS ikanow
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysisikanow
 
Open Analytics DC April 2012 Meetup
Open Analytics DC April 2012 MeetupOpen Analytics DC April 2012 Meetup
Open Analytics DC April 2012 Meetupikanow
 
Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?ikanow
 
Agile intelligence through Open Analytics
Agile intelligence through Open AnalyticsAgile intelligence through Open Analytics
Agile intelligence through Open Analyticsikanow
 
Value Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs AnalysisValue Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs Analysisikanow
 

More from ikanow (7)

Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014
 
Cloud computing with AWS
Cloud computing with AWS Cloud computing with AWS
Cloud computing with AWS
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Open Analytics DC April 2012 Meetup
Open Analytics DC April 2012 MeetupOpen Analytics DC April 2012 Meetup
Open Analytics DC April 2012 Meetup
 
Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?
 
Agile intelligence through Open Analytics
Agile intelligence through Open AnalyticsAgile intelligence through Open Analytics
Agile intelligence through Open Analytics
 
Value Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs AnalysisValue Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs Analysis
 

Recently uploaded

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Open Analytics DC June 2012 Presentation

  • 1. Document Analysis and Big Data Making Sense out of the Flood
  • 2. Agenda • Define Big Data and Document Analysis • The Infinit.e Solution • Questions
  • 3. What is Big Data? “Big data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.” Source: http://en.wikipedia.org/wiki/Big_data
  • 4. This is what Big Data Feels Like Shamelessly stolen from: http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
  • 5. What is Document Analysis? "Document Analysis refers to computer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.” Source: http://www.text-tech.com/docanalysis/definition.html
  • 6. Document Analysis • The goal is to: – Extract Entities (people, places, things) – Create Associations between entities (in the form of noun-verb-noun), e.g.: • John Doe lives in Washington, D.C • John Doe is married to Jane Doe • John Doe is a Virgo • John Doe traveled to Mexico on July 6th, 2011 • And…
  • 7. Document Analysis • Turn Who, What, When and Where into a unified data structure that supports data analytics and visualization. Who When people, organizations, past, present, future facilities, company dates What Where events, summaries, city, state, country, facts, themes coordinate
  • 8. The Infinit.e Solution • Infinit.e is an Open Source document discovery and analysis platform that has these very cool open source tools lurking under the hood. github.com/ikanow/Infinit.e
  • 9. The Infinit.e Solution Infinit.e is a scalable framework for Visualizing Analyzing Retrieving Enriching Storing Collecting Structured and Unstructured Documents
  • 10. Harvesting • Infinit.e’s harvester: – Collects documents for specified data sources (URLs, RDBMs via JDBC, file shares) – Marshalls each document through the enrichment process – Saves each metadata document, entity, and association created to MongoDB
  • 12. Sample RSS Document <rss version="2.0"> <channel> … <item> <title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title> <link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in- egypt-tunisia-report-by-egyptlastminute-com-13613.html</link> <description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia the most … </description> <dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher> <dc:creator>unknown</dc:creator> <dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date> </item> … </channel> </rss>
  • 14. Document Metadata • doc_metadata.metadata { "_id" : ObjectId("4f93638e0cf212156d0559d2"), "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...", "url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism- in-egypt-tunisia-report-by-egyptlastminute-com-13613.html" "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...", "created" : ISODate("2012-04-22T01:49:02Z"), “metadata” : {…}, "associations" : […], "entities" : […], ... }
  • 15. Harvested Document Metadata • document.metadata "metadata" : { "location" : [ { "region" : "South Asia", "citystateprovince" : { "stateprovince" : "Rolpa”, "city" : "Newang" }, "country" : "Nepal" } ], "icn" : [ "200573487" ], "incidentdate" : [ "07/25/2005" ], "organization" : [ "Communist Party of Nepal (Maoist)/United People's Front” ], ... },
  • 16. Document Enrichment • Infinit.e supports the extraction of entities and creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:
  • 17. Harvested Entities • feature.entity { "_id" : ObjectId("4f9189d48baf188282a1c9ef"), "alias" : [ "Zine el Abidine Ben Ali", "Zine El Abidine Ben Ali", "Zine el Abidine ben Ali" ], "batch_resync" : true, "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(143), "db_sync_time" : "1338751174988", "dimension" : "Who", "disambiguated_name" : "Zine El Abidine Ben Ali", "doccount" : 152, "index" : "zine el abidine ben ali/person", "totalfreq" : 353, "type" : "Person" }
  • 19. Harvested Associations • feature.association { "_id" : ObjectId("4f9189d48baf188282a1ca24"), "assoc_type" : "Fact", "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(70), "db_sync_time" : "1338491609281", "doccount" : NumberLong(73), "entity1" : [ "zine el abidine ben ali", "zine el abidine ben ali/person" ], "entity1_index" : "zine el abidine ben ali/person", "entity2" : ["president”,"president/position”], "entity2_index" : "president/position", "index" : "5e3fff27ddb78d6873ccfc77cf05c52f", "verb" : ["career”,"current”,"past”], "verb_category" : "career" }
  • 21. Geolocation of Entities/Events • feature.geo { "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"), "search_field" : "cairo", "country" : "Egypt", "country_code" : "EG", "city" : "cairo", "region" : "Al Qahirah", "region_code" : "EG11", "population" : 7734602, "latitude" : "30.05", "longitude" : "31.25", "geoindex" : { "lon" : 31.25, "lat" : 30.05 Note: MongoDB 2d Index } }
  • 23. Who, What, Where and When
  • 24. Thank You! Craig Vitter www.ikanow.com cvitter@ikanow.com github.com/ikanow/Infinit.e