SlideShare a Scribd company logo
1 of 29
Content analysis with Apache Tika Paolo Mottadelli  - [email_address]   or   [email_address]
Main challenge Lucene index
Other challenges
What is Tika? Another Indian Lucene project? No.
What is Tika? It is a  Toolkit
Current coverage
A brief history of Tika Sponsored by the Apache Lucene PMC
Tika organization Changing after graduation
Getting Tika …  and contributing
Tika Design
The Parser interface ,[object Object]
Tika Design
Document input stream
Tika Design
XHTML SAX events ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Why XHTML? ,[object Object],[object Object],[object Object]
ContentHandler (CH) and Decorators (CHD)
Tika Design
Document metadata
…  more metadata: HPSF
Tika Design
Parser implementations
The AutoDetectParser ,[object Object],[object Object]
Type Detection MimeType type = types.getMimeType(…);
tika-mimetypes.xml ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Supported formats
A really simple example ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Future Goals
Who uses Tika?

More Related Content

What's hot

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
GokulD
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
WO Community
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil
 

What's hot (20)

What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
 
Scientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaScientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache Tika
 
Lucene
LuceneLucene
Lucene
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Integrating Doctrine with Laravel
Integrating Doctrine with LaravelIntegrating Doctrine with Laravel
Integrating Doctrine with Laravel
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 

Similar to Content Analysis with Apache Tika

HTML Introduction
HTML IntroductionHTML Introduction
HTML Introduction
eceklu
 
CustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputsCustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputs
Suite Solutions
 
Authoring and Publishing with XMetaL and DITA
Authoring and Publishing with XMetaL and DITAAuthoring and Publishing with XMetaL and DITA
Authoring and Publishing with XMetaL and DITA
Scott Abel
 
Web topic 2 html
Web topic 2  htmlWeb topic 2  html
Web topic 2 html
CK Yang
 
HTML Introduction
HTML IntroductionHTML Introduction
HTML Introduction
c525600
 

Similar to Content Analysis with Apache Tika (20)

Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Mdst 3559-02-01-html
Mdst 3559-02-01-htmlMdst 3559-02-01-html
Mdst 3559-02-01-html
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
HTML Introduction
HTML IntroductionHTML Introduction
HTML Introduction
 
Wisneski TeI workshop 2009-2010
Wisneski TeI workshop 2009-2010Wisneski TeI workshop 2009-2010
Wisneski TeI workshop 2009-2010
 
Xml Case Learns 2008
Xml Case Learns 2008Xml Case Learns 2008
Xml Case Learns 2008
 
CustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputsCustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputs
 
The Big Documentation Extravaganza
The Big Documentation ExtravaganzaThe Big Documentation Extravaganza
The Big Documentation Extravaganza
 
Learning XSLT
Learning XSLTLearning XSLT
Learning XSLT
 
XML Transformations With PHP
XML Transformations With PHPXML Transformations With PHP
XML Transformations With PHP
 
Html
HtmlHtml
Html
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Basic of HTML
Basic of  HTMLBasic of  HTML
Basic of HTML
 
Authoring and Publishing with XMetaL and DITA
Authoring and Publishing with XMetaL and DITAAuthoring and Publishing with XMetaL and DITA
Authoring and Publishing with XMetaL and DITA
 
Xml Lecture Notes
Xml Lecture NotesXml Lecture Notes
Xml Lecture Notes
 
Decoding and developing the online finding aid
Decoding and developing the online finding aidDecoding and developing the online finding aid
Decoding and developing the online finding aid
 
Web topic 2 html
Web topic 2  htmlWeb topic 2  html
Web topic 2 html
 
HTML Introduction
HTML IntroductionHTML Introduction
HTML Introduction
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 

More from Paolo Mottadelli

More from Paolo Mottadelli (11)

Open Architecture in the Adobe Marketing Cloud - Summit 2014
Open Architecture in the Adobe Marketing Cloud - Summit 2014Open Architecture in the Adobe Marketing Cloud - Summit 2014
Open Architecture in the Adobe Marketing Cloud - Summit 2014
 
Integrating with Adobe Marketing Cloud - Summit 2014
Integrating with Adobe Marketing Cloud - Summit 2014Integrating with Adobe Marketing Cloud - Summit 2014
Integrating with Adobe Marketing Cloud - Summit 2014
 
Evolve13 cq-commerce-framework
Evolve13 cq-commerce-frameworkEvolve13 cq-commerce-framework
Evolve13 cq-commerce-framework
 
AEM (CQ) eCommerce Framework
AEM (CQ) eCommerce FrameworkAEM (CQ) eCommerce Framework
AEM (CQ) eCommerce Framework
 
Adobe AEM Commerce with hybris
Adobe AEM Commerce with hybrisAdobe AEM Commerce with hybris
Adobe AEM Commerce with hybris
 
Java standards in WCM
Java standards in WCMJava standards in WCM
Java standards in WCM
 
JCR and Sling Quick Dive
JCR and Sling Quick DiveJCR and Sling Quick Dive
JCR and Sling Quick Dive
 
Open Development
Open DevelopmentOpen Development
Open Development
 
Apache Poi Recipes
Apache Poi RecipesApache Poi Recipes
Apache Poi Recipes
 
Jira as a Project Management Tool
Jira as a Project Management ToolJira as a Project Management Tool
Jira as a Project Management Tool
 
Interoperability at Apache Software Foundation
Interoperability at Apache Software FoundationInteroperability at Apache Software Foundation
Interoperability at Apache Software Foundation
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Content Analysis with Apache Tika