SlideShare a Scribd company logo
1 of 71
Content analysis for ECM with Apache Tika Paolo Mottadelli  -
[email_address]
ON BOARD!
Agenda
Main challenge Lucene index
Other challenges
A real world challenge ? ? ? Searching .docx .xlsx .pptx in Alfresco ECM
Agenda
What is Tika? Another Indian Lucene project? No.
What is Tika? It is a  Toolkit
Current coverage
A brief history of Tika Sponsored by the Apache Lucene PMC
Tika organization Changing after graduation
Getting Tika …  and contributing
Tika Design
The Parser interface ,[object Object]
Tika Design
Document input stream
Tika Design
XHTML SAX events ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Why XHTML? ,[object Object],[object Object],[object Object]
ContentHandler (CH) and Decorators (CHD)
Tika Design
Document metadata
…  more metadata: HPSF
Tika Design
Parser implementations
The AutoDetectParser ,[object Object],[object Object]
Type Detection MimeType type = types.getMimeType(…);
tika-mimetypes.xml ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Supported formats
A really simple example ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Demo ?
Future Goals
Who uses Tika?
Agenda
ECM: what is it?
ECM: Manage ,[object Object],[object Object],* *
ECM: we love SEARCHING!
ECM: we love SEARCHING!
ECM: we love SEARCHING!
Don’t do it on your own Tika shields ECM from using many single components
Agenda
Alfresco: short presentation
Alfresco: short presentation
Who uses Alfresco?
Alfresco Repository JSR-170 Level2 Compatible
Repository Architecture Services Components Storage Hibernate Content Lucene Content Index Database Search Node Node Content Query Index
Repository Architecture Services Components Storage Hibernate Content Lucene Content Index Database Search Node Node Content Query Index
Alfresco Search
Alfresco Search
Use case
Use case
Without Tika:
Step 1
Step 2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],ContentTransformerRegistry Provides the most appropriate ContentTransformer
Step 2 (explained) Too many different ContentTransformer implementations
Step 3 Transform public void transformInternal(ContentReader reader, ContentWriter writer,  TransformationOptions options) throws Exception { ... HSSFWorkbook workbook = new HSSFWorkbook(is); ... for (int i = 0; i < sheetCount; i++) { HSSFSheet sheet = workbook.getSheetAt(i); String sheetName = workbook.getSheetName(i); writeSheet(os, sheet, encoding); } ... } Example:  PoiHssfContentTransformer
Step 3 (explained) Too many different ContentTransformer implementations ... again !?!
Step 4 Lucene index creation ContentReader reader = contentService.getReader(nodeRef, propertyName);  ContentTransformer transformer =  contentService.getTransformer(reader.getMimetype(), MimetypeMap. MIMETYPE_TEXT_PLAIN);  transformer.transform(reader, writer);  reader = writer.getReader(); . . . . . . . .  doc.add(new Field(attributeName, reader, Field.TermVector. NO));
Let’s do it using Tika
Step 1 + Step 2 + Step 3 String name = “resource.doc” InputStream input =  getResourceAsStream(name);  Metadata metadata = new Metadata();  ContentHandler handler =  new BodyContentHandler(); new AutoDetectParser().parse(input, handler, metadata); String title = metadata.get(Metadata. TITLE); String content = handler.toString();
Step 1 to 4 (compressed) String name = “resource.doc” InputStream input =  getResourceAsStream(name);  Reader  reader  =  new  ParsingReader (input, name); . . . . . .  doc.add(new Field(attributeName,  reader ,  Field.TermVector. NO));
Results: 1 & 2
Extension use case Adding support for Microsoft Office Open XML Documents (Office 2007+)
Apache POI Apache POI  provides Text Extraction support for  Office OpenXML  formats and An advanced coverage of SpreadsheetML  specification (WordprocessingML & PresentationML to come)
Apache POI Apache POI status
Apache POI TextExtractors ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Can we find it?
Results: 3 & 4
Q & A [email_address]

More Related Content

What's hot

Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
WO Community
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil
 

What's hot (20)

Scientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaScientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache Tika
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
 
Lucene
LuceneLucene
Lucene
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache lucene
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
LDP4j: A framework for the development of interoperable read-write Linked Da...
LDP4j: A framework for the development of interoperable read-write Linked Da...LDP4j: A framework for the development of interoperable read-write Linked Da...
LDP4j: A framework for the development of interoperable read-write Linked Da...
 
LDAP - Lightweight Directory Access Protocol
LDAP - Lightweight Directory Access ProtocolLDAP - Lightweight Directory Access Protocol
LDAP - Lightweight Directory Access Protocol
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Apache Lucene Basics
Apache Lucene BasicsApache Lucene Basics
Apache Lucene Basics
 

Viewers also liked

Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
mteutelink
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 

Viewers also liked (20)

Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsqueda
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Mejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrMejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache Solr
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en español
 
Search engine
Search engineSearch engine
Search engine
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Introducción a Solr
Introducción a SolrIntroducción a Solr
Introducción a Solr
 
Conferencia 3: solrconfig.xml
Conferencia 3: solrconfig.xmlConferencia 3: solrconfig.xml
Conferencia 3: solrconfig.xml
 
Conferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo SolrConferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo Solr
 
Conferencia 4: Queries
Conferencia 4: QueriesConferencia 4: Queries
Conferencia 4: Queries
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Seminario Apache Solr
Seminario Apache SolrSeminario Apache Solr
Seminario Apache Solr
 
Formación apache Solr
Formación apache SolrFormación apache Solr
Formación apache Solr
 

Similar to Content analysis for ECM with Apache Tika

Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
Jay Luker
 
CustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputsCustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputs
Suite Solutions
 
Code Analysis and Refactoring with CDT
Code Analysis and Refactoring with CDTCode Analysis and Refactoring with CDT
Code Analysis and Refactoring with CDT
dschaefer
 

Similar to Content analysis for ECM with Apache Tika (20)

Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
COinS (eng version)
COinS (eng version)COinS (eng version)
COinS (eng version)
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Twig internals - Maksym MoskvychevTwig internals maksym moskvychev
Twig internals - Maksym MoskvychevTwig internals   maksym moskvychevTwig internals - Maksym MoskvychevTwig internals   maksym moskvychev
Twig internals - Maksym MoskvychevTwig internals maksym moskvychev
 
Library Project
Library ProjectLibrary Project
Library Project
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
 
The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...The use of the code analysis library OpenC++: modifications, improvements, er...
The use of the code analysis library OpenC++: modifications, improvements, er...
 
Introducing Struts 2
Introducing Struts 2Introducing Struts 2
Introducing Struts 2
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Tibco business works
Tibco business worksTibco business works
Tibco business works
 
Orchard Dynamic Class Extensions
Orchard Dynamic Class ExtensionsOrchard Dynamic Class Extensions
Orchard Dynamic Class Extensions
 
Ejb3 Dan Hinojosa
Ejb3 Dan HinojosaEjb3 Dan Hinojosa
Ejb3 Dan Hinojosa
 
Fast track to lucene
Fast track to luceneFast track to lucene
Fast track to lucene
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
CustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputsCustomizingStyleSheetsForHTMLOutputs
CustomizingStyleSheetsForHTMLOutputs
 
IR with lucene
IR with luceneIR with lucene
IR with lucene
 
Code Analysis and Refactoring with CDT
Code Analysis and Refactoring with CDTCode Analysis and Refactoring with CDT
Code Analysis and Refactoring with CDT
 
Examiness hints and tips from the trenches
Examiness hints and tips from the trenchesExaminess hints and tips from the trenches
Examiness hints and tips from the trenches
 
Text Mining with R
Text Mining with RText Mining with R
Text Mining with R
 

More from Paolo Mottadelli

More from Paolo Mottadelli (11)

Open Architecture in the Adobe Marketing Cloud - Summit 2014
Open Architecture in the Adobe Marketing Cloud - Summit 2014Open Architecture in the Adobe Marketing Cloud - Summit 2014
Open Architecture in the Adobe Marketing Cloud - Summit 2014
 
Integrating with Adobe Marketing Cloud - Summit 2014
Integrating with Adobe Marketing Cloud - Summit 2014Integrating with Adobe Marketing Cloud - Summit 2014
Integrating with Adobe Marketing Cloud - Summit 2014
 
Evolve13 cq-commerce-framework
Evolve13 cq-commerce-frameworkEvolve13 cq-commerce-framework
Evolve13 cq-commerce-framework
 
AEM (CQ) eCommerce Framework
AEM (CQ) eCommerce FrameworkAEM (CQ) eCommerce Framework
AEM (CQ) eCommerce Framework
 
Adobe AEM Commerce with hybris
Adobe AEM Commerce with hybrisAdobe AEM Commerce with hybris
Adobe AEM Commerce with hybris
 
Java standards in WCM
Java standards in WCMJava standards in WCM
Java standards in WCM
 
JCR and Sling Quick Dive
JCR and Sling Quick DiveJCR and Sling Quick Dive
JCR and Sling Quick Dive
 
Open Development
Open DevelopmentOpen Development
Open Development
 
Apache Poi Recipes
Apache Poi RecipesApache Poi Recipes
Apache Poi Recipes
 
Jira as a Project Management Tool
Jira as a Project Management ToolJira as a Project Management Tool
Jira as a Project Management Tool
 
Interoperability at Apache Software Foundation
Interoperability at Apache Software FoundationInteroperability at Apache Software Foundation
Interoperability at Apache Software Foundation
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Content analysis for ECM with Apache Tika