Content analysis for ECM with Apache Tika Paolo Mottadelli  -
[email_address]
ON BOARD!
Agenda
Main challenge Lucene index
Other challenges
A real world challenge ? ? ? Searching .docx .xlsx .pptx in Alfresco ECM
Agenda
What is Tika? Another Indian Lucene project? No.
What is Tika? It is a  Toolkit
Current coverage
A brief history of Tika Sponsored by the Apache Lucene PMC
Tika organization Changing after graduation
Getting Tika …  and contributing
Tika Design
The Parser interface void parse(InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException;
Tika Design
Document input stream
Tika Design
XHTML SAX events <html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;> <head> <title>...</title> </head> <body> ... </body> </html>
Why XHTML? Reflect the structured text content of the document Not recreating the low level details For low level details use low level parser libs
ContentHandler (CH) and Decorators (CHD)
Tika Design
Document metadata
…  more metadata: HPSF
Tika Design
Parser implementations
The AutoDetectParser Encapsulates all Tika functionalities Can handle any type of document
Type Detection MimeType type = types.getMimeType(…);
tika-mimetypes.xml An example: Gzip <mime-type type=&quot;application/x-gzip&quot;> <magic priority=&quot;40&quot;> <match value=&quot;\037\213&quot; type=&quot;string“ offset=&quot;0&quot; /> </magic> <glob pattern=&quot;*.tgz&quot; /> <glob pattern=&quot;*.gz&quot; /> <glob pattern=&quot;*-gz&quot; /> </mime-type>
Supported formats
A really simple example InputStream input = MyTest.class.getResourceAsStream(&quot;testPPT.ppt&quot;); Metadata  metadata  = new Metadata(); ContentHandler  handler  = new BodyContentHandler(); new  OfficeParser ().parse(input, handler, metadata); String contentType =  metadata.get(Metadata. CONTENT_TYPE) ; String title=  metadata.get(Metadata. TITLE) ; String content =  handler.toString() ;
Demo ?
Future Goals
Who uses Tika?
Agenda
ECM: what is it?
ECM: Manage Indexing Categorization * *
ECM: we love SEARCHING!
ECM: we love SEARCHING!
ECM: we love SEARCHING!
Don’t do it on your own Tika shields ECM from using many single components
Agenda
Alfresco: short presentation
Alfresco: short presentation
Who uses Alfresco?
Alfresco Repository JSR-170 Level2 Compatible
Repository Architecture Services Components Storage Hibernate Content Lucene Content Index Database Search Node Node Content Query Index
Repository Architecture Services Components Storage Hibernate Content Lucene Content Index Database Search Node Node Content Query Index
Alfresco Search
Alfresco Search
Use case
Use case
Without Tika:
Step 1
Step 2 for (ContentTransformer transformer : transformers) { long transformationTime = transformer.getTransformationTime(); if (bestTransformer == null || transformationTime < bestTime) { bestTransformer = transformer; bestTime = transformationTime; } } return bestTransformer; ContentTransformerRegistry Provides the most appropriate ContentTransformer
Step 2 (explained) Too many different ContentTransformer implementations
Step 3 Transform public void transformInternal(ContentReader reader, ContentWriter writer,  TransformationOptions options) throws Exception { ... HSSFWorkbook workbook = new HSSFWorkbook(is); ... for (int i = 0; i < sheetCount; i++) { HSSFSheet sheet = workbook.getSheetAt(i); String sheetName = workbook.getSheetName(i); writeSheet(os, sheet, encoding); } ... } Example:  PoiHssfContentTransformer
Step 3 (explained) Too many different ContentTransformer implementations ... again !?!
Step 4 Lucene index creation ContentReader reader = contentService.getReader(nodeRef, propertyName);  ContentTransformer transformer =  contentService.getTransformer(reader.getMimetype(), MimetypeMap. MIMETYPE_TEXT_PLAIN);  transformer.transform(reader, writer);  reader = writer.getReader(); . . . . . . . .  doc.add(new Field(attributeName, reader, Field.TermVector. NO));
Let’s do it using Tika
Step 1 + Step 2 + Step 3 String name = “resource.doc” InputStream input =  getResourceAsStream(name);  Metadata metadata = new Metadata();  ContentHandler handler =  new BodyContentHandler(); new AutoDetectParser().parse(input, handler, metadata); String title = metadata.get(Metadata. TITLE); String content = handler.toString();
Step 1 to 4 (compressed) String name = “resource.doc” InputStream input =  getResourceAsStream(name);  Reader  reader  =  new  ParsingReader (input, name); . . . . . .  doc.add(new Field(attributeName,  reader ,  Field.TermVector. NO));
Results: 1 & 2
Extension use case Adding support for Microsoft Office Open XML Documents (Office 2007+)
Apache POI Apache POI  provides Text Extraction support for  Office OpenXML  formats and An advanced coverage of SpreadsheetML  specification (WordprocessingML & PresentationML to come)
Apache POI Apache POI status
Apache POI TextExtractors POIXMLDocument document; Package pkg = Package.open(stream); textExtractor = ExtractorFactory.createExtractor(pkg); if (textExtractor instanceof XSSFExcelExtractor) { setType(metadata, OOXML_EXCEL_MIMETYPE document = new XSSFWorkbook(pkg); } else if (textExtractor instanceof XWPFWordExtractor){…} else if (textExtractor instanceof XSLFPowerPointExtractor){…} setPOIXMLProperties(metadata, document);
Can we find it?
Results: 3 & 4
Q & A [email_address]

Content analysis for ECM with Apache Tika