Your SlideShare is downloading. ×
  • Like
Content analysis for ECM with Apache Tika
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Content analysis for ECM with Apache Tika

  • 3,951 views
Published

Presentation at ApacheCon US 2008 (New Orleans) by Paolo Mottadelli. This is about the Apache Tika project and how it was integrated in Alfresco in order to support Open XML format Full Text Search.

Presentation at ApacheCon US 2008 (New Orleans) by Paolo Mottadelli. This is about the Apache Tika project and how it was integrated in Alfresco in order to support Open XML format Full Text Search.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • Nice Presentation ...
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
3,951
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
170
Comments
1
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Content analysis for ECM with Apache Tika Paolo Mottadelli -
  • 2. [email_address]
  • 3. ON BOARD!
  • 4. Agenda
  • 5. Main challenge Lucene index
  • 6. Other challenges
  • 7. A real world challenge ? ? ? Searching .docx .xlsx .pptx in Alfresco ECM
  • 8. Agenda
  • 9. What is Tika? Another Indian Lucene project? No.
  • 10. What is Tika? It is a Toolkit
  • 11. Current coverage
  • 12. A brief history of Tika Sponsored by the Apache Lucene PMC
  • 13. Tika organization Changing after graduation
  • 14. Getting Tika … and contributing
  • 15. Tika Design
  • 16. The Parser interface
    • void parse(InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException;
  • 17. Tika Design
  • 18. Document input stream
  • 19. Tika Design
  • 20. XHTML SAX events
    • <html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;>
    • <head>
    • <title>...</title>
    • </head>
    • <body> ... </body>
    • </html>
  • 21. Why XHTML?
    • Reflect the structured text content of the document
    • Not recreating the low level details
    • For low level details use low level parser libs
  • 22. ContentHandler (CH) and Decorators (CHD)
  • 23. Tika Design
  • 24. Document metadata
  • 25. … more metadata: HPSF
  • 26. Tika Design
  • 27. Parser implementations
  • 28. The AutoDetectParser
    • Encapsulates all Tika functionalities
    • Can handle any type of document
  • 29. Type Detection MimeType type = types.getMimeType(…);
  • 30. tika-mimetypes.xml
    • An example: Gzip
    • <mime-type type=&quot;application/x-gzip&quot;>
    • <magic priority=&quot;40&quot;>
    • <match value=&quot;37213&quot; type=&quot;string“ offset=&quot;0&quot; />
    • </magic>
    • <glob pattern=&quot;*.tgz&quot; />
    • <glob pattern=&quot;*.gz&quot; />
    • <glob pattern=&quot;*-gz&quot; />
    • </mime-type>
  • 31. Supported formats
  • 32. A really simple example
    • InputStream input = MyTest.class.getResourceAsStream(&quot;testPPT.ppt&quot;);
    • Metadata metadata = new Metadata();
    • ContentHandler handler = new BodyContentHandler();
    • new OfficeParser ().parse(input, handler, metadata);
    • String contentType = metadata.get(Metadata. CONTENT_TYPE) ;
    • String title= metadata.get(Metadata. TITLE) ;
    • String content = handler.toString() ;
  • 33. Demo ?
  • 34. Future Goals
  • 35. Who uses Tika?
  • 36. Agenda
  • 37. ECM: what is it?
  • 38. ECM: Manage
    • Indexing
    • Categorization
    * *
  • 39. ECM: we love SEARCHING!
  • 40. ECM: we love SEARCHING!
  • 41. ECM: we love SEARCHING!
  • 42. Don’t do it on your own Tika shields ECM from using many single components
  • 43. Agenda
  • 44. Alfresco: short presentation
  • 45. Alfresco: short presentation
  • 46. Who uses Alfresco?
  • 47. Alfresco Repository JSR-170 Level2 Compatible
  • 48. Repository Architecture Services Components Storage Hibernate Content Lucene Content Index Database Search Node Node Content Query Index
  • 49. Repository Architecture Services Components Storage Hibernate Content Lucene Content Index Database Search Node Node Content Query Index
  • 50. Alfresco Search
  • 51. Alfresco Search
  • 52. Use case
  • 53. Use case
  • 54. Without Tika:
  • 55. Step 1
  • 56. Step 2
    • for (ContentTransformer transformer : transformers)
    • {
    • long transformationTime = transformer.getTransformationTime();
    • if (bestTransformer == null || transformationTime < bestTime)
    • {
    • bestTransformer = transformer;
    • bestTime = transformationTime;
    • }
    • }
    • return bestTransformer;
    ContentTransformerRegistry Provides the most appropriate ContentTransformer
  • 57. Step 2 (explained) Too many different ContentTransformer implementations
  • 58. Step 3 Transform public void transformInternal(ContentReader reader, ContentWriter writer, TransformationOptions options) throws Exception { ... HSSFWorkbook workbook = new HSSFWorkbook(is); ... for (int i = 0; i < sheetCount; i++) { HSSFSheet sheet = workbook.getSheetAt(i); String sheetName = workbook.getSheetName(i); writeSheet(os, sheet, encoding); } ... } Example: PoiHssfContentTransformer
  • 59. Step 3 (explained) Too many different ContentTransformer implementations ... again !?!
  • 60. Step 4 Lucene index creation ContentReader reader = contentService.getReader(nodeRef, propertyName); ContentTransformer transformer = contentService.getTransformer(reader.getMimetype(), MimetypeMap. MIMETYPE_TEXT_PLAIN); transformer.transform(reader, writer); reader = writer.getReader(); . . . . . . . . doc.add(new Field(attributeName, reader, Field.TermVector. NO));
  • 61. Let’s do it using Tika
  • 62. Step 1 + Step 2 + Step 3 String name = “resource.doc” InputStream input = getResourceAsStream(name); Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler(); new AutoDetectParser().parse(input, handler, metadata); String title = metadata.get(Metadata. TITLE); String content = handler.toString();
  • 63. Step 1 to 4 (compressed) String name = “resource.doc” InputStream input = getResourceAsStream(name); Reader reader = new ParsingReader (input, name); . . . . . . doc.add(new Field(attributeName, reader , Field.TermVector. NO));
  • 64. Results: 1 & 2
  • 65. Extension use case Adding support for Microsoft Office Open XML Documents (Office 2007+)
  • 66. Apache POI Apache POI provides Text Extraction support for Office OpenXML formats and An advanced coverage of SpreadsheetML specification (WordprocessingML & PresentationML to come)
  • 67. Apache POI Apache POI status
  • 68. Apache POI TextExtractors
    • POIXMLDocument document;
    • Package pkg = Package.open(stream);
    • textExtractor = ExtractorFactory.createExtractor(pkg);
    • if (textExtractor instanceof XSSFExcelExtractor) {
    • setType(metadata, OOXML_EXCEL_MIMETYPE
    • document = new XSSFWorkbook(pkg);
    • }
    • else if (textExtractor instanceof XWPFWordExtractor){…}
    • else if (textExtractor instanceof XSLFPowerPointExtractor){…}
    • setPOIXMLProperties(metadata, document);
  • 69. Can we find it?
  • 70. Results: 3 & 4
  • 71. Q & A [email_address]