Content analysis for ECM with Apache Tika

8,231 views

Published on

Presentation at ApacheCon US 2008 (New Orleans) by Paolo Mottadelli. This is about the Apache Tika project and how it was integrated in Alfresco in order to support Open XML format Full Text Search.

1 Comment
11 Likes
Statistics
Notes
No Downloads
Views
Total views
8,231
On SlideShare
0
From Embeds
0
Number of Embeds
74
Actions
Shares
0
Downloads
189
Comments
1
Likes
11
Embeds 0
No embeds

No notes for slide

Content analysis for ECM with Apache Tika

  1. Content analysis for ECM with Apache Tika Paolo Mottadelli -
  2. [email_address]
  3. ON BOARD!
  4. Agenda
  5. Main challenge Lucene index
  6. Other challenges
  7. A real world challenge ? ? ? Searching .docx .xlsx .pptx in Alfresco ECM
  8. Agenda
  9. What is Tika? Another Indian Lucene project? No.
  10. What is Tika? It is a Toolkit
  11. Current coverage
  12. A brief history of Tika Sponsored by the Apache Lucene PMC
  13. Tika organization Changing after graduation
  14. Getting Tika … and contributing
  15. Tika Design
  16. The Parser interface <ul><li>void parse(InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException; </li></ul>
  17. Tika Design
  18. Document input stream
  19. Tika Design
  20. XHTML SAX events <ul><li><html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;> </li></ul><ul><li><head> </li></ul><ul><li><title>...</title> </li></ul><ul><li></head> </li></ul><ul><li><body> ... </body> </li></ul><ul><li></html> </li></ul>
  21. Why XHTML? <ul><li>Reflect the structured text content of the document </li></ul><ul><li>Not recreating the low level details </li></ul><ul><li>For low level details use low level parser libs </li></ul>
  22. ContentHandler (CH) and Decorators (CHD)
  23. Tika Design
  24. Document metadata
  25. … more metadata: HPSF
  26. Tika Design
  27. Parser implementations
  28. The AutoDetectParser <ul><li>Encapsulates all Tika functionalities </li></ul><ul><li>Can handle any type of document </li></ul>
  29. Type Detection MimeType type = types.getMimeType(…);
  30. tika-mimetypes.xml <ul><li>An example: Gzip </li></ul><ul><li><mime-type type=&quot;application/x-gzip&quot;> </li></ul><ul><li><magic priority=&quot;40&quot;> </li></ul><ul><li><match value=&quot;37213&quot; type=&quot;string“ offset=&quot;0&quot; /> </li></ul><ul><li></magic> </li></ul><ul><li><glob pattern=&quot;*.tgz&quot; /> </li></ul><ul><li><glob pattern=&quot;*.gz&quot; /> </li></ul><ul><li><glob pattern=&quot;*-gz&quot; /> </li></ul><ul><li></mime-type> </li></ul>
  31. Supported formats
  32. A really simple example <ul><li>InputStream input = MyTest.class.getResourceAsStream(&quot;testPPT.ppt&quot;); </li></ul><ul><li>Metadata metadata = new Metadata(); </li></ul><ul><li>ContentHandler handler = new BodyContentHandler(); </li></ul><ul><li>new OfficeParser ().parse(input, handler, metadata); </li></ul><ul><li>String contentType = metadata.get(Metadata. CONTENT_TYPE) ; </li></ul><ul><li>String title= metadata.get(Metadata. TITLE) ; </li></ul><ul><li>String content = handler.toString() ; </li></ul>
  33. Demo ?
  34. Future Goals
  35. Who uses Tika?
  36. Agenda
  37. ECM: what is it?
  38. ECM: Manage <ul><li>Indexing </li></ul><ul><li>Categorization </li></ul>* *
  39. ECM: we love SEARCHING!
  40. ECM: we love SEARCHING!
  41. ECM: we love SEARCHING!
  42. Don’t do it on your own Tika shields ECM from using many single components
  43. Agenda
  44. Alfresco: short presentation
  45. Alfresco: short presentation
  46. Who uses Alfresco?
  47. Alfresco Repository JSR-170 Level2 Compatible
  48. Repository Architecture Services Components Storage Hibernate Content Lucene Content Index Database Search Node Node Content Query Index
  49. Repository Architecture Services Components Storage Hibernate Content Lucene Content Index Database Search Node Node Content Query Index
  50. Alfresco Search
  51. Alfresco Search
  52. Use case
  53. Use case
  54. Without Tika:
  55. Step 1
  56. Step 2 <ul><li>for (ContentTransformer transformer : transformers) </li></ul><ul><li>{ </li></ul><ul><li>long transformationTime = transformer.getTransformationTime(); </li></ul><ul><li>if (bestTransformer == null || transformationTime < bestTime) </li></ul><ul><li>{ </li></ul><ul><li>bestTransformer = transformer; </li></ul><ul><li>bestTime = transformationTime; </li></ul><ul><li>} </li></ul><ul><li>} </li></ul><ul><li>return bestTransformer; </li></ul>ContentTransformerRegistry Provides the most appropriate ContentTransformer
  57. Step 2 (explained) Too many different ContentTransformer implementations
  58. Step 3 Transform public void transformInternal(ContentReader reader, ContentWriter writer, TransformationOptions options) throws Exception { ... HSSFWorkbook workbook = new HSSFWorkbook(is); ... for (int i = 0; i < sheetCount; i++) { HSSFSheet sheet = workbook.getSheetAt(i); String sheetName = workbook.getSheetName(i); writeSheet(os, sheet, encoding); } ... } Example: PoiHssfContentTransformer
  59. Step 3 (explained) Too many different ContentTransformer implementations ... again !?!
  60. Step 4 Lucene index creation ContentReader reader = contentService.getReader(nodeRef, propertyName); ContentTransformer transformer = contentService.getTransformer(reader.getMimetype(), MimetypeMap. MIMETYPE_TEXT_PLAIN); transformer.transform(reader, writer); reader = writer.getReader(); . . . . . . . . doc.add(new Field(attributeName, reader, Field.TermVector. NO));
  61. Let’s do it using Tika
  62. Step 1 + Step 2 + Step 3 String name = “resource.doc” InputStream input = getResourceAsStream(name); Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler(); new AutoDetectParser().parse(input, handler, metadata); String title = metadata.get(Metadata. TITLE); String content = handler.toString();
  63. Step 1 to 4 (compressed) String name = “resource.doc” InputStream input = getResourceAsStream(name); Reader reader = new ParsingReader (input, name); . . . . . . doc.add(new Field(attributeName, reader , Field.TermVector. NO));
  64. Results: 1 & 2
  65. Extension use case Adding support for Microsoft Office Open XML Documents (Office 2007+)
  66. Apache POI Apache POI provides Text Extraction support for Office OpenXML formats and An advanced coverage of SpreadsheetML specification (WordprocessingML & PresentationML to come)
  67. Apache POI Apache POI status
  68. Apache POI TextExtractors <ul><li>POIXMLDocument document; </li></ul><ul><li>Package pkg = Package.open(stream); </li></ul><ul><li>textExtractor = ExtractorFactory.createExtractor(pkg); </li></ul><ul><li>if (textExtractor instanceof XSSFExcelExtractor) { </li></ul><ul><li>setType(metadata, OOXML_EXCEL_MIMETYPE </li></ul><ul><li>document = new XSSFWorkbook(pkg); </li></ul><ul><li>} </li></ul><ul><li>else if (textExtractor instanceof XWPFWordExtractor){…} </li></ul><ul><li>else if (textExtractor instanceof XSLFPowerPointExtractor){…} </li></ul><ul><li>setPOIXMLProperties(metadata, document); </li></ul>
  69. Can we find it?
  70. Results: 3 & 4
  71. Q & A [email_address]

×