Apache Tika

8,010 views
7,750 views

Published on

Published in: Technology, Education
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
8,010
On SlideShare
0
From Embeds
0
Number of Embeds
426
Actions
Shares
0
Downloads
178
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

Apache Tika

  1. 1. Apache Tika An extensible, configurable content analysis framework toolkit
  2. 2. Agenda The Problem The Solution The Project The Design
  3. 3. The Problem PDFBox Apache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index
  4. 4. It’s Worse Than That Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming Processing of digital media ? ? ? ? ? ? ? ?
  5. 5. Agenda The Problem The Solution The Project The Design
  6. 6. The Solution: Technical <ul><li>Generic API for extracting metadata and structured text content from a document </li></ul><ul><ul><li>Input: byte stream + optional metadata </li></ul></ul><ul><ul><li>Output: XHTML SAX events + metadata </li></ul></ul><ul><li>Automatic content type detection </li></ul><ul><ul><li>Magic bytes </li></ul></ul><ul><ul><li>File name patterns </li></ul></ul>
  7. 7. The Solution: Legal / Social <ul><li>Apache License </li></ul><ul><ul><li>(L)GPL projects can implement the Tika API </li></ul></ul><ul><li>Pooling of efforts </li></ul><ul><ul><li>Active development and maintenance </li></ul></ul><ul><ul><li>Already beyond the functionality of most custom solutions </li></ul></ul><ul><ul><li>Cool future goals: OCR, speech recognition, … </li></ul></ul>
  8. 8. Agenda The Problem The Solution The Project The Design
  9. 9. Project Status <ul><li>Initially planned already in early 2006 </li></ul><ul><li>Incubating since March 2007 </li></ul><ul><li>Sponsoring PMC: Apache Lucene </li></ul><ul><li>No releases yet </li></ul><ul><ul><li>0.1 release being planned </li></ul></ul><ul><li>Small development team </li></ul><ul><ul><li>6 committers, 3-4 currently active </li></ul></ul>
  10. 10. Current Features <ul><li>Media type framework </li></ul><ul><ul><li>Shared MIME info spec (freedesktop.org) </li></ul></ul><ul><ul><li>Default media type registry (incl. glob and magic patterns) </li></ul></ul><ul><li>Parser components </li></ul><ul><ul><li>PDF (PDFBox) </li></ul></ul><ul><ul><li>Plain text (ICU4) </li></ul></ul><ul><ul><li>XML (SAX) </li></ul></ul><ul><ul><li>HTML (NekoHTML) </li></ul></ul><ul><ul><li>Word, PowerPoint, Excel (POI) </li></ul></ul><ul><ul><li>ODF (SAX) </li></ul></ul><ul><ul><li>RTF (Swing) </li></ul></ul>
  11. 11. Project Statistics
  12. 12. Codebase History Lius Nutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Jerome Charron Sami Siren Bertrand Delacretaz Keith Bennett
  13. 13. Agenda The Problem The Solution The Project The Design
  14. 14. Content Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new PowerPointParser().parse(…);
  15. 15. Media Type Detection application/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetypes.xml /etc/magic mime.types ?
  16. 16. Combined Detection and Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF XML new AutoDetectParser().parse(…); ?
  17. 17. Agenda The Problem The Solution The Project The Design Thank You!

×