Your SlideShare is downloading. ×
0
Apache Tika An extensible, configurable content analysis framework toolkit
Agenda The Problem The Solution The Project The Design
The Problem PDFBox Apache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index
It’s Worse Than That Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats ...
Agenda The Problem The Solution The Project The Design
The Solution: Technical <ul><li>Generic API for extracting metadata and structured text content from a document </li></ul>...
The Solution: Legal / Social <ul><li>Apache License </li></ul><ul><ul><li>(L)GPL projects can implement the Tika API </li>...
Agenda The Problem The Solution The Project The Design
Project Status <ul><li>Initially planned already in early 2006 </li></ul><ul><li>Incubating since March 2007 </li></ul><ul...
Current Features <ul><li>Media type framework </li></ul><ul><ul><li>Shared MIME info spec (freedesktop.org) </li></ul></ul...
Project Statistics
Codebase History Lius Nutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Je...
Agenda The Problem The Solution The Project The Design
Content Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new  PowerPointParser(...
Media Type Detection application/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetyp...
Combined Detection and Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF...
Agenda The Problem The Solution The Project The Design Thank You!
Upcoming SlideShare
Loading in...5
×

Apache Tika

7,448

Published on

Published in: Technology, Education
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,448
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
171
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide

Transcript of "Apache Tika"

  1. 1. Apache Tika An extensible, configurable content analysis framework toolkit
  2. 2. Agenda The Problem The Solution The Project The Design
  3. 3. The Problem PDFBox Apache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index
  4. 4. It’s Worse Than That Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming Processing of digital media ? ? ? ? ? ? ? ?
  5. 5. Agenda The Problem The Solution The Project The Design
  6. 6. The Solution: Technical <ul><li>Generic API for extracting metadata and structured text content from a document </li></ul><ul><ul><li>Input: byte stream + optional metadata </li></ul></ul><ul><ul><li>Output: XHTML SAX events + metadata </li></ul></ul><ul><li>Automatic content type detection </li></ul><ul><ul><li>Magic bytes </li></ul></ul><ul><ul><li>File name patterns </li></ul></ul>
  7. 7. The Solution: Legal / Social <ul><li>Apache License </li></ul><ul><ul><li>(L)GPL projects can implement the Tika API </li></ul></ul><ul><li>Pooling of efforts </li></ul><ul><ul><li>Active development and maintenance </li></ul></ul><ul><ul><li>Already beyond the functionality of most custom solutions </li></ul></ul><ul><ul><li>Cool future goals: OCR, speech recognition, … </li></ul></ul>
  8. 8. Agenda The Problem The Solution The Project The Design
  9. 9. Project Status <ul><li>Initially planned already in early 2006 </li></ul><ul><li>Incubating since March 2007 </li></ul><ul><li>Sponsoring PMC: Apache Lucene </li></ul><ul><li>No releases yet </li></ul><ul><ul><li>0.1 release being planned </li></ul></ul><ul><li>Small development team </li></ul><ul><ul><li>6 committers, 3-4 currently active </li></ul></ul>
  10. 10. Current Features <ul><li>Media type framework </li></ul><ul><ul><li>Shared MIME info spec (freedesktop.org) </li></ul></ul><ul><ul><li>Default media type registry (incl. glob and magic patterns) </li></ul></ul><ul><li>Parser components </li></ul><ul><ul><li>PDF (PDFBox) </li></ul></ul><ul><ul><li>Plain text (ICU4) </li></ul></ul><ul><ul><li>XML (SAX) </li></ul></ul><ul><ul><li>HTML (NekoHTML) </li></ul></ul><ul><ul><li>Word, PowerPoint, Excel (POI) </li></ul></ul><ul><ul><li>ODF (SAX) </li></ul></ul><ul><ul><li>RTF (Swing) </li></ul></ul>
  11. 11. Project Statistics
  12. 12. Codebase History Lius Nutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Jerome Charron Sami Siren Bertrand Delacretaz Keith Bennett
  13. 13. Agenda The Problem The Solution The Project The Design
  14. 14. Content Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new PowerPointParser().parse(…);
  15. 15. Media Type Detection application/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetypes.xml /etc/magic mime.types ?
  16. 16. Combined Detection and Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF XML new AutoDetectParser().parse(…); ?
  17. 17. Agenda The Problem The Solution The Project The Design Thank You!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×