Mime Magic With Apache Tika

3,332 views
3,193 views

Published on

Fast Feather Track presentation at ApacheCon EU 2008 in Amsterdam

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,332
On SlideShare
0
From Embeds
0
Number of Embeds
58
Actions
Shares
0
Downloads
70
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Mime Magic With Apache Tika

  1. 1. MIME Magic with Apache Tika Jukka Zitting Tika committer and mentor
  2. 2. Agenda The Problem The Solution The Project The Client
  3. 3. The Problem PDFBox Apache POI Apache Xerces ICU4J NekoHTML etc. Lucene index
  4. 4. It's even worse! Licensing/Patents Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming/Performance Processing of digital media ? ? ? ? ? ? ? ?
  5. 5. Agenda The Problem The Solution The Project The Client
  6. 6. The Solution: Technical <ul><li>Generic API for extracting metadata and structured text content from a document </li></ul><ul><ul><li>Input: byte stream + optional metadata </li></ul></ul><ul><ul><li>Output: XHTML SAX events + metadata </li></ul></ul><ul><li>Automatic content type detection </li></ul><ul><ul><li>Magic bytes </li></ul></ul><ul><ul><li>File name patterns </li></ul></ul>
  7. 7. The Solution: Legal / Social <ul><li>Apache License </li></ul><ul><ul><li>(L)GPL projects can implement the Tika API </li></ul></ul><ul><li>Pooling of efforts </li></ul><ul><ul><li>Active development and maintenance </li></ul></ul><ul><ul><li>Already beyond the functionality of most custom solutions </li></ul></ul>
  8. 8. Agenda The Problem The Solution The Project The Client
  9. 9. Project Status <ul><li>Incubating since March 2007 </li></ul><ul><li>Sponsoring PMC: Apache Lucene </li></ul><ul><li>First release (0.1-incubating) in December 2007 </li></ul><ul><li>Interaction with PDFBox, POI, etc. </li></ul><ul><li>Currently in early adopter phase </li></ul>
  10. 10. Current Features <ul><li>73 registered media types </li></ul><ul><ul><li>167 glob patterns </li></ul></ul><ul><ul><li>26 magic header patterns </li></ul></ul><ul><li>7 built-in parser classes </li></ul><ul><ul><li>51 supported media types </li></ul></ul><ul><ul><li>MS Office, OpenOffice, HTML, PDF, XML, RTF, plain text </li></ul></ul>
  11. 11. Project Statistics
  12. 12. Agenda The Problem The Solution The Project The Client
  13. 13. Tika Parser API <ul><li>package org.apache.tika.parser; </li></ul><ul><li>public interface Parser { // Parses document content and metadata void parse( InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException; // Parses document metadata, @since Tika 0.2 void parse( InputStream stream, Metadata metadata) throws IOException, TikaException; </li></ul><ul><li>} </li></ul>
  14. 14. Example: Text extraction <ul><li>public static void main(String[] args)‏ </li></ul><ul><li>throws Exception { </li></ul><ul><li>InputStream stream = System.in; </li></ul><ul><li>ContentHandler handler = </li></ul><ul><li>new WriteOutContentHandler( System.out ); </li></ul><ul><li>Metadata metadata = new Metadata(); </li></ul><ul><li>new AutoDetectParser().parse( </li></ul><ul><li>stream , handler , metadata ); </li></ul><ul><li>} </li></ul>
  15. 15. Demo: Tika GUI
  16. 16. Agenda The Problem The Solution The Project The Client Thank You!

×