MIME Magic with Apache Tika Jukka Zitting Tika committer and mentor
Agenda The Problem The Solution The Project The Client
The Problem PDFBox Apache POI Apache Xerces ICU4J NekoHTML etc. Lucene index
It's even worse! Licensing/Patents Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming/Performance Processing of digital media ? ? ? ? ? ? ? ?
Agenda The Problem The Solution The Project The Client
The Solution: Technical Generic API for extracting metadata and structured text content from a document Input: byte stream + optional metadata Output: XHTML SAX events + metadata Automatic content type detection Magic bytes File name patterns
The Solution: Legal / Social Apache License (L)GPL projects can implement the Tika API Pooling of efforts Active development and maintenance Already beyond the functionality of most custom solutions
Agenda The Problem The Solution The Project The Client
Project Status Incubating since March 2007 Sponsoring PMC: Apache Lucene First release (0.1-incubating) in December 2007 Interaction with PDFBox, POI, etc. Currently in early adopter phase
Current Features 73 registered media types 167 glob patterns 26 magic header patterns 7 built-in parser classes 51 supported media types MS Office, OpenOffice, HTML, PDF, XML, RTF, plain text
Project Statistics
Agenda The Problem The Solution The Project The Client
Tika Parser API package  org.apache.tika.parser; public   interface  Parser { // Parses document content and metadata void  parse(   InputStream stream, ContentHandler handler,   Metadata metadata)   throws  IOException, SAXException,   TikaException; // Parses document metadata, @since Tika 0.2 void  parse(   InputStream stream, Metadata metadata)   throws  IOException, TikaException; }
Example: Text extraction public static void main(String[] args)‏ throws Exception { InputStream  stream  = System.in; ContentHandler  handler  = new WriteOutContentHandler( System.out ); Metadata  metadata  = new Metadata(); new AutoDetectParser().parse( stream ,  handler ,  metadata ); }
Demo: Tika GUI
Agenda The Problem The Solution The Project The Client Thank You!

Mime Magic With Apache Tika

  • 1.
    MIME Magic withApache Tika Jukka Zitting Tika committer and mentor
  • 2.
    Agenda The ProblemThe Solution The Project The Client
  • 3.
    The Problem PDFBoxApache POI Apache Xerces ICU4J NekoHTML etc. Lucene index
  • 4.
    It's even worse!Licensing/Patents Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming/Performance Processing of digital media ? ? ? ? ? ? ? ?
  • 5.
    Agenda The ProblemThe Solution The Project The Client
  • 6.
    The Solution: TechnicalGeneric API for extracting metadata and structured text content from a document Input: byte stream + optional metadata Output: XHTML SAX events + metadata Automatic content type detection Magic bytes File name patterns
  • 7.
    The Solution: Legal/ Social Apache License (L)GPL projects can implement the Tika API Pooling of efforts Active development and maintenance Already beyond the functionality of most custom solutions
  • 8.
    Agenda The ProblemThe Solution The Project The Client
  • 9.
    Project Status Incubatingsince March 2007 Sponsoring PMC: Apache Lucene First release (0.1-incubating) in December 2007 Interaction with PDFBox, POI, etc. Currently in early adopter phase
  • 10.
    Current Features 73registered media types 167 glob patterns 26 magic header patterns 7 built-in parser classes 51 supported media types MS Office, OpenOffice, HTML, PDF, XML, RTF, plain text
  • 11.
  • 12.
    Agenda The ProblemThe Solution The Project The Client
  • 13.
    Tika Parser APIpackage org.apache.tika.parser; public interface Parser { // Parses document content and metadata void parse( InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException; // Parses document metadata, @since Tika 0.2 void parse( InputStream stream, Metadata metadata) throws IOException, TikaException; }
  • 14.
    Example: Text extractionpublic static void main(String[] args)‏ throws Exception { InputStream stream = System.in; ContentHandler handler = new WriteOutContentHandler( System.out ); Metadata metadata = new Metadata(); new AutoDetectParser().parse( stream , handler , metadata ); }
  • 15.
  • 16.
    Agenda The ProblemThe Solution The Project The Client Thank You!