Mime Magic With Apache Tika

MIME Magic with Apache Tika Jukka Zitting Tika committer and mentor

Agenda The Problem The Solution The Project The Client

The Problem PDFBox Apache POI Apache Xerces ICU4J NekoHTML etc. Lucene index

It's even worse! Licensing/Patents Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming/Performance Processing of digital media ? ? ? ? ? ? ? ?

The Solution: Technical Generic API for extracting metadata and structured text content from a document Input: byte stream + optional metadata Output: XHTML SAX events + metadata Automatic content type detection Magic bytes File name patterns

The Solution: Legal / Social Apache License (L)GPL projects can implement the Tika API Pooling of efforts Active development and maintenance Already beyond the functionality of most custom solutions

Project Status Incubating since March 2007 Sponsoring PMC: Apache Lucene First release (0.1-incubating) in December 2007 Interaction with PDFBox, POI, etc. Currently in early adopter phase

Current Features 73 registered media types 167 glob patterns 26 magic header patterns 7 built-in parser classes 51 supported media types MS Office, OpenOffice, HTML, PDF, XML, RTF, plain text

Tika Parser API package org.apache.tika.parser; public interface Parser { // Parses document content and metadata void parse( InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException; // Parses document metadata, @since Tika 0.2 void parse( InputStream stream, Metadata metadata) throws IOException, TikaException; }

Example: Text extraction public static void main(String[] args)‏ throws Exception { InputStream stream = System.in; ContentHandler handler = new WriteOutContentHandler( System.out ); Metadata metadata = new Metadata(); new AutoDetectParser().parse( stream , handler , metadata ); }

Agenda The Problem The Solution The Project The Client Thank You!

Mime Magic With Apache Tika

More Related Content

What's hot

Viewers also liked

Similar to Mime Magic With Apache Tika

More from Jukka Zitting

Recently uploaded

Mime Magic With Apache Tika