Apache Tika

Apache Tika An extensible, configurable content analysis framework toolkit

Agenda The Problem The Solution The Project The Design

The Problem PDFBox Apache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index

It’s Worse Than That Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming Processing of digital media ? ? ? ? ? ? ? ?

The Solution: Technical Generic API for extracting metadata and structured text content from a document Input: byte stream + optional metadata Output: XHTML SAX events + metadata Automatic content type detection Magic bytes File name patterns

The Solution: Legal / Social Apache License (L)GPL projects can implement the Tika API Pooling of efforts Active development and maintenance Already beyond the functionality of most custom solutions Cool future goals: OCR, speech recognition, …

Project Status Initially planned already in early 2006 Incubating since March 2007 Sponsoring PMC: Apache Lucene No releases yet 0.1 release being planned Small development team 6 committers, 3-4 currently active

Current Features Media type framework Shared MIME info spec (freedesktop.org) Default media type registry (incl. glob and magic patterns) Parser components PDF (PDFBox) Plain text (ICU4) XML (SAX) HTML (NekoHTML) Word, PowerPoint, Excel (POI) ODF (SAX) RTF (Swing)

Codebase History Lius Nutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Jerome Charron Sami Siren Bertrand Delacretaz Keith Bennett

Content Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new PowerPointParser().parse(…);

Media Type Detection application/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetypes.xml /etc/magic mime.types ?

Combined Detection and Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF XML new AutoDetectParser().parse(…); ?

Agenda The Problem The Solution The Project The Design Thank You!

Apache Tika

More Related Content

What's hot

Viewers also liked

Similar to Apache Tika

More from Jukka Zitting

Recently uploaded

Apache Tika