Apache Tika An extensible, configurable content analysis framework toolkit
Agenda The Problem The Solution The Project The Design
The Problem PDFBox Apache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index
It’s Worse Than That Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming Processing of digital media ? ? ? ? ? ? ? ?
Agenda The Problem The Solution The Project The Design
The Solution: Technical Generic API for extracting metadata and structured text content from a document Input: byte stream + optional metadata Output: XHTML SAX events + metadata Automatic content type detection Magic bytes File name patterns
The Solution: Legal / Social Apache License (L)GPL projects can implement the Tika API Pooling of efforts Active development and maintenance Already beyond the functionality of most custom solutions Cool future goals: OCR, speech recognition, …
Agenda The Problem The Solution The Project The Design
Project Status Initially planned already in early 2006 Incubating since March 2007 Sponsoring PMC: Apache Lucene No releases yet 0.1 release being planned Small development team 6 committers, 3-4 currently active
Current Features Media type framework Shared MIME info spec (freedesktop.org) Default media type registry (incl. glob and magic patterns) Parser components PDF (PDFBox) Plain text (ICU4) XML (SAX) HTML (NekoHTML) Word, PowerPoint, Excel (POI) ODF (SAX) RTF (Swing)
Project Statistics
Codebase History Lius Nutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Jerome Charron Sami Siren Bertrand Delacretaz Keith Bennett
Agenda The Problem The Solution The Project The Design
Content Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new  PowerPointParser().parse(…);
Media Type Detection application/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetypes.xml /etc/magic mime.types ?
Combined Detection and Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF XML new  AutoDetectParser().parse(…); ?
Agenda The Problem The Solution The Project The Design Thank You!

Apache Tika

  • 1.
    Apache Tika Anextensible, configurable content analysis framework toolkit
  • 2.
    Agenda The ProblemThe Solution The Project The Design
  • 3.
    The Problem PDFBoxApache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index
  • 4.
    It’s Worse ThanThat Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming Processing of digital media ? ? ? ? ? ? ? ?
  • 5.
    Agenda The ProblemThe Solution The Project The Design
  • 6.
    The Solution: TechnicalGeneric API for extracting metadata and structured text content from a document Input: byte stream + optional metadata Output: XHTML SAX events + metadata Automatic content type detection Magic bytes File name patterns
  • 7.
    The Solution: Legal/ Social Apache License (L)GPL projects can implement the Tika API Pooling of efforts Active development and maintenance Already beyond the functionality of most custom solutions Cool future goals: OCR, speech recognition, …
  • 8.
    Agenda The ProblemThe Solution The Project The Design
  • 9.
    Project Status Initiallyplanned already in early 2006 Incubating since March 2007 Sponsoring PMC: Apache Lucene No releases yet 0.1 release being planned Small development team 6 committers, 3-4 currently active
  • 10.
    Current Features Mediatype framework Shared MIME info spec (freedesktop.org) Default media type registry (incl. glob and magic patterns) Parser components PDF (PDFBox) Plain text (ICU4) XML (SAX) HTML (NekoHTML) Word, PowerPoint, Excel (POI) ODF (SAX) RTF (Swing)
  • 11.
  • 12.
    Codebase History LiusNutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Jerome Charron Sami Siren Bertrand Delacretaz Keith Bennett
  • 13.
    Agenda The ProblemThe Solution The Project The Design
  • 14.
    Content Extraction PPTType: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new PowerPointParser().parse(…);
  • 15.
    Media Type Detectionapplication/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetypes.xml /etc/magic mime.types ?
  • 16.
    Combined Detection andExtraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF XML new AutoDetectParser().parse(…); ?
  • 17.
    Agenda The ProblemThe Solution The Project The Design Thank You!