Mime Magic With Apache Tika
Upcoming SlideShare
Loading in...5
×
 

Mime Magic With Apache Tika

on

  • 4,770 views

Fast Feather Track presentation at ApacheCon EU 2008 in Amsterdam

Fast Feather Track presentation at ApacheCon EU 2008 in Amsterdam

Statistics

Views

Total Views
4,770
Views on SlideShare
4,710
Embed Views
60

Actions

Likes
1
Downloads
66
Comments
0

5 Embeds 60

http://dev.day.com 51
http://www.slideshare.net 6
http://www.linkedin.com 1
http://www.docshut.com 1
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Mime Magic With Apache Tika Mime Magic With Apache Tika Presentation Transcript

    • MIME Magic with Apache Tika Jukka Zitting Tika committer and mentor
    • Agenda The Problem The Solution The Project The Client
    • The Problem PDFBox Apache POI Apache Xerces ICU4J NekoHTML etc. Lucene index
    • It's even worse! Licensing/Patents Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming/Performance Processing of digital media ? ? ? ? ? ? ? ?
    • Agenda The Problem The Solution The Project The Client
    • The Solution: Technical
      • Generic API for extracting metadata and structured text content from a document
        • Input: byte stream + optional metadata
        • Output: XHTML SAX events + metadata
      • Automatic content type detection
        • Magic bytes
        • File name patterns
    • The Solution: Legal / Social
      • Apache License
        • (L)GPL projects can implement the Tika API
      • Pooling of efforts
        • Active development and maintenance
        • Already beyond the functionality of most custom solutions
    • Agenda The Problem The Solution The Project The Client
    • Project Status
      • Incubating since March 2007
      • Sponsoring PMC: Apache Lucene
      • First release (0.1-incubating) in December 2007
      • Interaction with PDFBox, POI, etc.
      • Currently in early adopter phase
    • Current Features
      • 73 registered media types
        • 167 glob patterns
        • 26 magic header patterns
      • 7 built-in parser classes
        • 51 supported media types
        • MS Office, OpenOffice, HTML, PDF, XML, RTF, plain text
    • Project Statistics
    • Agenda The Problem The Solution The Project The Client
    • Tika Parser API
      • package org.apache.tika.parser;
      • public interface Parser { // Parses document content and metadata void parse( InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException; // Parses document metadata, @since Tika 0.2 void parse( InputStream stream, Metadata metadata) throws IOException, TikaException;
      • }
    • Example: Text extraction
      • public static void main(String[] args)‏
      • throws Exception {
      • InputStream stream = System.in;
      • ContentHandler handler =
      • new WriteOutContentHandler( System.out );
      • Metadata metadata = new Metadata();
      • new AutoDetectParser().parse(
      • stream , handler , metadata );
      • }
    • Demo: Tika GUI
    • Agenda The Problem The Solution The Project The Client Thank You!