• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Apache Tika
 

Apache Tika

on

  • 10,859 views

 

Statistics

Views

Total Views
10,859
Views on SlideShare
10,297
Embed Views
562

Actions

Likes
10
Downloads
159
Comments
0

7 Embeds 562

http://jukkaz.wordpress.com 483
http://www.slideshare.net 46
http://dev.day.com 27
http://www.slideee.com 2
https://www.linkedin.com 2
http://static.slideshare.net 1
http://slideclip.b-prep.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Apache Tika Apache Tika Presentation Transcript

    • Apache Tika An extensible, configurable content analysis framework toolkit
    • Agenda The Problem The Solution The Project The Design
    • The Problem PDFBox Apache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index
    • It’s Worse Than That Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming Processing of digital media ? ? ? ? ? ? ? ?
    • Agenda The Problem The Solution The Project The Design
    • The Solution: Technical
      • Generic API for extracting metadata and structured text content from a document
        • Input: byte stream + optional metadata
        • Output: XHTML SAX events + metadata
      • Automatic content type detection
        • Magic bytes
        • File name patterns
    • The Solution: Legal / Social
      • Apache License
        • (L)GPL projects can implement the Tika API
      • Pooling of efforts
        • Active development and maintenance
        • Already beyond the functionality of most custom solutions
        • Cool future goals: OCR, speech recognition, …
    • Agenda The Problem The Solution The Project The Design
    • Project Status
      • Initially planned already in early 2006
      • Incubating since March 2007
      • Sponsoring PMC: Apache Lucene
      • No releases yet
        • 0.1 release being planned
      • Small development team
        • 6 committers, 3-4 currently active
    • Current Features
      • Media type framework
        • Shared MIME info spec (freedesktop.org)
        • Default media type registry (incl. glob and magic patterns)
      • Parser components
        • PDF (PDFBox)
        • Plain text (ICU4)
        • XML (SAX)
        • HTML (NekoHTML)
        • Word, PowerPoint, Excel (POI)
        • ODF (SAX)
        • RTF (Swing)
    • Project Statistics
    • Codebase History Lius Nutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Jerome Charron Sami Siren Bertrand Delacretaz Keith Bennett
    • Agenda The Problem The Solution The Project The Design
    • Content Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new PowerPointParser().parse(…);
    • Media Type Detection application/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetypes.xml /etc/magic mime.types ?
    • Combined Detection and Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF XML new AutoDetectParser().parse(…); ?
    • Agenda The Problem The Solution The Project The Design Thank You!