Apache Tika

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    7 Favorites

    Apache Tika - Presentation Transcript

    1. Apache Tika An extensible, configurable content analysis framework toolkit
    2. Agenda The Problem The Solution The Project The Design
    3. The Problem PDFBox Apache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index
    4. It’s Worse Than That Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming Processing of digital media ? ? ? ? ? ? ? ?
    5. Agenda The Problem The Solution The Project The Design
    6. The Solution: Technical
      • Generic API for extracting metadata and structured text content from a document
        • Input: byte stream + optional metadata
        • Output: XHTML SAX events + metadata
      • Automatic content type detection
        • Magic bytes
        • File name patterns
    7. The Solution: Legal / Social
      • Apache License
        • (L)GPL projects can implement the Tika API
      • Pooling of efforts
        • Active development and maintenance
        • Already beyond the functionality of most custom solutions
        • Cool future goals: OCR, speech recognition, …
    8. Agenda The Problem The Solution The Project The Design
    9. Project Status
      • Initially planned already in early 2006
      • Incubating since March 2007
      • Sponsoring PMC: Apache Lucene
      • No releases yet
        • 0.1 release being planned
      • Small development team
        • 6 committers, 3-4 currently active
    10. Current Features
      • Media type framework
        • Shared MIME info spec (freedesktop.org)
        • Default media type registry (incl. glob and magic patterns)
      • Parser components
        • PDF (PDFBox)
        • Plain text (ICU4)
        • XML (SAX)
        • HTML (NekoHTML)
        • Word, PowerPoint, Excel (POI)
        • ODF (SAX)
        • RTF (Swing)
    11. Project Statistics
    12. Codebase History Lius Nutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Jerome Charron Sami Siren Bertrand Delacretaz Keith Bennett
    13. Agenda The Problem The Solution The Project The Design
    14. Content Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new PowerPointParser().parse(…);
    15. Media Type Detection application/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetypes.xml /etc/magic mime.types ?
    16. Combined Detection and Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF XML new AutoDetectParser().parse(…); ?
    17. Agenda The Problem The Solution The Project The Design Thank You!

    + Jukka ZittingJukka Zitting, 3 years ago

    custom

    3001 views, 7 favs, 3 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 3001
      • 2691 on SlideShare
      • 310 from embeds
    • Comments 0
    • Favorites 7
    • Downloads 48
    Most viewed embeds
    • 297 views on http://jukkaz.wordpress.com
    • 12 views on http://dev.day.com
    • 1 views on http://static.slideshare.net

    more

    All embeds
    • 297 views on http://jukkaz.wordpress.com
    • 12 views on http://dev.day.com
    • 1 views on http://static.slideshare.net

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories