• Like
Apache Tika
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Apache Tika

  • 7,284 views
Published

 

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
7,284
On SlideShare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
164
Comments
0
Likes
10

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache Tika An extensible, configurable content analysis framework toolkit
  • 2. Agenda The Problem The Solution The Project The Design
  • 3. The Problem PDFBox Apache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index
  • 4. It’s Worse Than That Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming Processing of digital media ? ? ? ? ? ? ? ?
  • 5. Agenda The Problem The Solution The Project The Design
  • 6. The Solution: Technical
    • Generic API for extracting metadata and structured text content from a document
      • Input: byte stream + optional metadata
      • Output: XHTML SAX events + metadata
    • Automatic content type detection
      • Magic bytes
      • File name patterns
  • 7. The Solution: Legal / Social
    • Apache License
      • (L)GPL projects can implement the Tika API
    • Pooling of efforts
      • Active development and maintenance
      • Already beyond the functionality of most custom solutions
      • Cool future goals: OCR, speech recognition, …
  • 8. Agenda The Problem The Solution The Project The Design
  • 9. Project Status
    • Initially planned already in early 2006
    • Incubating since March 2007
    • Sponsoring PMC: Apache Lucene
    • No releases yet
      • 0.1 release being planned
    • Small development team
      • 6 committers, 3-4 currently active
  • 10. Current Features
    • Media type framework
      • Shared MIME info spec (freedesktop.org)
      • Default media type registry (incl. glob and magic patterns)
    • Parser components
      • PDF (PDFBox)
      • Plain text (ICU4)
      • XML (SAX)
      • HTML (NekoHTML)
      • Word, PowerPoint, Excel (POI)
      • ODF (SAX)
      • RTF (Swing)
  • 11. Project Statistics
  • 12. Codebase History Lius Nutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Jerome Charron Sami Siren Bertrand Delacretaz Keith Bennett
  • 13. Agenda The Problem The Solution The Project The Design
  • 14. Content Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new PowerPointParser().parse(…);
  • 15. Media Type Detection application/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetypes.xml /etc/magic mime.types ?
  • 16. Combined Detection and Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF XML new AutoDetectParser().parse(…); ?
  • 17. Agenda The Problem The Solution The Project The Design Thank You!