Apache Tika
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
11,174
On Slideshare
10,610
From Embeds
564
Number of Embeds
7

Actions

Shares
Downloads
164
Comments
0
Likes
10

Embeds 564

http://jukkaz.wordpress.com 484
http://www.slideshare.net 46
http://dev.day.com 27
https://www.linkedin.com 3
http://www.slideee.com 2
http://static.slideshare.net 1
http://slideclip.b-prep.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache Tika An extensible, configurable content analysis framework toolkit
  • 2. Agenda The Problem The Solution The Project The Design
  • 3. The Problem PDFBox Apache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index
  • 4. It’s Worse Than That Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming Processing of digital media ? ? ? ? ? ? ? ?
  • 5. Agenda The Problem The Solution The Project The Design
  • 6. The Solution: Technical
    • Generic API for extracting metadata and structured text content from a document
      • Input: byte stream + optional metadata
      • Output: XHTML SAX events + metadata
    • Automatic content type detection
      • Magic bytes
      • File name patterns
  • 7. The Solution: Legal / Social
    • Apache License
      • (L)GPL projects can implement the Tika API
    • Pooling of efforts
      • Active development and maintenance
      • Already beyond the functionality of most custom solutions
      • Cool future goals: OCR, speech recognition, …
  • 8. Agenda The Problem The Solution The Project The Design
  • 9. Project Status
    • Initially planned already in early 2006
    • Incubating since March 2007
    • Sponsoring PMC: Apache Lucene
    • No releases yet
      • 0.1 release being planned
    • Small development team
      • 6 committers, 3-4 currently active
  • 10. Current Features
    • Media type framework
      • Shared MIME info spec (freedesktop.org)
      • Default media type registry (incl. glob and magic patterns)
    • Parser components
      • PDF (PDFBox)
      • Plain text (ICU4)
      • XML (SAX)
      • HTML (NekoHTML)
      • Word, PowerPoint, Excel (POI)
      • ODF (SAX)
      • RTF (Swing)
  • 11. Project Statistics
  • 12. Codebase History Lius Nutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Jerome Charron Sami Siren Bertrand Delacretaz Keith Bennett
  • 13. Agenda The Problem The Solution The Project The Design
  • 14. Content Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new PowerPointParser().parse(…);
  • 15. Media Type Detection application/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetypes.xml /etc/magic mime.types ?
  • 16. Combined Detection and Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF XML new AutoDetectParser().parse(…); ?
  • 17. Agenda The Problem The Solution The Project The Design Thank You!