Your SlideShare is downloading. ×
Apache Tika
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Apache Tika

7,325
views

Published on

Published in: Technology, Education

0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
7,325
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
164
Comments
0
Likes
10
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Apache Tika An extensible, configurable content analysis framework toolkit
  • 2. Agenda The Problem The Solution The Project The Design
  • 3. The Problem PDFBox Apache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index
  • 4. It’s Worse Than That Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming Processing of digital media ? ? ? ? ? ? ? ?
  • 5. Agenda The Problem The Solution The Project The Design
  • 6. The Solution: Technical
    • Generic API for extracting metadata and structured text content from a document
      • Input: byte stream + optional metadata
      • Output: XHTML SAX events + metadata
    • Automatic content type detection
      • Magic bytes
      • File name patterns
  • 7. The Solution: Legal / Social
    • Apache License
      • (L)GPL projects can implement the Tika API
    • Pooling of efforts
      • Active development and maintenance
      • Already beyond the functionality of most custom solutions
      • Cool future goals: OCR, speech recognition, …
  • 8. Agenda The Problem The Solution The Project The Design
  • 9. Project Status
    • Initially planned already in early 2006
    • Incubating since March 2007
    • Sponsoring PMC: Apache Lucene
    • No releases yet
      • 0.1 release being planned
    • Small development team
      • 6 committers, 3-4 currently active
  • 10. Current Features
    • Media type framework
      • Shared MIME info spec (freedesktop.org)
      • Default media type registry (incl. glob and magic patterns)
    • Parser components
      • PDF (PDFBox)
      • Plain text (ICU4)
      • XML (SAX)
      • HTML (NekoHTML)
      • Word, PowerPoint, Excel (POI)
      • ODF (SAX)
      • RTF (Swing)
  • 11. Project Statistics
  • 12. Codebase History Lius Nutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Jerome Charron Sami Siren Bertrand Delacretaz Keith Bennett
  • 13. Agenda The Problem The Solution The Project The Design
  • 14. Content Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new PowerPointParser().parse(…);
  • 15. Media Type Detection application/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetypes.xml /etc/magic mime.types ?
  • 16. Combined Detection and Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF XML new AutoDetectParser().parse(…); ?
  • 17. Agenda The Problem The Solution The Project The Design Thank You!