Apache Tika
Upcoming SlideShare
Loading in...5
×
 

Apache Tika

on

  • 11,027 views

 

Statistics

Views

Total Views
11,027
Views on SlideShare
10,463
Embed Views
564

Actions

Likes
10
Downloads
161
Comments
0

7 Embeds 564

http://jukkaz.wordpress.com 484
http://www.slideshare.net 46
http://dev.day.com 27
https://www.linkedin.com 3
http://www.slideee.com 2
http://static.slideshare.net 1
http://slideclip.b-prep.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Apache Tika Apache Tika Presentation Transcript

  • Apache Tika An extensible, configurable content analysis framework toolkit
  • Agenda The Problem The Solution The Project The Design
  • The Problem PDFBox Apache Poi Apache Xerces ICU4J NekoHTML etc. Lucene index
  • It’s Worse Than That Licensing Dependencies Metadata extraction Structured content Encryption/Compression Package formats Streaming Processing of digital media ? ? ? ? ? ? ? ?
  • Agenda The Problem The Solution The Project The Design
  • The Solution: Technical
    • Generic API for extracting metadata and structured text content from a document
      • Input: byte stream + optional metadata
      • Output: XHTML SAX events + metadata
    • Automatic content type detection
      • Magic bytes
      • File name patterns
  • The Solution: Legal / Social
    • Apache License
      • (L)GPL projects can implement the Tika API
    • Pooling of efforts
      • Active development and maintenance
      • Already beyond the functionality of most custom solutions
      • Cool future goals: OCR, speech recognition, …
  • Agenda The Problem The Solution The Project The Design
  • Project Status
    • Initially planned already in early 2006
    • Incubating since March 2007
    • Sponsoring PMC: Apache Lucene
    • No releases yet
      • 0.1 release being planned
    • Small development team
      • 6 committers, 3-4 currently active
  • Current Features
    • Media type framework
      • Shared MIME info spec (freedesktop.org)
      • Default media type registry (incl. glob and magic patterns)
    • Parser components
      • PDF (PDFBox)
      • Plain text (ICU4)
      • XML (SAX)
      • HTML (NekoHTML)
      • Word, PowerPoint, Excel (POI)
      • ODF (SAX)
      • RTF (Swing)
  • Project Statistics
  • Codebase History Lius Nutch Lius Lite Tika textmining Jackrabbit Andy Clark Jukka Zitting Rida Benjelloun Chris Mattman Jerome Charron Sami Siren Bertrand Delacretaz Keith Bennett
  • Agenda The Problem The Solution The Project The Design
  • Content Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting new PowerPointParser().parse(…);
  • Media Type Detection application/vnd.ms-powerpoint MimeTypes types = …; MimeType type = types.getMimeType(…); tika-mimetypes.xml /etc/magic mime.types ?
  • Combined Detection and Extraction PPT Type: application/vnd.ms-powerpoint Title: Apache Tika Author: Jukka Zitting TXT PDF XML new AutoDetectParser().parse(…); ?
  • Agenda The Problem The Solution The Project The Design Thank You!