Apache Tika end-to-end


Published on

From the Fast Feather Track at ApacheCon NA 2010 in Atlanta

This quick talk provides an overview of Apache Tika, looks at a new features and supported file formats. It then shows how to create a new parser, and finishes with using Tika from your own application.

Published in: Technology, Education
  • Be the first to comment

Apache Tika end-to-end

  1. 1. Apache Tika End-to-End An introduction to Apache Tika, and integrating it to your application
  2. 2. Nick Burch Software Engineer Alfresco
  3. 3. Apache Tika http://tika.apache.org/ • Project which started in 2006 • Grew out of the Lucene community, now widely used • Provides detection of files – eg this binary blob is really a word file, that one is UTF-8 plain text • Plain text, HTML and XHTML versions of a wide range of different file formats • Consistent Metadata from different files • Tika hides the complexity of the
  4. 4. What's new? • Lots of new parsers – text, office formats, publishing formats, images, audio, CAD, fonts etc • Long standing parsers improved – better HTML from word for example • Embedded resources and containers • Use expanding – used by many SOLR users, Alfresco, lots of people crunching masses of data on Hadoop
  5. 5. Supported Formats Page 1 • Audio – WAV, RIFF, MIDI • DWG (CAD) • Epub • RSS and ATOM Feeds • True Type Fonts • HTML • Images – JPEG, GIF, PNG, TIFF, Bitmap (including EXIF where found) • iWork (Keynote, Pages etc) • RFC822 mbox Mail
  6. 6. Supported Formats Page 2 • Microsoft Outlook .msg Email • Microsoft Office (Binary) – Word, PowerPoint, Excel, Visio, Publisher, Works • Microsoft Office (OOXML) – Word, PowerPoint, Excel • MP3 (id3 v1 and v2) • CDF (Scientific Data) • Open Document Format (Open Office) • Old-style Open Office (.sxw etc)
  7. 7. Supported Formats Page 3 • Zip and Tar archives • RDF • Plain Text • FLV Video • XML • Java class files And I probably forgot one...!
  8. 8. Metadata • Tika provides consistent metadata across the range of parsers • No need to know if it's “Last Author”, “Last Editor” or “Previous Author” in a file format, they all come back with the same metadata key • Keys and values are strings, but strongly typed metadata entries provide converters to dates, ints etc
  9. 9. Text Content • Tika generates HTML-like SAX events as it parses • Uses Java SAX API • Events can be captured or transformed • Body Content Handler used for plain text • HTML and XHTML available • Can customise with your own handler, with XSLT or with E4X from JavaScript • eg HTML Table → CSV
  10. 10. Calling Tika
  11. 11. // Get a content detector, and an auto- selecting Parser TikaConfig config = TikaConfig.getDefaultConfig(); ContainerAwareDetector detector = new ContainerAwareDetector( config.getMimeRepository() ); Parser parser = new AutoDetectParser(detector); // We’ll only want the plain text contents ContentHandler handler = new
  12. 12. // Plain text only content handler ContentHandler handler = new BodyContentHandler(); String text = handler.toString(); // XHTML content handler SAXTransformerFactory factory = SAXTransformerFactory.newInstance(); TransformerHandler handler = factory.newTransformerHandler(); handler.getTransformer().setOutputProp erty(OutputKeys.METHOD, "xml");
  13. 13. Tika Parsers
  14. 14. Parser Interface • Two key methods – what mime types are supported, and do the parsing public interface Parser { Set<MediaType> getSupportedTypes(ParseContext context); void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException,
  15. 15. public class HelloWorldParser implements Parser { public Set<MediaType> getSupportedTypes(ParseContext context) { Set<MediaType> types = new HashSet<MediaType>(); types.add(MediaType.parse("hello/world ")); return types; } public void parse(InputStream stream,
  16. 16. Demo: Tika-App
  17. 17. Demo: Geo-Tagged Images in Alfresco Share via Tika
  18. 18. Any Questions?