Mime Magic With Apache Tika


Apache Tika aims to make it easier to extract metadata and structured text content from all kinds of files. Tika is a subproject of Apache Lucene, and leverages libraries like Apache POI and Apache PDFBox to provide a powerful yet simple interface for parsing dozens of document formats. This makes Tika an ideal companion for Apache Lucene, or for any search engine that needs to be able to index metadata and content from many different types of files. This presentation introduces Apache Tika and shows how it's being used in projects like Apache Solr and Apache Jackrabbit. You will learn how to integrate Tika with your application and how to configure and extend Tika to best suit your needs.

  1. 1. MIME Magic with Apache Tika Jukka Zitting
  3. 4. Type Text Metadata
  4. 5. import org.apache.tika.Tika; unreleased code alert
  5. 6. <ul><li>// Type detection </li></ul><ul><li>String type = </li></ul><ul><li>Tika.detect( InputStream ); </li></ul><ul><li>String type = </li></ul><ul><li>Tika.detect( File ); </li></ul><ul><li>String type = </li></ul><ul><li>Tika.detect( URL ); </li></ul><ul><li>String type = </li></ul><ul><li>Tika.detect( String ); </li></ul>
  6. 7. <ul><li>// Text extraction </li></ul><ul><li>String content = </li></ul><ul><li>Tika.parseToString( InputStream ); </li></ul><ul><li>String content = </li></ul><ul><li>Tika.parseToString( File ); </li></ul><ul><li>String content = </li></ul><ul><li>Tika.parseToString( URL ); </li></ul>
  7. 8. <ul><li>// Streaming text extraction </li></ul><ul><li>Reader reader = </li></ul><ul><li>Tika.parse( InputStream ); </li></ul><ul><li>Reader reader = </li></ul><ul><li>Tika.parse( File ); </li></ul><ul><li>Reader reader = </li></ul><ul><li>Tika.parse( URL ); </li></ul>
  8. 9. <ul><li>// … with metadata </li></ul><ul><li>String type = </li></ul><ul><li>Tika.detect( InputStream , Metadata ); </li></ul><ul><li>String content = </li></ul><ul><li>Tika.parseToString( </li></ul><ul><li>InputStream , Metadata ); </li></ul><ul><li>Reader reader = </li></ul><ul><li>Tika.parse( InputStream , Metadata ); </li></ul>
  9. 10. How does the magic work?
  10. 11. <ul><li>// Metadata handling </li></ul><ul><li>import org.apache.tika.metadata.Metadata; </li></ul><ul><li>Metadata metadata = new Metadata(); </li></ul><ul><li>metadata.set( </li></ul><ul><li>Metadata.RESOURCE_NAME_KEY, “…”); </li></ul><ul><li>String title = </li></ul><ul><li>metadata.get(Metadata.TITLE); </li></ul>XMP?
  11. 12. <ul><li>// Type detection and the type registry </li></ul><ul><li>import org.apache.tika.detect.Detector; </li></ul><ul><li>Detector detector = </li></ul><ul><li>TikaConfig.getDefaultConfig() </li></ul><ul><li>.getMimeTypeRepository(); </li></ul><ul><li>MediaType type = detector.detect( </li></ul><ul><li>InputStream , Metadata ); </li></ul><ul><li>// 1274 known types, 627 detectable </li></ul><ul><li>// (927 globs, 280 magic byte patterns) </li></ul>
  12. 13. <ul><li>// The parser API </li></ul><ul><li>import org.apache.tika.parser.Parser; </li></ul><ul><li>Parser parser = new AutoDetectParser(); </li></ul><ul><li>parser.parse( </li></ul><ul><li>InputStream , ContentHandler , </li></ul><ul><li>Metadata , Map<String, Object> ); </li></ul><ul><li>// IOException, SAXException, TikaException </li></ul><ul><li>// asm, audio, epub, html, image, mbox, </li></ul><ul><li>// ms, mp3, odf, pdf, pkg, rtf, txt, xml </li></ul>
  13. 14. Parser API in detail
  14. 15. <ul><li>// Document input stream </li></ul><ul><li>InputStream stream = …; </li></ul><ul><li>try { </li></ul><ul><li>parser.parse(stream, …); </li></ul><ul><li>} finally { </li></ul><ul><li>stream.close(); // Remember! </li></ul><ul><li>} </li></ul><ul><li>// Streaming if possible, otherwise a </li></ul><ul><li>// temporary file is automatically used </li></ul>
  15. 16. <ul><li>// XHTML SAX events for structured text </li></ul><ul><li>ContentHandler handler = </li></ul><ul><li>new BodyContentHandler(…); // etc. </li></ul><ul><li>try { </li></ul><ul><li>parser.parse(…, handler, …); </li></ul><ul><li>} catch (SAXException e) { </li></ul><ul><li>… ; </li></ul><ul><li>} </li></ul><ul><li>// <html> </li></ul><ul><li>// <head>…<title>…</title>…</head> </li></ul><ul><li>// <body>…</body> </li></ul><ul><li>// </html> </li></ul>
  16. 17. <ul><li>// Input and output metadata </li></ul><ul><li>Metadata metadata = new Metadata(); </li></ul><ul><li>metadata.set( </li></ul><ul><li>Metadata.RESOURCE_NAME_KEY, “…”); </li></ul><ul><li>parser.parse(…, metadata, …); </li></ul><ul><li>String title = </li></ul><ul><li>metadata.get(Metadata.TITLE); </li></ul><ul><li>// See parser implementations for </li></ul><ul><li>// supported metadata keys (TODO: XMP, DC) </li></ul>
  17. 18. <ul><li>// Parse context </li></ul><ul><li>Map<String, Object> context = </li></ul><ul><li>new HashMap<String, Object>(); </li></ul><ul><li>context.put( </li></ul><ul><li>Parser.class.getName(), </li></ul><ul><li>new MyCustomComponentParser()); </li></ul><ul><li>// hierarchy while streaming! </li></ul><ul><li>parser.parse(…, context); </li></ul><ul><li>// Also for Locale, XML parser, etc. </li></ul>
  18. 19. Putting it all together
  19. 20. <ul><li>// Tika components </li></ul><ul><li>tika-core - API/core classes, no deps </li></ul><ul><li>tika-parsers - parser libraries </li></ul><ul><li>tika-app - all in one jar + cli/gui </li></ul><ul><li>$ java –jar tika-app-0.x.jar < … </li></ul><ul><li>$ java –jar tika-app-0.x.jar --text </li></ul><ul><li>$ java –jar tika-app-0.x.jar --metadata </li></ul><ul><li>$ java –jar tika-app-0.x.jar --help </li></ul>
  20. 21. <ul><li>// Tika GUI </li></ul><ul><li>$ java –jar tika-app-0.x.jar --gui </li></ul>
  21. 22. <ul><li>// Customizing Tika </li></ul><ul><li>org/apache/tika/tika-config.xml </li></ul><ul><li>org/apache/tika/mime/tika-mimetypes.xml </li></ul><ul><li>// … or add custom components at runtime </li></ul><ul><li>org.apache.tika.detect.CompositeDetector </li></ul><ul><li>org.apache.tika.parser.CompositeParser </li></ul><ul><li>// … or submit a patch! </li></ul><ul><li>// In the future: OSGi support? </li></ul>
  22. 23. <ul><li>// Tika integration </li></ul><ul><li>Lucene Java </li></ul><ul><li>document.add( </li></ul><ul><li>new Field(“content”, Tika.parse(…)); </li></ul><ul><li>// see also the ParsingReader class </li></ul><ul><li>Solr Cell, in Solr 1.4 </li></ul><ul><li>http://wiki.apache.org/solr/ExtractingRequestHandler </li></ul><ul><li>TODO: Nutch, ports (but see Tika cli) </li></ul>
  23. 24. Questions? http://lucene.apache.org/tika/