Mime Magic With Apache Tika


Published on

Apache Tika aims to make it easier to extract metadata and structured text content from all kinds of files. Tika is a subproject of Apache Lucene, and leverages libraries like Apache POI and Apache PDFBox to provide a powerful yet simple interface for parsing dozens of document formats. This makes Tika an ideal companion for Apache Lucene, or for any search engine that needs to be able to index metadata and content from many different types of files. This presentation introduces Apache Tika and shows how it's being used in projects like Apache Solr and Apache Jackrabbit. You will learn how to integrate Tika with your application and how to configure and extend Tika to best suit your needs.

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Mime Magic With Apache Tika

  1. 1. MIME Magic with Apache Tika Jukka Zitting
  2. 3. YOU don't know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain't no matter. That book was made by Mr. Mark Twain, and he told the truth, mainly. There was things which he stretched, but mainly he told the truth. That is nothing. I never seen anybody but lied one time or another, without it was Aunt Polly, or the widow, or maybe Mary. Aunt Polly--Tom's Aunt Polly, she is--and Mary, and the Widow Douglas is all told about in that book, which is mostly a true book, with some stretchers, as I said before… 3 May. Bistritz.--Left Munich at 8:35 P.M., on 1st May, arriving at Vienna early next morning; should have arrived at 6:46, but train was an hour late. Buda-Pesth seems a wonderful place, from the glimpse which I got of it from the train and the little I could walk through the streets. I feared to go very far from the station, as we had arrived late and would start as near the correct time as possible. The impression I had was that we were leaving the West and entering the East; the most western of splendid bridges over the Danube, which is here of noble width and depth, took us among the traditions of Turkish rule… To Sherlock Holmes she is always THE woman. I have seldom heard him mention her under any other name. In his eyes she eclipses and predominates the whole of her sex. It was not that he felt any emotion akin to love for Irene Adler. All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind. He was, I take it, the most perfect reasoning and observing machine that the world has seen, but as a lover he would have placed himself in a false position. He never spoke of the softer passions, save with a gibe and a sneer. They were admirable things for the observer--excellent for drawing the veil from men's motives and actions. But for the trained reasoner to admit such intrusions into his own delicate and finely adjusted temperament was to introduce a distracting factor which might throw a doubt upon all his mental results. Grit in a sensitive instrument, or a crack in one of his own high-power lenses, would not be more disturbing than a strong emotion in a nature such as his. And yet there was but one woman to him, and that woman was the late Irene Adler, of dubious and questionable memory… Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?' So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her… It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters. "My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?" Mr. Bennet replied that he had not. "But it is," returned she; "for Mrs. Long has just been here, and she told me all about it." Mr. Bennet made no answer…
  3. 4. Type Text Metadata
  4. 5. import org.apache.tika.Tika; unreleased code alert
  5. 6. <ul><li>// Type detection </li></ul><ul><li>String type = </li></ul><ul><li>Tika.detect( InputStream ); </li></ul><ul><li>String type = </li></ul><ul><li>Tika.detect( File ); </li></ul><ul><li>String type = </li></ul><ul><li>Tika.detect( URL ); </li></ul><ul><li>String type = </li></ul><ul><li>Tika.detect( String ); </li></ul>
  6. 7. <ul><li>// Text extraction </li></ul><ul><li>String content = </li></ul><ul><li>Tika.parseToString( InputStream ); </li></ul><ul><li>String content = </li></ul><ul><li>Tika.parseToString( File ); </li></ul><ul><li>String content = </li></ul><ul><li>Tika.parseToString( URL ); </li></ul>
  7. 8. <ul><li>// Streaming text extraction </li></ul><ul><li>Reader reader = </li></ul><ul><li>Tika.parse( InputStream ); </li></ul><ul><li>Reader reader = </li></ul><ul><li>Tika.parse( File ); </li></ul><ul><li>Reader reader = </li></ul><ul><li>Tika.parse( URL ); </li></ul>
  8. 9. <ul><li>// … with metadata </li></ul><ul><li>String type = </li></ul><ul><li>Tika.detect( InputStream , Metadata ); </li></ul><ul><li>String content = </li></ul><ul><li>Tika.parseToString( </li></ul><ul><li>InputStream , Metadata ); </li></ul><ul><li>Reader reader = </li></ul><ul><li>Tika.parse( InputStream , Metadata ); </li></ul>
  9. 10. How does the magic work?
  10. 11. <ul><li>// Metadata handling </li></ul><ul><li>import org.apache.tika.metadata.Metadata; </li></ul><ul><li>Metadata metadata = new Metadata(); </li></ul><ul><li>metadata.set( </li></ul><ul><li>Metadata.RESOURCE_NAME_KEY, “…”); </li></ul><ul><li>String title = </li></ul><ul><li>metadata.get(Metadata.TITLE); </li></ul>XMP?
  11. 12. <ul><li>// Type detection and the type registry </li></ul><ul><li>import org.apache.tika.detect.Detector; </li></ul><ul><li>Detector detector = </li></ul><ul><li>TikaConfig.getDefaultConfig() </li></ul><ul><li>.getMimeTypeRepository(); </li></ul><ul><li>MediaType type = detector.detect( </li></ul><ul><li>InputStream , Metadata ); </li></ul><ul><li>// 1274 known types, 627 detectable </li></ul><ul><li>// (927 globs, 280 magic byte patterns) </li></ul>
  12. 13. <ul><li>// The parser API </li></ul><ul><li>import org.apache.tika.parser.Parser; </li></ul><ul><li>Parser parser = new AutoDetectParser(); </li></ul><ul><li>parser.parse( </li></ul><ul><li>InputStream , ContentHandler , </li></ul><ul><li>Metadata , Map<String, Object> ); </li></ul><ul><li>// IOException, SAXException, TikaException </li></ul><ul><li>// asm, audio, epub, html, image, mbox, </li></ul><ul><li>// ms, mp3, odf, pdf, pkg, rtf, txt, xml </li></ul>
  13. 14. Parser API in detail
  14. 15. <ul><li>// Document input stream </li></ul><ul><li>InputStream stream = …; </li></ul><ul><li>try { </li></ul><ul><li>parser.parse(stream, …); </li></ul><ul><li>} finally { </li></ul><ul><li>stream.close(); // Remember! </li></ul><ul><li>} </li></ul><ul><li>// Streaming if possible, otherwise a </li></ul><ul><li>// temporary file is automatically used </li></ul>
  15. 16. <ul><li>// XHTML SAX events for structured text </li></ul><ul><li>ContentHandler handler = </li></ul><ul><li>new BodyContentHandler(…); // etc. </li></ul><ul><li>try { </li></ul><ul><li>parser.parse(…, handler, …); </li></ul><ul><li>} catch (SAXException e) { </li></ul><ul><li>… ; </li></ul><ul><li>} </li></ul><ul><li>// <html> </li></ul><ul><li>// <head>…<title>…</title>…</head> </li></ul><ul><li>// <body>…</body> </li></ul><ul><li>// </html> </li></ul>
  16. 17. <ul><li>// Input and output metadata </li></ul><ul><li>Metadata metadata = new Metadata(); </li></ul><ul><li>metadata.set( </li></ul><ul><li>Metadata.RESOURCE_NAME_KEY, “…”); </li></ul><ul><li>parser.parse(…, metadata, …); </li></ul><ul><li>String title = </li></ul><ul><li>metadata.get(Metadata.TITLE); </li></ul><ul><li>// See parser implementations for </li></ul><ul><li>// supported metadata keys (TODO: XMP, DC) </li></ul>
  17. 18. <ul><li>// Parse context </li></ul><ul><li>Map<String, Object> context = </li></ul><ul><li>new HashMap<String, Object>(); </li></ul><ul><li>context.put( </li></ul><ul><li>Parser.class.getName(), </li></ul><ul><li>new MyCustomComponentParser()); </li></ul><ul><li>// hierarchy while streaming! </li></ul><ul><li>parser.parse(…, context); </li></ul><ul><li>// Also for Locale, XML parser, etc. </li></ul>
  18. 19. Putting it all together
  19. 20. <ul><li>// Tika components </li></ul><ul><li>tika-core - API/core classes, no deps </li></ul><ul><li>tika-parsers - parser libraries </li></ul><ul><li>tika-app - all in one jar + cli/gui </li></ul><ul><li>$ java –jar tika-app-0.x.jar < … </li></ul><ul><li>$ java –jar tika-app-0.x.jar --text </li></ul><ul><li>$ java –jar tika-app-0.x.jar --metadata </li></ul><ul><li>$ java –jar tika-app-0.x.jar --help </li></ul>
  20. 21. <ul><li>// Tika GUI </li></ul><ul><li>$ java –jar tika-app-0.x.jar --gui </li></ul>
  21. 22. <ul><li>// Customizing Tika </li></ul><ul><li>org/apache/tika/tika-config.xml </li></ul><ul><li>org/apache/tika/mime/tika-mimetypes.xml </li></ul><ul><li>// … or add custom components at runtime </li></ul><ul><li>org.apache.tika.detect.CompositeDetector </li></ul><ul><li>org.apache.tika.parser.CompositeParser </li></ul><ul><li>// … or submit a patch! </li></ul><ul><li>// In the future: OSGi support? </li></ul>
  22. 23. <ul><li>// Tika integration </li></ul><ul><li>Lucene Java </li></ul><ul><li>document.add( </li></ul><ul><li>new Field(“content”, Tika.parse(…)); </li></ul><ul><li>// see also the ParsingReader class </li></ul><ul><li>Solr Cell, in Solr 1.4 </li></ul><ul><li>http://wiki.apache.org/solr/ExtractingRequestHandler </li></ul><ul><li>TODO: Nutch, ports (but see Tika cli) </li></ul>
  23. 24. Questions? http://lucene.apache.org/tika/