Content Analysis with Apache Tika

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    2 Favorites

    Content Analysis with Apache Tika - Presentation Transcript

    1. Content analysis with Apache Tika Paolo Mottadelli - [email_address] or [email_address]
    2. Main challenge Lucene index
    3. Other challenges
    4. What is Tika? Another Indian Lucene project? No.
    5. What is Tika? It is a Toolkit
    6. Current coverage
    7. A brief history of Tika Sponsored by the Apache Lucene PMC
    8. Tika organization Changing after graduation
    9. Getting Tika … and contributing
    10. Tika Design
    11. The Parser interface
      • void parse(InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException;
    12. Tika Design
    13. Document input stream
    14. Tika Design
    15. XHTML SAX events
      • <html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;>
      • <head>
      • <title>...</title>
      • </head>
      • <body> ... </body>
      • </html>
    16. Why XHTML?
      • Reflect the structured text content of the document
      • Not recreating the low level details
      • For low level details use low level parser libs
    17. ContentHandler (CH) and Decorators (CHD)
    18. Tika Design
    19. Document metadata
    20. … more metadata: HPSF
    21. Tika Design
    22. Parser implementations
    23. The AutoDetectParser
      • Encapsulates all Tika functionalities
      • Can handle any type of document
    24. Type Detection MimeType type = types.getMimeType(…);
    25. tika-mimetypes.xml
      • An example: Gzip
      • <mime-type type=&quot;application/x-gzip&quot;>
      • <magic priority=&quot;40&quot;>
      • <match value=&quot;37213&quot; type=&quot;string“ offset=&quot;0&quot; />
      • </magic>
      • <glob pattern=&quot;*.tgz&quot; />
      • <glob pattern=&quot;*.gz&quot; />
      • <glob pattern=&quot;*-gz&quot; />
      • </mime-type>
    26. Supported formats
    27. A really simple example
      • InputStream input = MyTest.class.getResourceAsStream(&quot;testPPT.ppt&quot;);
      • Metadata metadata = new Metadata();
      • ContentHandler handler = new BodyContentHandler();
      • new OfficeParser ().parse(input, handler, metadata);
      • String contentType = metadata.get(Metadata. CONTENT_TYPE) ;
      • String title= metadata.get(Metadata. TITLE) ;
      • String content = handler.toString() ;
    28. Future Goals
    29. Who uses Tika?

    + Paolo MottadelliPaolo Mottadelli, 2 years ago

    custom

    2064 views, 2 favs, 0 embeds more stats

    Apache Tika presentation, taken from Paolo Mottadel more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 2064
      • 2064 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 2
    • Downloads 47
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories