Your SlideShare is downloading. ×
0
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Tika: 1 point Oh!

2,422

Published on

ApacheCon NA 2011 talk on Apache Tika 1.0.

ApacheCon NA 2011 talk on Apache Tika 1.0.

Published in: Technology, Education
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,422
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
71
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Apache Tika: 1 point Oh! Chris A. Mattmann NASA JPL/Univ. Southern California/ASF [email_address] November 9, 2011
  • 2.
    • Apache Member involved in
      • OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor)
    • Senior Computer Scientist at NASA JPL in Pasadena, CA USA
    • Software Architecture/Engineering Prof at Univ. of Southern California
    And you are?
  • 3. Roadmap
    • 1 st part of the talk
      • Why Tika?
      • What is Tika?
      • What are the current versions of Tika?
      • What can it do?
    • 2 nd part of the talk
      • NASA Earth Science Data Systems
      • Data System Needs and Requirements
      • How does Tika help?
  • 4. The Information Landscape
  • 5. Proliferation of content types available
    • By some accounts, 16K to 51K content types*
    • What to do with content types?
      • Parse them
        • How?
        • Extract their text and structure
      • Index their metadata
        • In an indexing technology like Lucene, Solr, or in Google Appliance
      • Identify what language they belong to
        • Ngrams
    *http://filext.com/
  • 6. Importance of content types
  • 7. Importance of content type detection
  • 8. Search Engine Architecture
  • 9. Goals
    • Identify and classify file types
      • MIME detection
        • Glob pattern
          • *.txt
          • *.pdf
        • URL
          • http://…pdf
          • ftp://myfile.txt
        • Magic bytes
        • Combination of the above means
    • Classification means reaction can be targeted
  • 10. is…
    • A content analysis and detection toolkit
    • A set of Java APIs providing MIME type detection, language identification, integration of various parsing libraries
    • A rich Metadata API for representing different Metadata models
    • A command line interface to the underlying Java code
    • A GUI interface to the Java code
  • 11. Tika ’s (Brief) History
    • Original idea for Tika came from Chris Mattmann and Jerome Charron in 2006
    • Proposed as Lucene sub-project
      • Others interested, didn ’t gain much traction
    • Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in Apache Jackrabbit
      • A Content Management System
    • Graduated from the Incubator to Lucene sub-project in 2008
    • Graduated to Apache TLP in April 2010
    • 40, 88 and 29 issues resolved in versions 1.0, 0.10, and 0.9
  • 12. Community
    • Mailing lists
      • User: 125 peeps, ~70 msg/mo.
      • Dev: 210 peeps, ~250 msg/mo.
    • Committers/PMC
      • 13 peeps
      • Large majority of them active
    • Releases
      • 11 releases so far
      • Just pushed out 1 point OH
        • http://s.apache.org/N0I
    Credit: svnsearch.org
  • 13. Use in the classroom
    • Have used Apache Tika for the past 2 years in both my Search Engines/Information Retrieval class and my Software Architecture class
      • Several student final projects have turned into contributions for the project and merit for the students
    • Define data management projects that involve the use of OODT, and other technologies like Solr, Tika, Nutch, Hadoop, etc.
  • 14. Some recent 1 point oh press
  • 15. Getting started rapidly…like now!
    • Download Tika from:
      • http://tika.apache.org/download.html
    • Grab tika-app-1.0.jar
    • alias tika “java –jar tika-app-1.0.jar”
    • tika < somefile.doc > extracted-text.xhtml
    • tika –m < somefile.doc > extracted.met
        • Works on Windows too (alias only on UNIX)
  • 16. A quick NASA dataset
    • Atmospheric Infrared Sounder Mission (AIRS)
      • Level 2 Cloud Clear Radiance Product
      • Grab it from here:
        • ftp://airspar1u.ecs.nasa.gov/ftp/data/s4pa/Aqua_AIRS_Level2/AIRI2CCF.003/2007/005/
      • Just grab the first file
        • java -jar tika-app-1.0.jar -m < AIRS.2007.01.05.001.L2.CC.v4.0.9.0.G07006021239.hdf
      • Hopefully this worked for you, if not, blame..
        • Windows
          • And Bill Gates
    25-Mar-11 CORDEX-MATTMANN
  • 17. Detecting MIME types from Java
    • String type = Tika.detect(…)
      • java.io.InputStream
      • java.io.File
      • java.net.URL
      • java.lang.String
  • 18. Adding new MIME types
    • Got XML?
    • Based on freedesktop.org spec (loosely)
  • 19. Many custom applications and tools
    • You need this: to read this:
  • 20. Third-party parsing libraries
    • Most of the custom applications come with software libraries and tools to read/write these files
      • Rather than re-invent the wheel, figure out a way to take advantage of them
    • Parsing text and structure is a difficult problem
      • Not all libraries parse text in equivalent manners
      • Some are faster than others
      • Some are more reliable than others
  • 21. Parsing
    • String content = Tika.parseToString(…)
      • InputStream
      • File
      • URL
  • 22. Streaming Parsing
    • Reader reader = Tika.parse(…)
      • InputStream
      • File
      • URL
  • 23. Extraction of Metadata
    • Important to follow common Metadata models
      • Dublin Core – any electronic resource
      • XMP – also general like Dublin Core
      • Word Metadata – specific to .doc, .ppt, etc.
      • EXIF – image related
    • Lots of standards and models out there
      • The use and extraction of common models allows for content intercomparison
      • All standardize mechanisms for searching
      • You always know for X file type that field Y is there and of type String or Int or Date
  • 24. Cancer Research Example
  • 25. Cancer Research Example Attributes Relationships Credit: A. Hart
  • 26. Tika Sponsoring the Any23 Project
    • Tika PMC is sponsoring the Any23 project in the Incubator (entered: 10/1/2011)
    • Any23 = “Anything to Triples”
    • Semantic Toolkit for parsing, identification of all major semantic web content types (RDF, etc.)
    • Related to Apache Jena
    • Looking for synergies between 2 efforts
  • 27. Metadata
    • Metadata met = new Metadata(); //Dubiln Core met.set(Metadata.FORMAT, “text/html”); //multi-valued met.set(Metadata.FORMAT, “text/plain”); System.out.println( met.getValues(Metadata.FORMAT));
    • Other met models supported (HTTP Headers, Word, Creative Commons, Climate Forecast, etc.)
      • Run: tika --list-met-models
  • 28. Methods for language identification
    • N-grams
      • Method of detecting next character or set of characters in a sequence
      • Useful in determine whether small snippets of text come from a particular language, or character set
    • Non-computational approaches
      • Tagging
      • Looking for common words or characters
  • 29. Language Detection
    • LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile( FileUtils.readFileToString(new File(filename))));
    • System.out.println(lang.getLanguage());
    • Uses Ngram analysis included with Tika
      • Originating from Nutch
      • Can be improved
  • 30. Running Tika in GUI form
    • tika --gui
    <html xmlns:html= “…”> <body> … </body> </html>
  • 31. Integrating Tika into your App
    • Maven
    • Ant
    • Eclipse
    • It ’s just a set of jars
      • tika-core
      • tika-parsers
      • tika-app
      • tika-bundle
      • tika-server
    tika-core tika-parsers tika-app tika-bundle tika-server
  • 32. Some really great stuff in 1.0
    • Super improved OSGi support
      • New tika-bundle module
    • Improved RTF parsing support, OO support, and parsing of Outlook email attachments
    • Language Detection for Belarusian, Catalan, Esperanto, Galician, Lithuanian Romanian, Slovak,Slovenian, and Ukrainian
    • Improved PDF parsing (extract annotation)
    NICK ALREADY TALKED ABOUT THIS!!! Thunder stolen
  • 33. Things to watch out for
    • Deprecated APIs->gone
      • Recompile code
    • No more JDK 1.4 version of Tika
      • Upgrade
  • 34. Improvements to Tika
    • Adding more parsers for content types
    • Improve the JAX-RS server support
    • Expanding ability to handle random access file parsing
      • Scientific data file formats, some work on this
      • Leverage improvements in file representation TIKA-701, TIKA-654, TIKA-645, TIKA-153
    • Geospatial parsing support through GDAL
    • Improving language and charset detection
  • 35. Part 2 Science Data Systems at NASA Credit: http://www.jpl.nasa.gov/news/news.cfm?release=2011-295
  • 36. NASA Ground Data Systems Credit: D. Woollard
  • 37. Context
    • NASA develops science data processing systems for multiple earth science missions
    • These systems convert the instrument telemetry delivered to earth from space into useful data for scientific research
    • Typical characteristics
      • Remote sensing instruments that orbit the Earth multiple times daily
      • Data are acquired constantly
      • Complex algorithms convert instrument measurements to geophysical quantities
  • 38. The Square Kilometer Array
    • 1 sq. km of antennas
    • Never-before seen resolution looking into the sky
    • 700 TB
      • Per second!
  • 39. NASA DESDynI Mission
    • 16 TB/day
    • Geographically distributed
    • 10s of 1000s of jobs per day
    • Tier 1 Earth Science Decadal Mission
  • 40. Some Considerations
    • Scale
      • Data throughput rates
      • # of data types
      • # of metadata types
      • # of users to send the data to
    • Federation
      • Must leave the data where it is
      • Socio/Economic/Political
    • Heterogeneity
      • Technology, data formats, skills!
  • 41. Apache OODT
    • We ’ve got some components to deal with these issues
  • 42. How are we building these systems now?
    • Allow for push/pull of data over arbitrary protocols - Ingestion builds std catalog and archive
    • Deliver product metadata to search, portal or GIS
    • Plug in arbitrary met extractors
  • 43. How are we building these systems now?
    • Separation of file management from workflow management
    • Allow for heterogeneous computing resources
    • Easily integrate PGEs
    • Leverages same ingestion crawler
  • 44. What does this have to do with Tika? Metadata Ext: TIKA! Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!
  • 45. What does this have to do with Tika? Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!
  • 46. Science Data File Formats
    • Hierarchical Data Format (HDF)
      • http://www.hdfgroup.org
      • Versions 4 and 5
      • Lots of NASA data is in 4, newer NASA data in 5
      • Encapsulates
        • Observation (Scalars, Vectors, Matrices, NxMxZ…)
        • Metadata (Summary info, date/time ranges, spatial ranges)
      • Custom readers/writers/APIs in many languages
        • C/C++, Python, Java
  • 47. Science Data File Formats
    • network Common Data Form (netCDF)
      • www.unidata.ucar.edu/software/ netcdf /
      • Versions 3 and 4
      • Heavily used in DOE, NOAA, etc.
      • Encapsulates
        • Observation (Scalars, Vectors, Matrices, NxMxZ…)
        • Metadata (Summary info, date/time ranges, spatial ranges)
      • Custom readers/writers/APIs in many languages
        • C/C++, Python, Java
      • Not Hierarchical representation: all flat
  • 48. So how does it work?
    • Ingestion
      • Science data files, ancillary information from other missions, etc., arrive in NetCDF or HDF format
      • Need to extract their met, catalog and archive them, etc.
        • Can now use Tika to do this! TIKA-399 and TIKA-400 added this capability
    • Processing
      • Processors (PGEs) generate NetCDF and HDF, must extract met, catalog and archive
  • 49. Tool support
    • Entire stacks of tools written around these formats
      • OPeNDAP, LAS, readers, writers, custom NASA mission toolkits
      • OGC
        • WMS, WCS, etc.
      • Unique, one of a kind software build around these data file formats
    • Apache can contribute strongly in this area!
  • 50. Besides processing science files
    • … Tika also helps with
    • MIME identification
      • Useful in remote file acquisition
      • Useful in classification (catalog/archive) of existing content
      • Useful in crawling see my Nutch talk last year http://s.apache.org/UvU
    • Language identification
      • Can be useful when data is coming from around the world, but need to quickly identify whether or not we can process it
  • 51. Big Goal
    • More closely link OODT and Tika
      • Add new parser to Tika
      • Easily get OODT met extractor based on it
    • Contribute back some features still baking in OODT
      • Configuration aspects of parsing
      • File types and extensions for science data files
    • Spatial
      • Some work done in my CS572 class on spatial parser for Tika – would be great to integrate with Tika, OODT, SIS, and Solr
  • 52. NASA Geo Challenges
    • Sometimes the data isn ’t annotated with lat and lon
      • How to discover this?
    • Even when the data is annotated with spatial information, computation of e.g., bounding box around the poles is difficult
    • Efficiency and speed are difficult since data is at scale
  • 53. Acknowledgements
    • Some Tika material inspired by Jukka Zitting ’s talks
      • http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika
      • http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika-4427630
    • NASA Jet Propulsion Laboratory
      • OODT Team
  • 54. Book
    • Jukka and I have finished the first definitive guide on Apache Tika
    • Official release date: 11/17
    • Early Access available through MEAP program
    • http://manning.com/mattmann/
  • 55. Alright, I ’ll shut up now
    • Any questions?
    • THANK YOU!
      • [email_address]
      • [email_address]
      • @chrismattmann on Twitter

×