Apache Tika: 1 point Oh! Chris A. Mattmann NASA JPL/Univ. Southern California/ASF [email_address]  November 9, 2011
Apache Member involved in OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor) Senior Computer Scientist at NASA JPL in Pasadena, CA USA Software Architecture/Engineering Prof at Univ. of Southern California  And you are?
Roadmap 1 st  part of the talk Why Tika? What is Tika? What are the current versions of Tika? What can it do? 2 nd  part of the talk NASA Earth Science Data Systems Data System Needs and Requirements How does Tika help?
The Information Landscape
Proliferation of content types available By some accounts, 16K to 51K content types* What to do with content types? Parse them How? Extract their text and structure Index their metadata In an indexing technology like Lucene, Solr, or in Google Appliance Identify what language they belong to Ngrams *http://filext.com/
Importance of content types
Importance of content type detection
Search Engine Architecture
Goals Identify and classify file types MIME detection Glob pattern *.txt *.pdf URL http://…pdf ftp://myfile.txt Magic bytes Combination of  the above means Classification means  reaction can be targeted
is… A content analysis and detection toolkit A set of Java APIs providing MIME type detection, language identification, integration of various parsing libraries A rich Metadata API for representing different Metadata models A command line interface to the underlying Java code A GUI interface to the Java code
Tika ’s (Brief) History Original idea for Tika came from Chris Mattmann and Jerome Charron in 2006 Proposed as Lucene sub-project Others interested, didn ’t gain much traction Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in Apache Jackrabbit A Content Management System Graduated from the Incubator to Lucene sub-project in 2008 Graduated to Apache TLP in April 2010 40, 88 and 29 issues resolved in versions 1.0, 0.10, and 0.9
Community Mailing lists User: 125 peeps, ~70 msg/mo. Dev: 210 peeps, ~250 msg/mo. Committers/PMC 13 peeps Large majority of them active Releases 11 releases so far Just pushed out 1 point OH http://s.apache.org/N0I   Credit: svnsearch.org
Use in the classroom Have used Apache Tika for the past 2 years in both my Search Engines/Information Retrieval class and my Software Architecture class Several student final projects have turned into contributions for the project and merit for the students Define data management projects that involve the use of OODT, and other technologies like Solr, Tika, Nutch, Hadoop, etc.
Some recent 1 point oh press
Getting started rapidly…like now! Download Tika from: http://tika.apache.org/download.html Grab tika-app-1.0.jar alias tika  “java –jar tika-app-1.0.jar” tika < somefile.doc > extracted-text.xhtml tika –m < somefile.doc > extracted.met Works on Windows too (alias only on UNIX)
A quick NASA dataset Atmospheric Infrared Sounder Mission (AIRS) Level 2 Cloud Clear Radiance Product Grab it from here: ftp://airspar1u.ecs.nasa.gov/ftp/data/s4pa/Aqua_AIRS_Level2/AIRI2CCF.003/2007/005/   Just grab the first file java -jar tika-app-1.0.jar -m < AIRS.2007.01.05.001.L2.CC.v4.0.9.0.G07006021239.hdf Hopefully this worked for you, if not, blame.. Windows And Bill Gates 25-Mar-11 CORDEX-MATTMANN
Detecting MIME types from Java String type = Tika.detect(…) java.io.InputStream java.io.File java.net.URL java.lang.String
Adding new MIME types Got XML? Based on freedesktop.org spec  (loosely)
Many custom applications and tools You need this:  to read this:
Third-party parsing libraries Most of the custom applications come with software libraries and tools to read/write these files Rather than re-invent the wheel, figure out a way to take advantage of them Parsing text and structure is a difficult problem Not all libraries parse text in equivalent manners Some are faster than others Some are more reliable than others
Parsing String content = Tika.parseToString(…) InputStream File URL
Streaming Parsing Reader reader = Tika.parse(…) InputStream File URL
Extraction of Metadata Important to follow common Metadata models Dublin Core – any electronic resource XMP – also general like Dublin Core Word Metadata – specific to .doc, .ppt, etc. EXIF – image related Lots of standards and models out there The use and extraction of common models allows for content intercomparison All standardize mechanisms for searching You always know for X file type that field Y is there and of type String or Int or Date
Cancer Research Example
Cancer Research Example Attributes Relationships Credit: A. Hart
Tika Sponsoring the Any23 Project Tika PMC is sponsoring the Any23 project in the Incubator (entered: 10/1/2011) Any23 = “Anything to Triples” Semantic Toolkit for parsing, identification of all major semantic web content types (RDF, etc.) Related to Apache Jena Looking for synergies between 2 efforts
Metadata Metadata met = new Metadata(); //Dubiln Core met.set(Metadata.FORMAT, “text/html”); //multi-valued met.set(Metadata.FORMAT, “text/plain”); System.out.println( met.getValues(Metadata.FORMAT)); Other met models supported (HTTP Headers, Word, Creative Commons, Climate Forecast, etc.) Run: tika --list-met-models
Methods for language identification N-grams Method of detecting next character or set of characters in a sequence Useful in determine whether small snippets of text come from a particular language, or character set Non-computational approaches Tagging Looking for common words or characters
Language Detection LanguageIdentifier lang =  new LanguageIdentifier(new LanguageProfile( FileUtils.readFileToString(new File(filename)))); System.out.println(lang.getLanguage()); Uses Ngram analysis included with Tika Originating from Nutch Can be improved
Running Tika in GUI form tika --gui <html xmlns:html= “…”> <body> … </body> </html>
Integrating Tika into your App Maven Ant Eclipse It ’s just a set of jars tika-core tika-parsers tika-app tika-bundle tika-server tika-core tika-parsers tika-app tika-bundle tika-server
Some really great stuff in 1.0 Super improved OSGi support  New tika-bundle module Improved RTF parsing support, OO support, and parsing of Outlook email attachments Language Detection for Belarusian, Catalan, Esperanto, Galician, Lithuanian Romanian, Slovak,Slovenian, and Ukrainian Improved PDF parsing (extract annotation) NICK ALREADY TALKED ABOUT THIS!!!  Thunder stolen
Things to watch out for  Deprecated APIs->gone Recompile code No more JDK 1.4 version  of Tika Upgrade
Improvements to Tika Adding more parsers for content types Improve the JAX-RS server support Expanding ability to handle random access file parsing Scientific data file formats, some work on this Leverage improvements in file representation  TIKA-701, TIKA-654, TIKA-645, TIKA-153 Geospatial parsing support through GDAL Improving language and charset detection
Part 2 Science Data Systems at NASA Credit: http://www.jpl.nasa.gov/news/news.cfm?release=2011-295
NASA Ground Data Systems Credit: D. Woollard
Context NASA develops science data processing systems for multiple earth science missions These systems convert the instrument telemetry delivered to earth from space into useful data for scientific research Typical characteristics Remote sensing instruments that orbit the Earth multiple times daily Data are acquired constantly Complex algorithms convert instrument measurements to geophysical quantities
The Square Kilometer Array 1 sq. km of antennas Never-before seen  resolution  looking into the sky 700 TB Per second!
NASA DESDynI Mission 16 TB/day Geographically distributed 10s of 1000s of jobs per day Tier 1 Earth Science Decadal Mission
Some Considerations Scale Data throughput rates # of data types # of metadata types # of users to send the data to Federation Must leave the data where it is Socio/Economic/Political Heterogeneity Technology, data formats, skills!
Apache OODT We ’ve got some components to deal with   these issues
How are we building these systems now? Allow for push/pull of data over arbitrary protocols - Ingestion builds std catalog and archive Deliver product metadata to search, portal or GIS Plug in arbitrary met extractors
How are we building these systems now? Separation of file management from workflow management Allow for heterogeneous computing resources Easily integrate PGEs Leverages same ingestion crawler
What does this have to do with Tika? Metadata Ext: TIKA! Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!
What does this have to do with Tika? Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!
Science Data File Formats Hierarchical Data Format (HDF) http://www.hdfgroup.org   Versions 4 and 5 Lots of NASA data is in 4, newer NASA data in 5 Encapsulates  Observation (Scalars, Vectors, Matrices, NxMxZ…) Metadata (Summary info, date/time ranges, spatial ranges) Custom readers/writers/APIs in many languages C/C++, Python, Java
Science Data File Formats network Common Data Form (netCDF) www.unidata.ucar.edu/software/ netcdf /  Versions 3 and 4 Heavily used in DOE, NOAA, etc. Encapsulates  Observation (Scalars, Vectors, Matrices, NxMxZ…) Metadata (Summary info, date/time ranges, spatial ranges) Custom readers/writers/APIs in many languages C/C++, Python, Java Not Hierarchical representation: all flat
So how does it work? Ingestion Science data files, ancillary information from other missions, etc., arrive in NetCDF or HDF format Need to extract their met, catalog and archive them, etc. Can now use Tika to do this! TIKA-399 and TIKA-400 added this capability  Processing Processors (PGEs) generate NetCDF and HDF, must extract met, catalog and archive
Tool support Entire stacks of tools written around these formats OPeNDAP, LAS, readers, writers, custom NASA mission toolkits OGC WMS, WCS, etc. Unique, one of a kind software build around these data file formats Apache can contribute strongly in this area!
Besides processing science files … Tika also helps with MIME identification Useful in remote file acquisition Useful in classification (catalog/archive) of existing content Useful in crawling see my Nutch talk last year  http://s.apache.org/UvU Language identification Can be useful when data is coming from around the world, but need to quickly identify whether or not we can process it
Big Goal More closely link OODT and Tika Add new parser to Tika Easily get OODT met extractor based on it Contribute back some features still baking in OODT Configuration aspects of parsing File types and extensions for science data files Spatial Some work done in my CS572 class on spatial parser for Tika – would be great to integrate  with Tika, OODT, SIS, and Solr
NASA Geo Challenges Sometimes the data isn ’t annotated with lat and lon How to discover this? Even when the data  is annotated with  spatial information, computation of e.g., bounding box around the poles is difficult Efficiency and speed are difficult since data is at scale
Acknowledgements  Some Tika material inspired by Jukka Zitting ’s talks http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika-4427630 NASA Jet Propulsion Laboratory OODT Team
Book Jukka and I have finished the first definitive guide  on Apache Tika Official release date: 11/17 Early Access available through MEAP program http://manning.com/mattmann/
Alright, I ’ll shut up now Any questions? THANK YOU! [email_address]   [email_address]   @chrismattmann  on Twitter

Apache Tika: 1 point Oh!

  • 1.
    Apache Tika: 1point Oh! Chris A. Mattmann NASA JPL/Univ. Southern California/ASF [email_address] November 9, 2011
  • 2.
    Apache Member involvedin OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor) Senior Computer Scientist at NASA JPL in Pasadena, CA USA Software Architecture/Engineering Prof at Univ. of Southern California And you are?
  • 3.
    Roadmap 1 st part of the talk Why Tika? What is Tika? What are the current versions of Tika? What can it do? 2 nd part of the talk NASA Earth Science Data Systems Data System Needs and Requirements How does Tika help?
  • 4.
  • 5.
    Proliferation of contenttypes available By some accounts, 16K to 51K content types* What to do with content types? Parse them How? Extract their text and structure Index their metadata In an indexing technology like Lucene, Solr, or in Google Appliance Identify what language they belong to Ngrams *http://filext.com/
  • 6.
  • 7.
    Importance of contenttype detection
  • 8.
  • 9.
    Goals Identify andclassify file types MIME detection Glob pattern *.txt *.pdf URL http://…pdf ftp://myfile.txt Magic bytes Combination of the above means Classification means reaction can be targeted
  • 10.
    is… A contentanalysis and detection toolkit A set of Java APIs providing MIME type detection, language identification, integration of various parsing libraries A rich Metadata API for representing different Metadata models A command line interface to the underlying Java code A GUI interface to the Java code
  • 11.
    Tika ’s (Brief)History Original idea for Tika came from Chris Mattmann and Jerome Charron in 2006 Proposed as Lucene sub-project Others interested, didn ’t gain much traction Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in Apache Jackrabbit A Content Management System Graduated from the Incubator to Lucene sub-project in 2008 Graduated to Apache TLP in April 2010 40, 88 and 29 issues resolved in versions 1.0, 0.10, and 0.9
  • 12.
    Community Mailing listsUser: 125 peeps, ~70 msg/mo. Dev: 210 peeps, ~250 msg/mo. Committers/PMC 13 peeps Large majority of them active Releases 11 releases so far Just pushed out 1 point OH http://s.apache.org/N0I Credit: svnsearch.org
  • 13.
    Use in theclassroom Have used Apache Tika for the past 2 years in both my Search Engines/Information Retrieval class and my Software Architecture class Several student final projects have turned into contributions for the project and merit for the students Define data management projects that involve the use of OODT, and other technologies like Solr, Tika, Nutch, Hadoop, etc.
  • 14.
    Some recent 1point oh press
  • 15.
    Getting started rapidly…likenow! Download Tika from: http://tika.apache.org/download.html Grab tika-app-1.0.jar alias tika “java –jar tika-app-1.0.jar” tika < somefile.doc > extracted-text.xhtml tika –m < somefile.doc > extracted.met Works on Windows too (alias only on UNIX)
  • 16.
    A quick NASAdataset Atmospheric Infrared Sounder Mission (AIRS) Level 2 Cloud Clear Radiance Product Grab it from here: ftp://airspar1u.ecs.nasa.gov/ftp/data/s4pa/Aqua_AIRS_Level2/AIRI2CCF.003/2007/005/ Just grab the first file java -jar tika-app-1.0.jar -m < AIRS.2007.01.05.001.L2.CC.v4.0.9.0.G07006021239.hdf Hopefully this worked for you, if not, blame.. Windows And Bill Gates 25-Mar-11 CORDEX-MATTMANN
  • 17.
    Detecting MIME typesfrom Java String type = Tika.detect(…) java.io.InputStream java.io.File java.net.URL java.lang.String
  • 18.
    Adding new MIMEtypes Got XML? Based on freedesktop.org spec (loosely)
  • 19.
    Many custom applicationsand tools You need this: to read this:
  • 20.
    Third-party parsing librariesMost of the custom applications come with software libraries and tools to read/write these files Rather than re-invent the wheel, figure out a way to take advantage of them Parsing text and structure is a difficult problem Not all libraries parse text in equivalent manners Some are faster than others Some are more reliable than others
  • 21.
    Parsing String content= Tika.parseToString(…) InputStream File URL
  • 22.
    Streaming Parsing Readerreader = Tika.parse(…) InputStream File URL
  • 23.
    Extraction of MetadataImportant to follow common Metadata models Dublin Core – any electronic resource XMP – also general like Dublin Core Word Metadata – specific to .doc, .ppt, etc. EXIF – image related Lots of standards and models out there The use and extraction of common models allows for content intercomparison All standardize mechanisms for searching You always know for X file type that field Y is there and of type String or Int or Date
  • 24.
  • 25.
    Cancer Research ExampleAttributes Relationships Credit: A. Hart
  • 26.
    Tika Sponsoring theAny23 Project Tika PMC is sponsoring the Any23 project in the Incubator (entered: 10/1/2011) Any23 = “Anything to Triples” Semantic Toolkit for parsing, identification of all major semantic web content types (RDF, etc.) Related to Apache Jena Looking for synergies between 2 efforts
  • 27.
    Metadata Metadata met= new Metadata(); //Dubiln Core met.set(Metadata.FORMAT, “text/html”); //multi-valued met.set(Metadata.FORMAT, “text/plain”); System.out.println( met.getValues(Metadata.FORMAT)); Other met models supported (HTTP Headers, Word, Creative Commons, Climate Forecast, etc.) Run: tika --list-met-models
  • 28.
    Methods for languageidentification N-grams Method of detecting next character or set of characters in a sequence Useful in determine whether small snippets of text come from a particular language, or character set Non-computational approaches Tagging Looking for common words or characters
  • 29.
    Language Detection LanguageIdentifierlang = new LanguageIdentifier(new LanguageProfile( FileUtils.readFileToString(new File(filename)))); System.out.println(lang.getLanguage()); Uses Ngram analysis included with Tika Originating from Nutch Can be improved
  • 30.
    Running Tika inGUI form tika --gui <html xmlns:html= “…”> <body> … </body> </html>
  • 31.
    Integrating Tika intoyour App Maven Ant Eclipse It ’s just a set of jars tika-core tika-parsers tika-app tika-bundle tika-server tika-core tika-parsers tika-app tika-bundle tika-server
  • 32.
    Some really greatstuff in 1.0 Super improved OSGi support New tika-bundle module Improved RTF parsing support, OO support, and parsing of Outlook email attachments Language Detection for Belarusian, Catalan, Esperanto, Galician, Lithuanian Romanian, Slovak,Slovenian, and Ukrainian Improved PDF parsing (extract annotation) NICK ALREADY TALKED ABOUT THIS!!! Thunder stolen
  • 33.
    Things to watchout for Deprecated APIs->gone Recompile code No more JDK 1.4 version of Tika Upgrade
  • 34.
    Improvements to TikaAdding more parsers for content types Improve the JAX-RS server support Expanding ability to handle random access file parsing Scientific data file formats, some work on this Leverage improvements in file representation TIKA-701, TIKA-654, TIKA-645, TIKA-153 Geospatial parsing support through GDAL Improving language and charset detection
  • 35.
    Part 2 ScienceData Systems at NASA Credit: http://www.jpl.nasa.gov/news/news.cfm?release=2011-295
  • 36.
    NASA Ground DataSystems Credit: D. Woollard
  • 37.
    Context NASA developsscience data processing systems for multiple earth science missions These systems convert the instrument telemetry delivered to earth from space into useful data for scientific research Typical characteristics Remote sensing instruments that orbit the Earth multiple times daily Data are acquired constantly Complex algorithms convert instrument measurements to geophysical quantities
  • 38.
    The Square KilometerArray 1 sq. km of antennas Never-before seen resolution looking into the sky 700 TB Per second!
  • 39.
    NASA DESDynI Mission16 TB/day Geographically distributed 10s of 1000s of jobs per day Tier 1 Earth Science Decadal Mission
  • 40.
    Some Considerations ScaleData throughput rates # of data types # of metadata types # of users to send the data to Federation Must leave the data where it is Socio/Economic/Political Heterogeneity Technology, data formats, skills!
  • 41.
    Apache OODT We’ve got some components to deal with these issues
  • 42.
    How are webuilding these systems now? Allow for push/pull of data over arbitrary protocols - Ingestion builds std catalog and archive Deliver product metadata to search, portal or GIS Plug in arbitrary met extractors
  • 43.
    How are webuilding these systems now? Separation of file management from workflow management Allow for heterogeneous computing resources Easily integrate PGEs Leverages same ingestion crawler
  • 44.
    What does thishave to do with Tika? Metadata Ext: TIKA! Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!
  • 45.
    What does thishave to do with Tika? Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!
  • 46.
    Science Data FileFormats Hierarchical Data Format (HDF) http://www.hdfgroup.org Versions 4 and 5 Lots of NASA data is in 4, newer NASA data in 5 Encapsulates Observation (Scalars, Vectors, Matrices, NxMxZ…) Metadata (Summary info, date/time ranges, spatial ranges) Custom readers/writers/APIs in many languages C/C++, Python, Java
  • 47.
    Science Data FileFormats network Common Data Form (netCDF) www.unidata.ucar.edu/software/ netcdf / Versions 3 and 4 Heavily used in DOE, NOAA, etc. Encapsulates Observation (Scalars, Vectors, Matrices, NxMxZ…) Metadata (Summary info, date/time ranges, spatial ranges) Custom readers/writers/APIs in many languages C/C++, Python, Java Not Hierarchical representation: all flat
  • 48.
    So how doesit work? Ingestion Science data files, ancillary information from other missions, etc., arrive in NetCDF or HDF format Need to extract their met, catalog and archive them, etc. Can now use Tika to do this! TIKA-399 and TIKA-400 added this capability Processing Processors (PGEs) generate NetCDF and HDF, must extract met, catalog and archive
  • 49.
    Tool support Entirestacks of tools written around these formats OPeNDAP, LAS, readers, writers, custom NASA mission toolkits OGC WMS, WCS, etc. Unique, one of a kind software build around these data file formats Apache can contribute strongly in this area!
  • 50.
    Besides processing sciencefiles … Tika also helps with MIME identification Useful in remote file acquisition Useful in classification (catalog/archive) of existing content Useful in crawling see my Nutch talk last year http://s.apache.org/UvU Language identification Can be useful when data is coming from around the world, but need to quickly identify whether or not we can process it
  • 51.
    Big Goal Moreclosely link OODT and Tika Add new parser to Tika Easily get OODT met extractor based on it Contribute back some features still baking in OODT Configuration aspects of parsing File types and extensions for science data files Spatial Some work done in my CS572 class on spatial parser for Tika – would be great to integrate with Tika, OODT, SIS, and Solr
  • 52.
    NASA Geo ChallengesSometimes the data isn ’t annotated with lat and lon How to discover this? Even when the data is annotated with spatial information, computation of e.g., bounding box around the poles is difficult Efficiency and speed are difficult since data is at scale
  • 53.
    Acknowledgements SomeTika material inspired by Jukka Zitting ’s talks http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika-4427630 NASA Jet Propulsion Laboratory OODT Team
  • 54.
    Book Jukka andI have finished the first definitive guide on Apache Tika Official release date: 11/17 Early Access available through MEAP program http://manning.com/mattmann/
  • 55.
    Alright, I ’llshut up now Any questions? THANK YOU! [email_address] [email_address] @chrismattmann on Twitter