Apache Tika: 1 point Oh! Chris A. Mattmann NASA JPL/Univ. Southern California/ASF [email_address]  November 9, 2011
<ul><li>Apache Member involved in </li></ul><ul><ul><li>OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS (...
Roadmap <ul><li>1 st  part of the talk </li></ul><ul><ul><li>Why Tika? </li></ul></ul><ul><ul><li>What is Tika? </li></ul>...
The Information Landscape
Proliferation of content types available <ul><li>By some accounts, 16K to 51K content types* </li></ul><ul><li>What to do ...
Importance of content types
Importance of content type detection
Search Engine Architecture
Goals <ul><li>Identify and classify file types </li></ul><ul><ul><li>MIME detection </li></ul></ul><ul><ul><ul><li>Glob pa...
is… <ul><li>A content analysis and detection toolkit </li></ul><ul><li>A set of Java APIs providing MIME type detection, l...
Tika ’s (Brief) History <ul><li>Original idea for Tika came from Chris Mattmann and Jerome Charron in 2006 </li></ul><ul><...
Community <ul><li>Mailing lists </li></ul><ul><ul><li>User: 125 peeps, ~70 msg/mo. </li></ul></ul><ul><ul><li>Dev: 210 pee...
Use in the classroom <ul><li>Have used Apache Tika for the past 2 years in both my Search Engines/Information Retrieval cl...
Some recent 1 point oh press
Getting started rapidly…like now! <ul><li>Download Tika from: </li></ul><ul><ul><li>http://tika.apache.org/download.html <...
A quick NASA dataset <ul><li>Atmospheric Infrared Sounder Mission (AIRS) </li></ul><ul><ul><li>Level 2 Cloud Clear Radianc...
Detecting MIME types from Java <ul><li>String type = Tika.detect(…) </li></ul><ul><ul><li>java.io.InputStream </li></ul></...
Adding new MIME types <ul><li>Got XML? </li></ul><ul><li>Based on freedesktop.org spec  (loosely) </li></ul>
Many custom applications and tools <ul><li>You need this:  to read this:  </li></ul>
Third-party parsing libraries <ul><li>Most of the custom applications come with software libraries and tools to read/write...
Parsing <ul><li>String content = Tika.parseToString(…) </li></ul><ul><ul><li>InputStream </li></ul></ul><ul><ul><li>File <...
Streaming Parsing <ul><li>Reader reader = Tika.parse(…) </li></ul><ul><ul><li>InputStream </li></ul></ul><ul><ul><li>File ...
Extraction of Metadata <ul><li>Important to follow common Metadata models </li></ul><ul><ul><li>Dublin Core – any electron...
Cancer Research Example
Cancer Research Example Attributes Relationships Credit: A. Hart
Tika Sponsoring the Any23 Project <ul><li>Tika PMC is sponsoring the Any23 project in the Incubator (entered: 10/1/2011) <...
Metadata <ul><li>Metadata met = new Metadata(); //Dubiln Core met.set(Metadata.FORMAT, “text/html”); //multi-valued met.se...
Methods for language identification <ul><li>N-grams </li></ul><ul><ul><li>Method of detecting next character or set of cha...
Language Detection <ul><li>LanguageIdentifier lang =  new LanguageIdentifier(new LanguageProfile( FileUtils.readFileToStri...
Running Tika in GUI form <ul><li>tika --gui </li></ul><html xmlns:html= “…”> <body> … </body> </html>
Integrating Tika into your App <ul><li>Maven </li></ul><ul><li>Ant </li></ul><ul><li>Eclipse </li></ul><ul><li>It ’s just ...
Some really great stuff in 1.0 <ul><li>Super improved OSGi support  </li></ul><ul><ul><li>New tika-bundle module </li></ul...
Things to watch out for  <ul><li>Deprecated APIs->gone </li></ul><ul><ul><li>Recompile code </li></ul></ul><ul><li>No more...
Improvements to Tika <ul><li>Adding more parsers for content types </li></ul><ul><li>Improve the JAX-RS server support </l...
Part 2 Science Data Systems at NASA Credit: http://www.jpl.nasa.gov/news/news.cfm?release=2011-295
NASA Ground Data Systems Credit: D. Woollard
Context <ul><li>NASA develops science data processing systems for multiple earth science missions </li></ul><ul><li>These ...
The Square Kilometer Array <ul><li>1 sq. km of antennas </li></ul><ul><li>Never-before seen  resolution  looking into the ...
NASA DESDynI Mission <ul><li>16 TB/day </li></ul><ul><li>Geographically distributed </li></ul><ul><li>10s of 1000s of jobs...
Some Considerations <ul><li>Scale </li></ul><ul><ul><li>Data throughput rates </li></ul></ul><ul><ul><li># of data types <...
Apache OODT <ul><li>We ’ve got some components to deal with   these issues </li></ul>
How are we building these systems now? <ul><li>Allow for push/pull of data over arbitrary protocols - Ingestion builds std...
How are we building these systems now? <ul><li>Separation of file management from workflow management </li></ul><ul><li>Al...
What does this have to do with Tika? Metadata Ext: TIKA! Metadata Ext: TIKA! MIME identification: TIKA! MIME identificatio...
What does this have to do with Tika? Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!
Science Data File Formats <ul><li>Hierarchical Data Format (HDF) </li></ul><ul><ul><li>http://www.hdfgroup.org   </li></ul...
Science Data File Formats <ul><li>network Common Data Form (netCDF) </li></ul><ul><ul><li>www.unidata.ucar.edu/software/ n...
So how does it work? <ul><li>Ingestion </li></ul><ul><ul><li>Science data files, ancillary information from other missions...
Tool support <ul><li>Entire stacks of tools written around these formats </li></ul><ul><ul><li>OPeNDAP, LAS, readers, writ...
Besides processing science files <ul><li>… Tika also helps with </li></ul><ul><li>MIME identification </li></ul><ul><ul><l...
Big Goal <ul><li>More closely link OODT and Tika </li></ul><ul><ul><li>Add new parser to Tika </li></ul></ul><ul><ul><li>E...
NASA Geo Challenges <ul><li>Sometimes the data isn ’t annotated with lat and lon </li></ul><ul><ul><li>How to discover thi...
Acknowledgements  <ul><li>Some Tika material inspired by Jukka Zitting ’s talks </li></ul><ul><ul><li>http://www.slideshar...
Book <ul><li>Jukka and I have finished the first definitive guide  on Apache Tika </li></ul><ul><li>Official release date:...
Alright, I ’ll shut up now <ul><li>Any questions? </li></ul><ul><li>THANK YOU! </li></ul><ul><ul><li>[email_address]   </l...
Upcoming SlideShare
Loading in...5
×

Apache Tika: 1 point Oh!

2,502

Published on

ApacheCon NA 2011 talk on Apache Tika 1.0.

Published in: Technology, Education

Apache Tika: 1 point Oh!

  1. 1. Apache Tika: 1 point Oh! Chris A. Mattmann NASA JPL/Univ. Southern California/ASF [email_address] November 9, 2011
  2. 2. <ul><li>Apache Member involved in </li></ul><ul><ul><li>OODT (VP, PMC), Tika (VP,PMC), Nutch (PMC), Incubator (PMC), SIS (Mentor), Lucy (Mentor) and Gora (Champion), MRUnit (Mentor), Airavata (Mentor) </li></ul></ul><ul><li>Senior Computer Scientist at NASA JPL in Pasadena, CA USA </li></ul><ul><li>Software Architecture/Engineering Prof at Univ. of Southern California </li></ul>And you are?
  3. 3. Roadmap <ul><li>1 st part of the talk </li></ul><ul><ul><li>Why Tika? </li></ul></ul><ul><ul><li>What is Tika? </li></ul></ul><ul><ul><li>What are the current versions of Tika? </li></ul></ul><ul><ul><li>What can it do? </li></ul></ul><ul><li>2 nd part of the talk </li></ul><ul><ul><li>NASA Earth Science Data Systems </li></ul></ul><ul><ul><li>Data System Needs and Requirements </li></ul></ul><ul><ul><li>How does Tika help? </li></ul></ul>
  4. 4. The Information Landscape
  5. 5. Proliferation of content types available <ul><li>By some accounts, 16K to 51K content types* </li></ul><ul><li>What to do with content types? </li></ul><ul><ul><li>Parse them </li></ul></ul><ul><ul><ul><li>How? </li></ul></ul></ul><ul><ul><ul><li>Extract their text and structure </li></ul></ul></ul><ul><ul><li>Index their metadata </li></ul></ul><ul><ul><ul><li>In an indexing technology like Lucene, Solr, or in Google Appliance </li></ul></ul></ul><ul><ul><li>Identify what language they belong to </li></ul></ul><ul><ul><ul><li>Ngrams </li></ul></ul></ul>*http://filext.com/
  6. 6. Importance of content types
  7. 7. Importance of content type detection
  8. 8. Search Engine Architecture
  9. 9. Goals <ul><li>Identify and classify file types </li></ul><ul><ul><li>MIME detection </li></ul></ul><ul><ul><ul><li>Glob pattern </li></ul></ul></ul><ul><ul><ul><ul><li>*.txt </li></ul></ul></ul></ul><ul><ul><ul><ul><li>*.pdf </li></ul></ul></ul></ul><ul><ul><ul><li>URL </li></ul></ul></ul><ul><ul><ul><ul><li>http://…pdf </li></ul></ul></ul></ul><ul><ul><ul><ul><li>ftp://myfile.txt </li></ul></ul></ul></ul><ul><ul><ul><li>Magic bytes </li></ul></ul></ul><ul><ul><ul><li>Combination of the above means </li></ul></ul></ul><ul><li>Classification means reaction can be targeted </li></ul>
  10. 10. is… <ul><li>A content analysis and detection toolkit </li></ul><ul><li>A set of Java APIs providing MIME type detection, language identification, integration of various parsing libraries </li></ul><ul><li>A rich Metadata API for representing different Metadata models </li></ul><ul><li>A command line interface to the underlying Java code </li></ul><ul><li>A GUI interface to the Java code </li></ul>
  11. 11. Tika ’s (Brief) History <ul><li>Original idea for Tika came from Chris Mattmann and Jerome Charron in 2006 </li></ul><ul><li>Proposed as Lucene sub-project </li></ul><ul><ul><li>Others interested, didn ’t gain much traction </li></ul></ul><ul><li>Went the Incubator route in 2007 when Jukka Zitting found that there was a need for Tika capabilities in Apache Jackrabbit </li></ul><ul><ul><li>A Content Management System </li></ul></ul><ul><li>Graduated from the Incubator to Lucene sub-project in 2008 </li></ul><ul><li>Graduated to Apache TLP in April 2010 </li></ul><ul><li>40, 88 and 29 issues resolved in versions 1.0, 0.10, and 0.9 </li></ul>
  12. 12. Community <ul><li>Mailing lists </li></ul><ul><ul><li>User: 125 peeps, ~70 msg/mo. </li></ul></ul><ul><ul><li>Dev: 210 peeps, ~250 msg/mo. </li></ul></ul><ul><li>Committers/PMC </li></ul><ul><ul><li>13 peeps </li></ul></ul><ul><ul><li>Large majority of them active </li></ul></ul><ul><li>Releases </li></ul><ul><ul><li>11 releases so far </li></ul></ul><ul><ul><li>Just pushed out 1 point OH </li></ul></ul><ul><ul><ul><li>http://s.apache.org/N0I </li></ul></ul></ul>Credit: svnsearch.org
  13. 13. Use in the classroom <ul><li>Have used Apache Tika for the past 2 years in both my Search Engines/Information Retrieval class and my Software Architecture class </li></ul><ul><ul><li>Several student final projects have turned into contributions for the project and merit for the students </li></ul></ul><ul><li>Define data management projects that involve the use of OODT, and other technologies like Solr, Tika, Nutch, Hadoop, etc. </li></ul>
  14. 14. Some recent 1 point oh press
  15. 15. Getting started rapidly…like now! <ul><li>Download Tika from: </li></ul><ul><ul><li>http://tika.apache.org/download.html </li></ul></ul><ul><li>Grab tika-app-1.0.jar </li></ul><ul><li>alias tika “java –jar tika-app-1.0.jar” </li></ul><ul><li>tika < somefile.doc > extracted-text.xhtml </li></ul><ul><li>tika –m < somefile.doc > extracted.met </li></ul><ul><ul><ul><li>Works on Windows too (alias only on UNIX) </li></ul></ul></ul>
  16. 16. A quick NASA dataset <ul><li>Atmospheric Infrared Sounder Mission (AIRS) </li></ul><ul><ul><li>Level 2 Cloud Clear Radiance Product </li></ul></ul><ul><ul><li>Grab it from here: </li></ul></ul><ul><ul><ul><li>ftp://airspar1u.ecs.nasa.gov/ftp/data/s4pa/Aqua_AIRS_Level2/AIRI2CCF.003/2007/005/ </li></ul></ul></ul><ul><ul><li>Just grab the first file </li></ul></ul><ul><ul><ul><li>java -jar tika-app-1.0.jar -m < AIRS.2007.01.05.001.L2.CC.v4.0.9.0.G07006021239.hdf </li></ul></ul></ul><ul><ul><li>Hopefully this worked for you, if not, blame.. </li></ul></ul><ul><ul><ul><li>Windows </li></ul></ul></ul><ul><ul><ul><ul><li>And Bill Gates </li></ul></ul></ul></ul>25-Mar-11 CORDEX-MATTMANN
  17. 17. Detecting MIME types from Java <ul><li>String type = Tika.detect(…) </li></ul><ul><ul><li>java.io.InputStream </li></ul></ul><ul><ul><li>java.io.File </li></ul></ul><ul><ul><li>java.net.URL </li></ul></ul><ul><ul><li>java.lang.String </li></ul></ul>
  18. 18. Adding new MIME types <ul><li>Got XML? </li></ul><ul><li>Based on freedesktop.org spec (loosely) </li></ul>
  19. 19. Many custom applications and tools <ul><li>You need this: to read this: </li></ul>
  20. 20. Third-party parsing libraries <ul><li>Most of the custom applications come with software libraries and tools to read/write these files </li></ul><ul><ul><li>Rather than re-invent the wheel, figure out a way to take advantage of them </li></ul></ul><ul><li>Parsing text and structure is a difficult problem </li></ul><ul><ul><li>Not all libraries parse text in equivalent manners </li></ul></ul><ul><ul><li>Some are faster than others </li></ul></ul><ul><ul><li>Some are more reliable than others </li></ul></ul>
  21. 21. Parsing <ul><li>String content = Tika.parseToString(…) </li></ul><ul><ul><li>InputStream </li></ul></ul><ul><ul><li>File </li></ul></ul><ul><ul><li>URL </li></ul></ul>
  22. 22. Streaming Parsing <ul><li>Reader reader = Tika.parse(…) </li></ul><ul><ul><li>InputStream </li></ul></ul><ul><ul><li>File </li></ul></ul><ul><ul><li>URL </li></ul></ul>
  23. 23. Extraction of Metadata <ul><li>Important to follow common Metadata models </li></ul><ul><ul><li>Dublin Core – any electronic resource </li></ul></ul><ul><ul><li>XMP – also general like Dublin Core </li></ul></ul><ul><ul><li>Word Metadata – specific to .doc, .ppt, etc. </li></ul></ul><ul><ul><li>EXIF – image related </li></ul></ul><ul><li>Lots of standards and models out there </li></ul><ul><ul><li>The use and extraction of common models allows for content intercomparison </li></ul></ul><ul><ul><li>All standardize mechanisms for searching </li></ul></ul><ul><ul><li>You always know for X file type that field Y is there and of type String or Int or Date </li></ul></ul>
  24. 24. Cancer Research Example
  25. 25. Cancer Research Example Attributes Relationships Credit: A. Hart
  26. 26. Tika Sponsoring the Any23 Project <ul><li>Tika PMC is sponsoring the Any23 project in the Incubator (entered: 10/1/2011) </li></ul><ul><li>Any23 = “Anything to Triples” </li></ul><ul><li>Semantic Toolkit for parsing, identification of all major semantic web content types (RDF, etc.) </li></ul><ul><li>Related to Apache Jena </li></ul><ul><li>Looking for synergies between 2 efforts </li></ul>
  27. 27. Metadata <ul><li>Metadata met = new Metadata(); //Dubiln Core met.set(Metadata.FORMAT, “text/html”); //multi-valued met.set(Metadata.FORMAT, “text/plain”); System.out.println( met.getValues(Metadata.FORMAT)); </li></ul><ul><li>Other met models supported (HTTP Headers, Word, Creative Commons, Climate Forecast, etc.) </li></ul><ul><ul><li>Run: tika --list-met-models </li></ul></ul>
  28. 28. Methods for language identification <ul><li>N-grams </li></ul><ul><ul><li>Method of detecting next character or set of characters in a sequence </li></ul></ul><ul><ul><li>Useful in determine whether small snippets of text come from a particular language, or character set </li></ul></ul><ul><li>Non-computational approaches </li></ul><ul><ul><li>Tagging </li></ul></ul><ul><ul><li>Looking for common words or characters </li></ul></ul>
  29. 29. Language Detection <ul><li>LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile( FileUtils.readFileToString(new File(filename)))); </li></ul><ul><li>System.out.println(lang.getLanguage()); </li></ul><ul><li>Uses Ngram analysis included with Tika </li></ul><ul><ul><li>Originating from Nutch </li></ul></ul><ul><ul><li>Can be improved </li></ul></ul>
  30. 30. Running Tika in GUI form <ul><li>tika --gui </li></ul><html xmlns:html= “…”> <body> … </body> </html>
  31. 31. Integrating Tika into your App <ul><li>Maven </li></ul><ul><li>Ant </li></ul><ul><li>Eclipse </li></ul><ul><li>It ’s just a set of jars </li></ul><ul><ul><li>tika-core </li></ul></ul><ul><ul><li>tika-parsers </li></ul></ul><ul><ul><li>tika-app </li></ul></ul><ul><ul><li>tika-bundle </li></ul></ul><ul><ul><li>tika-server </li></ul></ul>tika-core tika-parsers tika-app tika-bundle tika-server
  32. 32. Some really great stuff in 1.0 <ul><li>Super improved OSGi support </li></ul><ul><ul><li>New tika-bundle module </li></ul></ul><ul><li>Improved RTF parsing support, OO support, and parsing of Outlook email attachments </li></ul><ul><li>Language Detection for Belarusian, Catalan, Esperanto, Galician, Lithuanian Romanian, Slovak,Slovenian, and Ukrainian </li></ul><ul><li>Improved PDF parsing (extract annotation) </li></ul>NICK ALREADY TALKED ABOUT THIS!!! Thunder stolen
  33. 33. Things to watch out for <ul><li>Deprecated APIs->gone </li></ul><ul><ul><li>Recompile code </li></ul></ul><ul><li>No more JDK 1.4 version of Tika </li></ul><ul><ul><li>Upgrade </li></ul></ul>
  34. 34. Improvements to Tika <ul><li>Adding more parsers for content types </li></ul><ul><li>Improve the JAX-RS server support </li></ul><ul><li>Expanding ability to handle random access file parsing </li></ul><ul><ul><li>Scientific data file formats, some work on this </li></ul></ul><ul><ul><li>Leverage improvements in file representation TIKA-701, TIKA-654, TIKA-645, TIKA-153 </li></ul></ul><ul><li>Geospatial parsing support through GDAL </li></ul><ul><li>Improving language and charset detection </li></ul>
  35. 35. Part 2 Science Data Systems at NASA Credit: http://www.jpl.nasa.gov/news/news.cfm?release=2011-295
  36. 36. NASA Ground Data Systems Credit: D. Woollard
  37. 37. Context <ul><li>NASA develops science data processing systems for multiple earth science missions </li></ul><ul><li>These systems convert the instrument telemetry delivered to earth from space into useful data for scientific research </li></ul><ul><li>Typical characteristics </li></ul><ul><ul><li>Remote sensing instruments that orbit the Earth multiple times daily </li></ul></ul><ul><ul><li>Data are acquired constantly </li></ul></ul><ul><ul><li>Complex algorithms convert instrument measurements to geophysical quantities </li></ul></ul>
  38. 38. The Square Kilometer Array <ul><li>1 sq. km of antennas </li></ul><ul><li>Never-before seen resolution looking into the sky </li></ul><ul><li>700 TB </li></ul><ul><ul><li>Per second! </li></ul></ul>
  39. 39. NASA DESDynI Mission <ul><li>16 TB/day </li></ul><ul><li>Geographically distributed </li></ul><ul><li>10s of 1000s of jobs per day </li></ul><ul><li>Tier 1 Earth Science Decadal Mission </li></ul>
  40. 40. Some Considerations <ul><li>Scale </li></ul><ul><ul><li>Data throughput rates </li></ul></ul><ul><ul><li># of data types </li></ul></ul><ul><ul><li># of metadata types </li></ul></ul><ul><ul><li># of users to send the data to </li></ul></ul><ul><li>Federation </li></ul><ul><ul><li>Must leave the data where it is </li></ul></ul><ul><ul><li>Socio/Economic/Political </li></ul></ul><ul><li>Heterogeneity </li></ul><ul><ul><li>Technology, data formats, skills! </li></ul></ul>
  41. 41. Apache OODT <ul><li>We ’ve got some components to deal with these issues </li></ul>
  42. 42. How are we building these systems now? <ul><li>Allow for push/pull of data over arbitrary protocols - Ingestion builds std catalog and archive </li></ul><ul><li>Deliver product metadata to search, portal or GIS </li></ul><ul><li>Plug in arbitrary met extractors </li></ul>
  43. 43. How are we building these systems now? <ul><li>Separation of file management from workflow management </li></ul><ul><li>Allow for heterogeneous computing resources </li></ul><ul><li>Easily integrate PGEs </li></ul><ul><li>Leverages same ingestion crawler </li></ul>
  44. 44. What does this have to do with Tika? Metadata Ext: TIKA! Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!
  45. 45. What does this have to do with Tika? Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!
  46. 46. Science Data File Formats <ul><li>Hierarchical Data Format (HDF) </li></ul><ul><ul><li>http://www.hdfgroup.org </li></ul></ul><ul><ul><li>Versions 4 and 5 </li></ul></ul><ul><ul><li>Lots of NASA data is in 4, newer NASA data in 5 </li></ul></ul><ul><ul><li>Encapsulates </li></ul></ul><ul><ul><ul><li>Observation (Scalars, Vectors, Matrices, NxMxZ…) </li></ul></ul></ul><ul><ul><ul><li>Metadata (Summary info, date/time ranges, spatial ranges) </li></ul></ul></ul><ul><ul><li>Custom readers/writers/APIs in many languages </li></ul></ul><ul><ul><ul><li>C/C++, Python, Java </li></ul></ul></ul>
  47. 47. Science Data File Formats <ul><li>network Common Data Form (netCDF) </li></ul><ul><ul><li>www.unidata.ucar.edu/software/ netcdf / </li></ul></ul><ul><ul><li>Versions 3 and 4 </li></ul></ul><ul><ul><li>Heavily used in DOE, NOAA, etc. </li></ul></ul><ul><ul><li>Encapsulates </li></ul></ul><ul><ul><ul><li>Observation (Scalars, Vectors, Matrices, NxMxZ…) </li></ul></ul></ul><ul><ul><ul><li>Metadata (Summary info, date/time ranges, spatial ranges) </li></ul></ul></ul><ul><ul><li>Custom readers/writers/APIs in many languages </li></ul></ul><ul><ul><ul><li>C/C++, Python, Java </li></ul></ul></ul><ul><ul><li>Not Hierarchical representation: all flat </li></ul></ul>
  48. 48. So how does it work? <ul><li>Ingestion </li></ul><ul><ul><li>Science data files, ancillary information from other missions, etc., arrive in NetCDF or HDF format </li></ul></ul><ul><ul><li>Need to extract their met, catalog and archive them, etc. </li></ul></ul><ul><ul><ul><li>Can now use Tika to do this! TIKA-399 and TIKA-400 added this capability </li></ul></ul></ul><ul><li>Processing </li></ul><ul><ul><li>Processors (PGEs) generate NetCDF and HDF, must extract met, catalog and archive </li></ul></ul>
  49. 49. Tool support <ul><li>Entire stacks of tools written around these formats </li></ul><ul><ul><li>OPeNDAP, LAS, readers, writers, custom NASA mission toolkits </li></ul></ul><ul><ul><li>OGC </li></ul></ul><ul><ul><ul><li>WMS, WCS, etc. </li></ul></ul></ul><ul><ul><li>Unique, one of a kind software build around these data file formats </li></ul></ul><ul><li>Apache can contribute strongly in this area! </li></ul>
  50. 50. Besides processing science files <ul><li>… Tika also helps with </li></ul><ul><li>MIME identification </li></ul><ul><ul><li>Useful in remote file acquisition </li></ul></ul><ul><ul><li>Useful in classification (catalog/archive) of existing content </li></ul></ul><ul><ul><li>Useful in crawling see my Nutch talk last year http://s.apache.org/UvU </li></ul></ul><ul><li>Language identification </li></ul><ul><ul><li>Can be useful when data is coming from around the world, but need to quickly identify whether or not we can process it </li></ul></ul>
  51. 51. Big Goal <ul><li>More closely link OODT and Tika </li></ul><ul><ul><li>Add new parser to Tika </li></ul></ul><ul><ul><li>Easily get OODT met extractor based on it </li></ul></ul><ul><li>Contribute back some features still baking in OODT </li></ul><ul><ul><li>Configuration aspects of parsing </li></ul></ul><ul><ul><li>File types and extensions for science data files </li></ul></ul><ul><li>Spatial </li></ul><ul><ul><li>Some work done in my CS572 class on spatial parser for Tika – would be great to integrate with Tika, OODT, SIS, and Solr </li></ul></ul>
  52. 52. NASA Geo Challenges <ul><li>Sometimes the data isn ’t annotated with lat and lon </li></ul><ul><ul><li>How to discover this? </li></ul></ul><ul><li>Even when the data is annotated with spatial information, computation of e.g., bounding box around the poles is difficult </li></ul><ul><li>Efficiency and speed are difficult since data is at scale </li></ul>
  53. 53. Acknowledgements <ul><li>Some Tika material inspired by Jukka Zitting ’s talks </li></ul><ul><ul><li>http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika </li></ul></ul><ul><ul><li>http://www.slideshare.net/jukka/text-and-metadata-extraction-with-apache-tika-4427630 </li></ul></ul><ul><li>NASA Jet Propulsion Laboratory </li></ul><ul><ul><li>OODT Team </li></ul></ul>
  54. 54. Book <ul><li>Jukka and I have finished the first definitive guide on Apache Tika </li></ul><ul><li>Official release date: 11/17 </li></ul><ul><li>Early Access available through MEAP program </li></ul><ul><li>http://manning.com/mattmann/ </li></ul>
  55. 55. Alright, I ’ll shut up now <ul><li>Any questions? </li></ul><ul><li>THANK YOU! </li></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>[email_address] </li></ul></ul><ul><ul><li>@chrismattmann on Twitter </li></ul></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×