Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison

35 views

Published on

Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison

  1. 1. © 2019 The MITRE Corporation. All rights reserved. Apache Tika Tim Allison tallison@apache.org, @_tallison April 24, 2019 Haystack Conference Approved for Public Release; Distribution Unlimited. Case Number 18-3138-6
  2. 2. | 2 | © 2019 The MITRE Corporation. All rights reserved. Overview ▪ What is Tika ▪ tika-eval ▪ Running Tika safely ▪ Coming out in 1.21 and beyond
  3. 3. | 3 | © 2019 The MITRE Corporation. All rights reserved. Text/Metadata Extraction
  4. 4. | 4 | © 2019 The MITRE Corporation. All rights reserved. Things Can Happen ▪ Tired: – Exceptions – Unsupported file formats – Encrypted files – Garbled text – Missing text ▪ Wired: – OOM – Seg fault – Infinite loops – Multithreaded garbage collector pegging all CPU resources
  5. 5. | 5 | Stands up on Soap Box
  6. 6. | 6 | © 2019 The MITRE Corporation. All rights reserved. Upgrade from PDFBox 1.8.6->1.8.7
  7. 7. | 7 | © 2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two…
  8. 8. | 8 | © 2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two… You don’t have a search system.
  9. 9. | 9 | © 2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two… 👍You’ve got a neat, little demo!👍 You don’t have a search system.
  10. 10. | 10 | Steps Off of Soap Box
  11. 11. | 11 | © 2019 The MITRE Corporation. All rights reserved. tika-eval ▪ Profile individual runs ▪ Compare two runs ▪ Exceptions by mime ▪ Out of vocabulary (OOV) statistics
  12. 12. | 12 | © 2019 The MITRE Corporation. All rights reserved. tika-eval: Eating our own dog food ▪ 3 million files (~1 TB) from Common Crawl and govdocs1 hosted on a public virtual machine, provided by Rackspace ▪ Code to profile a single run or compare two runs before release ▪ Evaluation methodology co-developed with and now co-run by open source colleagues (around the world) on the MSOffice parser project and the PDF parser project
  13. 13. | 13 | © 2019 The MITRE Corporation. All rights reserved. Tika 1.21 and beyond ▪ Tika 1.21 – csv/tsv detector and parser (Apache commons-csv) – Improved zip-based (.docx, .pptx, .xlsx) file detection and parsing ▪ Beyond – Modularize tika-eval and include stats within the extract for scalability and aggregation of stats w/in Solr/Elastic – Increase coverage/speed of zip-based file detection; can we move entirely to streaming detection? – Improve language coverage/lang id component w/in tika-eval ▪ Help! – What do you need? – How can you help us help you?

×