Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison

OpenSource Connections
OpenSource ConnectionsPrincipal, OpenSource Connections and Solr Consultant at OpenSource Connections
© 2019 The MITRE Corporation. All rights reserved.
Apache Tika
Tim Allison
tallison@apache.org, @_tallison
April 24, 2019
Haystack Conference
Approved for Public Release;
Distribution Unlimited. Case
Number 18-3138-6
| 2 |
© 2019 The MITRE Corporation. All rights reserved.
Overview
▪ What is Tika
▪ tika-eval
▪ Running Tika safely
▪ Coming out in 1.21 and beyond
| 3 |
© 2019 The MITRE Corporation. All rights reserved.
Text/Metadata Extraction
| 4 |
© 2019 The MITRE Corporation. All rights reserved.
Things Can Happen
▪ Tired:
– Exceptions
– Unsupported file formats
– Encrypted files
– Garbled text
– Missing text
▪ Wired:
– OOM
– Seg fault
– Infinite loops
– Multithreaded garbage collector pegging all CPU resources
| 5 |
Stands up on Soap Box
| 6 |
© 2019 The MITRE Corporation. All rights reserved.
Upgrade from PDFBox 1.8.6->1.8.7
| 7 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…
| 8 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…
You don’t have a search system.
| 9 |
© 2019 The MITRE Corporation. All rights reserved.
Soap Box
If your search system can’t
tell the difference between
those two…
👍You’ve got a neat, little demo!👍
You don’t have a search system.
| 10 |
Steps Off of Soap Box
| 11 |
© 2019 The MITRE Corporation. All rights reserved.
tika-eval
▪ Profile individual runs
▪ Compare two runs
▪ Exceptions by mime
▪ Out of vocabulary (OOV) statistics
| 12 |
© 2019 The MITRE Corporation. All rights reserved.
tika-eval: Eating our own dog food
▪ 3 million files (~1 TB) from Common Crawl and govdocs1 hosted on a
public virtual machine, provided by Rackspace
▪ Code to profile a single run or compare two runs before release
▪ Evaluation methodology co-developed with and now co-run by open
source colleagues (around the world) on the MSOffice parser project
and the PDF parser project
| 13 |
© 2019 The MITRE Corporation. All rights reserved.
Tika 1.21 and beyond
▪ Tika 1.21
– csv/tsv detector and parser (Apache commons-csv)
– Improved zip-based (.docx, .pptx, .xlsx) file detection and parsing
▪ Beyond
– Modularize tika-eval and include stats within the extract for scalability and aggregation of
stats w/in Solr/Elastic
– Increase coverage/speed of zip-based file detection; can we move entirely to streaming
detection?
– Improve language coverage/lang id component w/in tika-eval
▪ Help!
– What do you need?
– How can you help us help you?
1 of 13

More Related Content

Similar to Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison(20)

Research data management 1.5Research data management 1.5
Research data management 1.5
John Martin57 views
ApI first Microservices meetup ApI first Microservices meetup
ApI first Microservices meetup
Oracle Developers493 views
FIWARE and Smart Data ModelsFIWARE and Smart Data Models
FIWARE and Smart Data Models
Fernando Lopez Aguilar377 views
IBM Aspera overview IBM Aspera overview
IBM Aspera overview
Carlos Martin Hernandez1K views
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
Todd Palino3.9K views
Hyperledger weatherreport20190219 公開版Hyperledger weatherreport20190219 公開版
Hyperledger weatherreport20190219 公開版
Hyperleger Tokyo Meetup423 views
Enterprise Data LakesEnterprise Data Lakes
Enterprise Data Lakes
Farid Gurbanov187 views
Mulesoft Meetup Milano #11.pdfMulesoft Meetup Milano #11.pdf
Mulesoft Meetup Milano #11.pdf
Florence Consulting156 views

More from OpenSource Connections(20)

EncoresEncores
Encores
OpenSource Connections2K views
Test driven relevancyTest driven relevancy
Test driven relevancy
OpenSource Connections272 views
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
OpenSource Connections162 views
Payloads and OCR with SolrPayloads and OCR with Solr
Payloads and OCR with Solr
OpenSource Connections655 views
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections1.6K views

Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison

  • 1. © 2019 The MITRE Corporation. All rights reserved. Apache Tika Tim Allison tallison@apache.org, @_tallison April 24, 2019 Haystack Conference Approved for Public Release; Distribution Unlimited. Case Number 18-3138-6
  • 2. | 2 | © 2019 The MITRE Corporation. All rights reserved. Overview ▪ What is Tika ▪ tika-eval ▪ Running Tika safely ▪ Coming out in 1.21 and beyond
  • 3. | 3 | © 2019 The MITRE Corporation. All rights reserved. Text/Metadata Extraction
  • 4. | 4 | © 2019 The MITRE Corporation. All rights reserved. Things Can Happen ▪ Tired: – Exceptions – Unsupported file formats – Encrypted files – Garbled text – Missing text ▪ Wired: – OOM – Seg fault – Infinite loops – Multithreaded garbage collector pegging all CPU resources
  • 5. | 5 | Stands up on Soap Box
  • 6. | 6 | © 2019 The MITRE Corporation. All rights reserved. Upgrade from PDFBox 1.8.6->1.8.7
  • 7. | 7 | © 2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two…
  • 8. | 8 | © 2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two… You don’t have a search system.
  • 9. | 9 | © 2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two… 👍You’ve got a neat, little demo!👍 You don’t have a search system.
  • 10. | 10 | Steps Off of Soap Box
  • 11. | 11 | © 2019 The MITRE Corporation. All rights reserved. tika-eval ▪ Profile individual runs ▪ Compare two runs ▪ Exceptions by mime ▪ Out of vocabulary (OOV) statistics
  • 12. | 12 | © 2019 The MITRE Corporation. All rights reserved. tika-eval: Eating our own dog food ▪ 3 million files (~1 TB) from Common Crawl and govdocs1 hosted on a public virtual machine, provided by Rackspace ▪ Code to profile a single run or compare two runs before release ▪ Evaluation methodology co-developed with and now co-run by open source colleagues (around the world) on the MSOffice parser project and the PDF parser project
  • 13. | 13 | © 2019 The MITRE Corporation. All rights reserved. Tika 1.21 and beyond ▪ Tika 1.21 – csv/tsv detector and parser (Apache commons-csv) – Improved zip-based (.docx, .pptx, .xlsx) file detection and parsing ▪ Beyond – Modularize tika-eval and include stats within the extract for scalability and aggregation of stats w/in Solr/Elastic – Increase coverage/speed of zip-based file detection; can we move entirely to streaming detection? – Improve language coverage/lang id component w/in tika-eval ▪ Help! – What do you need? – How can you help us help you?