SlideShare a Scribd company logo
Submit Search
Upload
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Report
OpenSource Connections
Principal, OpenSource Connections and Solr Consultant at OpenSource Connections
Follow
•
0 likes
•
359 views
1
of
13
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
•
0 likes
•
359 views
Download Now
Download to read offline
Report
Data & Analytics
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Read more
OpenSource Connections
Principal, OpenSource Connections and Solr Consultant at OpenSource Connections
Follow
Recommended
IPTC Semantic Web 2012 Spring Working Group
Stuart Myles
533 views
•
12 slides
IPTC Semantic Web Working Group 2011 Autumn Working Group
Stuart Myles
449 views
•
12 slides
Big Data Europe: SC6 Workshop 3: The European Research Data Landscape: Opport...
BigData_Europe
374 views
•
23 slides
IPTC Semantic Web Working Group Summer 2012
Stuart Myles
538 views
•
13 slides
Plans for Enhanced NetCDF-4 Interface to HDF5 Data
The HDF-EOS Tools and Information Center
405 views
•
8 slides
Haystack 2018 apache_tika-eval_tallison
Tim Allison
546 views
•
44 slides
More Related Content
Similar to Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Embedded-ml(ai)applications - Bjoern Staender
Dataconomy Media
466 views
•
38 slides
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
MITRE - ATT&CKcon
954 views
•
9 slides
Research data management 1.5
John Martin
57 views
•
55 slides
Research and technology explosion in scale-out storage
Jeff Spencer
1.1K views
•
28 slides
ApI first Microservices meetup
Oracle Developers
493 views
•
18 slides
FIWARE and Smart Data Models
Fernando Lopez Aguilar
377 views
•
16 slides
Similar to Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
(20)
Embedded-ml(ai)applications - Bjoern Staender
Dataconomy Media
•
466 views
MITRE ATT&CKcon 2.0: ATT&CK Updates - Cyber Analytics Repository (CAR); Ivan ...
MITRE - ATT&CKcon
•
954 views
Research data management 1.5
John Martin
•
57 views
Research and technology explosion in scale-out storage
Jeff Spencer
•
1.1K views
ApI first Microservices meetup
Oracle Developers
•
493 views
FIWARE and Smart Data Models
Fernando Lopez Aguilar
•
377 views
IBM Aspera overview
Carlos Martin Hernandez
•
1K views
DATAOPS: THE NEXT BIG WAVE ON YOUR DATA JOURNEY - Big Data Expo
webwinkelvakdag
•
261 views
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
javier ramirez
•
369 views
Kafka at Peak Performance
Todd Palino
•
3.9K views
Hyperledger weatherreport20190219 公開版
Hyperleger Tokyo Meetup
•
423 views
Replacing Tape Backup with Cloud-Enabled Solutions by Index Engines
Amazon Web Services
•
1.4K views
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
OpenSource Connections
•
250 views
A Study Review of Common Big Data Architecture for Small-Medium Enterprise
Ridwan Fadjar
•
229 views
Enterprise Data Lakes
Farid Gurbanov
•
187 views
Building-a-Modern-Data-Platform-in-the-Cloud.pdf
Amazon Web Services
•
957 views
Model-driven Telemetry: The Foundation of Big Data Analytics
Cisco Canada
•
1.9K views
OSINT: Open Source Intelligence - Rohan Braganza
NSConclave
•
746 views
Mulesoft Meetup Milano #11.pdf
Florence Consulting
•
156 views
Implementing Machine Learning Incrementally
Ravindra Guntur
•
31 views
More from OpenSource Connections
Encores
OpenSource Connections
2K views
•
53 slides
Test driven relevancy
OpenSource Connections
272 views
•
20 slides
How To Structure Your Search Team for Success
OpenSource Connections
162 views
•
25 slides
The right path to making search relevant - Taxonomy Bootcamp London 2019
OpenSource Connections
992 views
•
56 slides
Payloads and OCR with Solr
OpenSource Connections
655 views
•
22 slides
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
OpenSource Connections
498 views
•
5 slides
More from OpenSource Connections
(20)
Encores
OpenSource Connections
•
2K views
Test driven relevancy
OpenSource Connections
•
272 views
How To Structure Your Search Team for Success
OpenSource Connections
•
162 views
The right path to making search relevant - Taxonomy Bootcamp London 2019
OpenSource Connections
•
992 views
Payloads and OCR with Solr
OpenSource Connections
•
655 views
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
OpenSource Connections
•
498 views
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
OpenSource Connections
•
266 views
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
OpenSource Connections
•
318 views
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
OpenSource Connections
•
243 views
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections
•
1.6K views
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
OpenSource Connections
•
700 views
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
OpenSource Connections
•
334 views
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
OpenSource Connections
•
718 views
Haystack 2019 - Architectural considerations on search relevancy in the conte...
OpenSource Connections
•
136 views
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
OpenSource Connections
•
469 views
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
OpenSource Connections
•
113 views
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
OpenSource Connections
•
317 views
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
OpenSource Connections
•
165 views
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
OpenSource Connections
•
553 views
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...
OpenSource Connections
•
347 views
Recently uploaded
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials
8 views
•
13 slides
Building Real-Time Travel Alerts
Timothy Spann
102 views
•
48 slides
MOSORE_BRESCIA
Federico Karagulian
5 views
•
8 slides
How Leaders See Data? (Level 1)
Narendra Narendra
10 views
•
76 slides
Cross-network in Google Analytics 4.pdf
GA4 Tutorials
6 views
•
7 slides
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig2004
5 views
•
30 slides
Recently uploaded
(20)
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials
•
8 views
Building Real-Time Travel Alerts
Timothy Spann
•
102 views
MOSORE_BRESCIA
Federico Karagulian
•
5 views
How Leaders See Data? (Level 1)
Narendra Narendra
•
10 views
Cross-network in Google Analytics 4.pdf
GA4 Tutorials
•
6 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig2004
•
5 views
Understanding Hallucinations in LLMs - 2023 09 29.pptx
Greg Makowski
•
10 views
Launch of the Knowledge Exchange Platform - Romina Boarini - 21 November 2023
StatsCommunications
•
55 views
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
•
91 views
PROGRAMME.pdf
HiNedHaJar
•
14 views
Data structure and algorithm.
Abdul salam
•
12 views
Short Story Assignment by Kelly Nguyen
kellynguyen01
•
14 views
3196 The Case of The East River
ErickANDRADE90
•
11 views
Journey of Generative AI
thomasjvarghese49
•
18 views
Survey on Factuality in LLM's.pptx
NeethaSherra1
•
5 views
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas12611618
•
8 views
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
JaysonGarabilesEspej
•
6 views
Advanced_Recommendation_Systems_Presentation.pptx
neeharikasingh29
•
5 views
UNEP FI CRS Climate Risk Results.pptx
pekka28
•
11 views
Supercharging your Data with Azure AI Search and Azure OpenAI
Peter Gallagher
•
35 views
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
1.
© 2019 The
MITRE Corporation. All rights reserved. Apache Tika Tim Allison tallison@apache.org, @_tallison April 24, 2019 Haystack Conference Approved for Public Release; Distribution Unlimited. Case Number 18-3138-6
2.
| 2 | ©
2019 The MITRE Corporation. All rights reserved. Overview ▪ What is Tika ▪ tika-eval ▪ Running Tika safely ▪ Coming out in 1.21 and beyond
3.
| 3 | ©
2019 The MITRE Corporation. All rights reserved. Text/Metadata Extraction
4.
| 4 | ©
2019 The MITRE Corporation. All rights reserved. Things Can Happen ▪ Tired: – Exceptions – Unsupported file formats – Encrypted files – Garbled text – Missing text ▪ Wired: – OOM – Seg fault – Infinite loops – Multithreaded garbage collector pegging all CPU resources
5.
| 5 | Stands
up on Soap Box
6.
| 6 | ©
2019 The MITRE Corporation. All rights reserved. Upgrade from PDFBox 1.8.6->1.8.7
7.
| 7 | ©
2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two…
8.
| 8 | ©
2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two… You don’t have a search system.
9.
| 9 | ©
2019 The MITRE Corporation. All rights reserved. Soap Box If your search system can’t tell the difference between those two… 👍You’ve got a neat, little demo!👍 You don’t have a search system.
10.
| 10 | Steps
Off of Soap Box
11.
| 11 | ©
2019 The MITRE Corporation. All rights reserved. tika-eval ▪ Profile individual runs ▪ Compare two runs ▪ Exceptions by mime ▪ Out of vocabulary (OOV) statistics
12.
| 12 | ©
2019 The MITRE Corporation. All rights reserved. tika-eval: Eating our own dog food ▪ 3 million files (~1 TB) from Common Crawl and govdocs1 hosted on a public virtual machine, provided by Rackspace ▪ Code to profile a single run or compare two runs before release ▪ Evaluation methodology co-developed with and now co-run by open source colleagues (around the world) on the MSOffice parser project and the PDF parser project
13.
| 13 | ©
2019 The MITRE Corporation. All rights reserved. Tika 1.21 and beyond ▪ Tika 1.21 – csv/tsv detector and parser (Apache commons-csv) – Improved zip-based (.docx, .pptx, .xlsx) file detection and parsing ▪ Beyond – Modularize tika-eval and include stats within the extract for scalability and aggregation of stats w/in Solr/Elastic – Increase coverage/speed of zip-based file detection; can we move entirely to streaming detection? – Improve language coverage/lang id component w/in tika-eval ▪ Help! – What do you need? – How can you help us help you?