SlideShare a Scribd company logo
Apache Tika: 1 point Oh! Chris A. Mattmann NASA JPL/Univ. Southern California/ASF [email_address]  November 9, 2011
[object Object],[object Object],[object Object],[object Object],And you are?
Roadmap ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The Information Landscape
Proliferation of content types available ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],*http://filext.com/
Importance of content types
Importance of content type detection
Search Engine Architecture
Goals ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
is… ,[object Object],[object Object],[object Object],[object Object],[object Object]
Tika ’s (Brief) History ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Community ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Credit: svnsearch.org
Use in the classroom ,[object Object],[object Object],[object Object]
Some recent 1 point oh press
Getting started rapidly…like now! ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
A quick NASA dataset ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],25-Mar-11 CORDEX-MATTMANN
Detecting MIME types from Java ,[object Object],[object Object],[object Object],[object Object],[object Object]
Adding new MIME types ,[object Object],[object Object]
Many custom applications and tools ,[object Object]
Third-party parsing libraries ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Parsing ,[object Object],[object Object],[object Object],[object Object]
Streaming Parsing ,[object Object],[object Object],[object Object],[object Object]
Extraction of Metadata ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Cancer Research Example
Cancer Research Example Attributes Relationships Credit: A. Hart
Tika Sponsoring the Any23 Project ,[object Object],[object Object],[object Object],[object Object],[object Object]
Metadata ,[object Object],[object Object],[object Object]
Methods for language identification ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Language Detection ,[object Object],[object Object],[object Object],[object Object],[object Object]
Running Tika in GUI form ,[object Object],<html xmlns:html= “…”> <body> … </body> </html>
Integrating Tika into your App ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],tika-core tika-parsers tika-app tika-bundle tika-server
Some really great stuff in 1.0 ,[object Object],[object Object],[object Object],[object Object],[object Object],NICK ALREADY TALKED ABOUT THIS!!!  Thunder stolen
Things to watch out for  ,[object Object],[object Object],[object Object],[object Object]
Improvements to Tika ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Part 2 Science Data Systems at NASA Credit: http://www.jpl.nasa.gov/news/news.cfm?release=2011-295
NASA Ground Data Systems Credit: D. Woollard
Context ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The Square Kilometer Array ,[object Object],[object Object],[object Object],[object Object]
NASA DESDynI Mission ,[object Object],[object Object],[object Object],[object Object]
Some Considerations ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Apache OODT ,[object Object]
How are we building these systems now? ,[object Object],[object Object],[object Object]
How are we building these systems now? ,[object Object],[object Object],[object Object],[object Object]
What does this have to do with Tika? Metadata Ext: TIKA! Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!
What does this have to do with Tika? Metadata Ext: TIKA! MIME identification: TIKA! MIME identification: TIKA!
Science Data File Formats ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Science Data File Formats ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
So how does it work? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Tool support ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Besides processing science files ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Big Goal ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
NASA Geo Challenges ,[object Object],[object Object],[object Object],[object Object]
Acknowledgements  ,[object Object],[object Object],[object Object],[object Object],[object Object]
Book ,[object Object],[object Object],[object Object],[object Object]
Alright, I ’ll shut up now ,[object Object],[object Object],[object Object],[object Object],[object Object]

More Related Content

What's hot

Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
GokulD
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
Robert Viseur
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
WO Community
 

What's hot (20)

Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Lucene
LuceneLucene
Lucene
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Presentation of OpenNLP
Presentation of OpenNLPPresentation of OpenNLP
Presentation of OpenNLP
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
S4
S4S4
S4
 
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval SystemsIntent Algorithms: The Data Science of Smart Information Retrieval Systems
Intent Algorithms: The Data Science of Smart Information Retrieval Systems
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Building a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engineBuilding a real time, solr-powered recommendation engine
Building a real time, solr-powered recommendation engine
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 

Viewers also liked

Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
mteutelink
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
Manish kumar
 

Viewers also liked (20)

Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsqueda
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Mejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrMejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache Solr
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en español
 
An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Search engine
Search engineSearch engine
Search engine
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Introducción a Solr
Introducción a SolrIntroducción a Solr
Introducción a Solr
 
Conferencia 3: solrconfig.xml
Conferencia 3: solrconfig.xmlConferencia 3: solrconfig.xml
Conferencia 3: solrconfig.xml
 
Conferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo SolrConferencia 5: Extendiendo Solr
Conferencia 5: Extendiendo Solr
 
Conferencia 4: Queries
Conferencia 4: QueriesConferencia 4: Queries
Conferencia 4: Queries
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
Seminario Apache Solr
Seminario Apache SolrSeminario Apache Solr
Seminario Apache Solr
 

Similar to Apache Tika: 1 point Oh!

Jeff Grethe: CAMERA
Jeff Grethe: CAMERAJeff Grethe: CAMERA
Jeff Grethe: CAMERA
Iddo
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003
butest
 

Similar to Apache Tika: 1 point Oh! (20)

A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT Ecosystem
 
Supercharging your Apache OODT deployments with the Process Control System
Supercharging your Apache OODT deployments with the Process Control SystemSupercharging your Apache OODT deployments with the Process Control System
Supercharging your Apache OODT deployments with the Process Control System
 
247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench247th ACS Meeting: The Eureka Research Workbench
247th ACS Meeting: The Eureka Research Workbench
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
WoSC19: Serverless Workflows for Indexing Large Scientific Data
WoSC19: Serverless Workflows for Indexing Large Scientific DataWoSC19: Serverless Workflows for Indexing Large Scientific Data
WoSC19: Serverless Workflows for Indexing Large Scientific Data
 
Jeff Grethe: CAMERA
Jeff Grethe: CAMERAJeff Grethe: CAMERA
Jeff Grethe: CAMERA
 
Empowering Transformational Science
Empowering Transformational ScienceEmpowering Transformational Science
Empowering Transformational Science
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003
 
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into EurekaACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
 
160606 data lifecycle project outline
160606 data lifecycle project outline160606 data lifecycle project outline
160606 data lifecycle project outline
 
Dspace
DspaceDspace
Dspace
 
Dspace
DspaceDspace
Dspace
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information Retrieval
 
New Directions in Metadata
New Directions in MetadataNew Directions in Metadata
New Directions in Metadata
 
Democratizing Big Semantic Data management
Democratizing Big Semantic Data managementDemocratizing Big Semantic Data management
Democratizing Big Semantic Data management
 
Hd3113831386
Hd3113831386Hd3113831386
Hd3113831386
 
Preservation Metadata
Preservation MetadataPreservation Metadata
Preservation Metadata
 
Liberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaLiberating Laboratory Data - Eureka
Liberating Laboratory Data - Eureka
 
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
 
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
 

More from Chris Mattmann

More from Chris Mattmann (7)

Wengines, Workflows, and 2 years of advanced data processing in Apache OODT
Wengines, Workflows, and 2 years of advanced data processing in Apache OODTWengines, Workflows, and 2 years of advanced data processing in Apache OODT
Wengines, Workflows, and 2 years of advanced data processing in Apache OODT
 
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayScalable Data Mining and Archiving in the Era of the Square Kilometre Array
Scalable Data Mining and Archiving in the Era of the Square Kilometre Array
 
Teaching NASA to Open Source its Software the Apache Way
Teaching NASA to Open Source its Software the Apache WayTeaching NASA to Open Source its Software the Apache Way
Teaching NASA to Open Source its Software the Apache Way
 
Understanding the Meaningful Use of Open Source Software
Understanding the Meaningful Use of Open Source SoftwareUnderstanding the Meaningful Use of Open Source Software
Understanding the Meaningful Use of Open Source Software
 
An Open Source Strategy for NASA
An Open Source Strategy for NASAAn Open Source Strategy for NASA
An Open Source Strategy for NASA
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
 

Recently uploaded

Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 

Recently uploaded (20)

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
Exploring UiPath Orchestrator API: updates and limits in 2024 🚀
 
Demystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John StaveleyDemystifying gRPC in .Net by John Staveley
Demystifying gRPC in .Net by John Staveley
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 

Apache Tika: 1 point Oh!