SlideShare a Scribd company logo
Text and Metadata Extraction with Apache Tika Jukka Zitting @ Day Software
from files… The Problem
YOU don't know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain't no matter. That book was made by Mr. Mark Twain, and he told the truth, mainly. There was things which he stretched, but mainly he told the truth. That is nothing. I never seen anybody but lied one time or another, without it was Aunt Polly, or the widow, or maybe Mary. Aunt Polly--Tom's Aunt Polly, she is--and Mary, and the Widow Douglas is all told about in that book, which is mostly a true book, with some stretchers, as I said before… 3 May. Bistritz.--Left Munich at 8:35 P.M., on 1st May, arriving at Vienna early next morning; should have arrived at 6:46, but train was an hour late. Buda-Pesth seems a wonderful place, from the glimpse which I got of it from the train and the little I could walk through the streets. I feared to go very far from the station, as we had arrived late and would start as near the correct time as possible. The impression I had was that we were leaving the West and entering the East; the most western of splendid bridges over the Danube, which is here of noble width and depth, took us among the traditions of Turkish rule… It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters. "My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?" Mr. Bennet replied that he had not. "But it is," returned she; "for Mrs. Long has just been here, and she told me all about it." Mr. Bennet made no answer… …  to text … The Problem
…  and metadata © Doug Schepers The Problem
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],The Agenda
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Some History Project Basics
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Some Statistics Project Basics
 
[object Object],[object Object],[object Object],[object Object],[object Object],tika-app-0.7.jar (17MB) Tika on the command line
tika-app-0.7.jar (17MB) $ java -jar tika-app-0.7.jar --gui Tika GUI
[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.Tika; Tika façade Where … can be: java.lang.String java.io.File java.net.URL java.io.InputStream
Dependency management ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Dependencies listed pdfbox-1.1.0.jar fontbox-1.1.0.jar jempbox-1.1.0.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar poi-3.6.jar poi-scratchpad-3.6.jar poi-ooxml-3.6.jar tika-core-0.7.jar tika-parsers-0.7.jar tagsoup-1.2.jar asm-3.1.jar xmlbeans-2.3.0.jar dom4j-1.6.1.jar xml-apis-1.0.b2.jar log4j-1.2.14.jar poi-ooxml-schemas-3.6.jar  commons-compress-1.0.jar metadata-extractor-2.4.0-beta-1.jar geronimo-stax-api_1.0_spec-1.0.1.jar commons-logging-1.1.1.jar …  yes, that’s 21 jars
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.parser.Parser; Tika Parser API
import  org.apache.tika.parser.Parser; Tika Parser API The input document. Remember to close the stream! java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata =  new  Metadata(); ParseContext context =  new  ParseContext(); Parser parser =  new  AutoDetectParser(); parser.parse(input, handler, metadata, context);
[object Object],[object Object],import  org.apache.tika.io.TikaInputStream; new in Tika 0.8 Where … can be: java.lang.String java.io.File java.net.URL java.io.InputStream ,[object Object],[object Object]
import  org.apache.tika.parser.Parser; Tika Parser API XHTML SAX event handler for the extracted structured text. java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata =  new  Metadata(); ParseContext context =  new  ParseContext(); Parser parser =  new  AutoDetectParser(); parser.parse(input, handler, metadata, context);
[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.sax.*; XHTML SAX events ,[object Object],[object Object],[object Object],[object Object]
import  org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata =  new  Metadata(); ParseContext context =  new  ParseContext(); Parser parser =  new  AutoDetectParser(); parser.parse(input, handler, metadata, context); Document metadata, both for input (filename) and output (title)
[object Object],[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.metadata.Metadata; Tika Metadata API
[object Object],[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.metadata.Metadata; Types of Metadata
import  org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata =  new  Metadata(); ParseContext context =  new  ParseContext(); Parser parser =  new  AutoDetectParser(); parser.parse(input, handler, metadata, context); Parsing context for extra options passed to the parser instances
[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.parser.ParseContext; Examples of the parse context
import  org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata =  new  Metadata(); ParseContext context =  new  ParseContext(); Parser parser =  new  AutoDetectParser(); parser.parse(input, handler, metadata, context); Automatically selects best parser based on detected document type
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.parser.*.*; Parser libraries (AL-compatible)
import  org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata =  new  Metadata(); ParseContext context =  new  ParseContext(); Parser parser =  new  AutoDetectParser(); parser.parse(input, handler, metadata, context); throws TikaException, IOException, SAXException
Solr Cell (extracting request handler) $ curl http://localhost:8983/solr/update/extract?literal.id=doc1 -F file=@document.doc http://wiki.apache.org/solr/ExtractingRequestHandler Integrations
[object Object],[object Object],[object Object],[object Object],Other Integrations Integrations
MEAP starting soon! Questions?
[object Object],[object Object],[object Object],Extras
[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.parser.html.HtmlMapper; 1-1 mapping of input HTML
[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.parser.Parser; Parsing documents from inside packages
import  org.apache.tika.parser.OutOfProcessParser; ,[object Object],[object Object],[object Object],[object Object],[object Object],work in progress

More Related Content

What's hot

Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Christos Manios
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
Chandler Huang
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
Yifeng Jiang
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise Architects
Neo4j
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
Apache spark
Apache sparkApache spark
Apache spark
shima jafari
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Apache flink
Apache flinkApache flink
Apache flink
Ahmed Nader
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
KafkaZone
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
Mohammed Fazuluddin
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Andrew Lamb
 
Spark
SparkSpark
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learned
Tin Le
 

What's hot (20)

Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Graphs for Enterprise Architects
Graphs for Enterprise ArchitectsGraphs for Enterprise Architects
Graphs for Enterprise Architects
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Apache spark
Apache sparkApache spark
Apache spark
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Apache flink
Apache flinkApache flink
Apache flink
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Kafka presentation
Kafka presentationKafka presentation
Kafka presentation
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Spark
SparkSpark
Spark
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learnedELK at LinkedIn - Kafka, scaling, lessons learned
ELK at LinkedIn - Kafka, scaling, lessons learned
 

Viewers also liked

Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
Jukka Zitting
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
Paolo Mottadelli
 
Wicked notes #3
Wicked notes #3Wicked notes #3
Wicked notes #3
Kennisland
 
H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?
H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?
H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?
VisaPro Immigration Services LLC
 
Walkathon Pre Activity
Walkathon Pre ActivityWalkathon Pre Activity
Walkathon Pre Activity
Kwan Tuck Soon
 
互联网商品设计
互联网商品设计互联网商品设计
互联网商品设计
easychen
 
The future of Facebook fundraising - IoF National Convention 2012
The future of Facebook fundraising - IoF National Convention 2012The future of Facebook fundraising - IoF National Convention 2012
The future of Facebook fundraising - IoF National Convention 2012
Jonathan Waddingham
 
3/5 Performance measurment and balanced scorecard in government organizations
3/5 Performance measurment and balanced scorecard in government organizations3/5 Performance measurment and balanced scorecard in government organizations
3/5 Performance measurment and balanced scorecard in government organizations
Mohamed Moustafa
 
Enabling co-­creation of e-services through virtual worlds
Enabling co-­creation of e-services through virtual worldsEnabling co-­creation of e-services through virtual worlds
Enabling co-­creation of e-services through virtual worlds
Thomas Kohler
 
Visual communication
Visual communicationVisual communication
Visual communication
Chiara Antonacci
 
Week 01
Week 01Week 01
Week 01
tjutel
 
Publizitatearen Historia 4
Publizitatearen Historia 4Publizitatearen Historia 4
Publizitatearen Historia 4
katixa
 
Cristian Forcadell
Cristian ForcadellCristian Forcadell
Cristian Forcadell
mforcadell
 
Migrating from PHP4 To PHP5 - Zend Webinar
Migrating from PHP4 To PHP5 - Zend WebinarMigrating from PHP4 To PHP5 - Zend Webinar
Migrating from PHP4 To PHP5 - Zend Webinar
Ivo Jansch
 
Ako hľadajú boha extroverti ppt
Ako hľadajú boha extroverti pptAko hľadajú boha extroverti ppt
Ako hľadajú boha extroverti ppt
Cirkev bratská Svätý Jur
 
Rulang Cooling Robot
Rulang Cooling RobotRulang Cooling Robot
Rulang Cooling Robot
Kwan Tuck Soon
 
How small charities can use the web to punch above their weight
How small charities can use the web to punch above their weightHow small charities can use the web to punch above their weight
How small charities can use the web to punch above their weight
Jonathan Waddingham
 

Viewers also liked (20)

Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
Wicked notes #3
Wicked notes #3Wicked notes #3
Wicked notes #3
 
Herodesov chram
Herodesov chramHerodesov chram
Herodesov chram
 
H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?
H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?
H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?
 
Walkathon Pre Activity
Walkathon Pre ActivityWalkathon Pre Activity
Walkathon Pre Activity
 
互联网商品设计
互联网商品设计互联网商品设计
互联网商品设计
 
The future of Facebook fundraising - IoF National Convention 2012
The future of Facebook fundraising - IoF National Convention 2012The future of Facebook fundraising - IoF National Convention 2012
The future of Facebook fundraising - IoF National Convention 2012
 
Kavkaz 2009 + kazani
Kavkaz 2009 + kazaniKavkaz 2009 + kazani
Kavkaz 2009 + kazani
 
3/5 Performance measurment and balanced scorecard in government organizations
3/5 Performance measurment and balanced scorecard in government organizations3/5 Performance measurment and balanced scorecard in government organizations
3/5 Performance measurment and balanced scorecard in government organizations
 
Enabling co-­creation of e-services through virtual worlds
Enabling co-­creation of e-services through virtual worldsEnabling co-­creation of e-services through virtual worlds
Enabling co-­creation of e-services through virtual worlds
 
Visual communication
Visual communicationVisual communication
Visual communication
 
Week 01
Week 01Week 01
Week 01
 
Prezentacia
PrezentaciaPrezentacia
Prezentacia
 
Publizitatearen Historia 4
Publizitatearen Historia 4Publizitatearen Historia 4
Publizitatearen Historia 4
 
Cristian Forcadell
Cristian ForcadellCristian Forcadell
Cristian Forcadell
 
Migrating from PHP4 To PHP5 - Zend Webinar
Migrating from PHP4 To PHP5 - Zend WebinarMigrating from PHP4 To PHP5 - Zend Webinar
Migrating from PHP4 To PHP5 - Zend Webinar
 
Ako hľadajú boha extroverti ppt
Ako hľadajú boha extroverti pptAko hľadajú boha extroverti ppt
Ako hľadajú boha extroverti ppt
 
Rulang Cooling Robot
Rulang Cooling RobotRulang Cooling Robot
Rulang Cooling Robot
 
How small charities can use the web to punch above their weight
How small charities can use the web to punch above their weightHow small charities can use the web to punch above their weight
How small charities can use the web to punch above their weight
 

Similar to Text and metadata extraction with Apache Tika

Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
Alfresco Software
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
Paolo Mottadelli
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
source{d}
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
eComm2008
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
Alfresco Software
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
Sutthipong Kuruhongsa
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
Sutthipong Kuruhongsa
 
Storm real-time processing
Storm real-time processingStorm real-time processing
Storm real-time processing
Michael Vogiatzis
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
gagravarr
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Apache tika
Apache tikaApache tika
Virtual Bash! A Lunchtime Introduction to Kafka
Virtual Bash! A Lunchtime Introduction to KafkaVirtual Bash! A Lunchtime Introduction to Kafka
Virtual Bash! A Lunchtime Introduction to Kafka
Jason Bell
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
Jamund Ferguson
 
Dynamic Python
Dynamic PythonDynamic Python
Dynamic Python
Chui-Wen Chiu
 
Why Java Needs Hierarchical Data
Why Java Needs Hierarchical DataWhy Java Needs Hierarchical Data
Why Java Needs Hierarchical Data
Marakana Inc.
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
Jay Luker
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
Chris Mattmann
 
Os Vanrossum
Os VanrossumOs Vanrossum
Os Vanrossum
oscon2007
 
ApacheCon 2000 Everything you ever wanted to know about XML Parsing
ApacheCon 2000 Everything you ever wanted to know about XML ParsingApacheCon 2000 Everything you ever wanted to know about XML Parsing
ApacheCon 2000 Everything you ever wanted to know about XML Parsing
Ted Leung
 

Similar to Text and metadata extraction with Apache Tika (20)

Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Storm real-time processing
Storm real-time processingStorm real-time processing
Storm real-time processing
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Apache tika
Apache tikaApache tika
Apache tika
 
Virtual Bash! A Lunchtime Introduction to Kafka
Virtual Bash! A Lunchtime Introduction to KafkaVirtual Bash! A Lunchtime Introduction to Kafka
Virtual Bash! A Lunchtime Introduction to Kafka
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
Dynamic Python
Dynamic PythonDynamic Python
Dynamic Python
 
Why Java Needs Hierarchical Data
Why Java Needs Hierarchical DataWhy Java Needs Hierarchical Data
Why Java Needs Hierarchical Data
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
 
Os Vanrossum
Os VanrossumOs Vanrossum
Os Vanrossum
 
ApacheCon 2000 Everything you ever wanted to know about XML Parsing
ApacheCon 2000 Everything you ever wanted to know about XML ParsingApacheCon 2000 Everything you ever wanted to know about XML Parsing
ApacheCon 2000 Everything you ever wanted to know about XML Parsing
 

More from Jukka Zitting

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6
Jukka Zitting
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
Jukka Zitting
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
Jukka Zitting
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
Jukka Zitting
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
Jukka Zitting
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
Jukka Zitting
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
Jukka Zitting
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repository
Jukka Zitting
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuning
Jukka Zitting
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
Jukka Zitting
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
Jukka Zitting
 
NoSQL Oakland
NoSQL OaklandNoSQL Oakland
NoSQL Oakland
Jukka Zitting
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
Jukka Zitting
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
Jukka Zitting
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On Steroids
Jukka Zitting
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
Jukka Zitting
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of Jackrabbit
Jukka Zitting
 
Apache Tika
Apache TikaApache Tika
Apache Tika
Jukka Zitting
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
Jukka Zitting
 

More from Jukka Zitting (19)

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repository
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuning
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
NoSQL Oakland
NoSQL OaklandNoSQL Oakland
NoSQL Oakland
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On Steroids
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of Jackrabbit
 
Apache Tika
Apache TikaApache Tika
Apache Tika
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
 

Recently uploaded

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 

Recently uploaded (20)

Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 

Text and metadata extraction with Apache Tika

  • 1. Text and Metadata Extraction with Apache Tika Jukka Zitting @ Day Software
  • 3. YOU don't know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain't no matter. That book was made by Mr. Mark Twain, and he told the truth, mainly. There was things which he stretched, but mainly he told the truth. That is nothing. I never seen anybody but lied one time or another, without it was Aunt Polly, or the widow, or maybe Mary. Aunt Polly--Tom's Aunt Polly, she is--and Mary, and the Widow Douglas is all told about in that book, which is mostly a true book, with some stretchers, as I said before… 3 May. Bistritz.--Left Munich at 8:35 P.M., on 1st May, arriving at Vienna early next morning; should have arrived at 6:46, but train was an hour late. Buda-Pesth seems a wonderful place, from the glimpse which I got of it from the train and the little I could walk through the streets. I feared to go very far from the station, as we had arrived late and would start as near the correct time as possible. The impression I had was that we were leaving the West and entering the East; the most western of splendid bridges over the Danube, which is here of noble width and depth, took us among the traditions of Turkish rule… It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters. "My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?" Mr. Bennet replied that he had not. "But it is," returned she; "for Mrs. Long has just been here, and she told me all about it." Mr. Bennet made no answer… … to text … The Problem
  • 4. … and metadata © Doug Schepers The Problem
  • 5.
  • 6.
  • 7.
  • 8.  
  • 9.
  • 10. tika-app-0.7.jar (17MB) $ java -jar tika-app-0.7.jar --gui Tika GUI
  • 11.
  • 12.
  • 13. Dependencies listed pdfbox-1.1.0.jar fontbox-1.1.0.jar jempbox-1.1.0.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar poi-3.6.jar poi-scratchpad-3.6.jar poi-ooxml-3.6.jar tika-core-0.7.jar tika-parsers-0.7.jar tagsoup-1.2.jar asm-3.1.jar xmlbeans-2.3.0.jar dom4j-1.6.1.jar xml-apis-1.0.b2.jar log4j-1.2.14.jar poi-ooxml-schemas-3.6.jar commons-compress-1.0.jar metadata-extractor-2.4.0-beta-1.jar geronimo-stax-api_1.0_spec-1.0.1.jar commons-logging-1.1.1.jar … yes, that’s 21 jars
  • 14.
  • 15. import org.apache.tika.parser.Parser; Tika Parser API The input document. Remember to close the stream! java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(input, handler, metadata, context);
  • 16.
  • 17. import org.apache.tika.parser.Parser; Tika Parser API XHTML SAX event handler for the extracted structured text. java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(input, handler, metadata, context);
  • 18.
  • 19. import org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(input, handler, metadata, context); Document metadata, both for input (filename) and output (title)
  • 20.
  • 21.
  • 22. import org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(input, handler, metadata, context); Parsing context for extra options passed to the parser instances
  • 23.
  • 24. import org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(input, handler, metadata, context); Automatically selects best parser based on detected document type
  • 25.
  • 26. import org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(input, handler, metadata, context); throws TikaException, IOException, SAXException
  • 27. Solr Cell (extracting request handler) $ curl http://localhost:8983/solr/update/extract?literal.id=doc1 -F file=@document.doc http://wiki.apache.org/solr/ExtractingRequestHandler Integrations
  • 28.
  • 29. MEAP starting soon! Questions?
  • 30.
  • 31.
  • 32.
  • 33.