SlideShare a Scribd company logo
1 of 33
Text and Metadata Extraction with Apache Tika Jukka Zitting @ Day Software
from files… The Problem
YOU don't know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain't no matter. That book was made by Mr. Mark Twain, and he told the truth, mainly. There was things which he stretched, but mainly he told the truth. That is nothing. I never seen anybody but lied one time or another, without it was Aunt Polly, or the widow, or maybe Mary. Aunt Polly--Tom's Aunt Polly, she is--and Mary, and the Widow Douglas is all told about in that book, which is mostly a true book, with some stretchers, as I said before… 3 May. Bistritz.--Left Munich at 8:35 P.M., on 1st May, arriving at Vienna early next morning; should have arrived at 6:46, but train was an hour late. Buda-Pesth seems a wonderful place, from the glimpse which I got of it from the train and the little I could walk through the streets. I feared to go very far from the station, as we had arrived late and would start as near the correct time as possible. The impression I had was that we were leaving the West and entering the East; the most western of splendid bridges over the Danube, which is here of noble width and depth, took us among the traditions of Turkish rule… It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters. "My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?" Mr. Bennet replied that he had not. "But it is," returned she; "for Mrs. Long has just been here, and she told me all about it." Mr. Bennet made no answer… …  to text … The Problem
…  and metadata © Doug Schepers The Problem
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],The Agenda
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Some History Project Basics
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Some Statistics Project Basics
 
[object Object],[object Object],[object Object],[object Object],[object Object],tika-app-0.7.jar (17MB) Tika on the command line
tika-app-0.7.jar (17MB) $ java -jar tika-app-0.7.jar --gui Tika GUI
[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.Tika; Tika façade Where … can be: java.lang.String java.io.File java.net.URL java.io.InputStream
Dependency management ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Dependencies listed pdfbox-1.1.0.jar fontbox-1.1.0.jar jempbox-1.1.0.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar poi-3.6.jar poi-scratchpad-3.6.jar poi-ooxml-3.6.jar tika-core-0.7.jar tika-parsers-0.7.jar tagsoup-1.2.jar asm-3.1.jar xmlbeans-2.3.0.jar dom4j-1.6.1.jar xml-apis-1.0.b2.jar log4j-1.2.14.jar poi-ooxml-schemas-3.6.jar  commons-compress-1.0.jar metadata-extractor-2.4.0-beta-1.jar geronimo-stax-api_1.0_spec-1.0.1.jar commons-logging-1.1.1.jar …  yes, that’s 21 jars
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.parser.Parser; Tika Parser API
import  org.apache.tika.parser.Parser; Tika Parser API The input document. Remember to close the stream! java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata =  new  Metadata(); ParseContext context =  new  ParseContext(); Parser parser =  new  AutoDetectParser(); parser.parse(input, handler, metadata, context);
[object Object],[object Object],import  org.apache.tika.io.TikaInputStream; new in Tika 0.8 Where … can be: java.lang.String java.io.File java.net.URL java.io.InputStream ,[object Object],[object Object]
import  org.apache.tika.parser.Parser; Tika Parser API XHTML SAX event handler for the extracted structured text. java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata =  new  Metadata(); ParseContext context =  new  ParseContext(); Parser parser =  new  AutoDetectParser(); parser.parse(input, handler, metadata, context);
[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.sax.*; XHTML SAX events ,[object Object],[object Object],[object Object],[object Object]
import  org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata =  new  Metadata(); ParseContext context =  new  ParseContext(); Parser parser =  new  AutoDetectParser(); parser.parse(input, handler, metadata, context); Document metadata, both for input (filename) and output (title)
[object Object],[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.metadata.Metadata; Tika Metadata API
[object Object],[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.metadata.Metadata; Types of Metadata
import  org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata =  new  Metadata(); ParseContext context =  new  ParseContext(); Parser parser =  new  AutoDetectParser(); parser.parse(input, handler, metadata, context); Parsing context for extra options passed to the parser instances
[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.parser.ParseContext; Examples of the parse context
import  org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata =  new  Metadata(); ParseContext context =  new  ParseContext(); Parser parser =  new  AutoDetectParser(); parser.parse(input, handler, metadata, context); Automatically selects best parser based on detected document type
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.parser.*.*; Parser libraries (AL-compatible)
import  org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata =  new  Metadata(); ParseContext context =  new  ParseContext(); Parser parser =  new  AutoDetectParser(); parser.parse(input, handler, metadata, context); throws TikaException, IOException, SAXException
Solr Cell (extracting request handler) $ curl http://localhost:8983/solr/update/extract?literal.id=doc1 -F file=@document.doc http://wiki.apache.org/solr/ExtractingRequestHandler Integrations
[object Object],[object Object],[object Object],[object Object],Other Integrations Integrations
MEAP starting soon! Questions?
[object Object],[object Object],[object Object],Extras
[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.parser.html.HtmlMapper; 1-1 mapping of input HTML
[object Object],[object Object],[object Object],[object Object],import  org.apache.tika.parser.Parser; Parsing documents from inside packages
import  org.apache.tika.parser.OutOfProcessParser; ,[object Object],[object Object],[object Object],[object Object],[object Object],work in progress

More Related Content

What's hot

High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkReynold Xin
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrChristos Manios
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Umesh Prasad
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...Amazon Web Services
 
Azure landing zones - Terraform module design considerations - Azure Architec...
Azure landing zones - Terraform module design considerations - Azure Architec...Azure landing zones - Terraform module design considerations - Azure Architec...
Azure landing zones - Terraform module design considerations - Azure Architec...DubemJavapi
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
 
Comparison of ACFS and DBFS
Comparison of ACFS and DBFSComparison of ACFS and DBFS
Comparison of ACFS and DBFSDanielHillinger
 

What's hot (20)

High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Postgres index types
Postgres index typesPostgres index types
Postgres index types
 
Stanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache SparkStanford CS347 Guest Lecture: Apache Spark
Stanford CS347 Guest Lecture: Apache Spark
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr Consuming RealTime Signals in Solr
Consuming RealTime Signals in Solr
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
Azure landing zones - Terraform module design considerations - Azure Architec...
Azure landing zones - Terraform module design considerations - Azure Architec...Azure landing zones - Terraform module design considerations - Azure Architec...
Azure landing zones - Terraform module design considerations - Azure Architec...
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)Rainbird: Realtime Analytics at Twitter (Strata 2011)
Rainbird: Realtime Analytics at Twitter (Strata 2011)
 
Comparison of ACFS and DBFS
Comparison of ACFS and DBFSComparison of ACFS and DBFS
Comparison of ACFS and DBFS
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 

Viewers also liked

Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tikaJukka Zitting
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
Wicked notes #3
Wicked notes #3Wicked notes #3
Wicked notes #3Kennisland
 
H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?
H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?
H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?VisaPro Immigration Services LLC
 
Walkathon Pre Activity
Walkathon Pre ActivityWalkathon Pre Activity
Walkathon Pre ActivityKwan Tuck Soon
 
互联网商品设计
互联网商品设计互联网商品设计
互联网商品设计easychen
 
The future of Facebook fundraising - IoF National Convention 2012
The future of Facebook fundraising - IoF National Convention 2012The future of Facebook fundraising - IoF National Convention 2012
The future of Facebook fundraising - IoF National Convention 2012Jonathan Waddingham
 
3/5 Performance measurment and balanced scorecard in government organizations
3/5 Performance measurment and balanced scorecard in government organizations3/5 Performance measurment and balanced scorecard in government organizations
3/5 Performance measurment and balanced scorecard in government organizationsMohamed Moustafa
 
Enabling co-­creation of e-services through virtual worlds
Enabling co-­creation of e-services through virtual worldsEnabling co-­creation of e-services through virtual worlds
Enabling co-­creation of e-services through virtual worldsThomas Kohler
 
Week 01
Week 01Week 01
Week 01tjutel
 
Publizitatearen Historia 4
Publizitatearen Historia 4Publizitatearen Historia 4
Publizitatearen Historia 4katixa
 
Cristian Forcadell
Cristian ForcadellCristian Forcadell
Cristian Forcadellmforcadell
 
Migrating from PHP4 To PHP5 - Zend Webinar
Migrating from PHP4 To PHP5 - Zend WebinarMigrating from PHP4 To PHP5 - Zend Webinar
Migrating from PHP4 To PHP5 - Zend WebinarIvo Jansch
 
How small charities can use the web to punch above their weight
How small charities can use the web to punch above their weightHow small charities can use the web to punch above their weight
How small charities can use the web to punch above their weightJonathan Waddingham
 

Viewers also liked (20)

Content extraction with apache tika
Content extraction with apache tikaContent extraction with apache tika
Content extraction with apache tika
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
Wicked notes #3
Wicked notes #3Wicked notes #3
Wicked notes #3
 
Herodesov chram
Herodesov chramHerodesov chram
Herodesov chram
 
H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?
H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?
H1B Visa 2015 Predictions: What Are Your Chances of Being Selected?
 
Walkathon Pre Activity
Walkathon Pre ActivityWalkathon Pre Activity
Walkathon Pre Activity
 
互联网商品设计
互联网商品设计互联网商品设计
互联网商品设计
 
The future of Facebook fundraising - IoF National Convention 2012
The future of Facebook fundraising - IoF National Convention 2012The future of Facebook fundraising - IoF National Convention 2012
The future of Facebook fundraising - IoF National Convention 2012
 
Kavkaz 2009 + kazani
Kavkaz 2009 + kazaniKavkaz 2009 + kazani
Kavkaz 2009 + kazani
 
3/5 Performance measurment and balanced scorecard in government organizations
3/5 Performance measurment and balanced scorecard in government organizations3/5 Performance measurment and balanced scorecard in government organizations
3/5 Performance measurment and balanced scorecard in government organizations
 
Enabling co-­creation of e-services through virtual worlds
Enabling co-­creation of e-services through virtual worldsEnabling co-­creation of e-services through virtual worlds
Enabling co-­creation of e-services through virtual worlds
 
Visual communication
Visual communicationVisual communication
Visual communication
 
Week 01
Week 01Week 01
Week 01
 
Prezentacia
PrezentaciaPrezentacia
Prezentacia
 
Publizitatearen Historia 4
Publizitatearen Historia 4Publizitatearen Historia 4
Publizitatearen Historia 4
 
Cristian Forcadell
Cristian ForcadellCristian Forcadell
Cristian Forcadell
 
Migrating from PHP4 To PHP5 - Zend Webinar
Migrating from PHP4 To PHP5 - Zend WebinarMigrating from PHP4 To PHP5 - Zend Webinar
Migrating from PHP4 To PHP5 - Zend Webinar
 
Ako hľadajú boha extroverti ppt
Ako hľadajú boha extroverti pptAko hľadajú boha extroverti ppt
Ako hľadajú boha extroverti ppt
 
Rulang Cooling Robot
Rulang Cooling RobotRulang Cooling Robot
Rulang Cooling Robot
 
How small charities can use the web to punch above their weight
How small charities can use the web to punch above their weightHow small charities can use the web to punch above their weight
How small charities can use the web to punch above their weight
 

Similar to Text and metadata extraction with Apache Tika

Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content TransformationAlfresco Software
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaPaolo Mottadelli
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout source{d}
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008eComm2008
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationAlfresco Software
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tikaSutthipong Kuruhongsa
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tikaSutthipong Kuruhongsa
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 
Virtual Bash! A Lunchtime Introduction to Kafka
Virtual Bash! A Lunchtime Introduction to KafkaVirtual Bash! A Lunchtime Introduction to Kafka
Virtual Bash! A Lunchtime Introduction to KafkaJason Bell
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesJamund Ferguson
 
Why Java Needs Hierarchical Data
Why Java Needs Hierarchical DataWhy Java Needs Hierarchical Data
Why Java Needs Hierarchical DataMarakana Inc.
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentJay Luker
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Chris Mattmann
 
Os Vanrossum
Os VanrossumOs Vanrossum
Os Vanrossumoscon2007
 
ApacheCon 2000 Everything you ever wanted to know about XML Parsing
ApacheCon 2000 Everything you ever wanted to know about XML ParsingApacheCon 2000 Everything you ever wanted to know about XML Parsing
ApacheCon 2000 Everything you ever wanted to know about XML ParsingTed Leung
 

Similar to Text and metadata extraction with Apache Tika (20)

Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout Introduction to source{d} Engine and source{d} Lookout
Introduction to source{d} Engine and source{d} Lookout
 
Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008Rocky Nevin's presentation at eComm 2008
Rocky Nevin's presentation at eComm 2008
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Storm real-time processing
Storm real-time processingStorm real-time processing
Storm real-time processing
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Apache tika
Apache tikaApache tika
Apache tika
 
Virtual Bash! A Lunchtime Introduction to Kafka
Virtual Bash! A Lunchtime Introduction to KafkaVirtual Bash! A Lunchtime Introduction to Kafka
Virtual Bash! A Lunchtime Introduction to Kafka
 
Don't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax TreesDon't Be Afraid of Abstract Syntax Trees
Don't Be Afraid of Abstract Syntax Trees
 
Dynamic Python
Dynamic PythonDynamic Python
Dynamic Python
 
Why Java Needs Hierarchical Data
Why Java Needs Hierarchical DataWhy Java Needs Hierarchical Data
Why Java Needs Hierarchical Data
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Letting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search ComponentLetting In the Light: Using Solr as an External Search Component
Letting In the Light: Using Solr as an External Search Component
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
 
Os Vanrossum
Os VanrossumOs Vanrossum
Os Vanrossum
 
ApacheCon 2000 Everything you ever wanted to know about XML Parsing
ApacheCon 2000 Everything you ever wanted to know about XML ParsingApacheCon 2000 Everything you ever wanted to know about XML Parsing
ApacheCon 2000 Everything you ever wanted to know about XML Parsing
 

More from Jukka Zitting

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6Jukka Zitting
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CIJukka Zitting
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Jukka Zitting
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repositoryJukka Zitting
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStoreJukka Zitting
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorJukka Zitting
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Jukka Zitting
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repositoryJukka Zitting
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuningJukka Zitting
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical modelJukka Zitting
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache TikaJukka Zitting
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitJukka Zitting
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiJukka Zitting
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On SteroidsJukka Zitting
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache TikaJukka Zitting
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of JackrabbitJukka Zitting
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache JackrabbitJukka Zitting
 

More from Jukka Zitting (19)

The new repository in AEM 6
The new repository in AEM 6The new repository in AEM 6
The new repository in AEM 6
 
Apache development with GitHub and Travis CI
Apache development with GitHub and Travis CIApache development with GitHub and Travis CI
Apache development with GitHub and Travis CI
 
Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3Oak, the architecture of Apache Jackrabbit 3
Oak, the architecture of Apache Jackrabbit 3
 
/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository/path/to/content - the Apache Jackrabbit content repository
/path/to/content - the Apache Jackrabbit content repository
 
MicroKernel & NodeStore
MicroKernel & NodeStoreMicroKernel & NodeStore
MicroKernel & NodeStore
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
 
Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011Apache Jackrabbit @ Swiss Open Source Awards 2011
Apache Jackrabbit @ Swiss Open Source Awards 2011
 
OSGifying the repository
OSGifying the repositoryOSGifying the repository
OSGifying the repository
 
Repository performance tuning
Repository performance tuningRepository performance tuning
Repository performance tuning
 
The return of the hierarchical model
The return of the hierarchical modelThe return of the hierarchical model
The return of the hierarchical model
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
NoSQL Oakland
NoSQL OaklandNoSQL Oakland
NoSQL Oakland
 
Content Storage With Apache Jackrabbit
Content Storage With Apache JackrabbitContent Storage With Apache Jackrabbit
Content Storage With Apache Jackrabbit
 
Introduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache JackrabbiIntroduction to JCR and Apache Jackrabbi
Introduction to JCR and Apache Jackrabbi
 
File System On Steroids
File System On SteroidsFile System On Steroids
File System On Steroids
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
Design and architecture of Jackrabbit
Design and architecture of JackrabbitDesign and architecture of Jackrabbit
Design and architecture of Jackrabbit
 
Apache Tika
Apache TikaApache Tika
Apache Tika
 
Content Management With Apache Jackrabbit
Content Management With Apache JackrabbitContent Management With Apache Jackrabbit
Content Management With Apache Jackrabbit
 

Recently uploaded

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Recently uploaded (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

Text and metadata extraction with Apache Tika

  • 1. Text and Metadata Extraction with Apache Tika Jukka Zitting @ Day Software
  • 3. YOU don't know about me without you have read a book by the name of The Adventures of Tom Sawyer; but that ain't no matter. That book was made by Mr. Mark Twain, and he told the truth, mainly. There was things which he stretched, but mainly he told the truth. That is nothing. I never seen anybody but lied one time or another, without it was Aunt Polly, or the widow, or maybe Mary. Aunt Polly--Tom's Aunt Polly, she is--and Mary, and the Widow Douglas is all told about in that book, which is mostly a true book, with some stretchers, as I said before… 3 May. Bistritz.--Left Munich at 8:35 P.M., on 1st May, arriving at Vienna early next morning; should have arrived at 6:46, but train was an hour late. Buda-Pesth seems a wonderful place, from the glimpse which I got of it from the train and the little I could walk through the streets. I feared to go very far from the station, as we had arrived late and would start as near the correct time as possible. The impression I had was that we were leaving the West and entering the East; the most western of splendid bridges over the Danube, which is here of noble width and depth, took us among the traditions of Turkish rule… It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters. "My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?" Mr. Bennet replied that he had not. "But it is," returned she; "for Mrs. Long has just been here, and she told me all about it." Mr. Bennet made no answer… … to text … The Problem
  • 4. … and metadata © Doug Schepers The Problem
  • 5.
  • 6.
  • 7.
  • 8.  
  • 9.
  • 10. tika-app-0.7.jar (17MB) $ java -jar tika-app-0.7.jar --gui Tika GUI
  • 11.
  • 12.
  • 13. Dependencies listed pdfbox-1.1.0.jar fontbox-1.1.0.jar jempbox-1.1.0.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar poi-3.6.jar poi-scratchpad-3.6.jar poi-ooxml-3.6.jar tika-core-0.7.jar tika-parsers-0.7.jar tagsoup-1.2.jar asm-3.1.jar xmlbeans-2.3.0.jar dom4j-1.6.1.jar xml-apis-1.0.b2.jar log4j-1.2.14.jar poi-ooxml-schemas-3.6.jar commons-compress-1.0.jar metadata-extractor-2.4.0-beta-1.jar geronimo-stax-api_1.0_spec-1.0.1.jar commons-logging-1.1.1.jar … yes, that’s 21 jars
  • 14.
  • 15. import org.apache.tika.parser.Parser; Tika Parser API The input document. Remember to close the stream! java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(input, handler, metadata, context);
  • 16.
  • 17. import org.apache.tika.parser.Parser; Tika Parser API XHTML SAX event handler for the extracted structured text. java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(input, handler, metadata, context);
  • 18.
  • 19. import org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(input, handler, metadata, context); Document metadata, both for input (filename) and output (title)
  • 20.
  • 21.
  • 22. import org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(input, handler, metadata, context); Parsing context for extra options passed to the parser instances
  • 23.
  • 24. import org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(input, handler, metadata, context); Automatically selects best parser based on detected document type
  • 25.
  • 26. import org.apache.tika.parser.Parser; Tika Parser API java.io.InputStream input = …; org.xml.sax.ContentHandler handler = …; Metadata metadata = new Metadata(); ParseContext context = new ParseContext(); Parser parser = new AutoDetectParser(); parser.parse(input, handler, metadata, context); throws TikaException, IOException, SAXException
  • 27. Solr Cell (extracting request handler) $ curl http://localhost:8983/solr/update/extract?literal.id=doc1 -F file=@document.doc http://wiki.apache.org/solr/ExtractingRequestHandler Integrations
  • 28.
  • 29. MEAP starting soon! Questions?
  • 30.
  • 31.
  • 32.
  • 33.