SlideShare a Scribd company logo
Presented By : rohit sharma
What Is Apache Tika?
Apache Tika is a library that is used for document type detection and content extraction from various file formats.
Internally, Tika uses existing various document parsers and document type detection techniques to detect and
extract data.
Tika provides a single generic API for parsing different file formats. It uses 83 existing specialized parser ibraries
for each document type. Like Apache PdfBox and POI.
All these parser libraries are encapsulated under a single interface called the Parser interface.
WHy Tika?
According to filext.com, there are about 15k to 51k content types, and this number is growing day by day.
Data is being stored in various formats such as text documents, excel spreadsheet, PDFs, images, and multimedia
files.
Therefore, applications such as search engines and content management systems need additional support for easy
extraction of data from these document types.
Apache Tika serves this purpose by providing a generic API to locate and extract data from multiple file formats.
Apache Tika Applications
Search Engines
Tika is widely used while developing search engines to index the text contents of digital documents.
Digital Asset Management
Some organizations manage their digital assets such as photographs, e-books, drawings, music and video
using a special application known as digital asset management (DAM).
Such applications take the help of document type detectors and metadata extractor to classify the various
documents.
Content Analysis
Websites like Amazon recommend newly released contents of their website to individual users according to
their interests.
To do so, these websites take the help of social media websites like Facebook to extract required information
such as likes and interests of the users. This gathered information will be in the form of html tags or other
formats that require further content type detection and extraction.
Features Of Tika
Unified parser Interface : Tika encapsulates all the third party parser libraries within a single parser interface.
Due to this feature, the user escapes from the burden of selecting the suitable parser library and use it
according to the file type encountered.
Fast processing : Quick content detection and extraction from applications can be expected.
Parser integration : Tika can use various parser libraries available for each document type in a single application.
MIME type detection : Tika can detect and extract content from all the media types included in the MIME
standards.
Language detection : Tika includes language identification feature, therefore can be used in documents based on
language type in a multi lingual websites.
Functionalities Of Tika
Tika supports various functionalities:
Document type detection
Content extraction
Metadata extraction
Language detection
Tika Facade Class?
Users can embed Tika in their applications using the Tika facade class. It has methods to explore all the
functionalities of Tika.
Since it is a facade class, Tika abstracts the complexity behind its functions.
It abstracts all the internal implementations and provides simple methods to access the Tika functionalities.
Some Methods Of tika facade class
String parseToString (File file) :- This method and all its variants parses the file passed as parameter and returns the
extracted text content in the String format. By default, the length of this string parameter is limited.
int getMaxStringLength () :- Returns the maximum length of strings returned by the parseToString methods.
void setMaxStringLength (int maxStringLength) :- Sets the maximum length of strings returned by the parseToString methods.
Reader parse (File file) :- This method and all its variants parses the file passed as parameter and returns the extracted text content in
the form of java.io.reader object.
String detect (InputStream stream, Metadata metadata) :- This method and all its variants accepts an InputStream object and a
Metadata object as parameters, detects the type of the given document, and returns the document type name as String object. This
method abstracts the detection mechanisms used by Tika.
Parser interface
This is the interface that is implemented by all the parser classes of Tika package.
The following is the important method of Tika Parser interface:
❖ parse (InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
This method parses the given document into a sequence of XHTML and SAX events. After parsing, it places the extracted
document content in the object of the ContentHandler class and the metadata in the object of the Metadata class.
Metadata class
Following are the methods of this class
1. add (Property property, String value)
2. add (String name, String value)
3. String get (Property property)
4. String get (String name)
5. Date getDate (Property property)
6. String[] getValues (Property property)
7. String[] getValues (String name)
8. String[] names()
9. set (Property property, Date date)
10. set(Property property, String[] values)
Language Identifier class
This class identifies the language of the given content.
LanguageIdentifier (String content) : This constructor can instantiate a language identifier by passing on a String from text content
String getLanguage () : Returns the language given to the current LanguageIdentifier object.
Type detection in tika
Tika supports all the Internet media document types provided in MIME. Whenever a file is passed through Tika, it detects the file and
its document type.
Type detection using Facade class
The detect() method of facade class is used to detect the document type. This method accepts a file as input.
Eg:
File file = new File("example.mp3");
Tika tika = new Tika();
String filetype = tika.detect(file);
Content extraction using tika
For parsing documents, the parseToString() method of Tika facade class is generally used.
Shown below are the steps involved in the parsing process and these are abstracted by the Tika ParsertoString()
method.
Content extraction using parser InterfaCE
The parser package of Tika provides several interfaces and classes using which we can parse a text document.
Content extraction using parser InterfaCE
parse() method
Along with parseToString(), you can also use the parse() method of the parser Interface. The prototype of this
method is shown below.
Given below is an example that shows how the parse() method is used.
1. Instantiate any of the classes providing the implementation for this interface.
Parser parser = new AutoDetectParser();
(or)
Parser parser = new CompositeParser();
(or)
object of any individual parsers given in Tika Library
1. Create a handler class object.
BodyContentHandler handler = new BodyContentHandler( );
Content extraction using parser InterfaCE
3. Create the Metadata object as shown below:
Metadata metadata = new Metadata();
4. Create any of the input stream objects, and pass your file that should be extracted to it.
File file=new File(filepath)
FileInputStream inputstream=new FileInputStream(file);
(or)
InputStream stream = TikaInputStream.get(new File(filename));
5. Create a parse context object as shown below:
ParseContext context =new ParseContext();
6. Instantiate the parser object, invoke the parse method, and pass all the objects required.
parser.parse(inputstream, handler, metadata, context);
METADATA extraction using parser InterfaCE
Besides content, Tika also extracts the metadata from a file. Metadata is nothing but the additional information supplied with a file.
If we consider an audio file, the artist name, album name, title comes under metadata.
Whenever we parse a file using parse(), we pass an empty metadata object as one of the parameters.
This method extracts the metadata of the given file (if that file contains any), and places them in the metadata object.
Therefore, after parsing the file using parse(), we can extract the metadata from that object.
Adding new and setting values in existing metadata
Adding New
We can add new metadata values using the add() method of the metadata class
metadata.add(“author”,”Tutorials point”);
The Metadata class has some predefined properties.
metadata.add(Metadata.SOFTWARE,"ms paint");
Edit Existing
You can set values to the existing metadata elements using the set() method.
metadata.set(Metadata.DATE, new Date());
Language detection in tika
Among all the 184 standard languages standardized by ISO 639-1, Tika can detect 18 languages.
Language detection in Tika is done using the getLanguage() method of the LanguageIdentifier class.
This method returns the code name of the language in String format.
Given below is the list of the 18 language-code pairs detected by Tika:
Language detection in tika
While instantiating the LanguageIdentifier class, you should pass the String format of the content to be extracted.
Eg:
LanguageIdentifier object=new LanguageIdentifier(“this is english”);
Language Detection of a Document
To detect the language of a given document, you have to parse it using the parse() method.
The parse() method parses the content and stores it in the handler object, which was passed to it as one of the arguments.
Eg:
parser.parse(inputstream, handler, metadata, context);
LanguageIdentifier object = new LanguageIdentifier(handler.toString());
Questions ?
Thanks !

More Related Content

What's hot

What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
gagravarr
 
Scientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaScientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache Tika
Chris Mattmann
 
Lucene
LuceneLucene
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
Kira
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
WO Community
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
Prasenjit Mukherjee
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
lucenerevolution
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
Swapnil & Patil
 
Arakno
AraknoArakno
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
gramana
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
Adrien Grand
 
Fedora Commons in the CLARIN Infrastructure
Fedora Commons in the CLARIN InfrastructureFedora Commons in the CLARIN Infrastructure
Fedora Commons in the CLARIN Infrastructure
Menzo Windhouwer
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
farhan "Frank"​ mashraqi
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
LDAP - Lightweight Directory Access Protocol
LDAP - Lightweight Directory Access ProtocolLDAP - Lightweight Directory Access Protocol
LDAP - Lightweight Directory Access Protocol
S. Hasnain Raza
 
Ldap
LdapLdap
Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)
Hector Correa
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
Innovation Engineering
 
AD & LDAP
AD & LDAPAD & LDAP
Describing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core VocabularyDescribing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core Vocabulary
Nandana Mihindukulasooriya
 

What's hot (20)

What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
What's with the 1s and 0s? Making sense of binary data at scale with Tika and...
 
Scientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache TikaScientific data curation and processing with Apache Tika
Scientific data curation and processing with Apache Tika
 
Lucene
LuceneLucene
Lucene
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Full Text Search with Lucene
Full Text Search with LuceneFull Text Search with Lucene
Full Text Search with Lucene
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Arakno
AraknoArakno
Arakno
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Fedora Commons in the CLARIN Infrastructure
Fedora Commons in the CLARIN InfrastructureFedora Commons in the CLARIN Infrastructure
Fedora Commons in the CLARIN Infrastructure
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
LDAP - Lightweight Directory Access Protocol
LDAP - Lightweight Directory Access ProtocolLDAP - Lightweight Directory Access Protocol
LDAP - Lightweight Directory Access Protocol
 
Ldap
LdapLdap
Ldap
 
Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)Introduction to Linked Data Platform (LDP)
Introduction to Linked Data Platform (LDP)
 
NLP and LSA getting started
NLP and LSA getting startedNLP and LSA getting started
NLP and LSA getting started
 
AD & LDAP
AD & LDAPAD & LDAP
AD & LDAP
 
Describing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core VocabularyDescribing LDP Applications with the Hydra Core Vocabulary
Describing LDP Applications with the Hydra Core Vocabulary
 

Viewers also liked

Actors model in gpars
Actors model in gparsActors model in gpars
Actors model in gpars
NexThoughts Technologies
 
Spring Web Flow
Spring Web FlowSpring Web Flow
Spring Web Flow
NexThoughts Technologies
 
JFree chart
JFree chartJFree chart
RESTEasy
RESTEasyRESTEasy
Jmh
JmhJmh
Introduction to thymeleaf
Introduction to thymeleafIntroduction to thymeleaf
Introduction to thymeleaf
NexThoughts Technologies
 
Progressive Web-App (PWA)
Progressive Web-App (PWA)Progressive Web-App (PWA)
Progressive Web-App (PWA)
NexThoughts Technologies
 
Introduction to gradle
Introduction to gradleIntroduction to gradle
Introduction to gradle
NexThoughts Technologies
 
Hamcrest
HamcrestHamcrest
Introduction to es6
Introduction to es6Introduction to es6
Introduction to es6
NexThoughts Technologies
 
Grails with swagger
Grails with swaggerGrails with swagger
Grails with swagger
NexThoughts Technologies
 
Reactive java - Reactive Programming + RxJava
Reactive java - Reactive Programming + RxJavaReactive java - Reactive Programming + RxJava
Reactive java - Reactive Programming + RxJava
NexThoughts Technologies
 
Jsoup
JsoupJsoup
Java 8 features
Java 8 featuresJava 8 features
Java 8 features
NexThoughts Technologies
 
Unit test-using-spock in Grails
Unit test-using-spock in GrailsUnit test-using-spock in Grails
Unit test-using-spock in Grails
NexThoughts Technologies
 
Cosmos DB Service
Cosmos DB ServiceCosmos DB Service
Cosmos DB Service
NexThoughts Technologies
 
Vertx
VertxVertx

Viewers also liked (17)

Actors model in gpars
Actors model in gparsActors model in gpars
Actors model in gpars
 
Spring Web Flow
Spring Web FlowSpring Web Flow
Spring Web Flow
 
JFree chart
JFree chartJFree chart
JFree chart
 
RESTEasy
RESTEasyRESTEasy
RESTEasy
 
Jmh
JmhJmh
Jmh
 
Introduction to thymeleaf
Introduction to thymeleafIntroduction to thymeleaf
Introduction to thymeleaf
 
Progressive Web-App (PWA)
Progressive Web-App (PWA)Progressive Web-App (PWA)
Progressive Web-App (PWA)
 
Introduction to gradle
Introduction to gradleIntroduction to gradle
Introduction to gradle
 
Hamcrest
HamcrestHamcrest
Hamcrest
 
Introduction to es6
Introduction to es6Introduction to es6
Introduction to es6
 
Grails with swagger
Grails with swaggerGrails with swagger
Grails with swagger
 
Reactive java - Reactive Programming + RxJava
Reactive java - Reactive Programming + RxJavaReactive java - Reactive Programming + RxJava
Reactive java - Reactive Programming + RxJava
 
Jsoup
JsoupJsoup
Jsoup
 
Java 8 features
Java 8 featuresJava 8 features
Java 8 features
 
Unit test-using-spock in Grails
Unit test-using-spock in GrailsUnit test-using-spock in Grails
Unit test-using-spock in Grails
 
Cosmos DB Service
Cosmos DB ServiceCosmos DB Service
Cosmos DB Service
 
Vertx
VertxVertx
Vertx
 

Similar to Apache tika

Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
Sutthipong Kuruhongsa
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
Sutthipong Kuruhongsa
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
Alfresco Software
 
Applying ocr to extract information : Text mining
Applying ocr to extract information  : Text miningApplying ocr to extract information  : Text mining
Applying ocr to extract information : Text mining
Saurabh Singh
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
Alfresco Software
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
Jukka Zitting
 
File Handling In C++(OOPs))
File Handling In C++(OOPs))File Handling In C++(OOPs))
File Handling In C++(OOPs))
Papu Kumar
 
File Handling In C++
File Handling In C++File Handling In C++
File Handling in C++
File Handling in C++File Handling in C++
Filepointers1 1215104829397318-9
Filepointers1 1215104829397318-9Filepointers1 1215104829397318-9
Filepointers1 1215104829397318-9
AbdulghafarStanikzai
 
Metadata first, ontologies second
Metadata first, ontologies secondMetadata first, ontologies second
Metadata first, ontologies second
Joseba Abaitua
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptx
WidsoulDevil
 
Python-FileHandling.pptx
Python-FileHandling.pptxPython-FileHandling.pptx
Python-FileHandling.pptx
Karudaiyar Ganapathy
 
Intake 38 10
Intake 38 10Intake 38 10
Intake 38 10
Mahmoud Ouf
 
Dataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabulariesDataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabularies
Valeria Pesce
 
Data file handling
Data file handlingData file handling
Data file handling
Prof. Dr. K. Adisesha
 
Files
FilesFiles
Python reading and writing files
Python reading and writing filesPython reading and writing files
Python reading and writing files
Mukesh Tekwani
 
CSCI6505 Project:Construct search engine using ML approach
CSCI6505 Project:Construct search engine using ML approachCSCI6505 Project:Construct search engine using ML approach
CSCI6505 Project:Construct search engine using ML approach
butest
 
File Management – File Concept, access methods, File types and File Operation
File Management – File Concept, access methods,  File types and File OperationFile Management – File Concept, access methods,  File types and File Operation
File Management – File Concept, access methods, File types and File Operation
Dhrumil Panchal
 

Similar to Apache tika (20)

Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Applying ocr to extract information : Text mining
Applying ocr to extract information  : Text miningApplying ocr to extract information  : Text mining
Applying ocr to extract information : Text mining
 
PLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and TransformationPLAT-13 Metadata Extraction and Transformation
PLAT-13 Metadata Extraction and Transformation
 
Mime Magic With Apache Tika
Mime Magic With Apache TikaMime Magic With Apache Tika
Mime Magic With Apache Tika
 
File Handling In C++(OOPs))
File Handling In C++(OOPs))File Handling In C++(OOPs))
File Handling In C++(OOPs))
 
File Handling In C++
File Handling In C++File Handling In C++
File Handling In C++
 
File Handling in C++
File Handling in C++File Handling in C++
File Handling in C++
 
Filepointers1 1215104829397318-9
Filepointers1 1215104829397318-9Filepointers1 1215104829397318-9
Filepointers1 1215104829397318-9
 
Metadata first, ontologies second
Metadata first, ontologies secondMetadata first, ontologies second
Metadata first, ontologies second
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptx
 
Python-FileHandling.pptx
Python-FileHandling.pptxPython-FileHandling.pptx
Python-FileHandling.pptx
 
Intake 38 10
Intake 38 10Intake 38 10
Intake 38 10
 
Dataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabulariesDataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabularies
 
Data file handling
Data file handlingData file handling
Data file handling
 
Files
FilesFiles
Files
 
Python reading and writing files
Python reading and writing filesPython reading and writing files
Python reading and writing files
 
CSCI6505 Project:Construct search engine using ML approach
CSCI6505 Project:Construct search engine using ML approachCSCI6505 Project:Construct search engine using ML approach
CSCI6505 Project:Construct search engine using ML approach
 
File Management – File Concept, access methods, File types and File Operation
File Management – File Concept, access methods,  File types and File OperationFile Management – File Concept, access methods,  File types and File Operation
File Management – File Concept, access methods, File types and File Operation
 

More from NexThoughts Technologies

Alexa skill
Alexa skillAlexa skill
GraalVM
GraalVMGraalVM
Docker & kubernetes
Docker & kubernetesDocker & kubernetes
Docker & kubernetes
NexThoughts Technologies
 
Apache commons
Apache commonsApache commons
Apache commons
NexThoughts Technologies
 
HazelCast
HazelCastHazelCast
MySQL Pro
MySQL ProMySQL Pro
Microservice Architecture using Spring Boot with React & Redux
Microservice Architecture using Spring Boot with React & ReduxMicroservice Architecture using Spring Boot with React & Redux
Microservice Architecture using Spring Boot with React & Redux
NexThoughts Technologies
 
Swagger
SwaggerSwagger
Solid Principles
Solid PrinciplesSolid Principles
Solid Principles
NexThoughts Technologies
 
Arango DB
Arango DBArango DB
Jython
JythonJython
Introduction to TypeScript
Introduction to TypeScriptIntroduction to TypeScript
Introduction to TypeScript
NexThoughts Technologies
 
Smart Contract samples
Smart Contract samplesSmart Contract samples
Smart Contract samples
NexThoughts Technologies
 
My Doc of geth
My Doc of gethMy Doc of geth
My Doc of geth
NexThoughts Technologies
 
Geth important commands
Geth important commandsGeth important commands
Geth important commands
NexThoughts Technologies
 
Ethereum genesis
Ethereum genesisEthereum genesis
Ethereum genesis
NexThoughts Technologies
 
Ethereum
EthereumEthereum
Springboot Microservices
Springboot MicroservicesSpringboot Microservices
Springboot Microservices
NexThoughts Technologies
 
An Introduction to Redux
An Introduction to ReduxAn Introduction to Redux
An Introduction to Redux
NexThoughts Technologies
 
Google authentication
Google authenticationGoogle authentication
Google authentication
NexThoughts Technologies
 

More from NexThoughts Technologies (20)

Alexa skill
Alexa skillAlexa skill
Alexa skill
 
GraalVM
GraalVMGraalVM
GraalVM
 
Docker & kubernetes
Docker & kubernetesDocker & kubernetes
Docker & kubernetes
 
Apache commons
Apache commonsApache commons
Apache commons
 
HazelCast
HazelCastHazelCast
HazelCast
 
MySQL Pro
MySQL ProMySQL Pro
MySQL Pro
 
Microservice Architecture using Spring Boot with React & Redux
Microservice Architecture using Spring Boot with React & ReduxMicroservice Architecture using Spring Boot with React & Redux
Microservice Architecture using Spring Boot with React & Redux
 
Swagger
SwaggerSwagger
Swagger
 
Solid Principles
Solid PrinciplesSolid Principles
Solid Principles
 
Arango DB
Arango DBArango DB
Arango DB
 
Jython
JythonJython
Jython
 
Introduction to TypeScript
Introduction to TypeScriptIntroduction to TypeScript
Introduction to TypeScript
 
Smart Contract samples
Smart Contract samplesSmart Contract samples
Smart Contract samples
 
My Doc of geth
My Doc of gethMy Doc of geth
My Doc of geth
 
Geth important commands
Geth important commandsGeth important commands
Geth important commands
 
Ethereum genesis
Ethereum genesisEthereum genesis
Ethereum genesis
 
Ethereum
EthereumEthereum
Ethereum
 
Springboot Microservices
Springboot MicroservicesSpringboot Microservices
Springboot Microservices
 
An Introduction to Redux
An Introduction to ReduxAn Introduction to Redux
An Introduction to Redux
 
Google authentication
Google authenticationGoogle authentication
Google authentication
 

Recently uploaded

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 

Recently uploaded (20)

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 

Apache tika

  • 1. Presented By : rohit sharma
  • 2. What Is Apache Tika? Apache Tika is a library that is used for document type detection and content extraction from various file formats. Internally, Tika uses existing various document parsers and document type detection techniques to detect and extract data. Tika provides a single generic API for parsing different file formats. It uses 83 existing specialized parser ibraries for each document type. Like Apache PdfBox and POI. All these parser libraries are encapsulated under a single interface called the Parser interface.
  • 3. WHy Tika? According to filext.com, there are about 15k to 51k content types, and this number is growing day by day. Data is being stored in various formats such as text documents, excel spreadsheet, PDFs, images, and multimedia files. Therefore, applications such as search engines and content management systems need additional support for easy extraction of data from these document types. Apache Tika serves this purpose by providing a generic API to locate and extract data from multiple file formats.
  • 4. Apache Tika Applications Search Engines Tika is widely used while developing search engines to index the text contents of digital documents. Digital Asset Management Some organizations manage their digital assets such as photographs, e-books, drawings, music and video using a special application known as digital asset management (DAM). Such applications take the help of document type detectors and metadata extractor to classify the various documents. Content Analysis Websites like Amazon recommend newly released contents of their website to individual users according to their interests. To do so, these websites take the help of social media websites like Facebook to extract required information such as likes and interests of the users. This gathered information will be in the form of html tags or other formats that require further content type detection and extraction.
  • 5. Features Of Tika Unified parser Interface : Tika encapsulates all the third party parser libraries within a single parser interface. Due to this feature, the user escapes from the burden of selecting the suitable parser library and use it according to the file type encountered. Fast processing : Quick content detection and extraction from applications can be expected. Parser integration : Tika can use various parser libraries available for each document type in a single application. MIME type detection : Tika can detect and extract content from all the media types included in the MIME standards. Language detection : Tika includes language identification feature, therefore can be used in documents based on language type in a multi lingual websites.
  • 6. Functionalities Of Tika Tika supports various functionalities: Document type detection Content extraction Metadata extraction Language detection
  • 7. Tika Facade Class? Users can embed Tika in their applications using the Tika facade class. It has methods to explore all the functionalities of Tika. Since it is a facade class, Tika abstracts the complexity behind its functions. It abstracts all the internal implementations and provides simple methods to access the Tika functionalities.
  • 8. Some Methods Of tika facade class String parseToString (File file) :- This method and all its variants parses the file passed as parameter and returns the extracted text content in the String format. By default, the length of this string parameter is limited. int getMaxStringLength () :- Returns the maximum length of strings returned by the parseToString methods. void setMaxStringLength (int maxStringLength) :- Sets the maximum length of strings returned by the parseToString methods. Reader parse (File file) :- This method and all its variants parses the file passed as parameter and returns the extracted text content in the form of java.io.reader object. String detect (InputStream stream, Metadata metadata) :- This method and all its variants accepts an InputStream object and a Metadata object as parameters, detects the type of the given document, and returns the document type name as String object. This method abstracts the detection mechanisms used by Tika.
  • 9. Parser interface This is the interface that is implemented by all the parser classes of Tika package. The following is the important method of Tika Parser interface: ❖ parse (InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) This method parses the given document into a sequence of XHTML and SAX events. After parsing, it places the extracted document content in the object of the ContentHandler class and the metadata in the object of the Metadata class.
  • 10. Metadata class Following are the methods of this class 1. add (Property property, String value) 2. add (String name, String value) 3. String get (Property property) 4. String get (String name) 5. Date getDate (Property property) 6. String[] getValues (Property property) 7. String[] getValues (String name) 8. String[] names() 9. set (Property property, Date date) 10. set(Property property, String[] values)
  • 11. Language Identifier class This class identifies the language of the given content. LanguageIdentifier (String content) : This constructor can instantiate a language identifier by passing on a String from text content String getLanguage () : Returns the language given to the current LanguageIdentifier object.
  • 12. Type detection in tika Tika supports all the Internet media document types provided in MIME. Whenever a file is passed through Tika, it detects the file and its document type. Type detection using Facade class The detect() method of facade class is used to detect the document type. This method accepts a file as input. Eg: File file = new File("example.mp3"); Tika tika = new Tika(); String filetype = tika.detect(file);
  • 13. Content extraction using tika For parsing documents, the parseToString() method of Tika facade class is generally used. Shown below are the steps involved in the parsing process and these are abstracted by the Tika ParsertoString() method.
  • 14. Content extraction using parser InterfaCE The parser package of Tika provides several interfaces and classes using which we can parse a text document.
  • 15. Content extraction using parser InterfaCE parse() method Along with parseToString(), you can also use the parse() method of the parser Interface. The prototype of this method is shown below. Given below is an example that shows how the parse() method is used. 1. Instantiate any of the classes providing the implementation for this interface. Parser parser = new AutoDetectParser(); (or) Parser parser = new CompositeParser(); (or) object of any individual parsers given in Tika Library 1. Create a handler class object. BodyContentHandler handler = new BodyContentHandler( );
  • 16. Content extraction using parser InterfaCE 3. Create the Metadata object as shown below: Metadata metadata = new Metadata(); 4. Create any of the input stream objects, and pass your file that should be extracted to it. File file=new File(filepath) FileInputStream inputstream=new FileInputStream(file); (or) InputStream stream = TikaInputStream.get(new File(filename)); 5. Create a parse context object as shown below: ParseContext context =new ParseContext(); 6. Instantiate the parser object, invoke the parse method, and pass all the objects required. parser.parse(inputstream, handler, metadata, context);
  • 17. METADATA extraction using parser InterfaCE Besides content, Tika also extracts the metadata from a file. Metadata is nothing but the additional information supplied with a file. If we consider an audio file, the artist name, album name, title comes under metadata. Whenever we parse a file using parse(), we pass an empty metadata object as one of the parameters. This method extracts the metadata of the given file (if that file contains any), and places them in the metadata object. Therefore, after parsing the file using parse(), we can extract the metadata from that object.
  • 18. Adding new and setting values in existing metadata Adding New We can add new metadata values using the add() method of the metadata class metadata.add(“author”,”Tutorials point”); The Metadata class has some predefined properties. metadata.add(Metadata.SOFTWARE,"ms paint"); Edit Existing You can set values to the existing metadata elements using the set() method. metadata.set(Metadata.DATE, new Date());
  • 19. Language detection in tika Among all the 184 standard languages standardized by ISO 639-1, Tika can detect 18 languages. Language detection in Tika is done using the getLanguage() method of the LanguageIdentifier class. This method returns the code name of the language in String format. Given below is the list of the 18 language-code pairs detected by Tika:
  • 20. Language detection in tika While instantiating the LanguageIdentifier class, you should pass the String format of the content to be extracted. Eg: LanguageIdentifier object=new LanguageIdentifier(“this is english”); Language Detection of a Document To detect the language of a given document, you have to parse it using the parse() method. The parse() method parses the content and stores it in the handler object, which was passed to it as one of the arguments. Eg: parser.parse(inputstream, handler, metadata, context); LanguageIdentifier object = new LanguageIdentifier(handler.toString());