Automatic Term Extraction (ATE/ATR) is an important Natural Language Processing (NLP) task that deals with the extraction of terminologies from domain-specific textual corpora. JATE 2.0 integrates with Apache Solr framework to benefit from its extensive, extensible, flexible text processing libraries; it can either be used as a separate module, or as a Solr plugin used during document processing to enrich the indexed documents with candidate terms. DOI: 10.13140/RG.2.1.2897.3684
1. • There is a real lack of open source tools to facilitate the
development of downstream applications, also encouraging code-
reuse, comparative studies, and fostering further research
• Existing tools are developed under different scenarios and
evaluated in different domains using proprietary language
resources, making it difficult for comparison.
• It is unclear whether and how well these tools can adapt to
different domain tasks and scale up to large data.
• Automatic Term Extraction (ATE/ATR) is an important
Natural Language Processing (NLP) task that deals with
the extraction of terminologies from domain-specific
textual corpora.
• Widely used by both industries and researchers in many
complex tasks, such as Information Retrieval (IR),
machine translation, ontology engineering and text
summarization (Bowker, 2003; Brewster et al., 2007;
Maynard et al., 2007). JATE2.0 Architecture
Automatic Term Recognition with Apache Solr
Ziqi Zhang and Jie Gao
1. JATE2.0 is an open-source library (under LGPLv3 license), available to download via https://github.com/ziqizhang/jate Contact: Jie Gao j.gao@sheffield.ac.uk
2. For more examples of JATE2.0 usage scenarios and ideas, please refer to JATE2.0 wiki. Ziqi Zhang ziqi.zhang@sheffield.ac.uk
OAK Group, Department of Computer Science, University of Sheffield,
Sheffield, S1 4DP, United Kingdom
Example setting of Part-of-Speech (PoS) pattern based candidate extraction
Acknowledgements
Unique Features
Use cases
Usage Modes
ATE algorithms in JATE2.0 (beta)
Evaluation
• Two datasets, GENIA dataset (Kim et al., 2003)
containing 1,999 medline abstract corpus for
bio-textmining previously used by (Zhang et al.,
2008); and the ACL RD-TEC dataset (Zadeh
and Handschuh, 2014), containing over 10,900
publications in the domain of computational
linguistics
• 3 types of candidate extractors are tested (NP,
N-gram, POS pattern)
• Overall recall, precision at Top K, and CPU time
are measured
Figure 5: Comparison of Top K precisions on ACL RD-TEC
Part of this research has been sponsored by the EU funded project
WeSenseIt under grant agreement number 308429; and the
SPEEAK-PC collaboration agreement 101947 of the Innovative UK.
Terminology-driven Faceted Search for interactive cause analysis
ATE in combination with sentiment analysis
• ATE used to improve sentiment analysis used by homeland
security forces (both English and Italian)
• Training corpus collection and annotation based on
distant supervision
• ATE for text normalization & standardization, key term
extraction (uni-/bi-gram) from corpus
• Key terms used as features to train sentiment classifiers
(SVM, Naïve Bayes,
logistic regression)
JATE2.0 for Translation
ATE is a very useful starting point for a human terminologist
or translator. JATE2.0 can work with very large corpus
efficiently. It is also easy-to-use and highly configurable for
various different domains and languages. With more than
10 algorithms, JATE2.0 can be simply used to process a
large corpus as input. Important/Domain-specific terms will
be identified, extracted, normalised, ranked and exported
with scores into a external file.
JATE2.0 for knowledge engineering
JATE2.0 can be used as concept extraction tool to support the
creation of a domain ontology or a terminology base directly
from text corpus. Users can take domain-specific corpus as
input and use JATE2.0 to generate normalised candidate terms/
concepts as a starting points for further ontology engineering.
Future version will support to import output to Protege or work
as a plugin to Protege.
To bring both academic and industries under a
uniform development and benchmark
framework that addresses :
• Adaptability
• Scalability
• High configurability and extensibility
Solution: JATE 2.0 integrates with Apache Solr
framework to benefit from its extensive,
extensible, flexible text processing libraries; it
can either be used as a separate module, or as
a Solr plugin used during document processing
to enrich the indexed documents with candidate
terms.
• Expands JATE 1.0 collection of state-of-the-art algorithms, which are not
available in any other tools;
• Linguistic processors (candidate term extraction) are highly customizable and
developed as Solr plugins, hence making JATE2.0 adaptable to many different
domains and languages;
• Two usage modes for various usage scenarios and can directly apply to digital
archive (for both indexed and not indexed) in industry;
Embedded mode: as a standalone application
from command line. This mode is recommended
when users need a list of candidate terms
extracted from a corpus so as to support
subsequent knowledge engineering task.
Plugin mode: works as a Solr plugin. This mode
is recommended when users need to index new
or enrich existing index with candidate terms,
which can, e.g., support faceted search, boost
query (implemented as a custom request
handler that processes term extraction by a
simply HTTP request)
Introduction
Objective
Photo credit to K-NOW
1 parses ingested
documents to raw text content
and performs character level
normalisation
2 ‘Cleansed’ text then passed
through the candidate
extraction component (as a Solr
analyzer chain)
3 Candidate terms loaded from Solr
index and processed by the subsequent
filtering component, where different ATE
algorithms can be configured
4 candidate terms can be indexed or exported to
support specific use cases (e.g., faceted query,
knowledge base construction)
Figure 4: Comparison of Top K precisions on GENIA
• TATA Steel Scenario: cause analysis via text analytics
• To understand the types of potential factors and actions that lead
to product failures
• Users (domain expert) collect, select unstructured
documentations (e.g., Lotus notes) from various data sources
• JATE 2.0 applied to the documents to extract industrial terms for
analyzing and linking domain relevant concepts from textual data
• Terms used to enable dynamic faceted search/navigation for concept-
driven text analytics