Automatic Term Recognition with Apache Solr

•  There is a real lack of open source tools to facilitate the
development of downstream applications, also encouraging code-
reuse, comparative studies, and fostering further research
•  Existing tools are developed under different scenarios and
evaluated in different domains using proprietary language
resources, making it difficult for comparison.
•  It is unclear whether and how well these tools can adapt to
different domain tasks and scale up to large data.
•  Automatic Term Extraction (ATE/ATR) is an important
Natural Language Processing (NLP) task that deals with
the extraction of terminologies from domain-specific
textual corpora.
•  Widely used by both industries and researchers in many
complex tasks, such as Information Retrieval (IR),
machine translation, ontology engineering and text
summarization (Bowker, 2003; Brewster et al., 2007;
Maynard et al., 2007). JATE2.0 Architecture
Automatic Term Recognition with Apache Solr
Ziqi Zhang and Jie Gao
1.  JATE2.0 is an open-source library (under LGPLv3 license), available to download via https://github.com/ziqizhang/jate Contact: Jie Gao j.gao@sheffield.ac.uk
2.  For more examples of JATE2.0 usage scenarios and ideas, please refer to JATE2.0 wiki. Ziqi Zhang ziqi.zhang@sheffield.ac.uk
OAK Group, Department of Computer Science, University of Sheffield,
Sheffield, S1 4DP, United Kingdom
Example setting of Part-of-Speech (PoS) pattern based candidate extraction
Acknowledgements
Unique Features
Use cases
Usage Modes
ATE algorithms in JATE2.0 (beta)
Evaluation
•  Two datasets, GENIA dataset (Kim et al., 2003)
containing 1,999 medline abstract corpus for
bio-textmining previously used by (Zhang et al.,
2008); and the ACL RD-TEC dataset (Zadeh
and Handschuh, 2014), containing over 10,900
publications in the domain of computational
linguistics
•  3 types of candidate extractors are tested (NP,
N-gram, POS pattern)
•  Overall recall, precision at Top K, and CPU time
are measured
Figure 5: Comparison of Top K precisions on ACL RD-TEC
Part of this research has been sponsored by the EU funded project
WeSenseIt under grant agreement number 308429; and the
SPEEAK-PC collaboration agreement 101947 of the Innovative UK.
Terminology-driven Faceted Search for interactive cause analysis
ATE in combination with sentiment analysis
•  ATE used to improve sentiment analysis used by homeland
security forces (both English and Italian)
•  Training corpus collection and annotation based on
distant supervision
•  ATE for text normalization & standardization, key term
extraction (uni-/bi-gram) from corpus
•  Key terms used as features to train sentiment classifiers
(SVM, Naïve Bayes,
logistic regression)
JATE2.0 for Translation
ATE is a very useful starting point for a human terminologist
or translator. JATE2.0 can work with very large corpus
efficiently. It is also easy-to-use and highly configurable for
various different domains and languages. With more than
10 algorithms, JATE2.0 can be simply used to process a
large corpus as input. Important/Domain-specific terms will
be identified, extracted, normalised, ranked and exported
with scores into a external file.
JATE2.0 for knowledge engineering
JATE2.0 can be used as concept extraction tool to support the
creation of a domain ontology or a terminology base directly
from text corpus. Users can take domain-specific corpus as
input and use JATE2.0 to generate normalised candidate terms/
concepts as a starting points for further ontology engineering.
Future version will support to import output to Protege or work
as a plugin to Protege.
To bring both academic and industries under a
uniform development and benchmark
framework that addresses :
•  Adaptability
•  Scalability
•  High configurability and extensibility
Solution: JATE 2.0 integrates with Apache Solr
framework to benefit from its extensive,
extensible, flexible text processing libraries; it
can either be used as a separate module, or as
a Solr plugin used during document processing
to enrich the indexed documents with candidate
terms.
•  Expands JATE 1.0 collection of state-of-the-art algorithms, which are not
available in any other tools;
•  Linguistic processors (candidate term extraction) are highly customizable and
developed as Solr plugins, hence making JATE2.0 adaptable to many different
domains and languages;
•  Two usage modes for various usage scenarios and can directly apply to digital
archive (for both indexed and not indexed) in industry;
Embedded mode: as a standalone application
from command line. This mode is recommended
when users need a list of candidate terms
extracted from a corpus so as to support
subsequent knowledge engineering task.
Plugin mode: works as a Solr plugin. This mode
is recommended when users need to index new
or enrich existing index with candidate terms,
which can, e.g., support faceted search, boost
query (implemented as a custom request
handler that processes term extraction by a
simply HTTP request)
Introduction
Objective
Photo credit to K-NOW
1 parses ingested
documents to raw text content
and performs character level
normalisation
2 ‘Cleansed’ text then passed
through the candidate
extraction component (as a Solr
analyzer chain)
3 Candidate terms loaded from Solr
index and processed by the subsequent
filtering component, where different ATE
algorithms can be configured
4 candidate terms can be indexed or exported to
support specific use cases (e.g., faceted query,
knowledge base construction)
Figure 4: Comparison of Top K precisions on GENIA
•  TATA Steel Scenario: cause analysis via text analytics
•  To understand the types of potential factors and actions that lead
to product failures
•  Users (domain expert) collect, select unstructured
documentations (e.g., Lotus notes) from various data sources
•  JATE 2.0 applied to the documents to extract industrial terms for
analyzing and linking domain relevant concepts from textual data
•  Terms used to enable dynamic faceted search/navigation for concept-
driven text analytics

Automatic Term Recognition with Apache Solr

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Automatic Term Recognition with Apache Solr

Similar to Automatic Term Recognition with Apache Solr (20)

Recently uploaded

Recently uploaded (20)

Automatic Term Recognition with Apache Solr