SlideShare a Scribd company logo
•  There is a real lack of open source tools to facilitate the
development of downstream applications, also encouraging code-
reuse, comparative studies, and fostering further research
•  Existing tools are developed under different scenarios and
evaluated in different domains using proprietary language
resources, making it difficult for comparison.
•  It is unclear whether and how well these tools can adapt to
different domain tasks and scale up to large data.
•  Automatic Term Extraction (ATE/ATR) is an important
Natural Language Processing (NLP) task that deals with
the extraction of terminologies from domain-specific
textual corpora.
•  Widely used by both industries and researchers in many
complex tasks, such as Information Retrieval (IR),
machine translation, ontology engineering and text
summarization (Bowker, 2003; Brewster et al., 2007;
Maynard et al., 2007). JATE2.0 Architecture
Automatic Term Recognition with Apache Solr
Ziqi Zhang and Jie Gao
1.  JATE2.0 is an open-source library (under LGPLv3 license), available to download via https://github.com/ziqizhang/jate Contact: Jie Gao j.gao@sheffield.ac.uk
2.  For more examples of JATE2.0 usage scenarios and ideas, please refer to JATE2.0 wiki. Ziqi Zhang ziqi.zhang@sheffield.ac.uk
OAK Group, Department of Computer Science, University of Sheffield,
Sheffield, S1 4DP, United Kingdom
Example setting of Part-of-Speech (PoS) pattern based candidate extraction
Acknowledgements
Unique Features
Use cases
Usage Modes
ATE algorithms in JATE2.0 (beta)
Evaluation
•  Two datasets, GENIA dataset (Kim et al., 2003)
containing 1,999 medline abstract corpus for
bio-textmining previously used by (Zhang et al.,
2008); and the ACL RD-TEC dataset (Zadeh
and Handschuh, 2014), containing over 10,900
publications in the domain of computational
linguistics
•  3 types of candidate extractors are tested (NP,
N-gram, POS pattern)
•  Overall recall, precision at Top K, and CPU time
are measured
Figure 5: Comparison of Top K precisions on ACL RD-TEC
Part of this research has been sponsored by the EU funded project
WeSenseIt under grant agreement number 308429; and the
SPEEAK-PC collaboration agreement 101947 of the Innovative UK.
Terminology-driven Faceted Search for interactive cause analysis
ATE in combination with sentiment analysis
•  ATE used to improve sentiment analysis used by homeland
security forces (both English and Italian)
•  Training corpus collection and annotation based on
distant supervision
•  ATE for text normalization & standardization, key term
extraction (uni-/bi-gram) from corpus
•  Key terms used as features to train sentiment classifiers
(SVM, Naïve Bayes,
logistic regression)
JATE2.0 for Translation
ATE is a very useful starting point for a human terminologist
or translator. JATE2.0 can work with very large corpus
efficiently. It is also easy-to-use and highly configurable for
various different domains and languages. With more than
10 algorithms, JATE2.0 can be simply used to process a
large corpus as input. Important/Domain-specific terms will
be identified, extracted, normalised, ranked and exported
with scores into a external file.
JATE2.0 for knowledge engineering
JATE2.0 can be used as concept extraction tool to support the
creation of a domain ontology or a terminology base directly
from text corpus. Users can take domain-specific corpus as
input and use JATE2.0 to generate normalised candidate terms/
concepts as a starting points for further ontology engineering.
Future version will support to import output to Protege or work
as a plugin to Protege.
To bring both academic and industries under a
uniform development and benchmark
framework that addresses :
•  Adaptability
•  Scalability
•  High configurability and extensibility
Solution: JATE 2.0 integrates with Apache Solr
framework to benefit from its extensive,
extensible, flexible text processing libraries; it
can either be used as a separate module, or as
a Solr plugin used during document processing
to enrich the indexed documents with candidate
terms.
•  Expands JATE 1.0 collection of state-of-the-art algorithms, which are not
available in any other tools;
•  Linguistic processors (candidate term extraction) are highly customizable and
developed as Solr plugins, hence making JATE2.0 adaptable to many different
domains and languages;
•  Two usage modes for various usage scenarios and can directly apply to digital
archive (for both indexed and not indexed) in industry;
Embedded mode: as a standalone application
from command line. This mode is recommended
when users need a list of candidate terms
extracted from a corpus so as to support
subsequent knowledge engineering task.
Plugin mode: works as a Solr plugin. This mode
is recommended when users need to index new
or enrich existing index with candidate terms,
which can, e.g., support faceted search, boost
query (implemented as a custom request
handler that processes term extraction by a
simply HTTP request)
Introduction
Objective
Photo credit to K-NOW
1 parses ingested
documents to raw text content
and performs character level
normalisation
2 ‘Cleansed’ text then passed
through the candidate
extraction component (as a Solr
analyzer chain)
3 Candidate terms loaded from Solr
index and processed by the subsequent
filtering component, where different ATE
algorithms can be configured
4 candidate terms can be indexed or exported to
support specific use cases (e.g., faceted query,
knowledge base construction)
Figure 4: Comparison of Top K precisions on GENIA
•  TATA Steel Scenario: cause analysis via text analytics
•  To understand the types of potential factors and actions that lead
to product failures
•  Users (domain expert) collect, select unstructured
documentations (e.g., Lotus notes) from various data sources
•  JATE 2.0 applied to the documents to extract industrial terms for
analyzing and linking domain relevant concepts from textual data
•  Terms used to enable dynamic faceted search/navigation for concept-
driven text analytics

More Related Content

Viewers also liked

Webinar hongkong 30.11.2011 dr
Webinar hongkong 30.11.2011 drWebinar hongkong 30.11.2011 dr
Webinar hongkong 30.11.2011 drbfnd
 
Definamoslas!!!
Definamoslas!!!Definamoslas!!!
Definamoslas!!!
Cristobal Veliz
 
Hardware
HardwareHardware
Hardware
buenaventura95
 
Gbo
GboGbo
Comer frutas con_el_estomago_vacio
Comer frutas con_el_estomago_vacioComer frutas con_el_estomago_vacio
Comer frutas con_el_estomago_vacio
ministerios de educacion
 
προγραμματισμός
προγραμματισμόςπρογραμματισμός
προγραμματισμόςkatoikidia
 
Paseo por micrasoft word
Paseo por micrasoft wordPaseo por micrasoft word
Paseo por micrasoft word
Daniel Betancur Martinez
 
Slides gate webinar session 1 c
Slides gate webinar session 1 cSlides gate webinar session 1 c
Slides gate webinar session 1 cbfnd
 
P4TK PERTANIAN
P4TK PERTANIANP4TK PERTANIAN
StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...
StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...
StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...
Fabian Gebert
 
Quo vadis Session 5
Quo vadis Session 5Quo vadis Session 5
Quo vadis Session 5bfnd
 
Slides gate webinar session 1
Slides gate webinar session 1Slides gate webinar session 1
Slides gate webinar session 1bfnd
 
ლაშა ბუღაძე – კორექტურა
ლაშა ბუღაძე – კორექტურალაშა ბუღაძე – კორექტურა
ლაშა ბუღაძე – კორექტურაNika Kemularia
 
China Property Market
China Property MarketChina Property Market
China Property Market
London Business School
 
Rx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexander
Rx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexanderRx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexander
Rx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexander
OPUNITE
 
E-commerce Berlin Expo 2017 - Big Money for little effort
E-commerce Berlin Expo 2017 - Big Money for little effortE-commerce Berlin Expo 2017 - Big Money for little effort
E-commerce Berlin Expo 2017 - Big Money for little effort
E-Commerce Berlin EXPO
 
E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...
E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...
E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...
E-Commerce Berlin EXPO
 
Rx16 federal tues_200_1_gladden_2halpin_3green
Rx16 federal tues_200_1_gladden_2halpin_3greenRx16 federal tues_200_1_gladden_2halpin_3green
Rx16 federal tues_200_1_gladden_2halpin_3green
OPUNITE
 
Gupres dan OSN 2016
Gupres dan OSN 2016Gupres dan OSN 2016
Gupres dan OSN 2016
tendikdikdas kabupaten bogor
 

Viewers also liked (20)

Webinar hongkong 30.11.2011 dr
Webinar hongkong 30.11.2011 drWebinar hongkong 30.11.2011 dr
Webinar hongkong 30.11.2011 dr
 
eb5-vietnamese-6.7low
eb5-vietnamese-6.7loweb5-vietnamese-6.7low
eb5-vietnamese-6.7low
 
Definamoslas!!!
Definamoslas!!!Definamoslas!!!
Definamoslas!!!
 
Hardware
HardwareHardware
Hardware
 
Gbo
GboGbo
Gbo
 
Comer frutas con_el_estomago_vacio
Comer frutas con_el_estomago_vacioComer frutas con_el_estomago_vacio
Comer frutas con_el_estomago_vacio
 
προγραμματισμός
προγραμματισμόςπρογραμματισμός
προγραμματισμός
 
Paseo por micrasoft word
Paseo por micrasoft wordPaseo por micrasoft word
Paseo por micrasoft word
 
Slides gate webinar session 1 c
Slides gate webinar session 1 cSlides gate webinar session 1 c
Slides gate webinar session 1 c
 
P4TK PERTANIAN
P4TK PERTANIANP4TK PERTANIAN
P4TK PERTANIAN
 
StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...
StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...
StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...
 
Quo vadis Session 5
Quo vadis Session 5Quo vadis Session 5
Quo vadis Session 5
 
Slides gate webinar session 1
Slides gate webinar session 1Slides gate webinar session 1
Slides gate webinar session 1
 
ლაშა ბუღაძე – კორექტურა
ლაშა ბუღაძე – კორექტურალაშა ბუღაძე – კორექტურა
ლაშა ბუღაძე – კორექტურა
 
China Property Market
China Property MarketChina Property Market
China Property Market
 
Rx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexander
Rx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexanderRx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexander
Rx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexander
 
E-commerce Berlin Expo 2017 - Big Money for little effort
E-commerce Berlin Expo 2017 - Big Money for little effortE-commerce Berlin Expo 2017 - Big Money for little effort
E-commerce Berlin Expo 2017 - Big Money for little effort
 
E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...
E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...
E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...
 
Rx16 federal tues_200_1_gladden_2halpin_3green
Rx16 federal tues_200_1_gladden_2halpin_3greenRx16 federal tues_200_1_gladden_2halpin_3green
Rx16 federal tues_200_1_gladden_2halpin_3green
 
Gupres dan OSN 2016
Gupres dan OSN 2016Gupres dan OSN 2016
Gupres dan OSN 2016
 

Similar to Automatic Term Recognition with Apache Solr

Requirementv4
Requirementv4Requirementv4
Requirementv4
stat
 
Stat Tech Reportv1
Stat Tech Reportv1Stat Tech Reportv1
Stat Tech Reportv1
stat
 
ESSENSE
ESSENSEESSENSE
Requirment
RequirmentRequirment
Requirment
stat
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation System
IRJET Journal
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
iosrjce
 
D017232729
D017232729D017232729
D017232729
IOSR Journals
 
report_barc
report_barcreport_barc
report_barc
siontani
 
A FRAMEWORK STUDIO FOR COMPONENT REUSABILITY
A FRAMEWORK STUDIO FOR COMPONENT REUSABILITYA FRAMEWORK STUDIO FOR COMPONENT REUSABILITY
A FRAMEWORK STUDIO FOR COMPONENT REUSABILITY
cscpconf
 
2 why python for nlp
2 why python for nlp2 why python for nlp
2 why python for nlp
ThennarasuSakkan
 
Updated Ayushi Tongiya Resume(1)
Updated Ayushi Tongiya Resume(1)Updated Ayushi Tongiya Resume(1)
Updated Ayushi Tongiya Resume(1)
Ayushi Tongiya
 
IRJET - Voice based Natural Language Query Processing
IRJET -  	  Voice based Natural Language Query ProcessingIRJET -  	  Voice based Natural Language Query Processing
IRJET - Voice based Natural Language Query Processing
IRJET Journal
 
File000162
File000162File000162
File000162
Desmond Devendran
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
dgarijo
 
robot framework1.pptx
robot framework1.pptxrobot framework1.pptx
robot framework1.pptx
tanuranasingha1996
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query Processing
IRJET Journal
 
Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016
Shrey Kumar
 
IRJET - Optical Character Recognition and Translation
IRJET -  	  Optical Character Recognition and TranslationIRJET -  	  Optical Character Recognition and Translation
IRJET - Optical Character Recognition and Translation
IRJET Journal
 
G0361034038
G0361034038G0361034038
G0361034038
ijceronline
 
Robot framework
Robot frameworkRobot framework

Similar to Automatic Term Recognition with Apache Solr (20)

Requirementv4
Requirementv4Requirementv4
Requirementv4
 
Stat Tech Reportv1
Stat Tech Reportv1Stat Tech Reportv1
Stat Tech Reportv1
 
ESSENSE
ESSENSEESSENSE
ESSENSE
 
Requirment
RequirmentRequirment
Requirment
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation System
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
 
D017232729
D017232729D017232729
D017232729
 
report_barc
report_barcreport_barc
report_barc
 
A FRAMEWORK STUDIO FOR COMPONENT REUSABILITY
A FRAMEWORK STUDIO FOR COMPONENT REUSABILITYA FRAMEWORK STUDIO FOR COMPONENT REUSABILITY
A FRAMEWORK STUDIO FOR COMPONENT REUSABILITY
 
2 why python for nlp
2 why python for nlp2 why python for nlp
2 why python for nlp
 
Updated Ayushi Tongiya Resume(1)
Updated Ayushi Tongiya Resume(1)Updated Ayushi Tongiya Resume(1)
Updated Ayushi Tongiya Resume(1)
 
IRJET - Voice based Natural Language Query Processing
IRJET -  	  Voice based Natural Language Query ProcessingIRJET -  	  Voice based Natural Language Query Processing
IRJET - Voice based Natural Language Query Processing
 
File000162
File000162File000162
File000162
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
 
robot framework1.pptx
robot framework1.pptxrobot framework1.pptx
robot framework1.pptx
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query Processing
 
Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016
 
IRJET - Optical Character Recognition and Translation
IRJET -  	  Optical Character Recognition and TranslationIRJET -  	  Optical Character Recognition and Translation
IRJET - Optical Character Recognition and Translation
 
G0361034038
G0361034038G0361034038
G0361034038
 
Robot framework
Robot frameworkRobot framework
Robot framework
 

Recently uploaded

哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
IJECEIAES
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
Madan Karki
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
SakkaravarthiShanmug
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
Las Vegas Warehouse
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
LAXMAREDDY22
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
KrishnaveniKrishnara1
 
Hematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood CountHematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood Count
shahdabdulbaset
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 

Recently uploaded (20)

哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
 
john krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptxjohn krisinger-the science and history of the alcoholic beverage.pptx
john krisinger-the science and history of the alcoholic beverage.pptx
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
cnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classicationcnn.pptx Convolutional neural network used for image classication
cnn.pptx Convolutional neural network used for image classication
 
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have oneISPM 15 Heat Treated Wood Stamps and why your shipping must have one
ISPM 15 Heat Treated Wood Stamps and why your shipping must have one
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
BRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdfBRAIN TUMOR DETECTION for seminar ppt.pdf
BRAIN TUMOR DETECTION for seminar ppt.pdf
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.pptUnit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
Unit-III-ELECTROCHEMICAL STORAGE DEVICES.ppt
 
Hematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood CountHematology Analyzer Machine - Complete Blood Count
Hematology Analyzer Machine - Complete Blood Count
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 

Automatic Term Recognition with Apache Solr

  • 1. •  There is a real lack of open source tools to facilitate the development of downstream applications, also encouraging code- reuse, comparative studies, and fostering further research •  Existing tools are developed under different scenarios and evaluated in different domains using proprietary language resources, making it difficult for comparison. •  It is unclear whether and how well these tools can adapt to different domain tasks and scale up to large data. •  Automatic Term Extraction (ATE/ATR) is an important Natural Language Processing (NLP) task that deals with the extraction of terminologies from domain-specific textual corpora. •  Widely used by both industries and researchers in many complex tasks, such as Information Retrieval (IR), machine translation, ontology engineering and text summarization (Bowker, 2003; Brewster et al., 2007; Maynard et al., 2007). JATE2.0 Architecture Automatic Term Recognition with Apache Solr Ziqi Zhang and Jie Gao 1.  JATE2.0 is an open-source library (under LGPLv3 license), available to download via https://github.com/ziqizhang/jate Contact: Jie Gao j.gao@sheffield.ac.uk 2.  For more examples of JATE2.0 usage scenarios and ideas, please refer to JATE2.0 wiki. Ziqi Zhang ziqi.zhang@sheffield.ac.uk OAK Group, Department of Computer Science, University of Sheffield, Sheffield, S1 4DP, United Kingdom Example setting of Part-of-Speech (PoS) pattern based candidate extraction Acknowledgements Unique Features Use cases Usage Modes ATE algorithms in JATE2.0 (beta) Evaluation •  Two datasets, GENIA dataset (Kim et al., 2003) containing 1,999 medline abstract corpus for bio-textmining previously used by (Zhang et al., 2008); and the ACL RD-TEC dataset (Zadeh and Handschuh, 2014), containing over 10,900 publications in the domain of computational linguistics •  3 types of candidate extractors are tested (NP, N-gram, POS pattern) •  Overall recall, precision at Top K, and CPU time are measured Figure 5: Comparison of Top K precisions on ACL RD-TEC Part of this research has been sponsored by the EU funded project WeSenseIt under grant agreement number 308429; and the SPEEAK-PC collaboration agreement 101947 of the Innovative UK. Terminology-driven Faceted Search for interactive cause analysis ATE in combination with sentiment analysis •  ATE used to improve sentiment analysis used by homeland security forces (both English and Italian) •  Training corpus collection and annotation based on distant supervision •  ATE for text normalization & standardization, key term extraction (uni-/bi-gram) from corpus •  Key terms used as features to train sentiment classifiers (SVM, Naïve Bayes, logistic regression) JATE2.0 for Translation ATE is a very useful starting point for a human terminologist or translator. JATE2.0 can work with very large corpus efficiently. It is also easy-to-use and highly configurable for various different domains and languages. With more than 10 algorithms, JATE2.0 can be simply used to process a large corpus as input. Important/Domain-specific terms will be identified, extracted, normalised, ranked and exported with scores into a external file. JATE2.0 for knowledge engineering JATE2.0 can be used as concept extraction tool to support the creation of a domain ontology or a terminology base directly from text corpus. Users can take domain-specific corpus as input and use JATE2.0 to generate normalised candidate terms/ concepts as a starting points for further ontology engineering. Future version will support to import output to Protege or work as a plugin to Protege. To bring both academic and industries under a uniform development and benchmark framework that addresses : •  Adaptability •  Scalability •  High configurability and extensibility Solution: JATE 2.0 integrates with Apache Solr framework to benefit from its extensive, extensible, flexible text processing libraries; it can either be used as a separate module, or as a Solr plugin used during document processing to enrich the indexed documents with candidate terms. •  Expands JATE 1.0 collection of state-of-the-art algorithms, which are not available in any other tools; •  Linguistic processors (candidate term extraction) are highly customizable and developed as Solr plugins, hence making JATE2.0 adaptable to many different domains and languages; •  Two usage modes for various usage scenarios and can directly apply to digital archive (for both indexed and not indexed) in industry; Embedded mode: as a standalone application from command line. This mode is recommended when users need a list of candidate terms extracted from a corpus so as to support subsequent knowledge engineering task. Plugin mode: works as a Solr plugin. This mode is recommended when users need to index new or enrich existing index with candidate terms, which can, e.g., support faceted search, boost query (implemented as a custom request handler that processes term extraction by a simply HTTP request) Introduction Objective Photo credit to K-NOW 1 parses ingested documents to raw text content and performs character level normalisation 2 ‘Cleansed’ text then passed through the candidate extraction component (as a Solr analyzer chain) 3 Candidate terms loaded from Solr index and processed by the subsequent filtering component, where different ATE algorithms can be configured 4 candidate terms can be indexed or exported to support specific use cases (e.g., faceted query, knowledge base construction) Figure 4: Comparison of Top K precisions on GENIA •  TATA Steel Scenario: cause analysis via text analytics •  To understand the types of potential factors and actions that lead to product failures •  Users (domain expert) collect, select unstructured documentations (e.g., Lotus notes) from various data sources •  JATE 2.0 applied to the documents to extract industrial terms for analyzing and linking domain relevant concepts from textual data •  Terms used to enable dynamic faceted search/navigation for concept- driven text analytics