SlideShare a Scribd company logo
1 of 1
Download to read offline
•  There is a real lack of open source tools to facilitate the
development of downstream applications, also encouraging code-
reuse, comparative studies, and fostering further research
•  Existing tools are developed under different scenarios and
evaluated in different domains using proprietary language
resources, making it difficult for comparison.
•  It is unclear whether and how well these tools can adapt to
different domain tasks and scale up to large data.
•  Automatic Term Extraction (ATE/ATR) is an important
Natural Language Processing (NLP) task that deals with
the extraction of terminologies from domain-specific
textual corpora.
•  Widely used by both industries and researchers in many
complex tasks, such as Information Retrieval (IR),
machine translation, ontology engineering and text
summarization (Bowker, 2003; Brewster et al., 2007;
Maynard et al., 2007). JATE2.0 Architecture
Automatic Term Recognition with Apache Solr
Ziqi Zhang and Jie Gao
1.  JATE2.0 is an open-source library (under LGPLv3 license), available to download via https://github.com/ziqizhang/jate Contact: Jie Gao j.gao@sheffield.ac.uk
2.  For more examples of JATE2.0 usage scenarios and ideas, please refer to JATE2.0 wiki. Ziqi Zhang ziqi.zhang@sheffield.ac.uk
OAK Group, Department of Computer Science, University of Sheffield,
Sheffield, S1 4DP, United Kingdom
Example setting of Part-of-Speech (PoS) pattern based candidate extraction
Acknowledgements
Unique Features
Use cases
Usage Modes
ATE algorithms in JATE2.0 (beta)
Evaluation
•  Two datasets, GENIA dataset (Kim et al., 2003)
containing 1,999 medline abstract corpus for
bio-textmining previously used by (Zhang et al.,
2008); and the ACL RD-TEC dataset (Zadeh
and Handschuh, 2014), containing over 10,900
publications in the domain of computational
linguistics
•  3 types of candidate extractors are tested (NP,
N-gram, POS pattern)
•  Overall recall, precision at Top K, and CPU time
are measured
Figure 5: Comparison of Top K precisions on ACL RD-TEC
Part of this research has been sponsored by the EU funded project
WeSenseIt under grant agreement number 308429; and the
SPEEAK-PC collaboration agreement 101947 of the Innovative UK.
Terminology-driven Faceted Search for interactive cause analysis
ATE in combination with sentiment analysis
•  ATE used to improve sentiment analysis used by homeland
security forces (both English and Italian)
•  Training corpus collection and annotation based on
distant supervision
•  ATE for text normalization & standardization, key term
extraction (uni-/bi-gram) from corpus
•  Key terms used as features to train sentiment classifiers
(SVM, Naïve Bayes,
logistic regression)
JATE2.0 for Translation
ATE is a very useful starting point for a human terminologist
or translator. JATE2.0 can work with very large corpus
efficiently. It is also easy-to-use and highly configurable for
various different domains and languages. With more than
10 algorithms, JATE2.0 can be simply used to process a
large corpus as input. Important/Domain-specific terms will
be identified, extracted, normalised, ranked and exported
with scores into a external file.
JATE2.0 for knowledge engineering
JATE2.0 can be used as concept extraction tool to support the
creation of a domain ontology or a terminology base directly
from text corpus. Users can take domain-specific corpus as
input and use JATE2.0 to generate normalised candidate terms/
concepts as a starting points for further ontology engineering.
Future version will support to import output to Protege or work
as a plugin to Protege.
To bring both academic and industries under a
uniform development and benchmark
framework that addresses :
•  Adaptability
•  Scalability
•  High configurability and extensibility
Solution: JATE 2.0 integrates with Apache Solr
framework to benefit from its extensive,
extensible, flexible text processing libraries; it
can either be used as a separate module, or as
a Solr plugin used during document processing
to enrich the indexed documents with candidate
terms.
•  Expands JATE 1.0 collection of state-of-the-art algorithms, which are not
available in any other tools;
•  Linguistic processors (candidate term extraction) are highly customizable and
developed as Solr plugins, hence making JATE2.0 adaptable to many different
domains and languages;
•  Two usage modes for various usage scenarios and can directly apply to digital
archive (for both indexed and not indexed) in industry;
Embedded mode: as a standalone application
from command line. This mode is recommended
when users need a list of candidate terms
extracted from a corpus so as to support
subsequent knowledge engineering task.
Plugin mode: works as a Solr plugin. This mode
is recommended when users need to index new
or enrich existing index with candidate terms,
which can, e.g., support faceted search, boost
query (implemented as a custom request
handler that processes term extraction by a
simply HTTP request)
Introduction
Objective
Photo credit to K-NOW
1 parses ingested
documents to raw text content
and performs character level
normalisation
2 ‘Cleansed’ text then passed
through the candidate
extraction component (as a Solr
analyzer chain)
3 Candidate terms loaded from Solr
index and processed by the subsequent
filtering component, where different ATE
algorithms can be configured
4 candidate terms can be indexed or exported to
support specific use cases (e.g., faceted query,
knowledge base construction)
Figure 4: Comparison of Top K precisions on GENIA
•  TATA Steel Scenario: cause analysis via text analytics
•  To understand the types of potential factors and actions that lead
to product failures
•  Users (domain expert) collect, select unstructured
documentations (e.g., Lotus notes) from various data sources
•  JATE 2.0 applied to the documents to extract industrial terms for
analyzing and linking domain relevant concepts from textual data
•  Terms used to enable dynamic faceted search/navigation for concept-
driven text analytics

More Related Content

Viewers also liked

Webinar hongkong 30.11.2011 dr
Webinar hongkong 30.11.2011 drWebinar hongkong 30.11.2011 dr
Webinar hongkong 30.11.2011 drbfnd
 
προγραμματισμός
προγραμματισμόςπρογραμματισμός
προγραμματισμόςkatoikidia
 
Slides gate webinar session 1 c
Slides gate webinar session 1 cSlides gate webinar session 1 c
Slides gate webinar session 1 cbfnd
 
StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...
StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...
StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...Fabian Gebert
 
Quo vadis Session 5
Quo vadis Session 5Quo vadis Session 5
Quo vadis Session 5bfnd
 
Slides gate webinar session 1
Slides gate webinar session 1Slides gate webinar session 1
Slides gate webinar session 1bfnd
 
ლაშა ბუღაძე – კორექტურა
ლაშა ბუღაძე – კორექტურალაშა ბუღაძე – კორექტურა
ლაშა ბუღაძე – კორექტურაNika Kemularia
 
Rx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexander
Rx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexanderRx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexander
Rx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexanderOPUNITE
 
E-commerce Berlin Expo 2017 - Big Money for little effort
E-commerce Berlin Expo 2017 - Big Money for little effortE-commerce Berlin Expo 2017 - Big Money for little effort
E-commerce Berlin Expo 2017 - Big Money for little effortE-Commerce Berlin EXPO
 
E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...
E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...
E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...E-Commerce Berlin EXPO
 
Rx16 federal tues_200_1_gladden_2halpin_3green
Rx16 federal tues_200_1_gladden_2halpin_3greenRx16 federal tues_200_1_gladden_2halpin_3green
Rx16 federal tues_200_1_gladden_2halpin_3greenOPUNITE
 

Viewers also liked (20)

Webinar hongkong 30.11.2011 dr
Webinar hongkong 30.11.2011 drWebinar hongkong 30.11.2011 dr
Webinar hongkong 30.11.2011 dr
 
eb5-vietnamese-6.7low
eb5-vietnamese-6.7loweb5-vietnamese-6.7low
eb5-vietnamese-6.7low
 
Definamoslas!!!
Definamoslas!!!Definamoslas!!!
Definamoslas!!!
 
Hardware
HardwareHardware
Hardware
 
Gbo
GboGbo
Gbo
 
Comer frutas con_el_estomago_vacio
Comer frutas con_el_estomago_vacioComer frutas con_el_estomago_vacio
Comer frutas con_el_estomago_vacio
 
προγραμματισμός
προγραμματισμόςπρογραμματισμός
προγραμματισμός
 
Paseo por micrasoft word
Paseo por micrasoft wordPaseo por micrasoft word
Paseo por micrasoft word
 
Slides gate webinar session 1 c
Slides gate webinar session 1 cSlides gate webinar session 1 c
Slides gate webinar session 1 c
 
P4TK PERTANIAN
P4TK PERTANIANP4TK PERTANIAN
P4TK PERTANIAN
 
StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...
StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...
StreamDocs - kollaborative Diskussion und Bewertung von Dokumenten in Echtzei...
 
Quo vadis Session 5
Quo vadis Session 5Quo vadis Session 5
Quo vadis Session 5
 
Slides gate webinar session 1
Slides gate webinar session 1Slides gate webinar session 1
Slides gate webinar session 1
 
ლაშა ბუღაძე – კორექტურა
ლაშა ბუღაძე – კორექტურალაშა ბუღაძე – კორექტურა
ლაშა ბუღაძე – კორექტურა
 
China Property Market
China Property MarketChina Property Market
China Property Market
 
Rx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexander
Rx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexanderRx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexander
Rx16 adv tues_1115_1_seymourhsu_2baird_3cochran_4hartung_5alexander
 
E-commerce Berlin Expo 2017 - Big Money for little effort
E-commerce Berlin Expo 2017 - Big Money for little effortE-commerce Berlin Expo 2017 - Big Money for little effort
E-commerce Berlin Expo 2017 - Big Money for little effort
 
E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...
E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...
E-commerce Berlin Expo 2017 - The German E-Commerce Market: How the Consumer ...
 
Rx16 federal tues_200_1_gladden_2halpin_3green
Rx16 federal tues_200_1_gladden_2halpin_3greenRx16 federal tues_200_1_gladden_2halpin_3green
Rx16 federal tues_200_1_gladden_2halpin_3green
 
Gupres dan OSN 2016
Gupres dan OSN 2016Gupres dan OSN 2016
Gupres dan OSN 2016
 

Similar to Automatic Term Recognition with Apache Solr

Requirementv4
Requirementv4Requirementv4
Requirementv4stat
 
Stat Tech Reportv1
Stat Tech Reportv1Stat Tech Reportv1
Stat Tech Reportv1stat
 
Requirment
RequirmentRequirment
Requirmentstat
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET Journal
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniquesiosrjce
 
report_barc
report_barcreport_barc
report_barcsiontani
 
A FRAMEWORK STUDIO FOR COMPONENT REUSABILITY
A FRAMEWORK STUDIO FOR COMPONENT REUSABILITYA FRAMEWORK STUDIO FOR COMPONENT REUSABILITY
A FRAMEWORK STUDIO FOR COMPONENT REUSABILITYcscpconf
 
Updated Ayushi Tongiya Resume(1)
Updated Ayushi Tongiya Resume(1)Updated Ayushi Tongiya Resume(1)
Updated Ayushi Tongiya Resume(1)Ayushi Tongiya
 
IRJET - Voice based Natural Language Query Processing
IRJET -  	  Voice based Natural Language Query ProcessingIRJET -  	  Voice based Natural Language Query Processing
IRJET - Voice based Natural Language Query ProcessingIRJET Journal
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Softwaredgarijo
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET Journal
 
Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016Shrey Kumar
 
IRJET - Optical Character Recognition and Translation
IRJET -  	  Optical Character Recognition and TranslationIRJET -  	  Optical Character Recognition and Translation
IRJET - Optical Character Recognition and TranslationIRJET Journal
 

Similar to Automatic Term Recognition with Apache Solr (20)

Requirementv4
Requirementv4Requirementv4
Requirementv4
 
Stat Tech Reportv1
Stat Tech Reportv1Stat Tech Reportv1
Stat Tech Reportv1
 
ESSENSE
ESSENSEESSENSE
ESSENSE
 
Requirment
RequirmentRequirment
Requirment
 
IRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation SystemIRJET- Speech to Speech Translation System
IRJET- Speech to Speech Translation System
 
Class Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP TechniquesClass Diagram Extraction from Textual Requirements Using NLP Techniques
Class Diagram Extraction from Textual Requirements Using NLP Techniques
 
D017232729
D017232729D017232729
D017232729
 
report_barc
report_barcreport_barc
report_barc
 
A FRAMEWORK STUDIO FOR COMPONENT REUSABILITY
A FRAMEWORK STUDIO FOR COMPONENT REUSABILITYA FRAMEWORK STUDIO FOR COMPONENT REUSABILITY
A FRAMEWORK STUDIO FOR COMPONENT REUSABILITY
 
2 why python for nlp
2 why python for nlp2 why python for nlp
2 why python for nlp
 
Updated Ayushi Tongiya Resume(1)
Updated Ayushi Tongiya Resume(1)Updated Ayushi Tongiya Resume(1)
Updated Ayushi Tongiya Resume(1)
 
IRJET - Voice based Natural Language Query Processing
IRJET -  	  Voice based Natural Language Query ProcessingIRJET -  	  Voice based Natural Language Query Processing
IRJET - Voice based Natural Language Query Processing
 
File000162
File000162File000162
File000162
 
Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
 
robot framework1.pptx
robot framework1.pptxrobot framework1.pptx
robot framework1.pptx
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query Processing
 
Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016
 
IRJET - Optical Character Recognition and Translation
IRJET -  	  Optical Character Recognition and TranslationIRJET -  	  Optical Character Recognition and Translation
IRJET - Optical Character Recognition and Translation
 
G0361034038
G0361034038G0361034038
G0361034038
 
Robot framework
Robot frameworkRobot framework
Robot framework
 

Recently uploaded

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stageAbc194748
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxnuruddin69
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 

Recently uploaded (20)

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stage
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 

Automatic Term Recognition with Apache Solr

  • 1. •  There is a real lack of open source tools to facilitate the development of downstream applications, also encouraging code- reuse, comparative studies, and fostering further research •  Existing tools are developed under different scenarios and evaluated in different domains using proprietary language resources, making it difficult for comparison. •  It is unclear whether and how well these tools can adapt to different domain tasks and scale up to large data. •  Automatic Term Extraction (ATE/ATR) is an important Natural Language Processing (NLP) task that deals with the extraction of terminologies from domain-specific textual corpora. •  Widely used by both industries and researchers in many complex tasks, such as Information Retrieval (IR), machine translation, ontology engineering and text summarization (Bowker, 2003; Brewster et al., 2007; Maynard et al., 2007). JATE2.0 Architecture Automatic Term Recognition with Apache Solr Ziqi Zhang and Jie Gao 1.  JATE2.0 is an open-source library (under LGPLv3 license), available to download via https://github.com/ziqizhang/jate Contact: Jie Gao j.gao@sheffield.ac.uk 2.  For more examples of JATE2.0 usage scenarios and ideas, please refer to JATE2.0 wiki. Ziqi Zhang ziqi.zhang@sheffield.ac.uk OAK Group, Department of Computer Science, University of Sheffield, Sheffield, S1 4DP, United Kingdom Example setting of Part-of-Speech (PoS) pattern based candidate extraction Acknowledgements Unique Features Use cases Usage Modes ATE algorithms in JATE2.0 (beta) Evaluation •  Two datasets, GENIA dataset (Kim et al., 2003) containing 1,999 medline abstract corpus for bio-textmining previously used by (Zhang et al., 2008); and the ACL RD-TEC dataset (Zadeh and Handschuh, 2014), containing over 10,900 publications in the domain of computational linguistics •  3 types of candidate extractors are tested (NP, N-gram, POS pattern) •  Overall recall, precision at Top K, and CPU time are measured Figure 5: Comparison of Top K precisions on ACL RD-TEC Part of this research has been sponsored by the EU funded project WeSenseIt under grant agreement number 308429; and the SPEEAK-PC collaboration agreement 101947 of the Innovative UK. Terminology-driven Faceted Search for interactive cause analysis ATE in combination with sentiment analysis •  ATE used to improve sentiment analysis used by homeland security forces (both English and Italian) •  Training corpus collection and annotation based on distant supervision •  ATE for text normalization & standardization, key term extraction (uni-/bi-gram) from corpus •  Key terms used as features to train sentiment classifiers (SVM, Naïve Bayes, logistic regression) JATE2.0 for Translation ATE is a very useful starting point for a human terminologist or translator. JATE2.0 can work with very large corpus efficiently. It is also easy-to-use and highly configurable for various different domains and languages. With more than 10 algorithms, JATE2.0 can be simply used to process a large corpus as input. Important/Domain-specific terms will be identified, extracted, normalised, ranked and exported with scores into a external file. JATE2.0 for knowledge engineering JATE2.0 can be used as concept extraction tool to support the creation of a domain ontology or a terminology base directly from text corpus. Users can take domain-specific corpus as input and use JATE2.0 to generate normalised candidate terms/ concepts as a starting points for further ontology engineering. Future version will support to import output to Protege or work as a plugin to Protege. To bring both academic and industries under a uniform development and benchmark framework that addresses : •  Adaptability •  Scalability •  High configurability and extensibility Solution: JATE 2.0 integrates with Apache Solr framework to benefit from its extensive, extensible, flexible text processing libraries; it can either be used as a separate module, or as a Solr plugin used during document processing to enrich the indexed documents with candidate terms. •  Expands JATE 1.0 collection of state-of-the-art algorithms, which are not available in any other tools; •  Linguistic processors (candidate term extraction) are highly customizable and developed as Solr plugins, hence making JATE2.0 adaptable to many different domains and languages; •  Two usage modes for various usage scenarios and can directly apply to digital archive (for both indexed and not indexed) in industry; Embedded mode: as a standalone application from command line. This mode is recommended when users need a list of candidate terms extracted from a corpus so as to support subsequent knowledge engineering task. Plugin mode: works as a Solr plugin. This mode is recommended when users need to index new or enrich existing index with candidate terms, which can, e.g., support faceted search, boost query (implemented as a custom request handler that processes term extraction by a simply HTTP request) Introduction Objective Photo credit to K-NOW 1 parses ingested documents to raw text content and performs character level normalisation 2 ‘Cleansed’ text then passed through the candidate extraction component (as a Solr analyzer chain) 3 Candidate terms loaded from Solr index and processed by the subsequent filtering component, where different ATE algorithms can be configured 4 candidate terms can be indexed or exported to support specific use cases (e.g., faceted query, knowledge base construction) Figure 4: Comparison of Top K precisions on GENIA •  TATA Steel Scenario: cause analysis via text analytics •  To understand the types of potential factors and actions that lead to product failures •  Users (domain expert) collect, select unstructured documentations (e.g., Lotus notes) from various data sources •  JATE 2.0 applied to the documents to extract industrial terms for analyzing and linking domain relevant concepts from textual data •  Terms used to enable dynamic faceted search/navigation for concept- driven text analytics