Argo: a platform for interoperable
and customisable text mining
Sophia Ananiadou
National Centre for Text Mining
School of...
Overview
• Sharing tools, resources and text mining workflows
• Challenges
• Interoperable infrastructure for processing a...
NaCTeM
• 1st publicly funded national
text mining centre
• Location: Manchester Institute
of Biotechnology
• Phase I - Bio...
Challenges
Language Technology
Languages
English
French
German
Spanish
Portuguese
Italian
Polish
….
Chinese
Hindu
Urdu
Jap...
Metadata
Languages
English
French
German
Spanish
Portuguese
Italian
Polish
….
Chinese
Hindu
Urdu
Japanese
Korean…Tasks
Tra...
Requirements from TM infrastructure
• Modularity of TM modules
• Interoperability among TM modules and resources
• Generic...
Module
Interoperability and Adaptability
ModuleModule
Resources
Dictionaries
Ontologies
Adaptation
Rule Writing
(Annotated...
Example: extracting proteins, annotations
8
GENIA
PennBioIE
AIMed
GENETAG
Incompatibility
Type definitions
Texts
Problem: ...
The problem with incompatibility
• Difficult to evaluate NERs
9
Corpus C Corpus D
NER A
Which NER is
best for my
task?
NER...
Text mining workflows
• A pipeline that executes particular tools and resources in
order
• Example: semantic search
• Vari...
Text mining workflows
Interoperability
Common Data Representation and Types
IBM Journal of Research and
Development (2011)...
Common Type System
• A common type system is required for the complete
interoperability
• Solution: Maintain local type sy...
U-Compare Type System
Syntactic Level
Document Level
Semantic Level
13Open AIRE-COAR ConferenceAnaniadou
POS tagger
B
Sentence
Splitter B
library
POS tagger
A
Sentence
Splitter A
NER
Sentence
Splitter A
Sentence
Splitter A
Sent...
• Web-based application
• Interactive creation of
workflows
• Cloud and high-
performance computing
• Integrated TM/NLP pr...
Structured
Data
Remote
Processing
Workflow
Diagramming
Workflow Designer
Manual
Editing
Annotator/Curator
Processing
Compo...
Processing Components
• Approaching 100 components (U-Compare)
– Additional 50 will be added soon
• META-NET
• Developed o...
Remote Processing
• Single machine execution
– In-house high-performance machines
• Distributed processing
– HTCondor
– VM...
Workflows
• Users create workflows as block diagrams
• Workflows can be shared among users
– Read only
– Planned: Read & w...
Workflows view
20Open AIRE-COAR ConferenceAnaniadou
Workflow Editor
21Open AIRE-COAR Conference
Sample Use Cases
1 Recognition of chemical entities (chemical NER)
2 Semi-automatic curation of metabolic pathways
3 Evalu...
Use Case 1: Chemical NER
Supplies gold
standard corpus
Removes golden annotations
so that they can be created
automaticall...
Chemical Entity Recogniser
• Chemical model evaluated at BioCreative IV
CHEMDNER challenge
• The challenge
– Data: 10,000 ...
Chemical Entity Recogniser
• Our solution
– Ranked unique mentions: ranked 1st out of 18 groups
– All mentions: ranked 3rd...
Use Case 2: Semi-automatic Curation –
Metabolic Pathways
Search for
relevant
documents
Manual correction of
automatic anno...
Manual Annotation Editor
Create new
annotations by
selecting text
Create, modify or
delete annotations
Edit details of
ann...
Filtering and converting
annotations
28Open AIRE-COAR ConferenceAnaniadou
Manual Annotation Editor: linking to
ontologiesAutomatic pre-
selection can be
modified by the user
Details show
ontology ...
Use Case 3: Information extraction
as a Web service
Web service-
enabled
reader
Web service-
enabled
writer
34Open AIRE-CO...
Language Universal
• Reusable modules
• Generic TM modules: Competence
• Annotated Text, corpora: Performance
• Standards ...
Upcoming SlideShare
Loading in …5
×

OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

691 views
492 views

Published on

Presentation at the OpenAIRE-COAR Conference: "Open Access Movement to Reality: Putting the Pieces Together", Athens - May 21-22, 2014.
Argo: a platform for interoperable and customisable text analytics, by Sophia Ananiadou - School of Computer Science, Director, National Centre for Text Mining, University of Manchester

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
691
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

OpenAIRE-COAR conference 2014: Argo - a platform for interoperable and customisable text analytics, by Sophia Ananiadou - University of Manchester

  1. 1. Argo: a platform for interoperable and customisable text mining Sophia Ananiadou National Centre for Text Mining School of Computer Science The University of Manchester
  2. 2. Overview • Sharing tools, resources and text mining workflows • Challenges • Interoperable infrastructure for processing and annotation 2Open AIRE-COAR ConferenceAnaniadou
  3. 3. NaCTeM • 1st publicly funded national text mining centre • Location: Manchester Institute of Biotechnology • Phase I - Biology (2004-2008) • Phase II - Biology, Medicine, Social Sciences (2008-2011) • Phase III – Biology, Medicine, Humanities, Social Sciences; Fully sustainable centre (2011- ) www.nactem.ac.uk
  4. 4. Challenges Language Technology Languages English French German Spanish Portuguese Italian Polish …. Chinese Hindu Urdu Japanese Korean….Tasks Translation Information Extraction Semantic Search Question Answering Sentiment Analysis Summarization Knowledge Discovery …. Domains Finance/Business Health Biology Social Sciences Humanities…. Text Types Newswire Scientific Literature Full papers/abstracts Twitter Patents Clinical records, EMR Textbooks, monographs Online forums…. Technology Sentence Splitter Paragraph Splitter NP Chunkers C-parser D-parser Semantic parser NE recognizers Relation recognizers ……. Diversity of Languages Diversity of Contexts Diversity of Applications TM Workflows TM Modules Shared! 4Open AIRE-COAR ConferenceAnaniadou
  5. 5. Metadata Languages English French German Spanish Portuguese Italian Polish …. Chinese Hindu Urdu Japanese Korean…Tasks Translation Information Extraction Semantic Search Question Answering Sentiment Analysis Summarization Knowledge Discovery …. Language Technology Linguistic Resources Knowledge Resources Resource-Rich Big DataBig Text Cloud Computing Crowd Sourcing Big Ontology Text Types Newswire Scientific Literature Full papers/abstracts Twitter Patents Clinical records, EMR Textbooks, monographs Online forums…. Domains Finance/Business Health Biology Social Sciences Humanities…. 5Open AIRE-COAR ConferenceAnaniadou OPEN SCIENCE
  6. 6. Requirements from TM infrastructure • Modularity of TM modules • Interoperability among TM modules and resources • Generic across different languages, domains, and text types – Adaptability 6Open AIRE-COAR ConferenceAnaniadou
  7. 7. Module Interoperability and Adaptability ModuleModule Resources Dictionaries Ontologies Adaptation Rule Writing (Annotated) Text Interoperability and Adaptability in Resource-rich TM INFRASTRUCTURES! Dependency Parser English French German JapaneseGreek POS Tagger Named Entity Languages Text Types Domains 7Open AIRE-COAR ConferenceAnaniadou
  8. 8. Example: extracting proteins, annotations 8 GENIA PennBioIE AIMed GENETAG Incompatibility Type definitions Texts Problem: Inconsistency Open AIRE-COAR ConferenceAnaniadou
  9. 9. The problem with incompatibility • Difficult to evaluate NERs 9 Corpus C Corpus D NER A Which NER is best for my task? NER B A: 93% B: 36% A is better than B. A: 63% B: 90% B is better than A. Why so different among different corpora and NERs ? Open AIRE-COAR ConferenceAnaniadou
  10. 10. Text mining workflows • A pipeline that executes particular tools and resources in order • Example: semantic search • Various versions (language- or domain-specific) of basic components needed for different applications and tasks • Different workflows can be created, compared and evaluated by the ability to seamlessly “mix and match” various versions of components PoS Tagger Dictionary Lookup NE Extraction Chunking Parsing Semantic Query 10Open AIRE-COAR ConferenceAnaniadou
  11. 11. Text mining workflows Interoperability Common Data Representation and Types IBM Journal of Research and Development (2011) U-Compare: a modular NLP workflow construction and evaluation system. Kano, Y., Miwa, M., Cohen, K. B., Hunter, L., Ananiadou, S. and Tsujii, J. 11Open AIRE-COAR ConferenceAnaniadou
  12. 12. Common Type System • A common type system is required for the complete interoperability • Solution: Maintain local type systems and bridge them via a sharable type system 12 A single common type is almost impossible to impose for all developers. U-Compare Sharable Type System Local Type System A Local Type System B bridging bridging 12Open AIRE-COAR ConferenceAnaniadou
  13. 13. U-Compare Type System Syntactic Level Document Level Semantic Level 13Open AIRE-COAR ConferenceAnaniadou
  14. 14. POS tagger B Sentence Splitter B library POS tagger A Sentence Splitter A NER Sentence Splitter A Sentence Splitter A Sentence Splitter A Sentence Splitter B Sentence Splitter B Sentence Splitter B POS tagger A POS tagger A POS tagger A POS tagger B POS tagger B POS tagger B NERNERNER Workflow A Workflow B Workflow C  F-Score A F-Score B F-Score C U-Compare: Evaluate and Compare TM Worklfows UIMA SD OpenNLP SD GENIA SD UIMA Tokenizer OpenNLP Tokenizer GENIA Tagger as Tokenizer GENIA Tagger Stepp Tagger OpenNLP Tagger ABNER MedT-NER GENIA Tagger as NER
  15. 15. • Web-based application • Interactive creation of workflows • Cloud and high- performance computing • Integrated TM/NLP processing system • GUI for workflow creation • Library of ready-to-use processing components • Statistics, visualizations, developer APIs • Supports UIMA • http://argo.nactem.ac.uk 15 Database: The Journal of Biological Databases and Curation (2012) Argo: an integrative, interactive, text mining- based workbench supporting curation. Rak, R., Rowley, A., Black, W.J. and Ananiadou, S
  16. 16. Structured Data Remote Processing Workflow Diagramming Workflow Designer Manual Editing Annotator/Curator Processing Components Developers UIMA Compliance 16Ananiadou
  17. 17. Processing Components • Approaching 100 components (U-Compare) – Additional 50 will be added soon • META-NET • Developed or co-developed by NaCTeM – Planned: Make the library open to others to contribute • Generic Listener component – Developers can plug in their own locally run UIMA component to a workflow in Argo 17Open AIRE-COAR ConferenceAnaniadou
  18. 18. Remote Processing • Single machine execution – In-house high-performance machines • Distributed processing – HTCondor – VMware vCloud (EBI) EUPMC – Planned: EC2, Azure, … 18Open AIRE-COAR ConferenceAnaniadou
  19. 19. Workflows • Users create workflows as block diagrams • Workflows can be shared among users – Read only – Planned: Read & write – Planned: downloadable workflows • Workflows can be deployed as web services – Plain text (input only), XMI, RDF, BioC 19Open AIRE-COAR ConferenceAnaniadou
  20. 20. Workflows view 20Open AIRE-COAR ConferenceAnaniadou
  21. 21. Workflow Editor 21Open AIRE-COAR Conference
  22. 22. Sample Use Cases 1 Recognition of chemical entities (chemical NER) 2 Semi-automatic curation of metabolic pathways 3 Evaluation of inter-annotator agreement 4 Information extraction as a Web service Ananiadou Open AIRE-COAR Conference 22
  23. 23. Use Case 1: Chemical NER Supplies gold standard corpus Removes golden annotations so that they can be created automatically Combinations of syntactic and semantic components create annotations Compares and reports precision, recall and F1 of the different branches against the gold standard corpus
  24. 24. Chemical Entity Recogniser • Chemical model evaluated at BioCreative IV CHEMDNER challenge • The challenge – Data: 10,000 manually annotated PubMed abstracts – Automatically recognises names of chemical entities in text 24Open AIRE-COAR ConferenceAnaniadou
  25. 25. Chemical Entity Recogniser • Our solution – Ranked unique mentions: ranked 1st out of 18 groups – All mentions: ranked 3rd out of 19 groups Subtask Precision % Recall % F-score % Ranked unique mentions 91 85 88 All mentions 93 81 87 25Open AIRE-COAR ConferenceAnaniadou
  26. 26. Use Case 2: Semi-automatic Curation – Metabolic Pathways Search for relevant documents Manual correction of automatic annotations NER for chemicals, genes, process indicators Linking to ontologies: CTD, ChEBI, UniProt 26Open AIRE-COAR ConferenceAnaniadou Save results in various formats, e.g., RDF for querying and incorporation into databases
  27. 27. Manual Annotation Editor Create new annotations by selecting text Create, modify or delete annotations Edit details of annotations Open a graphical interface to link annotations to ontologies 27Open AIRE-COAR ConferenceAnaniadou
  28. 28. Filtering and converting annotations 28Open AIRE-COAR ConferenceAnaniadou
  29. 29. Manual Annotation Editor: linking to ontologiesAutomatic pre- selection can be modified by the user Details show ontology entry webpage 29Open AIRE-COAR ConferenceAnaniadou
  30. 30. Use Case 3: Information extraction as a Web service Web service- enabled reader Web service- enabled writer 34Open AIRE-COAR ConferenceAnaniadou
  31. 31. Language Universal • Reusable modules • Generic TM modules: Competence • Annotated Text, corpora: Performance • Standards of Data Representation and Types for Resources: Competence • Dictionaries, Thesauri, Ontologies: Performance 36Open AIRE-COAR ConferenceAnaniadou

×