MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

MER: a Minimal Named‐Entity
Recognition Tagger
and Annotation Server
Francisco M. Couto, Luis F. Campos, and Andre Lamurias
LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal
BioCreative V.5 Workshop , April 26‐27, 2017

Why Minimal?
• TIPS (Technical interoperability and performance of annotation servers)
– it’s cool, we have to participate somehow 
• But we have limited computational resources
• Idea: Go Minimal
– Minimize the number of tools and steps to
perform Named‐Entity Recognition (NER)

What is Minimal?
• Flexibility
– Simple input
• Autonomy
– minimal set of components and software
dependencies
• Efficiency
– Low execution time

How Minimal?
• Only requires a lexicon as input
– a text file
• Only two components:
1. process the lexicon (offline)
2. produce the annotations (on‐the‐fly)
• GNU Bash shell script
– Using high performance grep and awk tools
– Portability: any Unix‐like operating system

Input
• lexicon text file
α‐maltose
nicotinic acid
nicotinic acid D‐ribonucleotide
nicotinic acid‐adenine dinucleotide phosphate

Pre‐Processing
== one‐word ( . . . word1 . txt )
α.maltose
== two‐word ( . . . word2 . txt )
nicotinic acid
== more‐words ( . . . words . txt )
nicotinic acid d.ribonucleotide
nicotinic acid.adenine dinucleotide phosphate
== first‐two‐words ( . . . words2 . txt )
nicotinic acid
nicotinic acid.adenine

Recognition
• Common Solution
– Apply grep directly to the input text
– execution time is proportional to the size of the
lexicon
• Inverted Solution
– input text as patterns matched against the lexicon
– more than 100 times faster
• TIPS chemical lexicon

Output
./get_entities.sh 'α‐maltose and nicotinic acid
D‐ribonucleotide was found, but not nicotinic
acid' lexicon
0       9       α‐maltose
14      28      nicotinic acid
65      79      nicotinic acid
14      45      nicotinic acid D‐ribonucleotide

Input: Lexicons
• Cell line and cell type
– Cellosaurus
• Chemical
– HMDB, ChEBI and ChEMBL
• Disease:
– Human Disease Ontology
• miRNA:
– miRBase
• Protein:
– Protein Ontology
• Subcellular structure:
– cellular component aspect of Gene Ontology
• Tissue and organ:
– tissue and organ subsets of UBERON
https://github.com/lasigeBioTM/MER/raw/biocreative2017/data/TIPS_MER_lexicons_Jan2017.zip

Lexicon Size
• more than 1M terms composed of more than
2M words and more than 25M characters

Input: text
• jq
– a command‐line JSON processor
– to parse the requests
• cURL
– to download each document
• Parsers
– PubMed, Patents, PMC
https://github.com/lasigeBioTM/MER/tree/biocreative2017/external_services
• NO CACHE

Output
• Added some more columns to MER output
– BeCalm TSV format
• The score
– 1‐1/ln(nc),
– nc = # characters of the recognized term

Infrastructure
• Three Virtual Machines (VM).
– Each ad 8GB of RAM and 4 CPUs @ 1.7 GHz
– CentOS Linux release 7.3.1611 (Core)
• VM (primary) to process the requests, distribute
the jobs, and execute MER.
• The other two VMs (secondary) just execute
MER.
• NGINX as HTTP server running CGI scripts
– high performance
• Task Spooler to manage and distribute jobs

Results
• April 21, 2017
• less than 3 seconds on average

Web Tool
http://labs.fc.ul.pt/mer/

Conclusions
• MER a minimal NER tagger
– Flexible: extensible to any lexicon
– Autonomous: only requires a GNU Bash shell
– Efficient: high‐performance capacity of grep
• Annotation Server
– developed in‐house
– minimal software dependencies
– and is open‐source
• Future: entity linking functionality in MER

Acknowledgments
• Portuguese National Distributed Computing
Infrastructure (http://www.incd.pt)
• Links
– https://github.com/lasigeBioTM/MER
– http://labs.fc.ul.pt/mer/

MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

More Related Content

What's hot

More from Francisco Couto

Recently uploaded

MER: a Minimal Named-Entity Recognition Tagger and Annotation Server