Named Entity Recognition is one of the vast techniques in Natural Language Processing. NER techniques can be applied on biomedical data but there are some problems which are mentioned in the presentation.
Name Entity Recognition problems in biomedical literature
Tools and techniques to help researchers cope with the information overload
are therefore needed.
NER tools can be applied to find all kind of entities, such as gene or protein
names, diseases and drugs, mutations or properties of protein structures.
Medline database contained approx. 15 million scientific abstracts with a
growth rate of about 400,000 articles per year.
Identification of proteins or genes is important to find out protein
Names in text
concepts in our mind
Concept denoted by a
gene name is usually
not clearly defined
agreement to name
• Clone during mapping phase in Human GENOME Project had
up to 15 different names
• FLT4 has four names: PCL; FLT41; LMPH1A;VEGFR3
Many genes and
proteins have more
than one name
• Cbp/p300- interactive transactivator
• CCAAT/enhancer binding protein, C/EBP alpha
Inconsistent use of
variations of names
• BioCreative Corpus of expert tagged gene names consist of
53% of all names consist of more than one token
• HumanT-cell leukaemia lymphotropic virus type 1Tax protein
• SEC stands for
• surface epithelial cell
• size exclusion chromatography
Lesar, U. and Hakenberg, J. (2005), ‘What makes a gene name? Named entity
recognition in the biomedical literature’, Briefings in Bioinformatics,Vol. 6(4), pp.