Aisha Kalsoom
Tools and techniques to help researchers cope with the information overload
are therefore needed.
NER tools can be applied to find all kind of entities, such as gene or protein
names, diseases and drugs, mutations or properties of protein structures.
Medline database contained approx. 15 million scientific abstracts with a
growth rate of about 400,000 articles per year.
Identification of proteins or genes is important to find out protein
interaction networks.
Concepts, meaning
and representation
Names in text
represent real-life
concepts in our mind
Concept denoted by a
gene name is usually
not clearly defined
No community-wide
agreement to name
particular gene
Supermarket
Sonic
Hedgehog gene
in human
p53
2WRU
• Clone during mapping phase in Human GENOME Project had
up to 15 different names
• FLT4 has four names: PCL; FLT41; LMPH1A;VEGFR3
Many genes and
proteins have more
than one name
• Cbp/p300- interactive transactivator
• CCAAT/enhancer binding protein, C/EBP alpha
Inconsistent use of
variations of names
• BioCreative Corpus of expert tagged gene names consist of
53% of all names consist of more than one token
• HumanT-cell leukaemia lymphotropic virus type 1Tax protein
Multi-word names
Acronyms are
homonyms
• SEC stands for
• surface epithelial cell
• size exclusion chromatography
• Selenocystein
 Lesar, U. and Hakenberg, J. (2005), ‘What makes a gene name? Named entity
recognition in the biomedical literature’, Briefings in Bioinformatics,Vol. 6(4), pp.
357-369.
 http://www.bioinformatics.org/textknowledge/acronym.php?textfield=SEC&sub
=search
 http://www.rcsb.org/pdb/explore/explore.do?structureId=2WRU

Name Entity Recognition problems in biomedical literature

  • 1.
  • 2.
    Tools and techniquesto help researchers cope with the information overload are therefore needed. NER tools can be applied to find all kind of entities, such as gene or protein names, diseases and drugs, mutations or properties of protein structures. Medline database contained approx. 15 million scientific abstracts with a growth rate of about 400,000 articles per year. Identification of proteins or genes is important to find out protein interaction networks.
  • 3.
    Concepts, meaning and representation Namesin text represent real-life concepts in our mind Concept denoted by a gene name is usually not clearly defined No community-wide agreement to name particular gene Supermarket Sonic Hedgehog gene in human p53 2WRU
  • 4.
    • Clone duringmapping phase in Human GENOME Project had up to 15 different names • FLT4 has four names: PCL; FLT41; LMPH1A;VEGFR3 Many genes and proteins have more than one name • Cbp/p300- interactive transactivator • CCAAT/enhancer binding protein, C/EBP alpha Inconsistent use of variations of names • BioCreative Corpus of expert tagged gene names consist of 53% of all names consist of more than one token • HumanT-cell leukaemia lymphotropic virus type 1Tax protein Multi-word names Acronyms are homonyms • SEC stands for • surface epithelial cell • size exclusion chromatography • Selenocystein
  • 5.
     Lesar, U.and Hakenberg, J. (2005), ‘What makes a gene name? Named entity recognition in the biomedical literature’, Briefings in Bioinformatics,Vol. 6(4), pp. 357-369.  http://www.bioinformatics.org/textknowledge/acronym.php?textfield=SEC&sub =search  http://www.rcsb.org/pdb/explore/explore.do?structureId=2WRU