Analysing Entity Type Variation      across Biomedical SubdomainsClaudiu Mihăilă, Riza Theresa Batista-Navarro, Sophia Ana...
BioTxtM 2012  Introduction  • Named entities        o Atomic elements, classified into various categories (protein,       ...
BioTxtM 2012Introduction• Corpora3
BioTxtM 2012Methodology• Full-text open-access journal articles from UKPMC• 20 subdomains 400 single broad-subject-termed ...
BioTxtM 2012Methodology• NE source: ASilver = AUKPMC                   AOscar       ANeMine     Corpus                    ...
BioTxtM 2012Methodology          NeMine                UKPMCGene                  GeneProtein               ProteinDisease...
BioTxtM 2012Methodology• Feature vectors       Document d                   Document dEnzyme               2    Enzyme    ...
BioTxtM 2012Methodology8
BioTxtM 2012Methodology9
BioTxtM 2012Methodology• Chi-squared statistics10
BioTxtM 2012Methodology• Frobenius norm                   1247.072511
BioTxtM 2012Feature evaluation• Good features for     o   Cell Biology     o   Pharmacology     o   Health Sciences     o ...
BioTxtM 2012Feature evaluation• Mean Chi-Squared for every feature over all pairs13
BioTxtM 2012Classifier selection                       Classifier       Top result count                       J48        ...
BioTxtM 2012Classifier evaluation• Dissimilar subdomains     o   Cell Biology     o   Pharmacology     o   Health Sciences...
BioTxtM 2012Conclusions• To remember     o Significant semantic variation of biomedical sublanguages     o Distinguishable...
BioTxtM 2012Thank you!        http://misteringo.deviantart.com/art/Bunnies-Scream-Again-7974597417
Upcoming SlideShare
Loading in...5
×

Analysing Entity Type Variation across Biomedical Subdomains

249

Published on

Published in: Health & Medicine, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
249
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Analysing Entity Type Variation across Biomedical Subdomains

  1. 1. Analysing Entity Type Variation across Biomedical SubdomainsClaudiu Mihăilă, Riza Theresa Batista-Navarro, Sophia Ananiadou Claudiu Mihăilă National Centre for Text Mining School of Computer Science University of Manchester 26 May 2012
  2. 2. BioTxtM 2012 Introduction • Named entities o Atomic elements, classified into various categories (protein, gene, disease, treatment, metabolite etc.) Theme Organism Theme Organism Pro Pro Pro Transcription +Reg ProIn contrast to the phenotype of the pta ackA double mutant, pbgP transcription was reduced in the pmrD mutant. 2
  3. 3. BioTxtM 2012Introduction• Corpora3
  4. 4. BioTxtM 2012Methodology• Full-text open-access journal articles from UKPMC• 20 subdomains 400 single broad-subject-termed articles Allergy & Communicable Biology Cell Biology Critical Care Immunology Diseases Health Environmental Medical Genetics Services Medicine Health Informatics Research Microbiology Neoplasms Neurology Pharmacology Physiology Pulmonary Tropical Public Health Rheumatology Virology Medicine Medicine4
  5. 5. BioTxtM 2012Methodology• NE source: ASilver = AUKPMC AOscar ANeMine Corpus Annotation Allergy & UKPMC Communicable Critical Care Biology Cell Biology Critical Care Immunology Diseases Health Environmental Medical Medicine Genetics Services Medicine Health Informatics Research OSCAR Physiology Microbiology Neoplasms Neurology Pharmacology Physiology Pulmonary Tropical Virology Public Health Rheumatology NeMine Virology Medicine Medicine 5
  6. 6. BioTxtM 2012Methodology NeMine UKPMCGene GeneProtein ProteinDisease DiseaseDrug DrugMetabolite MetaboliteBacteria Gene|ProteinDiagnostic processGeneral phenomenon SilverIndicator AnnotationNatural phenomenon OSCAROrgan Chemical moleculePathologic function Chemical adjectiveSymptom EnzymeTherapeutic process Reaction 6
  7. 7. BioTxtM 2012Methodology• Feature vectors Document d Document dEnzyme 2 Enzyme 0.45%Chemical molecule 71 Chemical molecule 14.85%Disease 8 Disease 1.67%Drug 12 Drug 2.51%Gene 15 Gene 3.13%Gene|Protein 155 Gene|Protein 3.24%Metabolite 3 Metabolite 0.62%Protein 188 Protein 39.33%Reaction 24 Reaction 5.02% 7
  8. 8. BioTxtM 2012Methodology8
  9. 9. BioTxtM 2012Methodology9
  10. 10. BioTxtM 2012Methodology• Chi-squared statistics10
  11. 11. BioTxtM 2012Methodology• Frobenius norm 1247.072511
  12. 12. BioTxtM 2012Feature evaluation• Good features for o Cell Biology o Pharmacology o Health Sciences o Public Health• Not-so-good features for o Medical Informatics o Medicine o Microbiology o Neoplasms o Neurology Frobenius norm of 2 vectors for each pair.12
  13. 13. BioTxtM 2012Feature evaluation• Mean Chi-Squared for every feature over all pairs13
  14. 14. BioTxtM 2012Classifier selection Classifier Top result count J48 0 0% JRip 4 2.10% Logistic 2 1.05% Random Tree 0 0% Random Forest 86 45.26% SMO 0 0% J48 6 3.15% JRip 7 3.68% Decision Stump 16 8.42% AdaBoost Logistic 0 0% Random Tree 0 0% Random Forest 68 35.78% Random Forest F-score for each5.26% SMO 1 pair.14
  15. 15. BioTxtM 2012Classifier evaluation• Dissimilar subdomains o Cell Biology o Pharmacology o Health Sciences o Public Health• Similar subdomains o Medical Informatics o Medicine o Microbiology o Neoplasms o Neurology Random Forest F-score for each pair.15
  16. 16. BioTxtM 2012Conclusions• To remember o Significant semantic variation of biomedical sublanguages o Distinguishable bio-subdomains using only NE types o Caution needed when adapting NLP tools to subdomains• To do o Extension to bio-events o Combination with lexical, syntactical, discourse features o Extension to other domains16
  17. 17. BioTxtM 2012Thank you! http://misteringo.deviantart.com/art/Bunnies-Scream-Again-7974597417
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×