Automatic extraction of microorganisms and their habitats from free text using text-mining workflows
Automac extracon of microorganisms and their habitats from free text using text-‐mining workﬂows BalaKrishna Kolluru, Sirintra Nakjang, Robert. P. Hirt, Anil Wipat and Sophia Ananiadou
Outline of the talk • Movaon • Experiments • Results & inferences • Discussion • Current work
Movaon • In the study of symbioc relaonships, host-‐ microbe interacons play an important role • To date, there is no comprehensive database regarding microbe—habitat relaon, but there is an explosion in the numbers of taxa • With this, there is an urgent need for automated host-‐microbe relaon extracon
Experiments: relevant work • Idenﬁcaon of named enes such as microorganisms, diseases, genes etc., has received suﬃcient importance from the scienﬁc community at large [Sasaki, Hanisch, Chikashi] • Researchers have also used ontology based approaches to idenfy concepts such as public health rumors etc [Biocaster]
Experiments: our approach Named enty recognion • Free text • Habitats & arcles organisms • pdf Text Relaon processing mining Employ text mining workﬂows consisng of • text/pdf processor • Named enty recognizer to idenfy microorganisms and their habitats • Relaon mining component to extract sentences which express this relaon
Experiments: our approach • The named enty recognizer used a hybrid diconary-‐machine learning based approach – It combined the informaon diconaries with a feature set for a condional random ﬁeld (CRF) based classiﬁer [Mallet] – The CRFs used a linear chain model and were trained on a corpus consisng of 32 full papers
Experiments: our approach – The feature set included • lexical informaon of the word e.g., word, POS tag etc • Orthographic informaon e.g. any uppercase le^ers, numbers • Contextual informaon; informaon about two word preceding and succeeding the word • For the relaon mining component, a linear chain CRF was trained using – Occurrence of organisms and habitats – Contextual informaon of all the enes in a sentence
Results and inference Performance of our named enty recognizer on a 9-‐fold cross-‐validaon Class of Precision(%) Recall(%) F-‐score(%) en**es 2PR/(P+R) Organisms 84 79 81 Habitats 68 55 61 improved results from the me of submission • Microorganisms have been recognized quite well. • Habitat recognion is modest • One of the observaons is that in a free text, the descripon of habitats/host is devoid any salient features such as uppercase le^ers, hyphens etc. • Instances such as abscess, lung were typical misses
Results and inference Relaon mining results • For the relaon mining experiment, the CRF-‐based classiﬁer achieved a precision of ~ 80% • Most of the false negaves ( sentences which should have been picked up, but were not) due to the noise in pdf to text conversion • Another reason for false negaves is the modest performance of habitat recognion which aﬀected the relaon mining algorithm
Discussion • The workﬂows we have developed bring together pdf-‐conversion, machine learning and diconaries together – Performance of individual components obviously has an impact its overall performance – Pdf conversion is not trivial by any means and this component is the most liming factor for any sentence-‐based classiﬁcaon task
Discussion • Pdf-‐to-‐text sentence examples These mechanisms may have evolved in bacterial pathogens to increase the frequency of phenotypic variaon in genes involved in 1 100,000 200,000 300,000 1,600,00 Figure 2 Circular representaon of the H. pylori 26695 chromosome. [Clearly, data from a table and ﬁgure corrupted the sentence] airborne pigs [noisy conversion of table discussing airborne diseases in pigs ]
Discussion • The CRF model for habitats is evidently weak – There is a need to augment the features to alleviate this weakness. We are currently enhancing model to include more features such as character-‐level n-‐grams – Results reﬂect inial success • Relaon mining is a hyper-‐classiﬁcaon task and perhaps it is prone to cascading errors
Current work • Work is underway to improve the relaon mining component using bag-‐of-‐words and character level n-‐grams to augment the feature space • We are also working on less noisy conversion techniques for pdf-‐to-‐text • Export the workﬂows to the public domain so that sciensts across the spectrum can use our workﬂows
Snapshot of relaon miner References • Hanisch, D. et al. ProMiner: Organism speciﬁc protein name detecon using approximate string matching. Embo Workshop Granada, Spain, 2004 • Sasaki, Y. et al. (2008). How to make the most of NE diconaries in stascal NER? In: BMC Bioinformacs, 9(Suppl 11), S5 • Collier, N. et al. BioCaster: detecng public health rumors with a Web-‐based text mining system. Bioinformacs, 24(24), 2008. • Nobata, C. et al Mining Metabolites: Extracng the Yeast Metabolome from the Literature. Metabolomics, 2010.
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.