1. Poster Design & Printing by Genigraphics®
- 800.790.4001
MACHINE LEARNING ALGORITHMS FOR NAME ENTITY RECOGNITION
Neha Gupta, Thomas Hahn, Prasanna Balakrishnan, Dr. Richard Segall
University of Arkansas at Little Rock, University of Arkansas for Medical Sciences, Arkansas State University, Boston Children’s Hospital,
Sathyabama Deemed University, Supported by grants from NCRR(P20RR016460) and NIGMS(P20GM103429) at NIH.HMM
METHODS
MAP REDUCE
REFERENCES
ABSTRACT
CONTACT
University of Arkansas Little Rock
Email: thomas.f.hahn@gmail.com,
nneha2ggupta@gmail.com,
balkiprasanna1984@gmail.com,
rssegall@aol.com
Existing libraries like ABNER and GATE are
being used to extract information from the
wealth of available biomedical literature. But
the disadvantage of ABNER and GATE is
that they run only in single machine.
However, the MapReduce algorithm will run
on multiple machines and is found to be
more efficient than these frameworks in our
analysis. The objective of this project is to
develop a statistically based model for
biomedical text mining. This application is
intended to address current limitations in
biomedical text mining regarding large-scale
datasets. A statistical model is built based
on the Hidden Markov Model (HMM) which
can be used to identify the biological
entities in the literature. This can be used to
identify and build networks crucial for
human diseases. HMM model is generated
and viterbi algorithm was applied to the list
of observations.
..
1. Lantz, B. (2013), Machine Learning with R. Birmingham,
UK: Packt Publishing.
2. Kim, J., Ohta, T., Tateisi, Y. & Tsujii, J. (2003). GENIA
corpus—a semantically annotated corpus for bio-
textmining. Retrieved 02/19, 2014, from http://0-
bioinformatics.oxfordjournals.org.iii-
server.ualr.edu/content/19/suppl_1/i180.abstract
3. Zhang, J., Shen, D., Zhou, G. D., Su, J., & Tan, C. L.
(2004). Enhancing HMM-based biomedical named entity
recognition by studying special phenomena. Journal of
Biomedical Informatics, 37(6), 411-422.
doi:10.1016/j.jbi.2004.08.005. Retrieved 02/21, 2014,
fromhttp://download.journals.elsevierhealth.com/pdfs/jou
rnals/1532-0464/PIIS1532046404000838.pdf
BAUM-WELCH ALGORITHM
-HMM model generated in R had a
logViterbiScore -16.57684.
-HMM was successfully trained in Map
Reduce to generate the expected counts of
initial states, emissions and transitions taken
to generate the sequence.
-The maximum likelihood score L of our
training set S was 151 using Baum-Welch
implementation in Perl.
-The training data was classified and
compared by the algorithms kNN, Naïve Baye
Bayes, Logistic Regression and SVM
-For linear kernel in SVM, 50 model’s
predictions does not agree with the actual
test dataset .
- Using Radial Basis Function kernel, 192
model predictions does not agree
-Clustered seven entities from GENIA
CORPUS 2000 Medline abstracts .
-Aligned the seven clusters to the PubMed
abstracts words and tagged them.
-The transition probability matrix and
emission probability matrix was generated
for these 500 tagged training words.
-HMM model was generated using RHMM
package .
-HMM training was also parallelized using
Map Reduce Algorithm in Perl.
-MLE of hidden parameters of HMM are
inferred using Baum-Welch algorithm
implementation in Perl.
-Implemented machine learning classifiers
like kNN, Naïve-Bayes, Logistic
Regression ,SVM.
RESULTS
Neha Gupta, Thomas Hahn, Prasanna Balakrishnan, Dr. Richard Segall
University of Arkansas at Little Rock, University of Arkansas for Medical Sciences, Arkansas State University, Boston Children’s Hospital,
Sathyabama Deemed University,
This project was supported by the Arkansas INBRE program, with grants from the National Center for Research Resources - NCRR
(P20RR016460) and the National Institute of General Medical Sciences - NIGMS (P20 GM103429) from the National Institutes of Health.