1. Manual Curation and Extraction
of miRNAs using miRTex miRNA
recognition NER.
Seerat Sidhu
Supervisors: Jean-marc Schwartz, Goran Nenadic
Faculty of Life Sciences
The University of Manchester
2. AIMS
Using miRTex for Analysis.
Automating the process of information retrieval.
Manual Curation and Evaluation of results.
Study the Distribution of errors.
3. INTRODUCTION
MicroRNAs are ~22 nucleotide long non-coding endogenous RNAs.
Automated tools provide assistance in solving the problem of fast
assembling biomedical literature.
NER (Named Entity Recognition) is a type of text mining technique
that is used to identify the mention of key biological entities in the text.
miRTex NER system uses Rule-Based approach to extract the mention
of miRNAs present within the free text.
4. METHODS
miRNA Extraction Pipeline
Corpus Selection
Pre-Processing
*continuous data
Data Input
miRNA-mention
Recognition
*Rule Based
* miRNA
Nomenclature Based
miRNA Extraction
6. METHODS
“mir” (or “miRNA”,” microRNA”, “miR”) is the prefix for
MicroRNAs which is usually followed by a dash and unique identifier
number.
The performance of the NER tool was evaluated using two Corpora:
miRTex corpus was evaluated which consisted of 150 abstracts.
In-house corpus consisted of 13 full-length articles.
7. METHODS
The gaps were removed between lines and paragraphs, to create
continuous data.
The data was stored in text files, which were further used to
construct dictionaries.
The results obtained using the NER tool were manually curated.
Results were additionally evaluated by the calculation of F-score,
as well as precision and recall scores.
9. Evaluation
miRTex corpus: F – score of 0.99 with the recall value of 0.99
and precision of 1.
In – house corpus:
10. Distribution of errors in the text.
Pubmedid Introduction Discussion Procedures Results Abstracts
22435726 1(FN)
23496142 2(FP) 3(FN)
25081906 1(FP) 1(FN) 1(FP)
24431276 3(FN)
26026730 1(FP) 1(FN)
11. CONCLUSION
Up to 100 documents can be processed at a time.
The system was automated to retrieve text files containing data.
The F-score was ~0.98 for majority of the results.
Reliable and Accurate predictions.
Good precision and recall values.
Errors were randomly distributed.
12. FUTURE RESEARCH
Study only a particular set of miRNAs.
Integration with curation pipelines, to attempt an analysis of the
relationship between miRNAs and diseases.
Identification of potential miRNA targets by carrying out a
systematic investigation.