Presentation slide at Japan Science & Engineering Challenge 2020.
In this study, I treated the sequence of nucleic acid as a natural language, created a large-scale corpus, and identified a gene sequence that improves the expression level of protein using machine learning software for natural language processing.
The large-scale corpus focuses on the sequence of the gene untranslated region, UTR, and its secondary structure, and is a component of the secondary structure.
1) Treat a stem and loop as morphemes.
2) Next, link the produced corpus with information on gene expression efficiency.
3) Machine learning, NLP, is performed using this as teacher data with using facebook's fasttext. Specifically, the distributed expression of words consisting of gene parts and this amount
4) Build a neural network model based on diffuse expression and maximize the expression efficiency from the created model.
5) A cluster analysis of the words was performed. Furthermore, from this cluster center of gravity, the same gene fragment that improves expression efficiency
In this process, the numerical value of the expression efficiency was labeled and it was verified whether the classification prediction could be performed.
Next, perform cluster analysis of the constructed word vector, identify gene fragments (words) with high expression efficiency, and leave behind.
I proposed a method for synthesizing a new base sequence (text) that can be expected to be highly expressed from a gene fragment.