The prediction in medical diagnosis is an important role in the field of bio informatics which acts as a vital tool for handling and maintaining human health care in an efficient way. The technologies involves in the process prediction are al complex ones to integrate and implements in an effective way. The optimized prediction in the medical diagnosis reduces the medical treatment complexities along with the time and cost saving benefits. The DNA (Deoxyribonucleic Acid) sequences and structure information will make the disease diagnosis prediction with more accuracy and optimizations for comparisons and conclusions. For the impact of inheritance based diseases the earlier predictions will definitely act as a pivotal process for improving the cure strategies for any human being. The existence case disease diagnosis based prediction will also help and supports the medical treatment in an advanced way both in technical and process wise. This paper proposed Optimized Prediction in Medical Diagnosis Using DNA Sequences and Structure Information with pattern matching and comparison analysis using DNA sequence and structure information for individual patient condition. In future this paper will be extended with artificial intelligence based implementation through genetic algorithm to attain a DNA based Medical Diagnosis Disease Prediction System.
2. Optimized Prediction in Medical Diagnosis Using DNA Sequences and Structure Information
https://iaeme.com/Home/journal/IJARET 2062 editor@iaeme.com
1. INTRODUCTION
1.1. Prediction
Prediction or forecast is a statement about a future event or data. They are often, but not always,
based upon experience or knowledge [1]. There is no universal agreement about the exact
difference from "estimation"; different authors and disciplines ascribe different connotations.
Future events are necessarily uncertain, so guaranteed accurate information about the future is
impossible. Prediction can be useful to assist in making plans about possible developments [3,
4].
1.2. Pattern Matching
Pattern matching is the act of checking a given sequence of tokens for the presence of the
constituents of some pattern. In contrast to pattern recognition, the match usually has to be
exact: "either it will or will not be a match.” The patterns generally have the form of either
sequences or tree structures. Uses of pattern matching include outputting the locations (if any)
of a pattern within a token sequence, to output some component of the matched pattern, and to
substitute the matching pattern with some other token sequence. Sequence patterns are often
described using regular expressions and matched using techniques such as backtracking.
1.3. DNA
DNA is a polymer composed of two polynucleotide chains that coil around each other to form
a double helix carrying genetic instructions for the development, functioning, growth, and
reproduction of all known organisms and many viruses [1]. DNA and ribonucleic acid (RNA)
are nucleic acids. Alongside proteins, lipids and complex carbohydrates (polysaccharides),
nucleic acids are one of the four major types of macromolecules that are essential for all known
forms of life [2, 4].
Figure 1 Sample DNA Structure
1.4. Problem Statement
The process of retrieving and comparing the DNA structure and sequence data for the prediction
approach in medical analysis is a tedious one which requires the pattern matching based
approach for the collected medical data from different medical resources with proper
verification and validation for disease diagnosis.
The objective of this work is to perform the retrieve and comparison of DNA structure and
sequence data from different medical resources and perform the prediction through the pattern
matching approach for the patient data disease diagnosis procedures.
3. Edward Daniel Christopher .B and Victor S.P
https://iaeme.com/Home/journal/IJARET 2063 editor@iaeme.com
2. PROPOSED METHODOLOGY
The proposed methodology focuses on optimized prediction in disease diagnosis through DNA
sequence analysis. The initial stage concentrates on the conversion of high dimensional DNA
sequence data to low dimensional easy processing sequence data along with the filtration of
structured data from unstructured data. The second stage focuses on the critical data
optimization for the exact prediction; the third stage incorporates the pattern matching approach
with the existing DNA target sequence for the exact mapping. The final stage performs the
disease diagnosis prediction result in an optimized way for the better cure of the medical
disease. The following figure-2 shows the proposed methodology structure.
2.1. Optimized Prediction of Medical diagnosis using DNA Sequence Information
Figure 2 Proposed DNA data analysis for the Prediction of patient disease identification analysis.
3. IMPLEMENTATION
3.1. DNA Sequence Converter
The DNA sequence converter performs the conversion of high dimensional data to low
dimensional data together with the process of retrieving structured data from the unstructured
data.
1. Unwanted Data Filtration and removal
The Database Maintenance utility enables to purge the Net Support DNA database of redundant
data, applications and programs, remove Agent PCs that are no longer in use, remove users,
4. Optimized Prediction in Medical Diagnosis Using DNA Sequences and Structure Information
https://iaeme.com/Home/journal/IJARET 2064 editor@iaeme.com
create backups of key data using an export/import facility and set a data retention policy,
allowing you to schedule automatic deletion of old data.
To maintain the Net Support DNA database at a manageable size, it is recommended that
delete/archive historical or unwanted records on a regular basis.
In the Tools tab, click the Database Maintenance icon. The Database Maintenance dialog
will appear.
Select the Data Size tab. This allows the collecting data from and which areas are taking up
the most space.
If required, export data in the Net Support DNA database before it is removed. Select the
Export Data tab. This can act as a secure backup in the event of a database corruption, or it can
be imported to another database. It also frees up space and keeps a record of the data should
you need to return to it later.
• In the Available PCs Tree, select the Agent(s) to export data from.
• Click to transfer to the PCs to export window.
• Limit the amount of exported data by applying a Date filter.
• Indicate if any additional system data is to be included.
• In the case of Applications and Custom User pages, click and select the items to include.
• Click Export when ready. Enter a name for the .XML file that will be created.
• Click OK. A confirmation message will appear when the export is complete.
• Next, to the Delete Data tab, decide which data to clear from the database.
• Select the database tables to include in the purge.
• Choose the required ‘cut-off’ date. All records logged before the specified date will be
deleted.
• Click Delete and confirm to proceed.
• A confirmation dialog will appear indicating how many records have been deleted.
The following figure-3 shows tool installation procedure,
Figure 3 Net support DNA tool installation
5. Edward Daniel Christopher .B and Victor S.P
https://iaeme.com/Home/journal/IJARET 2065 editor@iaeme.com
2. High dimension to low dimension structured data
In order to reduce the dimensionality of DNA sequence data Stochastic Neighbor Embedding approach is used as
follows,
Stochastic Neighbor Embedding (SNE)
This technique reduces the n numeric columns in the dataset to fewer dimensions m (m < n) based
on nonlinear local relationships among the data points. Specifically, it models each high-
dimensional object by a two- or three-dimensional point in such a way that similar objects are
modeled by nearby points and dissimilar objects are modeled by distant points in the new lower
dimensional space.
3. Feature extraction.
RStudio and Periodical are the software tools for extracting the features from any DNA
sequence. The installation code is as follows in R.
If (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("periodicDNA")
3.2. Optimization in DNA Prediction
1. Community Detection
In order to handle large networks size, divide the network into number of communities. Detect
these communities through using fast greedy modularity optimization algorithm based on the
multi scale algorithm .Modularity is defined as: the sum of the difference between the portion
of links within a partition linking to this partition minus the expected value of the portion of
links, so if edges were randomly placed, where A is the adjacency matrix, m the number of
edges (or total strength for a weighted network), di the degree (or strength) of node i, and the
δ(i, j) function returns one if nodes i and j belong to the same community, zero otherwise.
2. Whale Optimization Approach (WAO)
In general, the WOA algorithm begins with a set of random individuals. At each iteration,
individuals renew their positions with respect to either a randomly selected search agent or the
best solution gained so far. There is a parameter A, which represents a control parameter that is
decreased from 2 to 0 in order to afford exploration and exploitation, correspondingly. A
random search agent is chosen when |A| > 1, while the best solution is chosen when |A| < 1 for
updating the position of the search agents. Based on the value of another parameter p, WOA is
able to toggle between a spiral and a circular movement to balance among the
exploration/exploitation capability.
3. Link Prediction
The following algorithm identifies the exact link prediction for the optimized prediction in
disease identification.
Algorithm Link Prediction based Whale Optimization Algorithm
Input: Population-Size, Max-iterations
Output: G-mean //fitness function Load network Dataset A
Generate adjacency matrix for the given network dataset C
6. Optimized Prediction in Medical Diagnosis Using DNA Sequences and Structure Information
https://iaeme.com/Home/journal/IJARET 2066 editor@iaeme.com
Compute community detection on A Sub graphs
Divide the network based on C
For each sub graph do search Agent
Initialization (sub graph (i));
//apply Whale Optimization Algorithm on each individual
t=1;
While t< Max-iterations do G-mean
Fitness Function (search Agent);
If G-mean >best Fitness;
Best Fitness= G-mean;
Returnbest Fitness
WOA (search Agents)
End-While
End-For
3.3. Pattern Matching and Comparison of Target DNA Sequence
Basic Local Alignment Search Tool called as BLAST is used for pattern matching and
comparison of DNA data analysis. BLAST finds regions of similarity between biological
sequences. The program compares nucleotide or protein sequences to sequence databases and
calculates the statistical significance. The final disease diagnosis report for the target patient
with the effective prediction of DNA structure and sequence analysis for the further study
creation and storage of data representations.
3.4. Disease Diagnosis Prediction Result
The disease diagnosis prediction result includes3 phases.
Phase-1: identification
It represents the final parameter identification for the exact defect in the human health condition
in terms of medical technical terms. Example: Lumbar spine L1-L5 (L5 defect ratio 100:32)
Phase-2: Classification
Decision Tree is used for classification. Given a data of attributes together with its classes, the
tree produces a sequence of rules that can be used to classify the data. The algorithm splits the
sample into two or more homogeneous sets (leaves) based on the most significant differentiators
in your input variables. To choose a differentiator (predictor), the algorithm considers all
features and does a binary split on them (for categorical data, split by cat; for continuous, pick
a cut-off threshold). It will then choose the one with the least cost (i.e. highest accuracy),
repeating recursively, until it successfully splits the data in all leaves (or reaches the maximum
depth).
7. Edward Daniel Christopher .B and Victor S.P
https://iaeme.com/Home/journal/IJARET 2067 editor@iaeme.com
Phase-3: Medical disease diagnosis result
It clearly indicates the person’s medical diagnosis results with the mathematical
representational data along with the certainty representations if affected or not.
The final disease curing process helps or supports the medical practitioners to treat the
patient in an efficient manner. This proposed methodology plays the vital role in improving the
health care subsystem in an advanced way of health improvements.
4. RESULTS AND DISCUSSION
The proposed optimal prediction in medical diagnosis using DNA sequence structure
information produced the following results. The table-1 shows the DNA amino acid
requirement as prescribed by the WHO [12].
Table 1 a Age wise requirement per day mg/Kg
Amino Acid Infants, Age 3–4
months
Children, Age ∼2
years
Children, Age 10–12
years
Adults
Isoleucine 70 31 28 10
Leucine 161 73 42 14
Lysine 103 64 44 12
Methionine plus cystine 58 27 22 13
Phenylalanine plus tyrosine 125 69 22 14
Threonine 87 37 28 7
Tryptophan 17 12.5 3.3 3.5
Valine 93 38 25 10
Total 714 352 214 83.5
Table 1 b Age wise requirement per day mg/Kg
Amino Acid Infants, Age 3–4
months
Children, Age ∼2
years
Children, Age 10–12
years
Adults
Isoleucine 70 31 28 10
Leucine 161 73 42 14
Lysine 103 64 44 12
Methionine plus cystine 58 27 22 13
Phenylalanine plus tyrosine 125 69 22 14
Threonine 87 37 28 7
Tryptophan 17 12.5 3.3 3.5
Valine 93 38 25 10
Total 714 352 214 83.5
The following Table-2 shows the amino acid status results for an adult patient AAA from
the xxx medical Centre Tirunelveli Tamilnadu Lab result after the implementation of the
proposed methodology with optimized prediction in medical diagnosis.
Table 2 Amino acid count status results
Amino Acid Adults Requirement Deficiency
Isoleucine 4 10 6
Leucine 6 14 8
Lysine 11 12 1
Methionine plus cystine 12 13 1
Phenylalanine plus tyrosine 13 14 1
Threonine 6 7 1
Tryptophan 3 3.5 0.5
Valine 3 10 7
Total 58 83.5 25.5
8. Optimized Prediction in Medical Diagnosis Using DNA Sequences and Structure Information
https://iaeme.com/Home/journal/IJARET 2068 editor@iaeme.com
The following figure-4 shows the amino acid deficiency identification through the proposed
methodology for DNA sequence prediction.
Figure 4 Proposed Methodology results in Amino acid count from DNA sequence for Disease
Prediction
4.1. Prediction Result for Disease Cure
Maple syrup urine disease (MSUD) is a rare genetic disorder characterized by deficiency of an
enzyme complex (branched-chain alpha-keto acid dehydrogenase) that is required to break
down (metabolize) the three branched-chain amino acids (BCAAs) Leucine, isoleucine and
Valine, in the body.
The following table-3 shows the proposed methodology of optimization in prediction for
medical diagnosis using DNA sequences versus the existing non-bioinformatics approach from
a sample set of 52 patient’s information.
Table 3 Existing data processing Vs. Proposed medical diagnosis prediction efficiency.
Sl.No Patient data Actual count Existing Non bio informatics Data
processing approach
Exact
Prediction of
proposed
methodology
Success rate %
1 Cancer 05 01 04 80
2 Diabetes 24 15 23 95.8
3 Hair fall 16 08 14 87.5
4 Paralysis 04 01 03 75
5 Arthritis 03 01 02 66.7
The following figure-5 shows the performance of proposed prediction optimization in
medicinal diagnosis system vs. existing data processing approach (non-Bioinformatics).
9. Edward Daniel Christopher .B and Victor S.P
https://iaeme.com/Home/journal/IJARET 2069 editor@iaeme.com
Figure 5 Proposed Prediction optimization in medical diagnosis vs. existing approach
The final results show the 88.46% success with 38.46% more success (46/52)when
compared with the existing non DNA analysis approach (26/52=50%) which provides the
greater motivation for the future implementation of DNA based Medical Diagnosis Disease
Prediction System(DNAMDDPS).
5. CONCLUSION
The optimized prediction of bioinformatics based medical care system along with the DNA
analysis makes the medical diagnosis towards the best Health Care System. The proposed
methodology initially focuses on the dimensional reduction and structured data conversion from
the unstructured data. Then this research performs the community detection, whale optimization
along with the prediction link identification for the patient’s information, followed by the work
of pattern matching comparison analysis of target DNA dataset towards the exact disease
analysis, finally this research contributes towards the exact identification, classification for the
disease diagnosis state for the future curing strategies. The Final analysis focuses on the existing
system performance comparison for the successful implementation of the proposed. This paper
produced show the 88.46% success with 38.46% more success (46/52) when compared with
the existing non DNA analysis approach (26/52=50%) which provides the greater motivation
for the future implementation of DNA based Medical Diagnosis Disease Prediction
System(DNAMDDPS).. In future this research will be implemented with the support of machine
learning and artificial intelligence for the further developments in optimal medical diagnosis
system.
REFERENCES
[1] Jacques Cohen, Computer science and bioinformatics, Communications of the ACM, v.48 n.3,
March ,2005
[2] Jana, R., Aqel, M., Srivastava, P., and Mahanti, P. K., Soft Computing Methodologies in
Bioinformatics, European Journal of Scientific Research, Vol. 26, No. 2, pp. 189-203, 2009.
10. Optimized Prediction in Medical Diagnosis Using DNA Sequences and Structure Information
https://iaeme.com/Home/journal/IJARET 2070 editor@iaeme.com
[3] www.nature.com/clinicalpractice/onc
[4] Dressman MA. Gene expression profiling detects gene amplification and differentiates tumor
types in breast cancer. Cancer Res, 63:2194-2199, 2003.
[5] Subramanian S. Gastrointestinal stromal tumors (GISTs) with KIT and PDGFRA mutations
have distinct gene expression profiles. Oncogene, 23:7780-7790, 2004.
[6] Daisuke Kihara, Yifeng David Yang, and Troy Hawkins, Bioinformatics resources for cancer
research with an emphasis on gene function and structure prediction tools. Cancer Inform.2: 25–
35, 2006.
[7] Rhodes DR, Barrette TR, Rubin MA, Ghosh D, Chinnaiyan AM, Meta-analysis of microarrays:
interstudy Validation of gene expression profiles reveals pathway dysregulation in prostate
cancer. Cancer Res.; 62(15):4427-33, 2007.
[8] Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R,
Geisler S,
[9] Demeter J, Perou CM, Lønning PE, Brown PO, Børresen-Dale AL, Botstein D, Repeated
observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad
Sci U S A.; 100(14):8418-23, 2008.
[10] Rhodes DR, Chinnaiyan AM, Integrative analysis of the cancer transcriptome. Nat Gene; 37
Suppl: S31-7,2008.
[11] Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ, Basic local Bio alignment search tool.
J Mol Biol.;215(3):403-10, 2011
[12] https://en.wikipedia.org/
[13] https://www.ncbi.nlm.nih.gov/books/NBK234922/table/ttt00008/?report=objectonly