Applications and Trends in Data Mining Data Mining For Biological Data Analysis
Factors that led for the development• The past decade has seen an explosive growth in: 1.Genomics 2.Proteomics 3.Functional genomics 4.Biomedical research• Identification and comparative analysis of genomes of humans and other species for investigation of genetic networks.• Development of new Pharmaceuticals and advances in cancer therapies.
• DNA sequences form the foundation of genetic codes of all living organisms.• DNA sequences are comprised of four basic building blocks called nucleotides: 1.adenine (A) 2.cytosine (C) 3.guanine (G) 4.thymine (T)• These four nucleotides (or bases) are combined to form long chains that resemble a twisted ladder.
• DNA sequence … CTA CAC ACG TGT AAC …• A gene usually comprises hundreds of individual nucleotides arranged in particular order.• A genome is the complete set of genes of an organism.• Genomics is the analysis of genome sequences.• A proteome is the complete set of protein molecules present in a cell, tissue, or organism.• Proteomics is the study of proteome sequences.
Data mining may contribute tothe biological data analysis in the following aspects.
Biological data mining hasbecome an essential part of new research field called bioinformatics.
1)Semantic integration ofheterogeneous, distributed genomic andproteomic data bases.• Genomic and proteomic data sets are often generated at different labs and by different methods.• They are distributed, heterogeneous, and of wide variety.• Integration of such data is essential to cross-site analysis of biological data .• Such integration and linkage analysis would facilitate the systematic and coordinated analysis of genome and biological data.
• This has promoted the development of integrated data warehouses to store and manage derived biological data.• Data cleaning, data integration, reference reconciliation, classification, and clustering methods will facilitate the integration of biological data and the construction of data warehouses for biological data analysis.
2)Alignment, indexing, similarity search, andcomparative analysis of multiple nucleotide/proteinsequences.• BLAST and FASTA, in particular, are the tools for the systematic analysis of genomic and proteomic data.• Biological sequence analysis methods differ from many sequential pattern analysis algorithms proposed in data mining.• For protein sequences, two amino acids should also be considered a “match” if one can be derived from the other by substitutions that are likely to occur in nature.
• There is a combinatorial number of ways to approximately align multiple sequences: 1)reducing a multiple alignment to a series of pair wise alignments and then combining the result. 2)using Hidden Markow Models or HMMs.• Multiple alignment can be used to identify highly conserved residues among genomes and they can be used to build phylogenetic trees to infer evolutionary relationships among species.• Genomic and proteomic sequences isolated from diseased and healthy tissues can be compared to identify critical differences between them.• Sequences occurring in the diseased samples may indicate the genetic factor of the disease.
3)Discovery of structural patterns and analysis ofgenetic networks and protein pathways.• Protein sequences are folded into 3D structures, and such structures interact with each other based on the relative position and distances between them.• Such complex interactions lead to the formation of genetic networks and protein pathways.• It is important to develop powerful and scalable data mining to discover patterns and to study about regularities and irregularities among complex biological network.
4)Association and path analysis: identifying co-occurring gene sequences and linking genes todifferent stages of disease development .• Many studies have been focused on comparison of one gene to another.• Most diseases are not triggered by a single gene but by a combination of genes acting together.• Association analysis methods can be used to determine the kinds of genes that are likely to co-occur in target samples.• A group of genes may contribute to a disease process, here path analysis is expected to play an important role.
5)Visualization tools in genetic data analysis.• Alignments among genomic or proteomic sequences and interactions between them can be expressed in 1)Graphic forms. 2)Transformed into various kinds of easy-to-understand visual displays.• They facilitate pattern understanding, knowledge discovery, and interactive data exploration.