data_mining- principle and application in biology

Data Mining
Data Data Data Data Data Data Data

● How can one find all the members of a human gene family?
● For a given protein, how can one determine whether it
contains any functional domains of interest?
● How does one find a gene of interest and determine that
gene's structure and how does one easily examine other genes
in that same region?
WHAT KIND OF INFORMATION YOU ARE MINING

 uses informatics and statistics
 helps extracting information out of a
huge amount of data
 now accessible for everyone
DATA MINING

Data
• Publicly-available from Lambert Lab at
http://lambertlab.uams.edu/publicdata.htm
• 105 samples run on Affymetrix HuGenFL
• 74 Myeloma samples
• 31 Normal samples

Three main data browsers
I. California university(http://genome.ucsc.edu/)
II. National Center for Biotechnology Information’s
Map-Viewer (http://www.ncbi.nlm.nih.gov/)
III. European Molecular Biology Laboratory -
European Bioinformatics Institute
(http://www.emsembl.org)

I. single-query analysis (-> genome browser)
II. selection of a set of genes that meet a criterion (->
"Sister programs")
III. more in-depth analysis (-> R/Bioconductor,
BiomaRt, ...)
3 levels in data mining

The genome browsers : UCSC & Ensembl
I. UCSC (University College of Santa Cruz)
Gene Sorter ●
Table Browser ●
II. Ensembl
BioMart

UCSC
Gene Sorter
Exploring genes families and the relationships among genes
Select genes based on several characteristic
UCSC
Gene Sorter
Table Browser
Query data using the database structure
Ensembl
BioMart
Database reorganised for an easier data minin
How toolboxes work

Common Approaches
• Comparing two measurements at a time
• Person 1, gene G: 1000
• Person 2, gene G: 3200
• Greater than 3-fold change: flag this gene
• Comparing one measurement with a population of
measurements… is it unlikely that the new
measurement was drawn from same distribution?

Approaches (Continued)
• Clustering or Unsupervised Data Mining
• Hierarchical Clustering, Self-Organizing (Kohonen) Maps
(SOMs), K-Means Clustering
• Cluster patients with similar expression patterns
• Cluster genes with similar patterns across patients or
samples (genes that go up or down together)

• Classification or Supervised Data Mining.
• Use our knowledge of class values… myeloma vs. normal,
positive response vs. no response to treatment, etc., to gain
added insight.
• Find genes that are best predictors of class.
• Can provide useful tests, e.g. for choosing treatment.
• If predictor is comprehensible, may provide novel insight,
e.g., point to a new therapeutic target.

• Classification or Supervised Learning.
• UC Santa Cruz: Furey et al. 2001 (support vector
machines).
• MIT Whitehead: Golub et al. 1999, Slonim et al. 2000
(voting).
• SNPs and Proteomics are coming.

Outline
• Data and Task
• Supervised Learning Approaches and Results
• Tree Models and Boosting
• Support Vector Machines
• Voting
• Bayesian Networks
• Conclusions

data_mining- principle and application in biology

More Related Content

Similar to data_mining- principle and application in biology

More from ShibsekharRoy1

Recently uploaded

data_mining- principle and application in biology