Introduction to Data
Integration in Bioinformatics
Yan Xu

Dec. 2013
Data Integration
Copy
Number

Epigenome

Methylation

miRNA

Gene
Expression
Clinical data

Introduction to Data Integration in Bioinformatics

Pathways

Dec. 2013
Recent Publications
R. Louhimo, T. Lepikhova, O. Monni, and S. Hautaniemi, ‖Comparative analysis of
algorithms for integration of copy number and expression data,‖ Nature
Methods, 2012.
The ENCODE Project Consortium, ―An integrated encyclopedia of DNA elements in
the human genome, ‖ Nature, 2012.
S. Aerts and J. Cools, ―Cancer: Mutations close in on gene regulation,‖ Nature, Jul.
2013.
V. J. H. Powell and A. Acharya, ―Disease Prevention: Data Integration,‖ Science, Dec.
2012.
A. Vinayagam, Y. Hu, M. Kulkarni, C. Roesel, R. Sopko, S. E. Mohr, and N. Perrimon
―Protein Complex–Based Analysis Framework for High-Throughput Data Sets,‖
Science Signaling, Feb. 2013.

Introduction to Data Integration in Bioinformatics

Dec. 2013
DNA the molecule of life

Protein-coding DNA makes up barely 2% of the human
genome, About 80% of the bases in the genome may be expressed
without an identified function.

Introduction to Data Integration in Bioinformatics

Dec. 2013
Gene Expression
DNA: Two long
biopolymers made of
nucleotides,composed of
nucleobase:
A: Adenine
T: Thymine
C: Cytosine
G: Guanine

termination codon
Poly-A tail

cap

start codon
Sequence of amino acids

Introduction to Data Integration in Bioinformatics

Dec. 2013
Microarray

Reverse Transcription

Result

Introduction to Data Integration in Bioinformatics

Dec. 2013
Next generation RNA-sequencing
EST: Expressed Sequence Tag
Reads of a single type of
nucleotide at one moment

(animation)

The number of nucleotide reads
at one moment

Reference:
Open Reading Frame

Introduction to Data Integration in Bioinformatics

Time

Dec. 2013
DNA structural variation: Copy number
CNV (Copy Number Variation):
• 12% of human genomic DNA
• 0.4% of the genome of unrelated people differ with respect
to copy number
• Range from 1000 nucleotide bases to several megabases
• Inherited or caused by de novo mutation (not inherited
from either parent).
Relation to disease:
Higher EGFR (Epidermal growth factor receptor) copy number
exist in Non-small cell lung cancer. (Cappuzzo et al. Journal of the
National Cancer Institute, 2005)
Higher copy number of CCL3L1 decreases susceptibility to HIV.
(Gonzalez et al. Nature, 2005)
Low copy number of FCGR3B increases susceptibility to
inflammatory autoimmune disorders (Aitman et al. Nature, 2006).

Introduction to Data Integration in Bioinformatics

Dec. 2013
Epigenome: DNA Methylation
Why we look so
different even we
have the exactly
identical genes ??

What, when and where
Epigenome
directions

Introduction to Data Integration in Bioinformatics

Genome

• Addition of a methyl group to the C or
A DNA nucleotides.
• Permanent and unidirectional
• Can be copied across cell divisions or
even passed on to offsprings

Dec. 2013
miRNA (microRNA)
Genome has protein-coding genes, also has genes that code for small RNA
e.g., ―transfer RNA‖ that is used in translation is coded by genes
e.g., ―ribosomal RNA‖ that forms part of the structure of the ribosome, is also
coded by genes
miRNA: 21-22 nucleotide non-coding RNA

miRNA Pathway

• Perfect complementary
binding leads to mRNA
degradation of the target
gene
• Imperfect pairing inhibits
translation of mRNA to
protein

RISC: RNA-induced silencing complex.
Use miRNA as a template for
recognizing complementary mRNA

Introduction to Data Integration in Bioinformatics

Dec. 2013
Clinical data
General clinical checkup data: temperature, blood pressure;
Pathology: blood test, antibody test;

Radiology: X-ray, CT (Computed tomography), Ultrasound, MRI (Magnetic
resonance imaging).
Texture Heterogeneity

High score

Low score

Introduction to Data Integration in Bioinformatics

Internal Arteries

High score

Low score

Dec. 2013
Challenges of data integration analysis
• Large highly connected data sources and
ontologies

• Heterogeneity: functions, structures, data access
and analysis methods, dissemination formats.
• Incomplete or overlapping data sources
• Frequent changes

Introduction to Data Integration in Bioinformatics

Dec. 2013
Case I

E. Segal et al.,―Decoding global gene expression programs in liver cancer by noninvasive
imaging,‖ nature biotechnology, May 2007.

E. Segal et al.
“, Module
network:
identifying
regulatory
modules and their
condition-specific
regulators from
gene expression
data,” nature
genetics, 2003.

Introduction to Data Integration in Bioinformatics

Dec. 2013
Case II

O. Gevaert et al., ―Non–Small Cell Lung Cancer: Identifying Prognostic Imaging Biomarkers
by Leveraging Public Gene Expression Microarray Data—Methods and Preliminary Results
,‖ Radiology, Aug. 2012.

Introduction to Data Integration in Bioinformatics

Dec. 2013

Introduction to data integration in bioinformatics

  • 1.
    Introduction to Data Integrationin Bioinformatics Yan Xu Dec. 2013
  • 2.
  • 3.
    Recent Publications R. Louhimo,T. Lepikhova, O. Monni, and S. Hautaniemi, ‖Comparative analysis of algorithms for integration of copy number and expression data,‖ Nature Methods, 2012. The ENCODE Project Consortium, ―An integrated encyclopedia of DNA elements in the human genome, ‖ Nature, 2012. S. Aerts and J. Cools, ―Cancer: Mutations close in on gene regulation,‖ Nature, Jul. 2013. V. J. H. Powell and A. Acharya, ―Disease Prevention: Data Integration,‖ Science, Dec. 2012. A. Vinayagam, Y. Hu, M. Kulkarni, C. Roesel, R. Sopko, S. E. Mohr, and N. Perrimon ―Protein Complex–Based Analysis Framework for High-Throughput Data Sets,‖ Science Signaling, Feb. 2013. Introduction to Data Integration in Bioinformatics Dec. 2013
  • 4.
    DNA the moleculeof life Protein-coding DNA makes up barely 2% of the human genome, About 80% of the bases in the genome may be expressed without an identified function. Introduction to Data Integration in Bioinformatics Dec. 2013
  • 5.
    Gene Expression DNA: Twolong biopolymers made of nucleotides,composed of nucleobase: A: Adenine T: Thymine C: Cytosine G: Guanine termination codon Poly-A tail cap start codon Sequence of amino acids Introduction to Data Integration in Bioinformatics Dec. 2013
  • 6.
    Microarray Reverse Transcription Result Introduction toData Integration in Bioinformatics Dec. 2013
  • 7.
    Next generation RNA-sequencing EST:Expressed Sequence Tag Reads of a single type of nucleotide at one moment (animation) The number of nucleotide reads at one moment Reference: Open Reading Frame Introduction to Data Integration in Bioinformatics Time Dec. 2013
  • 8.
    DNA structural variation:Copy number CNV (Copy Number Variation): • 12% of human genomic DNA • 0.4% of the genome of unrelated people differ with respect to copy number • Range from 1000 nucleotide bases to several megabases • Inherited or caused by de novo mutation (not inherited from either parent). Relation to disease: Higher EGFR (Epidermal growth factor receptor) copy number exist in Non-small cell lung cancer. (Cappuzzo et al. Journal of the National Cancer Institute, 2005) Higher copy number of CCL3L1 decreases susceptibility to HIV. (Gonzalez et al. Nature, 2005) Low copy number of FCGR3B increases susceptibility to inflammatory autoimmune disorders (Aitman et al. Nature, 2006). Introduction to Data Integration in Bioinformatics Dec. 2013
  • 9.
    Epigenome: DNA Methylation Whywe look so different even we have the exactly identical genes ?? What, when and where Epigenome directions Introduction to Data Integration in Bioinformatics Genome • Addition of a methyl group to the C or A DNA nucleotides. • Permanent and unidirectional • Can be copied across cell divisions or even passed on to offsprings Dec. 2013
  • 10.
    miRNA (microRNA) Genome hasprotein-coding genes, also has genes that code for small RNA e.g., ―transfer RNA‖ that is used in translation is coded by genes e.g., ―ribosomal RNA‖ that forms part of the structure of the ribosome, is also coded by genes miRNA: 21-22 nucleotide non-coding RNA miRNA Pathway • Perfect complementary binding leads to mRNA degradation of the target gene • Imperfect pairing inhibits translation of mRNA to protein RISC: RNA-induced silencing complex. Use miRNA as a template for recognizing complementary mRNA Introduction to Data Integration in Bioinformatics Dec. 2013
  • 11.
    Clinical data General clinicalcheckup data: temperature, blood pressure; Pathology: blood test, antibody test; Radiology: X-ray, CT (Computed tomography), Ultrasound, MRI (Magnetic resonance imaging). Texture Heterogeneity High score Low score Introduction to Data Integration in Bioinformatics Internal Arteries High score Low score Dec. 2013
  • 12.
    Challenges of dataintegration analysis • Large highly connected data sources and ontologies • Heterogeneity: functions, structures, data access and analysis methods, dissemination formats. • Incomplete or overlapping data sources • Frequent changes Introduction to Data Integration in Bioinformatics Dec. 2013
  • 13.
    Case I E. Segalet al.,―Decoding global gene expression programs in liver cancer by noninvasive imaging,‖ nature biotechnology, May 2007. E. Segal et al. “, Module network: identifying regulatory modules and their condition-specific regulators from gene expression data,” nature genetics, 2003. Introduction to Data Integration in Bioinformatics Dec. 2013
  • 14.
    Case II O. Gevaertet al., ―Non–Small Cell Lung Cancer: Identifying Prognostic Imaging Biomarkers by Leveraging Public Gene Expression Microarray Data—Methods and Preliminary Results ,‖ Radiology, Aug. 2012. Introduction to Data Integration in Bioinformatics Dec. 2013

Editor's Notes

  • #10 Researchers are now learning that another level of information—the epigenome—controls gene expression in part by controlling access to DNA. The gene-reading machinery is blocked when methyl molecules bind to DNA or histones.