Kulakova sbb2014

COMPUTER DATA ANALYSIS
OF GENOME SEQUENCING
BY TECHNOLOGY ChIP-seq
AND Hi-C
adviser–Yuri Orlov, ICG SB RAS
author– Kulakova Ekaterina, bachelor

Topicality
 Automated systems allow decoding DNA and genomic sequences up to whole genomes. The
complete sequencing of genomes leads to avalanche growth on the sequence information
(megabytes and gigabytes of data).
 Development of methods based on chromatin immunoprecipitation (ChIP-seq, ChIA-PET) gives
a qualitatively new data.
 There are new tasks of computer genomics (analysis of spatial, non-linear structures of
chromosomes)
Aim and Scientific novelty
The aim of this work - the study of chromosomal contacts in the cell nucleus with the help of
computer programs statistical data of genes and chromosomal domains, experimental data
analysis ChIP-seq and Hi-C.
 Integration of modern genome-wide ChIP-seq data and Hi-C, which became available only in
the last two or three year
 Using the parameter precision location on chromosome with which to analyze the data
 Establishing a list of genes located on chromosome boundaries of topological domains.
*ChIP- Seq = Chromatin ImmunoPrecipitation sequencing
ChIA-PET = Chromatin Interaction Analysis by Paired-End-Tag sequencing

Methods Hi-C and ChIA-PET*
Arrangement of chromosomes in
the cell nucleus (reconstruction
according to Hi-C)
Comprehensive Mapping of Long-Range
Interactions Reveals Folding Principles of the
Human Genome. Science, 2009
Topological arrangement of the
domains of chromosomes and its
mapping in the genome
Scheme of local chromosomal
domains ("tangle" contacts)
*ChIP- Seq = Chromatin ImmunoPrecipitation sequencing
ChIA-PET = Chromatin Interaction Analysis by Paired-End-Tag sequencing
Hi-C = Hi (high dimension chromosome) Conformation
Separate loops
«tangle»
(Dixon et al., 2012)
Scheme of arrangement of
genes on chromosome

Genomic data: genes, peaks ChIP-seq,
contact areas ChIA-PET
genes
genes
Plot of
chromosomal
contacts ChIA-PET
Chromosomal domain
Peaks of ChIP-seq
profiles

File formats and their presentation
Bed-file example
 >track name=ER_E2 description=ER_E2
 chr1 557112 558114
 chr1 559459 560286
 chr1 998864 999397
 chr1 999399 999604
 chr1 1004343 1005146
 chr1 1070346 1071080
 chr1 1305474 1306502
 chr1 1358287 1358744
 chr1 1776987 1777750
 chr1 1820476 1821168
 chr1 1922754 1923628
 chr1 2131962 2132747
 chr1 2325805 2326447
 chr1 2368996 2369977
 chr1 3119829 3120541
 chr1 3244610 3245121
 …
Data about domains in mouse cells -
obtained in the laboratory O.L.Serov (ICG
SB RAS) (Fib_domains, Sp_domains).
The size of one file with the
genomic profile - from 100 MB to
2-3 Gb
RefSeq annotation taken from UCSC Genome
Browser
http://genome.ucsc.edu/cgi-bin/hgTables

Calculation of the position of genes and
domain boundaries
 А1 – left coordinate of the gene B1 - right coordinate of the gene.
 А2 – left coordinate of the domain, В2 – right coordinate of the domain.
 Е – accuracy, user-defined.
 if (|А1 – А2| <= Е) & (В1 < А2 + (В2 – А2)/2) true, we assume that the gene
lies close to the left boundary of the domain. Similar conditions for the right
border.
Е
А2 А1 В1 В2
домен
ген
Example of location of chromosomal
domains and genes for mouse
chromosome 10 The linear arrangement of genes in the domain

Table location types of genes in chromosomal
domains
Other – other genes
Inside – genes that lie within the domains
onBorder – genes lying on the domain
boundaries.

Analysis of the location set of genes on
the domains in different cell types
 User specifies a list of genes. Possible to analyze all the genes in the genome
(20,000 genes)
 Types of cells - embryonic stem cells (fibroblasts - Fib) and sperm (Sp)
mouse. Experiment Hi-C, ICG SB RAS
Sp (densely packed
structure)
92,5 % genes within domains
1,4% on border
6,1% other
Fib (Open chromatin)
72,6 % genes within domains
3,2% on border
24% other

Experimental data.
Gene Ontology categories
For analysis were taken genes lying on the
domain boundaries.
The result was sorted by the number of
genes with common biological processes
category
Used online resource
http://david.abcc.ncifcrf.gov/

Analysis of the co-expression of genes, lying on the
borders of the spatial domain
For analysis were taken genes located on the domain boundaries.
Used online resource STRING http://string-db.org/
The main result - graphs of gene networks of varying degrees of
connectivity for the two types of cells
Fib
698 – the total number of genes on
the domain boundaries
88 – genes involved in the
connection
160 pairs of connection
12% genes from total
Sp
314 – the total number of
genes on the domain
boundaries
13 – genes involved in the
connection
10 pairs of connection
4% genes from total

Conclusion
 Implemented a Java program
 Application of the program to the experimental data (ICG SB RAS
and databases on chromosome contacts)
 The analysis of the location set of genes in chromosomal domains
(control computer simulation)

Next Steps
 Define domains including pluripotency genes in the mouse genome (Dixon
et al., 2012).
 Make developed project is compatible with other programs designed to
ICG SB RAS for microarray data developed in languages Java, C / C + +.
 Integrate the program with data on gene expression database BioGPS
microchips in human genome.
Thank you for your attention!

Publications(Thesises)
 Safronova N.S., Kulakova E.V., Orlov Yu.L. (2013) Applications of text complexity measures to
genome sequences analysis. // Proceedings of GIW-2013, National University of Singapore, 16-
18 Dec 2013. P.42.
 Медведева И.В., Вишневский О.В., Кулакова Е.В., Спицына А.М., Афонников Д.А., Кочетов
А.В., Орлов Ю.Л. (2014) Геномная организация и контекстные характеристики генов с
повышенной экспрессией в клетках мозга // Геномная организация и контекстные
характеристики генов с повышенной экспрессией в клетках мозга // XVI Всероссийская
научно-техническая конференция «Нейроинформатика-2014»: Сборник научных трудов.
М.: НИЯУ МИФИ. Ч. 2., С. 32-42.
 Kulakova E.V., Bryzgalov L.O., Orlov Y.L., Li G., Ruan Y. Computer analysis of chromosome
contacts revealed by sequencing // Конференция BGRSSB-2014 (Bioinformatics of Genome
Regulation and StructureSystem Biology).
 Kulakova E.V., Podkolodnaya O.A.,Serov O.L., Orlov Y.L. Computer data analysis of genome
sequencing by technology ChIP-seq and Hi-C.// Конференция BGRSSB-2014 (Bioinformatics
of Genome Regulation and StructureSystem Biology).P – 90.
 Кулакова Е.В. Компьютерный анализ данных геномного секвенирования по технологии
ChIP-seq и Hi-C. // Конференция МНСК-2014 (Международная научная студенческая
конференция). C. 207
 Spitsina A., Kulakova E.V., Safronova N., Orlova N.G. Statistical analysis
of gene expression data by rank correlation coefficients.// Конференция BGRSSB-2014
(Bioinformatics of Genome Regulation and StructureSystem Biology). P-91.

Kulakova sbb2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kulakova sbb2014

Similar to Kulakova sbb2014 (20)

Recently uploaded

Recently uploaded (20)

Kulakova sbb2014

Editor's Notes