This document discusses computer analysis of genome sequencing data from ChIP-seq and Hi-C technologies. It presents a study analyzing chromosomal contacts and spatial organization of chromosomes using these experimental methods. The author developed a Java program to integrate ChIP-seq and Hi-C data and analyze the location of genes relative to topological domains. The program was applied to mouse genome data, identifying genes on domain boundaries and their gene ontology categories and co-expression networks. Further development of the program and integration with other genomic datasets is proposed.
Presentation at 2019 ASHG GRC/GIAB workshop describing recent updates to the MANE project, which aims to provide matched annotation from RefSeq and GENCODE.
As increasing numbers of people choose to have their genomes sequenced and made available for research, more genomic data is available for analysis by machine learning approaches. Single Nucleotide Polymorphisms (SNPs) are known to be a major factor influencing many physical traits, diseases and other phenotypes. Using publicly available data and tools we predict phenotype from genotype using SNP data (1 to 2 million SNPs). We utilize data analysis and machine learning approaches only, no domain knowledge, so that our automated approach may be generally used to predict different phenotypes from genotype. In the first application of our method we predicted eye color with 87% accuracy.
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Databricks
Epinomics is advancing epigenetic research to drive personalized medicine, using epigenomic data analysis. Their goal is to provide an analysis resource to the community that will promote high-quality data and replicable and interpretable results. They work with academic and commercial users to ingest and analyze their genomic sequencing data and metadata. They extract epigenetic features from the sequenced genome, called “chromatin accessibility”, which are indicative of instrumental epigenetic changes responsible for differential gene expression and disease development.
Epinomics has built an Apache Spark-based pipeline that retrieves chromatin accessibility data from the epigenome, uses GraphX to find overlapping accessibility atlas and then clusters the data and runs machine learning algorithms. This session will provide a primer on epigenomics, details about Epinomics’ Spark-based data pipeline focusing on parallel bioinformatic analysis, and how they use machine learning models to build the epigenomic landscape and accelerate the field of personalized immunotherapy. use GraphX to find overlapping accessibility atlas and then cluster the data and run machine learning algorithms.
In this talk we will provide a primer on epigenomics, details about our Spark based data pipeline focusing on parallel bioinformatic analysis and how we use machine learning models to build the epigenomic landscape and accelerate the field of personalized immunotherapy.
Presentation at 2019 ASHG GRC/GIAB workshop describing recent updates to the MANE project, which aims to provide matched annotation from RefSeq and GENCODE.
As increasing numbers of people choose to have their genomes sequenced and made available for research, more genomic data is available for analysis by machine learning approaches. Single Nucleotide Polymorphisms (SNPs) are known to be a major factor influencing many physical traits, diseases and other phenotypes. Using publicly available data and tools we predict phenotype from genotype using SNP data (1 to 2 million SNPs). We utilize data analysis and machine learning approaches only, no domain knowledge, so that our automated approach may be generally used to predict different phenotypes from genotype. In the first application of our method we predicted eye color with 87% accuracy.
Building Genomic Data Processing and Machine Learning Workflows Using Apache ...Databricks
Epinomics is advancing epigenetic research to drive personalized medicine, using epigenomic data analysis. Their goal is to provide an analysis resource to the community that will promote high-quality data and replicable and interpretable results. They work with academic and commercial users to ingest and analyze their genomic sequencing data and metadata. They extract epigenetic features from the sequenced genome, called “chromatin accessibility”, which are indicative of instrumental epigenetic changes responsible for differential gene expression and disease development.
Epinomics has built an Apache Spark-based pipeline that retrieves chromatin accessibility data from the epigenome, uses GraphX to find overlapping accessibility atlas and then clusters the data and runs machine learning algorithms. This session will provide a primer on epigenomics, details about Epinomics’ Spark-based data pipeline focusing on parallel bioinformatic analysis, and how they use machine learning models to build the epigenomic landscape and accelerate the field of personalized immunotherapy. use GraphX to find overlapping accessibility atlas and then cluster the data and run machine learning algorithms.
In this talk we will provide a primer on epigenomics, details about our Spark based data pipeline focusing on parallel bioinformatic analysis and how we use machine learning models to build the epigenomic landscape and accelerate the field of personalized immunotherapy.
GRC Workshop held at Churchill College on Sep 21, 2014. Talk by Bronwen Aken discussing the Ensembl approach to annotating the complete human reference assembly.
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident: for example, drugs with supporting genetic evidence have twice the clinical trial success rate. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down.
Therefore, we began the open-source Hail project (https://hail.is) to be a scalable platform built on Apache Spark to enable the worldwide genetics community to build, share, and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, annotations and sample data; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes.
We will give an overview of the goals of the Hail project and its architecture. The challenge of efficiently manipulating genetic data in Spark has led to several innovations that may have wider applicability, including an RDD-like abstraction for representing multidimensional data and an OrderedRDD abstraction for ordered data, (for example, data indexed by position in the genome). Finally, we will discuss Hail performance and future directions.
Analytical Study of Hexapod miRNAs using Phylogenetic Methodscscpconf
MicroRNAs (miRNAs) are a class of non-coding RNAs that regulate gene expression.
Identification of total number of miRNAs even in completely sequenced organisms is still an
open problem. However, researchers have been using techniques that can predict limited
number of miRNA in an organism. In this paper, we have used homology based approach for
comparative analysis of miRNA of hexapoda group .We have used Apis mellifera, Bombyx
mori, Anopholes gambiae and Drosophila melanogaster miRNA datasets from miRBase
repository. We have done pair wise as well as multiple alignments for the available miRNAs in
the repository to identify and analyse conserved regions among related species. Unfortunately,
to the best of our knowledge, miRNA related literature does not provide in depth analysis of
hexapods. We have made an attempt to derive the commonality among the miRNAs and to
identify the conserved regions which are still not available in miRNA repositories. The results
are good approximation with a small number of mismatches. However, they are encouraging and may facilitate miRNA biogenesis for hexapods.
Data analysis & integration challenges in genomicsmikaelhuss
Presentation given at the Genomics Today and Tomorrow event in Uppsala, Sweden, 19 March 2015. (http://connectuppsala.se/events/genomics-today-and-tomorrow/) Topics include APIs, "querying by data set", machine learning.
GRC Workshop held at Churchill College on Sep 21, 2014. Talk by Bronwen Aken discussing the Ensembl approach to annotating the complete human reference assembly.
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident: for example, drugs with supporting genetic evidence have twice the clinical trial success rate. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down.
Therefore, we began the open-source Hail project (https://hail.is) to be a scalable platform built on Apache Spark to enable the worldwide genetics community to build, share, and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, annotations and sample data; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes.
We will give an overview of the goals of the Hail project and its architecture. The challenge of efficiently manipulating genetic data in Spark has led to several innovations that may have wider applicability, including an RDD-like abstraction for representing multidimensional data and an OrderedRDD abstraction for ordered data, (for example, data indexed by position in the genome). Finally, we will discuss Hail performance and future directions.
Analytical Study of Hexapod miRNAs using Phylogenetic Methodscscpconf
MicroRNAs (miRNAs) are a class of non-coding RNAs that regulate gene expression.
Identification of total number of miRNAs even in completely sequenced organisms is still an
open problem. However, researchers have been using techniques that can predict limited
number of miRNA in an organism. In this paper, we have used homology based approach for
comparative analysis of miRNA of hexapoda group .We have used Apis mellifera, Bombyx
mori, Anopholes gambiae and Drosophila melanogaster miRNA datasets from miRBase
repository. We have done pair wise as well as multiple alignments for the available miRNAs in
the repository to identify and analyse conserved regions among related species. Unfortunately,
to the best of our knowledge, miRNA related literature does not provide in depth analysis of
hexapods. We have made an attempt to derive the commonality among the miRNAs and to
identify the conserved regions which are still not available in miRNA repositories. The results
are good approximation with a small number of mismatches. However, they are encouraging and may facilitate miRNA biogenesis for hexapods.
Data analysis & integration challenges in genomicsmikaelhuss
Presentation given at the Genomics Today and Tomorrow event in Uppsala, Sweden, 19 March 2015. (http://connectuppsala.se/events/genomics-today-and-tomorrow/) Topics include APIs, "querying by data set", machine learning.
Apollo: A workshop for the Manakin Research Coordination NetworkMonica Munoz-Torres
Apollo is a web-based, collaborative genomic annotation editing platform. We need annotation editing tools to modify and refine precise location and structure of the genome elements that predictive algorithms cannot yet resolve automatically.
This presentation is an introduction to how the manual annotation process takes place using Apollo. It is addressed to the members of the Manakin Genomics research community.
Apollo is a web-based, collaborative genomic annotation editing platform. We need annotation editing tools to modify and refine precise location and structure of the genome elements that predictive algorithms cannot yet resolve automatically.
This presentation is an introduction to how the manual annotation process takes place using Apollo. It is addressed to the members of the American Chestnut & Chinese Chestnut Genomics research community.
Apollo - A webinar for the Phascolarctos cinereus research communityMonica Munoz-Torres
Web Apollo is a web-based, collaborative genomic annotation editing platform. We need annotation editing tools to modify and refine precise location and structure of the genome elements that predictive algorithms cannot yet resolve automatically.
This presentation is an introduction to how the manual annotation process takes place using Web Apollo. It is addressed to the members of the Phascolarctos cinereus research community.
RT-PCR and DNA microarray measurement of mRNA cell proliferationIJAEMSJORNAL
For mRNA quantification, RT-PCR and DNA microarrays have been compared in few studies
(RT-PCR). Healing callus of adult and juvenile rats after femur injury was found to be rich in mRNA at
various stages of the healing process. We used both methods to examine ten samples and a total of 26 genes.
Internal DNA probes tagged with 32P were employed in reverse transcription-polymerase chain reaction
(RT-PCR) to identify genes (RT-PCR). Ten Affymetrix® Rat U34A cRNA microarrays were hybridized with
biotin-labeled cRNA generated from mRNA. There was a wide range of correlation coefficients (r) between
RT-PCR and microarray data for each gene. Meaning became genetically unique because of this diversity.
Relatively lowly expressed genes had the highest r values. The distance between PCR primers and
microarray probes was found to be higher than previously assumed, leading to a drop in agreement between
microarray calls and PCR outcomes. Microarray research showed that RT-PCR expression levels for two
genes had a "floor effect." As a result, PCR primers and microarray probes that overlap in mRNA expression
levels can provide good agreement between these two techniques.
Introduction to Apollo: A webinar for the i5K Research CommunityMonica Munoz-Torres
Apollo is a web-based application that supports and enables collaborative genome curation in real time, allowing teams of curators to improve on existing automated gene models through an intuitive interface. Apollo allows researchers to break down large amounts of data into manageable portions to mobilize groups of researchers with shared interests.
The i5K, an initiative to sequence the genomes of 5,000 insect and related arthropod species, is a broad and inclusive effort that seeks to involve scientists from around the world in their genome curation process, and Apollo is serving as the platform to empower this community.
This presentation is an introduction to Apollo for the members of the i5K Pilot Project Species.
The UCSC genome browser: A Neuroscience focused overviewVictoria Perreau
An self guided tutorial based overview of the UCSC genome browser for accessing public neuroscience data, in particular data from the ENCODE project. Including additional transcriptomic resources for the Neurosciences.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
Kulakova sbb2014
1. COMPUTER DATA ANALYSIS
OF GENOME SEQUENCING
BY TECHNOLOGY ChIP-seq
AND Hi-C
adviser–Yuri Orlov, ICG SB RAS
author– Kulakova Ekaterina, bachelor
2. Topicality
Automated systems allow decoding DNA and genomic sequences up to whole genomes. The
complete sequencing of genomes leads to avalanche growth on the sequence information
(megabytes and gigabytes of data).
Development of methods based on chromatin immunoprecipitation (ChIP-seq, ChIA-PET) gives
a qualitatively new data.
There are new tasks of computer genomics (analysis of spatial, non-linear structures of
chromosomes)
Aim and Scientific novelty
The aim of this work - the study of chromosomal contacts in the cell nucleus with the help of
computer programs statistical data of genes and chromosomal domains, experimental data
analysis ChIP-seq and Hi-C.
Integration of modern genome-wide ChIP-seq data and Hi-C, which became available only in
the last two or three year
Using the parameter precision location on chromosome with which to analyze the data
Establishing a list of genes located on chromosome boundaries of topological domains.
*ChIP- Seq = Chromatin ImmunoPrecipitation sequencing
ChIA-PET = Chromatin Interaction Analysis by Paired-End-Tag sequencing
3. Methods Hi-C and ChIA-PET*
Arrangement of chromosomes in
the cell nucleus (reconstruction
according to Hi-C)
Comprehensive Mapping of Long-Range
Interactions Reveals Folding Principles of the
Human Genome. Science, 2009
Topological arrangement of the
domains of chromosomes and its
mapping in the genome
Scheme of local chromosomal
domains ("tangle" contacts)
*ChIP- Seq = Chromatin ImmunoPrecipitation sequencing
ChIA-PET = Chromatin Interaction Analysis by Paired-End-Tag sequencing
Hi-C = Hi (high dimension chromosome) Conformation
Separate loops
«tangle»
(Dixon et al., 2012)
Scheme of arrangement of
genes on chromosome
4. Genomic data: genes, peaks ChIP-seq,
contact areas ChIA-PET
genes
genes
Plot of
chromosomal
contacts ChIA-PET
Chromosomal domain
Peaks of ChIP-seq
profiles
5. File formats and their presentation
Bed-file example
>track name=ER_E2 description=ER_E2
chr1 557112 558114
chr1 559459 560286
chr1 998864 999397
chr1 999399 999604
chr1 1004343 1005146
chr1 1070346 1071080
chr1 1305474 1306502
chr1 1358287 1358744
chr1 1776987 1777750
chr1 1820476 1821168
chr1 1922754 1923628
chr1 2131962 2132747
chr1 2325805 2326447
chr1 2368996 2369977
chr1 3119829 3120541
chr1 3244610 3245121
…
Data about domains in mouse cells -
obtained in the laboratory O.L.Serov (ICG
SB RAS) (Fib_domains, Sp_domains).
The size of one file with the
genomic profile - from 100 MB to
2-3 Gb
RefSeq annotation taken from UCSC Genome
Browser
http://genome.ucsc.edu/cgi-bin/hgTables
6. Calculation of the position of genes and
domain boundaries
А1 – left coordinate of the gene B1 - right coordinate of the gene.
А2 – left coordinate of the domain, В2 – right coordinate of the domain.
Е – accuracy, user-defined.
if (|А1 – А2| <= Е) & (В1 < А2 + (В2 – А2)/2) true, we assume that the gene
lies close to the left boundary of the domain. Similar conditions for the right
border.
Е
А2 А1 В1 В2
домен
ген
Example of location of chromosomal
domains and genes for mouse
chromosome 10 The linear arrangement of genes in the domain
7. Table location types of genes in chromosomal
domains
Other – other genes
Inside – genes that lie within the domains
onBorder – genes lying on the domain
boundaries.
8. Analysis of the location set of genes on
the domains in different cell types
User specifies a list of genes. Possible to analyze all the genes in the genome
(20,000 genes)
Types of cells - embryonic stem cells (fibroblasts - Fib) and sperm (Sp)
mouse. Experiment Hi-C, ICG SB RAS
Sp (densely packed
structure)
92,5 % genes within domains
1,4% on border
6,1% other
Fib (Open chromatin)
72,6 % genes within domains
3,2% on border
24% other
9. Experimental data.
Gene Ontology categories
For analysis were taken genes lying on the
domain boundaries.
The result was sorted by the number of
genes with common biological processes
category
Used online resource
http://david.abcc.ncifcrf.gov/
10. Analysis of the co-expression of genes, lying on the
borders of the spatial domain
For analysis were taken genes located on the domain boundaries.
Used online resource STRING http://string-db.org/
The main result - graphs of gene networks of varying degrees of
connectivity for the two types of cells
Fib
698 – the total number of genes on
the domain boundaries
88 – genes involved in the
connection
160 pairs of connection
12% genes from total
Sp
314 – the total number of
genes on the domain
boundaries
13 – genes involved in the
connection
10 pairs of connection
4% genes from total
11. Conclusion
Implemented a Java program
Application of the program to the experimental data (ICG SB RAS
and databases on chromosome contacts)
The analysis of the location set of genes in chromosomal domains
(control computer simulation)
12. Next Steps
Define domains including pluripotency genes in the mouse genome (Dixon
et al., 2012).
Make developed project is compatible with other programs designed to
ICG SB RAS for microarray data developed in languages Java, C / C + +.
Integrate the program with data on gene expression database BioGPS
microchips in human genome.
Thank you for your attention!
13. Publications(Thesises)
Safronova N.S., Kulakova E.V., Orlov Yu.L. (2013) Applications of text complexity measures to
genome sequences analysis. // Proceedings of GIW-2013, National University of Singapore, 16-
18 Dec 2013. P.42.
Медведева И.В., Вишневский О.В., Кулакова Е.В., Спицына А.М., Афонников Д.А., Кочетов
А.В., Орлов Ю.Л. (2014) Геномная организация и контекстные характеристики генов с
повышенной экспрессией в клетках мозга // Геномная организация и контекстные
характеристики генов с повышенной экспрессией в клетках мозга // XVI Всероссийская
научно-техническая конференция «Нейроинформатика-2014»: Сборник научных трудов.
М.: НИЯУ МИФИ. Ч. 2., С. 32-42.
Kulakova E.V., Bryzgalov L.O., Orlov Y.L., Li G., Ruan Y. Computer analysis of chromosome
contacts revealed by sequencing // Конференция BGRSSB-2014 (Bioinformatics of Genome
Regulation and StructureSystem Biology).
Kulakova E.V., Podkolodnaya O.A.,Serov O.L., Orlov Y.L. Computer data analysis of genome
sequencing by technology ChIP-seq and Hi-C.// Конференция BGRSSB-2014 (Bioinformatics
of Genome Regulation and StructureSystem Biology).P – 90.
Кулакова Е.В. Компьютерный анализ данных геномного секвенирования по технологии
ChIP-seq и Hi-C. // Конференция МНСК-2014 (Международная научная студенческая
конференция). C. 207
Spitsina A., Kulakova E.V., Safronova N., Orlova N.G. Statistical analysis
of gene expression data by rank correlation coefficients.// Конференция BGRSSB-2014
(Bioinformatics of Genome Regulation and StructureSystem Biology). P-91.
Editor's Notes
Актуальность данной темы основана на том, что автоматизированные системы определяющие последовательности оснований ДНК, позволяют расшифровывать ДНК и геномные последовательности вплоть до целых геномов.
Полное секвенирование геномов ведет к лавинообразному росту объема информации о нуклеотидных последовательностях (мегабайты и гигабайты данных).
Разработка методов иммунопреципитации хроматина и секвенирования (ChIP-seq, ChIA-PET – «чип-сик», «чиа-пет» - рассшифровка этих аббревиатур на английском показана здесь) для исследования регуляторных районов генома, дает качественно новые данные.
Появляются новые задачи компьютерной геномики (анализ пространственных, а не линейных структур хромосом)
Цель данной работы - изучение хромосомных контактов в ядре клетки с помощью компьютерных программ статистической обработки данных расположения генов и хромосомных доменов, анализ экспериментальных данных ChIP-seq и Hi-C.
Научная новизна заключается в интеграции современных полногеномных данных ChIP-seq и Hi-C, ставших доступными только в последние два-три года
В использовании параметра точности расположения на хромосоме, с которой необходимо провести анализ данных,
И в установлении списка генов, находящихся на границах топологических хромосомных доменов.
Хромосомы, находящиеся в ядре клетки, компактизуются в клубки и узелки. Метод Hi-C (high dimension chromosome Conformation) дает понятия о пространственном расположении хромосом в ядре клетки. На рисунке слева показан результат реконструкции по данными Hi-C. Клубки которые образуют хромосомы делят на домены. Клубок – домен. Как показано на рисунке в центре. Справа –результат полученный методом ChIA-PET (Chromatin Interaction Analysis by Paired-End-Tag sequencing). Метод позволяет определить участки связывания транскрипционных факторов и взаимодействующие участки хроматина расположенных на значительном удалении друг от друга в геноме.
На данном слайде представлено картирование интеграции данных ChIP-seq и ChIA-PET. На основе данных контактов выделяют хромосомные домены (фиолетовые линии контактов и красный треугольник вверху). По таким данным можно изучать расположение генов относительно доменов (показано стрелками).
Данные о доменах хранятся в bed-файле в виде, представленном на слайде: хромосома в которой располагается домен, координаты его начала и конца.
Данные о доменах на двух типах клеток - фибробласты и сперматозоиды - были предоставлены лабораторией Олега Леонидовича Серова, ИЦиГ СО РАН.
Данные генов были взяты из базы данных UCSC Genome Browser («ЮсиЭсСи Дженом браузер») –
В нижней части слайда приведен пример.
Размер одного файла с геномным профилем от 100Мб до 2-3 Гб.
Файл содержит поля - идентификатор гена, имя гена, хромосома, координаты начала и конца, символьное имя гена и др.
Одной из задач в моей работы было – выделить списки генов находящихся на границе пространственных доменов. В эти списки входили гены относящиеся к двум категориям – те, которые непосредственно пересекают границу и те, которые лежат «близко» к ней. Понятие «близко» основано на параметре точности вводимым пользователем. На картинке видно, что расстояние между левыми координатами домена и гена должно быть в пределах точности, а правая координата гена не должна превышать середины домена.
На слайде представлена результирующая таблица. В ней содержится имя гена и категории внутри, на границе или «другие». Под другими я подразумеваю гены имеющие длину больше длины домена или ген относящийся к специфичной хромосоме.
Возможен расчет для случайных групп генов, когда список генов составляется с помощью датчика случайных чисел, для оценки частот распределения генов по таким группам для исследуемой доменной организации.
Еще одной из задач был анализ расположения набора генов. Список генов пользователь задает самостоятельно.
Возможен анализ всех генов в геноме.
Расположение всех генов было проанализировано на доменах двух типов клеток – фибробласты и сперматозоиды, экспериментальные данные лаборатории Серова, Институт Цитологии и Генетики.
Можно увидеть что у сперматозоидов практически все гены лежат внутри доменов. У фибробластов довольно большой процент других генов.
Списки генов на границах доменов были проанализированы при помощи интернет-ресурса DAVID на предмет категорий генных онтологий. На графиках обозначены биологические процессы, число генов отвечающих за них, коэффициент значимости, а так же наблюдаемое и ожидаемое число генов. На обоих графиках видно, что наибольшее число генов отвечает за БЕЛКИ фосфопротеины. Значимыми процессами являются для фибробластов – функции связанные с плазматической мембраны, у сперматозоидов – также с мембраной, что свидетельствует о плотной хроматиновой упаковке генома
Таким образом, с помощью программы был получен биологический результат
Этот же список генов был проанализирован на связи коэкспрессии при помощи интернет-ресурса реконструкции генных сетей STRING. Показан интернет-адрес.
Показана статистика числа генов и процент образуемых связей в сети.
Видно, что У фибробластов генная сеть гораздо более связная (на рисунке слева), 12 процентов генов против 4 процентов.
(Если спросят - У сперматозоидов хроматин закрытый, ДНК находится в очень компактном состоянии)
Заключение.
Реализована программы на языке Java
Программа применена к экспериментальным данным
Выполнен анализ расположения набора генов в хромосомных доменах (гены из генных сетей и контрольная компьютерная симуляция)
Дальнейшие действия:
Включают исследование расположения отдельных генов,
Разработку пакета совместимого с программами, разработанными ранее в ИЦиГ СО РАН для микрочиповых данных,
на языках Java, C/C++.
Интегрировать разработанную программу с данными экспрессии генов на микрочипах по базе данных BioGPS.
Спасибо за внимание!