ENCODE is a project that aims to identify all functional elements in the human genome. It has characterized elements such as transcribed regions, protein-coding genes, transcription factor binding sites, chromatin structure, and DNA methylation sites across many cell types using various methods like RNA-seq, ChIP-seq, and RRBS. While it has provided valuable information on the functional elements encoded in the human genome, it is limited by the small number of cell types and factors analyzed. Future goals include expanding the analysis to more cell types to further understanding of the human genome.
New insights into the human genome by ENCODE project Senthil Natesan
It’s been ten years since scientists sequenced the human genome. But what do all these letters?
Researchers could identify in its 3 billion letters many of the regions that code for proteins, but those make
up little more than 1% of the genome, contained in around 21,000 genes a few familiar objects in an otherwise stark and unrecognizable landscape. Many biologists suspected that the information responsible
for the wondrous complexity of humans lay somewhere in the ‘deserts’ between the genes (The ENCODE Project Consortium, 2012).
Interpreting the human genome sequence is one of the leading challenges of 21st century biology
(Collins et al., 2003). In 2003, the National Human Genome Research Institute (NHGRI) embarked on an
ambitious project the Encyclopedia of DNA Elements (ENCODE), aiming to delineate all of the functional elements encoded in the human genome sequence (The ENCODE Project Consortium 2004). To further
this goal, NHGRI organized the ENCODE Consortium, an international group of investigators with diverse
backgrounds and expertise in production and analysis of high-throughput functional genomic data. In a pilot project phase spanning 2003–2007, the Consortium applied and compared a variety of experimental and computational methods to annotate functional elements in a defined 1% of the human genome (The ENCODE Project Consortium, 2007)
Nadia Pisanti - With the recent New Genome Sequencing Technologies, Medicine and Biology are witnessing a revolution where Computer Science and Data Analysis play a crucial role. In this talk, I will give an overview of perspectives and challenges in this field.
New insights into the human genome by ENCODE project Senthil Natesan
It’s been ten years since scientists sequenced the human genome. But what do all these letters?
Researchers could identify in its 3 billion letters many of the regions that code for proteins, but those make
up little more than 1% of the genome, contained in around 21,000 genes a few familiar objects in an otherwise stark and unrecognizable landscape. Many biologists suspected that the information responsible
for the wondrous complexity of humans lay somewhere in the ‘deserts’ between the genes (The ENCODE Project Consortium, 2012).
Interpreting the human genome sequence is one of the leading challenges of 21st century biology
(Collins et al., 2003). In 2003, the National Human Genome Research Institute (NHGRI) embarked on an
ambitious project the Encyclopedia of DNA Elements (ENCODE), aiming to delineate all of the functional elements encoded in the human genome sequence (The ENCODE Project Consortium 2004). To further
this goal, NHGRI organized the ENCODE Consortium, an international group of investigators with diverse
backgrounds and expertise in production and analysis of high-throughput functional genomic data. In a pilot project phase spanning 2003–2007, the Consortium applied and compared a variety of experimental and computational methods to annotate functional elements in a defined 1% of the human genome (The ENCODE Project Consortium, 2007)
Nadia Pisanti - With the recent New Genome Sequencing Technologies, Medicine and Biology are witnessing a revolution where Computer Science and Data Analysis play a crucial role. In this talk, I will give an overview of perspectives and challenges in this field.
Genome: The entire chromosomal genetic material of an organism.
Sequencing a genome: Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
Genomics is a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble and analyze the function and structure of genomes
'Genomics' is nothing but the study of entire genetic compliment of an organism. Plant genomics is study of plant genome. This is my topic of M.Sc. course 'Plant biotechnology'.
description of functional genomics and structural genomics and the techniques involved in it and also decribing the models of forward genetics and techniques involved in it and reverse genetics and techniques involved in it
Whole genome sequencing of arabidopsis thalianaBhavya Sree
arabidopsis is the representative of plant kingdom or the 'model plant'.it is the first plant genome sequenced. the sequences lead to the overall understanding of the plant kingdom, better understanding of various genes,the important metabolic pathways, evolution etc
A retrospective look at the state of many famous modern genome sequences, and a cautionary tale of the dangers in assuming that genome sequence and/or its annotations are finished.
This is a compilation of the Yeast genome project from the different databases and sources.
By:
Nazish Nehal,
M. Tech (Biotechnology),
University School of Biotechnology (USBT),
Guru Gobind Singh Indraprastha University (GGSIPU),
New Delhi (INDIA)
Abstract: The focus in this session will be put on the differences between standard DNA mapping and RNAseq-specific transcript mapping: identifying splice variants and isoforms. The issue of transcript quantification and genomic variants that can be identified from RNAseq data will be discussed.
Analysis of Single-Cell Sequencing Data by CLC/Ingenuity: Single Cell Analysi...QIAGEN
Single-cell analysis is useful to study genetic heterogeneity between individual cells and can help in result interpretation by looking at the average behavior of a large number of cells. Applications include circulating tumor cells, cells from small biopsies and cells from in vitro fertilized embryos. In this slidedeck, we show how single cell next-generation sequencing data can be analyzed and what challenges needs to be overcome. One of the examples we use is single cell data from two colorectal cancer cell lines.
Genome: The entire chromosomal genetic material of an organism.
Sequencing a genome: Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
Genomics is a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble and analyze the function and structure of genomes
'Genomics' is nothing but the study of entire genetic compliment of an organism. Plant genomics is study of plant genome. This is my topic of M.Sc. course 'Plant biotechnology'.
description of functional genomics and structural genomics and the techniques involved in it and also decribing the models of forward genetics and techniques involved in it and reverse genetics and techniques involved in it
Whole genome sequencing of arabidopsis thalianaBhavya Sree
arabidopsis is the representative of plant kingdom or the 'model plant'.it is the first plant genome sequenced. the sequences lead to the overall understanding of the plant kingdom, better understanding of various genes,the important metabolic pathways, evolution etc
A retrospective look at the state of many famous modern genome sequences, and a cautionary tale of the dangers in assuming that genome sequence and/or its annotations are finished.
This is a compilation of the Yeast genome project from the different databases and sources.
By:
Nazish Nehal,
M. Tech (Biotechnology),
University School of Biotechnology (USBT),
Guru Gobind Singh Indraprastha University (GGSIPU),
New Delhi (INDIA)
Abstract: The focus in this session will be put on the differences between standard DNA mapping and RNAseq-specific transcript mapping: identifying splice variants and isoforms. The issue of transcript quantification and genomic variants that can be identified from RNAseq data will be discussed.
Analysis of Single-Cell Sequencing Data by CLC/Ingenuity: Single Cell Analysi...QIAGEN
Single-cell analysis is useful to study genetic heterogeneity between individual cells and can help in result interpretation by looking at the average behavior of a large number of cells. Applications include circulating tumor cells, cells from small biopsies and cells from in vitro fertilized embryos. In this slidedeck, we show how single cell next-generation sequencing data can be analyzed and what challenges needs to be overcome. One of the examples we use is single cell data from two colorectal cancer cell lines.
This presentation gives an easy introduction to ChIP-seq analyses and is part of a bioinformatics workshop. The accompanying websites are available at http://sschmeier.github.io/bioinf-workshop/#!galaxy-chipseq/
The transcriptome of a cell is not fixed, but is dynamic, and reflects the function or type of the cell, the cell stage or the cell's response to intrinsic and extrinsic influences, such as signaling or stress factors. Only on a single cell level, can you eliminate the biological noise that is inherent to standard gene expression analysis – providing you the insights needed for a deeper understanding of transcription dynamics. In this presentation we delve into the different steps of RNA seq starting from a single cell.
Course: Bioinformatics for Biomedical Research (2014).
Session: 4.1- Introduction to RNA-seq and RNA-seq Data Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
With technological breakthroughs in single cell isolation, whole genome amplification (WGA) and NGS library preparation, experiments using single cells are now possible. However, challenges still exist. In particular, methods for the unbiased and complete amplification of a single genome and for the efficient conversion of that amplified DNA into a sequencer-compatible library face several technical limitations including incomplete amplification, the introduction of PCR errors, GC-bias and locus or allelic drop-out. The presentation covers the impact of these factors and how one can mitigate it.
Learn about enabling next-generation sequencing applications with IBM Storwize V7000 Unified and SONAS Gateway solutions. This paper offers recommendations and guidance to facilitate easy configuration and installation of the solution to ensure an efficient installation with good performance. For more information on IBM Storage Systems, visit http://ibm.co/LIg7gk.
Visit http://on.fb.me/LT4gdu to 'Like' the official Facebook page of IBM India Smarter Computing.
DNA-based methods for bioaerosol analysisjordanpeccia
Information for producing phylogenetic/taxonomic libraries of airborne bacteria and fungi. Includes fundamental background information, approaches for sequencing and data analysis, two case studies, and a review of sampling methods
Genome walking – a new strategy for identification of nucleotide sequence in ...Dr. Mukesh Chavan
Identification of unknown nucleotide sequences flanking already characterized DNA regions can be pursued by number of different PCR- based methods commonly known as Genome walking (GW)
GW methods have been developed in the last 20 years, with continuous improvements added to the first basic strategies
First reported by Hengen in 1995 in comparison with other technologies
Hui et al., in 1998 reviewed in detail
The extreme flexibility of GW strategies makes its application possible in every standardly equipped research laboratory. In addition, the possibility of merging GW strategies to next generation sequencing approaches will undoubtedly extend the future application of this by now basic technique of molecular biology.
This is an introduction to conducting manual annotation efforts using Apollo. This webinar was offered to members of the i5K Research community on 2015-10-07.
1.introduction to genetic engineering and restriction enzymesGetachew Birhanu
An introduction to Genetic engineering
A short background and history of Genetic Engineering
Classification of DNA manipulating Enzymes, nomenclature
Restriction recognition sequences, the anatomy of a gene and the flow of genetic information
More emphasis is given for the essential DNA Manipulating Enzymes
Finally Restriction mapping (analysis)
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
2. What is a gene???
ENCODE
• Union of genomic sequences encoding a
coherent set of potentially overlapping
functional products.
(Gerstein et al., 2007)
3. Its been ten years since scientists sequenced the human
genome
But What do all these letters????????
9. ENCODE
Major methods
Data production and
initial analysis
Accessing ENCODE
data
Working with ENCODE
data
Data analysis
Limitations
Threads – Nature
explorer
13. RNA-seq – Isolation of RNA sequences followed by high-throughput
sequencing
CAGE – Capture of the methylated cap at the 5’end of RNA, followed
by high-throughput sequencing
RNA-PET – Simultaneous capture of RNAs with both a 5’methyl cap
and a poly(A) tail
ChIP-seq - Chromatin immunoprecipitation followed by sequencing
FAIRE-seq - Formaldehyde assisted isolation of regulatory
elements. Crosslinking, phenol extraction, and sequencing the DNA
fragments in the aqueous phase
16. ENCODE data production and initial analyses
• Since 2007, ENCODE has developed methods and performed a large
number of sequence-based studies to map functional elements across
the human genome.
• The elements mapped (and approaches used) include
RNA transcribed regions (RNA-seq, CAGE, RNA-PET and manual
annotation),
Protein-coding regions (mass spectrometry),
Transcription-factor-binding sites (ChIP-seq and DNase-seq),
Chromatin structure (DNase-seq, FAIRE-seq, histone ChIP-seq),
DNA methylation sites (RRBS assay)
(The ENCODE Project Consortium, 2012)
17. Transcribed and protein-coding regions
• In total, GENCODE-annotated exons of protein-coding genes cover 2.94% of the
genome or 1.22% for protein-coding exons.
• Protein-coding genes span 33.45% from the outermost start to stop codons, or
39.54% from promoter to poly(A) site.
• Additional protein-coding genes remain to be found.
• In addition, they annotated 8,801 automatically derived small RNAs and 9,640
manually curated long non-coding RNA (lncRNA) loci
• The GENCODE annotated 11,224 pseudogenes
(The ENCODE Project Consortium, 2012)
18. Process flow of experimental evaluation of
pseudogene transcription
Experimental validation
results showing the
transcription of pseudogenes
in different tissues
(Pei et al., 2012)
19. ENCODE gene and transcript annotations.
(The ENCODE Project Consortium, 2011)
20. RNA
• They sequenced RNA from different cell lines and multiple
subcellular fractions to develop an extensive RNA expression
catalogue.
• They used CAGE-seq (5’cap-targeted RNA isolation and
sequencing) to identify 62,403 (TSSs) in tier 1 and2 cell types
(The ENCODE Project Consortium, 2012)
21. A large majority of GENCODE elements are detected by
RNA-seq data
(Djebali et al., 2012)
22. Protein bound regions
• 119 different DNA-binding proteins and a number of RNA
polymerase components in 72 cell types using ChIP-seq
• Overall, 636,336 binding regions covering 231 mega bases
(8.1%) of the genome are enriched for regions bound by DNA-
binding proteins across all cell types.
(The ENCODE Project Consortium, 2012)
23. Occupancy of transcription factors and RNA
polymerase 2 on human chromosome 6p as
determined by ChIP-seq
25. DNase I hypersensitive sites and footprinting
• Chromatin accessibility characterized by DNase I hypersensitivity
is the hallmark of regulatory DNA regions.
• 2.89 million unique, non-overlapping (DHSs) by DNase-seq in 125
cell types – lie distal to TSSs
• In tier 1 and tier 2 cell types - 205,109 DHSs per cell
type, encompassing an average of 1.0% of the genomic sequence
in each cell type, and 3.9% in aggregate.
(The ENCODE Project Consortium, 2012)
26. Density of DNase I cleavage sites for selected cell types
(Thurman et al., 2012)
27. • On average, 98.5% of the occupancy sites of transcription factors
mapped by ENCODE ChIP-seq
• Using genomic DNase I footprinting on 41 cell types they
identified 8.4million distinct DNase I footprints
(The ENCODE Project Consortium, 2012)
28. Regions of histone modification
• They assayed chromosomal locations for up to 12 histone
modifications and variants in 46 cell types, across tier 1 and 2.
(The ENCODE Project Consortium, 2012)(http://www.factorbook.org)
29. DNA methylation
• They used reduced representation bisulphite sequencing (RRBS)
to profile DNA methylation quantitatively for an average of 1.2
million CpGs in each of 82 cell lines and tissues (8.6% of non-
repetitive genomic CpGs), including CpGs in intergenic
regions, proximal promoters and intragenic regions.
(The ENCODE Project Consortium, 2012)
30. Proteomics
To assess putative protein products generated from novel RNA
transcripts and isoforms, proteins are sequenced and quantified
by mass spectrometry and mapped back to their encoding
transcripts.
K562 and GM12878 – protein study begun
(The ENCODE Project Consortium, 2011)
32. Accessing ENCODE Data
ENCODE Data Release and Use Policy
• The ENCODE Data Release and Use Policy is described at
http://www.encodeproject.org/ENCODE/terms.html.
• ENCODE data are released for viewing in a publicly accessible
browser (initially at http://genome-preview.ucsc.edu/ENCODE
and, after additional quality checks, at http://encodeproject.org)
Public Repositories
• UCSC Genome Browser database (http://genome.ucsc.edu).
(The ENCODE Project Consortium, 2011)
34. Working with ENCODE Data
Using ENCODE Data in the UCSC Browser
• Many users will want to view and interpret the ENCODE data for
particular genes of interest. At the online ENCODE portal
(http://encodeproject.org), users should follow a ‘‘Genome
Browser’’ link to visualize the data in the context of other genome
annotations.
(The ENCODE Project Consortium, 2011)
35. ENCODE Data Analysis
• Development and implementation of algorithms and pipelines for
processing and analyzing data - major activity of the ENCODE
Project.
•Short sequences
are aligned to
the reference
genome
1st Phase
•Identifying the
enriched regions
2nd Phase •Integrating the
identified regions
of enriched signal
with each other
and with other
data types
3rd Phase
(The ENCODE Project Consortium, 2011)
37. Integrating ENCODE with other projects and the
Scientific Community
1. defining promoter and enhancer regions by combining transcript
mapping and biochemical marks,
2. delineating distinct classes of regions within the genomic
landscape by their specific combinations of biochemical and
functional characteristics, and
3. defining transcription factor co-associations and regulatory
networks.
(The ENCODE Project Consortium, 2011)
38. • ENCODE Project - interpretation of human genome variation that
is associated with disease or quantitative phenotypes
• Integrate with 1,000 Genomes Project - how SNPs and structural
variation may affect transcript, regulatory and DNA methylation
data
• ENCODE - GWAS and other sequence variation driven studies of
human phenotypes
Major contributor not only of data but also novel technologies for
deciphering the human genome
(The ENCODE Project Consortium, 2011)
39. Limitations of ENCODE Annotations
• Cell types - physiologically and genetically inhomogeneous.
• Local micro-environments in culture may also vary
• Use of DNA sequencing to annotate functional genomic features is
also constrained.
• Considerable quantitative variation in the signal strength along
the genome
(The ENCODE Project Consortium, 2011)
40. Challenges
• Adult human body contains several hundred distinct cell types
• Each of which expresses a unique subset of the 1,800 TFs
encoded in the human genome
• Brain alone contains thousands of types of neurons that are likely
to express not only different sets of TFs but also a larger variety
of non-coding RNAs
• A truly comprehensive atlas of human functional elements is not
practical with current technologies
(The ENCODE Project Consortium, 2011)
41. Outcome
• Understanding of the human genome
• The broad coverage of ENCODE annotations enhances our
understanding of common diseases with a genetic
component, rare genetic diseases
• 119 of 1,800 known transcription factors and 13 of more than 60
currently known histone or DNA modifications across 147 cell
types
• Overall these data reflect a minor fraction of the potential
functional information encoded in the human genome
(The ENCODE Project Consortium, 2012)
43. 13 Threads
1. Transcription factor motifs
2. Chromatin patterns at transcription factor binding sites
3. Characterization of intergenic regions and gene definition
4. RNA and chromatin modification patterns around promoters
5. Epigenetic regulation of RNA processing
6. Non-coding RNA characterization
7. DNA methylation
8. Enhancer discovery and characterization
9. Three-dimensional connections across the genome
10. Characterization of network topology
11. Machine learning approaches to genomics
12. Impact of functional information on understanding variation
13. Impact of evolutionary selection on functional regions
48. Future goal
• Mechanistic processes that generate these elements and how and
where they function
• Enlarge the data set to additional factors, modifications and cell
types, complementing the other related projects
• Constitute foundational resources for human genomics, allowing a
deeper interpretation of the organization of gene and regulatory
information and the mechanisms of regulation, and thereby
provide important insights into human health and disease
(The ENCODE Project Consortium, 2012)
49. Project is still far from complete
Conclusion
For update: https://www.facebook.com/ENCODEProject
These analyses reveal that the human genome encodes a diversearray of transcripts. For example, in the proto-oncogene TP53locus, RNA-seq data indicate that, while TP53transcripts areaccurately assigned to the minus strand, those for the oppositelytranscribed, adjacent geneWRAP53emanate from the plus strand(Figure 3). An independent transcript within the first intron ofTP53is also observed in both GM12878 and K562 cells (Figure 3).
Theupper portion shows the ChIP-seq signal of five sequence-specific transcription factors and RNA Pol2 throughout the 58.5 Mb of the short arm ofhuman chromosome 6 of the human lymphoblastoid cell line GM12878. Input control signal is shown below the RNA Pol2 data. At this level ofresolution, the sites of strongest signal appear as vertical spikes in blue next to the name of each experiment (‘‘BATF,’’ ‘‘EBF,’’ etc.).
116 kb segment of the HLA region is expanded; here, individual sites of occupancy can be seen mappingto specific regions of the three HLA genes shown at the bottom, with asterisks indicating binding sites called by peak calling software. Finally, thelower left region shows a 3,500 bp region around two tandem histone genes, with RNA Pol2 occupancy at both promoters and two of the fivetranscription factors, BATF and cFos, occupying sites nearby.
They organized all the information associated with each transcription factor including the ChIP-seq peaks, discovered motifs and associated histone modification patterns in FactorBook (http://www.factorbook.org), a public resource that will be updated as the project proceeds.
After curation and review at the Data Coordination Center, all processed ENCODE data are publicly released to the UCSC Genome Browser database (http://genome.ucsc.edu).
Three differenttypes of regulatory data are represented for an area of the genome: motif-based predictions, DNase I hypersensitivity peaks, and ChIP-seq peaks. Thisregion contains six SNPs. SNP1 is associated with a phenotype in a genome-wide association study. SNP3 is an eQTL associated with changes in geneexpression in a different study. SNP6 overlaps a predicted motif, a DNase Ihypersensitivity peak, and a ChIP-seq peak. There are, therefore, multiplesources of evidence that SNP6 is in a regulatory region. Furthermore,SNP6 is in perfect linkage disequilibrium (r2=1.0) with SNP1 and SNP3,meaning that there is transitive evidence due to the LD that SNP6 is alsoassociated with the phenotype and is also an eQTL. SNP6 is therefore themost likely functional SNP in this associated region.
Aggregate overlap of phenotypes to selected transcription-factor-binding sites (left matrix) or DHSsin selected cell lines (right matrix), with a count of overlaps between thephenotype and the cell line/factor. Values in blue squares pass an empiricalP-value threshold#0.01 (based on the same analysis of overlaps betweenrandomly chosen, GWAS-matched SNPs and these epigenetic features) andhave at least a count of three overlaps. ThePvalue for the total number ofphenotype–transcription factor associations is,0.001
several SNPsassociatedwithCrohn’s disease andotherinflammatorydiseases that reside inalarge gene desert on chromosome 5, along with some epigenetic featuresindicative of function. The SNP (rs11742570) strongly associated to Crohn’sdisease overlaps a GATA2 transcription-factor-binding signal determined inHUVECs. This region is also DNase I hypersensitive inHUVECsandT-helperTH1 andTH2 cells. An interactive version of this figure is available in the onlineversion of the paper
Users are able to interface with our database by entering lists of SNVs or regions to identify common SNVs at http://www.RegulomeDB.org/ (a). They are then presented with a sorted list of the most important SNVs (b). These SNVs can be examined for the evidence used to rank them as well as a citation for the evidence.
Scientists in the Encyclopedia of DNA Elements Consortium have applied 24 experiment types (across) to more than 150 cell lines (down) to assign functions to as many DNA regions as possible — but the project is still far from complete