Biodb 2011-05

•Download as PPT, PDF•

0 likes•1,203 views

BioinformaticsInstitute

Technology

Copyright OpenHelix. No use or reproduction without express written consent 1

Copyright OpenHelix. No use or reproduction without express written consent 2
Important note to slide users:
 To maintain the color schemes/cues and the animations, if you
import these slides into other slide sets please click the checkbox
in the PowerPoint Insert/Reuse window that maintains slide
format. Otherwise important information may be lost.
Mac users
PC users

Version 3 3
ENCODE Data Available through
The UCSC Genome Browser
Materials prepared by
Mary Mangan, Ph.D.
Warren C. Lathe, Ph.D.
www.openhelix.com
Updated: Q1 2011

Copyright OpenHelix. No use or reproduction without express written consent 4
ENCODE DCC at UCSC
ENCODE at UCSC: http://encodeproject.org
 Introduction
 ENCODE Data Types
 Find and Use ENCODE Data
 ENCODE Downloads
 Additional ENCODE Topics
 Summary
 Exercises

ENCODE: www.genome.gov/10005107
 ENCyclopedia of DNA Elements, NHGRI
 Consortium of international researchers
 UCSC is the Data Coordination Center
Copyright OpenHelix. No use or reproduction without express written consent 5

ENCODE Background
 Pilot phase, or phase I: www.genome.gov/26525202
 Selected regions of the genome: 1%, 30 MB
Copyright OpenHelix. No use or reproduction without express written consent 6

ENCODE Discoveries
 “Marker” papers: Nature and issue of Genome Research
 Changes to our conceptual framework for the genome
Copyright OpenHelix. No use or reproduction without express written consent 7

ENCODE Pilot Data and Beyond
 ENCODE portal: http://genome.ucsc.edu/ENCODE/
 Pilot ENCODE browser: genome.ucsc.edu/ENCODE/pilot.html
Copyright OpenHelix. No use or reproduction without express written consent 8

ENCODE Next Phase: Production Phase
 UCSC is the DCC for human and mouse data
 The portal is available: genome.ucsc.edu/ENCODE/
 New aspects of the Production Phase projectsCopyright OpenHelix. No use or reproduction without express written consent 9

ENCODE Production Phase Focus
 ENCODE is now genome-wide
 Specific cell types and new technologies being applied
 Project focus topics selected, then supplemented
Copyright OpenHelix. No use or reproduction without express written consent 10
chromatin
transcriptome/
genes
promoters/
regulatory sites
DNase sites

ENCODE Data is Flowing!
 Data being submitted to UCSC DCC by data providers
 “Wranglers” ensure meta data is present
 Quality checks occur, data is released for useCopyright OpenHelix. No use or reproduction without express written consent 11

ENCODE DCC at UCSC
Copyright OpenHelix. No use or reproduction without express written consent 12
ENCODE at UCSC: http://encodeproject.org
 Introduction
 ENCODE Data Types
 Find and Use ENCODE Data
 ENCODE Downloads
 Additional ENCODE Topics
 Summary
 Exercises

ENCODE Data Types
 Mapping data
 Genes
 Expression
 Regulation
 Variation
Copyright OpenHelix. No use or reproduction without express written consent 13
ENCODE
Tracks
identified
with icon

Mapability Data
 Mapability for unique regions
 Higher the peak, the more unique
 Cleavage intensity for structural profiling
Copyright OpenHelix. No use or reproduction without express written consent 14
Broad:
36 mers
Duke:
20-35 mers
Rosetta:
35 mers
UMass:
15 mers more
unique
not
unique

GENCODE http://www.sanger.ac.uk/PostGenomics/encode/
 Gencode for assessment of protein coding genes
Copyright OpenHelix. No use or reproduction without express written consent 15

$Expression Data: RNA Localization  RNAs molecules, location in various cell types and fractions Copyright OpenHelix. No use or reproduction without express written consent 16 http://en.wikipedia.org/wiki/MRNA$

Expression Data: Presence of RNA or Exons
 RNAs of various types
 Special look for long mRNAs and exonsCopyright OpenHelix. No use or reproduction without express written consent 17
http://en.wikipedia.org/wiki/MRNA

Regulation Data
 Regulation data
 Structure: modifications, open vs. closed chromatin
Copyright OpenHelix. No use or reproduction without express written consent 18
Image from NIH

Regulation Data II
 Transcription factor binding sites, TFBS
 RNA binding proteins
Copyright OpenHelix. No use or reproduction without express written consent 19
TATA bound to DNA

Variation Data
 Copy Number Variation (CNV) Data
Copyright OpenHelix. No use or reproduction without express written consent 20

Super-Tracks
 New strategies to integrate and display data
 Super-Tracks provide multiple data types to view
 See Track Description page for details, options, and keys
Copyright OpenHelix. No use or reproduction without express written consent 21

ENCODE DCC at UCSC
Copyright OpenHelix. No use or reproduction without express written consent 22
ENCODE at UCSC: http://encodeproject.org
 Introduction
 ENCODE Data Types
 Find and Use ENCODE Data
 ENCODE Downloads
 Additional ENCODE Topics
 Summary
 Exercises

General Organization
 Tracks identified with icon
 Also available in Table Browser
 Description pages have options, settings, filters,
display keys, meta data, and references
Copyright OpenHelix. No use or reproduction without express written consent 23
Configuration
choices,
options,
filters
Display key,
techniques,
references,
contacts
click

ENCODE Data Policy genome.ucsc.edu/ENCODE/terms.html
 Non-scoop window
 “Ft. Lauderdale agreement”
Copyright OpenHelix. No use or reproduction without express written consent 24

Awareness of Embargo Dates
 Track description pages, Table Browser interface
 Download pages
Copyright OpenHelix. No use or reproduction without express written consent 25

ChIP-seq Data for TFBS
 Yale TFBS
 Sample display near TP53 in “dense” visibility mode
 Chip-seq graphic adapted from: wikipedia.org/wiki/ChIP-on-chip
Copyright OpenHelix. No use or reproduction without express written consent 26
TP53
stronger signals
cell types +
antibodies

Description Page, Upper
 See description page for more display options
 Choose tracks and view styles
Copyright OpenHelix. No use or reproduction without express written consent 27
display mode
peak configure
download

Description Page, Lower
 Display conventions explained
 Methods and references
Copyright OpenHelix. No use or reproduction without express written consent 28

ENCODE DCC at UCSC
Copyright OpenHelix. No use or reproduction without express written consent 29
ENCODE at UCSC: http://encodeproject.org
 Introduction
 ENCODE Data Types
 Find and Use ENCODE Data
 ENCODE Downloads
 Additional ENCODE Topics
 Summary
 Exercises

Downloads and Release Log
 Release log for a handy list of available data
 Download is offered; FTP recommended
Copyright OpenHelix. No use or reproduction without express written consent 30
Release log
Human
Mouse

ENCODE DCC at UCSC
Copyright OpenHelix. No use or reproduction without express written consent 31
ENCODE at UCSC: http://encodeproject.org
 Introduction
 ENCODE Data Types
 Find and Use ENCODE Data
 ENCODE Downloads
 Additional ENCODE Topics
 Summary
 Exercises

New Features
 Mouse data
 Proteomics data
 Publications
 Questions? UCSC mailing list, or ENCODE at NHGRI
Copyright OpenHelix. No use or reproduction without express written consent 32
encode-announce mailing list:
https://lists.soe.ucsc.edu/mailman/listinfo/encode-announce
encode-announce mailing list:
https://lists.soe.ucsc.edu/mailman/listinfo/encode-announce
UCSC Genome Browser discussion list:
http://genome.ucsc.edu/contacts.html
UCSC Genome Browser discussion list:
http://genome.ucsc.edu/contacts.html

modENCODE: modencode.org
 A separate modENCODE: www.genome.gov/26524507
 C. elegans and D. melanogaster
 modENCODE DCC: www.modencode.orgCopyright OpenHelix. No use or reproduction without express written consent 33
Science 24 December 2010: Vol. 330
new
February 2011 issue

ENCODE DCC at UCSC
Copyright OpenHelix. No use or reproduction without express written consent 34
ENCODE at UCSC: http://encodeproject.org
 Introduction
 ENCODE Data Types
 Find and Use ENCODE Data
 ENCODE Downloads
 Additional ENCODE Topics
 Summary
 Exercises

Summary
 Encyclopedia of DNA Elements
 Data Coordination Center at UCSC Genome Browser
Copyright OpenHelix. No use or reproduction without express written consent 35

ENCODE DCC at UCSC
Copyright OpenHelix. No use or reproduction without express written consent 36
ENCODE at UCSC: http://encodeproject.org
 Introduction
 ENCODE Data Types
 Find and Use ENCODE Data
 ENCODE Downloads
 Additional ENCODE Topics
 Summary
 Exercises

Copyright OpenHelix. No use or reproduction without express written consent 37
Hands-on session for ENCODE at UCSC
 Exercises on the handouts
 We will walk through them together
 2 styles: questions only, and step-by-step
 When we are finished the formal exercises, we can
help you to investigate issues that you want to
understand for your research

Copyright OpenHelix. No use or reproduction without express written consent 38
Notice:
 The materials and slides offered are for non-commercial use only.
Reproduction, distribution and/or use for commercial purposes is
strictly prohibited.
 Copyright 2010, OpenHelix, LLC
 http://www.openhelix.com/ENCODE

Copyright OpenHelix. No use or reproduction without express written consent 39

Similar to Biodb 2011-05

Use of open_linked_data_in_bioinformaticsRemzi Çelebi

Metadata-based tools at the ENCODE PortalENCODE-DCC

BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeChunlei Wu

Cross Context Scripting attacks & exploitationRoberto Suggi Liverani

On chemical structures, substances, nanomaterials and measurementsNina Jeliazkova

Structural Biology in the Clouds: A Success Story of 10 yearsAlexandreBonvin2

Data exchange alternatives, GIGA TAG (2009)Dag Endresen

CCCB Germline Variant Analysis on Cloud PlatformYaoyu Wang

Implementation of GPU-based bioinformatic tools at the ENCODE DCCENCODE-DCC

2016 Summer - Araport Project Overview LeafletAraport

Software update for embedded systemsSZ Lin

Accessing and scripting CDK from BioclipseOla Spjuth

Model repositories and standard formats for model reusabilityUniversity Medicine Greifswald

Making Data FAIR on WikiData - Andra WaagmeesterOpenAIRE

Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau

Implementing chemistry platform for OpenPHACTSValery Tkachenko

RippleStack & EtherCIS: Shinkansen to openEHRopenEHR-Japan

Data Integration vs Transparency: Tackling the tensionPaul Groth

D02-NextGenSeq-MOLGENISBioinformatics Open Source Conference

ICAR 2015 Workshop - Nick ProvartAraport

Similar to Biodb 2011-05 (20)

Use of open_linked_data_in_bioinformatics

Metadata-based tools at the ENCODE Portal

BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge

Cross Context Scripting attacks & exploitation

On chemical structures, substances, nanomaterials and measurements

Structural Biology in the Clouds: A Success Story of 10 years

Data exchange alternatives, GIGA TAG (2009)

CCCB Germline Variant Analysis on Cloud Platform

Implementation of GPU-based bioinformatic tools at the ENCODE DCC

2016 Summer - Araport Project Overview Leaflet

Software update for embedded systems

Accessing and scripting CDK from Bioclipse

Model repositories and standard formats for model reusability

Making Data FAIR on WikiData - Andra Waagmeester

Cool Informatics Tools and Services for Biomedical Research

Implementing chemistry platform for OpenPHACTS

RippleStack & EtherCIS: Shinkansen to openEHR

Data Integration vs Transparency: Tackling the tension

D02-NextGenSeq-MOLGENIS

ICAR 2015 Workshop - Nick Provart

Recently uploaded

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Bluetooth Controlled Car with Arduino.pdfngoud9212

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Install Stable Diffusion in windows machinePadma Pradeep

costume and set research powerpoint presentationphoebematthew05

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

APIForce Zurich 5 April Automation LPDGMarianaLemus7

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Vulnerability_Management_GRC_by Sohang Sengupta.pptxnull - The Open Security Community

Recently uploaded (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Nell’iperspazio con Rocket: il Framework Web di Rust!

Vertex AI Gemini Prompt Engineering Tips

Gen AI in Business - Global Trends Report 2024.pdf

Human Factors of XR: Using Human Factors to Design XR Systems

Advanced Test Driven-Development @ php[tek] 2024

Bluetooth Controlled Car with Arduino.pdf

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Install Stable Diffusion in windows machine

costume and set research powerpoint presentation

Streamlining Python Development: A Guide to a Modern Project Setup

APIForce Zurich 5 April Automation LPDG

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Benefits Of Flutter Compared To Other Frameworks

Understanding the Laravel MVC Architecture

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Designing IA for AI - Information Architecture Conference 2024

Vulnerability_Management_GRC_by Sohang Sengupta.pptx

Biodb 2011-05

1. Copyright OpenHelix. No use or reproduction without express written consent 1

2. Copyright OpenHelix. No use or reproduction without express written consent 2 Important note to slide users:  To maintain the color schemes/cues and the animations, if you import these slides into other slide sets please click the checkbox in the PowerPoint Insert/Reuse window that maintains slide format. Otherwise important information may be lost. Mac users PC users

3. Version 3 3 ENCODE Data Available through The UCSC Genome Browser Materials prepared by Mary Mangan, Ph.D. Warren C. Lathe, Ph.D. www.openhelix.com Updated: Q1 2011

4. Copyright OpenHelix. No use or reproduction without express written consent 4 ENCODE DCC at UCSC ENCODE at UCSC: http://encodeproject.org  Introduction  ENCODE Data Types  Find and Use ENCODE Data  ENCODE Downloads  Additional ENCODE Topics  Summary  Exercises

5. ENCODE: www.genome.gov/10005107  ENCyclopedia of DNA Elements, NHGRI  Consortium of international researchers  UCSC is the Data Coordination Center Copyright OpenHelix. No use or reproduction without express written consent 5

6. ENCODE Background  Pilot phase, or phase I: www.genome.gov/26525202  Selected regions of the genome: 1%, 30 MB Copyright OpenHelix. No use or reproduction without express written consent 6

7. ENCODE Discoveries  “Marker” papers: Nature and issue of Genome Research  Changes to our conceptual framework for the genome Copyright OpenHelix. No use or reproduction without express written consent 7

8. ENCODE Pilot Data and Beyond  ENCODE portal: http://genome.ucsc.edu/ENCODE/  Pilot ENCODE browser: genome.ucsc.edu/ENCODE/pilot.html Copyright OpenHelix. No use or reproduction without express written consent 8

9. ENCODE Next Phase: Production Phase  UCSC is the DCC for human and mouse data  The portal is available: genome.ucsc.edu/ENCODE/  New aspects of the Production Phase projectsCopyright OpenHelix. No use or reproduction without express written consent 9

10. ENCODE Production Phase Focus  ENCODE is now genome-wide  Specific cell types and new technologies being applied  Project focus topics selected, then supplemented Copyright OpenHelix. No use or reproduction without express written consent 10 chromatin transcriptome/ genes promoters/ regulatory sites DNase sites

11. ENCODE Data is Flowing!  Data being submitted to UCSC DCC by data providers  “Wranglers” ensure meta data is present  Quality checks occur, data is released for useCopyright OpenHelix. No use or reproduction without express written consent 11

12. ENCODE DCC at UCSC Copyright OpenHelix. No use or reproduction without express written consent 12 ENCODE at UCSC: http://encodeproject.org  Introduction  ENCODE Data Types  Find and Use ENCODE Data  ENCODE Downloads  Additional ENCODE Topics  Summary  Exercises

13. ENCODE Data Types  Mapping data  Genes  Expression  Regulation  Variation Copyright OpenHelix. No use or reproduction without express written consent 13 ENCODE Tracks identified with icon

14. Mapability Data  Mapability for unique regions  Higher the peak, the more unique  Cleavage intensity for structural profiling Copyright OpenHelix. No use or reproduction without express written consent 14 Broad: 36 mers Duke: 20-35 mers Rosetta: 35 mers UMass: 15 mers more unique not unique

15. GENCODE http://www.sanger.ac.uk/PostGenomics/encode/  Gencode for assessment of protein coding genes Copyright OpenHelix. No use or reproduction without express written consent 15

16. Expression Data: RNA Localization  RNAs molecules, location in various cell types and fractions Copyright OpenHelix. No use or reproduction without express written consent 16 http://en.wikipedia.org/wiki/MRNA

17. Expression Data: Presence of RNA or Exons  RNAs of various types  Special look for long mRNAs and exonsCopyright OpenHelix. No use or reproduction without express written consent 17 http://en.wikipedia.org/wiki/MRNA

18. Regulation Data  Regulation data  Structure: modifications, open vs. closed chromatin Copyright OpenHelix. No use or reproduction without express written consent 18 Image from NIH

19. Regulation Data II  Transcription factor binding sites, TFBS  RNA binding proteins Copyright OpenHelix. No use or reproduction without express written consent 19 TATA bound to DNA

20. Variation Data  Copy Number Variation (CNV) Data Copyright OpenHelix. No use or reproduction without express written consent 20

21. Super-Tracks  New strategies to integrate and display data  Super-Tracks provide multiple data types to view  See Track Description page for details, options, and keys Copyright OpenHelix. No use or reproduction without express written consent 21

22. ENCODE DCC at UCSC Copyright OpenHelix. No use or reproduction without express written consent 22 ENCODE at UCSC: http://encodeproject.org  Introduction  ENCODE Data Types  Find and Use ENCODE Data  ENCODE Downloads  Additional ENCODE Topics  Summary  Exercises

23. General Organization  Tracks identified with icon  Also available in Table Browser  Description pages have options, settings, filters, display keys, meta data, and references Copyright OpenHelix. No use or reproduction without express written consent 23 Configuration choices, options, filters Display key, techniques, references, contacts click

24. ENCODE Data Policy genome.ucsc.edu/ENCODE/terms.html  Non-scoop window  “Ft. Lauderdale agreement” Copyright OpenHelix. No use or reproduction without express written consent 24

25. Awareness of Embargo Dates  Track description pages, Table Browser interface  Download pages Copyright OpenHelix. No use or reproduction without express written consent 25

26. ChIP-seq Data for TFBS  Yale TFBS  Sample display near TP53 in “dense” visibility mode  Chip-seq graphic adapted from: wikipedia.org/wiki/ChIP-on-chip Copyright OpenHelix. No use or reproduction without express written consent 26 TP53 stronger signals cell types + antibodies

27. Description Page, Upper  See description page for more display options  Choose tracks and view styles Copyright OpenHelix. No use or reproduction without express written consent 27 display mode peak configure download

28. Description Page, Lower  Display conventions explained  Methods and references Copyright OpenHelix. No use or reproduction without express written consent 28

29. ENCODE DCC at UCSC Copyright OpenHelix. No use or reproduction without express written consent 29 ENCODE at UCSC: http://encodeproject.org  Introduction  ENCODE Data Types  Find and Use ENCODE Data  ENCODE Downloads  Additional ENCODE Topics  Summary  Exercises

30. Downloads and Release Log  Release log for a handy list of available data  Download is offered; FTP recommended Copyright OpenHelix. No use or reproduction without express written consent 30 Release log Human Mouse

31. ENCODE DCC at UCSC Copyright OpenHelix. No use or reproduction without express written consent 31 ENCODE at UCSC: http://encodeproject.org  Introduction  ENCODE Data Types  Find and Use ENCODE Data  ENCODE Downloads  Additional ENCODE Topics  Summary  Exercises

32. New Features  Mouse data  Proteomics data  Publications  Questions? UCSC mailing list, or ENCODE at NHGRI Copyright OpenHelix. No use or reproduction without express written consent 32 encode-announce mailing list: https://lists.soe.ucsc.edu/mailman/listinfo/encode-announce encode-announce mailing list: https://lists.soe.ucsc.edu/mailman/listinfo/encode-announce UCSC Genome Browser discussion list: http://genome.ucsc.edu/contacts.html UCSC Genome Browser discussion list: http://genome.ucsc.edu/contacts.html

33. modENCODE: modencode.org  A separate modENCODE: www.genome.gov/26524507  C. elegans and D. melanogaster  modENCODE DCC: www.modencode.orgCopyright OpenHelix. No use or reproduction without express written consent 33 Science 24 December 2010: Vol. 330 new February 2011 issue

34. ENCODE DCC at UCSC Copyright OpenHelix. No use or reproduction without express written consent 34 ENCODE at UCSC: http://encodeproject.org  Introduction  ENCODE Data Types  Find and Use ENCODE Data  ENCODE Downloads  Additional ENCODE Topics  Summary  Exercises

35. Summary  Encyclopedia of DNA Elements  Data Coordination Center at UCSC Genome Browser Copyright OpenHelix. No use or reproduction without express written consent 35

36. ENCODE DCC at UCSC Copyright OpenHelix. No use or reproduction without express written consent 36 ENCODE at UCSC: http://encodeproject.org  Introduction  ENCODE Data Types  Find and Use ENCODE Data  ENCODE Downloads  Additional ENCODE Topics  Summary  Exercises

37. Copyright OpenHelix. No use or reproduction without express written consent 37 Hands-on session for ENCODE at UCSC  Exercises on the handouts  We will walk through them together  2 styles: questions only, and step-by-step  When we are finished the formal exercises, we can help you to investigate issues that you want to understand for your research

38. Copyright OpenHelix. No use or reproduction without express written consent 38 Notice:  The materials and slides offered are for non-commercial use only. Reproduction, distribution and/or use for commercial purposes is strictly prohibited.  Copyright 2010, OpenHelix, LLC  http://www.openhelix.com/ENCODE

39. Copyright OpenHelix. No use or reproduction without express written consent 39

Editor's Notes

Welcome to an OpenHelix tutorial.
To maintain the color schemes/cues and the animations, if you import these slides into other slide sets please click the checkbox in the PowerPoint Insert window that maintains slide format. Otherwise important information may be lost.
Welcome to the tutorial on the ENCODE data available through the UCSC Genome Browser. The UCSC Genome Browser presents the reference genomic sequences for many species including human, and provides related data to facilitate interpretation of the genomic sequences. Researchers use this browser to find genes and gene predictions, SNPs and variations, cross-species comparative data, and much more. In this tutorial we will focus on the data from the ENCODE, or Enc yclopedia O f D NA E lements, project. ENCODE is an international research consortium that aims to catalog and describe all functional elements in the human genome. This tutorial was created by Doctors Mary Mangan and Warren Lathe of OpenHelix, with guidance from the ENCODE Data Coordination Center at the University of California Santa Cruz, and is freely available to the public because it is sponsored by the UCSC Genome Bioinformatics Group.
The main features of the UCSC Genome Browser are explored in additional introductory and advanced tutorials. We encourage you to familiarize yourself with that material before proceeding to the ENCODE materials. This tutorial assumes that the viewer has sufficient background information to move forward with these additional data types and features of the ENCODE project. The agenda for this tutorial is shown here. We will begin with an introduction to the ENCODE project. Then, to explore the cutting-edge information being produced by the ENCODE project, we’ll walk through the various types of ENCODE data. Once you understand what the ENCODE project is, and what data is available from it, we will show you how to access and use the ENCODE data. This will include an understanding of the strategies and displays. We’ll explore how to access the downloadable files. Additional topics related to the ENCODE project will be introduced. Finally, we’ll summarize this tutorial. At the end you’ll have the opportunity to watch a screen-capture demonstration of an exercise using the UCSC Genome Browser with ENCODE data. Now let’s proceed with the introduction to the ENCODE DCC at UCSC.
Following the major sequencing of the human genome, a new project was launched to take a closer look at the elements that comprise the genome sequence features. This international research consortium, sponsored by the National Human Genome Research Institute (or NHGRI), is called ENCODE: ENCyclopedia Of DNA Elements. UCSC has been designated as the Data Coordination Center (DCC)—a repository for submission and retrieval of ENCODE Consortium data. Next we’ll briefly describe some background and framework for the ENCODE project.
Rather than providing exhaustive understanding of human genomics, the completion of the human genome project made it apparent that merely having the nucleotide sequence of the genome wouldn’t provide complete knowledge of the genome organization and function. The ENCODE project was devised to catalog all of the functional elements in the human genome, and to rigorously examine them to gain a better understanding of their roles in the cell. This began with a pilot project, and information about the pilot phase can be obtained at the URL shown. To accomplish this, data has been generated using cutting-edge technologies. This data is at the forefront of our genomic knowledge, moving beyond the mere transcription of known genes into more in-depth questions on chromatin remodeling, transcription factor binding, transcription of non-coding RNA genes, and more. The first phase of the ENCODE project was referred to as the “pilot” phase. In the pilot phase, selected regions of the genome were chosen for detailed examination. Approximately 30 megabases, representing 1% of the genome, was defined as the target. Half of the regions included well-studied gene-rich or known feature-rich areas with substantial similarity in other species. Half were randomly selected—so the features would be less well-known or understood.
The ENCODE pilot phase was completed in 2007. The results were published—as a whole project overview “marker” paper in Nature, and as individual papers by research teams that describe their specific data focus in a special issue of Genome Research . Scores of important insights were generated and are highlighted in the Nature paper. These include: There is abundant transcription beyond the known protein-coding genes: both intragenic and intergenic transcription, including both non-coding RNA and transcribed pseudogenes. This had been observed before, but the ENCODE pilot phase conclusively demonstrated this. At the same time, known protein-coding genes revealed unexpected complexity: distal untranslated region (UTR) exons as many as 200 kilobases away, overlapping or interleaved loci, antisense transcription. This has all really challenged the conventional definition of a "gene". Patterns of histone modification and DNase sensitivity reveal "domains" of packed or accessible chromatin, and these accessibility patterns correlate well with rates of transcription, DNA replication, and regulatory protein factors binding to the DNA. This underscores the regulatory importance of epigenetic factors. These items, and many more which can be explored in great detail in the individual papers by the research teams, resulted in changes to our conceptual framework for understanding the organization and functional aspects of the genome.
A portal for the ENCODE project was created on the UCSC website, which can be accessed by clicking the ENCODE link from the left navigation bar on the UCSC Genome Browser homepage. On the ENCODE Data Coordination Center page there is a link for Pilot Project on the left. This data will remain available at the UCSC site. But the ENCODE project has now grown and moved beyond the pilot phase. The success of the pilot project enabled the continuation of this path towards understanding the whole genome—leading to a second phase of the ENCODE project: the genome-wide ENCODE Production phase or ENCODE Scale-up phase. The goal is now to examine 100% of the genome. The focus in this tutorial will be the UCSC Genome Browser and DCC, now with the genome-wide ENCODE data. In addition to the UCSC DCC, there are other places to locate ENCODE data as well. For example, certain strategies employ microarray studies, and the GEO or Gene Expression Omnibus at NCBI, as well as the ArrayExpress repository, will store that data. Sequences from high-throughput sequencing assays are stored at the NCBI or EBI short read archives. Other data types may be found in other appropriate repositories. However, for the remainder of this tutorial we will focus on exploring the ENCODE Production Phase data in the UCSC Genome Browser.
The current ENCODE Production Phase project builds upon the knowledge of the prior phase. The production phase portal at UCSC provides access and details on this specific focus. But the data coming in from the new phase of the ENCODE project are fully incorporated into the regular browser interface that you are used to. On the portal page there are some helpful links to items of relevance around the project. One section highlights key differences from the earlier work, and we’ll touch on those features next.
The genome-wide production phase of the ENCODE project is proceeding now. A key feature of the production phase is that several cell types have been selected to form the main data collection efforts. All project teams will use these same cells in their work for consistency. The cell types are organized into tiers: 1, 2, and 3 to prioritize the experimental investigations. This will enable better coordination of the studies and interpretation of the results. It might also be something for researchers to consider for their own experiments, as there will be a great deal of supporting data for these cell types that might be informative. The cell types are publicly available from a variety of providers. More can be learned about them from the ENCODE project site or the UCSC portal. Several areas of focus have been funded. These include chromatin organization and DNase hypersensitive sites. Detailed studies of gene structures, and the transcriptome as a whole, are underway. Exploration of regulatory elements, including transcription factor binding sites, is ongoing. These topics will help us to understand much more about genomic mechanisms. In October 2009, a number of new ENCODE groups were funded as part of the NIH American Recovery and Reinvestment Act grants. These newer groups will expand the scope of ENCODE research to the mouse genome, and will provide new assays (such as proteo-genomics and epitope-tagging protein binding) to supplement results produced during the first two years of the project. We’ll provide you with the information on how to identify and use ENCODE project data that comes from additional methods as we proceed.
ENCODE project researchers are submitting data to the Data Coordination Center at the UCSC Genome Browser now. Among the first data to be released in this new phase has been Transcription Factor Binding site research from the consortium members shown in this news announcement, and more data is regularly being submitted. The UCSC DCC team ensures that there is sufficient information about the project for users to understand the features—this is called the meta data—such as the lab and institution, the grant information, annotation type, antibodies used, etc. The data is quality checked to ensure it meets basic criteria. However, it is important to note that this may be pre-publication data and should be considered as such. After quality checking, the data is released to the public for general use, subject to data policies, which will be covered in an upcoming section of this tutorial. The first production phase data was mapped to the March 2006 assembly. Going forward this data will be remapped, or coordinate converted, to all subsequent assemblies. It is important to be aware of the assembly to which the data refers when examining data coordinates in publications. Older assemblies are always available at UCSC, either from the menu selections or in the UCSC archives.
[end of Introduction] That completes our introduction to the ENCODE project. [beginning of Data Types] In this section we’ll discuss the types of ENCODE data that can be found at UCSC.
As we described in the last section, various types of data are now being submitted to the UCSC ENCODE DCC from numerous different providers. Here we will provide an overview of these data types as they are organized in the UCSC Genome Browser “track groups” area, and mention some key features. These are the data types present at the time we created this tutorial—more ENCODE data is flowing in regularly, and you may find more as you examine the site. ENCODE data is identified in the tracks area by the ENCODE NHGRI helix icon, a few examples of which are indicated on the slide. Most participating research groups have provided several tracks for any data set, and generally only selected data from each research group is displayed by default. Track details can be accessed by clicking any of the hyperlinked track names on the web page. This will open a window providing details of all the available tracks and data features available to view. This report will also provide the full details of any data type, including a description and associated references. ENCODE data can be found in the Mapping and Sequencing Tracks group, Genes and Gene Predictions group, Expression, Regulation, and Variation at this time. Next we will briefly discuss each of these data types.
Found in the Mapping and Sequencing Tracks Group, Mapability is a way to assess the uniqueness of the genomic sequence in that region. With many short read sequencing tools in use, assembling the data into longer and complete sequences relies on the confidence that the region is unique. Therefore, several groups are calculating mapability with various technologies, and over various window lengths, of the sequence. For example, for a 30 base pair region of the genome, researchers determine how unique that fragment is. They score that, and the score can be used to generate a histogram that indicates the uniqueness. This is a sample of one of these tracks on a large region of chromosome 21. If we examine the region around the centromere, one can see that a portion of this region appears not to be unique. Further downstream the uniqueness of the region is much more apparent. Essentially we see that the higher the peak, the greater the likelihood that the sequence is unique. A zoomed-in view of several mapability tracks is provided for more detail. Again, one can check the track details page for a deeper understanding of the graphical displays and the underlying data characteristics. As with other UCSC tracks, clicking on specific track items will open a new web page that provides additional details about that specific item. Also in this group, the BU ORChId track displays predicted hydroxyl radical cleavage intensity on naked DNA for each nucleotide, which represent a structural profile of the DNA in the genome. [Mapability region shown: chr21:9,488,411-14,616,795 Duke 20bp track, full. BU ORChID chr21:13,260,011-13,261,418 full]
In the Genes and Gene Predictions Track group is a Gencode track. The GENCODE project is generating a set of validated reference genes for the ENCODE project. The goal is to identify and map protein coding genes in the genome. This annotated set of genes carries information on the likely status of the genes, whether they are known genes, novel genes, putative genes, pseudogenes, or likely artifacts. There are various levels of confidence included—validated, manually annotated, and automated annotations. Gencode genes that have various characterizations can be examined using the filter options on the Gencode details page. The Gencode key provides information about the color codes used for the displays, which allows users to visualize the Gencode genes and quickly understand their status. Clicking on any Gencode gene provides additional details about it. More details about this project are provided in the track details page mentioned earlier, or at the Gencode site at the URL shown. [region: chr21:13,211,401-14,313,093 Gencode track in Pack]
A number of interesting data tracks are available in the Expression group. A variety of techniques are being used to locate and characterize the messenger RNA molecules in the ENCODE cell types. Some projects are looking at the subcellular localization of mRNAs. The diagram on the left illustrates some of the regions where one might expect to find mRNA molecules in various stages of their lifespan. Researchers can take the cells and fractionate them, and then determine which pieces of the mRNAs are found. For example, the nuclear fraction may have different characteristics than mRNAs out in the cytoplasm. One can choose to display the different cell types and the different fractions on the track details pages. Shown here is a simple example of a region of the TPTE gene, and the composition of the mRNAs observed. The pattern is different in the cytosol and in the nucleus. [Image from: http://en.wikipedia.org/wiki/MRNA; gene sample is TPTE gene region on chromosome chr21:9,911,130-10,020,902]
In addition to the subcellular localization, various groups are looking to identify new mRNAs or confirm the presence of known mRNAs. Using different technologies and strategies—some with extra effort to obtain longer mRNAs which are technically challenging—data providers will offer an extensive look at mature mRNAs and their corresponding exons. As always, one can examine the track details to learn more about the technology and the data involved.
A major focus of the ENCODE project is to investigate regulatory components and features of the genome. A wide range of tools and methods are being used to investigate this. Some projects have the goal of looking at structural aspects of the genome organization. For example, the position and modification of histones can provide clues about which regions of the genome are accessible or inaccessible. DNA fragments can be chemically bound to histones and then isolated and the corresponding DNA can be sequenced to determine the location. Leads on promoters, enhancers, silencers, and various other types of functional elements may be gleaned from this type of data. Commonly the data will show strong signal regions in summary form, and the more dispersed pattern from the signal data. Again, the data description page provides additional details on the methods and displays. [sample region: chr21:9,939,013-9,949,270] [Image credit http://www.genome.gov/Images/press_photos/highres/20150-300.jpg]
The binding of proteins to possible regulatory elements is also a key focus of the ENCODE work. A number of projects are studying the binding of proteins to either genomic DNA, or to mRNA, to learn more about various aspects of regulation. We will provide an example of that type of data in an upcoming tutorial section, when we describe the Yale TFBS project. Several additional strategies to examine transcription factor binding sites (or TFBS) are underway. Other work is looking at other features of possible promoter regions—such as bi-directional promoters between two genes. Negative regulatory elements, another interesting and important aspect of regulation, are also being pursued. RNA binding proteins may provide further clues to regulation of gene expression. [Sample region: chr21:29,281,340-29,292,754] [Image credit: http://en.wikipedia.org/wiki/File:TATA-binding_protein.png]
Understanding variation is also crucial to complete interpretations of genomics data. Copy number variation, or CNVs, had been elusive before we had a reference genome sequence and the appropriate technologies for their detection. But researchers have recently become more aware of these structural variations of the genome that can impact the functions of cells. Some CNVs appear to be present in humans without a detectable impact on health status. Others may be important in disease. In the framework of the ENCODE project, it is important to know in the cell lines under examination if they have extra or missing copies of genomic segments. Copy number variations may include amplification or duplication of a segment. It may mean deletion of a piece of the genomic region with respect to the reference genome. These types of variation are present in this data set. Pairs of chromosomes may include alteration on one, or both of the members of a pair, and therefore the deletions may be scored as heterozygous or homozygous deletions. Such data can be investigated using the Common Cell CNV track data. The example shows a region on the X chromosome, with observable patterns indicating the presence and absence of segments on one or both chromosomes, and amplifications in some cases. Normal segments are also indicated. If a gene of interest is in a region with variation, this could be crucial information about these cells. With the strategies we have described in this tutorial for locating and using the ENCODE data types researchers will have the skills to access and visualize any new data types that come along. The ENCODE icon and details pages can be examined for further guidance. [sample image: chrX:45,908,749-80,283,771, front menu on pack, both details page views set to pack]
The ENCODE project is generating tremendous volumes of new data. Faced with the problem of how to display such a large amount of data in ways to facilitate analysis, UCSC is developing new visualization methods that cluster and overlay the data, and display the resulting tracks on a single screen. These are called super-tracks. Shown here is an example of one super-track. The ENCODE Integrated Regulation super-track is a collection of regulatory tracks containing state-of-the-art information about the mechanisms that turn genes on and off at the transcription level. Individual tracks within the set show enrichment of histone modifications suggestive of enhancer and promoter activity, DNAse clusters indicating open chromatin, regions of transcription factor binding, and transcription levels. When viewed in combination, the complementary nature of the data within these tracks has the potential to greatly facilitate our understanding of regulatory DNA. The data comprising these tracks were generated from hundreds of experiments on multiple cell lines, and each of the cell lines in a track is associated with a particular color. You can learn more about the data and the features of the track by accessing the Description pages from the ENCODE Regulation link in the Regulation track section. And from that page you’ll have access to the sub-tracks that comprise the set. You can alter the visualization here, or you can go to the individual track pages themselves for further options and keys to the display features. Other super-tracks may have different data, but the details and features will also be accessible from those description and details pages as well. But all will provide new and effective means to visualize the data from the ENCODE project and other projects as well.
[end of Data Types] That completes our discussion of ENCODE project data types. [beginning of Find and Use] Now that we have provided a full background on what the ENCODE project is, and what data types are available from it, we will next examine how to find and utilize this data.
When ENCODE production phase project data becomes publicly available in the Genome Browser, it will be available in the standard track menus areas. The ENCODE icon identifies those data tracks that are ENCODE-specific on the main Genome Browser Viewer interface. We will focus on the Yale TFBS, Transcription Factor Binding Site data, that we mentioned in a previous section for this example. Of course, there are other ENCODE data that could be examined in this track group and other track groups as well. The data will also appear in the Table Browser, and if it is subject to the data use restriction policy that will be indicated. We will explore the data use policies later in this tutorial. The ENCODE radio button on the table browser restricts the search to the pilot target regions mentioned in the introduction. The button does not need to be clicked to examine the broader production phase ENCODE data that is the focus of this tutorial. As mentioned previously, to learn more about any of the data types that are available one can simply click the hyperlinked short name of the track. For this example, the Yale TFBS in the Regulation track group has been selected and a new page of details has opened. The upper portion of the details page provides many configuration options, settings, data set choices, and possibly filters for the data. The lower portion of the page will offer keys to understanding the visual display, information about the techniques and strategies involved in generating the data. If publications relevant to the data are available, they will be listed here as well. Contact information for the data providers is also present. The data can be queried to learn additional details about genes and regions of interest, which may lead to novel new insights. It is hoped that a broad range of researchers take advantage of this opportunity. Before exploring specific ENCODE data, it is important to understand ENCODE data usage policies. We will discuss this on the next slide.
All of the data that is generated by the ENCODE project will become publicly available. Submitted data is made available in the browser after it is quality control checked and tested. This means that the researchers who submit the data may not yet have published their analysis of the data. To balance the needs of the public for rapid access to this data with the rights of the providers to publish their findings, there is a policy around the use of this data. Providers will make the data available as quickly as they can, and provide documentation about the data. Researchers can access, query, download, and utilize the data in their own work. However, users of the data are requested to not publish findings about this data as a whole before the providers have had 9 months to do so. This 9-month moratorium on publications is essentially a “non-scoop” agreement, and is based on principles established during the human genome sequencing project that had similar needs—rapid access with some protection against scooping. These principles are sometimes referred to as the “Fort Lauderdale agreement” and more details about that are available from NHGRI. However, there are some cases where researchers may publish results from the data. For example, a researcher is allowed to publish results on a single gene. Researchers interested in publishing results based on ENCODE data are encouraged to contact the ENCODE project consortium and work with them on projects focused on this data. For full details on how to use the data please see the Data Policy terms page for specific instructions.
Although researchers should be aware of the data embargo system, they should not be deterred from using the data in their work. The goal of the ENCODE project is that many publishable discoveries and advances will be made from the data! It is important to be aware of the embargo dates as they are presented on the data one is using. The dates are available from multiple places at the UCSC site. On the data description pages it may look like this sample where the “Restricted Until” column indicates the embargo date. On the Table Browser embargo dates are indicated when a table is chosen, which is illustrated here. The download pages also provide the status of the data. Shown here is a sample with a date, as well as one set that is not restricted. So however one interacts with the data: in the main graphical browser, in the Table Browser, or by downloading the data, the dates will be available.
As an example of a specific ENCODE data set, we will now specifically examine the Yale TFBS set of data. The data will be described briefly, and then we will look at the available display features. The principles of access, usage, and display that we cover will be true for all of the new ENCODE data. The new techniques and data types being submitted may require new graphical displays or controls than other data. And each project will have different aspects. We will use this first data set as an example. This data is ChIP-seq data for identifying Transcription Factor Binding Sites. We are focusing on an area around the start of the TP53 gene for this sample. If the Yale TFBS data in the “Regulation” tracks is displayed in “dense” mode, it will appear as shown on this slide. Here the underlying premise for identifying transcription factor binding sites with the ChIP-seq methods is shown graphically. Cells for the ENCODE project are grown. Cross-linking agents bind transcription factors to genomic DNA. The DNA is fragmented. The proteins of interest (or POI) bound with DNA fragments can be captured by different antibodies—for example, an antibody for the c-FOS factor or one for the Pol2 protein. Then immunoprecipitation pulls out the complexes. The DNA sequences can then be isolated and high-throughput sequencing technologies can be used to identify the specific sequence. The sequences can then be mapped to the genomic reference sequence and indicated on the browser. Shown is a segment near the beginning of the TP53 gene. It looks as if there are positive signals from this technique, near the beginning of this gene, in certain cell types, and with certain antibodies to that transcription factor. That suggests that the transcription factor may bind in some conditions to this region. To understand more about the display and ways to view and understand the data, one can refer to the description page. The hyperlink above the menu for Yale TFBS links to that page, and will be examined next.
On the right is shown an overview of the whole description and configuration page for the Yale TFBS data from the menu hyperlink. There are several basic features of these pages. Here we’ll focus on the top and magnify portions for clarity. At the uppermost part of the page are display mode menu choices similar to those available on the genome viewer. One can set the display view for individual data types, and also the maximum and minimum values that are of interested for this data set. These settings can be considered sophisticated filters for customizing displays to one’s personal research needs. There is also the option to choose which specific cell types and antibody data sets to examine. This configuration section of the page offers extensive choices for which data to view. Next on the page one can choose to examine different aspects of the data by selecting or deselecting the display, such as peaks, signal, or even raw data for the cell line and antibodies of interest. For more details about any of the choices one can click the ellipsis to access the metadata associated with this particular item. The complete data can also be downloaded from here. Also, the data policy restriction date is indicated.
Further down on the page there is more crucial “meta” data and information about how and why this work was done, and experimental method details important for understanding the data acquisition and display. Publications that might offer guidance on the methods are also listed, when such papers are available. Any publications on the studies by the data providers will be listed here as well.
[end of Find and Use] That completes the section on finding and using ENCODE data with the UCSC Genome Browser. [beginning of Downloads] In this section we will explore where to access the data to download in bulk.
For many people, access to the data from the graphical browser and the Table Browser will serve their needs. However, the data is also available in bulk for use in other ways. All of the data can be accessed and downloaded from the DCC site. We will examine that briefly next. The Release Log provides a list of what is available and what is new. The Downloads link provides access to the complete data sets that are available. Keep in mind that these are subject to the same data use policies that we described in the browser discussion, and that will be indicated with those files as well. For bulk downloading, it is recommended that researchers access the FTP site that is indicated. The site will also offer plain-text files with metadata, and a file with checksums, to supplement the data. Other data that is not suited for browser display can also be found here. Researchers have early access to the ENCODE data in a variety of ways: browser viewing, table queries, and downloads.
[end of Downloads] That completes our look at downloads. [beginning of Additional ENCODE Topics] In this section we’ll discuss some other aspects of ENCODE.
As the ENCODE project proceeds, new aspects and features will arise. The homepage News section will announce updates, as will the release logs. Publications will become available. The ENCODE-announce mailing list is also a good way to stay informed and be notified when new data is released. One new feature to come will be ENCODE strategies aimed at the mouse genome. Proteomics data may be generated for the cell lines involved. New ENCODE data will follow the same principles and formats covered in this tutorial so that researchers familiar with one type of data will readily be able to utilize all new data as well. And there will be publications on the ENCODE project as a whole, and from individual data providers, over time. For any questions about using the data, it is suggested that researchers use the UCSC mailing list. There is also searchable archive associated with the mailing list, and this is an excellent first source of immediate answers without having to send an email. Researchers can also contact the UCSC ENCODE team directly. For larger project-level issues please contact the NHGRI. [proteomics image: http://en.wikipedia.org/wiki/File:Protein_pattern_analyzer.jpg]
A separate branch of the ENCODE project is the modENCODE project. Inquiries like those for the human genome are being made for worm and fly genomes as well. There is a separate Data Coordination Center for that data, and access to that part of the project is available from the URL shown. The modENCODE team coordinates with the human ENCODE team. But the data is handled separately. Researchers can also learn more about the modENCODE project from this feature in Nature, as well as other publications from the modENCODE groups.
[end of Additional ENCODE Topics] That completes the Additional Topics section. [beginning of Summary] In this section we’ll summarize this tutorial.
The ENCODE project is delivering a tremendous amount of information on functional elements of the genome, and will illuminate many new areas of research. The role of the UCSC Genome Browser team as the Data Coordination Center ensures that the submitted data will become quickly available to researchers worldwide. The data is visible in the graphical browser interface, in the Table Browser for custom queries, and prepared for downloading as well. We hope that a broad range of researchers will find many ways to utilize the ENCODE data to expand our knowledge of the genome, and how it functions.
[end of Summary] That completes our Summary. [beginning of Exercises] In this section we will present a screencast recording of an exercise done with the ENCODE data at the UCSC Genome Browser site to reinforce the concepts developed in this tutorial.
[Not read in recorded materials] Hands-on session. The exercises that match this presentation can be found on the “ENCODE Data Available at the UCSC Genome Browser” OpenHelix tutorial homepage—or in your folders if at a live OpenHelix training. You can choose to do some now, or wait until the end. Exercises on the handouts We will walk through them together, in live OpenHelix training sessions 2 styles: questions only, and step-by-step. You can chose to just read the question and try to find the answer, or use the checklist guide to hit each item in a step-by-step manner. When we are finished the formal exercises, we can help you to investigate issues that you want to understand for your research
The materials and slides offered are for non-commercial use only. Reproduction, distribution and/or use for commercial purposes strictly prohibited. Copyright 2010, OpenHelix, LLC.
Thank you for using this OpenHelix tutorial.

Biodb 2011-05

Recommended

Recommended

More Related Content

Similar to Biodb 2011-05

Similar to Biodb 2011-05 (20)

More from BioinformaticsInstitute

More from BioinformaticsInstitute (20)

Recently uploaded

Recently uploaded (20)

Biodb 2011-05

Editor's Notes