Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NCBI Boot Camp for Beginners Slides


Published on

The NCBI Boot Camp for Beginners was designed to offer an overview of the NCBI suite of resources. In the first half of the presentation, highlighted databases were covered in four main categories: literature, sequences, genes & genomes and expression & structure. The second half of the class used the apolipoprotein A as a query that was explored through many of the NCBI databases, from identifying the reference sequences to a structural analysis of the Cys130Arg variant.

Published in: Education
  • There are over 16,000 woodworking plans that comes with step-by-step instructions and detailed photos, Click here to take a look ♥♥♥
    Are you sure you want to  Yes  No
    Your message goes here
  • Get access to 16,000 woodworking plans, Download 50 FREE Plans... ➤➤
    Are you sure you want to  Yes  No
    Your message goes here
  • The #1 Woodworking Resource With Over 16,000 Plans, Download 50 FREE Plans... ★★★
    Are you sure you want to  Yes  No
    Your message goes here

NCBI Boot Camp for Beginners Slides

  1. 1. NCBI<br />Boot<br />Camp<br />
  2. 2. NCBI<br />“<br />”<br />...advances science and health by providing access to biomedical and genomic information<br />
  3. 3. NCBI<br />Sequences<br />Expression<br />Genome maps<br />Structures<br />Protein Domains<br />Homology (gene, protein, structure)<br />Pathways<br />Genetic Variation<br />
  4. 4. NCBI<br />tools<br />databases<br />
  5. 5. databases*<br />* a brief survey of selected dbs<br />
  6. 6. 1<br />literature<br />
  7. 7. PubMed<br />Bookshelf<br />OMIM<br />
  8. 8. PubMed<br />20,672,941<br />citations<br />2,157,529<br />PubMed Central<br />5,519<br />indexed journals<br />
  9. 9. Bookshelf<br />767<br />
  10. 10. Dr. McKusick<br />OMIM<br />
  11. 11. Lesch-Nyhan<br />If you query for Lesch-Nyhan, you<br />get a very long OMIM record <br />OMIM<br />
  12. 12. Clinical Features<br />Biochemical Features<br />Inheritance<br />Pathogenesis<br />Diagnosis<br />History<br />Description<br />Cloning<br />Gene Structure<br />Mapping<br />Molecular Genetics<br />Pathogenesis<br />Evolution<br />Animal Model<br />Allelic Variants<br />See Also<br />References<br />Contributors<br />Creation Date<br />Edit History<br />OMIM<br />Note: there are separate entries for Lesch-Nyhan syndrome and the protein that causes the defect<br />
  13. 13. OMIM<br />Every OMIM Record has an extensive list of internal and external links<br />
  14. 14. 2<br />sequences<br />
  15. 15. Nucleotide<br />GenBank<br />RefSeq<br />
  16. 16. DNA<br />RNA<br />Protein<br />EST: expressed sequence tag<br />SNP: single nucleotide polymorphism<br />WGS: whole genome sequencing<br />CDS: coding sequence<br />STS: sequence tagged site<br />
  17. 17. NCBI<br />SNP<br />Primary <br />Databases<br />GEO<br />GenBank<br />Protein<br />
  18. 18. GenBank Format<br />GenBank<br />
  19. 19. LOCUS<br />Locus name, size, type, division, modification date<br />Search tips: <br /> Locus names can change!<br /> Division names are historical, <br /> not taxonomical!<br />
  20. 20. DEFINITION<br />As the author sees fit…<br />Search tip: No Controlled Vocabulary in Definitions!<br />
  21. 21. ACCESSION/Version<br />Accession numbers do not change, even if information in the record is changed at the author's request.<br />Version and GI numbers change<br />
  22. 22. Keywords, Source, Organism<br />Organism: Tied into Taxonomy Browser<br />Search tip: Keywords are often blank<br />When performing a “keyword” style search, use [all] , [word] or [title]<br />
  23. 23. Selected References<br />Newest First<br />Last “reference” covers submission information<br />
  24. 24. Features I<br />Source, gene, misc features<br />
  25. 25. Features II<br />CDS: links, translation<br />
  26. 26. Sequence<br />
  27. 27. GenBank Format<br />GenBank<br />(also for protein)<br />
  28. 28. 132,015,054<br />Sequences in GenBank 3/20/11<br />+<br />HARD WORK<br />-<br />redundancy<br />RefSeq<br />
  29. 29. RefSeqs<br />provides a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes<br />RefSeq<br />
  30. 30. bio mol<br />DNA<br />RNA<br />Protein<br />RefSeqs<br />provides a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes<br />RefSeq<br />
  31. 31. bio mol<br />DNA<br />RNA<br />Protein<br />RefSeqs<br />HELLO<br />my name is<br />provides a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes<br />XX_123456<br />RefSeq<br />
  32. 32. bio molecules<br />Genomic DNA<br />(NC)<br />Incomplete<br />(NG)<br />mRNA<br />(NM)<br />Model mRNA<br />(XM)<br />Curated Protein<br />(NP)<br />Model protein <br />(XP)<br />RefSeq<br />
  33. 33. NG_012250.1 NM_000690.2 AY621070.1 EU414258.1 EU414257.1 EU414256.1 EU414255.1 EU414254.1 EU414253.1 EU414252.1 EU414251.1 EU414250.1 EU414249.1 AF164120.1 EU373813.1 EU373812.1 EU373811.1 EU373810.1 EU373809.1 EU373808.1 EU373807.1 EU373806.1 EU373805.1 EU373804.1 AH002599.1 M20456.1 M20455.1 M20454.1 M20453.1 M20452.1 M20451.1 M20450.1 M20449.1 M20448.1 M20447.1 M20446.1 M20445.1 M20444.1 CR456991.1 AB385105.1 CU678321.1 CU678320.1 AF073514.1 AF073513.1 AF073512.1 AF073511.1 <br />NG_012250.1 <br />NM_000690.2 <br />RefSeq<br />
  34. 34. Note: the NP sequence would not normally be found using a nucleotide search – I have included it only to show the complete suite of RefSeq for ALDH2<br />NG_012250.1 <br />NM_000690.2 <br />NP_000681.2<br />RefSeq<br />
  35. 35. 3<br />genes/genome<br />
  36. 36. Genome<br />Gene<br />HomoloGene<br />
  37. 37. Genome<br />1090<br />eukaryota<br />1483<br />prokaryota<br />2507<br />viruses<br />
  38. 38. Note: genome records are either mitochondrial or chromosome<br />Note: no common names are listed as genome query results<br />
  39. 39. The genome record shows a variety of stats for different databases, as well as a map of the genome that is scrollable<br />
  40. 40. Searching in BioProject yields common names<br />
  41. 41. BioProject results contain background information<br />
  42. 42. Instead of searching Genome, you can also browse via the Genome Resource Guide<br />
  43. 43. Genome Resources<br />G<br />Genome BLAST<br />B<br />Map Viewer<br />M<br />Genome Project<br />(BioProject)<br />P<br />
  44. 44. G<br />Genes and Human Health<br />Epigenomics<br />The Genomic Sequence<br />Maps and Markers<br />Transcribed Sequences<br />Cytogenetics<br />Comparative Genomics<br />A standard record in Genome Resources contains many links out along with brief database summaries<br />
  45. 45. M<br />Map Viewer starts by letting you select a chromosome (or section of a circular genome)<br />
  46. 46. M<br />
  47. 47. To the left of each gene, there are a variety of links out. Note: these change based on the level of information known about a given gene.<br />M<br />HUGO Gene Nomenclature<br />Sequence Viewer<br />Protein<br />Download<br />Evidence Viewer<br />Molecular Model<br />STS, OMIM, CCDS, SNP<br />
  48. 48. Regulatory<br />Gene <br />Intron<br />Exon<br />Intron<br />
  49. 49.
  50. 50. Each gene record provides extensive details. We will go through an example Gene record in the following slides.<br />
  51. 51. Sequence Viewer and MapViewer<br />Genomic Info<br />
  52. 52. Bibliography<br />PubMed<br />NOTE: Gene Reference into Function is an excellent resource for literature related to function. These articles have been submitted for inclusion into GeneRIF and are not the product of an automated text search.<br />GeneRIF<br />
  53. 53. There’s<br />Even More!<br />Interactions<br />Gene Ontology<br />Genotypes<br />Homologues<br />Protein Information<br />Interactions will list all known interacting molecules, providing links to <br />
  54. 54. RefSeq<br />These reference sequences are stable and are independent of genome builds<br />
  55. 55. The NCBI Assembly<br />~100 individuals<br />The Celera assembly<br />~5 individuals<br />These reference sequences refer to specific builds<br />HuRef<br />Just Craig Ventner<br />
  56. 56. LINKS<br />LINKS: internal, external and commercial<br />
  57. 57. HomoloGene<br />
  58. 58. Homologs<br />paralogs<br />orthologs<br />orthologs<br />frog α<br />chick α<br />mouseα<br />mouseβ<br />chick β<br />frogβ<br />α-chain gene<br />β-chain gene<br />GENE DUPLICATION<br />Early Gene of Interest<br />
  59. 59. P3H1<br />
  60. 60. Protein of Interest<br />(P3H1)<br />Cross-species identity is automatically calculated<br />Automatic sequence alignments are easily accessible<br />
  61. 61. Protein of Interest<br />(P3H1)<br />Note: UniGene may come up with different results, since it is based on EST clusters and not protein sequence<br />
  62. 62. 4<br />expression<br />& structure<br />
  63. 63. UniGene<br /> EST, GEO <br />Structures <br /> CDD, MMDB, PubChem… <br />
  64. 64. UniGene<br />…an organized view of the transcriptome<br />
  66. 66. GENE EXPRESSION<br />EST<br />This is a “virtual northern” whereESTs are counted to get a rough sense of overall expression levels<br />
  67. 67. GENE EXPRESSION<br />GEO<br />Note: the GEO results contain all arrays that assay for this gene; most of these results are for specific disease or altered states and do not necessarily reflect wild type, normal levels of expression <br />
  68. 68. Structures <br /> CDD, MMDB, PubChem… <br />
  69. 69.
  70. 70.
  71. 71. Cn3D colored by secondary structure<br />Note: Cn3D has aligned the individual chains for you<br />
  72. 72. Cn3D colored by chain (there are 7)<br />
  73. 73. “structure function”: the hemolysin protein bores a hole into red blood cells and sucks their insides out. The structure kind of looks like a hollow tack.<br />
  74. 74. Note: the structure listing shows each individual chain (along with 3D domains and superfamilies) AND the chemical that was found in the crustal structure (see arrow)<br />
  75. 75. Another example… this time a single chain with distinct domains<br />
  76. 76. Now we are coloring by domain. Also note the funky space-filling model. It makes proteins look fat.<br />
  77. 77. Note that Super Families are defined: clicking on them will take you to the conserved domain database<br />
  78. 78. The Conserved Domains Database provides alignments across species of conserved domains, along with a general description of the domain<br />
  79. 79. 3D domains are color coded. Note: 3D domains do not always correlate to Super Families! Clicking on the 3D domain will take you to related structures<br />
  80. 80. You can select structures and then view the 3D alignment in Cn3D<br />
  81. 81. Volia! Structural alignment. Note: the sequences are aligned in the Sequence View box.<br />
  82. 82. PubChem has three primary areas:<br />BioAssay – registry of assays that can be searched by small molecule<br />Substance – a redundant registry of compounds<br />Compound – a non-redundant, curated chemical database<br />
  83. 83. You can search PubChem by chemical name, CAS number, or even by similar structures.<br />Records contain lots of additional information. Highlights: synonyms (which can be quite extensive in chemical nomenclature). Of particular note: if the compound shows up in Structure, you can link to a view in Cn3D that shows it complexed with protein/DNA/RNA! <br />
  84. 84.
  85. 85. BioSystemswill display a short verbal description, a schematic of the system in question and a link to all of the genes, proteins, small molecules found in the system along with links to related systems .<br />
  86. 86. NCBI<br />discovery<br />initiative<br />
  87. 87. NCBI<br />high quality DB<br />discovery tools<br />
  88. 88. high quality DB<br />discovery tools<br />RefSeq<br />GenBank<br />Database Ads<br />check out these resources!<br />Sensors<br />are you looking for…<br />Analysis tools<br />pre-computed & on the fly<br />
  89. 89. where<br />do I<br />start?<br />
  90. 90. anywhere*<br />*but gene acts as a good hub<br />
  91. 91. Apolipoprotein E<br />APOE<br />Cys130Arg<br />
  92. 92. We <br />Can <br />Do <br />It!<br />Gene and RefSeq<br />Genome Maps<br />Allelic Var/Disease<br />Expression<br />Homologus G/P<br />Structure<br />
  93. 93. Search for APOE in Entrez:<br />Note that there are many different records in several different databases that have hits for APOE. <br />Select PubMed.<br />
  94. 94. Select APOE in homo sapiens<br />We have used PubMed for it’s gene sensor , which is fantastically useful. However, you can also search directly in the Gene database. <br />
  95. 95. LOTS of information in this report, including links IN the report, links to other NCBI databases and links to outside resources.<br />
  96. 96. Let’s check out the reference sequences….<br />
  97. 97. Note the genomic, mRNA and protein RefSeq that are independently maintained.<br />
  98. 98. Separate records for ref sequences associated with specific genomic builds…<br />
  99. 99. (many databases here)<br />Let’s check out the SNP:Variation Viewer<br />
  100. 100. Note, the Cys130Arg variant has been frequently observed and well documented<br />
  101. 101. Let’s observe the Sequence Viewer and MapViewer for this gene.<br />
  102. 102. Note, you can change which sequence you want to observe (Stable reference, reference, celera and HuRef)<br />
  103. 103. The full view shows genes in the area, along with info on SNPs and other variation classifications.<br />
  104. 104. Full screen of MapViewer<br />
  105. 105. ab initio modeling<br />Ensemble<br />Genes<br />UniGene<br />RefSeq<br />These are default maps<br />
  106. 106. You can change the maps you wish to view, both in terms of how you are annotating the genome but also in which organisms.<br />
  107. 107. Here we are looking at Chimp, Mouse and Human Gene maps.<br />You can zoom out to get a larger picture of the area.<br />
  108. 108. APOE is part of the APO gene cluster. Note: lines between maps are mapped homologues. <br />
  109. 109. Each gene has a series of links following its name. We’ll jump to the APOE OMIM record.<br />
  110. 110. Another extensive record! Let’s jump to allelic variants.<br />
  111. 111. Let’s go to the SNP record<br />Cys130Arg is .0016<br />.0016<br />Extensive documentation…<br />
  112. 112. OMIM<br />Prot 3D<br />SeqView<br />GeneView<br />MapView<br />VarView<br />PubMed<br />
  113. 113. NOTE: the default setting doesn’t show much, because it doesn’t include clinically associated variants – click this box and refresh.<br />
  114. 114. This is the one we want! Let’s jump to this reference SNP<br />
  115. 115. Hummmm… the two reference assemblies have a wild type allele, whereas Celera and HuRef carry the mutant allele.<br />Let’s check out this area in HuRef using the sequence viewer – click on the chromosome position link.<br />
  116. 116. Clicking on sequence will bring up the sequence and the CDS. You will note that HuRef (which means Craig Ventner) carries the mutant allele.<br />
  117. 117. (many databases here)<br />Let’s check out expression in UniGene<br />
  118. 118. Click on EST profile to go to the virtual northern.<br />
  119. 119. Disease<br />State<br />BODY SITES<br />Development<br />
  120. 120. Click on GEO Profiles to see actually gene expression array data.<br />
  121. 121. Note, there are thousands of hits, meaning many gene arrays have assayed for this gene. However, most of these are in reference to a disease or altered state. <br />
  122. 122. Use GDS596 – it is the results for “normal” gene expression.<br />Click on a chart to see detailed results.<br />
  123. 123. Highest expression in the liver, with lower level throughout the brain.<br />liver<br />brain<br />
  124. 124. (many databases here)<br />Let’s check out homologs<br />
  125. 125. You can show a pairwise alignment using BLAST…<br />
  126. 126. E<br />value<br />Note the very low E value<br />1e-158<br />
  127. 127. The alignment shows that the Chimp genome carries an R at the allele in question!<br />
  128. 128. You can also check out homologs found in UniGene – a different a way to search.<br />
  129. 129. Bunnies show up using the UniGene homolog search, but not the HomoloGene search.<br />
  130. 130. Let’s go check out the protein record…<br />
  131. 131. Click here to link to the RefSeq protein record.<br />
  132. 132. Let’s run a BLAST to see if we can identify the giant panda homolog.<br />
  133. 133. I’ve changed the search to focus on the RefSeq protein database and limit it to the giant panda.<br />
  134. 134. Note: BLAST automatically detects domains<br />The highest hit is a hypothetical protein. Let’s take a look at the alignment.<br />
  135. 135. Note, the panda has the mutant arginine...<br />… does this mean pandas and chimps both have early onset Alzheimer's disease? Nobody knows!<br />
  136. 136. Let’s check out some related structures.<br />
  137. 137. This is the default setting. Change to all similar MMDB.<br />
  138. 138. Click here to go to an alignent between your query and the structure’s sequence.<br />
  139. 139. Click here to view in Cn3D<br />Note: the structure sequence contains the mutant arginine<br />
  140. 140. Showing side chains, colored by hydrophobicity. <br />The arginine is shown in yellow.<br />Click here to go to the structure summary for 1B68<br />
  141. 141. Click here to find similar 3D domains<br />
  142. 142. Select another structure and then view 3D alignment.<br />
  143. 143. Overall alignment, showing side chains colored by hydrophobicity. <br />Note, the Cys vs. Arg doesn’t make a huge change structurally.<br />
  144. 144. asdf<br />science<br />can be<br />complex...<br />
  145. 145. …we can<br />help you<br />with that.<br />
  146. 146. thank <br />you<br />
  147. 147. Jackie Wirz, PhD<br /><br />