Genomics lecture 3

2,034 views

Published on

Background to genomics - based on the C. elegans genome project.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,034
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Genomics lecture 3

  1. 1. C. elegans cosmid K06A5, 24323 bp.Flat sequence file –3955 bp shown.>CEK06A5acaagagagggcgcctcggccgtatgttgaatgggagatcgatggaaccgagacaacgagaaaaggaatagagacggagaaagagagagagagcgcgcgttgttggaaggatgaaaaagaaaaaagacatgagctgcttcacaagagcttggcgaaagcaaagggcaaagtgttgacagcttagtggtggtagttggatcttctctcctcgttctctgctcacaactcgtctatcactcatatcacatttatttcccaatatcattttaacaacatcttccgatgcatgttcgtcaatattgcgcaaccactttgcaatattgtcaaaacttttcgcatttgtgatatcgtaaaccagcataattcccattgctccgcggtaatatgatgttgtgattgtgtggaatcgttcttgtccagctgtgtcccagatttgtaatttaatcttttttccttttaattcgatagttttaattttgaagtcgattcctgaatgaaaaaagaaaattattttgaaatcactagattctgaataaaaactaaccaatagttgagatgaatgtggtgttaaaggcatcatccgaaaatctgtacagaatgcaagtttttccaactcctgagtcgcctattagcagcaatttgaagagcatgtcatacggtcggcgagccatttttcttctgaaatgagaaaaagttgagaactaaagttgcacaaaagtaagagaaaagcacttgagtcatggcaaatagaacgaacactttgagatttcgaagaagttatcaagagttgacaattggaagatatttggaagaactttctaatttttttctagttttccaaaattaggtttttgtcataaaatgttgtcaaagaaaaaacaggacaaaatagttaattgttgtttccattataacaaaaaaaaatttgaacggagctattaacgcgtgcatgcgcaaatcacatcgattagctgtttctgggaaattctcgggaaaaggtgaacagcagctgctggcttcctctgcgggtcacgaaaacacaaagagatcattataattgttatttggaaaggaagcgaatctaaaacgggtacaggtggacgtttattgatcgaaagtgctttttatttgaaattgaatggtgaactttgcaattttgtaatgcaaagtacgttatcagatggcatgagatgtgtgaagtgataaggaataaaatgtgaacgacatgttcaagaaactgtgatttttcaataatttgtgatgaaatattttaggaacagaaatgaacatattaattgatataaaaacaataggaacactaactcataattatgataggtgaatatcaaaatgtgctagattttttgaagttaaaaaatacatttctaatattttttcaaataataagtttcagctgaaatttcagggtgatttcagaaagctatgttttgataaattgttttgaaaattaaaagaagctacagcaaaaaaaaattaaagagaacatcgctccctcgtagtgtataatttttgattatcgaaaaaaatgagtcaatgatgaaaaggaagtcgcaatctcaaaacttcaaaaatcaaaagaagccgttgcctctgtcatcaaaaattcagaagacaaggttgttgacaagggtcaattctcagtggtggagggcattgggcgtggtgaaatttttgaaggctagtgtggttggacctctactagatagacaaaacccccgaaatagacgtttaatttgatgagatggtggagaaagaaaaggactcattctctagatgatagagagaccagagatacagacaagagagggcgcctcggccgtatgttgaatgggagatcgatggaaccgagacaacgagaaaaggaatagagacggagaaagagagagagagcgcgcgttgttggaaggatgaaaaagaaaaaagacatgagctgcttcacaagagcttggcgaaagcaaagggcaaagtgttgacagcttagtggtggtagttggatcatgtgtttttatgtttccggtgggagaaggttcaacaaaaaatgaaaagaaaaagttcaagcggcatgaatcattctgagtttaaaacaaaattattgcgaaaattaatattaaaaccttttcacaaaacttcaagctaatctgttcatgaaaatttgaataatagttttttcccacctatttagaattaacttcatattaacgaaattaattaacgaatcgaaaattatgacttttcagaatcatctgaagttttttcacattccatgctgcatggaataatttgatcctggaatcgatatgtttttatggtatactttttaaccttcaatttagctggaaaagtatggaataaataattcccgaagctatgtacatatatgtagaattattgaatgattgtgagaacaacttgactttagcttgagtaggaatcggaatggctatcgaccgatcaacacttaggattgtaagaatggcagtaagaatatattgaagaaagaatgtttgttcataggaagagaaagagtattgcgaaatcatcatcgcccactttagaatggacgggcggtgagcggacatagagaattgtgaatgactaatgcttttgcagaatctagggcaaaatcgtaggaacaaacaattgtaatacggagaaaacaatcatatcgatcgatgatcatggagaaaaatgtgatttaagtgagtagacttggaaaaattaataaaagcatgaattgtcgatatttttcatttattttcattataaagctctttaaaaacaaattaaatattgagaatggcttcgaagaatattgtttcaaatatgttcaatggtgacaccttgcggataaaattaatgtaaaaatcatggaacacagattcactgatatctcattatctcaagcagtgtaattagagattttttggaacaattattttataaaactataaataaaccgtttatactactcaaagccaaatattcaagctattaccattttttttctaactaattcttgagcaattaaagtattccccagtttttattttgcaacgactccaggcaaacacgctccgttgcacttgccgccaaggcgttgcattcaaatcagagagacatctcattccgatttctgtttttcttccaataaacggtattttatgcctaatgggtgatacggaaattgttcctcttcgagtacaaaatgtacttgatagcgaaatcattcgtctcaacttgtggtccatgaaggtaactgtctagtttttttaagttttcatgatttcaatatttttacagtttaacgcgaccagtttcaaactcgaaggttttgtgagaaatgaagaaggcactatgatgcagaaagtttgttccgaatttatttgtgtaagtcgagaaacatattcgtcaacaattttcattaaatattcagagacgcttcacttctacgttgcttttcgatgtttccggacgtttcttcgacttggtcggacagattgatcgggaatatcaacaaaaaatgggaatgcctagtagaattattgatgaattttcaaatggaattcctgaaaattgggccgaccttatctattcctgcatgtcagccaaccaaagaagcgcacttcgccctatccaacaggctccaaaagaaccaattagaactagaacagaaccaattgttacgttggcagatgaaaccgagctaactggaggatgccagaaaaattccgaaaacgagaaagaaaggaacagacgtgagcgtgaagaacagcaaacaaaggaacgtgagagaagattagaagaagaaaaacaacgacgagatgctgaagctgaggctgaaagaaggcgaaaagaagaggaagagctggaagaagctaattacacccttcgtgctccgaaatctcagaacggcgagccaatcactccgataaga
  2. 2. Genome sequence of C.elegans. Sequence of entire genome. Sequence of cDNA clones. Approximately 19,500 PREDICTED protein coding gene sequences. Large number of various kinds of functional RNAs – not discuss further. For this lecture – focus predicted proteins. Gene prediction? How?Science, December 1998.
  3. 3. Computer based predictionsGENEFINDER (C.elegans), BLAST (all genomes) and other computerprograms.Biases in coding sequence - in C. elegans non-coding is AT rich.Splice site signals, initiator methionines, termination codons.Likely exons and probable/possible splice patterns.BLAST – compare the Translation of all 6 reading frames. • Evidence that a prediction is correct? • Homology with genes in other organisms – homologues. • Known protein families. •Experimental evidence.
  4. 4. The Basic Local Alignment Search Tool (BLAST) finds regions of local similaritybetween sequences.The program compares nucleotide or protein sequences to sequence databases andcalculates the statistical significance of matches.http://www.ncbi.nlm.nih.gov/The National Center for Biotechnology Information (NCBI), the U.S. National Library of Medicine.How does BLAST work?mqnpmillifclfcavicsrgtdsdiphef Protein Sequence Single Letter code Search windowsBLAST compares small sequential blocks – or WINDOWS- of sequence against massivedatabases.It looks for regions of similarity and scores them.
  5. 5. More BLAST High similarity BLAST score Conserved regions Non-conserved regions Low similarity BLAST score Large ProteinSmall windows of comparison - detect LOCAL regions of similarity.Output - % identity and % similarity (permits conservative substitutions of aa.)Gives overall score and probability of relatedness.If the entire protein sequence was compared in one go, you may get a relatively lowoverall similarity.How did genes and gene families evolve and what is meant by protein domains?We need to come back to this – remember the question!
  6. 6. Below is the sequence of a protein: HOMEWORK mqnpmillif clfcavicsr gtdsdiphef hkmlkhaksl nsllrdlhvi yspemtnrhvektdkhgaal slksgsmsaq rivsiqnisd demdgytlfh lqsmkdikqg ndtcnlqsvcvpipqlsddp qvlmypkcye vkqcvgsccn svetchpgti nlvkkhvael lyigngrfmfnmtkeitmee htscscfdcg sntpqcapgf vvgrsctcec ankeernncv gnatwnaetckcecdlkcee gkilhkdrcd cvrrrqhhgg prghhghrhh hrsrpidtee vqkigqlkvgriggGo to NCBI http://www.ncbi.nlm.nih.gov/Go to Blast then look down the left for “Choose a BLAST program to run”From within that section, select “protein blast”.Copy the above protein sequence and paste it into the box on the top left of web page.Scroll down the page and click the big blue BLAST button.Have a look at the outcome – any questions – post to the Forum on moodle. BLAST is one of the powerful computational tools for Comparative Genomics
  7. 7. Computational biology is mostly predictive – not EXPERIMENTAL Lets look at simple experimental evidence for existence of genes. “The Central Dogma” of Molecular Biology DNA → mRNA → Protein Expressed sequence tags (ESTs) – cDNA clones.To make cDNA mRNA is copied to DNA with reverse transcriptase.RNA → DNA Retroviruses (e.g. HIV). RNA genome → DNA → integration → mRNA → protein
  8. 8. Making cDNA Typical eukaryotic gene - double stranded DNA exon intron 1. RNA Polymerase Primary transcript – single sense strand RNA – introns present 5’ 3’OH RNA exon 2. Capping, splicing, poly-adenylation Messenger RNA (mRNA)5’ CAP AAAAAAAAAAA 3’OH OH-TTTTTTTT-5’ DNA primer 3. First strand cDNA synthesis -reverse transcriptase AAAAAAAAAAA RNA/cDNA duplex TTTTTTTT 4. Second strand cDNA – DNA polymerase AAAAAAAA TTTTTTTT Double stranded cDNA
  9. 9. EST sequencing was carried out in parallel to genome sequencing. Simplest experimental evidence that a bit of genomic DNA contains a gene. Making cDNA cDNA synthesis oligo dT priming Messenger RNA (mRNA) AAAAAAAAAAA 3’OH OH-TTTTTTTT-5’ DNA primer cDNA synthesis by random priming AAAAAAAAAAA 3’OH DNA primer OH-NNNNNNNNN-5’ Random 6-mers or 9-mersThe advantage of Random Priming is cDNA clones not biased towards 3’ end of gene.
  10. 10. Sequence data from Random Primed cDNA – ESTs (or EST Tags) Typical eukaryotic gene - double stranded DNA EST 1 EST 2 EST 3EST sequences EST 4 The sequencing of ESTs uncovered frequent examples of differential splicing. Common examples of which are exon skipping (above) Alternative 5’ exons, alternative splice altering stop codons, genes within genes etc. Above true for C. elegans, humans, flies, and many other species.
  11. 11. • C. elegans EST data from approximately 50,000 cDNA clones. • Identified 9,356 different genes.1. Grind up thousands of worms.2. Prepare mRNA – convert to cDNA with reverse transcriptase – clone in plasmid.3. Some mRNSs exist at extremely low levels of abundance.4. Low abundance cDNAs may be impossible to clone randomly.
  12. 12. Reverse transcriptase PCR – very sensitive. Gene AAAAAAAA mRNAPrimer A. Primer B cDNA from mRNA using reverse transcriptase. Amplify cDNA by PCR – primers designed from predicted genes. Clone and analyse products. Experimentally confirmed genes raised to > 18,000. Full length cDNA– valuable for confirming intron/exon structure.
  13. 13. Summary of predicted and known gene sequences in C. elegans1. Predicted 19,500 genes.2. At least 18,000 expressed as RNA.3. Average of 1 gene per 5 kb.4. ~ 42% have detectable homologies to genes/proteins outside Nematoda.
  14. 14. Genome SizeOrganism Genome GenesE.coli (bacteria) 4.64 Mb 4,377S. cerevisiae (fungal) 12.1 Mb 6,163C.elegans (metazoan) 100 Mb 19,300Arabadopsis (plant) 118 Mb ~20,000D. melanogaster (fruit fly) 135.6 Mb 13,472Mus musculus (mouse) 3059 Mb ~25,000Homo sapiens (obvious) 3286 Mb ~25,000
  15. 15. The C. elegans Top 20 protein HomologiesNumber Description650 7 TM chemoreceptor410 Eukaryotic protein kinase domain240 Zinc finger, C4 (transcription factor)170 Collagen140 7 TM receptor130 Zinc finger, C2H2 (transcription factor)120 Lectin C-type domain short and long forms100 RNA recognition motif (RRM, RBD, or RNP domain)90 Zinc finger, C3HC4 type (transcription factor)90 Protein-tyrosine phosphatase90 Ankyrin repeat90 WD domain, G-beta repeats80 Homeobox domain (transcription factor)80 Neurotransmitter-gated ion channel80 Cytochrome P45080 Helicases conserved C-terminal domain80 Alcohol/other dehydrogenases, short-chain type70 UDP-glucoronosyl and UDP-glucosyl transferases70 EGF-like domain70 Immunoglobulin superfamily
  16. 16. Does the “Top 20” list tell us anything? Previous slide looked rather boring? Test your memory – what was on the list?Many of the large gene families are implicated in developmental control. Core set of proteins needed for general cell biology/metabolism to make a cell – e.g. S. cerevisiae ~6,163 genes. Evolution of developmental complexity – amplification of families of regulatory molecules. The above in part explains the increase in number of genes in multicellular organisms – it does not explain fully the increase in DNA content.
  17. 17. How much does DNA sequence teach us?Remember that what we can learn from protein similaritiesis limited by what we know about the similar proteins.We still need to connect genes/proteins with functions.
  18. 18. How has genomics influenced genetics? C. elegans mutantsWild Type dpy-7: Short fat worm – exoskeletal defect. ced-4: Programmed cell death defective. unc-51: Paralysed - abnormal axons. dec-2: long defecation cycle – genetically constipated.
  19. 19. We wanted to investigate the molecular detail of gene defined by mutation. We knew where mutant genes mapped and we knew their phenotype. Chromosome I Genetic mapping.Left arm m.u. bli-3 m.u. = map unit. -15 egl-30 Genetic mapping – recombination. mab-20 -10 1 m.u. is 1% recombination per meiosis. -5 fog-1 unc-73 unc-57Central 0 dpy-5 dpy-14cluster fer-1 5 lin-11 unc-29 unc-75 Parent Recombinant 10 unc-101 15 20 glp-4 fog-1 + fog-1 + 25 unc-54 glp-4 + + glp-4Right arm
  20. 20. Sequence of genomes – individual chromosomes AGCCTTTATGGCGAGATGGATAGCT………………………..………………………………………….TATAAPhysical Map of clones unc-101 unc-54 unc-75 unc-73 mab-20 lin-11 dpy-5 glp-4 fog-1 egl-30 fer-1 bli-3Geneticmap 10 15 20 25 0 5 -15 -10 -5 How can the physical and genetic maps be aligned? Identify the sequence of genes defined by mutation.
  21. 21. unc-101 unc-75 unc-54 unc-73 mab-20 lin-11 dpy-5 glp-4 fog-1 egl-30 fer-1 bli-3 Genetic map 10 15 20 25 0 5 -15 -10 -5Physical map • An association or alignment between the physical and genetic maps.
  22. 22. Positional cloning of genes defined by mutation. unc-101 unc-54 unc-75 unc-73 mab-20 lin-11 dpy-5 glp-4 fog-1 egl-30 fer-1 bli-3 Genetic map 10 15 20 25 0 5 -15 -10 -5Physical map Imagine lin-11 and unc-101 had both been cloned. Where on the physical map might unc-75 be?
  23. 23. Transgenic C.elegans – rescue of mutant phenotype. DNA injected into the gonads of the adult hermaphrodites. Form large heritable DNA molecules termed "free arrays".
  24. 24. Phenotypic Rescue1. Inject cosmid into the mutant.2. Observe transgenic progeny for phenotypic rescue.3. Subclone individual genes from cosmid.4. Observe transgenic progeny for phenotypic rescue. Cosmid sequence Genes Inject unc-75 mutant worms.
  25. 25. Positional cloning of genes defined by mutation. unc-101 unc-54 unc-75 unc-73 mab-20 lin-11 dpy-5 glp-4 fog-1 egl-30 fer-1 bli-3 Genetic map 10 15 20 25 0 5 -15 -10 -5Physical map Attempt phenotypic rescue with cosmids. • The standard route to clone C. elegans genes defined by mutation. • The more genes are cloned the easier it becomes to clone others.
  26. 26. Can’t make transgenic humans – but the same positionalinformation is used to identify Human disease genes.
  27. 27. RNA Interference (RNAi)RNAi - sequence-specific inactivation of gene function by, either by double strandedRNA or siRNA.Since its discovery in C.elegans, it has been found to work in many organisms – e.g.cultured vertebrate cells, plants, trypanosomes, Drosophila.
  28. 28. Mediators of RNAi - short interfering RNAs (siRNAs) 21-23 nt dsRNA duplexes.DICER – Highly conserved family of RNaseIII enzymes.Targets double stranded RNA.
  29. 29. ArgonauteSingle Stranded interfering RNA
  30. 30. RNAi in C.elegans. ds RNAObserver phenotype of F1 offspringNoticed that site of injection did not matter – intestine works??How could that affect embryos?Systemic RNAi
  31. 31. Bacterial Feeding Method in C. elegansExpress dsRNA of a cloned C.elegans gene in a strain of E.coli.Worms eat the bacteria as food.RNAi of the gene can be obtained both in the worms that feed on the dsRNAexpressing bacteria, and in the F1 progeny of these worms.
  32. 32. sid-1 mutants are defectivein systemic RNAi SID-1 protein Transport of dsRNA into Cells by the Transmembrane Protein SID-1 Science 301, 1545 (2003)
  33. 33. RNAi as a tool for genetic analysisLoss of function phenotype can be estimated by RNAi.RNAi by feeding method – whole genome RNAi projects.Clones of 16,757 predicted genes tested in genome wide screen.10.3% gave obvious phenotype.Redundancy between genes.RNAi is capable of functioning for more than one gene at a time.Permits analysis of functionally redundant genes.
  34. 34. Summary, C. elegans GenomicsPermits comparisons with human genes.Most human disease genes have C. elegans homologues.Powerful genetic tools – experiments on genes.Detailed anatomy – relate gene to function. Examples of processes investigated. Programmed cell death. Signalling. Cell adhesion. Axonal guidance. Oncogene function. Insulin Pathway Ageing
  35. 35. How did genes evolve and what are gene/protein families
  36. 36. Early genomes– Early genomes made of RNA • RNA world - no cells (in modern sense), just RNA, starting with 1 gene • RNotide polymerase activity - catalyse own synth. • Later on - translation - encoded info for production of proteins – Involves nucleic acids ‘coding for’ proteins– Later emergence of DNA as the info store - genome stability - less labile– Modern functions of nucleic acids • coding - proteins via mRNA • catalytic – ribozymes • structural – rRNA, tRNA * • regulatory - miRNAs nucleotides tRNA, rRNA RNA DNA mRNAInorganic surface protein
  37. 37. Where did our genome come from?….‘Tree of Life’ - Tree of all AnimalsCommon ancestor=> common genome*• Each species’ genome descended with modification from genome of ancestor Reconstruction of picture of ‘ancestral genome’? Comparative genomics - tells us about state of ancestor and changes along each branch
  38. 38. Genes and Genome evolution• What processes lead to genome evolution…? * Initial ligation to form early chromosomes inversion duplication / deletion accumn. of point mutations Invasion - horizontal gene transfer & transposable elements
  39. 39. Structure of a typical eukaryotic gene TSS ATG stop gene promoter Intron 1 Exon 1 Exon 2 Exon 3 Exon 4mRNA Poly A tail 5’-UTR 3’-UTRprotein Domain 1 Domain 2 * What features of all genes are missing from this diagram….?

×