Good MorningI’m going to talk today about my PhD, “Evolutionary Genomics of Organismal Diversity”, which I started a few weeks ago with Dave and Domino
I’m going start with a bit of an introduction, just to give you a flavor for the areas I intend to coverThen I’ll move on to a few slides detailing my summer work with Dave and Domino, to put things in contextWhich moves nicely on to my first year, which is essentially a continuation, but more importantly, expansion of my summer workFollowed by some preliminary plans,for the second and third yearsAnd to wrap it all up, I’ll give you an “In a nutshell”,take home message
The official title of the PhD is “Evolutionary Genomics of Organismal Diversity”This is a deliberately broad title, to enable some freedom in the direction of the researchEssentially it revolves around evolutionary comparative genomicsI’m hoping to ask questions such as; “What forces drive genome evolution?” and “What types of genomic change underlie organismal adaption?”To do this I’ll be using bioinformatics and computational biology approachesPurely because the amount of data that exists as a result of whole-genome projects is impossible to quantify without computational methodologyAlso because I believe a cross-discipline approach enables us to more effectively answer these questions
My summer work was based on an area of Katie Tindall’s PhD thesisIt involved studying particular features of Teleost fish genomes (Zebrafish, Stickleback, Medaka, Fugu, and Tetraodon) in order to understand differences in genome architectureIn particularl I looked at introns and to a lesser extent exonsThe data for these were retrieved from the EnsEMBL serverThe focus was on undertaking comparative genomics analyses to try and understand the causes of change in genome structureHopefully will lead to a paper, possibly in BMC Genomics? - Steve Moss, Dave Lunt, Domino Joyce, Stuart Humphries
This is a chart taken from Katie’s thesis, which shows the frequency of intron sizeIt has a 5000bp cut off,grouped into windows of 10bp for each point on the plotHere we see a bimodal distribution across the five fishThe inset chart here (point) uses a 1000bp cutoff off to highlight the differences in this area (point)This second peak was of particular interest, because most comparative genomics papers looking at introns only describe the initial peakThe divergence of Zebrafish (the blue line) was also of great interestBut, could I reproducethis?
Katie had found this underlying pattern of intron size, but the approach was ad hoc, as she obviously wasn’t focusing on things from the same methodology that I amSo, I decided it would be best to develop some novel scripts to retrieve the data I needed from the EnsEMBL serverTo do this I used a programming language called Perl, specifically EnsEMBL’s own Perl API and BioPerl API, which provides easier access to the data I needed without having to program everything from scratchI couldn’t retrieve the April 2007 data Katie had used, as the archive wasn’t available on the EnsEMBL server, so instead I had to use the August 2007, version 46 release.I used another language called Python and the BioPython API to create some scripts to analyze these dataBut, I was unable to reproduce the bimodal distribution, only producing a single peakI also ran the analyses usingthe version 58 database release and produced the same results – possibly due to annotation quality in the April 2007 data?It was decided to use the version 58 release, as it was assumed to be better annotation quality
So after retrieving and analyzing the data I produced this scatterplot using RI used the same 5000bp cut-off for intron size and grouped the points into windows of 10bp of intron size, to maintain the same format as Katie’s original plotI determined the mode intron size (the peak) to be an average of 80bp across the fishObviously no second peakAnd as you can see, Zebrafish still diverges from the the other fish between500bp and 1000bp before tailing offSo, this means that Zebrafish has a greater number of larger introns, which is quite interestingThis distribution is a reflection in the underlying processes shaping the genome – there is a difference in the way the genome of Danio is evolving in comparison to the othersFundamental process – amazingly important force – something different going onAlthough the difference is clear to the eye, we also wanted to confirm this statistically
Again using R (statistical language – Stuart helped) we performed various statistical analyses in the hope of retrieving a p-valueWe looked at methods for analyzing the distributions using relative distribution analysis such as Lorenz curves, PDFs and CDFsBut, due to the ambiguity of data standard statistical tests were difficultIn the end we decided to utilize the mean and the plus and minus 5% confidence intervals in sliding windows of 100bp across the distributionIf there were any overlapping areas we could say that these weren’t significantly differentBut for those that had a gap between them, we could confirm that they were statistically different
So we had shown that there was a difference, but why?One hypothesis was that Zebrafish could have an increased number of repeat elementsIn order to determine this, I used a program called RepeatMasker to analyze the intron sequence data across the five fishThen I used Python to create a script to determine the unique intron size (i.e. original intron length minus RE length, for each intron)Number of REs ranges from 47,000 to 520,000Total length of REs ranges from 2,500,000bp to 36,500,000bpSo, Zebrafish definitely has the highest RE contentHowever, after reproducing the plot using the unique intron sizes, the distribution was still the same, although had shifted down the Y axisI’m not sure why this is! But perhaps it would be interesting to undertake phylogenetic analysis, to see if this can equate for the difference?
Summer work is a great example of how can address fundamental questions and investigate forces driving genome change using comparative genomicsThere is plenty of scopeSome questions I would like to answer in my first year involve areas such asCan phylogeny explain differences in genome architecture?Lynch and Conery paper in 2003 showed statistical correlation between Ne (effective population size) and genome architecture sizes, but without phylogeny (Ordinary Least Squares - OLS)Lynch M (2007) The origins of genome architecture. Sunderland, , Massachusetts, USA: Sinauer Associates.Lynch M, Conery JS (2003) The origins of genome complexity. Science 302: 1401–1404.A Whitney and Garland paper this year reproduced the same study, taking into account phylogeny (Phylogenetic Generalized Least Squares - PGLS, Phylogeny with OU transform – RegOU (regression model, residuals modeled as Ornstein-Uhlenbeck process) and received the opposite results i.e. no statistical correlationWhitney KD, Garland T Jr (2010) Did Genetic Drift Drive Increases in Genome Complexity? PLoS Genet 6(8): e1001080. doi:10.1371/journal.pgen.1001080Also I could ask questions such as, do orthologous introns evolve in a clock-like manner?A paper by Zhu et al. Patterns of exon-intron architecture variation of genes in eukaryotic genomes. BMC genomics (2009) vol. 10 pp. 47 focused on thisGene families – The Hahn Lab produced the CAFÉ toolkit for the study of gene family evolution and produced a paper in 2006 on the evolution of mammalian gene familiesDemuth, J.P., T. De Bie, J.E. Stajich, N. Cristianini, and M.W. Hahn (2006) The evolution of mammalian gene families. PLoS ONE. 1:e85I also hope to undertake some analysis using social insect data such as from ants and bees – which would be a collaboration with Dr Rob HammondCasts and no casts – genome changes
This moves me nicely into my first year, which is a continuation and expansion of the summer workI’m going to be focusing on computational evolutionary comparative genomicsSo I want to integrate the scripts I have developed over summer and expand on them to create a comparative genomics toolkit, that I am going to call PyTEA (Python to Toolkit for Evolutionary Analysis) or just TEAOne of the first goals is to create a basic automated pipeline that will allow the analysis of any genome at “the push of a button”As a lot of the code already exists it’s just a case of integrating it all to work more uniformlyThe initial pipeline should produce an output of descriptive statistics for genomic components such as introns/exons/genes numberand sizeThen the user could customize the pipeline to perform analysis of repeat element content for exampleOne of the more complex tasks would involve determining orthologybetween different genomic components, but this is something I hope to expand on in future software updatesThe aim is to expand this so it can be utilized across taxa!Useful to note that my PhD is not primarily about bioinformatics, its more about using these methods to answer important evolutionary questions
On to the second year…Initial plans have been made to spend some time up at the Blaxter Lab in Edinburgh working on nematode genomicsI’m looking to ask questions such as, how does meiosis and recombination influence genome structure and content?To do this, I could take a nematode sequence pair, where one of the species has lost meiosis and then comparethe two genomes, to see any effects on contentAlso, as part of the PhD, we should also have some money to sequence 3 nematode genomesSo hopefully I can look a bit at doing some annotationAnd it will also be a good real world test for my comparative genomics toolkit (PyTEA)
Number of big questions to tackle, will have developed toolkit significantlyNo definitive plans for the third year yetPlan to focus on any interesting ideas or discoveries that I have had or made over the previous two yearsObviously analyze new genomes will have been publish, and existing annotations will improve, so I can look at expanding my analysesGOLD, the Genomes OnLine Database lists approximately 1500 complete genomes published across taxa to date, with over 7000 still ongoingOf course as methods improve and new tools are published I can update the TEA pipeline to reflect this
So, my PhD In a nutshell…It will focus on evolutionary comparative genomics across taxaI’ll use bioinformatics and computational biology methodologiesVia development of a novel toolkit that also integratesexisting tried and tested tools – makes sense not to reinvent the wheelThe focus however is not on the bioinformatics development, but in using bioinformatics and in silico analysis to answer important evolutionary and biological questionsHopefully it will shed some light on questions surrounding the evolution of genome architecture – looking at both the abstract level – so gene families, gene duplication and loss) down to sub-level architecture such as exon and intronsHopefully piece together a bit more of the puzzle to understand what this all means… for the inidividual organisms/species/biodiversity/ecosystems? – in terms of historical and future adaptations?
Thank you!Any questions?
Evolutionary Genomics of Organismal Diversity
Steve MossEVOLUTIONARY GENOMICSOF ORGANISMAL DIVERSITY
My PhD• Introduction• Summer work• First year• Second year• Third year• In a Nutshell
Summer WorkCorroboration• Retrieve original data• Developed novel scripts• EnsEMBL release 46• EnsEMBL release 58• Unable to reproduce
The PhDFirst Year • Phylogenetic comparative method? • Genome size • Intron number and size • Transposable elements • Orthologous loci? • Do orthologous introns evolve in a clock-like manner? • Gene families? • Significance of gene family size changes? • Social insect data
The PhDFirst Year• Data• Computational evolutionary comparative genomics• Develop software toolkit• Implement basic pipeline• Genomic components• Include orthology• Expand across taxa
The PhDSecond Year• Patterns of heterozygosity• Expansion of transposable element families• Polyploidy • Decay of alleles • Regain of diploidy• Nematode genomics • Blaxter Lab • Meiosis • Sequence pair • 3 nematode genomes
The PhDThird Year• Follow up interesting ideas and discoveries• Analyze any new genomes• GOLD• Make updates to toolkit
In a nutshell… • Answer important biological questions • Evolution of genome architecture • Evolutionary Comparative Genomics • Bioinformatics and Computational Biology • Development of software and tools