s4021907_phd_finalthesis

!
!
!
A"new"method"for"sequencing"the"hexaploid"genome"of"Triticum(aestivum:"
Assembly"and"analysis"of"the"group"7"chromosomes"using"second;generation"
sequencing"technology"
Paul!James!Berkman!
!
!
!
!
!
A thesis submitted for the degree of Doctor of Philosophy at
The University of Queensland in February 2012
School of Agriculture and Food Sciences

ii!
!
Declaration by author
This thesis is composed of my original work, and contains no material previously published or
written by another person except where due reference has been made in the text. I have clearly
stated the contribution by others to jointly-authored works that I have included in my thesis.
I have clearly stated the contribution of others to my thesis as a whole, including statistical
assistance, survey design, data analysis, significant technical procedures, professional editorial
advice, and any other original research work used or reported in my thesis. The content of my thesis
is the result of work I have carried out since the commencement of my research higher degree
candidature and does not include a substantial part of work that has been submitted to qualify for
the award of any other degree or diploma in any university or other tertiary institution. I have
clearly stated which parts of my thesis, if any, have been submitted to qualify for another award.
I acknowledge that an electronic copy of my thesis must be lodged with the University Library and,
subject to the General Award Rules of The University of Queensland, immediately made available
for research and study in accordance with the Copyright Act 1968.
I acknowledge that copyright of all material contained in my thesis resides with the copyright
holder(s) of that material.

iii!
!
Statement of Contributions to Jointly Authored Works Contained in the Thesis
BERKMAN, P. J., SKARSHEWSKI, A., LORENC, M., LAI, K., DURAN, C., LING, E. Y. S.,
STILLER, J., SMITS, L., IMELFORT, M., MANOLI, S., MCKENZIE, M., KUBALÁKOVÁ, M.,
ŠIMKOVÁ, H., BATLEY, J., FLEURY, D., DOLEŽEL, J. & EDWARDS, D. 2011. Sequencing
and assembly of low copy and genic regions of isolated Triticum aestivum chromosome arm 7DS.
Plant Biotechnology Journal, 9: 768-775
PJB was responsible for 40% of conception and design, 40% of the coding, analysis and
interpretation of the data, and 40% of the drafting and writing. DE was responsible for 40% of the
conception and design and 40% of the drafting and writing. JD was responsible for 10% of
conception and design. MM and HS were each 30% responsible for sequence data generation. MK
and JB were each 20% responsible for sequence data generation. AS, ML, KL, LS, and SM were
each responsible for 10% of the coding, analysis and interpretation of the data. CD and JS were
each responsible for 5% of the coding, analysis and interpretation of the data. AS, JB, JD, and DF
were each responsible for 5% of the drafting and writing.
BERKMAN, P. J., SKARSHEWSKI, A., MANOLI, S., LORENC, M. T., STILLER, J., SMITS,
L., LAI, K., CAMPBELL, E., KUBALÁKOVÁ, M., ŠIMKOVÁ, H., BATLEY, J., DOLEŽEL, J.,
HERNANDEZ, P. & EDWARDS, D. 2012. Sequencing wheat chromosome arm 7BS delimits the
7BS/4AL translocation and reveals homoeologous gene conservation. Theoretical and Applied
Genetics, 124: 423-432
PJB was responsible for 35% of conception and design, 35% of the analysis and interpretation of
the data, and 50% of the drafting and writing. DE was responsible for 30% of the conception and
design, 20% of the analysis and interpretation, and 40% of the drafting and writing. AS was
responsible for 10% of the conception and design and 10% of the analysis and interpretation of the
data. ML and KL were each responsible for 10% of the analysis and interpretation. SM, JS, and LS
were each responsible for 5% of the analysis and interpretation. JD was responsible for 15% of
conception and design and 5% of the drafting and writing. MK, HŠ, and PH were each responsible
for 10% of data generation. EC was responsible for 40% of data generation. JB was responsible for
10% conception and design, 5% drafting and writing, and 30% of data generation.

iv!
!
LAI, K., BERKMAN, P. J., LORENC, M. T., DURAN, C., SMITS, L., MANOLI, S., STILLER,
J. & EDWARDS, D. 2012. WheatGenome.info: An integrated database and portal for wheat
genome information. Plant and Cell Physiology, 52: e2(1-7).
PJB contributed 20% to the conception and design, 20% to the database implementation, and 30%
to the drafting and writing. KL contributed 25% to the conception and design, 20% to the database
implementation, and 30% to the drafting and writing. ML contributed 15% to the conception and
design, 15% to the database implementation, and 10% to the drafting and writing. DE contributed
40% to the conception and design, 10% to the database implementation, and 30% to the drafting
and writing. LS contributed 15% to the database implementation. JS contributed 10% to the
database implementation. SM and CD both contributed 5% to the database implementation.
BERKMAN, P. J., LAI, K., LORENC, M. & EDWARDS, D. 2012. Next generation sequencing
applications for wheat crop improvement. American Journal of Botany, 99: 365-371
PJB and DE each contributed 35% to the conception and design and 30% to the drafting and
writing. KL contributed 20% to the conception and design and 20% to the drafting and writing. ML
contributed 10% to the conception and design and 20% to the drafting and writing.
Statement of Contributions by Others to the Thesis as a Whole
Introductory chapter – Prof. David Edwards and Dr Jacqueline Batley respectively contributed 10%
and 5% to the drafting of this chapter, particularly during early iterations.
BERKMAN, P. J., VISENDI, P., LEE, H. C., STILLER, J., MANOLI, S., LORENC, M. T., LAI,
K., BATLEY, J., FLEURY, D., SIMKOVA, H., KUBALAKOVA, M., WEINING, S., DOLEZEL,
J. & EDWARDS, D. 2012. Dispersion and domestication shaped the genome of bread wheat.
Nature Genetics, Manuscript submitted as Letter, Under review.
PJB was responsible for 35% of the conception and design, 30% of the coding and analysis, and
40% of the drafting and writing. DE was responsible for 35% of the conception and design, 10% of
the coding and analysis, and 40% of the drafting and writing. SW and JD were each responsible for

v!
!
5% of the conception and design. JB, ML, KL, and JD were each responsible for 5% of the drafting
and writing. ML, PV, and HCL were each responsible for 15% of the coding and analysis. JS, SM,
and KL were each responsible for 5% of the coding and analysis.
Statement of Parts of the Thesis Submitted to Qualify for the Award of Another Degree
None.
Published Works by the Author Incorporated into the Thesis
STILLER, J., SMITS, L., IMELFORT, M., MANOLI, S., MCKENZIE, M.,
KUBALÁKOVÁ, M., ŠIMKOVÁ, H., BATLEY, J., FLEURY, D., DOLEŽEL, J. &
EDWARDS, D. 2011. Sequencing and assembly of low copy and genic regions of isolated
Triticum aestivum chromosome arm 7DS. Plant Biotechnology Journal, 9: 768-775 –
Incorporated completely as Chapter 3
BERKMAN, P. J., SKARSHEWSKI, A., MANOLI, S., LORENC, M. T., STILLER, J., SMITS,
L., LAI, K., CAMPBELL, E., KUBALÁKOVÁ, M., ŠIMKOVÁ, H., BATLEY, J.,
DOLEŽEL, J., HERNANDEZ, P. & EDWARDS, D. 2012. Sequencing wheat chromosome
arm 7BS delimits the 7BS/4AL translocation and reveals homoeologous gene conservation.
Theoretical and Applied Genetics, 124: 423-432 – Incorporated completely as Chapter 4
LAI, K., BERKMAN, P. J., LORENC, M. T., DURAN, C., SMITS, L., MANOLI, S., STILLER,
J. & EDWARDS, D. 2012. WheatGenome.info: An integrated database and portal for wheat
genome information. Plant and Cell Physiology, 52: e2(1-7) – Incorporated completely as
Chapter 6
BERKMAN, P. J., LAI, K., LORENC, M. & EDWARDS, D. 2012. Next generation sequencing
applications for wheat crop improvement. American Journal of Botany, 99: 365-371 –
Incorporated completely as Chapter 7

vi!
!
Additional Published Works by the Author Relevant to the Thesis but not Forming Part of it
DURAN, C., EALES, D., MARSHALL, D., IMELFORT, M., STILLER, J., BERKMAN, P. J.,
CLARK, T., MCKENZIE, M., APPLEBY, N., BATLEY, J., BASFORD, K. & EDWARDS,
D. 2010. Future tools for association mapping in crop plants. Genome, 53, 1017-23.
MARSHALL, D. J., HAYWARD, A., EALES, D., IMELFORT, M., STILLER, J., BERKMAN, P.
J., CLARK, T., MCKENZIE, M., LAI, K., DURAN, C., BATLEY, J. & EDWARDS, D.
2010. Targeted identification of genomic regions using TAGdb. Plant Methods, 6, 19.
WANG, X., WANG, H., WANG, J., SUN, R., WU, J., LIU, S., BAI, Y., MUN, J. H., BANCROFT,
I., CHENG, F., HUANG, S., LI, X., HUA, W., FREELING, M., PIRES, J. C., PATERSON,
A. H., CHALHOUB, B., WANG, B., HAYWARD, A., SHARPE, A. G., PARK, B. S.,
WEISSHAAR, B., LIU, B., LI, B., TONG, C., SONG, C., DURAN, C., PENG, C., GENG,
C., KOH, C., LIN, C., EDWARDS, D., MU, D., SHEN, D., SOUMPOUROU, E., LI, F.,
FRASER, F., CONANT, G., LASSALLE, G., KING, G. J., BONNEMA, G., TANG, H.,
BELCRAM, H., ZHOU, H., HIRAKAWA, H., ABE, H., GUO, H., JIN, H., PARKIN, I. A.,
BATLEY, J., KIM, J. S., JUST, J., LI, J., XU, J., DENG, J., KIM, J. A., YU, J., MENG, J.,
MIN, J., POULAIN, J., HATAKEYAMA, K., WU, K., WANG, L., FANG, L., TRICK, M.,
LINKS, M. G., ZHAO, M., JIN, M., RAMCHIARY, N., DROU, N., BERKMAN, P. J.,
CAI, Q., HUANG, Q., LI, R., TABATA, S., CHENG, S., ZHANG, S., SATO, S., SUN, S.,
KWON, S. J., CHOI, S. R., LEE, T. H., FAN, W., ZHAO, X., TAN, X., XU, X., WANG,
Y., QIU, Y., YIN, Y., LI, Y., DU, Y., LIAO, Y., LIM, Y., NARUSAKA, Y., WANG, Z.,
LI, Z., XIONG, Z. & ZHANG, Z. 2011. The genome of the mesopolyploid crop species
Brassica rapa. Nature Genetics, 43, 1035-1039.

vii!
!
Acknowledgements
It is a strange experience to reach the end of a journey like this one. So many hours of worry and
intense thought, so many painful and heart-sinking moments; so many moments of sheer
exhilaration and joy, so many light-bulb moments of delight in seeing something new; so much
caffeine consumed and so many nails chewed all the way down, one cannot adequately explain to
another the intensity of the stress and satisfaction in each accomplishing each step unless the other
has been there too. All of these experiences of mine are now past, a momentous occasion indeed to
have reached the end. I could not have arrived on my own.
Professor David Edwards, thank you for your support and advice through it all, you’ve taught me
the value of communicating science well. Dr Jiri Stiller, thank you for your measured approach to
everything and for your grounding influence, you’ve provided good perspective at each step. Adam
Skarshewski, Paul Visendi and Hong Lee, you have all helped greatly to sharpen my thinking about
the science, even at the last. Monica Ogierman and Kaye Hunt, thank you both for your advice and
support in officialdom, things would be so much harder without two such capable people to assist.
To those who have gone before me, Mike, Chris, Dom, and Daniel, thank you all for your
mentoring and words of advice during my time here. You each helped me to keep my head down,
get my work done, and get through. Mike, you somehow helped me in finishing my thesis by
continually distracting me from it; thanks for the random conversations over coffee.
My brothers and sisters at Liberty Community Church, thank you all for loving and caring for my
family despite our frequent absences. Your friendship means so much. Greg and D, I love you guys.
Denis and Shirley, thank you so much for your continual support of my family and me, especially
over the last few months, I suspect we would have gone insane without your help. We love you lots.
To my Mum and Dad, thank you for your fostering of a quirky and enquiring mind in me and for
your endless support of my family. I still can’t quite believe you put up with us free-loading for
two years! Thank you for all that you’ve taught me, I love you both so very much.
My boys, Jonas and Quinn, you guys are the spice of our life. Thank you for all of the cuddles,
giggles and smiles, I am so proud of you both. To my gorgeous girl, a million clichés couldn’t
capture your loveliness. You have stood with me through these years even through 5 countries,
living between 2 cities, and caring for 2 little boys. You are the delight of my life and I love you.
Finally to my Father and Creator, thank you for the beauty in all that you have made and for
redeeming me in Jesus. You alone have made me who I am and I stand here only because of You.

viii!
!
Abstract
Wheat is an extremely important crop species to Australia and the rest of the world in both
economic and social terms with population growth, disease, and other climate-related pressures
requiring improvements to this crop if it is to be a secure food-source into the future. Experience
with genome sequence data from the first sequenced plant genomes has demonstrated the utility of
this knowledge not just in scientific terms to extend understanding of plant biology and evolution,
but in the same economic and social terms under which these plants are deemed important. The
polyploid complexity of wheat is a hindrance to the determination of its genome sequence using the
techniques that have been more readily applicable to the plants whose genomes have already been
sequenced. Second generation sequencing technologies are accelerating genome sequencing efforts
in a number of crops but come with major computational challenges, and while a number of
bioinformatics tools have been established to align and assemble second generation sequencing data
specialized thinking is required to appropriately apply these technologies to polyploid genomes. A
number of factors are of great importance in sequencing complex genomes, and while it should be
both feasible and valuable to leverage second generation sequencing technologies the wheat
community has been slow to do so.
This thesis describes a new approach to sequencing the wheat genome that utilises second
generation sequencing technologies and syntenic relationships within the grasses to produce gene-
based genomic scaffolds of wheat and demonstrates the application of this approach to extend
wheat crop improvement and our understanding of wheat genome evolution. This approach was
initially developed, applied, and validated using second generation sequencing data from isolated
wheat chromosome arm 7DS, through which the capacity of this approach to assemble all or nearly
all wheat genes was demonstrated. This approach was subsequently applied in chromosome arm
7BS to delimit a previously identified translocation within the range of a few genes and to predict a
total gene count in wheat of ~77,000 genes. Finally the approach was applied to all of the wheat
group 7 chromosomes providing the basis for the first assembly comparison of wheat’s sub-
genomes, which identified dispersion as one of the key factors that have driven genome
fractionation in the recent evolution of the hexaploid wheat genome. The syntenic builds produced
by this approach have been made publicly available through a user-friendly resource that simplifies
the access to functional information. The public accessibility of the data provides a powerful
resource to support wheat crop research and improvement as a template for varietal polymorphism
prediction and transcriptomic analysis.

ix!
!!
While a number of challenges exist for polyploid crop improvement, the syntenic build
approach provides a strong basis upon which to conduct not only crop improvement research, but
also investigate polyploid genome evolution. The methodology can be further improved and
extended in a number of ways, and will therefore be a valuable approach to assist both wheat
improvement and future genome sequencing efforts alike.
Keywords
Triticum aestivum, wheat, genome, second generation sequencing, polyploidy, fractionation,
genome assembly, bioinformatics, synteny, evolution
Australian and New Zealand Standard Research Classifications (ANZSRC)
0604 Genomics 40%, 0607 Plant Biology not elsewhere classified 30%, 0803 Bioinformatics 30%

x!
!
Table of Contents
Acknowledgements............................................................................................................................vii
Abstract.............................................................................................................................................viii
Table of Contents.................................................................................................................................x
List of Figures...................................................................................................................................xiii
List of Tables.....................................................................................................................................xiv
List of Abbreviations..........................................................................................................................xv
1 Introduction and Literature Review...........................................................................................1
1.1 Hexaploid bread wheat...................................................................................................................1
1.2 Polyploid plant genomics...............................................................................................................3
1.3 Plant genome sequencing...............................................................................................................7
1.4 Second-generation sequencing (2GS) .........................................................................................10
1.5 Principles for complex genome sequencing.................................................................................19
1.6 Summary.......................................................................................................................................21
1.7 References....................................................................................................................................21
2 Materials, Methods, and Validation.........................................................................................45
2.1 Architecture and development of critical tools............................................................................45
2.2 Issues of chromosome arm 2GS data...........................................................................................49
2.3 Synteny-based assembly validation.............................................................................................51
2.4 Synteny-based assembly with SynBuilder.py..............................................................................53
2.5 References....................................................................................................................................57
3 Sequencing and assembly of low copy and genic regions of isolated Triticum aestivum
chromosome arm 7DS................................................................................................................59
3.1 Journal cover art...........................................................................................................................60
3.2 Summary.......................................................................................................................................61

xi!
!!
3.3 Introduction..................................................................................................................................61
3.4 Results..........................................................................................................................................62
3.5 Discussion....................................................................................................................................63
3.6 Experimental procedures..............................................................................................................66
3.7 Acknowledgements......................................................................................................................67
3.8 References....................................................................................................................................67
3.9 Supplementary information..........................................................................................................69
4 Sequencing wheat chromosome arm 7BS delimits the 7BS/4A translocation and reveals
homoeologous gene conservation..............................................................................................71
4.1 Abstract........................................................................................................................................72
4.2 Introduction..................................................................................................................................72
4.3 Materials and methods..................................................................................................................73
4.4 Results..........................................................................................................................................74
4.5 Discussion.....................................................................................................................................77
4.6 Conclusion....................................................................................................................................79
4.7 References....................................................................................................................................80
5 Dispersion and domestication shaped the genome of bread wheat........................................83
5.1 Introductory paragraph.................................................................................................................83
5.2 Main text.......................................................................................................................................84
5.3 Online methods.............................................................................................................................89
5.4 Acknowledgements......................................................................................................................92
5.5 References....................................................................................................................................93
6 WheatGenome.info: an integrated database and portal for wheat genome information....99
6.1 Abstract.......................................................................................................................................100
6.2 Introduction................................................................................................................................100
6.3 Database contents.......................................................................................................................101
6.4 Conclusions and future directions..............................................................................................104
6.5 Funding.......................................................................................................................................105

xii!
!
6.6 Acknowledgements....................................................................................................................105
6.7 References..................................................................................................................................105
7 Next-generation sequencing applications for wheat crop improvement.............................107
7.1 Abstract.......................................................................................................................................108
7.2 Introduction................................................................................................................................108
7.3 Wheat genomics.........................................................................................................................109
7.4 Analysis of the wheat transcriptome..........................................................................................110
7.5 Wheat genetic marker discovery................................................................................................110
7.6 Conclusions and future directions..............................................................................................113
7.7 Literature cited............................................................................................................................113
8 Concluding remarks and future directions............................................................................115
8.1 Concluding remarks....................................................................................................................115
8.2 Future directions.........................................................................................................................117
8.3 References..................................................................................................................................118
9 Appendices................................................................................................................................121
9.1 Appendix 1 – Future tools for association mapping in crop plants............................................121
9.2 Appendix 2 – Targeted identification of genomic regions using TAGdb..................................129
9.3 Appendix 3 – Discover of salinity tolerance gene orthologs using TAGdb (Poster).................137
9.4 Appendix 4 – Second generation sequence analysis of wheat chromosome 7DS (Poster)........139
9.5 Appendix 5 – Applying second generation sequencing technology in the assembly and analysis
of the wheat group 7 chromosomes (Poster)..............................................................................141
9.6 Appendix 6 – Applying Second Generation Sequencing Technology for the Analysis of Isolated
Wheat Chromosomes (Poster)....................................................................................................143
9.7 Appendix 7 – The genome of the mesopolyploid crop species Brassica rapa..........................145
9.8 Appendix 8 – Gene content, loss, conservation, and genetic variation among Triticum aestivum
group 7 chromosomes (Poster)...................................................................................................153
9.9 Appendix 9 – Dispersion and domestication shaped the genome of bread wheat (Chapter 5
Supplementary Information) .....................................................................................................155

xiii!
!
List of Figures
Figure 1-1 Schematic representation of the evolutionary history of wheat species.......................5
Figure 1-2 Data structures based on a prefix trie.........................................................................16
Figure 2-1 Schema for TAGdb architecture.................................................................................46
Figure 2-2 TAGdb front page.......................................................................................................47
Figure 2-3 Schema of MetaDB database structure.......................................................................48
Figure 2-4 MetaDB process pipeline schema.............................................................................49
Figure 2-5 Insert size histograms of 7DS paired-end library vs. 2 different queries...................50
Figure 2-6 Insert size histogram of 7DS mate-pair library vs. good query..................................51
Figure 2-7 TAGdb results page displaying 7DS reads aligned to OVP1.....................................52
Figure 2-8 Geneious display of OVP1 gene aligned with 7DS contigs.......................................53
Figure 3-0 Cover image from Plant Biotechnology Journal Volume 9, Issue 7..........................60
Figure 3-1 Heat maps of wheat 7DS read pairs mapped onto O. sativa and B. distachyon.........62
Figure 3-2 Wheat 7DS reads mapped to a genomic region of B. distachyon..............................63
Figure 3-3 GBrowse2 view of annotated wheat 7DS syntenic build...........................................64
Figure 3-4 CMap view comparing 7DS syntenic build with B. distachyon and Ae. tauschii......65
Figure 3-S1 Histogram of bin-mapped cDNAs aligned to 7DS assembled contigs.......................69
Figure 4-1 Heatmaps of 7BS and 7DS read paired coverage against B. distachyon genome......75
Figure 4-2 Bin-mapped loci RBB hits with wheat 7BS and 7DS assembled contigs..................76
Figure 4-3 Venn diagram displaying the common genes between 7BS and 7DS........................76
Figure 4-4 Comparison of B. distachyon chromosome 1 with 7DS, 7BS, and 4AL....................78
Figure 5-1 Venn diagram representing genes present on 7A, 7B and 7D....................................86

xiv!
!
Figure 5-2 SNP distribution across the syntenic builds of 7A, 7B and 7D..................................88
Figure 5-Supp 1 Histogram of bin-mapped cDNAs aligned to 7A, 7B, and 7D.....................155
Figure 5-Supp 2 Network image representing gene networking on 7A, 7B, and 7D..............156
Figure 6-1 Example of detailed information for 7AS syntenic build from GBrowse2..............102
Figure 6-2 Screenshot of TAGdb showing alignment of wheat short reads..............................102
Figure 6-3 CMap3D view of 7DS syntenic build, Ae. tauschii, B. distachyon..........................103
Figure 6-4 The wheat autoSNPdb web interface displaying wheat predicted SNPs.................104
Figure 7-1 Graphical representation of possible NGS data from wheat homoeologs................111
Figure 7-2 Screenshot of autoSNPdb displaying 10 SNPs in a wheat gene..............................112
List of Tables
Table 5-1 Syntenic Build Summary Table..................................................................................85
Table 5-2 Enriched GO terms of wheat sub-genomes................................................................87
Table 5-3 Sub-genomic varietal SNP profiles from 4 Australian cultivars................................88
Table 5-Supp 1 Chromosome arm assembly and syntenic build full details.........................159
Table 5-Supp 2 Chromosome arm bin-mapped cDNA RBB alignment results....................159
Table 5-Supp 3 Top 10 GO functional annotation clusters for group 7 chromosomes.........160
Table 5-Supp 4 Number of reads mapped from 4 Australian varieties..................................161
Table 5-Supp 5 SNPs called between 4 Australian wheat cultivars......................................161
Table 6-1 Summary of wheat group 7 chromosome data available..........................................101
Table 7-1 Cumulative volume of NGS data in NCBI SRA......................................................109

xv!
!
List of Abbreviations
2GS second-generation sequencing
BAC bacterial artificial chromosome
bp base pairs
Gbp gigabase-pairs (billion base-pairs)
GO gene ontology
indel insertion or deletion mutation
IWGSC International Wheat Genome Sequencing Consortium
Mbp megabase-pairs (million base-pairs)
NGS next-generation sequencing
QTL quantitative trait loci
SNP single-nucleotide polymorphism
SRA short read archive
SynBuild synteny-based scaffolding assembly
WGD whole-genome duplication
WGS whole-genome shotgun

1"
"
1 Introduction+and+Literature+Review+
1.1 Hexaploid+bread+wheat+
1.1.1 Importance+of+wheat+
Common bread wheat (Triticum aestivum) is a plant of great economic and social significance
throughout the world and has held this status for over 5000 years (Chantret et al., 2005). Since the
beginning of its cultivation during the Neolithic age, farmers have applied breeding techniques to
improve crop growth and yield to contribute to the survival of people groups around the world. The
wheat genome has evolved a high degree of complexity that confounds the application of modern
molecular biological techniques. Yet at a time when environmental challenges are on the rise with
problems such as drought, salinity and disease, it is critically important for farmers to have access to
new varieties of wheat that can yield good crops under challenging conditions.
Wheat is Australia’s largest grain crop, being around three times larger than Australia’s second
largest grain crop, barley, both in terms of land area usage and production volume (Australian
Bureau of Statistics, 2010). On a four-year average over the years 2006 to 2010, Australia was the
9th
largest producer of wheat, 16th
largest consumer of wheat, and 5th
largest exporter of wheat in
the world (United States Department of Agriculture Foreign Agricultural Service, 2010). It is
estimated that wheat annually accounts for $5 billion of the Australian economy and in 2009 wheat
exports alone brought $4.9 billion into Australia from overseas (Australian Bureau of Statistics,
2010, Commonwealth Scientific and Industrial Research Organisation, 2010).
In addition to the importance to the Australian economy, wheat plays an important social role in
Australia. Primary production occurs predominantly in rural and remote regions of Australia. As a
consequence of the locality of wheat production and its value, wheat is a fundamental pillar for a
number of communities that are dependent on the continued local farming of this crop to sustain the
community (Queensland Department of Primary Industries and Fisheries, 2009). T. aestivum is a
highly valuable species to Australia in both social and economic terms.
Beyond Australia, wheat is one of the most important plants in the world, with wheat
accounting for nearly 20% of the world’s daily food consumption by energy, second only to rice by
a very narrow margin, and accounting for the greatest volume of agricultural production of a single
crop internationally (Food and Agriculture Organisation of the United Nations, 2012). With the rise
of developing economies in Asia, many Asian populations that have previously relied on rice as

Paul%J.%Berkman% % 1%Introduction"and"Literature"Review%
2"
"
their primary food source are moving towards a bread-based diet, which is likely to drive the
consumption of wheat even higher in coming decades (Miskelly, 2005).
While generating food products and grain feed remains the predominant use of wheat crops in
Australia and throughout the world, wheat is now being used alongside a number of grain and other
crops in the production of biofuels, namely ethanol (Biofuels Taskforce, 2005). The aim of new
technologies in grain crops is to utilise waste products from the milling of grain for other
applications to generate ethanol, however current technologies in this area are diverting some wheat
from food and grain feed production in Europe, Canada, and China (Smith, 2006, Balat and Balat,
2009). As demand for renewable energy sources continues to grow, it is possible that demand for
wheat to be applied in ethanol production will increase also. With a growing population nationally
and internationally it is unlikely that demand for wheat as a food source will decrease, resulting on a
net upward demand for wheat crops. Yet as demand is increasing, ability to supply faces a number
of environmental pressures and challenges.
1.1.2 Environmental+and+growth+challenges+in+wheat+production+
Wheat is grown in countries with some of the harshest climates in the world including Syria,
Mexico, throughout Africa, and in Australia (Food and Agriculture Organisation of the United
Nations, 2012). Australia is world-renowned for its harsh and varied climate, from droughts to
flooding rains. The world is currently experiencing significant pressure from climate change with
intense weather patterns on the increase, an intensification of both wet and dry climate events
(McCarthy et al., 2001). Both as a consequence of this and in the context of widespread drought
throughout the country, crops are being required to grow under extremely dry conditions. A second
consequence of dry conditions is an increase in the presence of soil components, which have
adverse affects on crop health and yield. Examples of this are boron and salt, both of which can be
toxic to plants at moderate concentrations (Schnurbusch et al., 2007, Byrt and Munns, 2008).
Plants are required to adapt to survive such conditions, either by better excluding compounds from
uptake or by intracellular processing of these compounds to mitigate harmful effects (Munns and
Tester, 2008).
A number diseases account for a significant loss of wheat crops in Australia. Yellow spot,
stripe rust, Septoria nodorum blotch, crown rot, and other reportable diseases account for $913
million of crop losses annually (Murray and Brennan, 2010), with the first three diseases accounting
for half of this loss. This loss equates to nearly 20% of total annual wheat exports from Australia

Paul%J.%Berkman% % % 1%Introduction"and"Literature"Review%
3"
" "
(Australian Bureau of Statistics, 2010). Internationally rusts, Septoria ssp., and other pathogens
contribute to losses of about 29% (Oerke and Dehne, 2004), highlighting the importance for
developing resistant lines.
While some existing varieties demonstrate tolerance for drought, salinity, and boron, as well as
the efficient use of nitrogen, resistance to disease, and other issues not mentioned here, optimized
introgression of these capacities within commercial wheat varieties is necessary. This will ensure
the continued success of T. aestivum crops in Australia and throughout the world.
1.2 Polyploid+plant+genomics+
Flowering plants demonstrate an exceptional capacity for genome duplication with a large
number of crop species possessing polyploid genomes including varieties of wheat, canola, maize,
sugarcane, potato, and strawberry. Indeed, the genomic comparison enabled by the plant genome
sequences published over last decade suggests that ancient genome duplication has occurred in
nearly all angiosperms (reviewed in Doyle et al., 2008, Soltis and Soltis, 1999).
1.2.1 Types+of+polyploidy+
Polyploidy is a term frequently used to describe genome duplication within a species. There are
in general terms two types of polyploid genomes; allopolyploid and autopolyploid (reviewed in
Soltis et al., 2004).
Allopolyploidy is the result of a hybridization event of two often closely related species in
which one or both contribute either their entire diploid genome or their haploid genome which is
then doubled in subsequent generations (reviewed in Soltis et al., 2004). Examples of allopolyploid
plant species include tetraploid species such as Brassica napus (oilseed rape/canola), Triticum
turgidum (durum wheat), Gossypium hirsutum (cotton) and Zea mays (maize) (Eckardt, 2001,
Schnable et al., 2009, U, 1935, Wendel and Cronn, 2002), as well as hexaploid species such as
Avena sativa (oat) and octoploid species such as Fragaria grandiflora (strawberry) (Devos and
Gale, 1997, Shulaev et al., 2011).
Autopolyploidy describes the change in which a species multiplies its own genome (reviewed in
Soltis et al., 2004). This often occurs in plants where the entire genome is duplicated or triplicated
with a key example being Musa sapientum (banana) (Daniells et al., 2001).

4"
"
It can sometimes be difficult to distinguish between allopolyploidy and autopolyploidy in plants
with a particularly high degree of ploidy. For example, modern sugarcane crop varieties are
generated by crossing the two sugarcane species, Saccharum officinarum (2n=8x=80, x=10) and
Saccharum sponteneum (2n=40-128, x=8), which produces an aneuploid crop containing a variable
number of chromosomes (Grivet and Arruda, 2002). In the above karyotype nomenclature, ‘n’
represents the haploid number of chromosomes for the species while ‘x’ represents the base number
of chromosomes for each sub-genome. Sugarcane varieties are typically propagated vegetatively,
thereby retaining their aneuploidy and confounding analyses which might indicate the species of
origin for its subgenomes.
1.2.2 Wheat+genome+size,+polyploidy,+and+complexity+
Current estimates suggest the genome size of common bread wheat is approximately 17 billion
nucleotides (Paux et al., 2008). While it is difficult to accurately estimate the number of genes in a
species without a reference genome sequence, current estimates suggest there are between 77,000
and 295,900 genes in T. aestivum (Paux et al., 2006, Rabinowicz et al., 2005, Berkman et al., 2012).
Given wheat’s duplicated genomes, it is likely that not all of these genes are expressed but rather
include a large number of pseudo-genes (Wicker et al., 2011).
T. aestivum is an allohexaploid species of wheat, meaning that its genome consists of six sets of
chromosomes originating from three distinct diploid genomes yet functions much like a typical
diploid genome (Gill et al., 2004). This diploid-like function is generally attributed to the Ph1
locus, which is associated with the prevention of chromosome pairing between the sub-genomes
(Segal et al., 1997, Griffiths et al., 2006). In the case of T. aestivum, the three diploid donor species
each had seven pairs of chromosomes, resulting in 21 pairs of chromosomes in common bread
wheat (2n=6x=42). The donor species are proposed to have diverged from an ancestral diploid
species between 2.5 and 6 MYA. This was followed by a inter-species hybridisation event which
combined the genomes of Triticum urartu (Au
Au
) and an unidentified species (BB) that bears high
similarity to Aegilops speltoides between 0.5 and 3 MYA, to produce the allotetraploid genome of
wild emmer wheat or Triticum turgidum (Au
Au
BB) (Eckardt, 2001, Huang et al., 2002, Chantret et
al., 2005). Following the domestication of wheat, the second inter-species hybridisation event
occurred between T. turgidum (Au
Au
BB) and Aegilops tauschii (DD), which produced the
allohexaploid genome of T. aestivum (Au
Au
BBDD) 7-10 KYA (Chantret et al., 2005). The above is
displayed in Figure"1-1 below taken from Chantret et al. (2005).

5"
" "
"
Figure'1)1')'Schematic'Representation'of'the'Evolutionary'History'of'Wheat'Species'(Triticum'and'Aegilops)'(taken'from'
Chantret'et/al.,'2005)."
The process of genome duplication has the consequence that many of the genic regions, which
would typically be considered unique in diploid genomes, are found to have genes with high
homology in syntenic parts of the genomes from the other donor species’. These regions of
homology between donor genomes are said to be “homoeologous”. As would be expected, the
presence of homoeologous chromosomes within a genome causes some difficulties in identifying
the precise location of genes with homoeologs, however homoeologous chromosomes may be
distinguished on the basis of a number of molecular biology techniques (Pedersen and Langridge,
1997, Gill et al., 1991).
In addition to the presence of multiple genomes within T. aestivum, it is estimated that between
75% and 90% of the genome sequence is comprised of repetitive sequences (Flavell et al., 1977,
Wanjugi et al., 2009). This repetitive DNA is predominantly found to consist of transposable
elements (TEs) with some low-complexity repeats. This adds a significant difficulty to the problem
of elucidating a genome sequence for bread wheat, as it means that the relatively unique genic
regions, for which it is easiest to assemble shotgun DNA sequence, are interspersed with long
regions of repetitive DNA. It is understood that the proliferation of transposable elements within the
wheat genome is a key element in the overly large size of the wheat genome and apparent
accelerated breakdown of gene collinearity (Wicker et al., 2010, Berkman et al., 2012, Wicker et
al., 2011).

6"
"
1.2.3 Impacts+of+polyploidy+
Given the abundance of genome duplication in angiosperms, it is quite clear that this
evolutionary phenomenon confers an advantage. It has been suggested that the duplication of a
whole-genome provides a basis for the differentiation of duplicated genes to result in sub- or neo-
functionalisation of the gene; a process in which a duplicated gene evolves to specialise in either a
sub-function that it previously possessed or develop an entirely new function missing from the
donor genomes (Flagel et al., 2008, Force et al., 1999). An example of neo-functionalisation is the
Ph1 gene in wheat which has been identified and characterized as the basis of preventing
homoeologous chromosome pairing, that is the pairing of orthologous chromosomes from separate
donor genomes, during meiosis (Griffiths et al., 2006, Segal et al., 1997). Ph1 has been
characterized as present in the hexaploid and tetraploid wheat on chromosome 5B, with recent
evidence indicating that this function came about following genome duplication in the Triticeae
(Griffiths et al., 2006, Hao et al., 2011).
Another hypothesis as to the impact of polyploidisation on the gene content of a species is the
gene dosage balance hypothesis. This hypothesis suggests that following a whole genome
duplication event, neo-functionalisation is suppressed due to the organism’s retention of balanced
dosage impacts of genes operating within the same protein-interaction networks (Freeling and
Thomas, 2006). The conclusion of this hypothesis is that the function of homoeologous genes
which are highly networked with other genes will be retained from both genomes of origin in order
to preserve the overall function of the gene-network (Thomas et al., 2006). It has recently been
suggested that this dosage balance hypothesis is true in circumstances where the selective
environment in which the species is located following whole genome duplication remains
consistent, while changes in the selective environment may result in the neo- or sub-
functionalisation of genes with specialized and potentially non-networked function to gain a
selective advantage (Bekaert et al., 2011).
While genome duplication provides and advantage to angiosperms, neopolyploids present a
challenge to genetic and genomic research due to a confounding of techniques that can be applied
successfully to diploid genomes. For instance while detection of genetic polymorphisms as
molecular markers can be relatively simply applied in diploids, the issue of “marker dosage” must
be considered when looking at a polyploid species (Bundock et al., 2009). Rather than a marker
having a clearly identifiable physical location on single locus in a diploid, a marker can easily
appear on homoeologous chromosomes preventing clear association with traits. Similarly in
genome sequencing efforts, trying to assemble whole-genome shotgun data from a polyploid
genome will almost inevitably result in the construction of chimeric contiguous sequences

7"
" "
containing a combination of homoeologous chromosomes, since the sequences of homoeologous
chromosomes are often too similar to distinguish during assembly. In wheat, the additional
problem of repetitive DNA within the genome further increases the challenge involved in the task
of assembling genomic sequence data for this species.
In short, polyploid crop species require a new way of thinking to enable the accurate
determination of the processes involved in their evolution and the genetic basis of traits relevant to
their cultivation.
1.3 Plant+genome+sequencing+
The value of genome sequencing has been demonstrated in a variety of crop species. Indeed, in
the face of so many external pressures on the growth of crops it is vital for breeders to have access
to high quality biological reference data.
The availability of a reference genome sequence for crops such as rice, maize and sorghum has
paved the way for a deeper understanding of the association between genetic sequence information
and desirable biological traits (International Rice Genome Sequencing Project, 2005, Schnable et
al., 2009, Paterson et al., 2009). Such knowledge has allowed for the development of genetic
markers and a greater awareness of the physiological basis for desirable traits, in turn providing a
feedback loop for breeders and researchers to work on crop improvement in the field and
laboratory. Without a reference genome sequence, the scope of crop improvement research is
limited to a much narrower focus and efforts to understand the impacts of gene regulation and
interactions between genes are hindered. A complete genome sequence for T. aestivum will be an
extremely helpful tool in the continued improvement of this key species. Even so, given the
enormous time and effort required to determine the genome sequence of a plant species, the
question must be and is often asked ‘why should we sequence genomes?’
1.3.1 Sequenced+plant+genomes+
There are 34 plant species for which the genome sequence has already been determined, of
which 33 are listed on Phytozome (Goodstein et al., 2011, van Bakel et al., 2011). While the
sequences for these species have been made publicly available, only 24 of these have published the
genome sequence through a peer-reviewed process. Published plant genomes include an increasing
number of crop species such as grapevine, papaya, cucumber, Chinese cabbage, potato, rice,
sorghum, maize, soybean, castor bean, apple, cocoa, and strawberry (Argout et al., 2011, Chan et

8"
"
al., 2010, Goff et al., 2002, Huang et al., 2009a, Jaillon et al., 2007, Ming et al., 2008, Paterson et
al., 2009, Schmutz et al., 2010, Schnable et al., 2009, Shulaev et al., 2011, Velasco et al., 2010,
Wang et al., 2011, Xu et al., 2011, Yu et al., 2002).
The availability of genome sequences has facilitated the improvement of crops in two ways:
molecular breeding and genetic engineering (reviewed in Varshney et al., 2011). Molecular
breeding is the process of identifying desirable traits within a species along with molecular markers
for the traits, followed by breeding and testing varieties of a crop using markers to confirm that they
contain the desired trait. Genetic engineering is the process of identifying and characterising a gene
of interest, either from within or outside a species, and applying modern molecular techniques to
introduce the gene into the genome of a target variety. Both of these approaches are enhanced by a
reference genome sequence through improved quantitative trait loci (QTL) and association analysis,
candidate gene characterisation, gene expression analysis, and comparative genomics, all of which
have increased our understanding of plant developmental processes and environmental responses
critical to crop productivity (Fabi et al., 2010, Satish et al., 2009, Paull et al., 2008, Degenkolbe et
al., 2009, Li et al., 2006). Knowledge of the genomic basis for desirable physiological traits
supports breeders in their crop improvement endeavours.
The lack of a reference genome sequence for wheat hinders research into crop improvement and
the production of superior wheat cultivars. With rapid advances in DNA sequencing and
bioinformatics techniques, there are several independent efforts to establish comprehensive wheat
genome resources, though there is debate over exactly what is required and how best to produce the
resources required for wheat crop improvement.
1.3.2 Genome+sequencing+efforts+in+wheat+
The establishment of an international wheat genome-sequencing project was suggested at a
wheat genomics workshop held in November 2003 (Gill et al., 2004). A BAC-by-BAC approach
was adopted, starting with a 5-year pilot project to produce a high-resolution physical map with
anchored BACs, followed by the sequencing of a BAC minimum tiling path (Gill et al., 2004). The
International Wheat Genome Sequencing Consortium (IWGSC) was set up the following year with
the goal of producing a complete and annotated genome sequence for T. aestivum cv. Chinese
Spring (www.wheatgenome.org). In the seven years since its establishment the consortium has
produced BAC libraries for all individual wheat chromosome arms (Šafář et al., 2010), physical
maps of chromosome 3B (Paux et al., 2008) and 3DS (Fleury et al., 2010), data analysis pipelines

9"
" "
(Sabot et al., 2005), sub-genome-specific molecular markers (You et al., 2011), supported by
numerous book chapters, review articles, workshops and conference presentations.
Whole chromosome shotgun (WCS) sequencing was suggested as an alternative to the BAC by
BAC approach (Gill et al 2004), however the BAC-by-BAC approach was adopted as, due to the
technology available at the time, it was thought likely to provide the most thorough and robust
reference genome sequence. Unfortunately, it may well be many years before such a sequence is
available to researchers due to the complexity of coordinating research funding across international
groups. Commercial investment in the IWGSC project also delays broad application of the genome
data, with access to the genome sequence being restricted to investors and core participants for a
period prior to public release. Given the critical economic and social importance of bread wheat
worldwide, and the large quantities of global public funds supporting the project, it could be argued
that the draft genome data should be immediately released, as it would add great value to crop
research and improvement conducted by researchers and breeders who are not directly contributing
to the actual sequencing project.
At the time of the establishment of the IWGSC, whole genome shotgun (WGS) sequencing of
the bread wheat genome was not considered feasible due to the size and complexity of the genome
(Gill et al., 2004). A draft wheat genome assembly has been produced from 5x Roche 454 WGS
sequencing (http://www.cerealsdb.uk.net/), as well as assembly of the donor species of the wheat D-
genome, Aegilops tauschii (http://www.cshl.edu/genome/wheat). While data from both of these
projects were made available in 2010, their analysis has not been published by peer review. This
highlights the conclusions of Gill et al. (2004), that WGS sequencing of wheat is confounded by its
complexity, and that this is likely to hold true even with recent advances in sequencing technology.
1.3.2.1 ChromosomeJarm+sequencing+
Flow cytometry was developed some time ago as a means by which chromosomes can be
separated based on size differences (Laat and Blaas, 1984). This has been optimised to sort plant
chromosomes with high levels of purity (Kubaláková et al., 2000) and was found to be applicable to
a number of important plant species including Oryza sativa (Lee and Arumuganathan, 1999), Cicer
arietinum (Vlacilova et al., 2002), Secale cereale (Kubaláková et al., 2003), Hordeum vulgare
(Lysák et al., 1999), and the complex genome of T. aestivum (Vrána et al., 2000).
This method was initially proved to be effective only in isolating chromosome 3B of wheat,
however the use of cytogenetic stock with chromosome arm deletions was found to assist in the

10"
"
separation of individual chromosome arms for the whole genome (Kubaláková et al., 2002). The
broad range of cytogenetic stock in wheat has been a critical step in the separation of wheat’s
homoeologous genomes. From the separation of chromosome 3B, a BAC library was generated
representing the complete sequence of the chromosome arm at a depth of 6.2x coverage (Šafář et
al., 2004), providing the foundation to the mapping and assembly of this chromosome (Doležel et
al., 2004). This approach has now been applied to generate BAC libraries for the complete genome
of T. aestivum (Šafář et al., 2010).
Despite the availability of isolated chromosome arms for the complete wheat genome at the
commencement of the IWGSC, sequencing of complete isolated chromosomes has not been
attempted until very recently. Barley researchers were the first to apply sequencing technology
directly to an isolated barley chromosome in order to analyse the gene content of chromosome 1H
(Mayer et al., 2009). Similar work has now been completed on all chromosomes of barley to
determine barley’s overall gene content (Mayer et al., 2011).
Work described in this thesis represents the first published sequencing of isolated wheat
chromosome arms (Berkman et al., 2011), an approach that has now been applied by a number of
groups to wheat chromosome arms (Berkman et al., 2012, Hernandez et al., 2012, Wicker et al.,
2011). The ability to shotgun assemble highly abundant repetitive regions remains a challenge,
even within isolated chromosome arms. This approach is therefore limited to the analysis of unique
and low copy gene rich regions (Berkman et al., 2011, Mayer et al., 2011, Wicker et al., 2011). This
limitation is balanced by the benefits of speed and low cost, enabling the rapid production of gene
scaffolds which can be applied for wheat crop improvement research. The fast, low-cost
sequencing has come about as a result of new DNA sequencing technologies that have
revolutionized biological research over recent years
1.4 SecondJgeneration+sequencing+(2GS)+
Original Sanger sequencing methods provided a way to obtain reasonably long sequences, up to
several hundred nucleotides, with a high degree of certainty regarding the sequence integrity.
Drawbacks of this method are the time required to generate the sequence data, as well as the limited
ability to parallelise the process reducing the overall capacity to generate data in high volumes.
Over the last few years, a number of second-generation sequencing (2GS) methods have been
produced in a series of iterations that have introduced continually increasing volumes of sequence
data with increasing quality and read length. 2GS is defined in this thesis as the generation of

11"
" "
sequencing technologies immediately following Sanger methods, which have commenced an
exponential acceleration of DNA and RNA sequencing. 2GS is embodied in the following
platforms. The Roche 454 GS FLX Titanium technology (Margulies et al., 2005) is currently
capable of producing a million reads up to 1,000 nucleotides in length in a single run of 23 hours
(http://www.454.com). Applied Biosystems’ SOLiD platform currently has the ability to produce
over 20 billion nucleotides per day, with a read length of up to 75 nucleotides
(http://www.appliedbiosystems.com). Illumina’s HiSeq2000 is capable of producing 600 billion
nucleotides of sequence data with a read length of 100 nucleotides in a run of ~11 days
(http://www.illumina.com), and Illumina have recently announced a kit for their MiSeq platform
that can produce paired reads up to 250 bp long.
Following the development of 2GS technologies, another generation of sequencing technologies
is emerging. These new technologies promise greater volumes of sequence data and longer reads
than 2GS, which will have significant impacts on de novo sequence assembly (Imelfort and
Edwards, 2009, Paszkiewicz and Studholme, 2010). Two of the more recent sequencing
technologies are the Ion Torrent from Life Technologies and the SMRT (Single Molecule Real
Time) technology from Pacific Biosciences. Ion Torrent sequencing uses a semiconductor based
high density array of micro reaction chambers (http://www.iontorrent.com), producing sequence
reads of 100–200 bp, with up to 1 Gbp of data per run. During the sequencing reaction, the four
DNA nucleotides are flowed separately across the micro reaction chambers. The system records the
sequence by sensing the pH change when a hydroxyl group is released during extension of a
specific base. The error profile of this data is biased towards homopolymer errors with a per-base
accuracy of 98.897% for the first 100 bp (Rothberg et al., 2011), and the technology has significant
potential for cost effective resequencing and variant discovery. Pacific Biosciences produces one of
the first “third-generation” sequencing systems to go on the market (Eid et al., 2009). Read lengths
of around 1,000 bp have been reported (http://www.pacificbiosciences.com) with the potential to
take snapshots of shorter reads over an extended fragment of over 10,000 bp. Little is known about
the error profile of the data, but it would be expected that missing bases and hence
insertion/deletion (indel) calling will be a likely issue with this technology.
The explosive growth in sequencing technologies makes future predictions problematic, though
we can be certain that the increase in sequence data volumes, read lengths, and data quality will
continue. While the quality of sequence data produced by each of these technologies is not as high
as that of traditional Sanger sequencing, the continuous improvements and updates to sequencing
protocols are yielding sequences of increasingly high quality across all platforms. Most of these

12"
"
technologies is also capable of producing paired read sequences with a variety of fragment/insert
sizes, which assists analysis algorithms to overcome repetitive sequence issues (Robison, 2010).
1.4.1 Critical+issues+
Each of the abovementioned technologies displays a distinct error profile requiring different
approaches to address. The GS FLX technology has difficulty correctly interpreting homopolymer
runs of nucleotides, often inserting or deleting bases from the sequence output in these regions
(Mardis, 2008). In contrast, the Illumina’s technology has a tendency to introduce C↔A and G↔T
substitutions (Erlich et al., 2008, Erlich et al., 2009), and nucleotides towards the 3’ end of
sequenced reads display a lower base-calling confidence, as reported frequently on community-
based websites such as SEQanswers (http://www.seqanswers.com). Regarding the ABI SOLiD
platform, a similar decrease in quality towards the 3’ ends of sequenced reads is observed (Flicek
and Birney, 2009). Both the Illumina and ABI SOLiD technologies also appear to have a bias
towards producing low sequence coverage of AT-rich repetitive sequences (Harismendy et al.,
2009). When applying each type of data to specific biological questions, it remains an important
step to understand each technology’s individual error profile to be able to identify errors accurately.
While error rates published by technology suppliers typically appear as low as 0.01%
(http://www.appliedbiosystems.com), it is important to understand what this means. One study
found that the Illumina GA and SOLiD platforms respectively had only 43% and 34% of the reads
pass quality filters and subsequently align to a reference sequence (Harismendy et al., 2009). This
increased error rate from that quoted by technology vendors does not render the technologies
redundant, since data volumes produced by second-generation sequences are still greater than
traditional methods even after stringent quality filtering. With error rates improving and sequence
volume continuing to increase, these technologies will continue to have value in a variety of
applications into the future (Harismendy et al., 2009, Imelfort et al., 2009, Duran et al., 2010,
Edwards and Batley, 2010).
To complicate issues further a number of discrepancies exist within commonly used file
formats. For example, the FASTQ file format, which is used to relate read sequence data with
Phred nucleotide quality scores, exists in three different forms. The first being the original Sanger
FASTQ format, which was produced to retain quality information relating to Sanger sequences;
secondly, when Solexa produced its Genome Analyzer technology it also introduced a slightly
modified version of the FASTQ format; and finally, when Illumina purchased the Genome Analyzer

13"
" "
technology and updated this to the Genome Analyzer II, it released a third version of the FASTQ
format (Cock et al., 2010). To confound the situation even further Illumina have recently updated
their CASAVA base-calling software to produce the original Sanger FASTQ file format. The
variations between these formats exist in the relationship between ASCII characters and Phred
quality scores, and while formulae have been published to relate the differing formats (Cock et al.,
2010), it remains difficult to distinguish between these three formats without prior knowledge.
In order to address the issues of error-rates and quality, a number of tools have been developed
to trim and/or filter reads based on quality, as well as undertake some basic statistical analysis of
read quality. One such tool is SolexaQA, which can generate statistics on FASTQ files to
understand the profile of a particular dataset and handle the filtering and trimming of 2GS data
based on quality scores (Cox et al., 2010). Another tool useful to undertake basic kmer analysis of
new datasets is Tallymer, which has been successfully applied in the analysis of repetitive elements
in crop genomes and can be similarly applied to raw and quality-filtered datasets (Kurtz et al.,
2008). While it is important to understand the data and minimise the inclusion of errors in data
analysis, it is even more critical to understand the biological question being pursued.
1.4.2 Applications+of+2GS+
There are two main categories of questions into which 2GS applications can be divided; these
are re-sequencing questions and de novo sequencing questions.
In the category of re-sequencing questions, a wide variety of applications exist. Many
researchers are applying multiplex 2GS technologies to conduct high-throughput sequencing of a
target genomic region from a number of different samples (Parameswaran et al., 2007, Appleby et
al., 2009, Erlich et al., 2009). Alignment of such data to a reference sequence can provide insight
into the genomic basis for phenotypic differences. Increasingly, the sequencing of RNA using 2GS
followed by alignment to a reference sequence is replacing the application of microarray technology
and providing results with greater precision (Cloonan et al., 2009, Morozova et al., 2009). Another
application of 2GS is the analysis of non-coding RNAs, in particular microRNAs, and their
differential expression in various tissue types and individual samples. Analysis of non-coding
RNAs using 2GS is revealing regulatory mechanisms behind gene expression, a field providing
greater insight into biological function (Guffanti et al., 2009).
De novo sequencing using 2GS data has a large number of applications in areas of great
biological significance. A number of genome sequences have been published in the last few years

14"
"
which have used 2GS data as the primary or sole technology for de novo sequencing. Examples
include the genomes of the giant panda (Li et al., 2010), Brachypodium distachyon (Vogel et al.,
2010), Cucumis sativus (cucumber) (Huang et al., 2009a), Homo neanderthalensis (Green et al.,
2010), and Brassica rapa (Wang et al., 2011). Many other projects are currently underway which
will likely be published in the coming years. In many circumstances, laboratories lack the funds to
produce a complete genome sequence for their species of interest, which lacks a reference genome.
Despite the lack of a reference genome, such laboratories increasingly need to interrogate genomic
data. Publicly available 2GS data have recently been combined with algorithms to allow the
identification of PCR primers, followed by the amplification and Sanger sequencing of regions of
interest (Marshall et al., 2010). Also, de novo sequencing can be approached as a re-sequencing
effort based on syntenic regions from related species (Mayer et al., 2009). Another emergent de
novo method to compare 2GS datasets is differential kmer analysis, in which statistical comparisons
investigate the differences in kmer abundance between 2GS datasets and then determine the
functional relevance of the significantly different kmers. This approach is quite new and references
describing this method should be shortly forth coming.
Whether approaching questions from a perspective of re-sequencing or de novo application,
appropriate algorithms are required to ensure data is interrogated correctly.
1.4.3 Importance+of+Bioinformatics+
The typical first step in applying 2GS data to a biological question, after filtering and/or
trimming data, is either the alignment of short-reads to a reference sequence or the assembly of
short-reads into contiguous sequences (contigs). A number of algorithms exist for both of these
tasks, however these broadly fall into a few categories.
1.4.3.1 Alignment+
Li and Homer (2010) have categorised alignment algorithms into three categories, as being
based on hash tables, suffix trees, or merge sorting.

15"
" "
Hash%table%alignment%
Indexing based on hash tables stems from the most highly cited alignment algorithm, BLAST
(Altschul et al., 1990, Vouzis and Sahinidis, 2010). A hash table is a look-up table containing
sequence data, either query or reference, that is compared with other sequence data using a Smith-
Waterman alignment (Smith and Waterman, 1981, Li and Homer, 2010). For example, the original
BLAST implementation builds a hash table of the query sequence that uses kmers of a specified
size (word size) as the keys to values that describe the sequence position on the query, and this is
then searched through the reference database (Altschul et al., 1990). A number of other
implementations of this form of algorithm apply a spaced seed approach to alignment, such
implementations include GNUMAP (Clement et al., 2010), MAQ (Li et al., 2008a), MapReduce
(Schatz, 2009), PerM (Chen et al., 2009), RMAP (Smith et al., 2008, Smith et al., 2009), SeqMap
(Jiang and Wong, 2008), SOAP (Li et al., 2008b), and ZOOM (Lin et al., 2008). Given that the
spaced seed does not allow for gapped-alignment, a number of other tools have implemented
gapped alignments, usually after seed extension, including AGILE (Misra et al., 2010), BLAT
(Kent, 2002), RazerS (Weese et al., 2009), SHRiMP (Rumble et al., 2009), and SSAHA2 (Ning et
al., 2001). In addition to this, recent algorithms such as Novoalign (Novocraft, 2010), CLC
Genomics Workbench (CLC Bio, 2010), and SHRiMP (Rumble et al., 2009) apply vectorisation
which significantly improves performance of the algorithm compared to the BLAST
implementation.
Suffix9based%alignment%
A suffix tree algorithm represents sequence data in a different manner to a hash table. Rather
than a sequence being stored as a key by string, a sequence is represented as a series of addresses in
a graph. Examples of various ideas for suffix-based sequence structures are displayed in Figure"1=2
taken from Li & Homer (2010). Li and Homer (2010) describe algorithms in this category as
“reducing the inexact matching problem to the exact match problem” in two steps. First these
algorithms identify exact matches, then these alignments build inexact alignments based on exact
matches (Li and Homer, 2010). Algorithms that implement suffix/prefix based structures for
sequence alignment include Bowtie (Langmead et al., 2009), BWA (Li and Durbin, 2009), BWA-
SW (Li and Durbin, 2010), BWT-SW (Lam et al., 2008), MUMmer (Delcher et al., 2003), and
Tallymer (Kurtz et al., 2008).

16"
"
"
Figure'1)2')'Data'structures'based'on'a'prefix'trie'(taken'from'Li'and'Homer,'2010).'Prefix"trie"of"string"AGGAGC"where"
symbol"OE"marks"the"start"of"the"string."The"two"numbers"in"each"node"give"the"suffix"array"interval"of"the"substring"represented"by"
the"node,"which"is"the"string"concatenation"of"edge"symbols"from"the"node"to"the"root."(B)"Compressed"prefix"trie"by"contracting"
nodes"with"in="and"out=degree"both"being"one."(C)"Prefix"tree"by"representing"the"substring"on"each"edge"as"the"interval"on"the"
original"string."(D)"Prefix"directed"word"graph"(prefix"DAWG)"created"by"collapsing"nodes"of"the"prefix"trie"with"identical"suffix"array"
interval."(E)"Constructing"the"suffix"array"and"Burrows"Wheeler"transform"of"AGGAGC."The"dollar"symbol"marks"the"end"of"the"
string"and"is"lexicographically"smaller"than"all"the"other"symbols."The"suffix"array"interval"of"a"substring"W"is"the"maximal"interval"in"
the"suffix"array"with"all"suffixes"in"the"interval"having"W’s"prefix."For"example,"the"suffix"array"interval"of"AG"is"[1,"2]."The"two"
suffixes"in"the"interval"are"AGC$"and"AGGAGC$,"starting"at"position"3"and"0,"respectively."They"are"the"only"suffixes"that"have"AG"as"
prefix.'
"
Merge9sorting%alignment%
Very few algorithms fit into this category, currently with only Slider (Malhis et al., 2009) and
SliderII (Malhis and Jones, 2010) implementing this algorithm for sequence alignment. These
algorithms differ in accepting the Illumina GAII probability file, now redundant in recent iterations
of the technology, as an input rather than the FASTQ or other directly sequence-based files in order
to generate all probable reads and associate probability values to each read. These reads are then
processed in a sorted sliding window manner based on specified parameters for window size to
identify sequence matches. The use of sequencing I/O removes the need for a hash table or suffix-
based structure, and is intended to improve SNP-calling based on 2GS data (Malhis et al., 2009,
Malhis and Jones, 2010).

17"
" "
2GS%short9read%alignment%issues%
Regardless of the algorithm implemented in alignment, it is known that accuracy may be
improved by using sequencing quality scores for individual nucleotides (Smith et al., 2008). Such
information allows a probabilistic approach to determining the effect of erroneous nucleotides
within error-prone 2GS data, allowing for the clearer identification of mis-alignments and SNPs.
The most commonly used current format for storing alignment data is SAM format, which allows
for integration of quality information to assist with alignment (Li et al., 2009).
It has been highlighted that with many of the above-mentioned tools being built and optimised
for short-read alignment, such tools may become redundant in coming years as 2GS technologies
continue to produce longer read-lengths of better quality (Misra et al., 2010, Li and Homer, 2010).
Consequently, more work will be needed to optimise alignment tools for long reads as the
technology improves, as well as optimising the continued development of sequence assembly tools
that can integrate data from a variety of sources (Misra et al., 2010, Imelfort and Edwards, 2009).
1.4.3.2 Assembly+
Assembly algorithms can be broadly divided into two categories, being based either on the
Overlap-Layout-Consensus or the Eulerian methodologies (Imelfort and Edwards, 2009).
Overlap9Layout9Consensus%assembly%
The overlap-layout-consensus assembly method was first devised for the assembly of sequence
reads from traditional Sanger sequences in early sequencing projects. The method is based on the
process of identifying regions of overlap between multiple sequence reads followed by the layout
and calculation of a consensus sequence for the overlapping regions. Examples of early
implementations of this algorithm include ARACHNE (Batzoglou et al., 2002, Jaffe et al., 2003),
CAP3 (Huang and Madan, 1999), PCAP (Huang et al., 2003), and Phrap (Phil Green, Phrap
Documentation, www.phrap.org). This approach has been implemented for 2GS data in very few
algorithms, an example of one such being Edena (Hernandez et al., 2008). A later extension on this
idea saw the introduction of a string graph, which allows the application of graph ideas similar to
the Eulerian method (Myers, 1995). This approach has subsequently been implemented in SaSSY,
an in-house algorithm for application to short-reads, an algorithm that promises to generate more
robust contigs than current assembly methods (Imelfort and Edwards, 2009).

18"
"
Eulerian%assembly%
Eulerian assembly methods have been heavily used in recent short-read assembly algorithms.
The idea was introduced by Idury and Waterman (1995) who suggested sequence assembly by
Eulerian tours of a graph based on equal length sequence fragments. These ideas were first
implemented in Euler (Pevzner et al., 2001) for traditional sequencing methods. The Eulerian
approach to assembly has since been implemented for 2GS data in ABySS (Birol et al., 2009),
AllPaths (Butler et al., 2008, Maccallum et al., 2009), EulerSR (Chaisson and Pevzner, 2008,
Chaisson et al., 2009), SHARCGS (Dohm et al., 2007), VCAKE (Jeck et al., 2007), and Velvet
(Zerbino and Birney, 2008). Velvet is currently one of the most popular assemblers and now has an
add-on algorithm, Oases, which has been developed for transcriptome assembly (Paszkiewicz and
Studholme, 2010). The reason for the popularity of Eulerian assemblers is probably due to the
ability of this method to overcome computational limitations which remain a significant bottleneck
in 2GS assembly (Imelfort and Edwards, 2009).
1.4.3.3 2GS+bioinformatics+in+polyploid+genomics+
Given the ‘data deluge’ that has been precipitated by 2GS technologies, accelerating the
production of sequencing data at a rate greater than Moore’s Law, computational analysis is indeed
a critical hurdle in successfully applying 2GS to answer important biological questions. Alignment
and assembly are two aspects of the ever important issue of the bioinformatic analysis of 2GS
technologies, however the methodologies described above that have been developed to address
these questions tend to focus, for good reason, on the mathematical validity of the results they
produce.
In applying 2GS data and bioinformatic analysis to polyploid genome biology, the biological
factors relating to the polyploid genome must drive both the development of research questions to
be pursued and the approach to be taken in answering the biological question. Given the particular
issues related to polyploid genomes, 2GS technologies are certainly applicable, however a new way
of thinking is required to gain the full leverage these technologies offer, and this is most true when
considering the issue of polyploid genome sequencing.

19"
" "
1.5 Principles+for+complex+genome+sequencing+
There are a number of principles to consider when approaching a genome-sequencing project
that should inform any project undertaken. This is true of projects utilizing 2GS technologies and
those based on Sanger sequencing, and it is true for plant, animal, and microbial genome projects
alike.
1.5.1 Complexity+reduction+
One of the key factors in assembling any genome sequence is its size. If the size of the target
sequence is too large, the complexity of the assembly, namely the memory and time required for
algorithms to execute, increases beyond the scope of current computing capabilities. A method to
mitigate this problem that is embodied in even the early genome sequencing projects is to reduce
the target genome size, either by reduced representation sequencing or other sequence isolation
methods. While size is an important factor in genome sequencing, polyploidy adds an additional
layer of complexity with duplicated regions of the genome. Regardless of the size of the target
genome, any duplicated regions within a genome need to be resolved before any shotgun
sequencing approach can succeed. Sequence isolation can resolve duplicated regions of the genome
prior to sequence generation and the application of paired reads during sequence generation can also
provide for the resolution of duplications following sequencing (Medvedev et al., 2011).
1.5.2 Defined+sequencing+goal+
Another important factor to be considered in a genome-sequencing project is that the project
must have clear objectives that ideally align with real-world requirements. The assumption is often
made that a gold-standard assembly is required for every species, and given unlimited resources
access to a gold-standard assembly would of course be the ideal. However if resources are limited,
as is the case in most research environments, and a lesser assembly could suffice then perhaps a
genome could be considered finished when the assembly is fit for the intended purpose. Should
real-world requirements be fully satisfied by gene-rich scaffolds, then the cost and time associated
with a gold-standard ‘finished’ assembly may be considered inordinate. In cases where a gold-
standard assembly is necessary, a step-wise approach that delivers needs-based interim results
should be considered within the scope of a genome-sequencing project. For example, if the
objective is to identify the gene content and provide a reference for molecular marker discovery,
then a syntenic gene scaffold could be considered finished. In contrast, if the purpose is to

20"
"
investigate the role of transposable elements in species evolution, a more traditional approach may
be required to elucidate the repetitive regions of the genome.
The rapid changes in genome sequencing, together with the way that this data is being applied
raises the question of whether a gold-standard reference genome assembly is required for wheat.
The re-sequencing of many ecotypes of Arabidopsis (Weigel and Mott, 2009), as well as many rice,
maize and soybean varieties (Yu et al., 2002, Goff et al., 2002, Huang et al., 2009b, Xu et al., 2010,
Lai et al., 2010, Lam et al., 2010) highlights the value of comparative genome information and the
limitations of using a single reference. The future of wheat genomics may therefore benefit from the
production of multiple reference sequences rather than a single reference. While it is certain that
some aspects of biological significance will only be identified with a gold-standard assembly
(Feuillet et al., 2011), such pursuits should not limit the development and application of lesser
assemblies where these are sufficient to answer the biological question in hand. It is unlikely that
every single nucleotide for a wheat variety will be correctly assembled and it is therefore important
for a project to have clear specifications, enabling the end-point to be achieved and celebrated.
Given that different quality assemblies may be valuable for different purposes, it is valid to have
multiple stages of finishing appropriate to the purpose of the assembly. This ensures a gradual and
constant release of information for wheat improvement researchers without waiting for a final goal
which may never be achieved.
1.5.3 Current+technologies+
The final factor discussed here is that, upon determining the clear goal for a genome-sequencing
project, the research team must keep abreast with relevant emerging technologies. In this context,
the preparation and versatility to adopt new, appropriate technologies as they become available can
expedite the delivery of the required genome sequence. The advent of 2GS technologies could be
considered a relevant example of technological developments that have impacted existing genome
sequencing projects. In reviewing sequencing projects that have been completed over the last few
years and the manner in which 2GS has helped to accelerate the genome sequencing projects of
cucumber, apple, strawberry, and cocoa (Huang et al., 2009a, Velasco et al., 2010, Shulaev et al.,
2011, Argout et al., 2011), it is clearly advantageous to utilise improvements in sequencing
technology if at all possible within the scope of the project. If new technologies can be employed to
achieve the goals of the project, it is likely that the genome sequence will be more rapidly
‘finished’, providing advantages to the research community and beyond.

21"
" "
1.6 Summary+
As is described above, wheat is an extremely important crop species for Australia and the rest of
the world in both economic and social terms with population growth, disease, and other climate-
related stresses requiring improvements to this crop if it is to be a secure food-source in the future.
Experience with genome sequence data from the first sequenced plant genomes has demonstrated
the utility of this technology not just in scientific terms to extend understanding of plant biology
and evolution, but in the same economic and social terms under which these plants are deemed
important. The polyploid complexity of wheat is a hindrance to the determination of its genome
sequence using the techniques that have been more readily applicable to the plants whose genomes
have already been sequenced. 2GS technologies are accelerating genome-sequencing efforts in a
number of crops but come with major computational challenges, and while a number of
bioinformatics tools have been established to align and assemble 2GS data, specialised thinking is
required to appropriately apply these technologies to polyploid genomes. A number of factors are
of great importance in sequencing complex genomes, and while it should be both feasible and
valuable to leverage 2GS technologies the wheat community has been slow to do so.
It is the purpose of this thesis to establish and apply an approach that utilises 2GS technologies
and syntenic relationships within the grasses to determine the genome sequence of wheat and apply
this to extend wheat crop improvement and our understanding of wheat genome evolution. With the
correct application of new DNA sequencing and bioinformatic techniques, it is my hypothesis that
the complexity of the wheat genome can be circumvented to provide researchers and breeders with
critical genomic data to assist them to develop new wheat varieties that can address the challenges
of climate change and a growing population.
1.7 References+
ALTSCHUL, S. F., GISH, W., MILLER, W., MYERS, E. W. & LIPMAN, D. J. 1990. Basic local
alignment search tool. J Mol Biol, 215, 403-10.
APPLEBY, N., EDWARDS, D. & BATLEY, J. 2009. New technologies for ultra-high throughput
genotyping in plants. Methods Mol Biol, 513, 19-39.

22"
"
ARGOUT, X., SALSE, J., AURY, J.-M., GUILTINAN, M. J., DROC, G., GOUZY, J., ALLEGRE,
M., CHAPARRO, C., LEGAVRE, T., MAXIMOVA, S. N., ABROUK, M., MURAT, F.,
FOUET, O., POULAIN, J., RUIZ, M., ROGUET, Y., RODIER-GOUD, M., BARBOSA-
NETO, J. F., SABOT, F., KUDRNA, D., AMMIRAJU, J. S. S., SCHUSTER, S. C.,
CARLSON, J. E., SALLET, E., SCHIEX, T., DIEVART, A., KRAMER, M., GELLEY, L.,
SHI, Z., BERARD, A., VIOT, C., BOCCARA, M., RISTERUCCI, A. M., GUIGNON, V.,
SABAU, X., AXTELL, M. J., MA, Z., ZHANG, Y., BROWN, S., BOURGE, M., GOLSER,
W., SONG, X., CLEMENT, D., RIVALLAN, R., TAHI, M., AKAZA, J. M., PITOLLAT,
B., GRAMACHO, K., D'HONT, A., BRUNEL, D., INFANTE, D., KEBE, I., COSTET, P.,
WING, R., MCCOMBIE, W. R., GUIDERDONI, E., QUETIER, F., PANAUD, O.,
WINCKER, P., BOCS, S. & LANAUD, C. 2011. The genome of Theobroma cacao. Nature
Genetics, 43, 101-108.
AUSTRALIAN BUREAU OF STATISTICS 2010. Australian Farming in Brief 2010. Sydney
NSW, Australia: Australian Bureau of Statistics.
BALAT, M. & BALAT, H. 2009. Recent trends in global production and utilization of bio-ethanol
fuel. Applied Energy, 86, 2273-2282.
BATZOGLOU, S., JAFFE, D. B., STANLEY, K., BUTLER, J., GNERRE, S., MAUCELI, E.,
BERGER, B., MESIROV, J. P. & LANDER, E. S. 2002. ARACHNE: a whole-genome
shotgun assembler. Genome Res, 12, 177-89.
BEKAERT, M., EDGER, P. P., PIRES, J. C. & CONANT, G. C. 2011. Two-phase resolution of
polyploidy in the Arabidopsis metabolic network gives rise to relative and absolute dosage
constraints. The Plant cell, 23, 1719-28.
STILLER, J., SMITS, L., IMELFORT, M., MANOLI, S., MCKENZIE, M.,
KUBALÁKOVÁ, M., ŠIMKOVÁ, H., BATLEY, J., FLEURY, D., DOLEŽEL, J. &
EDWARDS, D. 2011. Sequencing and assembly of low copy and genic regions of isolated
Triticum aestivum chromosome arm 7DS. Plant Biotechnol J, 9, 768-775.
BERKMAN, P. J., SKARSHEWSKI, A., MANOLI, S., LORENC, M. T., STILLER, J., SMITS, L.,
LAI, K., CAMPBELL, E., KUBALAKOVA, M., SIMKOVA, H., BATLEY, J., DOLEZEL,
J., HERNANDEZ, P. & EDWARDS, D. 2012. Sequencing wheat chromosome arm 7BS

23"
" "
delimits the 7BS/4AL translocation and reveals homoeologous gene conservation.
Theoretical and Applied Genetics, 124, 423-32.
BIOFUELS TASKFORCE 2005. Report of the Biofuels Taskforce to the Prime Minister. Canberra
ACT, Australia: Biofuels Taskforce.
BIROL, I., JACKMAN, S. D., NIELSEN, C., QIAN, J. Q., VARHOL, R., STAZYK, G., MORIN,
R. D., ZHAO, Y., HIRST, M., SCHEIN, J. E., HORSMAN, D. E., CONNORS, J. M.,
GASCOYNE, R. D., MARRA, M. A. & JONES, S. J. 2009. De novo Transcriptome
Assembly with ABySS. Bioinformatics.
BUNDOCK, P. C., ELIOTT, F. G., ABLETT, G., BENSON, A. D., CASU, R. E., AITKEN, K. S.
& HENRY, R. J. 2009. Targeted single nucleotide polymorphism (SNP) discovery in a
highly polyploid plant species using 454 sequencing. Plant Biotechnology Journal, 7, 347-
354.
BUTLER, J., MACCALLUM, I., KLEBER, M., SHLYAKHTER, I. A., BELMONTE, M. K.,
LANDER, E. S., NUSBAUM, C. & JAFFE, D. B. 2008. ALLPATHS: de novo assembly of
whole-genome shotgun microreads. Genome Res, 18, 810-20.
BYRT, C. S. & MUNNS, R. 2008. Living with salinity. New Phytol, 179, 903-5.
CHAISSON, M. J., BRINZA, D. & PEVZNER, P. A. 2009. De novo fragment assembly with short
mate-paired reads: Does the read length matter? Genome Res, 19, 336-46.
CHAISSON, M. J. & PEVZNER, P. A. 2008. Short read fragment assembly of bacterial genomes.
Genome Res, 18, 324-30.
CHAN, A. P., CRABTREE, J., ZHAO, Q., LORENZI, H., ORVIS, J., PUIU, D., MELAKE-
BERHAN, A., JONES, K. M., REDMAN, J., CHEN, G., CAHOON, E. B., GEDIL, M.,
STANKE, M., HAAS, B. J., WORTMAN, J. R., FRASER-LIGGETT, C. M., RAVEL, J. &
RABINOWICZ, P. D. 2010. Draft genome sequence of the oilseed species Ricinus
communis. Nat Biotech, 28, 951-956.
CHANTRET, N., SALSE, J., SABOT, F., RAHMAN, S., BELLEC, A., LAUBIN, B., DUBOIS, I.,
DOSSAT, C., SOURDILLE, P., JOUDRIER, P., GAUTIER, M. F., CATTOLICO, L.,
BECKERT, M., AUBOURG, S., WEISSENBACH, J., CABOCHE, M., BERNARD, M.,

24"
"
LEROY, P. & CHALHOUB, B. 2005. Molecular basis of evolutionary events that shaped
the hardness locus in diploid and polyploid wheat species (Triticum and Aegilops). Plant
Cell, 17, 1033-45.
CHEN, Y., SOUAIAIA, T. & CHEN, T. 2009. PerM: Efficient Mapping of Short Sequencing
Reads with Periodic Full Sensitive Spaced Seeds. Bioinformatics.
CLC BIO. 2010. CLC bio: CLC Genomics Workbench [Online]. Available:
http://www.clcbio.com/index.php?id=1240 [Accessed 25 November 2010].
CLEMENT, N. L., SNELL, Q., CLEMENT, M. J., HOLLENHORST, P. C., PURWAR, J.,
GRAVES, B. J., CAIRNS, B. R. & JOHNSON, W. E. 2010. The GNUMAP algorithm:
unbiased probabilistic mapping of oligonucleotides from next-generation sequencing.
Bioinformatics, 26, 38-45.
CLOONAN, N., XU, Q., FAULKNER, G. J., TAYLOR, D. F., TANG, D. T., KOLLE, G. &
GRIMMOND, S. M. 2009. RNA-MATE: a recursive mapping strategy for high-throughput
RNA-sequencing data. Bioinformatics, 25, 2615-6.
COCK, P. J., FIELDS, C. J., GOTO, N., HEUER, M. L. & RICE, P. M. 2010. The Sanger FASTQ
file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.
Nucleic Acids Res, 38, 1767-71.
COMMONWEALTH SCIENTIFIC AND INDUSTRIAL RESEARCH ORGANISATION. 2010.
Breeding better cereal varieties and improving crop management [Online]. Commonwealth
Scientific and Industrial Research Organisation. Available:
http://www.csiro.au/science/PIcereals.html [Accessed 25 November 2010].
COX, M. P., PETERSON, D. A. & BIGGS, P. J. 2010. SolexaQA: At-a-glance quality assessment
of Illumina second-generation sequencing data. BMC Bioinformatics, 11, 485.
DANIELLS, J. W., GEERING, A. D. W., BRYDE, N. J. & THOMAS, J. E. 2001. The effect of
Banana streak virus on the growth and yield of dessert bananas in tropical Australia. Annals
of Applied Biology, 139, 51-60.

25"
" "
DEGENKOLBE, T., DO, P., ZUTHER, E., REPSILBER, D., WALTHER, D., HINCHA, D. &
KÖHL, K. 2009. Expression profiling of rice cultivars differing in their tolerance to long-
term drought stress. Plant Molecular Biology, 69, 133-153.
DELCHER, A. L., SALZBERG, S. L. & PHILLIPPY, A. M. 2003. Using MUMmer to identify
similar regions in large sequence sets. Curr Protoc Bioinformatics, Chapter 10, Unit 10 3.
DEVOS, K. M. & GALE, M. D. 1997. Comparative genetics in the grasses. Plant Mol Biol, 35, 3-
15.
DOHM, J. C., LOTTAZ, C., BORODINA, T. & HIMMELBAUER, H. 2007. SHARCGS, a fast
and highly accurate short-read assembly algorithm for de novo genomic sequencing.
Genome Res, 17, 1697-706.
DOLEŽEL, J., KUBALÁKOVÁ, M., BARTOŠ, J. & MACAS, J. 2004. Flow cytogenetics and
plant genome mapping. Chromosome Res, 12, 77-91.
DOYLE, J. J., FLAGEL, L. E., PATERSON, A. H., RAPP, R. A., SOLTIS, D. E., SOLTIS, P. S. &
WENDEL, J. F. 2008. Evolutionary Genetics of Genome Merger and Doubling in Plants.
Annual Review of Genetics, 42, 443-461.
DURAN, C., EALES, D., MARSHALL, D., IMELFORT, M., STILLER, J., BERKMAN, P. J.,
CLARK, T., MCKENZIE, M., APPLEBY, N., BATLEY, J., BASFORD, K. & EDWARDS,
D. 2010. Future tools for association mapping in crop plants. Genome, 53, 1017-23.
ECKARDT, N. A. 2001. A Sense of Self: The Role of DNA Sequence Elimination in
Allopolyploidization. The Plant Cell, 13, 1699-1704.
EDWARDS, D. & BATLEY, J. 2010. Plant genome sequencing: applications for crop
improvement. Plant Biotechnol J, 8, 2-9.
EID, J., FEHR, A., GRAY, J., LUONG, K., LYLE, J., OTTO, G., PELUSO, P., RANK, D.,
BAYBAYAN, P., BETTMAN, B., BIBILLO, A., BJORNSON, K., CHAUDHURI, B.,
CHRISTIANS, F., CICERO, R., CLARK, S., DALAL, R., DEWINTER, A., DIXON, J.,
FOQUET, M., GAERTNER, A., HARDENBOL, P., HEINER, C., HESTER, K., HOLDEN,
D., KEARNS, G., KONG, X., KUSE, R., LACROIX, Y., LIN, S., LUNDQUIST, P., MA,
C., MARKS, P., MAXHAM, M., MURPHY, D., PARK, I., PHAM, T., PHILLIPS, M.,

26"
"
ROY, J., SEBRA, R., SHEN, G., SORENSON, J., TOMANEY, A., TRAVERS, K.,
TRULSON, M., VIECELI, J., WEGENER, J., WU, D., YANG, A., ZACCARIN, D.,
ZHAO, P., ZHONG, F., KORLACH, J. & TURNER, S. 2009. Real-time DNA sequencing
from single polymerase molecules. Science, 323, 133-8.
ERLICH, Y., CHANG, K., GORDON, A., RONEN, R., NAVON, O., ROOKS, M. & HANNON,
G. J. 2009. DNA Sudoku--harnessing high-throughput sequencing for multiplexed specimen
analysis. Genome Res, 19, 1243-53.
ERLICH, Y., MITRA, P. P., DELABASTIDE, M., MCCOMBIE, W. R. & HANNON, G. J. 2008.
Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nat Methods, 5,
679-82.
FABI, J. P., MENDES, L. R. B. C., LAJOLO, F. M. & DO NASCIMENTO, J. R. O. 2010.
Transcript profiling of papaya fruit reveals differentially expressed genes associated with
fruit ripening. Plant Science, 179, 225-233.
FEUILLET, C., LEACH, J. E., ROGERS, J., SCHNABLE, P. S. & EVERSOLE, K. 2011. Crop
genome sequencing: lessons and rationales. Trends in Plant Science, 16, 77-88.
FLAGEL, L., UDALL, J., NETTLETON, D. & WENDEL, J. 2008. Duplicate gene expression in
allopolyploid Gossypium reveals two temporally distinct phases of expression evolution.
BMC Biology, 6, 16.
FLAVELL, R. B., RIMPAU, J. & SMITH, D. B. 1977. Repeated sequence DNA relationships in 4
cereal genomes. Chromosoma, 63, 205-222.
FLEURY, D., LUO, M.-C., DVORAK, J., RAMSAY, L., GILL, B., ANDERSON, O., YOU, F.,
SHOAEI, Z., DEAL, K. & LANGRIDGE, P. 2010. Physical mapping of a large plant
genome using global high-information-content-fingerprinting: the distal region of the wheat
ancestor Aegilops tauschii chromosome 3DS. BMC Genomics, 11, 382.
FLICEK, P. & BIRNEY, E. 2009. Sense from sequence reads: methods for alignment and
assembly. Nature Methods, 6, S6-S12.

s4021907_phd_finalthesis

Recommended

Recommended

More Related Content

Similar to s4021907_phd_finalthesis

Similar to s4021907_phd_finalthesis (20)

s4021907_phd_finalthesis