1. A
C
G
T
TheThe MedicagoMedicago truncatulatruncatula genome:genome:
a progress reporta progress report
Dr. Bruce A. RoeDr. Bruce A. Roe
Advanced Center for Genome TechnologyAdvanced Center for Genome Technology
Department of Chemistry and BiochemistryDepartment of Chemistry and Biochemistry
University of OklahomaUniversity of Oklahoma
broe@ou.edu www.genome.ou.edubroe@ou.edu www.genome.ou.edu
Plant and Animal GenomePlant and Animal Genome
San Deigo January 11San Deigo January 11, 2004, 2004
Photos by Steve Hughes, Genetic Resource Centre (PIRSA-SARDI), Adelaide, Australia.
http://www.fao.org/ag/AGP/AGPC/doc/gallery/pictures/meditrunc/meditrunc.htm
2. A
C
G
T
• An important forage crop
• A genetically tractable model legume
• A relatively small (~500 Mbp) diploid genome
• Active legume research community
• Medicago Research Consortium
• Large collection of ESTs
• Excellent BAC library
• Integrated physical and genetic map
• Large number of BAC-end sequences
Why sequence the Medicago genome?
3. A
C
G
T
DNA GenBank
Sequence Pipeline at the University of Oklahoma
Genome Center, OU-ACGT
DNA shearing
(HydroshearTM
)
Colony Piking
(QPixIITM
)
Growing subclones
(HiGroTM)
Subclone Isolation I
(Mini-StaccatoTM
)
Subclone isolation II
(VPrepTM
)
Thermocycling
(ABI 9700)
Sequencing
(ABI 3700)
Data assembly and
Analysis
Primer
Synthesis
Miscelaneous liquid
handling
Closure
4. A
C
G
T
• This Zymark robot has 384 cannula array, four built in shakers, three
attached storage racks, built-in barcoding and a Twister II robotic arm.
• This automation has allow us to perform the DNA isolation completely
unattended from as many as eighty 384 well plates of bacterial cells per
Subclone Isolation (Mini-StaccatoTM
)
QuickTime™ and a
YUV420 codec decompressor
are needed to see this picture.
5. A
C
G
T
• Once all three solutions have been added, the plates are transferred from
the SciClone workspace deck to a storage rack by the Twister II robotic arm.
Subclone Isolation (Mini-StaccatoTM
)
QuickTime™ and a
YUV420 codec decompressor
are needed to see this picture.
6. A
C
G
T
• Liquid handling station with 384-channel pipettor head
• Four movable shelves on either side of the pipettor head
• Used for subclone isolation, sequencing reaction set-up and clean-up.
Subclone Isolation and Sequencing Reaction
Pipetting (Velocity 11 VPrep)
QuickTime™ and a
YUV420 codec decompressor
are needed to see this picture.
7. A
C
G
T
Data assembly and Analysis
32 GB RAM running Solaris 8 OS and 3
TB of data stored on RAID-5 arrays
with autoloader tape backup
Also:
• 12 workstations each with 1 GB RAM
Sun V880 server Phred/Phrap/Consed
Exgap
8. A
C
G
T
Initial WGS Skimming for ~500 Mb
Medicago truncatula genome
• Collected ~25,000 end-sequences from ~12,500
plasmid-based WGS clones.
• Of these ~25,000 sequences, ~1,000 have
homology with Medicago truncatula ESTs.
• URL:
http://www.genome.ou.edu/medicago.html
9. A
C
G
T
Phrap assembly of our Medicago truncatula whole
genome shotgun survey sequencing data
at 0.005-fold genomic sequence coverage
10. A
C
G
T
DotPlot of a Phrap assembled whole genome
shotgun contig showing multiple repeated regions
0 100 200 300 400 500 600 700
7006005004003002001000
Bases
Bases
11. A
C
G
T
DotPlot of a Phrap assembled whole genome shotgun
contig showing 4 repeated blocks of ~600 bases
0 500 1000
10005000Bases
Bases
12. A
C
G
T
Yet another genomic contig showing extensive repeated regions
Contig 1931
0 200 400 600
6004002000
Bases
Bases
14. A
C
G
T
Summary of our Medicago truncatula WGS
Sequencing Assembly with only 0.005-fold
Genomic Sequence Coverage
• The largest contig (21,157 bp) contained the 26S
rRNA genes
• 19 smaller contigs (105,455 bp total) were from the
chloroplast genome
• The remaining ~500 contigs, ranging in size from
2,000 to 12,000 bp contain highly repetitive DNA,
which were unique to Medicago, as they had no
significant homology in the GenBank database
• We concluded that a more directed strategy was
needed
15. A
C
G
T
Mapped BAC approach in
collaboration with Doug Cook
and DJ Kim at U.C. Davis with
funding from the Noble
Foundation, Ardmore, OK
16. A
C
G
T
The first ~1000The first ~1000 Medicago truncatulaMedicago truncatula BACsBACs
• Initially concentrated on BACs with known biologicalInitially concentrated on BACs with known biological
markers and in regions of biological interest that weremarkers and in regions of biological interest that were
supplied to us by the UC Davis group.supplied to us by the UC Davis group.
• Requests for sequencing specific BACs were directedRequests for sequencing specific BACs were directed
to Doug Cook and DJ Kim at UC Davis and theyto Doug Cook and DJ Kim at UC Davis and they
supplied us with the BACs once these BACs havesupplied us with the BACs once these BACs have
been characterized.been characterized.
• Once the BACs were received, we created the shotgunOnce the BACs were received, we created the shotgun
libraries, isolated the sequencing templates andlibraries, isolated the sequencing templates and
obtained the working draft sequence followed byobtained the working draft sequence followed by
closure and finishing.closure and finishing.
• All data was made publically available in GenBankAll data was made publically available in GenBank
within 24 hours of sequence assembly.within 24 hours of sequence assembly.
19. A
C
G
T
The next ~750The next ~750 Medicago truncatulaMedicago truncatula BACsBACs
• With recent NSF funding, we will beWith recent NSF funding, we will be
sequencing BACs from chromosomessequencing BACs from chromosomes
1,4, 6, and 8 with the goal of completing1,4, 6, and 8 with the goal of completing
the sequence of the euchromatic regionsthe sequence of the euchromatic regions
of these chromosomes over the next 3of these chromosomes over the next 3
years.years.
• Chromosomes 2 and 7 will be sequencedChromosomes 2 and 7 will be sequenced
at TIGR, chromosome 3 at The Sangerat TIGR, chromosome 3 at The Sanger
Institute and and chromosome 5 atInstitute and and chromosome 5 at
Genoscope.Genoscope.
• All data will be released immediately asAll data will be released immediately as
before.before.
27. A
C
G
T
Gene Size Distribution (All Sequence Data)
(FgenesH vs. Genscan)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1-1000
1001-2000
2001-3000
3001-4000
4001-5000
5001-6000
6001-7000
7001-8000
8001-9000
9001-10000
10001-11000
11001-12000
12001-13000
13001-14000
14001-15000
15001-16000
16001-17000
17001-18000
18001-19000
19001-20000
20001-above
FgeneSH
Genscan
Number
of
Genes
Gene Size Range
13,396 FgeneSH predicted genes
11,488 Genscan predicted genes
28. A
C
G
T
Exon Size Distribution (All Sequence Data)
(FgenesH vs. Genscan)
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
1-50
51-100
101-200
201-300
301-400
401-500
501-600
601-700
701-800
801-900
901-1000
1001-1500
1501-2000
2001-2500
2501-3000
3001-3500
3501-4000
Number
of
Exons
Exon Size Range
FgeneSH
Genscan
59,808 FgeneSH predicted exons
55,792 Genscan predicted exons
29. A
C
G
T
Intron Size Distribution (All Sequence Data)
(FgenesH vs. Genscan)
0
2000
4000
6000
8000
10000
12000
1-50
51-100
101-200
201-300
301-400
401-500
501-600
601-700
701-800
801-900
901-1000
1001-1500
1501-2000
2001-2500
2501-3000
3001-3500
3501-4000
Number
of
Introns
Intron Size Range
FgeneSH
Genscan
46,412 FgeneSH predicted introns
44,305 Genscan predicted introns
30. A
C
G
T
FgeneSH Genscan
Total number of genes 13,397 11,488
Total length of genes 30,793,326 51,687,528
Total exon length 15,794,243 14,400,445
Total number of exons 59,808 55,792
Total intron length 14,999,083 37,287,083
Total number of introns 46,412 44,305
_______________________________________________________
Base Pairs Sequenced 87,423,457 87,423,457
_______________________________________________________
Gene Space
(Gene Length/BP Sequenced) 35% 59%_______________________________________________________
Gene Density (Genes/200Mb) 30,649 26,281
1 gene/6.5 kb 1 gene/7.6 kb_______________________________________________________
Arabidopsis 25,498 protein coding genes
Gene Density of the ~450 Mb Medicago truncatula genome
32. A
C
G
T
Metabolic Overview of Medicago
13,396 FgeneSH predicted genes using the COG Database
DNA Metabolism
23%
Cellular Processes
23%Metabolism
24%
Poorly
Characterized
17%
No Hits
5%
Multiple COG Hits
8%
33. A
C
G
T
Metabolic Overview (detailed view) of Medicago
13,396 FgeneSH predicted genes using the COG Database
No Hits
5%
Translation, ribosomal
structure & biogenesis
7% Transcription
5%
DNA replication,
recombination & repair
11%
Multiple COG Hits
8%
Poorly Characterized
17%
Cell division &
chromosome
partitioning 2%
Posttranslational
modification, protein
turnover, chaperones 5%
Cell envelope
biogenesis, outer
membrane 4%
Cell motility & secretion 3%
Inorganic ion transport &
metabolism 3%
Signal
transduction
mechanisms 5%Energy production &
conversion 5%
Carbohydrate transport &
metabolism 4%
Amino acid transport
& metabolism 5%
Nucleotide transport &
metabolism 2%
Coenzyme metabolism 2%
Lipid metabolism 2%
Secondary metabolites
biosynthesis, transport &
catabolism 3%
35. A
C
G
T
AC138448.fg.10 MATKRSVGTLKEAELKGKRVFVRVDLNVPLDDNLNITDDTRIRAAVPTIKYLTGYGAKVILSSHL-----
AC138448.fg.11 MA-KKSVGDLSGAELKGKKVFVRADLNVPLDDNQNITDDTRIRAAIPTIKYLIQNGAKVILSSHL-----
AC138448.fg.8 MATKRSVGTLKEGELKGKRVFVRVDLNVPLDDNLNITDDTRIRAAVPTIKYLTGYGAKVILSSHLEIYKT
AC138448.fg.10 ------------------------------------------GRPKGVTPKYSLKPLVPRLSELLGTQVK
AC138448.fg.11 ------------------------------------------GRPKGVTPKYSLAPLVPRLSELIGIEVI
AC138448.fg.8 EVSVSEYNLAVSEYKLAISDTYRYRIRVRHDSSPFLEYRGSQGRPKGVTPKYSLKPLVPRLSELLETQVK
AC138448.fg.10 IADDSIGEEVEKLVAQIPEGGVLLLENVRFHKEEEKNDPEFAKKLASLADLYVNDAFGTAHRAHASTEGV
AC138448.fg.11 KAEDSIGPEVEKLVASLPDGGVLLLENVRFYKEEEKNDPEHAKKLAALADLYVNDAFGTAHRAHASTEGV
AC138448.fg.8 ISDDCIGEEVEKLVAQIPEGGVLLLENVRFHKEEEKNEPEFAKKLASLADLYVNDAFGTAHRAHASTEGV
AC138448.fg.10 AKYLKPSVAGFLMQKELDYLVGAVSNPKKPFAAIVGGSKVSSKIGVIESLLEKVDILLLGGGMIFTFYKA
AC138448.fg.11 TKYLKPSVAGFLLQKELDYLVGAVSSPKRPFAAIVGGSKVSSKIGVIESLLEKVDILLLGGGMIFTFYKA
AC138448.fg.8 AKYLKPSVAGFLMQKELDYLVGAVSNPKKPFAAIVGGSKVSSKIGVIESLLEKVDILLLGGGMIYTFYKA
AC138448.fg.10 QGYAVGSSLVEEDKLDLATTLIEKAKAKGVSLLLPTDVVIADKFAADANDKIVPASSIPDGWMGLDIGPD
AC138448.fg.11 QGLAVGSSLVEEDKLELATTLIAKAKAKGVSLLLPSDVVIADKFAPDANSQIVPASAIPDGWMGLDIGPD
AC138448.fg.8 QGYSIGSSLVEEDKLDLATSLMEKAKAKGVSLLLPTDVVIADKFSADANDKIVPASSIPDGWMGLDIGPD
AC138448.fg.10 SIKTFNEALDKSQTIIWNGPMGVFEFDKFAAGTEAIAKKLAEVSGKGVTTIIGGGDSVAAVEKVGLADKM
AC138448.fg.11 SIKTFNEALDTTQTIIWNGPMGVFEFDKFAVGTESIAKKLADLSGKGVTTIIGGGDSVAAVEKVGVADVM
AC138448.fg.8 SIKTFNEALDKSQTIIWNGPMGVFEFDKFAAGTEAIAKKLAEVSGKGVTTIIGGGDSVAAVEKVGLADKM
AC138448.fg.10 SHISTGGGASLELLEGKPLPGVLALDDA* 401 amino acids
AC138448.fg.11 SHISTGGGASLELLEGKELPGVLALDEATPVAV* 405 amino acids, differs at 42 positions
AC138448.fg.8 SHISTGGGASLELLEGKPLPGVLALDDA* 448 amino acids, differs at 6 positions
Gene Duplication: Three copies of phosphoglycerate kinase in one BAC
36. A
C
G
T
Printrepeat Analysis of
M. truncatula BAC AC121240 vs. A. thaliana Chr.2
Expansion, Duplication, Repeat Elements
~5 kb region
~25 kb region
38. A
C
G
T
Medicago truncatulaMedicago truncatula
Summary and ConclusionsSummary and Conclusions
• Average Predicted Gene Density of 1 gene per 6.5 toAverage Predicted Gene Density of 1 gene per 6.5 to
7.6 Kb by FgeneSH and Genscan, respectively.7.6 Kb by FgeneSH and Genscan, respectively.
• Genome characteristics such as %GC, intron/exonGenome characteristics such as %GC, intron/exon
size and conserved unique 5’ splice sites revealsize and conserved unique 5’ splice sites reveal
Medicago characteristicsMedicago characteristics
• The sequence of theThe sequence of the Medicago truncatulaMedicago truncatula genomegenome
shows homology to the sequencedshows homology to the sequenced ArabidopsisArabidopsis
thalianathaliana genome but expansion, rearrangementsgenome but expansion, rearrangements
and duplications are evident.and duplications are evident.
39. A
C
G
T
Data Release and Preliminary AnnotationData Release and Preliminary Annotation
• All our sequence data is available through links on ourAll our sequence data is available through links on our
web site to GenBank and on our ftp site at URL:web site to GenBank and on our ftp site at URL:
ftp.genome.ou.edu/medicagoftp.genome.ou.edu/medicago
• keyword and blast searches can be done on our web sitekeyword and blast searches can be done on our web site
at URL:at URL: http://www.genome.ou.edu/medicago.htmlhttp://www.genome.ou.edu/medicago.html
• Additional annotation via Genome Browser databaseAdditional annotation via Genome Browser database
are available on our web site at URL:are available on our web site at URL:
http://www.genome.ou.edu/medicago_table.htmlhttp://www.genome.ou.edu/medicago_table.html
• E-mail suggestions for additional annotation to BruceE-mail suggestions for additional annotation to Bruce
Roe at: broe@ou.eduRoe at: broe@ou.edu
40. A
C
G
T
Three Year PlanThree Year Plan
• Obtain the contiguous sequence of the GeneObtain the contiguous sequence of the Gene
Rich regions of four of the 8Rich regions of four of the 8 Medicago truncatulaMedicago truncatula
genome at OU, with the remaining four beinggenome at OU, with the remaining four being
completed by our international partners at TIGR,completed by our international partners at TIGR,
Sanger, and Genoscope.Sanger, and Genoscope.
• This information will serve as a solid foundationThis information will serve as a solid foundation
for anticipated comparative and functionalfor anticipated comparative and functional
legume genomics.legume genomics.
41. A
C
G
T
Laboratory OrganizationLaboratory Organization
Bruce Roe, PIBruce Roe, PI
InformaticsInformatics
Support TeamsSupport Teams
ProductionProduction AdministrationAdministration
Jim WhiteJim White
Steve KentonSteve Kenton
Hongshing LaiHongshing Lai
Sean QianSean Qian
Rose Morales-Diaz*Rose Morales-Diaz*
Mounir Elharam*Mounir Elharam*
Yonas TesfaiYonas Tesfai
Steve Shaull**Steve Shaull**
Doug WhiteDoug White
Work-study Undergraduates**Work-study Undergraduates**
Kay Lynn HaleKay Lynn Hale
Dixie WishnuckDixie Wishnuck
Tami WomackTami Womack
Mary Catherine WilliamsMary Catherine Williams
DNA SynthesisDNA Synthesis
Phoebe Loh*Phoebe Loh*
Sulan QiSulan Qi
Bart Ford*Bart Ford*
Reagents &Reagents &
Equip. Maint.Equip. Maint.
Mounir Elharam*Mounir Elharam*
Doug WhiteDoug White
Axin HuaAxin Hua
Weihong XuWeihong Xu
Jami MilamJami Milam
Sara Downard**Sara Downard**
Limei YangLimei Yang
Angie Prescott*Angie Prescott*
Audra Wendt**Audra Wendt**
Mandi Aycock**Mandi Aycock**
Ziyun YaoZiyun Yao
Steve Shaull*Steve Shaull*
Youngju YoonYoungju Yoon
Trang DoTrang Do
Anh DoAnh Do
Lily FuLily Fu
Yang YeYang Ye
James YuJames Yu
Tessa Manning**Tessa Manning**
Fu YingFu Ying
Liping ZhouLiping Zhou
Ruihua ShiRuihua Shi
Junjie WuJunjie Wu
Stephan DeschampsStephan Deschamps
Shelly OommenShelly Oommen
Christopher LauChristopher Lau
Yanhong LiYanhong Li
Research TeamsResearch Teams
Doris KupferDoris Kupfer
Julia Kim*Julia Kim*
Sun SoSun So
Graham Wiley**Graham Wiley**
Lauren Ritterhouse**Lauren Ritterhouse**
Lin SongLin Song
Ying NiYing Ni
Huarong JiangHuarong Jiang
ShaoPing LinShaoPing Lin
Honggui JiaHonggui Jia
Hongming WuHongming Wu
Baifang QinBaifang Qin
Peng ZhangPeng Zhang
Fares NajarFares Najar
Chunmei QuChunmei Qu
Keqin WangKeqin Wang
Carson QuCarson Qu
Shuling LiShuling Li
Funding from the Noble Foundation, DOE, and NSF
Collaborators at Univ. Minnesota, UC Davis, TIGR,
Sanger, Genoscope, and the Noble Foundation
Pheobe LohPheobe Loh **
Sulan QiSulan Qi
Bart Ford*Bart Ford*
* Previous undergraduate* Previous undergraduate
research studentresearch student
** Present undergraduate** Present undergraduate
research studentresearch student
44. A
C
G
T
Conserved Intron/Exon Boundry Features by a FELINEs**
Analysis of 181,444 Medicago truncatula ESTs in GenBank
vs Genomic Sequence
Size Range Mean Length
Exons 6 - 5,789 nt 268 nt
Introns 20 - 3,921 nt 429 nt
Intron Conserved Splice Site Sequence Elements Percent
Introns w/ 5’ GU 99.21%
Introns w/ 5’ GC 0.36%*
Introns w/ 5’ AU 0.31%
Introns w/ U12 branch sites instead of A12 0.13%
*Compared to 0.5 - 2.5% in fungi, and 0.5% in mammals with an EST minimum identity
of 90%
** S. Drabensctot, D. Kupfer, J. White, D. Dyer, B. Roe, K. Buchanan and J. Murphy.
FELINES: A Utility for Extracting and Examining EST-Defined Introns and Exons.
Nucleic Acid Research 31(22), E141 (2003).
45. A
C
G
T
Consensus Logogram of the 5’GU vs the 5’AU Class of Introns
in Medicago truncatula determined by FELINES
AU intron consensus
GU intron consensus