Challenges with data quality,
     sharing, and versioning
      David Dooling <ddooling@wustl.edu>
                      ...
Production Centers
• Tony Cox, Sanger        • David Dooling, WUStL
  Sequencing              Scale
  Scale            ...
sub scale {



<ddooling@wustl.edu>
Moore’s Law

                       ,-./011-2#
                       300.-4/#567#
                       8,9#
           ...
Images




                 200 TB/week




<ddooling@wustl.edu>
Images




                   10 PB/year




<ddooling@wustl.edu>
Perspective




                   20 PB/day

<ddooling@wustl.edu>
Perspective




                       2 PB/s

<ddooling@wustl.edu>
FASTQ
  @HWI-EAS404:5:1:6:180#0/1
  GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT
  +HWI-EAS...
FASTQ
  @HWI-EAS404:5:1:6:180#0/1
  GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT
  +HWI-EAS...
Mapping




                       2 TB/week




<ddooling@wustl.edu>
Mapping




                       100 TB/year




<ddooling@wustl.edu>
Mapping




                  42,000 core-hr/week




<ddooling@wustl.edu>
Mapping




                       5 core-yr/week




<ddooling@wustl.edu>
Mapping




                       250 core cluster




<ddooling@wustl.edu>
The Weakest Link




<ddooling@wustl.edu>
The Balanced PC
• Clock speed
• AGP
• Front-side bus
• Hypertransport
• 1 Gbps
• PCI-X
• SATA
• PCI-Express
• Infiniband
•...
The balanced PS         1




        10   gosub     get(sequencers)
        20   gosub     get(disk)
        30   gosub  ...
The unbalanced PS



        10   gosub get(sequencers)
        20   gosub get(disk)
        30   gosub get(backup_capacit...
The GHz race




<ddooling@wustl.edu>
} # scale



<ddooling@wustl.edu>
sub quality {



<ddooling@wustl.edu>
Honda




<ddooling@wustl.edu>
Honda




<ddooling@wustl.edu>
Honda




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Quality is Job 1




<ddooling@wustl.edu>
...must be more than
         just a slogan



<ddooling@wustl.edu>
Quality missteps
          Initial low fidelity between base
              quality values and quality




                ...
An aside




            “basecall calibration predicted vs. observed”
<ddooling@wustl.edu>
Cult of traces




<ddooling@wustl.edu>
Quality is the key
Need high fidelity between prediction and observed

                 50 bytes per base


              ...
The down side




http://www3.appliedbiosystems.com/cms/
groups/mcb_marketing/documents/
generaldocuments/cms_057559.pdf

...
} # quality



<ddooling@wustl.edu>
sub sharing {



<ddooling@wustl.edu>
1000 Genomes




<ddooling@wustl.edu>
3.8 Tb




<ddooling@wustl.edu>
~50 B/b




<ddooling@wustl.edu>
190 TB




<ddooling@wustl.edu>
Submitted to central
         repositories



<ddooling@wustl.edu>
... and replicated
            across the pond



<ddooling@wustl.edu>
The goal of this project is to provide a system
for storing and retrieving huge amounts of
data, distributed among a large...
Write-only databases




          Search limited to sequence and
           values of specific XML entities
             ...
Write-only databases




                             x
          Search limited to sequence and
           values of spec...
Speaking of XML
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>             <?xml version=quot;1.0quot; encoding=qu...
} # sharing



<ddooling@wustl.edu>
sub versioning {



<ddooling@wustl.edu>
The Cathedral and the Bazaar
Linux overturned much of what I thought I
knew. I had been preaching the Unix gospel of
small...
The Vatican and the Reformation




<ddooling@wustl.edu>
The popes




                   Will this scale?
<ddooling@wustl.edu>
GenBank genome




http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/

  <ddooling@wust...
git genome




http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/

  <ddooling@wustl.ed...
The Human Reference
>7 dna:chromosome chromosome:NCBI36:7:1:158821424:1
...AATAACTATATAAGTAAATAAGCAAGCTGTATGAATATACAAAGCTC...
The Human Reference




<ddooling@wustl.edu>
The Human Reference
  (a)                                                                                                 ...
} # versioning



<ddooling@wustl.edu>
sub thank {quot;youquot;}



<ddooling@wustl.edu>
Upcoming SlideShare
Loading in …5
×

Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

5,008 views

Published on

Talk from the Genome Informatics Alliance 2009 meeting.

Published in: Technology, Sports
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,008
On SlideShare
0
From Embeds
0
Number of Embeds
1,286
Actions
Shares
0
Downloads
214
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide


  • What are the challenges that the large genome centers are currently facing that the typical researcher will be facing soon?



    Do not store images
    Do not store SRF
    Keep FASTQ




  • This acceleration breaks everything
  • 3.4*125/75*35 = 198.333333333333
  • We need to stop having to deal with images
    It should be transparent to the end user


  • LHC http://atlasexperiment.org/
  • (90*2+90/125*50)*35 = 7560
    Uncompressed
  • For 75 b read, you need 200 bytes, 25% is the headers
    Save 12.5% by simply not replicating the sequence header
  • 8*90/12*35 = 2100






  • Cost of software
  • The chain is only as strong as its weakest link.
    Images: Assembly line backing up? Keystone cops piling up? Stooges?



    Transition: situation not-unlike that faced by PC manufacturers over the past decade




  • This analogy works on another level as well...
  • Intel convinced everyone that the speed of the computer was equal to the clock speed of the processor
    Many people believed this
    Even when using a 56k modem
    Even when AML Opteron came out
    Even when Intel went to multi-core and lower clock speeds
    A cautionary tale for those joining the Gb race
    Which wraps up the scale up...
  • ... and leads us into quality
  • ... and leads us into quality
  • Make the best small engine in the world


  • Made high-quality cars for years
    Recognized after years of consistent performance


  • Now enjoy premium cost and high resale value
    Everyone I know has a Honda Odyssey


  • Money from the T-bird allowed them to design, develop, and introduce the...














  • It&#x2019;s gotten better
  • Google image search second or third result
    Draw your own conclusions
  • This distrust of base calls and quality values has reinforced the cult of traces
    This does not scale for human resources, disk space, etc.
    This leads to a very bad situation for those of us responsible for the computing, storage, and network infrastrcuture
  • Quality is at the core of all other issues, storage, compute, throughput, etc.
    If it&#x2019;s a bad base, call it a bad base
    Don&#x2019;t forget the GHz race
  • Reducing data to base calls and quality values does reduce its value
    Especially for data not natively in &#x201C;base space&#x201D;
    Is there a richness in this data that is lost?
    But you gain not having to have custom tool tails for each native data type








  • 2 bits/base is absolute minimum






  • Grid
  • No one ever feels lucky
  • No one ever feels lucky
  • They have learned their lesson, by creating an incredible amount of XML to submit
    Study, Sample, Experiment, Run




  • He may know a lot about software, but he does not know anything about building cathedrals


  • Currently, revisions are tightly controlled by central repositories, NCBI, UCSC, EBI


  • Push and pull around diff&#x2019;s
    Balance curation with rapid advances
    Debian web of trust




  • How far will FASTA get you?
    C. elegans - part of genome repeat structure
    http://genomebiology.com/2006/7/1/R7
    Can you use the current de Bruijn graph assembly engines for alignment?


  • Talk to me
  • Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

    1. Challenges with data quality, sharing, and versioning David Dooling <ddooling@wustl.edu> GIA 2009
    2. Production Centers • Tony Cox, Sanger • David Dooling, WUStL Sequencing Scale Scale Quality Infrastructure Sharing Data flow Versioning • Toby Bloom, Broad Quality Integration Standards Sharing <ddooling@wustl.edu>
    3. sub scale { <ddooling@wustl.edu>
    4. Moore’s Law ,-./011-2# 300.-4/#567# 8,9# :;0.6<-# :-=>-1?-# !quot;quot;quot;# !quot;quot;$# !quot;quot;!# !quot;quot;%# !quot;quot;&# !quot;quot;'# !quot;quot;(# !quot;quot;)# !quot;quot;*# !quot;quot;+# !quot;$quot;# <ddooling@wustl.edu>
    5. Images 200 TB/week <ddooling@wustl.edu>
    6. Images 10 PB/year <ddooling@wustl.edu>
    7. Perspective 20 PB/day <ddooling@wustl.edu>
    8. Perspective 2 PB/s <ddooling@wustl.edu>
    9. FASTQ @HWI-EAS404:5:1:6:180#0/1 GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT +HWI-EAS404:5:1:6:180#0/1 aaaa`]aaaa`aa^aa]aaaa^`_``____`W]a_`T[[b__`YXUW][MSTNZX^[[`_Z[^``X`^a @HWI-EAS404:5:1:6:396#0/1 TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA +HWI-EAS404:5:1:6:396#0/1 Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`^ZPP[__^_a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^NZ @HWI-EAS404:5:1:6:1344#0/1 GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG +HWI-EAS404:5:1:6:1344#0/1 aabaaa__]^a`[^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X```_WVNYWKDNLTW[YXSVZ^ZTZZVRUX[ @HWI-EAS404:5:1:6:1814#0/1 AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC +HWI-EAS404:5:1:6:1814#0/1 aa````aa^a`_^``a`XY`^ZX^YW^[XUWUYOMVZZ_W^^XXTSMHMLLNTTDWU__[WVVY]Y_]X 7 TB/week <ddooling@wustl.edu>
    10. FASTQ @HWI-EAS404:5:1:6:180#0/1 GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT +HWI-EAS404:5:1:6:180#0/1 aaaa`]aaaa`aa^aa]aaaa^`_``____`W]a_`T[[b__`YXUW][MSTNZX^[[`_Z[^``X`^a @HWI-EAS404:5:1:6:396#0/1 TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA +HWI-EAS404:5:1:6:396#0/1 Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`^ZPP[__^_a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^NZ @HWI-EAS404:5:1:6:1344#0/1 GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG +HWI-EAS404:5:1:6:1344#0/1 aabaaa__]^a`[^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X```_WVNYWKDNLTW[YXSVZ^ZTZZVRUX[ @HWI-EAS404:5:1:6:1814#0/1 AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC +HWI-EAS404:5:1:6:1814#0/1 aa````aa^a`_^``a`XY`^ZX^YW^[XUWUYOMVZZ_W^^XXTSMHMLLNTTDWU__[WVVY]Y_]X 350 TB/year <ddooling@wustl.edu>
    11. Mapping 2 TB/week <ddooling@wustl.edu>
    12. Mapping 100 TB/year <ddooling@wustl.edu>
    13. Mapping 42,000 core-hr/week <ddooling@wustl.edu>
    14. Mapping 5 core-yr/week <ddooling@wustl.edu>
    15. Mapping 250 core cluster <ddooling@wustl.edu>
    16. The Weakest Link <ddooling@wustl.edu>
    17. The Balanced PC • Clock speed • AGP • Front-side bus • Hypertransport • 1 Gbps • PCI-X • SATA • PCI-Express • Infiniband • Multi-core • Front-side bus • GPU • 10 Gbps <ddooling@wustl.edu>
    18. The balanced PS 1 10 gosub get(sequencers) 20 gosub get(disk) 30 gosub get(backup_capacity) 40 gosub get(network_capacity) 50 gosub get(cluster_nodes) 1 - Pipeline for Sequencing <ddooling@wustl.edu>
    19. The unbalanced PS 10 gosub get(sequencers) 20 gosub get(disk) 30 gosub get(backup_capacity) 40 gosub get(network_capacity) 50 gosub get(cluster_nodes) 60 goto 10 <ddooling@wustl.edu>
    20. The GHz race <ddooling@wustl.edu>
    21. } # scale <ddooling@wustl.edu>
    22. sub quality { <ddooling@wustl.edu>
    23. Honda <ddooling@wustl.edu>
    24. Honda <ddooling@wustl.edu>
    25. Honda <ddooling@wustl.edu>
    26. Ford <ddooling@wustl.edu>
    27. Ford <ddooling@wustl.edu>
    28. Ford <ddooling@wustl.edu>
    29. Ford <ddooling@wustl.edu>
    30. Ford <ddooling@wustl.edu>
    31. Ford <ddooling@wustl.edu>
    32. Ford <ddooling@wustl.edu>
    33. Quality is Job 1 <ddooling@wustl.edu>
    34. ...must be more than just a slogan <ddooling@wustl.edu>
    35. Quality missteps Initial low fidelity between base quality values and quality Tsonev, S. SEP 2007 <ddooling@wustl.edu>
    36. An aside “basecall calibration predicted vs. observed” <ddooling@wustl.edu>
    37. Cult of traces <ddooling@wustl.edu>
    38. Quality is the key Need high fidelity between prediction and observed 50 bytes per base 20 bytes per base 2 bytes per base 3 bits per base <ddooling@wustl.edu>
    39. The down side http://www3.appliedbiosystems.com/cms/ groups/mcb_marketing/documents/ generaldocuments/cms_057559.pdf http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg <ddooling@wustl.edu>
    40. } # quality <ddooling@wustl.edu>
    41. sub sharing { <ddooling@wustl.edu>
    42. 1000 Genomes <ddooling@wustl.edu>
    43. 3.8 Tb <ddooling@wustl.edu>
    44. ~50 B/b <ddooling@wustl.edu>
    45. 190 TB <ddooling@wustl.edu>
    46. Submitted to central repositories <ddooling@wustl.edu>
    47. ... and replicated across the pond <ddooling@wustl.edu>
    48. The goal of this project is to provide a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods. <ddooling@wustl.edu>
    49. Write-only databases Search limited to sequence and values of specific XML entities submitted as metadata <ddooling@wustl.edu>
    50. Write-only databases x Search limited to sequence and values of specific XML entities submitted as metadata <ddooling@wustl.edu>
    51. Speaking of XML <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <LS454> <STUDY_SET xmlns:xsi=quot;http://www.w3.org/2001/ <EXPERIMENT_SET xmlns:xsi=quot;http://www.w3.org/ <INSTRUMENT_MODEL>GS 20</ <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT XMLSchema-instancequot;> 2001/XMLSchema-instancequot;> INSTRUMENT_MODEL> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT <STUDY alias=quot;LowSalternSDbayVir111005quot; <EXPERIMENT ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT accession=quot;SRP000145quot;> alias=quot;LowSalternSDbayVir111005_experimentquot; <FLOW_SEQUENCE>TACGTACGTACGTACGTACGTACGTACGTACGT ACGTACGTACGTACGTACGTACGTACGTACG</VALUE> <DESCRIPTOR> expected_number_runs=quot;2quot; accession=quot;SRX000217quot;> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT </RUN_ATTRIBUTE> <STUDY_TITLE>Solar Salterns, viral <TITLE>454 sequencing of saltern metagenome ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT <RUN_ATTRIBUTE> fraction from low salinity saltern in San Diego, fragment library</TITLE> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</ <TAG>key_sequence</TAG> CA </STUDY_TITLE> <STUDY_REF accession=quot;SRP000145quot; FLOW_SEQUENCE> <VALUE>TCAG</VALUE> <STUDY_TYPE refname=quot;LowSalternSDbayVir111005quot;/> <FLOW_COUNT>168</FLOW_COUNT> </RUN_ATTRIBUTE> existing_study_type=quot;Metagenomicsquot;/> <DESIGN> </LS454> </RUN_ATTRIBUTES> <STUDY_ABSTRACT>Viral community from a <DESIGN_DESCRIPTION>454 Sequencing of </PLATFORM> </RUN> quot;lowquot; salinity saltern and sequenced at 454 Life viral fraction from low salinity saltern in San <PROCESSING> <RUN alias=quot;D1LDSHLquot; instrument_model=quot;454 GS Sciences. </STUDY_ABSTRACT> Diego, CA</DESIGN_DESCRIPTION> <BASE_CALLS> 20quot; run_date=quot;2006-04-06T09:25:19Zquot; <CENTER_NAME>SDSU</CENTER_NAME> <SAMPLE_DESCRIPTOR accession=quot;SRS000373quot; <SEQUENCE_SPACE>Base Space</ run_file=quot;D1LDSHLquot; run_center=quot;454MSCquot; refname=quot;28373quot;/> SEQUENCE_SPACE> total_data_blocks=quot;1quot; accession=quot;SRR001054quot;> <CENTER_PROJECT_NAME>LowSalternSDbayVir111005</ <LIBRARY_DESCRIPTOR> <BASE_CALLER>454BaseCaller</BASE_CALLER> <EXPERIMENT_REF accession=quot;SRX000217quot; CENTER_PROJECT_NAME> <LIBRARY_NAME>lowSalternSDbayVir111005</ </BASE_CALLS> refname=quot;LowSalternSDbayVir111005_experimentquot;/> <PROJECT_ID>28373</PROJECT_ID> LIBRARY_NAME> <QUALITY_SCORES qtype=quot;phredquot;> <DATA_BLOCK name=quot;D1LDSHLquot; region=quot;1quot; </DESCRIPTOR> <LIBRARY_STRATEGY>OTHER</ <QUALITY_SCORER>454BaseCaller</ total_spots=quot;70935quot; total_reads=quot;70935quot; <STUDY_ATTRIBUTES> LIBRARY_STRATEGY> QUALITY_SCORER> number_channels=quot;1quot; format_code=quot;1quot; sector=quot;0quot;> <STUDY_ATTRIBUTE> <LIBRARY_SOURCE>OTHER</LIBRARY_SOURCE> <NUMBER_OF_LEVELS>64</NUMBER_OF_LEVELS> <FILES> <TAG>NCBI parent project ID</TAG> <LIBRARY_SELECTION>RANDOM</ <MULTIPLIER>1</MULTIPLIER> <FILE filename=quot;D1LDSHL01.sffquot; <VALUE>28725</VALUE> LIBRARY_SELECTION> </QUALITY_SCORES> filetype=quot;sffquot;/> </STUDY_ATTRIBUTE> <LIBRARY_LAYOUT> </PROCESSING> </FILES> </STUDY_ATTRIBUTES> <SINGLE/> </EXPERIMENT> </DATA_BLOCK> </STUDY> </LIBRARY_LAYOUT> </EXPERIMENT_SET> <RUN_ATTRIBUTES> </STUDY_SET> <LIBRARY_CONSTRUCTION_PROTOCOL> <RUN_ATTRIBUTE> none provided <TAG>flow_count</TAG> <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> </LIBRARY_CONSTRUCTION_PROTOCOL> <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <VALUE>168</VALUE> <SAMPLE_SET xmlns:xsi=quot;http://www.w3.org/2001/ </LIBRARY_DESCRIPTOR> <RUN_SET xmlns:xsi=quot;http://www.w3.org/2001/ </RUN_ATTRIBUTE> XMLSchema-instancequot;> <SPOT_DESCRIPTOR> XMLSchema-instancequot;> <RUN_ATTRIBUTE> <SAMPLE alias=quot;28373quot; accession=quot;SRS000373quot;> <SPOT_DECODE_SPEC> <RUN alias=quot;D0IIGP3quot; instrument_model=quot;454 GS <TAG>flow_sequence</TAG> <SAMPLE_NAME> <NUMBER_OF_READS_PER_SPOT>2</ 20quot; run_date=quot;2006-03-17T09:39:51Zquot; <TAXON_ID>496920</TAXON_ID> NUMBER_OF_READS_PER_SPOT> run_file=quot;D0IIGP3quot; run_center=quot;454MSCquot; <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT <COMMON_NAME>saltern metagenome</ <READ_SPEC> total_data_blocks=quot;1quot; accession=quot;SRR001053quot;> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT COMMON_NAME> <READ_INDEX>0</READ_INDEX> <EXPERIMENT_REF accession=quot;SRX000217quot; ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT </SAMPLE_NAME> <READ_CLASS>Technical Read</ refname=quot;LowSalternSDbayVir111005_experimentquot;/> ACGTACGTACGTACGTACGTACGTACGTACG</VALUE> <DESCRIPTION>viral fraction from low READ_CLASS> <DATA_BLOCK name=quot;D0IIGP3quot; region=quot;1quot; </RUN_ATTRIBUTE> salinity saltern in San Diego, CA </ <READ_TYPE>Adapter</READ_TYPE> total_spots=quot;51121quot; total_reads=quot;51121quot; <RUN_ATTRIBUTE> DESCRIPTION> <BASE_COORD>1</BASE_COORD> number_channels=quot;1quot; format_code=quot;1quot; sector=quot;0quot;> <TAG>key_sequence</TAG> <SAMPLE_ATTRIBUTES> </READ_SPEC> <FILES> <VALUE>TCAG</VALUE> <SAMPLE_ATTRIBUTE> <READ_SPEC> <FILE filename=quot;D0IIGP301.sffquot; </RUN_ATTRIBUTE> <TAG>collection_date</TAG> <READ_INDEX>1</READ_INDEX> filetype=quot;sffquot;/> </RUN_ATTRIBUTES> <VALUE>11/10/05</VALUE> <READ_CLASS>Application Read</ </FILES> </RUN> </SAMPLE_ATTRIBUTE> READ_CLASS> </DATA_BLOCK> </RUN_SET> <SAMPLE_ATTRIBUTE> <READ_TYPE>Forward</READ_TYPE> <RUN_ATTRIBUTES> <TAG>lat_lon</TAG> <BASE_COORD>5</BASE_COORD> <RUN_ATTRIBUTE> <VALUE>32.599040, -117.107356</VALUE> </READ_SPEC> <TAG>flow_count</TAG> </SAMPLE_ATTRIBUTE> </SPOT_DECODE_SPEC> <VALUE>168</VALUE> </SAMPLE_ATTRIBUTES> </SPOT_DESCRIPTOR> </RUN_ATTRIBUTE> </SAMPLE> </DESIGN> <RUN_ATTRIBUTE> </SAMPLE_SET> <PLATFORM> <TAG>flow_sequence</TAG> <ddooling@wustl.edu>
    52. } # sharing <ddooling@wustl.edu>
    53. sub versioning { <ddooling@wustl.edu>
    54. The Cathedral and the Bazaar Linux overturned much of what I thought I knew. I had been preaching the Unix gospel of small tools, rapid prototyping and evolutionary programming for years. But I also believed there was a certain critical complexity above which a more centralized, a priori approach was required. I believed that the most important software (operating systems and really large tools like the Emacs programming editor) needed to be built like cathedrals, carefully crafted by individual wizards or small bands of mages working in splendid isolation, with no beta to be released before its time. <ddooling@wustl.edu>
    55. The Vatican and the Reformation <ddooling@wustl.edu>
    56. The popes Will this scale? <ddooling@wustl.edu>
    57. GenBank genome http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/ <ddooling@wustl.edu>
    58. git genome http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/ <ddooling@wustl.edu>
    59. The Human Reference >7 dna:chromosome chromosome:NCBI36:7:1:158821424:1 ...AATAACTATATAAGTAAATAAGCAAGCTGTATGAATATACAAAGCTCTCTGGTAAAG GTAAATACATAAACAAACATAAAAACAGTCCTATTGTAATTTTGGTTTGTAACTCTGCTT TTTATTTTCTACATAATTTAAAAGGCAAATGCATAAAATGTAATTGTAAATCTGTTAGCT GGTATACAATGAATAAAGATATAATTTGTCACATCAATAACATAAAAAGAGTAGAGCTAT ATATATAGCAGTAGAATTTTGGTATGTGATTGAACTTAAGTTGAAATAAATTCAAATTAA AATGTTATAACTCTAGGATGTTATATGTAATTCTCATAGTAACCAAAAATGAAATATACA TAGAATATAAACAAAAGGAAATGAGACTAGAAACAAAATGTGTCACTACAAAAAAATCAA CTAAAGATAAAAAAGAAATAATTGAGAAAATGATTGGCAAAAATCAGTAACTCTGACGTA TTAAAACTTTCCATGCTACATAAATCTGAAAACTCTATTTCACATAAAACTGGAGCTGAA AGAAACAAATATTTACCTATAAAGTTAAAAGTTATATAGGGAACAAACACTAATTTTTTT TAGAAAAAATTATAAAAAGAGTAAAAATATGCCTTATACTACCGTAATTTCATGTTTTAC AGCTCTGGGAAAATAGAAAATAAAATGTTCTGTTAGCATGAATCCCTCTGTGCCCCC... <ddooling@wustl.edu>
    60. The Human Reference <ddooling@wustl.edu>
    61. The Human Reference (a) 2 A 4(24) B 82 3(2) 5 7 16(2) 3(3) 2 3 3(2) 2 5 58(2) 3(2) 2(2) 8 2(3) 6(2) 2(219) 2 2 23(2) 3 2 2 3 81 3(21) 4(22) 4(3) 13 3(24) 3 A 2(2) 2(2) 2(202) 19(8) 2(19) 2(15) 2 2(34) 2(13) 158 C 5(7) 2(42) 4(9) 2(15) 2(4) 7(8) 3(3) 71 B 18 2 C 2 D 37 F 139 6 E E 13(2) 13(2) 55(3) 2(6) 2(7) 6(3) 4(7) 4 5 2 F 3 D 38(6) 3(5) 160 3(50) 2 G G 2 2(61) 4(51) 2(49) 3(50) 8 2(7) H 4 2(4) 142 2(50) 5 5(5) 8(6) 5(7) 158 3 3(41) 173 H (b) (c) 142 G 160 81 13(7) 158 117 93 29 D H A 184 9(6) H 48(10) 140 8 8(5) 38(6) 114 G F 13(2) 13(2) 55(3) 132 207 D 139 A 82 127(2) B E 62 E 37 71 B F 37 139 D F 13(2) 55(3) E D 21 158 32(3) 45(3) A 13(2) C s5766 13(2) 38(6) 20(2) 18 F G B 8 8(5) A 81 18(6) 58(7) E 171 C G 123(2) 82 B D Zhi, BJ Raphael, AL Price, H Tang and PA Pevzner. Identifying repeat domains in large genomes. Genome Biology 2006, 7:R7 <ddooling@wustl.edu>
    62. } # versioning <ddooling@wustl.edu>
    63. sub thank {quot;youquot;} <ddooling@wustl.edu>

    ×