SlideShare a Scribd company logo
1 of 63
Download to read offline
Challenges with data quality,
     sharing, and versioning
      David Dooling <ddooling@wustl.edu>
                              GIA 2009
Production Centers
• Tony Cox, Sanger        • David Dooling, WUStL
  Sequencing              Scale
  Scale                   Quality
  Infrastructure          Sharing
  Data    flow            Versioning



• Toby Bloom, Broad
  Quality
  Integration
  Standards
  Sharing

<ddooling@wustl.edu>
sub scale {



<ddooling@wustl.edu>
Moore’s Law

                       ,-./011-2#
                       300.-4/#567#
                       8,9#
                       :;0.6<-#
                       :-=>-1?-#




       !quot;quot;quot;#   !quot;quot;$#      !quot;quot;!#     !quot;quot;%#   !quot;quot;&#   !quot;quot;'#   !quot;quot;(#   !quot;quot;)#   !quot;quot;*#   !quot;quot;+#   !quot;$quot;#




<ddooling@wustl.edu>
Images




                 200 TB/week




<ddooling@wustl.edu>
Images




                   10 PB/year




<ddooling@wustl.edu>
Perspective




                   20 PB/day

<ddooling@wustl.edu>
Perspective




                       2 PB/s

<ddooling@wustl.edu>
FASTQ
  @HWI-EAS404:5:1:6:180#0/1
  GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT
  +HWI-EAS404:5:1:6:180#0/1
  aaaa`]aaaa`aa^aa]aaaa^`_``____`W]a_`T[[b__`YXUW][MSTNZX^[[`_Z[^``X`^a
  @HWI-EAS404:5:1:6:396#0/1
  TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA
  +HWI-EAS404:5:1:6:396#0/1
  Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`^ZPP[__^_a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^NZ
  @HWI-EAS404:5:1:6:1344#0/1
  GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG
  +HWI-EAS404:5:1:6:1344#0/1
  aabaaa__]^a`[^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X```_WVNYWKDNLTW[YXSVZ^ZTZZVRUX[
  @HWI-EAS404:5:1:6:1814#0/1
  AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC
  +HWI-EAS404:5:1:6:1814#0/1
  aa````aa^a`_^``a`XY`^ZX^YW^[XUWUYOMVZZ_W^^XXTSMHMLLNTTDWU__[WVVY]Y_]X


                               7 TB/week
<ddooling@wustl.edu>
FASTQ
  @HWI-EAS404:5:1:6:180#0/1
  GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT
  +HWI-EAS404:5:1:6:180#0/1
  aaaa`]aaaa`aa^aa]aaaa^`_``____`W]a_`T[[b__`YXUW][MSTNZX^[[`_Z[^``X`^a
  @HWI-EAS404:5:1:6:396#0/1
  TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA
  +HWI-EAS404:5:1:6:396#0/1
  Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`^ZPP[__^_a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^NZ
  @HWI-EAS404:5:1:6:1344#0/1
  GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG
  +HWI-EAS404:5:1:6:1344#0/1
  aabaaa__]^a`[^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X```_WVNYWKDNLTW[YXSVZ^ZTZZVRUX[
  @HWI-EAS404:5:1:6:1814#0/1
  AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC
  +HWI-EAS404:5:1:6:1814#0/1
  aa````aa^a`_^``a`XY`^ZX^YW^[XUWUYOMVZZ_W^^XXTSMHMLLNTTDWU__[WVVY]Y_]X


                               350 TB/year
<ddooling@wustl.edu>
Mapping




                       2 TB/week




<ddooling@wustl.edu>
Mapping




                       100 TB/year




<ddooling@wustl.edu>
Mapping




                  42,000 core-hr/week




<ddooling@wustl.edu>
Mapping




                       5 core-yr/week




<ddooling@wustl.edu>
Mapping




                       250 core cluster




<ddooling@wustl.edu>
The Weakest Link




<ddooling@wustl.edu>
The Balanced PC
• Clock speed
• AGP
• Front-side bus
• Hypertransport
• 1 Gbps
• PCI-X
• SATA
• PCI-Express
• Infiniband
• Multi-core
• Front-side bus
• GPU
• 10 Gbps
<ddooling@wustl.edu>
The balanced PS         1




        10   gosub     get(sequencers)
        20   gosub     get(disk)
        30   gosub     get(backup_capacity)
        40   gosub     get(network_capacity)
        50   gosub     get(cluster_nodes)




                        1 - Pipeline for Sequencing
<ddooling@wustl.edu>
The unbalanced PS



        10   gosub get(sequencers)
        20   gosub get(disk)
        30   gosub get(backup_capacity)
        40   gosub get(network_capacity)
        50   gosub get(cluster_nodes)
        60   goto 10




<ddooling@wustl.edu>
The GHz race




<ddooling@wustl.edu>
} # scale



<ddooling@wustl.edu>
sub quality {



<ddooling@wustl.edu>
Honda




<ddooling@wustl.edu>
Honda




<ddooling@wustl.edu>
Honda




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Quality is Job 1




<ddooling@wustl.edu>
...must be more than
         just a slogan



<ddooling@wustl.edu>
Quality missteps
          Initial low fidelity between base
              quality values and quality




                       Tsonev, S. SEP 2007

<ddooling@wustl.edu>
An aside




            “basecall calibration predicted vs. observed”
<ddooling@wustl.edu>
Cult of traces




<ddooling@wustl.edu>
Quality is the key
Need high fidelity between prediction and observed

                 50 bytes per base


                 20 bytes per base


                  2 bytes per base


                       3 bits per base

<ddooling@wustl.edu>
The down side




http://www3.appliedbiosystems.com/cms/
groups/mcb_marketing/documents/
generaldocuments/cms_057559.pdf




                                         http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg




   <ddooling@wustl.edu>
} # quality



<ddooling@wustl.edu>
sub sharing {



<ddooling@wustl.edu>
1000 Genomes




<ddooling@wustl.edu>
3.8 Tb




<ddooling@wustl.edu>
~50 B/b




<ddooling@wustl.edu>
190 TB




<ddooling@wustl.edu>
Submitted to central
         repositories



<ddooling@wustl.edu>
... and replicated
            across the pond



<ddooling@wustl.edu>
The goal of this project is to provide a system
for storing and retrieving huge amounts of
data, distributed among a large number of
heterogenous server nodes, under a single
virtual filesystem tree with a variety of
standard access methods.




<ddooling@wustl.edu>
Write-only databases




          Search limited to sequence and
           values of specific XML entities
              submitted as metadata
<ddooling@wustl.edu>
Write-only databases




                             x
          Search limited to sequence and
           values of specific XML entities
              submitted as metadata
<ddooling@wustl.edu>
Speaking of XML
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>             <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>                   <LS454>
<STUDY_SET xmlns:xsi=quot;http://www.w3.org/2001/      <EXPERIMENT_SET xmlns:xsi=quot;http://www.w3.org/              <INSTRUMENT_MODEL>GS 20</                  <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
XMLSchema-instancequot;>                               2001/XMLSchema-instancequot;>                          INSTRUMENT_MODEL>                                  ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
  <STUDY alias=quot;LowSalternSDbayVir111005quot;            <EXPERIMENT                                                                                         ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
accession=quot;SRP000145quot;>                             alias=quot;LowSalternSDbayVir111005_experimentquot;        <FLOW_SEQUENCE>TACGTACGTACGTACGTACGTACGTACGTACGT   ACGTACGTACGTACGTACGTACGTACGTACG</VALUE>
    <DESCRIPTOR>                                   expected_number_runs=quot;2quot; accession=quot;SRX000217quot;>    ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT         </RUN_ATTRIBUTE>
      <STUDY_TITLE>Solar Salterns, viral               <TITLE>454 sequencing of saltern metagenome    ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT         <RUN_ATTRIBUTE>
fraction from low salinity saltern in San Diego,   fragment library</TITLE>                           ACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</                   <TAG>key_sequence</TAG>
CA </STUDY_TITLE>                                      <STUDY_REF accession=quot;SRP000145quot;               FLOW_SEQUENCE>                                              <VALUE>TCAG</VALUE>
      <STUDY_TYPE                                  refname=quot;LowSalternSDbayVir111005quot;/>                       <FLOW_COUNT>168</FLOW_COUNT>                     </RUN_ATTRIBUTE>
existing_study_type=quot;Metagenomicsquot;/>                   <DESIGN>                                             </LS454>                                         </RUN_ATTRIBUTES>
      <STUDY_ABSTRACT>Viral community from a             <DESIGN_DESCRIPTION>454 Sequencing of            </PLATFORM>                                      </RUN>
quot;lowquot; salinity saltern and sequenced at 454 Life   viral fraction from low salinity saltern in San        <PROCESSING>                                     <RUN alias=quot;D1LDSHLquot; instrument_model=quot;454 GS
Sciences. </STUDY_ABSTRACT>                        Diego, CA</DESIGN_DESCRIPTION>                           <BASE_CALLS>                                 20quot; run_date=quot;2006-04-06T09:25:19Zquot;
      <CENTER_NAME>SDSU</CENTER_NAME>                    <SAMPLE_DESCRIPTOR accession=quot;SRS000373quot;             <SEQUENCE_SPACE>Base Space</               run_file=quot;D1LDSHLquot; run_center=quot;454MSCquot;
                                                   refname=quot;28373quot;/>                                  SEQUENCE_SPACE>                                    total_data_blocks=quot;1quot; accession=quot;SRR001054quot;>
<CENTER_PROJECT_NAME>LowSalternSDbayVir111005</          <LIBRARY_DESCRIPTOR>                                 <BASE_CALLER>454BaseCaller</BASE_CALLER>       <EXPERIMENT_REF accession=quot;SRX000217quot;
CENTER_PROJECT_NAME>                                       <LIBRARY_NAME>lowSalternSDbayVir111005</         </BASE_CALLS>                                refname=quot;LowSalternSDbayVir111005_experimentquot;/>
      <PROJECT_ID>28373</PROJECT_ID>               LIBRARY_NAME>                                            <QUALITY_SCORES qtype=quot;phredquot;>                   <DATA_BLOCK name=quot;D1LDSHLquot; region=quot;1quot;
    </DESCRIPTOR>                                          <LIBRARY_STRATEGY>OTHER</                          <QUALITY_SCORER>454BaseCaller</            total_spots=quot;70935quot; total_reads=quot;70935quot;
    <STUDY_ATTRIBUTES>                             LIBRARY_STRATEGY>                                  QUALITY_SCORER>                                    number_channels=quot;1quot; format_code=quot;1quot; sector=quot;0quot;>
      <STUDY_ATTRIBUTE>                                    <LIBRARY_SOURCE>OTHER</LIBRARY_SOURCE>             <NUMBER_OF_LEVELS>64</NUMBER_OF_LEVELS>          <FILES>
        <TAG>NCBI parent project ID</TAG>                  <LIBRARY_SELECTION>RANDOM</                        <MULTIPLIER>1</MULTIPLIER>                          <FILE filename=quot;D1LDSHL01.sffquot;
        <VALUE>28725</VALUE>                       LIBRARY_SELECTION>                                       </QUALITY_SCORES>                            filetype=quot;sffquot;/>
      </STUDY_ATTRIBUTE>                                   <LIBRARY_LAYOUT>                               </PROCESSING>                                        </FILES>
    </STUDY_ATTRIBUTES>                                      <SINGLE/>                                  </EXPERIMENT>                                        </DATA_BLOCK>
  </STUDY>                                                 </LIBRARY_LAYOUT>                          </EXPERIMENT_SET>                                      <RUN_ATTRIBUTES>
</STUDY_SET>                                               <LIBRARY_CONSTRUCTION_PROTOCOL>                                                                     <RUN_ATTRIBUTE>
                                                             none provided                                                                                        <TAG>flow_count</TAG>
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>                     </LIBRARY_CONSTRUCTION_PROTOCOL>           <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>                      <VALUE>168</VALUE>
<SAMPLE_SET xmlns:xsi=quot;http://www.w3.org/2001/           </LIBRARY_DESCRIPTOR>                        <RUN_SET xmlns:xsi=quot;http://www.w3.org/2001/              </RUN_ATTRIBUTE>
XMLSchema-instancequot;>                                     <SPOT_DESCRIPTOR>                            XMLSchema-instancequot;>                                     <RUN_ATTRIBUTE>
  <SAMPLE alias=quot;28373quot; accession=quot;SRS000373quot;>             <SPOT_DECODE_SPEC>                           <RUN alias=quot;D0IIGP3quot; instrument_model=quot;454 GS             <TAG>flow_sequence</TAG>
    <SAMPLE_NAME>                                            <NUMBER_OF_READS_PER_SPOT>2</            20quot; run_date=quot;2006-03-17T09:39:51Zquot;
      <TAXON_ID>496920</TAXON_ID>                  NUMBER_OF_READS_PER_SPOT>                          run_file=quot;D0IIGP3quot; run_center=quot;454MSCquot;             <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
      <COMMON_NAME>saltern metagenome</                      <READ_SPEC>                              total_data_blocks=quot;1quot; accession=quot;SRR001053quot;>       ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
COMMON_NAME>                                                    <READ_INDEX>0</READ_INDEX>                <EXPERIMENT_REF accession=quot;SRX000217quot;          ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
    </SAMPLE_NAME>                                              <READ_CLASS>Technical Read</          refname=quot;LowSalternSDbayVir111005_experimentquot;/>    ACGTACGTACGTACGTACGTACGTACGTACG</VALUE>
    <DESCRIPTION>viral fraction from low           READ_CLASS>                                            <DATA_BLOCK name=quot;D0IIGP3quot; region=quot;1quot;                </RUN_ATTRIBUTE>
salinity saltern in San Diego, CA </                            <READ_TYPE>Adapter</READ_TYPE>        total_spots=quot;51121quot; total_reads=quot;51121quot;                  <RUN_ATTRIBUTE>
DESCRIPTION>                                                    <BASE_COORD>1</BASE_COORD>            number_channels=quot;1quot; format_code=quot;1quot; sector=quot;0quot;>             <TAG>key_sequence</TAG>
    <SAMPLE_ATTRIBUTES>                                      </READ_SPEC>                                   <FILES>                                               <VALUE>TCAG</VALUE>
      <SAMPLE_ATTRIBUTE>                                     <READ_SPEC>                                      <FILE filename=quot;D0IIGP301.sffquot;                   </RUN_ATTRIBUTE>
        <TAG>collection_date</TAG>                              <READ_INDEX>1</READ_INDEX>            filetype=quot;sffquot;/>                                       </RUN_ATTRIBUTES>
        <VALUE>11/10/05</VALUE>                                 <READ_CLASS>Application Read</              </FILES>                                       </RUN>
      </SAMPLE_ATTRIBUTE>                          READ_CLASS>                                            </DATA_BLOCK>                                  </RUN_SET>
      <SAMPLE_ATTRIBUTE>                                        <READ_TYPE>Forward</READ_TYPE>            <RUN_ATTRIBUTES>
        <TAG>lat_lon</TAG>                                      <BASE_COORD>5</BASE_COORD>                  <RUN_ATTRIBUTE>
        <VALUE>32.599040, -117.107356</VALUE>                </READ_SPEC>                                     <TAG>flow_count</TAG>
      </SAMPLE_ATTRIBUTE>                                  </SPOT_DECODE_SPEC>                                <VALUE>168</VALUE>
    </SAMPLE_ATTRIBUTES>                                 </SPOT_DESCRIPTOR>                                 </RUN_ATTRIBUTE>
  </SAMPLE>                                            </DESIGN>                                            <RUN_ATTRIBUTE>
</SAMPLE_SET>                                          <PLATFORM>                                             <TAG>flow_sequence</TAG>




         <ddooling@wustl.edu>
} # sharing



<ddooling@wustl.edu>
sub versioning {



<ddooling@wustl.edu>
The Cathedral and the Bazaar
Linux overturned much of what I thought I
knew. I had been preaching the Unix gospel of
small tools, rapid prototyping and evolutionary
programming for years. But I also believed
there was a certain critical complexity above
which a more centralized, a priori approach was
required. I believed that the most important
software (operating systems and really large
tools like the Emacs programming editor)
needed to be built like cathedrals, carefully
crafted by individual wizards or small bands of
mages working in splendid isolation, with no
beta to be released before its time.
<ddooling@wustl.edu>
The Vatican and the Reformation




<ddooling@wustl.edu>
The popes




                   Will this scale?
<ddooling@wustl.edu>
GenBank genome




http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/

  <ddooling@wustl.edu>
git genome




http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/

  <ddooling@wustl.edu>
The Human Reference
>7 dna:chromosome chromosome:NCBI36:7:1:158821424:1
...AATAACTATATAAGTAAATAAGCAAGCTGTATGAATATACAAAGCTCTCTGGTAAAG
GTAAATACATAAACAAACATAAAAACAGTCCTATTGTAATTTTGGTTTGTAACTCTGCTT
TTTATTTTCTACATAATTTAAAAGGCAAATGCATAAAATGTAATTGTAAATCTGTTAGCT
GGTATACAATGAATAAAGATATAATTTGTCACATCAATAACATAAAAAGAGTAGAGCTAT
ATATATAGCAGTAGAATTTTGGTATGTGATTGAACTTAAGTTGAAATAAATTCAAATTAA
AATGTTATAACTCTAGGATGTTATATGTAATTCTCATAGTAACCAAAAATGAAATATACA
TAGAATATAAACAAAAGGAAATGAGACTAGAAACAAAATGTGTCACTACAAAAAAATCAA
CTAAAGATAAAAAAGAAATAATTGAGAAAATGATTGGCAAAAATCAGTAACTCTGACGTA
TTAAAACTTTCCATGCTACATAAATCTGAAAACTCTATTTCACATAAAACTGGAGCTGAA
AGAAACAAATATTTACCTATAAAGTTAAAAGTTATATAGGGAACAAACACTAATTTTTTT
TAGAAAAAATTATAAAAAGAGTAAAAATATGCCTTATACTACCGTAATTTCATGTTTTAC
AGCTCTGGGAAAATAGAAAATAAAATGTTCTGTTAGCATGAATCCCTCTGTGCCCCC...


<ddooling@wustl.edu>
The Human Reference




<ddooling@wustl.edu>
The Human Reference
  (a)                                                                                                                                               2
                                                                                                                                                                                                                                                                                             A

                                                                                                                                    4(24)                                                                                                                                                    B
                                                                                                                                                                                                                                                         82
               3(2)
                                         5                             7                                                                    16(2)
               3(3)                                                                                                                                       2
                                                                                                                             3                                           3(2)                 2
                                                                                                         5                                                                                                58(2)
                                                                                                             3(2)
                                                         2(2)                                                                                                        8                                                                                 2(3)
                                                                                       6(2)
             2(219)                     2                             2
                          23(2)                                                                                                                                                               3
                                                                                                                              2
                                                                                                         2                                                3                                                               81
                                                                                                                                                                                                                                                       3(21)
             4(22)                     4(3)
        13                                                                                                                                                                                                                                3(24)
                                                                                                                                                                                                                                  3
    A                                  2(2)
                                                                                                                                    2(2)                                                                                         2(202)
                                                                                                                                                                                                                                                       19(8)
                      2(19)            2(15)                     2                                                                          2(34)
                                                         2(13)
                                                                                                                                                                                                                                                                                   158
                                                                                                                                                                                                                                                                                             C
                                                                                                                                                                                                                                          5(7)                  2(42) 4(9)
                                                                                                                                            2(15)
                                                2(4)
                                                                                                                                                                                                                                                       7(8)
                                         3(3)                                     71
    B
                                                                                                             18
                                                                                                                      2
    C
                                                                                                                             2
    D
                                                                                                                                                                                                                          37
                                                                                                                                                                                                                                                                                             F
                                                                                       139                                                                                                                                                  6
    E                                                                                                                                                                                                                                                                                        E
                                                                                                                            13(2)                       13(2)            55(3)
                                                                                                     2(6)    2(7)                           6(3)
                                                                                       4(7)
                                         4                                                     5                                                                                                                                            2
    F                                                                                                 3                                                                                                                                                                                      D
                                                                     38(6)                                                  3(5)
                                                                                              160                   3(50)                                                                                                                   2
    G                                                                                                                                                                                                                                                                                        G
         2                                                                                                                                                                       2(61)
                      4(51)                                                                                                                                                                                              2(49)
                                       3(50)                                                                  8
                                                          2(7)
    H
                                                                                                                                                                                   4
                                                                                                                                            2(4)
                                                                                                             142                                                                                          2(50)            5
                                                                                                                                                         5(5)            8(6)             5(7)
                                                                                                                    158
                                                                                                                                             3
                                                                                                                    3(41)
                                                                                                                                                                173
                                                                                                                                                                                                                                                                                             H




  (b)                                                                                                                                       (c)                                                                                                           142
                                                                                                                                                          G                            160
                                                        81

                                                                                                                                                                                                                                  13(7)                   158
                              117                       93                                          29
                      D                                                                                                                                   H
                                                                                                                     A                                                                                                                                                       184
                                                                                                                                                                                                                  9(6)
                                                                                                                                                                                                                                                                                         H
                                                                                                                                                                                                                                                       48(10)
                                                       140                                                                                                                                                                            8
                                                                                                                                                                                       8(5)
                                                                                                                                                                                                                         38(6)
                                                       114                                                                                                                                                                                                                               G
                                                                                                                                                          F
                                                                                                                                                                                   13(2)
                                                                                                                                                                                                              13(2)               55(3)
                                                       132
                              207                                                                                                                                                                                                                                                        D
                                                                                                                                                                                       139
                      A                                                                             82
                                                  127(2)                                                             B                                    E
                                  62
                                                                                                                                                                                                                                                                                         E
                                                                                                    37                                                                                   71
                      B
                                                                                                                     F                                                                                                                            37
                                                       139                                                                                                D
                                                                                                                                                                                                                                                                                         F
                                                                          13(2)                55(3)
                      E                                                                                              D                                          21                                                                                                    158
                                                                                                                                                                                                              32(3)               45(3)
                                                                                                                                                          A
                                                   13(2)                                                                                                                                                                                                                                 C
                                                                                                                                                                                                  s5766
                                                                                                                                                                                   13(2)
                                                                          38(6)
                                                                                                                                                                                                              20(2)
                                                                                                                                                                18
                      F                                                                                              G
                                                                                                                                                          B
                                                                                         8
                                                       8(5)                                                                                                                                                                                                                              A
                                                                                                                                                                                        81
                                                                          18(6)                58(7)                 E
                                                       171                                                                                                C
                      G                                                                                                                                                           123(2)                                                          82
                                                                                                                                                                                                                                                                                         B




             D Zhi, BJ Raphael, AL Price, H Tang and PA Pevzner. Identifying repeat domains in
             large genomes. Genome Biology 2006, 7:R7
<ddooling@wustl.edu>
} # versioning



<ddooling@wustl.edu>
sub thank {quot;youquot;}



<ddooling@wustl.edu>

More Related Content

Similar to Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

Why Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the WebWhy Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the Webjoelburton
 
OSCON 2004: XML and Apache
OSCON 2004: XML and ApacheOSCON 2004: XML and Apache
OSCON 2004: XML and ApacheTed Leung
 
IST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
IST 561 Session2--Feb 2, 2009 Basic XHTML ConceptsIST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
IST 561 Session2--Feb 2, 2009 Basic XHTML ConceptsD.A. Garofalo
 
Lca2009 Video A11y
Lca2009 Video A11yLca2009 Video A11y
Lca2009 Video A11yguesta3d158
 
Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8Tatsuhiko Miyagawa
 
technical fluency
technical fluencytechnical fluency
technical fluencyjudell
 
Standardizing the Web: A Look into the Why of Web Standards
Standardizing the Web: A Look into the Why of Web StandardsStandardizing the Web: A Look into the Why of Web Standards
Standardizing the Web: A Look into the Why of Web StandardsTim Wright
 
Spring
SpringSpring
Springdasgin
 
Plone Interactivity
Plone InteractivityPlone Interactivity
Plone InteractivityEric Steele
 
07 Collada Overview
07 Collada Overview07 Collada Overview
07 Collada Overviewjohny2008
 
Agile Tour Shanghai December 2011
Agile Tour Shanghai December 2011Agile Tour Shanghai December 2011
Agile Tour Shanghai December 2011Alistair McKinnell
 
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...Daniel Cukier
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solrtomhill
 
Integrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous SourcesIntegrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous SourcesMatthew Rowe
 
JavaServer Faces Anti-Patterns and Pitfalls
JavaServer Faces Anti-Patterns and PitfallsJavaServer Faces Anti-Patterns and Pitfalls
JavaServer Faces Anti-Patterns and PitfallsDennis Byrne
 
Anvita Dynamic Fontson Web Feb2001
Anvita Dynamic Fontson Web Feb2001Anvita Dynamic Fontson Web Feb2001
Anvita Dynamic Fontson Web Feb2001guest6e7a1b1
 
Edge trends mizuno-template
Edge trends mizuno-templateEdge trends mizuno-template
Edge trends mizuno-templateshintaro mizuno
 
Leaflet JS (GIS) and Capital MetroRail
Leaflet JS (GIS) and Capital MetroRailLeaflet JS (GIS) and Capital MetroRail
Leaflet JS (GIS) and Capital MetroRailterrafrost2
 

Similar to Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing (20)

Why Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the WebWhy Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the Web
 
OSCON 2004: XML and Apache
OSCON 2004: XML and ApacheOSCON 2004: XML and Apache
OSCON 2004: XML and Apache
 
IST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
IST 561 Session2--Feb 2, 2009 Basic XHTML ConceptsIST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
IST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
 
Lca2009 Video A11y
Lca2009 Video A11yLca2009 Video A11y
Lca2009 Video A11y
 
Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8
 
technical fluency
technical fluencytechnical fluency
technical fluency
 
Standardizing the Web: A Look into the Why of Web Standards
Standardizing the Web: A Look into the Why of Web StandardsStandardizing the Web: A Look into the Why of Web Standards
Standardizing the Web: A Look into the Why of Web Standards
 
Spring
SpringSpring
Spring
 
Plone Interactivity
Plone InteractivityPlone Interactivity
Plone Interactivity
 
07 Collada Overview
07 Collada Overview07 Collada Overview
07 Collada Overview
 
Agile Tour Shanghai December 2011
Agile Tour Shanghai December 2011Agile Tour Shanghai December 2011
Agile Tour Shanghai December 2011
 
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solr
 
Integrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous SourcesIntegrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous Sources
 
Juggling
JugglingJuggling
Juggling
 
JavaServer Faces Anti-Patterns and Pitfalls
JavaServer Faces Anti-Patterns and PitfallsJavaServer Faces Anti-Patterns and Pitfalls
JavaServer Faces Anti-Patterns and Pitfalls
 
Anvita Dynamic Fontson Web Feb2001
Anvita Dynamic Fontson Web Feb2001Anvita Dynamic Fontson Web Feb2001
Anvita Dynamic Fontson Web Feb2001
 
Mojolicious on Steroids
Mojolicious on SteroidsMojolicious on Steroids
Mojolicious on Steroids
 
Edge trends mizuno-template
Edge trends mizuno-templateEdge trends mizuno-template
Edge trends mizuno-template
 
Leaflet JS (GIS) and Capital MetroRail
Leaflet JS (GIS) and Capital MetroRailLeaflet JS (GIS) and Capital MetroRail
Leaflet JS (GIS) and Capital MetroRail
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 

Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

  • 1. Challenges with data quality, sharing, and versioning David Dooling <ddooling@wustl.edu> GIA 2009
  • 2. Production Centers • Tony Cox, Sanger • David Dooling, WUStL Sequencing Scale Scale Quality Infrastructure Sharing Data flow Versioning • Toby Bloom, Broad Quality Integration Standards Sharing <ddooling@wustl.edu>
  • 4. Moore’s Law ,-./011-2# 300.-4/#567# 8,9# :;0.6<-# :-=>-1?-# !quot;quot;quot;# !quot;quot;$# !quot;quot;!# !quot;quot;%# !quot;quot;&# !quot;quot;'# !quot;quot;(# !quot;quot;)# !quot;quot;*# !quot;quot;+# !quot;$quot;# <ddooling@wustl.edu>
  • 5. Images 200 TB/week <ddooling@wustl.edu>
  • 6. Images 10 PB/year <ddooling@wustl.edu>
  • 7. Perspective 20 PB/day <ddooling@wustl.edu>
  • 8. Perspective 2 PB/s <ddooling@wustl.edu>
  • 9. FASTQ @HWI-EAS404:5:1:6:180#0/1 GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT +HWI-EAS404:5:1:6:180#0/1 aaaa`]aaaa`aa^aa]aaaa^`_``____`W]a_`T[[b__`YXUW][MSTNZX^[[`_Z[^``X`^a @HWI-EAS404:5:1:6:396#0/1 TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA +HWI-EAS404:5:1:6:396#0/1 Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`^ZPP[__^_a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^NZ @HWI-EAS404:5:1:6:1344#0/1 GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG +HWI-EAS404:5:1:6:1344#0/1 aabaaa__]^a`[^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X```_WVNYWKDNLTW[YXSVZ^ZTZZVRUX[ @HWI-EAS404:5:1:6:1814#0/1 AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC +HWI-EAS404:5:1:6:1814#0/1 aa````aa^a`_^``a`XY`^ZX^YW^[XUWUYOMVZZ_W^^XXTSMHMLLNTTDWU__[WVVY]Y_]X 7 TB/week <ddooling@wustl.edu>
  • 10. FASTQ @HWI-EAS404:5:1:6:180#0/1 GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT +HWI-EAS404:5:1:6:180#0/1 aaaa`]aaaa`aa^aa]aaaa^`_``____`W]a_`T[[b__`YXUW][MSTNZX^[[`_Z[^``X`^a @HWI-EAS404:5:1:6:396#0/1 TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA +HWI-EAS404:5:1:6:396#0/1 Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`^ZPP[__^_a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^NZ @HWI-EAS404:5:1:6:1344#0/1 GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG +HWI-EAS404:5:1:6:1344#0/1 aabaaa__]^a`[^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X```_WVNYWKDNLTW[YXSVZ^ZTZZVRUX[ @HWI-EAS404:5:1:6:1814#0/1 AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC +HWI-EAS404:5:1:6:1814#0/1 aa````aa^a`_^``a`XY`^ZX^YW^[XUWUYOMVZZ_W^^XXTSMHMLLNTTDWU__[WVVY]Y_]X 350 TB/year <ddooling@wustl.edu>
  • 11. Mapping 2 TB/week <ddooling@wustl.edu>
  • 12. Mapping 100 TB/year <ddooling@wustl.edu>
  • 13. Mapping 42,000 core-hr/week <ddooling@wustl.edu>
  • 14. Mapping 5 core-yr/week <ddooling@wustl.edu>
  • 15. Mapping 250 core cluster <ddooling@wustl.edu>
  • 17. The Balanced PC • Clock speed • AGP • Front-side bus • Hypertransport • 1 Gbps • PCI-X • SATA • PCI-Express • Infiniband • Multi-core • Front-side bus • GPU • 10 Gbps <ddooling@wustl.edu>
  • 18. The balanced PS 1 10 gosub get(sequencers) 20 gosub get(disk) 30 gosub get(backup_capacity) 40 gosub get(network_capacity) 50 gosub get(cluster_nodes) 1 - Pipeline for Sequencing <ddooling@wustl.edu>
  • 19. The unbalanced PS 10 gosub get(sequencers) 20 gosub get(disk) 30 gosub get(backup_capacity) 40 gosub get(network_capacity) 50 gosub get(cluster_nodes) 60 goto 10 <ddooling@wustl.edu>
  • 33. Quality is Job 1 <ddooling@wustl.edu>
  • 34. ...must be more than just a slogan <ddooling@wustl.edu>
  • 35. Quality missteps Initial low fidelity between base quality values and quality Tsonev, S. SEP 2007 <ddooling@wustl.edu>
  • 36. An aside “basecall calibration predicted vs. observed” <ddooling@wustl.edu>
  • 38. Quality is the key Need high fidelity between prediction and observed 50 bytes per base 20 bytes per base 2 bytes per base 3 bits per base <ddooling@wustl.edu>
  • 39. The down side http://www3.appliedbiosystems.com/cms/ groups/mcb_marketing/documents/ generaldocuments/cms_057559.pdf http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg <ddooling@wustl.edu>
  • 46. Submitted to central repositories <ddooling@wustl.edu>
  • 47. ... and replicated across the pond <ddooling@wustl.edu>
  • 48. The goal of this project is to provide a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods. <ddooling@wustl.edu>
  • 49. Write-only databases Search limited to sequence and values of specific XML entities submitted as metadata <ddooling@wustl.edu>
  • 50. Write-only databases x Search limited to sequence and values of specific XML entities submitted as metadata <ddooling@wustl.edu>
  • 51. Speaking of XML <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <LS454> <STUDY_SET xmlns:xsi=quot;http://www.w3.org/2001/ <EXPERIMENT_SET xmlns:xsi=quot;http://www.w3.org/ <INSTRUMENT_MODEL>GS 20</ <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT XMLSchema-instancequot;> 2001/XMLSchema-instancequot;> INSTRUMENT_MODEL> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT <STUDY alias=quot;LowSalternSDbayVir111005quot; <EXPERIMENT ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT accession=quot;SRP000145quot;> alias=quot;LowSalternSDbayVir111005_experimentquot; <FLOW_SEQUENCE>TACGTACGTACGTACGTACGTACGTACGTACGT ACGTACGTACGTACGTACGTACGTACGTACG</VALUE> <DESCRIPTOR> expected_number_runs=quot;2quot; accession=quot;SRX000217quot;> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT </RUN_ATTRIBUTE> <STUDY_TITLE>Solar Salterns, viral <TITLE>454 sequencing of saltern metagenome ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT <RUN_ATTRIBUTE> fraction from low salinity saltern in San Diego, fragment library</TITLE> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</ <TAG>key_sequence</TAG> CA </STUDY_TITLE> <STUDY_REF accession=quot;SRP000145quot; FLOW_SEQUENCE> <VALUE>TCAG</VALUE> <STUDY_TYPE refname=quot;LowSalternSDbayVir111005quot;/> <FLOW_COUNT>168</FLOW_COUNT> </RUN_ATTRIBUTE> existing_study_type=quot;Metagenomicsquot;/> <DESIGN> </LS454> </RUN_ATTRIBUTES> <STUDY_ABSTRACT>Viral community from a <DESIGN_DESCRIPTION>454 Sequencing of </PLATFORM> </RUN> quot;lowquot; salinity saltern and sequenced at 454 Life viral fraction from low salinity saltern in San <PROCESSING> <RUN alias=quot;D1LDSHLquot; instrument_model=quot;454 GS Sciences. </STUDY_ABSTRACT> Diego, CA</DESIGN_DESCRIPTION> <BASE_CALLS> 20quot; run_date=quot;2006-04-06T09:25:19Zquot; <CENTER_NAME>SDSU</CENTER_NAME> <SAMPLE_DESCRIPTOR accession=quot;SRS000373quot; <SEQUENCE_SPACE>Base Space</ run_file=quot;D1LDSHLquot; run_center=quot;454MSCquot; refname=quot;28373quot;/> SEQUENCE_SPACE> total_data_blocks=quot;1quot; accession=quot;SRR001054quot;> <CENTER_PROJECT_NAME>LowSalternSDbayVir111005</ <LIBRARY_DESCRIPTOR> <BASE_CALLER>454BaseCaller</BASE_CALLER> <EXPERIMENT_REF accession=quot;SRX000217quot; CENTER_PROJECT_NAME> <LIBRARY_NAME>lowSalternSDbayVir111005</ </BASE_CALLS> refname=quot;LowSalternSDbayVir111005_experimentquot;/> <PROJECT_ID>28373</PROJECT_ID> LIBRARY_NAME> <QUALITY_SCORES qtype=quot;phredquot;> <DATA_BLOCK name=quot;D1LDSHLquot; region=quot;1quot; </DESCRIPTOR> <LIBRARY_STRATEGY>OTHER</ <QUALITY_SCORER>454BaseCaller</ total_spots=quot;70935quot; total_reads=quot;70935quot; <STUDY_ATTRIBUTES> LIBRARY_STRATEGY> QUALITY_SCORER> number_channels=quot;1quot; format_code=quot;1quot; sector=quot;0quot;> <STUDY_ATTRIBUTE> <LIBRARY_SOURCE>OTHER</LIBRARY_SOURCE> <NUMBER_OF_LEVELS>64</NUMBER_OF_LEVELS> <FILES> <TAG>NCBI parent project ID</TAG> <LIBRARY_SELECTION>RANDOM</ <MULTIPLIER>1</MULTIPLIER> <FILE filename=quot;D1LDSHL01.sffquot; <VALUE>28725</VALUE> LIBRARY_SELECTION> </QUALITY_SCORES> filetype=quot;sffquot;/> </STUDY_ATTRIBUTE> <LIBRARY_LAYOUT> </PROCESSING> </FILES> </STUDY_ATTRIBUTES> <SINGLE/> </EXPERIMENT> </DATA_BLOCK> </STUDY> </LIBRARY_LAYOUT> </EXPERIMENT_SET> <RUN_ATTRIBUTES> </STUDY_SET> <LIBRARY_CONSTRUCTION_PROTOCOL> <RUN_ATTRIBUTE> none provided <TAG>flow_count</TAG> <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> </LIBRARY_CONSTRUCTION_PROTOCOL> <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <VALUE>168</VALUE> <SAMPLE_SET xmlns:xsi=quot;http://www.w3.org/2001/ </LIBRARY_DESCRIPTOR> <RUN_SET xmlns:xsi=quot;http://www.w3.org/2001/ </RUN_ATTRIBUTE> XMLSchema-instancequot;> <SPOT_DESCRIPTOR> XMLSchema-instancequot;> <RUN_ATTRIBUTE> <SAMPLE alias=quot;28373quot; accession=quot;SRS000373quot;> <SPOT_DECODE_SPEC> <RUN alias=quot;D0IIGP3quot; instrument_model=quot;454 GS <TAG>flow_sequence</TAG> <SAMPLE_NAME> <NUMBER_OF_READS_PER_SPOT>2</ 20quot; run_date=quot;2006-03-17T09:39:51Zquot; <TAXON_ID>496920</TAXON_ID> NUMBER_OF_READS_PER_SPOT> run_file=quot;D0IIGP3quot; run_center=quot;454MSCquot; <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT <COMMON_NAME>saltern metagenome</ <READ_SPEC> total_data_blocks=quot;1quot; accession=quot;SRR001053quot;> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT COMMON_NAME> <READ_INDEX>0</READ_INDEX> <EXPERIMENT_REF accession=quot;SRX000217quot; ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT </SAMPLE_NAME> <READ_CLASS>Technical Read</ refname=quot;LowSalternSDbayVir111005_experimentquot;/> ACGTACGTACGTACGTACGTACGTACGTACG</VALUE> <DESCRIPTION>viral fraction from low READ_CLASS> <DATA_BLOCK name=quot;D0IIGP3quot; region=quot;1quot; </RUN_ATTRIBUTE> salinity saltern in San Diego, CA </ <READ_TYPE>Adapter</READ_TYPE> total_spots=quot;51121quot; total_reads=quot;51121quot; <RUN_ATTRIBUTE> DESCRIPTION> <BASE_COORD>1</BASE_COORD> number_channels=quot;1quot; format_code=quot;1quot; sector=quot;0quot;> <TAG>key_sequence</TAG> <SAMPLE_ATTRIBUTES> </READ_SPEC> <FILES> <VALUE>TCAG</VALUE> <SAMPLE_ATTRIBUTE> <READ_SPEC> <FILE filename=quot;D0IIGP301.sffquot; </RUN_ATTRIBUTE> <TAG>collection_date</TAG> <READ_INDEX>1</READ_INDEX> filetype=quot;sffquot;/> </RUN_ATTRIBUTES> <VALUE>11/10/05</VALUE> <READ_CLASS>Application Read</ </FILES> </RUN> </SAMPLE_ATTRIBUTE> READ_CLASS> </DATA_BLOCK> </RUN_SET> <SAMPLE_ATTRIBUTE> <READ_TYPE>Forward</READ_TYPE> <RUN_ATTRIBUTES> <TAG>lat_lon</TAG> <BASE_COORD>5</BASE_COORD> <RUN_ATTRIBUTE> <VALUE>32.599040, -117.107356</VALUE> </READ_SPEC> <TAG>flow_count</TAG> </SAMPLE_ATTRIBUTE> </SPOT_DECODE_SPEC> <VALUE>168</VALUE> </SAMPLE_ATTRIBUTES> </SPOT_DESCRIPTOR> </RUN_ATTRIBUTE> </SAMPLE> </DESIGN> <RUN_ATTRIBUTE> </SAMPLE_SET> <PLATFORM> <TAG>flow_sequence</TAG> <ddooling@wustl.edu>
  • 54. The Cathedral and the Bazaar Linux overturned much of what I thought I knew. I had been preaching the Unix gospel of small tools, rapid prototyping and evolutionary programming for years. But I also believed there was a certain critical complexity above which a more centralized, a priori approach was required. I believed that the most important software (operating systems and really large tools like the Emacs programming editor) needed to be built like cathedrals, carefully crafted by individual wizards or small bands of mages working in splendid isolation, with no beta to be released before its time. <ddooling@wustl.edu>
  • 55. The Vatican and the Reformation <ddooling@wustl.edu>
  • 56. The popes Will this scale? <ddooling@wustl.edu>
  • 59. The Human Reference >7 dna:chromosome chromosome:NCBI36:7:1:158821424:1 ...AATAACTATATAAGTAAATAAGCAAGCTGTATGAATATACAAAGCTCTCTGGTAAAG GTAAATACATAAACAAACATAAAAACAGTCCTATTGTAATTTTGGTTTGTAACTCTGCTT TTTATTTTCTACATAATTTAAAAGGCAAATGCATAAAATGTAATTGTAAATCTGTTAGCT GGTATACAATGAATAAAGATATAATTTGTCACATCAATAACATAAAAAGAGTAGAGCTAT ATATATAGCAGTAGAATTTTGGTATGTGATTGAACTTAAGTTGAAATAAATTCAAATTAA AATGTTATAACTCTAGGATGTTATATGTAATTCTCATAGTAACCAAAAATGAAATATACA TAGAATATAAACAAAAGGAAATGAGACTAGAAACAAAATGTGTCACTACAAAAAAATCAA CTAAAGATAAAAAAGAAATAATTGAGAAAATGATTGGCAAAAATCAGTAACTCTGACGTA TTAAAACTTTCCATGCTACATAAATCTGAAAACTCTATTTCACATAAAACTGGAGCTGAA AGAAACAAATATTTACCTATAAAGTTAAAAGTTATATAGGGAACAAACACTAATTTTTTT TAGAAAAAATTATAAAAAGAGTAAAAATATGCCTTATACTACCGTAATTTCATGTTTTAC AGCTCTGGGAAAATAGAAAATAAAATGTTCTGTTAGCATGAATCCCTCTGTGCCCCC... <ddooling@wustl.edu>
  • 61. The Human Reference (a) 2 A 4(24) B 82 3(2) 5 7 16(2) 3(3) 2 3 3(2) 2 5 58(2) 3(2) 2(2) 8 2(3) 6(2) 2(219) 2 2 23(2) 3 2 2 3 81 3(21) 4(22) 4(3) 13 3(24) 3 A 2(2) 2(2) 2(202) 19(8) 2(19) 2(15) 2 2(34) 2(13) 158 C 5(7) 2(42) 4(9) 2(15) 2(4) 7(8) 3(3) 71 B 18 2 C 2 D 37 F 139 6 E E 13(2) 13(2) 55(3) 2(6) 2(7) 6(3) 4(7) 4 5 2 F 3 D 38(6) 3(5) 160 3(50) 2 G G 2 2(61) 4(51) 2(49) 3(50) 8 2(7) H 4 2(4) 142 2(50) 5 5(5) 8(6) 5(7) 158 3 3(41) 173 H (b) (c) 142 G 160 81 13(7) 158 117 93 29 D H A 184 9(6) H 48(10) 140 8 8(5) 38(6) 114 G F 13(2) 13(2) 55(3) 132 207 D 139 A 82 127(2) B E 62 E 37 71 B F 37 139 D F 13(2) 55(3) E D 21 158 32(3) 45(3) A 13(2) C s5766 13(2) 38(6) 20(2) 18 F G B 8 8(5) A 81 18(6) 58(7) E 171 C G 123(2) 82 B D Zhi, BJ Raphael, AL Price, H Tang and PA Pevzner. Identifying repeat domains in large genomes. Genome Biology 2006, 7:R7 <ddooling@wustl.edu>

Editor's Notes

  1. What are the challenges that the large genome centers are currently facing that the typical researcher will be facing soon? Do not store images Do not store SRF Keep FASTQ
  2. This acceleration breaks everything
  3. 3.4*125/75*35 = 198.333333333333
  4. We need to stop having to deal with images It should be transparent to the end user
  5. LHC http://atlasexperiment.org/
  6. (90*2+90/125*50)*35 = 7560 Uncompressed
  7. For 75 b read, you need 200 bytes, 25% is the headers Save 12.5% by simply not replicating the sequence header
  8. 8*90/12*35 = 2100
  9. Cost of software
  10. The chain is only as strong as its weakest link. Images: Assembly line backing up? Keystone cops piling up? Stooges? Transition: situation not-unlike that faced by PC manufacturers over the past decade
  11. This analogy works on another level as well...
  12. Intel convinced everyone that the speed of the computer was equal to the clock speed of the processor Many people believed this Even when using a 56k modem Even when AML Opteron came out Even when Intel went to multi-core and lower clock speeds A cautionary tale for those joining the Gb race Which wraps up the scale up...
  13. ... and leads us into quality
  14. ... and leads us into quality
  15. Make the best small engine in the world
  16. Made high-quality cars for years Recognized after years of consistent performance
  17. Now enjoy premium cost and high resale value Everyone I know has a Honda Odyssey
  18. Money from the T-bird allowed them to design, develop, and introduce the...
  19. It&#x2019;s gotten better
  20. Google image search second or third result Draw your own conclusions
  21. This distrust of base calls and quality values has reinforced the cult of traces This does not scale for human resources, disk space, etc. This leads to a very bad situation for those of us responsible for the computing, storage, and network infrastrcuture
  22. Quality is at the core of all other issues, storage, compute, throughput, etc. If it&#x2019;s a bad base, call it a bad base Don&#x2019;t forget the GHz race
  23. Reducing data to base calls and quality values does reduce its value Especially for data not natively in &#x201C;base space&#x201D; Is there a richness in this data that is lost? But you gain not having to have custom tool tails for each native data type
  24. 2 bits/base is absolute minimum
  25. Grid
  26. No one ever feels lucky
  27. No one ever feels lucky
  28. They have learned their lesson, by creating an incredible amount of XML to submit Study, Sample, Experiment, Run
  29. He may know a lot about software, but he does not know anything about building cathedrals
  30. Currently, revisions are tightly controlled by central repositories, NCBI, UCSC, EBI
  31. Push and pull around diff&#x2019;s Balance curation with rapid advances Debian web of trust
  32. How far will FASTA get you? C. elegans - part of genome repeat structure http://genomebiology.com/2006/7/1/R7 Can you use the current de Bruijn graph assembly engines for alignment?
  33. Talk to me