Introduction to
Next Generation Sequencing
           Alex Sánchez

        Statistics and Bioinformatics Research Group
        Statistics department, Universitat de Barelona

        Statistics and Bioinformatics Unit
        Vall d’Hebron Institut de Recerca




     Introduction to NGS      http://ueb.ir.vhebron.net/NGS
Outline

Introduction, Presentation, Goals.
Next generation sequencing technologies.
  Evolution, Description, Comparison.
Bioinformatics challenges.
Some aspects of NGS data analysis.
  NGS data, and data preprocessing (QC)
  Types of analyses, workflows, tools
Conclusions and perspectives




      Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Who, where, what?




       Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Introduction




       Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Why is NGS revolutionary?
• NGS has brought high speed not only to genome
  sequencing and personal medicine,
• it has also changed the way we do genome research

  Got a question on genome organization?


         SEQUENCE IT !!!

                  Ana Conesa, bioinformatics researcher at
                         Principe Felipe Research Center



           Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Sequencing: the Sanger Method (1977)




       Click here to see an animation

      Introduction to NGS   http://ueb.ir.vhebron.net/NGS
History of DNA sequencing is related to the combination of new technologies.




            Introduction to NGS      http://ueb.ir.vhebron.net/NGS
The human genome project




      Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Next generation sequencing

            The future is here, now




     Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Next generation Sequencing
• Improvements in enzymes, chemistry and image
  analysis, mature by the middle of last decade
  dramatically increased sequencing capabilities.
• The newest type of technology, called “next-generation
  sequencing“, appeared with the potential to dramatically
  accelerate biological and biomedical research
   – by enabling the comprehensive analysis of genomes,
     transcriptomes and interactomes,
   – by tending to become inexpensive, routine and
     widespread, rather than requiring very costly
     production-scale efforts.



           Introduction to NGS   http://ueb.ir.vhebron.net/NGS
NGS technologies




       Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
Sanger sequencing      Cyclic-array sequencing




                Introduction to NGS    http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
Sanger sequencing      Next-generation sequencing



                                              Advantages of NGS
                                              - Construction of a sequencing
                                              library    clonal amplification to
                                              generate sequencing features




                Introduction to NGS    http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
Sanger sequencing      Next-generation sequencing



                                              Advantages:
                                              - Construction of a sequencing
                                              library    clonal amplification to
                                              generate sequencing features

                                                   No in vivo cloning,
                                                transformation, colony picking...




                Introduction to NGS    http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
Sanger sequencing      Next-generation sequencing



                                              Advantages:
                                              - Construction of a sequencing
                                              library    clonal amplification to
                                              generate sequencing features

                                                   No in vivo cloning,
                                                transformation, colony picking...

                                              - Array-based sequencing




                Introduction to NGS    http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
Sanger sequencing      Next-generation sequencing



                                              Advantages:
                                              - Construction of a sequencing
                                              library    clonal amplification to
                                              generate sequencing features

                                                   No in vivo cloning,
                                                transformation, colony picking...

                                              - Array-based sequencing

                                                  Higher degree of parallelism
                                                than capillary-based sequencing



                Introduction to NGS    http://ueb.ir.vhebron.net/NGS
NGS means high sequencing capacity




  GS FLX 454                HiSeq 2000               5500xl SOLiD
  (ROCHE)                   (ILLUMINA)                (ABI)




               GS Junior


                                          Ion TORRENT




           Introduction to NGS   http://ueb.ir.vhebron.net/NGS
NGS Platforms Performance




                                454 GS Junior
                                35MB




     Introduction to NGS   http://ueb.ir.vhebron.net/NGS
454 Sequencing




      Introduction to NGS   http://ueb.ir.vhebron.net/NGS
ABI SOLID Sequencing




      Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Solexa sequencing




       Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Comparison of 2nd NGS




      Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Some numbers




      Introduction to NGS   http://ueb.ir.vhebron.net/NGS
The sequencing process, in detail
1   Library preparation           1                       DNA
                                                          fragmentation
                                                          and    in     vitro
                                                          adaptor ligation




                          Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
1   Library preparation           1                       DNA
2 Clonal amplification                                    fragmentation
                                                          and    in     vitro
                                                          adaptor ligation

                   emulsion PCR
2




                          Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
1   Library preparation           1                       DNA
2 Clonal amplification                                    fragmentation
                                                          and    in     vitro
                                                          adaptor ligation

                   emulsion PCR                                       bridge PCR
2




                          Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
1   Library preparation       1                   DNA
2 Clonal amplification                            fragmentation
                                                  and    in     vitro
3 Cyclic array sequencing                         adaptor ligation

                   emulsion PCR                               bridge PCR
2



3    Pyrosequencing




    454 sequencingIntroduction to NGS   http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
1   Library preparation       1                        DNA
2 Clonal amplification                                 fragmentation
                                                       and    in     vitro
3 Cyclic array sequencing                              adaptor ligation

                   emulsion PCR                                    bridge PCR
2



3    Pyrosequencing           Sequencing-by-ligation




    454 sequencingIntroduction to NGSplatform
                                SOLiD   http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
1   Library preparation       1                        DNA
2 Clonal amplification                                 fragmentation
                                                       and    in     vitro
3 Cyclic array sequencing                              adaptor ligation

                   emulsion PCR                                    bridge PCR
2



3    Pyrosequencing           Sequencing-by-ligation        Sequencing-by-synthesis




    454 sequencingIntroduction to NGSplatform
                                SOLiD                    Solexa technology
                                        http://ueb.ir.vhebron.net/NGS
Next next generation sequencing
• Pacific Biosystems
  – Real time DNA
    synthesis
  – Up to 12000nt (?)
  – 50 bases/second (?)

• Promises delivery of
  human genome in
  minutes?
  – Company on track for
    2013


          Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Bioinformatics challenges of NGS




      Introduction to NGS   http://ueb.ir.vhebron.net/NGS
I have my sequences/images. Now what?




        Introduction to NGS   http://ueb.ir.vhebron.net/NGS
NGS pushes (bio)informatics needs up
• Need for large amount of CPU power
   – Informatics groups must manage compute clusters
   – Challenges in parallelizing existing software or redesign of
     algorithms to work in a parallel environment
   – Another level of software complexity and challenges to
     interoperability
• VERY large text files (~10 million lines long)
   – Can’t do ‘business as usual’ with familiar tools such as
     Perl/Python.
   – Impossible memory usage and execution time
   – Impossible to browse for problems
• Need sequence Quality filtering


             Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Data management issues
• Raw data are large. How long should be kept?
• Processed data are manageable for most people
   – 20 million reads (50bp) ~1Gb
• More of an issue for a facility: HiSeq recommends
  32 CPU cores, each with 4GB RAM

• Certain studies much more data intensive than other
   – Whole genome sequencing
      • A 30X coverage genome pair (tumor/normal) ~500 GB
      • 50 genome pairs ~ 25 TB




            Introduction to NGS   http://ueb.ir.vhebron.net/NGS
So what?

• In NGS we have to process really big amounts of data,
  which is not trivial in computing terms.

• Big NGS projects require supercomputing infrastructures

• Or put another way: it's not the case that anyone can do
  everything.
   – Small facilities must carefully choose their projects to be scaled
     with their computing capabilities.




             Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Computational infrastructure for NGS
• There is great variety but a good point to start with:

   – Computing cluster
       • Multiple nodes (servers) with multiple cores
       • High performance storage (TB, PB level)
       • Fast networks (10Gb ethernet, infiniband)
   – Enough space and conditions for the equipment
     ("servers room")
   – Skilled people (sysadmin, developers)
       • CNAG, in Barcelona: 36 people, more than 50% of them
         informaticians




             Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Big computing infrastructure
• Distributed memory cluster
   –   Starting at 20 computing nodes
   –   160 to 240 cores
   –   amd64 (x86_64) is the most used cpu architecture
   –   At least 48GB ram per node
• Fast networks
   – 10Gbit
   – Infiniband
• Batch queue system (sge, condor, pbs, slurm)
• Optional MPI and GPUs environment depending on
  project requirements.



              Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Big infrastructure is expensive
• Starting at 200.000€
   – 200.000€ is just the hardware
   – Plus data center (computers room)
   – Plus informaticians salary
• Not every partner knows about supercomputing.
   – SGI
   – Bull
   – IBMHP




             Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Middle size infrastructure
• "Small” distributed filesystem ( around 50TB).

• "Small” cluster (around 10 nodes, 80 to 120 cores).

• At least gigabit ethernet network.

• Price range: 50.000 – 100.000 € (just hardware)
   – plus data center and informaticians salary




             Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Small infrastructure
• Recommended at least 2 machines
   – 8 or 12 cores each machine.
   – 48Gb ram minimum each machine.
   – BIG local disk. At least 4TB each machine
      • As much local disks as we can afford


• Price range: starting at 8.000€ - 10.000€ (2 machines)




             Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Alternatives (1): Cloud Computing
•   Pros
     – Flexibility.
     – You pay what you use.
     – Don´t need to maintain a data center.
•   Cons
     – Transfer big datasets over internet is
       slow.
     – You pay for consumed bandwidth.
       That is a problem with big datasets.
     – Lower performance, specially in disk
       read/write.
     – Privacy/security concerns.
     – More expensive for big and long
       term projects.




                 Introduction to NGS      http://ueb.ir.vhebron.net/NGS
Alternatives (2): Grid Computing
• Pros
   – Cheaper.
   – More resources available.
• Cons
   – Heterogeneous
     environment.
   – Slow connectivity (specially
     in Spain).
   – Much time required to find
     good resources in the grid.



            Introduction to NGS   http://ueb.ir.vhebron.net/NGS
So what?
• Think before you NGS
• Decide what you …
   – want to do,
   – can afford
   – know how to do
• Consider all alternatives
• Look for expert advice …




            Introduction to NGS   http://ueb.ir.vhebron.net/NGS
NGS data analysis




      Introduction to NGS   http://ueb.ir.vhebron.net/NGS
NGS data analysis stages




       Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Applications of Next-Generation Sequencing
Metagenomics and other community-based “omics”




Zoetendal E G et al.
Gut 2008;57:1605-1615




                        Introduction to NGS   http://ueb.ir.vhebron.net/NGS
De novo sequencing




      Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Transcriptomics by NGS: RNASeq




• Analog Signal                          • Digital Signal
•   Easy to convey the signal’s
                                         •   Harder to achieve & interpret
    information
•   Continuous strength                  •   Reads counts: discrete values
•   Signal loss and distortion           •   Weak background or no noise




                Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Which software for NGS (data) analysis?
•   Answer is not straightforward.
                                         http://seqanswers.com/wiki/Software/list
•   Many possible classifications
     – Biological domains
         • SNP discovery, Genomics, ChIP-Seq, De-novo assembly, …
     – Bioinformatics methods
         • Mapping, Assembly, Alignment, Seq-QC,…
     – Technology
         • Illumina, 454, ABI SOLID, Helicos, …
     – Operating system
         • Linux, Mac OS X, Windows, …
     – License type
         • GPLv3, GPL, Commercial, Free for academic use,…
     – Language
         • C++, Perl, Java, C, Phyton
     – Interface
         • Web Based, Integrated solutions, command line tools, pipelines,…




                Introduction to NGS      http://ueb.ir.vhebron.net/NGS
Combining tools in a typical workflow




        Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Other popular tools




       Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Quality control and preprocessing of
             NGS data




      Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Data types




       Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Why QC and preprocessing
• Sequencer output:
   – Reads + quality
• Natural questions
   – Is the quality of my sequenced
     data OK?
   – If something is wrong can I fix it?
• Problem: HUGE files... How
  do they look?
• Files are flat files and big...
  tens of Gbs (even hard to
  browse them)



             Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Preprocessing sequences improves results




        Introduction to NGS   http://ueb.ir.vhebron.net/NGS
How is quality measured?




•   Sequencing systems use to assign quality scores to each peak
•   Phred scores provide log(10)-transformed error probability values:
    If p is probability that the base call is wrong the Phred score is
                  Q = .10·log10p
     – score = 20 corresponds to a 1% error rate
     – score = 30 corresponds to a 0.1% error rate
     – score = 40 corresponds to a 0.01% error rate
•   The base calling (A, T, G or C) is performed based on Phred scores.
•   Ambiguous positions with Phred scores <= 20 are labeled with N.



                Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Data formats
• FastA format (everybody knows about it)
   – Header line starts with “>” followed by a sequence ID
   – Sequence (string of nt).


• FastQ format (http://maq.sourceforge.net/fastq.shtml)
   – First is the sequence (like Fasta but starting with “@”)
   – Then “+” and sequence ID (optional) and in the following line are
     QVs encoded as single byte ASCII codes
       • Different quality encode variants


• Nearly all downstream analysis take FastQ as input
  sequence



              Introduction to NGS    http://ueb.ir.vhebron.net/NGS
The fastq format
•   A FASTQ file normally uses four lines per sequence.
    – Line 1 begins with a '@' character and is followed by a sequence
      identifier and an optional description (like a FASTA title line).
    – Line 2 is the raw sequence letters.
    – Line 3 begins with a '+' character and isoptionally followed by the same
      sequence identifier (and any description) again.
    – Line 4 encodes the quality values for the sequence in Line 2, and must
      contain the same number of symbols as letters in the sequence.
        • Different encodings are in use
        • Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126


@Seq description
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65




                Introduction to NGS           http://ueb.ir.vhebron.net/NGS
Some tools to deal with QC
• Use FastQC to see your starting state.

• Use Fastx-toolkit to optimize different datasets and then
  visualize the result with FastQC to prove your success!

• Hints:
   – Trimming, clipping and filtering may improve quality
   – But beware of removing too many sequences…


Go to the tutorial and try the exercises...




             Introduction to NGS   http://ueb.ir.vhebron.net/NGS
Acknowledgements
Grupo de investigación en Estadística y Bioinformática del
departamento de Estadística de la Universidad de
Barcelona.

Xavier de Pedro and Ferran Briansó (but also Jose Luis
Mosquera and Israel Ortega) de la Unitat d’Estadística i
Bioinformàtica del VHIR (Vall d’Hebron Institut de
Recerca)

Unitat de Serveis Científico Tècnics (UCTS) del VHIR
(Vall d’Hebron Institut de Recerca)

People whose materials have been borrowed
   Manel Comabella, Rosa Prieto, Paqui Gallego, Javier
   Santoyo, Ana Conesa, Pablo Escobar, Thomas Girke
   …



                Introduction to NGS    http://ueb.ir.vhebron.net/NGS
Gracias por la atención y la paciencia




   Introduction to NGS   http://ueb.ir.vhebron.net/NGS

Introduction to next generation sequencing

  • 1.
    Introduction to Next GenerationSequencing Alex Sánchez Statistics and Bioinformatics Research Group Statistics department, Universitat de Barelona Statistics and Bioinformatics Unit Vall d’Hebron Institut de Recerca Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 2.
    Outline Introduction, Presentation, Goals. Nextgeneration sequencing technologies. Evolution, Description, Comparison. Bioinformatics challenges. Some aspects of NGS data analysis. NGS data, and data preprocessing (QC) Types of analyses, workflows, tools Conclusions and perspectives Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 3.
    Who, where, what? Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 4.
    Introduction Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 5.
    Why is NGSrevolutionary? • NGS has brought high speed not only to genome sequencing and personal medicine, • it has also changed the way we do genome research Got a question on genome organization? SEQUENCE IT !!! Ana Conesa, bioinformatics researcher at Principe Felipe Research Center Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 6.
    Sequencing: the SangerMethod (1977) Click here to see an animation Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 7.
    History of DNAsequencing is related to the combination of new technologies. Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 8.
    The human genomeproject Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 9.
    Next generation sequencing The future is here, now Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 10.
    Next generation Sequencing •Improvements in enzymes, chemistry and image analysis, mature by the middle of last decade dramatically increased sequencing capabilities. • The newest type of technology, called “next-generation sequencing“, appeared with the potential to dramatically accelerate biological and biomedical research – by enabling the comprehensive analysis of genomes, transcriptomes and interactomes, – by tending to become inexpensive, routine and widespread, rather than requiring very costly production-scale efforts. Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 11.
    NGS technologies Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 12.
    Next-generation DNA sequencing Sangersequencing Cyclic-array sequencing Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 13.
    Next-generation DNA sequencing Sangersequencing Next-generation sequencing Advantages of NGS - Construction of a sequencing library clonal amplification to generate sequencing features Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 14.
    Next-generation DNA sequencing Sangersequencing Next-generation sequencing Advantages: - Construction of a sequencing library clonal amplification to generate sequencing features No in vivo cloning, transformation, colony picking... Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 15.
    Next-generation DNA sequencing Sangersequencing Next-generation sequencing Advantages: - Construction of a sequencing library clonal amplification to generate sequencing features No in vivo cloning, transformation, colony picking... - Array-based sequencing Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 16.
    Next-generation DNA sequencing Sangersequencing Next-generation sequencing Advantages: - Construction of a sequencing library clonal amplification to generate sequencing features No in vivo cloning, transformation, colony picking... - Array-based sequencing Higher degree of parallelism than capillary-based sequencing Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 17.
    NGS means highsequencing capacity GS FLX 454 HiSeq 2000 5500xl SOLiD (ROCHE) (ILLUMINA) (ABI) GS Junior Ion TORRENT Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 18.
    NGS Platforms Performance 454 GS Junior 35MB Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 19.
    454 Sequencing Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 20.
    ABI SOLID Sequencing Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 21.
    Solexa sequencing Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 22.
    Comparison of 2ndNGS Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 23.
    Some numbers Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 24.
    The sequencing process,in detail 1 Library preparation 1 DNA fragmentation and in vitro adaptor ligation Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 25.
    Next-generation DNA sequencing 1 Library preparation 1 DNA 2 Clonal amplification fragmentation and in vitro adaptor ligation emulsion PCR 2 Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 26.
    Next-generation DNA sequencing 1 Library preparation 1 DNA 2 Clonal amplification fragmentation and in vitro adaptor ligation emulsion PCR bridge PCR 2 Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 27.
    Next-generation DNA sequencing 1 Library preparation 1 DNA 2 Clonal amplification fragmentation and in vitro 3 Cyclic array sequencing adaptor ligation emulsion PCR bridge PCR 2 3 Pyrosequencing 454 sequencingIntroduction to NGS http://ueb.ir.vhebron.net/NGS
  • 28.
    Next-generation DNA sequencing 1 Library preparation 1 DNA 2 Clonal amplification fragmentation and in vitro 3 Cyclic array sequencing adaptor ligation emulsion PCR bridge PCR 2 3 Pyrosequencing Sequencing-by-ligation 454 sequencingIntroduction to NGSplatform SOLiD http://ueb.ir.vhebron.net/NGS
  • 29.
    Next-generation DNA sequencing 1 Library preparation 1 DNA 2 Clonal amplification fragmentation and in vitro 3 Cyclic array sequencing adaptor ligation emulsion PCR bridge PCR 2 3 Pyrosequencing Sequencing-by-ligation Sequencing-by-synthesis 454 sequencingIntroduction to NGSplatform SOLiD Solexa technology http://ueb.ir.vhebron.net/NGS
  • 30.
    Next next generationsequencing • Pacific Biosystems – Real time DNA synthesis – Up to 12000nt (?) – 50 bases/second (?) • Promises delivery of human genome in minutes? – Company on track for 2013 Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 31.
    Bioinformatics challenges ofNGS Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 32.
    I have mysequences/images. Now what? Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 33.
    NGS pushes (bio)informaticsneeds up • Need for large amount of CPU power – Informatics groups must manage compute clusters – Challenges in parallelizing existing software or redesign of algorithms to work in a parallel environment – Another level of software complexity and challenges to interoperability • VERY large text files (~10 million lines long) – Can’t do ‘business as usual’ with familiar tools such as Perl/Python. – Impossible memory usage and execution time – Impossible to browse for problems • Need sequence Quality filtering Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 34.
    Data management issues •Raw data are large. How long should be kept? • Processed data are manageable for most people – 20 million reads (50bp) ~1Gb • More of an issue for a facility: HiSeq recommends 32 CPU cores, each with 4GB RAM • Certain studies much more data intensive than other – Whole genome sequencing • A 30X coverage genome pair (tumor/normal) ~500 GB • 50 genome pairs ~ 25 TB Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 35.
    So what? • InNGS we have to process really big amounts of data, which is not trivial in computing terms. • Big NGS projects require supercomputing infrastructures • Or put another way: it's not the case that anyone can do everything. – Small facilities must carefully choose their projects to be scaled with their computing capabilities. Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 36.
    Computational infrastructure forNGS • There is great variety but a good point to start with: – Computing cluster • Multiple nodes (servers) with multiple cores • High performance storage (TB, PB level) • Fast networks (10Gb ethernet, infiniband) – Enough space and conditions for the equipment ("servers room") – Skilled people (sysadmin, developers) • CNAG, in Barcelona: 36 people, more than 50% of them informaticians Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 37.
    Big computing infrastructure •Distributed memory cluster – Starting at 20 computing nodes – 160 to 240 cores – amd64 (x86_64) is the most used cpu architecture – At least 48GB ram per node • Fast networks – 10Gbit – Infiniband • Batch queue system (sge, condor, pbs, slurm) • Optional MPI and GPUs environment depending on project requirements. Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 38.
    Big infrastructure isexpensive • Starting at 200.000€ – 200.000€ is just the hardware – Plus data center (computers room) – Plus informaticians salary • Not every partner knows about supercomputing. – SGI – Bull – IBMHP Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 39.
    Middle size infrastructure •"Small” distributed filesystem ( around 50TB). • "Small” cluster (around 10 nodes, 80 to 120 cores). • At least gigabit ethernet network. • Price range: 50.000 – 100.000 € (just hardware) – plus data center and informaticians salary Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 40.
    Small infrastructure • Recommendedat least 2 machines – 8 or 12 cores each machine. – 48Gb ram minimum each machine. – BIG local disk. At least 4TB each machine • As much local disks as we can afford • Price range: starting at 8.000€ - 10.000€ (2 machines) Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 41.
    Alternatives (1): CloudComputing • Pros – Flexibility. – You pay what you use. – Don´t need to maintain a data center. • Cons – Transfer big datasets over internet is slow. – You pay for consumed bandwidth. That is a problem with big datasets. – Lower performance, specially in disk read/write. – Privacy/security concerns. – More expensive for big and long term projects. Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 42.
    Alternatives (2): GridComputing • Pros – Cheaper. – More resources available. • Cons – Heterogeneous environment. – Slow connectivity (specially in Spain). – Much time required to find good resources in the grid. Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 43.
    So what? • Thinkbefore you NGS • Decide what you … – want to do, – can afford – know how to do • Consider all alternatives • Look for expert advice … Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 44.
    NGS data analysis Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 45.
    NGS data analysisstages Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 46.
  • 47.
    Metagenomics and othercommunity-based “omics” Zoetendal E G et al. Gut 2008;57:1605-1615 Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 48.
    De novo sequencing Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 49.
    Transcriptomics by NGS:RNASeq • Analog Signal • Digital Signal • Easy to convey the signal’s • Harder to achieve & interpret information • Continuous strength • Reads counts: discrete values • Signal loss and distortion • Weak background or no noise Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 50.
    Which software forNGS (data) analysis? • Answer is not straightforward. http://seqanswers.com/wiki/Software/list • Many possible classifications – Biological domains • SNP discovery, Genomics, ChIP-Seq, De-novo assembly, … – Bioinformatics methods • Mapping, Assembly, Alignment, Seq-QC,… – Technology • Illumina, 454, ABI SOLID, Helicos, … – Operating system • Linux, Mac OS X, Windows, … – License type • GPLv3, GPL, Commercial, Free for academic use,… – Language • C++, Perl, Java, C, Phyton – Interface • Web Based, Integrated solutions, command line tools, pipelines,… Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 51.
    Combining tools ina typical workflow Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 52.
    Other popular tools Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 53.
    Quality control andpreprocessing of NGS data Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 54.
    Data types Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 55.
    Why QC andpreprocessing • Sequencer output: – Reads + quality • Natural questions – Is the quality of my sequenced data OK? – If something is wrong can I fix it? • Problem: HUGE files... How do they look? • Files are flat files and big... tens of Gbs (even hard to browse them) Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 56.
    Preprocessing sequences improvesresults Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 57.
    How is qualitymeasured? • Sequencing systems use to assign quality scores to each peak • Phred scores provide log(10)-transformed error probability values: If p is probability that the base call is wrong the Phred score is Q = .10·log10p – score = 20 corresponds to a 1% error rate – score = 30 corresponds to a 0.1% error rate – score = 40 corresponds to a 0.01% error rate • The base calling (A, T, G or C) is performed based on Phred scores. • Ambiguous positions with Phred scores <= 20 are labeled with N. Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 58.
    Data formats • FastAformat (everybody knows about it) – Header line starts with “>” followed by a sequence ID – Sequence (string of nt). • FastQ format (http://maq.sourceforge.net/fastq.shtml) – First is the sequence (like Fasta but starting with “@”) – Then “+” and sequence ID (optional) and in the following line are QVs encoded as single byte ASCII codes • Different quality encode variants • Nearly all downstream analysis take FastQ as input sequence Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 59.
    The fastq format • A FASTQ file normally uses four lines per sequence. – Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). – Line 2 is the raw sequence letters. – Line 3 begins with a '+' character and isoptionally followed by the same sequence identifier (and any description) again. – Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. • Different encodings are in use • Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126 @Seq description GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 60.
    Some tools todeal with QC • Use FastQC to see your starting state. • Use Fastx-toolkit to optimize different datasets and then visualize the result with FastQC to prove your success! • Hints: – Trimming, clipping and filtering may improve quality – But beware of removing too many sequences… Go to the tutorial and try the exercises... Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 61.
    Acknowledgements Grupo de investigaciónen Estadística y Bioinformática del departamento de Estadística de la Universidad de Barcelona. Xavier de Pedro and Ferran Briansó (but also Jose Luis Mosquera and Israel Ortega) de la Unitat d’Estadística i Bioinformàtica del VHIR (Vall d’Hebron Institut de Recerca) Unitat de Serveis Científico Tècnics (UCTS) del VHIR (Vall d’Hebron Institut de Recerca) People whose materials have been borrowed Manel Comabella, Rosa Prieto, Paqui Gallego, Javier Santoyo, Ana Conesa, Pablo Escobar, Thomas Girke … Introduction to NGS http://ueb.ir.vhebron.net/NGS
  • 62.
    Gracias por laatención y la paciencia Introduction to NGS http://ueb.ir.vhebron.net/NGS