Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS


Published on

Published in: Education
  • Be the first to comment

Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS

  1. 1. Introduction to NGS(Now Generation Sequencing) Data Analysis Alex Sánchez Statistics and Bioinformatics Research Group Statistics department, Universitat de Barelona Statistics and Bioinformatics Unit Vall d’Hebron Institut de Recerca NGS Data analysis
  2. 2. Outline• Introduction• Bioinformatics Challenges• NGS data analysis: Some examples and workflows • Metagenomics, De novo sequencing, Variant detection, RNA- seq• Software • Galaxy, Genome viewers• Data formats and quality control NGS Data analysis
  3. 3. Introduction NGS Data analysis
  4. 4. Why is NGS revolutionary?• NGS has brought high speed not only to genome sequencing and personal medicine,• it has also changed the way we do genome research Got a question on genome organization? SEQUENCE IT !!! Ana Conesa, bioinformatics researcher at Principe Felipe Research Center NGS Data analysis
  5. 5. NGS means high sequencing capacity GS FLX 454 HiSeq 2000 5500xl SOLiD (ROCHE) (ILLUMINA) (ABI) GS Junior Ion TORRENT NGS Data analysis
  6. 6. NGS Platforms Performance 454 GS Junior 35MB NGS Data analysis
  7. 7. 454 Sequencing NGS Data analysis
  8. 8. ABI SOLID Sequencing NGS Data analysis
  9. 9. Solexa sequencing NGS Data analysis
  10. 10. Applications of Next-Generation Sequencing NGS Data analysis
  11. 11. Comparison of 2nd NGS NGS Data analysis
  12. 12. Some numbersPlatform 454/FLX Solex (Illum a ina)AB S ID OLRead length ~350-400bp 36, 75, or 106 bp 50bpSingle read Yes Yes YesPaired-end Reads Yes Yes YesLong-insert (several Kbp) mate-paired reads Yes Yes NoNumber of reads por instrument run 5.00K >100 M 400MMax Data output 0.5Gbp 20.5 Gbp 20GbpRun time to 1Gb 6 Days > 1 Day >1 DayEase of use (workflow) Difficult Least difficult DifficultBase Calling Flow Space Nucleotide space Color sapceD Applica NA tionsWhole genome sequencing and resequencing Yes Yes Yesde novo sequencing Yes Yes YesTargeted resequencing Yes Yes YesDiscovery of genetic variants ( SNPs, InDels, CNV, ...) Yes Yes YesChromatin Immunopecipitation (ChIP) Yes Yes YesMethylation Analysis Yes Yes YesMetagenomics Yes No NoR Applica NA tions Yes Yes YesWhole Transcriptome Yes Yes YesSmall RNA Yes Yes YesExpression Tags Yes Yes Yes NGS Data analysis
  13. 13. Bioinformatics challenges of NGS NGS Data analysis
  14. 14. I have my sequences/images. Now what? NGS Data analysis
  15. 15. NGS pushes (bio)informatics needs up• Need for computer power • VERY large text files (~10 million lines long) – Can’t do ‘business as usual’ with familiar tools such as Perl/Python. – Impossible memory usage and execution time • Impossible to browse for problems • Need sequence Quality filtering • Need for large amount of CPU power • Informatics groups must manage compute clusters • Challenges in parallelizing existing software or redesign of algorithms to work in a parallel environment• Need for Bioinformatics power!!! • The challenges turns from data generation into data analysis! • How should bioinformatics be structured • Bigger centralized bioinformatics services? (or research groups providing service?) • Distributed model: bioinformaticians must be part of the temas. Interoperability? NGS Data analysis
  16. 16. Data management issues• Raw data are large. How long should be kept?• Processed data are manageable for most people – 20 million reads (50bp) ~1Gb• More of an issue for a facility: HiSeq recommends 32 CPU cores, each with 4GB RAM• Certain studies much more data intensive than other – Whole genome sequencing • A 30X coverage genome pair (tumor/normal) ~500 GB • 50 genome pairs ~ 25 TB NGS Data analysis
  17. 17. So what?• In NGS we have to process really big amounts of data, which is not trivial in computing terms.• Big NGS projects require supercomputing infrastructures• Or put another way: its not the case that anyone can do everything. – Small facilities must carefully choose their projects to be scaled with their computing capabilities. NGS Data analysis
  18. 18. Computational infrastructure for NGS• There is great variety but a good point to start with: – Computing cluster • Multiple nodes (servers) with multiple cores • High performance storage (TB, PB level) • Fast networks (10Gb ethernet, infiniband) – Enough space and conditions for the equipment ("servers room") – Skilled people (sysadmin, developers) • CNAG, in Barcelona: 36 people, more than 50% of them informaticians NGS Data analysis
  19. 19. Alternatives (1): Cloud Computing• Pros – Flexibility. – You pay what you use. – Don´t need to maintain a data center.• Cons – Transfer big datasets over internet is slow. – You pay for consumed bandwidth. That is a problem with big datasets. – Lower performance, specially in disk read/write. – Privacy/security concerns. – More expensive for big and long term projects. NGS Data analysis
  20. 20. Alternatives (2): Grid Computing• Pros – Cheaper. – More resources available.• Cons – Heterogeneous environment. – Slow connectivity (specially in Spain). – Much time required to find good resources in the grid. NGS Data analysis
  21. 21. In summary?•“NGS” arrived 2007/8•No-one predicted NGS in 2001 (ten years ago)•Therefore we cannot predict what we will come up against•TGS represents specific challenges–Large Data Storage–Technology-aware software–Enables new assays and new science•We would have said the same about NGS….•These are not new problems, but will require new solutions•There is a lag between technology and software…. NGS Data analysis
  22. 22. Bioinformatics and bioinformaticians• The term bioinformatician means many things• Some may require a wide range of skills• Others require a depth of specific skills• The best thing we can teach is the ability to learn and adapt • The spirit of adventure • There is a definite skills shortage • There always has been NGS Data analysis
  23. 23. Increasing importance of data analysisneeds NGS Data analysis
  24. 24. NGS data analysis NGS Data analysis
  25. 25. NGS data analysis stages NGS Data analysis
  26. 26. Quality control and preprocessing of NGS data NGS Data analysis
  27. 27. Data types NGS Data analysis
  28. 28. Why QC and preprocessing• Sequencer output: – Reads + quality• Natural questions – Is the quality of my sequenced data OK? – If something is wrong can I fix it?• Problem: HUGE files... How do they look?• Files are flat files and big... tens of Gbs (even hard to browse them) NGS Data analysis
  29. 29. Preprocessing sequences improves results NGS Data analysis
  30. 30. How is quality measured?• Sequencing systems use to assign quality scores to each peak• Phred scores provide log(10)-transformed error probability values: If p is probability that the base call is wrong the Phred score is Q = .10·log10p – score = 20 corresponds to a 1% error rate – score = 30 corresponds to a 0.1% error rate – score = 40 corresponds to a 0.01% error rate• The base calling (A, T, G or C) is performed based on Phred scores.• Ambiguous positions with Phred scores <= 20 are labeled with N. NGS Data analysis
  31. 31. Data formats• FastA format (everybody knows about it) – Header line starts with “>” followed by a sequence ID – Sequence (string of nt).• FastQ format ( – First is the sequence (like Fasta but starting with “@”) – Then “+” and sequence ID (optional) and in the following line are QVs encoded as single byte ASCII codes • Different quality encode variants• Nearly all downstream analysis take FastQ as input sequence NGS Data analysis
  32. 32. The fastq format• A FASTQ file normally uses four lines per sequence. – Line 1 begins with a @ character and is followed by a sequence identifier and an optional description (like a FASTA title line). – Line 2 is the raw sequence letters. – Line 3 begins with a + character and isoptionally followed by the same sequence identifier (and any description) again. – Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. • Different encodings are in use • Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126@Seq descriptionGATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT+!*((((***+))%%%++)(%%%%).1***-+*))**55CCF>>>>>>CCCCCCC65 NGS Data analysis
  33. 33. Some tools to deal with QC• Use FastQC to see your starting state.• Use Fastx-toolkit to optimize different datasets and then visualize the result with FastQC to prove your success!• Hints: – Trimming, clipping and filtering may improve quality – But beware of removing too many sequences…Go to the tutorial and try the exercises... NGS Data analysis
  34. 34. Applications• [1] Metagenomics• [2] De novo sequencing• [3] Amplicon analysis• [4] Variant discovery• [5] Transcriptome analysis• …and more … NGS Data analysis
  35. 35. [1] Metagenomics &other community-based “omics”Zoetendal E G et al.Gut 2008;57:1605-1615 NGS Data analysis
  36. 36. [1] A metagenomics workflowAAGACGTGGACA GTCCGTCACAACTGA AAGACGTGGACAGATCTGCTCAGGCTAGCATGAAC CATGCGTGCATG GATAGGTGGACCGATATGCATTAGACTTGCAGGGC AGTCGTCAGTCATGGG Short reads (40-150 bps) Assembly Contigs Gene prediction 1 3000 6000 1 3000 6000 1 2000 Homology searching ORFs Proteins, families, functions Functional classification Ontologies Binning Sequences into species Functional profiles
  37. 37. [1] Metagenomic ApproachesSMALL-SCALE: 16S rRNA gene profilingThe basic approach is to identify microbes in a complexcommunity by exploiting universal and conserved targets,such as rRNA genesPetrosini.Challenges and limitations: Chimeric sequences caused byPCR amplification and sequencing errors.LARGE-SCALE: Whole Genome Shotgun (WGS)Whole-genome approaches enable to identify andannotate microbial genes and its functions in thecommunity. Challenges and limitations: relatively large amounts of starting material required potential contamination of metagenomic samples with host genetic material high numbers of genes of unknown function. Environmental Shotgun Sequencing (ESS). A primer on metagenomics. PLoS Comput Biol. 2010 Feb 26;6(2):e1000667. NGS Data analysis
  38. 38. [1] Comparative MetagenomicsComparing two or more metagenomes is necessary to understand how genomic differencesaffect, and are affected by the abiotic environment.MEGAN can also be used tocompare the OTU compositionof two or more frequency-normalized samples.MG-RAST provides acomparative functional andsequence-based analysis foruploaded samples.Other software based onphylogeneticdata are UniFrac.
  39. 39. [1] Some Metagenomics projects"whole-genome shotgun sequencing" was applied to microbial populationsA total of 1.045 billion base pairs of nonredundant sequence were analyzed"whole-genome shotgun sequencing"78 million base pairs of unique DNA sequence were analyzedTo date, 242 metagenomic projects are on going and 103 are completed( NGS Data analysis
  40. 40. [2] De novo sequencing NGS Data analysis
  41. 41. [3] Amplicon analysisEach amplicon (PCR product) is sequenced individually, allowing for the identification of rare variants and the assignment of haplotype information over the full sequence lengthSome applications: ● Detection of low-frequency (<1%) variants in complex mixtures → rare somatic mutations, viral quasispecies... Ultra-deep amplicon sequencing ● Identification of rare alleles associated with hereditary diseases, heterozygote SNP calling... Ultra-broad amplicon sequencing ● Metabolic profiling of environmental habitats, bacterial taxonomy and phlylogeny 16S rRNA amplicon sequencing NGS Data analysis
  42. 42. [3] Example of raw data generation with GS-FLX... NGS Data analysis
  43. 43. [3] Data Workflow... Data Processing
  44. 44. [3] Final output examples... NT substitution (error) matricesBar plots output example (with circular legend for the AA) AA frequency tables NGS Data analysis
  45. 45. [4] Variant discoveryYour aligner decides the type/amount of variants you can identifyNaive SNP calling Reads countingStatistic support SNP calling Maximum likelihood, BayesianQuality score recalibration Recalibrate quality score from whole alignmentLocal realignment around indels Realign readsKnown variants (limited species) dbSNP NGS Data analysis
  46. 46. [4] Example: Exome Variant Analysis NGS Data analysis
  47. 47. [4] Genotype calling tools NGS Data analysis
  48. 48. [4] GATK pipeline NGS Data analysis
  49. 49. [4] NGS Data analysis
  50. 50. [4] Many ongoing sequencing projects NGS Data analysis
  51. 51. [5] Transcriptome Analysis using NGS RNA-Seq, or "Whole Transcriptome Shotgun Sequencing" ("WTSS") refers to use of HTS technologies to sequence cDNA in order to get information about a samples RNA content.  Reads produced by sequencing  Aligned to a reference genome to build transcriptome mappings. NGS Data analysis
  52. 52. [5] Applications (1)  Whole transcriptome analysismRNA AAAA Fragmentation  Detects expression of known and novel mRNAs RT  Identification of alternative splicing events cDNA library  Detects expressed SNPs or mutations  Identifies allele specific sequencing expression patterns NGS Data analysis
  53. 53. [5] Applications (2) Differential expression 1.Reads are mapped to the reference genome or transcriptome 2.Mapped reads are assembled into expression summaries (tables of counts, showing how may reads are in coding region, exon, gene or junction); 3.The data are normalized; 4.Statistical testing of differential expression (DE) is performed, producing a list of genes with P-values and fold changes. NGS Data analysis
  54. 54. [5] RNA Seq data analysis - Mapping•Main Issues: –Number of allowed mismatches End up with a list of # of reads per transcript –Number of multihits –Mates expected distance These will be our (discrete) response variable –Considering exon junctions NGS Data analysis
  55. 55. [5] RNA Seq data analysis -Normalization• Two main sources of bias – Influence of length: Counts are proportional to the transcript length times the mRNA expression level. – Influence of sequencing depth: The higher sequencing depth, the higher counts.• How to deal with this – Normalize (correct) gene counts to minimize biases. – Use statistical models that take into account length and sequencing depth NGS Data analysis
  56. 56. [5] RNA Seq - Differential expression methods• Fishers exact test or similar approaches.• Use Generalized Linear Models and model counts using – Poisson distribution. – Negative binomial distribution.• Transform count data to use existing approaches for microarray data.• … NGS Data analysis
  57. 57. [5] Advantages of RNA-seq Unlike hybridization approaches does not require existing genomic sequence  Expected to replace microarrays for transcriptomic studies Very low background noise  Reads can be unabmiguously mapped Resolution up to 1 bp High-throughput quantitative measurement of transcript abundance  Better than Sanger sequencing of cDNA or EST libraries Cost decreasing all the time  Lower than traditional sequencing Can reveal sequence variations (SNPs) Automated pipelines available NGS Data analysis
  58. 58. Software for NGS preprocessing and analysis NGS Data analysis
  59. 59. Which software for NGS (data) analysis?• Answer is not straightforward.• Many possible classifications – Biological domains • SNP discovery, Genomics, ChIP-Seq, De-novo assembly, … – Bioinformatics methods • Mapping, Assembly, Alignment, Seq-QC,… – Technology • Illumina, 454, ABI SOLID, Helicos, … – Operating system • Linux, Mac OS X, Windows, … – License type • GPLv3, GPL, Commercial, Free for academic use,… – Language • C++, Perl, Java, C, Phyton – Interface • Web Based, Integrated solutions, command line tools, pipelines,… NGS Data analysis
  60. 60. Which software for NGS (data) analysis?• Answer is not straightforward.• Many possible classifications – Biological domains • SNP discovery, Genomics, ChIP-Seq, De-novo assembly, … – Bioinformatics methods • Mapping, Assembly, Alignment, Seq-QC,… – Technology • Illumina, 454, ABI SOLID, Helicos, … – Operating system • Linux, Mac OS X, Windows, … – License type • GPLv3, GPL, Commercial, Free for academic use,… – Language • C++, Perl, Java, C, Phyton – Interface • Web Based, Integrated solutions, command line tools, pipelines,… NGS Data analysis
  61. 61. Some popular tools and places NGS Data analysis
  62. 62. Site 62
  63. 63. Obtain data from many data sources including the UCSC Table Browser, Prepare data for further BioMart, WormBase, analysis by rearranging or your own data. or cutting data columns, Analyze data by finding filtering data and many overlapping regions, other actions. determining statistics, phylogenetic analysis and much more 63
  64. 64. User Register contains links to Shows the history the downloading, of analysis steps,pre-procession and displays data and result viewing analysis tools menus and data inputs NGS Data analysis 64
  65. 65. Click Get Data 65
  66. 66. Get Data from DatabaseNGS Data analysis 66
  67. 67. Upload File File Format Upload or paste file 67
  68. 68. NGS Data analysis 68
  69. 69. FASTQ file manipulation: format conversation, summary statistics, trimming reads, filtering reads by quality score…
  70. 70. Input: sanger FASTQOutput: SAM format
  71. 71. Downstream analysis: SAM -> BAMNGS Data analysis
  72. 72. Co py rig ht Op en He lix. No us e or re pr List saved histories and od uct shared histories. ion Work on a current history, wit ho create new, share workflow ut ex pr es s wri tte n co ns enNGS Data analysis t2 7
  73. 73. Creates a workflow, allows user to repeat analysis using different datasets.NGS Data analysis
  74. 74. DATA VISUALIZATION NGS Data analysis
  75. 75. Why is visualization important?make large amounts of data more interpretableglean patterns from the datasanity check / visual debuggingmore… NGS Data analysis
  76. 76. History of Genome Visualization 1800s 1900s 2000s time NGS Data analysis
  77. 77. What is a “Genome Browser”linear representation of a genomeposition-based annotations, each called a track continuous annotations: e.g. conservation interval annotations: e.g. gene, read alignment point annotations: e.g. SNPsuser specifies a subsection of genome to look at NGS Data analysis
  78. 78. Server-side model (e.g. UCSC, Ensembl, Gbrowse) serve• central data rstore• rendersimages• sends to client client• requestsimages• displaysimages NGS Data analysis
  79. 79. Client-side model (e.g. Savant, IGV) serve• stores data r client HTS• local HTS machinestore• rendersimages• displaysimages
  80. 80. Rough comparison of Genome Browsers UCSC Ensembl GBrowse Savant IGVModel Server Server Server Client ClientInteractiveHTS supportDatabase oftracksPlugins No support Some support Good support NGS Data analysis
  81. 81. Limitations of most genomebrowsersdo not support multiple genomes simultaneouslydo not capture 3-dimensional conformationdo not capture spatial or temporal informationdo not integrate well with analyticscannot be customized The SAVANT GENOME BROWSER has been created to overcome these limitations NGS Data analysis
  82. 82. Integrative Genomics Viewer (IGV)he Integrative Genomics Viewer (IGV) is a high-performance visualization toolfor interactive exploration of large, integrated datasets. It supports a wide varietyof data types including sequence alignments, microarrays, and genomicannotations.
  83. 83. Acknowledgements Grupo de investigación en Estadística y Bioinformática del departamento de Estadística de la Universidad de Barcelona. All the members at the Unitat d’Estadística i Bioinformàtica del VHIR (Vall d’Hebron Institut de Recerca) Unitat de Serveis Científico Tècnics (UCTS) del VHIR (Vall d’Hebron Institut de Recerca) People whose materials have been borrowed or who have contributed with their work  Manel Comabella, Rosa Prieto, Paqui Gallego, Javier Santoyo, Ana Conesa, Thomas Girke and Silvia Cardona.… NGS Data analysis
  84. 84. Gracias por la atención y la paciencia NGS Data analysis