SlideShare a Scribd company logo
Large Scale Resequencing: Approaches and
   Challenges


    Thomas Keane
    Vertebrate Resequencing Informatics group
    Wellcome Trust Sanger Institute
    Hinxton, Cambridge, UK

    thomas.keane@sanger.ac.uk



AGBT Tutorial Workshop   15th February, 2012
Sanger total sequence (2007-2009)
Gbp




  AGBT Tutorial Workshop   15th February, 2012
Sanger total sequence to-date
Gbp




  AGBT Tutorial Workshop   15th February, 2012
Vertebrate Resequencing Informatics Group

     Established in 2008 with Jim Stalker
         PIs: Richard Durbin and David Adams
     Initial projects
         1000 Genomes project (http://www.1000genomes.org)
               Data processing, releases, aligner evaluation, sequencing
               Pilot 2008-2009: ~5Tbp (Nature 2011;467)
               Phase 1 2009-2011: ~30Tbp
               Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)
         Mouse Genomes Project (http://www.sanger.ac.uk/
           mousegenomes)
               Sequencing 17 laboratory mouse strains
               SNPs, indels, SVs, de novo assembly
               Approx. ~1.2Tbp (Nature 2011;477)


AGBT Tutorial Workshop   15th February, 2012
UK10K

 Investigating the role of rare genetic variants in health and disease
 Whole genome cohorts: 4,000 individuals across two well-established and deeply
 phenotyped UK cohorts with ongoing longitudinal phenotype collection:
     TWINSUK – 2,000
     ALSPAC – 2,000
     6x (18Gbp) per sample

 Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals
    Neurodevelopmental diseases – 3,000
        e.g. schizophrenia, autism spectrum disorders
    Obesity – 2,000
        e.g. severe childhood onset obesity
    Rare diseases – 1,000
        e.g. severe insulin resistance, congenital heart disease, ciliopathies
    5Gbp per sample

 Expect to generate ~100Tbp by end 2012
    ~40Tbp from BGI


AGBT Tutorial Workshop   15th February, 2012
Current Status




                  Recently passed 1000 genomes in terms of total Gbp
AGBT Tutorial Workshop   15th February, 2012
What are the challenges?



 Storage                                             Software/Workflows



                                               NGS


 Compute                                                  Power


AGBT Tutorial Workshop   15th February, 2012
Data Production Workflow


         Sample                              NA34842                 NA87465                 Sample/Platform
         merge

                                                                                                  Merge Up
                                    BAM                   BAM                    BAM
      Library
      merge                                                                                  Library
Freeze


       BAM
                            BAM           BAM          BAM      ……       BAM           BAM

   Improvement
                            BAM                                 ……
   Alignment
                                          BAM          BAM               BAM           BAM
                                                                                                   Import
   (bwa, smalt etc)
                            Fastq         Fastq        Fastq    ……       Fastq     Fastq
                                                                                                       +
                                                                                             Improvement



   AGBT Tutorial Workshop   15th February, 2012
Data Production Workflow

                                        Chr1                   Chr2            Chr3
                     NA19294                                                              …
                     NA18943
                                                                                          …      Merge
                     NA19305              .                        .            .
                         .
                         .
                                          .
                                          .
                                                                   .
                                                                   .
                                                                                .
                                                                                .                across
                     NA19309                                                              …

                 RG:NA19294
                 RG:NA18943
                 RG:NA19305
                                                                                          Cross-sample BAMs

                        SNPs/indels                                                 SVMerge
                samtools                GATK                    Genome STRiP



                              VQSR
                                                                                                  Variant
                              BEAGLE/
                              Impute2
                                                                                                  Calling

                                                       VEP Annotation

                                                       Final VCF 

AGBT Tutorial Workshop           15th February, 2012
Storage Challenges

 Expect ~200Tbp of sequence in 2011-2012
   Working estimate including processing, release, and variant calling
   10bytes per bp

 Storage considerations
   Scalability – can we easily add more storage units?
   Backup and disaster recovery – what do we really need to keep?
   Performance – sufficient I/O throughput to serve compute nodes
   Cost

 Data Formats
   Standardised formats – BAM & VCF 

 Minimise the number of copies
   Aim for two copies at most – original lanes + release (stripped) BAM

AGBT Tutorial Workshop   15th February, 2012
A Tiered Storage Solution


Cost          Size

 2               1                                                              3Gb/sec




                                                                                                  CPU Farm
 1               3                                                                    800Mb/sec




                                                          Off-       Off-
 2               2                                        site       site
       Level 1
           Data: Current release vertical BAMs
           Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs)
       Level 2
           Data: Lane level BAMs
           Processes: Alignment, recalibration, local realignment
       Level 3
           Data: Previous release BAMs + variant calls backup

     AGBT Tutorial Workshop   15th February, 2012
Data release + archiving: iRODs

 Rule-Oriented Data management systems                                                iRODs
     Open source – origins in particle physics world
     Most important feature of iRODS is the Rule Engine                      nfs02       nfs20
     Akin to source control system
 Customise own application level metadata                          nfs03
                                                                                 nfs01        Off-
     e.g. run, lane, plex, sample, library….                                                 site
 Stores/searches key-value metadata on files:
            List all files from UK10K studies:
                     imeta -z seq qu -d study like 'UK10K_%’!
                          /seq/5363/5363_1.bam!
                          /seq/5363/5363_2.bam (.....and a whole lot more)!
                Get metadata about a file:
                     imeta ls -d /seq/6534/6534_3#7.bam sample!
                          attribute: sample!
                          value: QTL191953!

 Sanger production: BAM files from runs per lane per plex deposited
      BMC Bioinformatics 2011, 12:361

 Recently adopted for UK10K internal data release and archiving
      Users use meta-data queries to find their data
      Files can be part of multiple releases
                                                                              http://www.irods.org

AGBT Tutorial Workshop    15th February, 2012
Compute Pipeline Management: VRPipe

 VRPipe
   Managed and automated execution of sequences of arbitrary
     software against massive datasets across large compute clusters
   Error handling, optimal memory requests, batching of jobs, retrying
     failures, failure reporting, highly extendable, detailed job statistics
 1000 Genomes Phase 2 processed through VRPipe
   Tracked ~1 million jobs
   Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs
   bwa_aln_fastq: ~2443 days total serial wall time
   Mean memory: 941MB/job (max 5637)
 2012                                                                sb10@sanger.ac.uk

   Fully migrate all NGS processes to VRPipe (data processing, SNP/
     indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines)
   Management front-ends
   Create distributable VM for cloud rollout
 http://www.github.com/VertebrateResequencing/vr-pipe/wiki

AGBT Tutorial Workshop   15th February, 2012
Even more scale up in 2012 – HiSeq 2500

 Currently takes 1-2 weeks to sequence a human genome
   High depth human genomes in a single day – Illumina HiSeq
     2500
   Caucasian family with a severe T-cell deficiency in affected
     sibling
   Single run on HiSeq 2500 by Illumina per individual

                             PF
                                                      % ≥Q30 Mismatch Mismatch Run time
              Sample        Yield         % Align
                            (Gbp)                      value  R1 (%)   R2 (%)    (hrs)

              Father       117.7               89      92.6     0.4      0.5     25.5
              Mother       125.7               90.2    92.8     0.4      0.5     25.5

              Affected     124.4               90.3    92.4     0.4      0.5     25.5




AGBT Tutorial Workshop   15th February, 2012
What does the data look like?




AGBT Tutorial Workshop   15th February, 2012
Upcoming Changes in 2012

 We cannot keep all of the data
   2007-2008: Keep everything including images from runs
   2009: BAM/Fastq – all of the base quality information
   2010-2011: Stripping original qualities and other unused tags
   2012-: Current formats contain lots of repetition
       Reference based compression
       Reducing quality information e.g. quality binning or quality
       budgets
       Potential formats: CRAM and/or Reduced BAM




AGBT Tutorial Workshop   15th February, 2012
CRAM Format
                                        TGAGCTCTAAGTACC!
                                        329183050298757!


CRAM models for
compression                                                           TGAGCTCTAAGTACC!               TGAGCTCTAAGTACC!
                                                                      002020010022212!               -2---30---9---7!

                                                                            Horizontal                Vertical
                            Do nothing                     Lossless
                                                                                             Quality lossy


        100                                       10                                     1                                            0.1



CRAM current
                                  Untreated             CRAM                       CRAM               CRAM substitutions/insertions
performance                                            lossless                  combination                   model
                                                                                   model


    CRAM v0.6 released 13.2.12:                                        •    Option to preserve all unmapped reads
    •  Pairing information preservation regardless of distance         •    Performance and bug fixes
    •  Revised and improved lossless mode                              •    Arbitrary tags

                                  http://www.ebi.ac.uk/ena/about/cram_toolkit
                                                                                         Source: Ewan Birney/Guy Cochrane, EBI

   AGBT Tutorial Workshop   15th February, 2012
Any questions?




                                                                 Richard Durbin




 URLs
  •  VRPipe: https://github.com/VertebrateResequencing/vr-pipe   David Adams
  •  iRODS@Sanger: BMC Bioinformatics 2011, 12:361
  •  http://www.slideshare.net/thomaskeane

AGBT Tutorial Workshop   15th February, 2012

More Related Content

Viewers also liked

Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
Torsten Seemann
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Torsten Seemann
 
The Best Way to Optimize Physician Workflow
The Best Way to Optimize Physician WorkflowThe Best Way to Optimize Physician Workflow
The Best Way to Optimize Physician Workflow
Health Catalyst
 
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Torsten Seemann
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
Torsten Seemann
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
Thomas Keane
 
Mouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingMouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-Editing
Thomas Keane
 
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
Игорь Шадеркин
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Torsten Seemann
 
Maternal Fetal Medicine 2017
Maternal Fetal Medicine 2017Maternal Fetal Medicine 2017
Maternal Fetal Medicine 2017
Valeriya Chesanova
 
The Real Opportunity of Precision Medicine and How to Not Miss Out
The Real Opportunity of Precision Medicine and How to Not Miss OutThe Real Opportunity of Precision Medicine and How to Not Miss Out
The Real Opportunity of Precision Medicine and How to Not Miss Out
Health Catalyst
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
Thomas Keane
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
Torsten Seemann
 
Key Issues on the Economics of Precision Medicine
Key Issues on the Economics of Precision MedicineKey Issues on the Economics of Precision Medicine
Key Issues on the Economics of Precision Medicine
HEHTAslides
 
The Scottish Ecosystem for Precision Medicine
The Scottish Ecosystem for Precision MedicineThe Scottish Ecosystem for Precision Medicine
The Scottish Ecosystem for Precision Medicine
HEHTAslides
 
Rheumatoid Arthritis: Too expensive to treat, too expensive to fail
Rheumatoid Arthritis: Too expensive to treat, too expensive to failRheumatoid Arthritis: Too expensive to treat, too expensive to fail
Rheumatoid Arthritis: Too expensive to treat, too expensive to fail
HEHTAslides
 
Stem cell personalized medicine 2017 plus
Stem cell personalized medicine 2017 plusStem cell personalized medicine 2017 plus
Stem cell personalized medicine 2017 plus
Avi Dey
 
Six secrets-to-closing-sale
Six secrets-to-closing-saleSix secrets-to-closing-sale
Six secrets-to-closing-sale
Benjamin Brown
 
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
Nick Loman
 

Viewers also liked (19)

Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015Long read sequencing - LSCC lab talk - fri 5 june 2015
Long read sequencing - LSCC lab talk - fri 5 june 2015
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
 
The Best Way to Optimize Physician Workflow
The Best Way to Optimize Physician WorkflowThe Best Way to Optimize Physician Workflow
The Best Way to Optimize Physician Workflow
 
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
 
Multiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotationsMultiple mouse reference genomes and strain specific gene annotations
Multiple mouse reference genomes and strain specific gene annotations
 
Mouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingMouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-Editing
 
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
 
Maternal Fetal Medicine 2017
Maternal Fetal Medicine 2017Maternal Fetal Medicine 2017
Maternal Fetal Medicine 2017
 
The Real Opportunity of Precision Medicine and How to Not Miss Out
The Real Opportunity of Precision Medicine and How to Not Miss OutThe Real Opportunity of Precision Medicine and How to Not Miss Out
The Real Opportunity of Precision Medicine and How to Not Miss Out
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
Key Issues on the Economics of Precision Medicine
Key Issues on the Economics of Precision MedicineKey Issues on the Economics of Precision Medicine
Key Issues on the Economics of Precision Medicine
 
The Scottish Ecosystem for Precision Medicine
The Scottish Ecosystem for Precision MedicineThe Scottish Ecosystem for Precision Medicine
The Scottish Ecosystem for Precision Medicine
 
Rheumatoid Arthritis: Too expensive to treat, too expensive to fail
Rheumatoid Arthritis: Too expensive to treat, too expensive to failRheumatoid Arthritis: Too expensive to treat, too expensive to fail
Rheumatoid Arthritis: Too expensive to treat, too expensive to fail
 
Stem cell personalized medicine 2017 plus
Stem cell personalized medicine 2017 plusStem cell personalized medicine 2017 plus
Stem cell personalized medicine 2017 plus
 
Six secrets-to-closing-sale
Six secrets-to-closing-saleSix secrets-to-closing-sale
Six secrets-to-closing-sale
 
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
 

Similar to Large Scale Resequencing: Approaches and Challenges

Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing era
Thomas Keane
 
Bobcat hotchips final 8 2 10
Bobcat hotchips final 8 2 10Bobcat hotchips final 8 2 10
Bobcat hotchips final 8 2 10
mbasford
 
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - PosterEfficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
rvernica
 
Jaguar x86 Core Functional Verification
Jaguar x86 Core Functional VerificationJaguar x86 Core Functional Verification
Jaguar x86 Core Functional Verification
DVClub
 
Netgear ReadyNAS Comparison
Netgear ReadyNAS ComparisonNetgear ReadyNAS Comparison
Netgear ReadyNAS Comparison
Altaware, Inc.
 
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
Rob Shakir
 
Asml Euv Use Forecast
Asml Euv Use ForecastAsml Euv Use Forecast
Asml Euv Use Forecast
Knowledge_Broker
 
Public Presentation, ASML EUV forecast Jul 2010
Public Presentation, ASML EUV forecast Jul 2010Public Presentation, ASML EUV forecast Jul 2010
Public Presentation, ASML EUV forecast Jul 2010
JVervoort
 
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
Swiss Big Data User Group
 
ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...
ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...
ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...
ESS BILBAO
 
NGS Data Preprocessing
NGS Data PreprocessingNGS Data Preprocessing
NGS Data Preprocessing
cursoNGS
 
AMD technologies for HPC
AMD technologies for HPCAMD technologies for HPC
AMD technologies for HPC
Joshua Mora
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
Thomas Keane
 
Benchmarker - A Good Friend for Performance
Benchmarker - A Good Friend for PerformanceBenchmarker - A Good Friend for Performance
Benchmarker - A Good Friend for Performance
kwatch
 
産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み
Ryousei Takano
 
BGP Error Handling (NANOG 51)
BGP Error Handling (NANOG 51)BGP Error Handling (NANOG 51)
BGP Error Handling (NANOG 51)
Rob Shakir
 

Similar to Large Scale Resequencing: Approaches and Challenges (16)

Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing era
 
Bobcat hotchips final 8 2 10
Bobcat hotchips final 8 2 10Bobcat hotchips final 8 2 10
Bobcat hotchips final 8 2 10
 
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - PosterEfficient Parallel Set-Similarity Joins Using MapReduce - Poster
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
 
Jaguar x86 Core Functional Verification
Jaguar x86 Core Functional VerificationJaguar x86 Core Functional Verification
Jaguar x86 Core Functional Verification
 
Netgear ReadyNAS Comparison
Netgear ReadyNAS ComparisonNetgear ReadyNAS Comparison
Netgear ReadyNAS Comparison
 
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
 
Asml Euv Use Forecast
Asml Euv Use ForecastAsml Euv Use Forecast
Asml Euv Use Forecast
 
Public Presentation, ASML EUV forecast Jul 2010
Public Presentation, ASML EUV forecast Jul 2010Public Presentation, ASML EUV forecast Jul 2010
Public Presentation, ASML EUV forecast Jul 2010
 
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
 
ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...
ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...
ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...
 
NGS Data Preprocessing
NGS Data PreprocessingNGS Data Preprocessing
NGS Data Preprocessing
 
AMD technologies for HPC
AMD technologies for HPCAMD technologies for HPC
AMD technologies for HPC
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Benchmarker - A Good Friend for Performance
Benchmarker - A Good Friend for PerformanceBenchmarker - A Good Friend for Performance
Benchmarker - A Good Friend for Performance
 
産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み産総研におけるプライベートクラウドへの取り組み
産総研におけるプライベートクラウドへの取り組み
 
BGP Error Handling (NANOG 51)
BGP Error Handling (NANOG 51)BGP Error Handling (NANOG 51)
BGP Error Handling (NANOG 51)
 

More from Thomas Keane

2014 Wellcome Trust Advances Course: NGS Course - Lecture2
2014 Wellcome Trust Advances Course: NGS Course - Lecture22014 Wellcome Trust Advances Course: NGS Course - Lecture2
2014 Wellcome Trust Advances Course: NGS Course - Lecture2
Thomas Keane
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Thomas Keane
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
Thomas Keane
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
Thomas Keane
 
Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010
Thomas Keane
 
Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010
Thomas Keane
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing Tutorial
Thomas Keane
 

More from Thomas Keane (7)

2014 Wellcome Trust Advances Course: NGS Course - Lecture2
2014 Wellcome Trust Advances Course: NGS Course - Lecture22014 Wellcome Trust Advances Course: NGS Course - Lecture2
2014 Wellcome Trust Advances Course: NGS Course - Lecture2
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
 
Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010
 
Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing Tutorial
 

Recently uploaded

20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 

Recently uploaded (20)

20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 

Large Scale Resequencing: Approaches and Challenges

  • 1. Large Scale Resequencing: Approaches and Challenges Thomas Keane Vertebrate Resequencing Informatics group Wellcome Trust Sanger Institute Hinxton, Cambridge, UK thomas.keane@sanger.ac.uk AGBT Tutorial Workshop 15th February, 2012
  • 2. Sanger total sequence (2007-2009) Gbp AGBT Tutorial Workshop 15th February, 2012
  • 3. Sanger total sequence to-date Gbp AGBT Tutorial Workshop 15th February, 2012
  • 4. Vertebrate Resequencing Informatics Group  Established in 2008 with Jim Stalker  PIs: Richard Durbin and David Adams  Initial projects  1000 Genomes project (http://www.1000genomes.org)  Data processing, releases, aligner evaluation, sequencing  Pilot 2008-2009: ~5Tbp (Nature 2011;467)  Phase 1 2009-2011: ~30Tbp  Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)  Mouse Genomes Project (http://www.sanger.ac.uk/ mousegenomes)  Sequencing 17 laboratory mouse strains  SNPs, indels, SVs, de novo assembly  Approx. ~1.2Tbp (Nature 2011;477) AGBT Tutorial Workshop 15th February, 2012
  • 5. UK10K Investigating the role of rare genetic variants in health and disease Whole genome cohorts: 4,000 individuals across two well-established and deeply phenotyped UK cohorts with ongoing longitudinal phenotype collection:   TWINSUK – 2,000   ALSPAC – 2,000   6x (18Gbp) per sample Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals   Neurodevelopmental diseases – 3,000  e.g. schizophrenia, autism spectrum disorders   Obesity – 2,000  e.g. severe childhood onset obesity   Rare diseases – 1,000  e.g. severe insulin resistance, congenital heart disease, ciliopathies   5Gbp per sample Expect to generate ~100Tbp by end 2012   ~40Tbp from BGI AGBT Tutorial Workshop 15th February, 2012
  • 6. Current Status Recently passed 1000 genomes in terms of total Gbp AGBT Tutorial Workshop 15th February, 2012
  • 7. What are the challenges? Storage Software/Workflows NGS Compute Power AGBT Tutorial Workshop 15th February, 2012
  • 8. Data Production Workflow Sample NA34842 NA87465 Sample/Platform merge Merge Up BAM BAM BAM Library merge Library Freeze BAM BAM BAM BAM …… BAM BAM Improvement BAM …… Alignment BAM BAM BAM BAM Import (bwa, smalt etc) Fastq Fastq Fastq …… Fastq Fastq + Improvement AGBT Tutorial Workshop 15th February, 2012
  • 9. Data Production Workflow Chr1 Chr2 Chr3 NA19294 … NA18943 … Merge NA19305 . . . . . . . . . . . across NA19309 … RG:NA19294 RG:NA18943 RG:NA19305 Cross-sample BAMs SNPs/indels SVMerge samtools GATK Genome STRiP VQSR Variant BEAGLE/ Impute2 Calling VEP Annotation Final VCF  AGBT Tutorial Workshop 15th February, 2012
  • 10. Storage Challenges Expect ~200Tbp of sequence in 2011-2012  Working estimate including processing, release, and variant calling  10bytes per bp Storage considerations  Scalability – can we easily add more storage units?  Backup and disaster recovery – what do we really need to keep?  Performance – sufficient I/O throughput to serve compute nodes  Cost Data Formats  Standardised formats – BAM & VCF  Minimise the number of copies  Aim for two copies at most – original lanes + release (stripped) BAM AGBT Tutorial Workshop 15th February, 2012
  • 11. A Tiered Storage Solution Cost Size 2 1 3Gb/sec CPU Farm 1 3 800Mb/sec Off- Off- 2 2 site site Level 1   Data: Current release vertical BAMs   Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs) Level 2   Data: Lane level BAMs   Processes: Alignment, recalibration, local realignment Level 3   Data: Previous release BAMs + variant calls backup AGBT Tutorial Workshop 15th February, 2012
  • 12. Data release + archiving: iRODs Rule-Oriented Data management systems iRODs   Open source – origins in particle physics world   Most important feature of iRODS is the Rule Engine nfs02 nfs20   Akin to source control system Customise own application level metadata nfs03 nfs01 Off-   e.g. run, lane, plex, sample, library…. site Stores/searches key-value metadata on files:   List all files from UK10K studies: imeta -z seq qu -d study like 'UK10K_%’! /seq/5363/5363_1.bam! /seq/5363/5363_2.bam (.....and a whole lot more)!   Get metadata about a file: imeta ls -d /seq/6534/6534_3#7.bam sample! attribute: sample! value: QTL191953! Sanger production: BAM files from runs per lane per plex deposited   BMC Bioinformatics 2011, 12:361 Recently adopted for UK10K internal data release and archiving   Users use meta-data queries to find their data   Files can be part of multiple releases http://www.irods.org AGBT Tutorial Workshop 15th February, 2012
  • 13. Compute Pipeline Management: VRPipe VRPipe  Managed and automated execution of sequences of arbitrary software against massive datasets across large compute clusters  Error handling, optimal memory requests, batching of jobs, retrying failures, failure reporting, highly extendable, detailed job statistics 1000 Genomes Phase 2 processed through VRPipe  Tracked ~1 million jobs  Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs  bwa_aln_fastq: ~2443 days total serial wall time  Mean memory: 941MB/job (max 5637) 2012 sb10@sanger.ac.uk  Fully migrate all NGS processes to VRPipe (data processing, SNP/ indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines)  Management front-ends  Create distributable VM for cloud rollout http://www.github.com/VertebrateResequencing/vr-pipe/wiki AGBT Tutorial Workshop 15th February, 2012
  • 14. Even more scale up in 2012 – HiSeq 2500 Currently takes 1-2 weeks to sequence a human genome  High depth human genomes in a single day – Illumina HiSeq 2500  Caucasian family with a severe T-cell deficiency in affected sibling  Single run on HiSeq 2500 by Illumina per individual PF % ≥Q30 Mismatch Mismatch Run time Sample Yield % Align (Gbp) value R1 (%) R2 (%) (hrs) Father 117.7 89 92.6 0.4 0.5 25.5 Mother 125.7 90.2 92.8 0.4 0.5 25.5 Affected 124.4 90.3 92.4 0.4 0.5 25.5 AGBT Tutorial Workshop 15th February, 2012
  • 15. What does the data look like? AGBT Tutorial Workshop 15th February, 2012
  • 16. Upcoming Changes in 2012 We cannot keep all of the data  2007-2008: Keep everything including images from runs  2009: BAM/Fastq – all of the base quality information  2010-2011: Stripping original qualities and other unused tags  2012-: Current formats contain lots of repetition  Reference based compression  Reducing quality information e.g. quality binning or quality budgets  Potential formats: CRAM and/or Reduced BAM AGBT Tutorial Workshop 15th February, 2012
  • 17. CRAM Format TGAGCTCTAAGTACC! 329183050298757! CRAM models for compression TGAGCTCTAAGTACC! TGAGCTCTAAGTACC! 002020010022212! -2---30---9---7! Horizontal Vertical Do nothing Lossless Quality lossy 100 10 1 0.1 CRAM current Untreated CRAM CRAM CRAM substitutions/insertions performance lossless combination model model CRAM v0.6 released 13.2.12: •  Option to preserve all unmapped reads •  Pairing information preservation regardless of distance •  Performance and bug fixes •  Revised and improved lossless mode •  Arbitrary tags http://www.ebi.ac.uk/ena/about/cram_toolkit Source: Ewan Birney/Guy Cochrane, EBI AGBT Tutorial Workshop 15th February, 2012
  • 18. Any questions? Richard Durbin URLs •  VRPipe: https://github.com/VertebrateResequencing/vr-pipe David Adams •  iRODS@Sanger: BMC Bioinformatics 2011, 12:361 •  http://www.slideshare.net/thomaskeane AGBT Tutorial Workshop 15th February, 2012