SlideShare a Scribd company logo
Making powerful science: an
introduction to NGS data analysis
Dr Adam Cribbs
Group leader in systems biology
Botnar Research Centre
Introduction
PhD
Prof Fionula Brennan
Tregs in Rheumatoid Arthritis
Postdoctoral scientist
Prof Sir Marc Feldmann
Prof Udo Oppermann
Epigenetics of T cells
MRC Career development fellowship
Prof Chris Ponting
Dr David Sims
Systems biology
MRC Career development fellowship
PI position
Current research
Purpose of this section
• Introduction to the concepts in NGS data analysis
• Data formats and quality control
• Challenges in data analysis
• Software and pipelines
Application of NGS sequencing
Sequencing of
Genomic DNA
Sequencing of
DNA library
Sequencing of
cDNA library
Whole genome sequencing
• Genome re-sequencing
• de novo genome sequencing
• Metagenomics applications
Epigenetic profiling
• Methylation sequencing
• Nucleosome footprinting
Genomic footprinting
• ChIP sequencing
Targeted sequencing
• PCR-amplified regions
• Capture-enriched DNA
Transcriptome analysis
• Novel RNA classes (lncRNAs)
• Novel splice variants
Transcriptome expression
• mRNA
• Small RNA
RNA footprinting
• Ribosomal footprinting
• RNA-IP sequencing
Bioinformatic challenges
Now I have my data what do I do????
Bioinformatic challenges
• 2.7 billion to hundreds
of £
• NGS pushed the need for
bioinformatics and big
data analytics
• Need for power!!
Need for computation
• Need for computer power
• VERY large files (10s of millions of lines)
• Impossible to use familiar tools such as python
• Impossible memory usage and execution time
• Need for a large amount of compute power
• Compute clusters
• Parallel code and multi threading to speed up analysis
• Need for faster software
• Pipelines
• Bioinformatics power!
• Properly structured working
Data management issues
• How to store data – very large raw data
• Alternative data structures – e.g. binary storage (bam
files)
• Certain studies use different amounts of storage
• RNA-seq per file 2Gb
• WGS – 500 Gb files
• Less of an issue now than it used to be 3-5 years ago
– hardware improvements
Computational clusters
• Multi-nodes (servers) with multi-cores
• High performance storage (expensive)
• In-line storage
• Fast networks (50Gb Ethernet between nodes)
• Located in a single data centre
• Need skilled data-admin staff to monitor and fix
issues
Cloud based analysis
• Pros
• Flexible
• Pay for what you use
• Don’t need to maintain a data centre
• Cons
• Transfer big data over internet is slow
• You pay for bandwidth
• Lower performance – disk IO
• Privacy/data concerns
• More expensive for long term projects
The future
• NGS arrived in 2007/2008
• No-one predicted NGS in 2001
• How can we really predict the future?
• Problems will always remain:
• Software always lags behind hardware
Bioinformatics and computational biology
• The term bioinformatician can mean many things
• Usually little biology background but quantitative
skills
• Computational biologist is usually someone with a
biology and quantitative background
• There is definitely a massive skills shortage in both
How to learn computational skills
• Introduction to Next-gen data analysis
• EBI in Cambridge - https://www.ebi.ac.uk/training/online/course/functional-
genomics-ii-common-technologies-and-data-analysis-methods/next-generation
• OBDI program
• 3 month short term training for a particular skill
• https://www.imm.ox.ac.uk/research/units-and-centres/mrc-wimm-
centre-for-computational-biology/training
• Undertake part of your PhD in a computational group
NGS analysis
NGS data analysis
Raw reads from
sequencer
Quality assessment of
reads
Mapping
Pathway
analysis
Gene
networks
Data storage and
visualisation
Quality control of reads
• Sequencing output:
• Reads + quality
• Flat files – are very large – inefficient but it’s the
standard
• Question: is the quality of my sequencing data good?
Quality control of reads
• Fastqc – babraham institute
• https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Tools to deal with read QC
• Fastx-toolkit to optimize different datasets
• Fastq-screen – check that all of your data is not
contaminated
• Trimming to improve quality
• Trimmomatic
• Cutadapt
• There are many many more!
• But beware of removing too many reads or trimming
too much
Mapping reads to genome/transcriptome
• Mapping data is very important to get correct
• Many different mappers – make sure you use the
latest software
• Always treat your samples consistently
Mapping reads to genome/transcriptome
• Main issues:
• Number of mismatches
• Number of multi-hits
• Mates expected distance
• Exon junction
GTF file for mapping
• File format for reference sequence
Mapping reads to genome
• Which one to use???
• Depends on application
Mapping reads to transcriptome
• Which one to use???
• Depends on application
• Don’t use tophat or hisat – use Tophat2 and hisat2
SAM/BAM format
• Standard mapping output
• Sequence alignment map (SAM)
• Tab delimited
• 11 mandatory fields
1. Read name
2. Flag
3. Reference
4. Position
5. Quality
6. Cigar
7. Ref name of mate
8. Pos of mate
10. Seq
9. Template len
SAM/BAM format
• FLAG
• CIGAR
SAM/BAM tools
• Commandline
• Samtools
• view
• Index
• Sort
• Picard
• MarkDuplicates
• Python
• Pysam – maintained and developed by CGAT (Andreas
Hager)
Workflows: RNA-seq
RNA-seq workflow for DEG
• Workflow1:
• Tophat2 (align) -> cufflinks
(transcript assembly) ->
cuffdiff (DEG) -> cuffmerge
(merge assemblies)
• Workflow 2:
• Hisat2 (align with any spliced
mapper) -> featurecounts
(counting reads to
transcripts) -> DESeq2 or
EdgeR (DEG)
Hisat2 alignment
DESeq2
featurecounts
General linear model that
accounts for negative
binomial distribution
Count data
• Following featurecounts you are left with a counts table
Fewer genes with large counts and
more with fewer counts
DEG methods compared
• Which model to use????
• My preference is DESeq2
• Well written and better support
• edgeR not accounting for typeI errors as well?
Microarray
RNA-seq
DESeq2 model
• Model overview:
• First fits a GLM to the data using a sample size factor
• Cooks distance for counts outlier detection
• Dispersion is measured
• zero-centered normal prior to shrink lower end
• Wald test or LRT test
Pathway analysis
• Pathway analysis helps to identify novel pathways that may be
disease relevant
• Skewed towards cancer
• Not always informative
• Paid vs public
Biological interpretation
• The most important part and most difficult
• Can be a problem when dealing with a company
• Language barrier between biologist and bioinformatician
• Visualising data helps overcome this
Developing pipelines
• To speed up your analysis and make your code
reproducible you need to write pipelines
https://github.com/Acribbs/scflow
Further resources
Further resources
• Please email me
• MOOCS:
• Coursera : https://www.coursera.org/learn/bioinformatics-
methods-1
• Edex: https://www.edx.org/micromasters/bioinformatics
• Programming skills:
• Codeacademy
• EBI Introduction to Next-generation sequencing
course - competitive

More Related Content

What's hot

What's hot (20)

Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
DNA_Services
DNA_ServicesDNA_Services
DNA_Services
 
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGSCurso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
 
Introduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seqIntroduction to Single-cell RNA-seq
Introduction to Single-cell RNA-seq
 
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo ProtocolGalaxy RNA-Seq Analysis: Tuxedo Protocol
Galaxy RNA-Seq Analysis: Tuxedo Protocol
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
 
Analysis of ChIP-Seq Data
Analysis of ChIP-Seq DataAnalysis of ChIP-Seq Data
Analysis of ChIP-Seq Data
 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
 
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression AnalysisSo you want to do a: RNAseq experiment, Differential Gene Expression Analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
 
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
 
Workshop NGS data analysis - 1
Workshop NGS data analysis - 1Workshop NGS data analysis - 1
Workshop NGS data analysis - 1
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
Exploring new frontiers with next-generation sequencing
Exploring new frontiers with next-generation sequencingExploring new frontiers with next-generation sequencing
Exploring new frontiers with next-generation sequencing
 
Ngs part i 2013
Ngs part i 2013Ngs part i 2013
Ngs part i 2013
 
Annotating nc-RNAs with Rfam
Annotating nc-RNAs with RfamAnnotating nc-RNAs with Rfam
Annotating nc-RNAs with Rfam
 
Advanced NGS Library Prep for Challenging Samples
Advanced NGS Library Prep for Challenging SamplesAdvanced NGS Library Prep for Challenging Samples
Advanced NGS Library Prep for Challenging Samples
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) Technology
 
The QIAseq NGS Portfolio for Cancer Research: Sample-to-Insight for All
The QIAseq NGS Portfolio for Cancer Research: Sample-to-Insight for AllThe QIAseq NGS Portfolio for Cancer Research: Sample-to-Insight for All
The QIAseq NGS Portfolio for Cancer Research: Sample-to-Insight for All
 
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
 
NEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCINGNEXT GENERATION SEQUENCING
NEXT GENERATION SEQUENCING
 

Similar to Making powerful science: an introduction to NGS data analysis

Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
Yasin Memari
 
Distributed approach for Peptide Identification
Distributed approach for Peptide IdentificationDistributed approach for Peptide Identification
Distributed approach for Peptide Identification
abhinav vedanbhatla
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
Ola Spjuth
 

Similar to Making powerful science: an introduction to NGS data analysis (20)

2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2
Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2
Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2
 
Distributed approach for Peptide Identification
Distributed approach for Peptide IdentificationDistributed approach for Peptide Identification
Distributed approach for Peptide Identification
 
Data analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomicsData analysis patterns, tools and data types in genomics
Data analysis patterns, tools and data types in genomics
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
 
Sc12 workshop-writeup
Sc12 workshop-writeupSc12 workshop-writeup
Sc12 workshop-writeup
 
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
 
Flashy prefetching for high performance flash drives
Flashy prefetching for high performance flash drivesFlashy prefetching for high performance flash drives
Flashy prefetching for high performance flash drives
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can Know
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
 
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
 
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
 BioTeam Bhanu Rekepalli Presentation at BICoB 2015 BioTeam Bhanu Rekepalli Presentation at BICoB 2015
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
 
Where do we currently stand at ICARDA?
Where do we currently stand at ICARDA?Where do we currently stand at ICARDA?
Where do we currently stand at ICARDA?
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 

Recently uploaded

Alcohol_Dr. Jeenal Mistry MD Pharmacology.pdf
Alcohol_Dr. Jeenal Mistry MD Pharmacology.pdfAlcohol_Dr. Jeenal Mistry MD Pharmacology.pdf
Alcohol_Dr. Jeenal Mistry MD Pharmacology.pdf
Dr Jeenal Mistry
 
Mastering Wealth: A Path to Financial Freedom
Mastering Wealth: A Path to Financial FreedomMastering Wealth: A Path to Financial Freedom
Mastering Wealth: A Path to Financial Freedom
FatimaMary4
 

Recently uploaded (20)

Alcohol_Dr. Jeenal Mistry MD Pharmacology.pdf
Alcohol_Dr. Jeenal Mistry MD Pharmacology.pdfAlcohol_Dr. Jeenal Mistry MD Pharmacology.pdf
Alcohol_Dr. Jeenal Mistry MD Pharmacology.pdf
 
Evaluation of antidepressant activity of clitoris ternatea in animals
Evaluation of antidepressant activity of clitoris ternatea in animalsEvaluation of antidepressant activity of clitoris ternatea in animals
Evaluation of antidepressant activity of clitoris ternatea in animals
 
How to Give Better Lectures: Some Tips for Doctors
How to Give Better Lectures: Some Tips for DoctorsHow to Give Better Lectures: Some Tips for Doctors
How to Give Better Lectures: Some Tips for Doctors
 
TEST BANK For Timby's Introductory Medical-Surgical Nursing, 13th American Ed...
TEST BANK For Timby's Introductory Medical-Surgical Nursing, 13th American Ed...TEST BANK For Timby's Introductory Medical-Surgical Nursing, 13th American Ed...
TEST BANK For Timby's Introductory Medical-Surgical Nursing, 13th American Ed...
 
Ocular injury ppt Upendra pal optometrist upums saifai etawah
Ocular injury  ppt  Upendra pal  optometrist upums saifai etawahOcular injury  ppt  Upendra pal  optometrist upums saifai etawah
Ocular injury ppt Upendra pal optometrist upums saifai etawah
 
TEST BANK For Advanced Practice Nursing in the Care of Older Adults, 2nd Edit...
TEST BANK For Advanced Practice Nursing in the Care of Older Adults, 2nd Edit...TEST BANK For Advanced Practice Nursing in the Care of Older Adults, 2nd Edit...
TEST BANK For Advanced Practice Nursing in the Care of Older Adults, 2nd Edit...
 
Mastering Wealth: A Path to Financial Freedom
Mastering Wealth: A Path to Financial FreedomMastering Wealth: A Path to Financial Freedom
Mastering Wealth: A Path to Financial Freedom
 
Temporal, Infratemporal & Pterygopalatine BY Dr.RIG.pptx
Temporal, Infratemporal & Pterygopalatine BY Dr.RIG.pptxTemporal, Infratemporal & Pterygopalatine BY Dr.RIG.pptx
Temporal, Infratemporal & Pterygopalatine BY Dr.RIG.pptx
 
PT MANAGEMENT OF URINARY INCONTINENCE.pptx
PT MANAGEMENT OF URINARY INCONTINENCE.pptxPT MANAGEMENT OF URINARY INCONTINENCE.pptx
PT MANAGEMENT OF URINARY INCONTINENCE.pptx
 
US E-cigarette Summit: Taming the nicotine industrial complex
US E-cigarette Summit: Taming the nicotine industrial complexUS E-cigarette Summit: Taming the nicotine industrial complex
US E-cigarette Summit: Taming the nicotine industrial complex
 
Report Back from SGO 2024: What’s the Latest in Cervical Cancer?
Report Back from SGO 2024: What’s the Latest in Cervical Cancer?Report Back from SGO 2024: What’s the Latest in Cervical Cancer?
Report Back from SGO 2024: What’s the Latest in Cervical Cancer?
 
Retinal consideration in cataract surgery
Retinal consideration in cataract surgeryRetinal consideration in cataract surgery
Retinal consideration in cataract surgery
 
The hemodynamic and autonomic determinants of elevated blood pressure in obes...
The hemodynamic and autonomic determinants of elevated blood pressure in obes...The hemodynamic and autonomic determinants of elevated blood pressure in obes...
The hemodynamic and autonomic determinants of elevated blood pressure in obes...
 
CURRENT HEALTH PROBLEMS AND ITS SOLUTION BY AYURVEDA.pptx
CURRENT HEALTH PROBLEMS AND ITS SOLUTION BY AYURVEDA.pptxCURRENT HEALTH PROBLEMS AND ITS SOLUTION BY AYURVEDA.pptx
CURRENT HEALTH PROBLEMS AND ITS SOLUTION BY AYURVEDA.pptx
 
The Normal Electrocardiogram - Part I of II
The Normal Electrocardiogram - Part I of IIThe Normal Electrocardiogram - Part I of II
The Normal Electrocardiogram - Part I of II
 
The History of Diagnostic Medical imaging
The History of Diagnostic Medical imagingThe History of Diagnostic Medical imaging
The History of Diagnostic Medical imaging
 
Young at heart: Cardiovascular health stations to empower healthy lifestyle b...
Young at heart: Cardiovascular health stations to empower healthy lifestyle b...Young at heart: Cardiovascular health stations to empower healthy lifestyle b...
Young at heart: Cardiovascular health stations to empower healthy lifestyle b...
 
For Better Surat #ℂall #Girl Service ❤85270-49040❤ Surat #ℂall #Girls
For Better Surat #ℂall #Girl Service ❤85270-49040❤ Surat #ℂall #GirlsFor Better Surat #ℂall #Girl Service ❤85270-49040❤ Surat #ℂall #Girls
For Better Surat #ℂall #Girl Service ❤85270-49040❤ Surat #ℂall #Girls
 
Relationship between vascular system disfunction, neurofluid flow and Alzheim...
Relationship between vascular system disfunction, neurofluid flow and Alzheim...Relationship between vascular system disfunction, neurofluid flow and Alzheim...
Relationship between vascular system disfunction, neurofluid flow and Alzheim...
 
Prix Galien International 2024 Forum Program
Prix Galien International 2024 Forum ProgramPrix Galien International 2024 Forum Program
Prix Galien International 2024 Forum Program
 

Making powerful science: an introduction to NGS data analysis

  • 1. Making powerful science: an introduction to NGS data analysis Dr Adam Cribbs Group leader in systems biology Botnar Research Centre
  • 2. Introduction PhD Prof Fionula Brennan Tregs in Rheumatoid Arthritis Postdoctoral scientist Prof Sir Marc Feldmann Prof Udo Oppermann Epigenetics of T cells MRC Career development fellowship Prof Chris Ponting Dr David Sims Systems biology MRC Career development fellowship PI position
  • 4. Purpose of this section • Introduction to the concepts in NGS data analysis • Data formats and quality control • Challenges in data analysis • Software and pipelines
  • 5. Application of NGS sequencing Sequencing of Genomic DNA Sequencing of DNA library Sequencing of cDNA library Whole genome sequencing • Genome re-sequencing • de novo genome sequencing • Metagenomics applications Epigenetic profiling • Methylation sequencing • Nucleosome footprinting Genomic footprinting • ChIP sequencing Targeted sequencing • PCR-amplified regions • Capture-enriched DNA Transcriptome analysis • Novel RNA classes (lncRNAs) • Novel splice variants Transcriptome expression • mRNA • Small RNA RNA footprinting • Ribosomal footprinting • RNA-IP sequencing
  • 6. Bioinformatic challenges Now I have my data what do I do????
  • 7. Bioinformatic challenges • 2.7 billion to hundreds of £ • NGS pushed the need for bioinformatics and big data analytics • Need for power!!
  • 8. Need for computation • Need for computer power • VERY large files (10s of millions of lines) • Impossible to use familiar tools such as python • Impossible memory usage and execution time • Need for a large amount of compute power • Compute clusters • Parallel code and multi threading to speed up analysis • Need for faster software • Pipelines • Bioinformatics power! • Properly structured working
  • 9. Data management issues • How to store data – very large raw data • Alternative data structures – e.g. binary storage (bam files) • Certain studies use different amounts of storage • RNA-seq per file 2Gb • WGS – 500 Gb files • Less of an issue now than it used to be 3-5 years ago – hardware improvements
  • 10. Computational clusters • Multi-nodes (servers) with multi-cores • High performance storage (expensive) • In-line storage • Fast networks (50Gb Ethernet between nodes) • Located in a single data centre • Need skilled data-admin staff to monitor and fix issues
  • 11. Cloud based analysis • Pros • Flexible • Pay for what you use • Don’t need to maintain a data centre • Cons • Transfer big data over internet is slow • You pay for bandwidth • Lower performance – disk IO • Privacy/data concerns • More expensive for long term projects
  • 12. The future • NGS arrived in 2007/2008 • No-one predicted NGS in 2001 • How can we really predict the future? • Problems will always remain: • Software always lags behind hardware
  • 13. Bioinformatics and computational biology • The term bioinformatician can mean many things • Usually little biology background but quantitative skills • Computational biologist is usually someone with a biology and quantitative background • There is definitely a massive skills shortage in both
  • 14. How to learn computational skills • Introduction to Next-gen data analysis • EBI in Cambridge - https://www.ebi.ac.uk/training/online/course/functional- genomics-ii-common-technologies-and-data-analysis-methods/next-generation • OBDI program • 3 month short term training for a particular skill • https://www.imm.ox.ac.uk/research/units-and-centres/mrc-wimm- centre-for-computational-biology/training • Undertake part of your PhD in a computational group
  • 16. NGS data analysis Raw reads from sequencer Quality assessment of reads Mapping Pathway analysis Gene networks Data storage and visualisation
  • 17. Quality control of reads • Sequencing output: • Reads + quality • Flat files – are very large – inefficient but it’s the standard • Question: is the quality of my sequencing data good?
  • 18. Quality control of reads • Fastqc – babraham institute • https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  • 19. Tools to deal with read QC • Fastx-toolkit to optimize different datasets • Fastq-screen – check that all of your data is not contaminated • Trimming to improve quality • Trimmomatic • Cutadapt • There are many many more! • But beware of removing too many reads or trimming too much
  • 20. Mapping reads to genome/transcriptome • Mapping data is very important to get correct • Many different mappers – make sure you use the latest software • Always treat your samples consistently
  • 21. Mapping reads to genome/transcriptome • Main issues: • Number of mismatches • Number of multi-hits • Mates expected distance • Exon junction
  • 22. GTF file for mapping • File format for reference sequence
  • 23. Mapping reads to genome • Which one to use??? • Depends on application
  • 24. Mapping reads to transcriptome • Which one to use??? • Depends on application • Don’t use tophat or hisat – use Tophat2 and hisat2
  • 25. SAM/BAM format • Standard mapping output • Sequence alignment map (SAM) • Tab delimited • 11 mandatory fields 1. Read name 2. Flag 3. Reference 4. Position 5. Quality 6. Cigar 7. Ref name of mate 8. Pos of mate 10. Seq 9. Template len
  • 27. SAM/BAM tools • Commandline • Samtools • view • Index • Sort • Picard • MarkDuplicates • Python • Pysam – maintained and developed by CGAT (Andreas Hager)
  • 29. RNA-seq workflow for DEG • Workflow1: • Tophat2 (align) -> cufflinks (transcript assembly) -> cuffdiff (DEG) -> cuffmerge (merge assemblies) • Workflow 2: • Hisat2 (align with any spliced mapper) -> featurecounts (counting reads to transcripts) -> DESeq2 or EdgeR (DEG) Hisat2 alignment DESeq2 featurecounts General linear model that accounts for negative binomial distribution
  • 30. Count data • Following featurecounts you are left with a counts table Fewer genes with large counts and more with fewer counts
  • 31. DEG methods compared • Which model to use???? • My preference is DESeq2 • Well written and better support • edgeR not accounting for typeI errors as well? Microarray RNA-seq
  • 32. DESeq2 model • Model overview: • First fits a GLM to the data using a sample size factor • Cooks distance for counts outlier detection • Dispersion is measured • zero-centered normal prior to shrink lower end • Wald test or LRT test
  • 33. Pathway analysis • Pathway analysis helps to identify novel pathways that may be disease relevant • Skewed towards cancer • Not always informative • Paid vs public
  • 34. Biological interpretation • The most important part and most difficult • Can be a problem when dealing with a company • Language barrier between biologist and bioinformatician • Visualising data helps overcome this
  • 35. Developing pipelines • To speed up your analysis and make your code reproducible you need to write pipelines https://github.com/Acribbs/scflow
  • 37. Further resources • Please email me • MOOCS: • Coursera : https://www.coursera.org/learn/bioinformatics- methods-1 • Edex: https://www.edx.org/micromasters/bioinformatics • Programming skills: • Codeacademy • EBI Introduction to Next-generation sequencing course - competitive