Data analysis patterns, tools and data types in genomics

Data analysis patterns, tools
and data types in genomics
for the uninitiated
BIMSB Sys. Bio. Lectures
Jan 2019
Altuna Akalin

Last week today…
• Talked about mindset for research
• Most of the content is here as a blog post:
https://medium.com/@aakalin

This week
• Common data analysis patterns in genomics
• Which tools and data types are relevant in
which step ?
• Ideas on how to get started with learning
bioinformatics
• Programming languages used for data analysis
Slides will be
athttps://www.slideshare.net/altunaakalin

What can we do with high-throughput
assays
• Which genes are expressed and how much ?
• Where does a transcription factor bind ?
• Which bases are methylated in the genome ?
• Which transcripts are translated ?
• Where does RNA-binding proteins bind ?
• Which microRNAs are expressed ?
• Which parts of the genome are in contact with each other ?
• Where are the mutations in the genome located ?
• Which parts of the genome are nucleosome-free ?
• Many more…

The general idea behind high-
throughput techniques
From http://compgenomr.github.io/book

High-throughput sequencing
• AKA massively parallel sequencing
• collection of many methods and technologies
• can sequence DNA, millions of fragments at a
time.

General genomics workflow

General data analysis workflow
Exploratory
analysis &
modeling
Data
collection
Quality
check &
cleaning
Processing
Visualization
& reporting

Data collection
• Where and how you get your data
• Includes publicly available data resources

Data collection for genomics
• Which sequencing technology you are using ?
• What kind of experiments are you doing ?
• How many samples ?
• How many replicates ?
• Which public data you will include in your
analysis ?
Data
collection
Fastq files

Quality check and clean up
• Data clean up starts with the data set you get
• Can include removing low quality data points
• Can include removing missing values or
incomplete data sets
In general,

Quality check and clean up
• Quality check is mostly about checking read quality
• Can involve removing low quality bases from reads
• Can involve removing adapter/barcode sequences from
reads
In genomics,
Quality
check &
cleaning
Fastq filesFastq files
PS: You can also filter aligned reads based on how well they align, ignoring this for
simplicity

• Example tools:
– Trimming reads: TrimGalore, cutadapt,
trimmomatic
– Read quality check: fastqc, multiQC
Quality
check &
cleaning
Fastq filesFastq files

Data Processing
• Transforming raw data to a state where
modeling or exploratory data analysis can
start
• Can include making a tabular data structure
from raw data
• Can include data transformations such as
taking logs or normalization
In general,

Data Processing
• alignment + quantification
• Can include further processing/modeling such
as calling peaks for ChIP-seq
In genomics,
Processing
SAM/BAM files
BED files
Text files
Fastq files

SAM/BAM files
• These are produced by aligners such as but
not limited to STAR, Bowtie and BWA
• SAM is a tab-delimited text format contains
alignment info.
Pavlopoulos, BioData Mining20136:13

SAM/BAM files
• BAM is the compressed and indexed version of
SAM files
• Indexing allows random access to the
compressed file
• samtools and friends filter/manipulate BAM files
• More info @ http://samtools.github.io/hts-specs/

BED files
• Aligners or more frequently post-alignment
processing produces BED files
• ChIP-seq peak callers such as MACS2
More info https://genome.ucsc.edu/FAQ/FAQformat.html#format1

Non-standard text files
• Alignment quantification tools such as
featureCounts or HTSeq-count can output text
files
• These will be number of reads per transcript
or gene across samples

General trend in genomics file formats
• Text files
• Tab-delimited
• genomic location and other features such as
names (gene or feature names) and scores
• Many formats (such as BED and SAM) can be
compressed and indexed

Exploratory analysis and modeling
• How samples or variables relate to each other
– clustering & dimension reduction (PCA, etc.)
• Prediction of variable of interest: Y ~ X1 + X2+ X3
• Statistical models including hypothesis testing
In general,
In genomics,
• All of the above
• Annotation with gene sets/pathways
• Looking at genomics data with special browsers, such
as UCSC genome browser or IGV

Final visualization and reporting
• Final figures, tables and text that describes the
outcome of your analysis
• Jupyter notebook or Rmarkdown go-to tool for
compiling reports these days
In general,
In genomics,
• Same as above
• Example reports from RNA-seq analysis

Example RNA-seq workflow
Exploratory
analysis &
modeling
Data
collection
Quality
check &
cleaning
Processing
Visualization
& reporting
Fastqc
trimGalore
STAR
featureCounts
DESeq2
gProfiler
rmarkdown

Example ChIP-seq workflow
Exploratory
analysis &
modeling
Data
collection
Quality
check &
cleaning
Processing
Visualization
& reporting
Fastqc
trimGalore
Bowtie2 genomation
Clustering
RmarkdownMACS2

First pass analysis
• Running through your workflow with default
or pre-defined parameters
• Gives you an idea about data set quality and
biology

Analysis/re-analysis cycle
• The first-pass analysis often has to be
repeated
Exploratory
analysis &
modeling
Data
collection
Quality
check &
cleaning
Processing
Visualization
& reporting

Can’t we automate all this ?
• Yes, to some extent
http://bioinformatics.mdc-berlin.de/pigx/
Wurmus et al. (2018) GigaScience

This is not the end
• More derivative analysis is required based on
the research questions
• This could lead to reprocessing data or
different modeling and visualization

The most important part of data
analysis is visualization
• In each step there is some visualization
involved
• Intermediate results are VERY important
Exploratory
analysis &
modeling
Data
collection
Quality
check &
cleaning
Processing
Visualization
& reporting

Importance of data exploration
with genome browsers
A walk through

Look at your genes or regions of
interest with processed data

Look through your genes or regions of
interest with processed data
Genes of interest Control genes

Form a hypothesis or observation
based on limited data points
Based on limited data points I looked:
• It seems my genes of interest have longer CpG
islands
• It seems my genes of interest have broader
transcription initiation

Test hypothesis/observation with all
the data
To be able to do that:
• We need to get the features of interest for all
genes (genes of interest and control genes)
• We need to calculate lengths and numbers of
features
• We need to do hypothesis testing
• We need to do visualization
The results of such analysis is here: Akalin et al. 2009, Genome Biology

Give me some practical advice, how
can I start analyzing my own data ?
Programming
• Terminal (Bash)
• R
• Python
• Perl
• …
Click through (GUI)
• Galaxy
• KNIME
• …

Galaxy or other GUIs
• There is a tool for every step of analysis you
can chain them. https://usegalaxy.org/
• You still need to know how and where to use
each tool
• The only thing you are bypassing is the
terminal/command line
• GUIs are limited in their flexibility

Programming
• Learning programming diversifies your skillset
– Better for postdoc applications
– Can do stuff outside science or academia
• Learning a GUI does not give you the same
edge

Where do I begin ?
First, a motivating example

Where do I begin ?
Graduate A
• PhD – Genetics,
Thesis: wet-lab
genomics
• M.Sc. – Molecular
Biology, Thesis:
wet-lab genomics
• B.Sc. – Molecular
Biology
Graduate B
• PhD – Genetics,
Thesis: wet-lab
genomics
• M.Sc. – Biology,
Thesis: wet-lab
• Pharmacist in
Training
• B.Sc. – Pharmacy
Guess first position after PhD ?

Why R ?
• All of exploratory analysis, modeling,
visualization and reporting can be done in R
• Bioconductor has thousands of specialized
bioinformatics algorithms/methods
– You can even do alignments & quality check

Where do I begin ?
• Learn how to read text or csv tables
• Learn how to manipulate data frames
• And make simple plots (plot(),hist(),barplot())
• Repeat until you are comfortable
Then,
• Write a function
• Learn loops and control structures
• Learn about other R data types

Get online/offline courses and
material
• Coursera courses:
https://www.coursera.org/learn/r-programming
• Computational genomics with R (book draft):
http://compgenomr.github.io/book
• Rstudio
resources:https://www.rstudio.com/online-
learning/#r-programming
• Datacamp interactive learning (some free stuff)
https://www.datacamp.com/courses/free-
introduction-to-r

Buddy up
• Get a colleague from your lab or neighboring
lab where you can ask each other questions
about programming

How do I get help ?
• Google it out, most problems you will
encounter are encountered. The answer is
reachable by a correctly formed query in
Google
• If it fails, come to “bioinfo. Clinics” or book
consultation at http://iris.mdc-berlin.de

Don’t be a perfectionist
• Just do something that resembles what you
want to do, you will iterate over later make it
better
• Ex:
– don’t worry about making the cutest plot, just
make a plot
– Don’t worry about which mapping algorithm is the
best, just use one and get some results

Python vs R
• If Python is the greatest thing that happened
for general programming languages,
R/Bioconductor is the greatest thing
happened in bioinformatics
• if you learn python first, you will regularly
have to drop in to R for any kind of statistics
developed for HT-seq.

pandas
Data
frames
statsmodels stats
seaborn ggplot
Convergence of data analysis/science
languages

languages
https://ursalabs.org/

languages
Data &
Machine-learning models
Interface for data access and manipulation

@AltunaAkalinhttp://bioinformatics.mdc-berlin.de
http://github.com/BIMSBbioinfo
Slides will be at:
https://www.slideshare.net/altunaakalin

More References/Reading material
• RNA-seqlopedia https://rnaseq.uoregon.edu/
• Computational genomics with R,
http://compgenomr.github.io/book
• Biostars tutorials:
https://www.biostars.org/t/Tutorials/

Data analysis patterns, tools and data types in genomics

More Related Content

What's hot

Similar to Data analysis patterns, tools and data types in genomics

More from Altuna Akalin

Recently uploaded

Data analysis patterns, tools and data types in genomics

Editor's Notes