2. Bioinformatics is…
The development of computational methods
for studying the structure, function, and
evolution of genes, proteins, and whole
genomes;
The development of methods for the
management and analysis of biological
information arising from genomics and high-
throughput biological experiments.
3. 3
Why is there Bioinformatics?
Lots of new sequences being added
- Automated sequencers
- Genome Projects
- Metagenomics
- RNA sequencing, microarray studies, proteomics,…
Patterns in datasets that can be analyzed
using computers
Huge datasets
4. 4
Gramicidine S (Consden et al., 1947), partial insulin sequence
(Sanger and Tuppy, 1951)
1961: tRNA fragments
Francis Crick, Sydney Brenner, and colleagues propose the
existence of transfer RNA that uses a three base code and
mediates in the synthesis of proteins (Crick et al., 1961)
General nature of genetic code for proteins. Nature 192: 1227-
1232. In Microbiology: A Centenary Perspective, edited by
Wolfgang K. Joklik, ASM Press. 1999, p.384
First codon assignment UUU/phe (Nirenberg and Matthaei,
1961)
Need for informatics in biology: origins
5. 5
The key to the whole field of nucleic acid-based identification
of microorganisms…
…the introduction molecular systematics using proteins and
nucleic acids by the American Nobel laureate Linus Pauling.
Zuckerkandl, E., and L. Pauling. "Molecules as Documents of
Evolutionary History." 1965. Journal of Theoretical Biology
8:357-366
Another landmark: Nucleic acid sequencing (Sanger and
Coulson, 1975)
Need for informatics in biology: origins
6. 6
Need for informatics in biology: origins
• First genomes sequenced:
– 3.5 kb RNA bacteriophage MS2 (Fiers et al.,
1976)
– 5.4 kb bacteriophage X174 (Sanger et al.,
1977)
– 1.83 Mb First complete genome sequence of a free-living
organism: Haemophilus influenzae KW20 (Fleischmann
et al., 1995)
– First multicellular organism to be sequenced:
C. elegans (C. elegans sequencing consortium, 1998)
• Early databases: Dayhoff, 1972; Erdmann, 1978
• Early programs: restriction enzyme sites, promoters, etc…
circa 1978.
• 1978 – 1993: Nucleic Acids Research published supplemental
information
7. 7(from the National Centre for Biotechnology Information)
Genbank and associated
resources doubles faster
than Moore’s Law!
(< every 18 months)
http://en.wikipedia.org/wiki/Moore’s_law
8. 8
Today: So many genomes…
As of mid-August 2010, according to the GOLD GenomesOnline database….
Eukaryotic genome projects are in progress?
(Genome and ESTs)
1548 (517 - 5 years ago)
Prokaryote genome projects are in progress?
5006 (740 - 5 years ago)
Metagenome projects are in progress?
133 (Zero - 5 years ago)
TOTAL 6687 projects (As of Sept 2011: >10,000)
9.
10.
11.
12.
13.
14. 14
The Human Genome
The genome sequence is complete - almost!
approximately 3.5 billion base pairs.
15. 15
Work ongoing to locate all genes and
regulatory regions and describe their
functions… …bioinformatics plays a critical
role
17. 17
Bioinformatics helps with…….
Sequence Similarity Searching/Comparison
What is similar to my sequence?
Searching gets harder as the databases
get bigger - and quality changes
Tools: BLAST and FASTA = early time
saving heuristics (approximate methods)
Need better methods for SNP analysis!
Statistics + informed judgment of the
biologist
18. 18
Bioinformatics helps with…….
Structure-Function Relationships
Can we predict the function of protein
molecules from their sequence?
sequence > structure > function
Prediction of some simple 3-D structures
possible (a-helix, b-sheet, membrane
spanning, etc.)
19. 19
Can we define evolutionary
relationships between organisms
by comparing DNA sequences?
- Lots of methods and software,
what is the best analysis approach?
Bioinformatics helps with…….
Phylogenetics
21. Sanger (“dideoxy sequencing or chain
termination”) Sequencing
Single stranded DNA
from sample* extended
by polymerase from
primer then randomly
terminated by dideoxy
nucleotide (ddNTP)
Variable length DNA
fragments radiolabelled
or fluorescently
detected ddNTP
*sample derived from amplified cDNA, genomic clones or whole genome shotgun
22. Sanger Pro’s & Con’s
• Advantages
– Relatively accurate
– Relatively long (500 –
1500) bp reads
• Disadvantage
– Relatively costly in terms
of reagents and
relatively low
throughput
23. Next Generation Sequencing (NGS)
Sequence
Assembly
on HPC
Roche 454
Life Tech. Ion Torrent
Illumina HiSeq
Life Tech SOLiD Oxford
Nanopore
“GridION”
Polonator
HeliScope
Pacific
Biosciences
SMRT Cell
24. (General) NGS Pro’s & Con’s
• Advantages
– Very high throughput
– Very cheap data
production
• Disadvantages
– Relatively short reads
– Relatively higher error
rates
– Bioinformatics of
assembly is much more
challenging
27. Challenge
Assembling “next generation sequence” (NGS) data
requires a great deal of computing power and gigabytes
memory
Software often can execute in parallel on all available
computer processing unit (CPU) cores.
Many functional annotation processes (e.g. database
searching, gene expression statistical analyses) also
demand a lot of computing power
29. What is Cloud Computing?
Pooled resources: shared with many
users (remotely accessed)
Virtualization: high utilization of
hardware resources (no idling)
Elasticity: dynamic scaling without
capital expenditure and time delay
Automation: build, deploy, configure,
provision, and move without manual
intervention
Metered billing: “pay-as-you-go, only
for what you use
Cloud Computing
30. Cloud Bioinformatics Module
Raw Data/
Results/
Snapshots
Task-
Specialized
Server
Input Job
Message Queue
Output Job
Message Queue
Job Status
Notification
Customized
Machine
Image
Start-up
(w/parameters)
31. A More Complete Picture…
Raw Data + Results
Web
Portal
Project
Relational
Database
Database
Loader
32. Case Study in Bioinformatics on the Cloud
Used Amazon Web Services
http://aws.amazon.com
Assembled ~99 raw NGS transcriptome sequence
datasets from 83 species, on 16 Amazon EC2 instances
with 8 CPU cores, 68 GB of RAM, ~200 hours of
computer time, total run in less than one working day.
Each single machine of the required size would likely
have cost at least ~$10,000 (and time) to purchase,
and incur significant operating costs overhead
(machine room space, power supplies, networking, air
conditioning, staff salaries, etc.)
The above run could be started up in a few minutes
and cost ~ $500 to complete. Once done, no machines
left idling and unused…
33. Software for (NGS) Bioinformatics
Bundled with sequencing machines:
e.g. Newbler assembler with Roche 454
3rd party commercial:
DNA Star (www.dnastar.com)
Geneious (http://www.geneious.com/)
GeneWiz (http://www.genewiz.com)
And others…
Open Source:
Lots (selected examples to be covered in this
workshop)
34. What do I need to run bioinformatics software locally?
Some common bioinformatics software is
platform independent, hence will run equally
under Windows and UNIX (Linux, OSX)
Most other software targets Unix systems. If
you are running Microsoft Windows and want
to run such software locally, the easiest way to
do this(?) is to install some version of Linux
(suggest “Ubuntu”) as a dual boot or (less
intrusively) as a guest operating system in a
virtual machine, e.g.
http://www.vmware.com/products/player/
36. WestGrid @ SFU / IRMACS
WestGrid is a consortium member of
“Computer Canada”
https://computecanada.org/
“bugaboo” cluster: 4328 cores total: 1280
cores, 8 cores/node, 16 GB/node, x86_64, IB.
Plus 3048 cores, 12 cores/node, 24GB/node,
x86_64, IB. capability cluster, 40 Core Years
Access to other Westgrid resources through
LAN and WAN
More details from Brian Corrie tomorrow…
39. What is Bioinformatics?
Road Map
Annotation
Sequences
(Formats)
Visualization of
Sequence &
Annotation
Search &
Alignments
NGS
Sequence
Databases
Sequence
Assembly
40. Specific Applications
Sequence Assembly of Transcriptomes
Sequence Assembly of Whole Genomes
Annotation of de novo Assembled Sequences
Identification and Analysis of Sequence Variation
Comparative Genomic Analysis and Visualization
Meta-Analysis of Annotated Sequence Data
41. Survey: Workshop Expectations I
How to find significance in the huge amount of
data that Next Gen sequencing, but also
microarrays etc. generate.
A basic understanding of how to analyse next
generation sequencing data.
Learn some hands-on computer experience
learning to use software for analysing sequence
data; what can be done and how to do it.
genome assembly + meta-analysis
42. Survey: Workshop Expectations II
The basics of alignment and SNP calling with next-
gen sequencing, and what kind of programs are
out there to do these tasks and then analyze the
large datasets (I've been trying to figure this out
on my own through reading the literature and it's
quite time consuming so any info provided
through the workshop would be very helpful -
thanks)
The main workflow for processing sequence data
from the beginning to the more specific paths of
analyses. Also the concepts, significance of the
adjustable parameters behind the various
algorithms used in the workflow.
43. Survey: Workshop Expectations III
I expect to learn the basic bioinformatics tools.
Learn different sequence alignment
software/technologies (i.e. BWA, Abyss, etc.). Learn
more about the complexities of NGS sequencing
Next generation sequencing, data analysis etc.
Parameters regulating assembly of contigs. How to
take raw data to an assembly, control the main
parameters for assembly, mass analyze data for
annotation and SNPs
How to compare expression profiles using RNA
transcriptomes.
Want to learn new things
44. Survey: Operating System Being Used
Microsoft Windows on Intel/AMD – 14 (86.7%)
Most running Windows 7 (some XP & Vista)
One uses Linux through Westgrid and the IRMACS
cluster
Some of you also thinking of running Linux
Apple OS X – 2 (13.3%)
Snow Leopard Release
Apple Lion, running Windows 7 using Parallels
Linux on Intel - 2 (13.3%)
45. Looking Ahead…
What will you need for this workshop?
Mainly, just a laptop running a web browser
(Optional) access to Linux/Unix locally (VM Player)
Reading list:
Will give review citations for future lectures
For next week, suggest that you surf to
http://www.ncbi.nlm.nih.gov/