CLOUD BIOINFORMATICS
BY.A.ARPUTHA SELVARAJ
Bioinformatics is…
The development of computational methods
for studying the structure, function, and
evolution of genes,...
3
Why is there Bioinformatics?
 Lots of new sequences being added
- Automated sequencers
- Genome Projects
- Metagenomics...
4
 Gramicidine S (Consden et al., 1947), partial insulin sequence
(Sanger and Tuppy, 1951)
 1961: tRNA fragments
 Franc...
5
 The key to the whole field of nucleic acid-based identification
of microorganisms…
…the introduction molecular systema...
6
Need for informatics in biology: origins
• First genomes sequenced:
– 3.5 kb RNA bacteriophage MS2 (Fiers et al.,
1976)
...
7(from the National Centre for Biotechnology Information)
Genbank and associated
resources doubles faster
than Moore’s Law...
8
Today: So many genomes…
As of mid-August 2010, according to the GOLD GenomesOnline database….
 Eukaryotic genome projec...
14
The Human Genome
The genome sequence is complete - almost!
 approximately 3.5 billion base pairs.
15
Work ongoing to locate all genes and
regulatory regions and describe their
functions… …bioinformatics plays a critical
...
16
Identifying single nucleotide polymorphisms
(SNPs) and other changes between individuals
17
Bioinformatics helps with…….
Sequence Similarity Searching/Comparison
 What is similar to my sequence?
 Searching get...
18
Bioinformatics helps with…….
Structure-Function Relationships
 Can we predict the function of protein
molecules from t...
19
 Can we define evolutionary
relationships between organisms
by comparing DNA sequences?
- Lots of methods and software...
WHAT IS NEXT GENERATION
SEQUENCING (NGS)?
Sanger (“dideoxy sequencing or chain
termination”) Sequencing
 Single stranded DNA
from sample* extended
by polymerase fr...
Sanger Pro’s & Con’s
• Advantages
– Relatively accurate
– Relatively long (500 –
1500) bp reads
• Disadvantage
– Relativel...
Next Generation Sequencing (NGS)
Sequence
Assembly
on HPC
Roche 454
Life Tech. Ion Torrent
Illumina HiSeq
Life Tech SOLiD ...
(General) NGS Pro’s & Con’s
• Advantages
– Very high throughput
– Very cheap data
production
• Disadvantages
– Relatively ...
General Workflow
1. Template preparation
2. Sequencing & imaging
3. Genome alignment/assembly
COPING WITH THE
BIOINFORMATICS CHALLENGE
Challenge
 Assembling “next generation sequence” (NGS) data
requires a great deal of computing power and gigabytes
memory...
“High Performance
Computing” and “Cloud
Computing”
Computer
Nodes
Network
Storage
Your local
workstation/
laptop
What is Cloud Computing?
 Pooled resources: shared with many
users (remotely accessed)
 Virtualization: high utilization...
Cloud Bioinformatics Module
Raw Data/
Results/
Snapshots
Task-
Specialized
Server
Input Job
Message Queue
Output Job
Messa...
A More Complete Picture…
Raw Data + Results
Web
Portal
Project
Relational
Database
Database
Loader
Case Study in Bioinformatics on the Cloud
 Used Amazon Web Services
http://aws.amazon.com
 Assembled ~99 raw NGS transcr...
Software for (NGS) Bioinformatics
Bundled with sequencing machines:
e.g. Newbler assembler with Roche 454
3rd party com...
What do I need to run bioinformatics software locally?
Some common bioinformatics software is
platform independent, hence...
But, what are *we* going to use here?
WestGrid @ SFU / IRMACS
WestGrid is a consortium member of
“Computer Canada”
https://computecanada.org/
“bugaboo” cluste...
Galaxy Genomics Workbench
http://galaxy.psu.edu/
(also http://main.g2.bx.psu.edu/)
THE ROADMAP
What is Bioinformatics?
Road Map
Annotation
Sequences
(Formats)
Visualization of
Sequence &
Annotation
Search &
Alignments...
Specific Applications
 Sequence Assembly of Transcriptomes
 Sequence Assembly of Whole Genomes
 Annotation of de novo A...
Survey: Workshop Expectations I
 How to find significance in the huge amount of
data that Next Gen sequencing, but also
m...
Survey: Workshop Expectations II
 The basics of alignment and SNP calling with next-
gen sequencing, and what kind of pro...
Survey: Workshop Expectations III
 I expect to learn the basic bioinformatics tools.
 Learn different sequence alignment...
Survey: Operating System Being Used
Microsoft Windows on Intel/AMD – 14 (86.7%)
 Most running Windows 7 (some XP & Vista...
Looking Ahead…
What will you need for this workshop?
Mainly, just a laptop running a web browser
(Optional) access to L...
Thank You
Cloud bioinformatics 2
Cloud bioinformatics 2
Cloud bioinformatics 2
Cloud bioinformatics 2
Cloud bioinformatics 2
Upcoming SlideShare
Loading in...5
×

Cloud bioinformatics 2

1,407

Published on

GENE

Published in: Health & Medicine
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,407
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Summer 2002
  • Information sources:
    (Rhesus macaque) Robert F. Service. Science 311: 5767. 1544-1546 (2006).
    454 press release, May 31, 2007. http://www.454.com/about-454/news/index.asp?display=detail&id=68
    Wellcome Trust Sanger Institute press release, July 2, 2008. http://www.sanger.ac.uk/Info/Press/2008/080702.shtml
    Complete Genomics article in Bio-IT World: http://www.bio-itworld.com/BioIT_Article.aspx?id=82058
    Applied Biosystems press release, October 1, 2008. http://phx.corporate-ir.net/phoenix.zhtml?c=61498&p=irol-abiNewsArticle&ID=1207598&highlight=
  • Summer 2002
  • Summer 2002
  • Summer 2002
  • Roadmap of the workshop (10 minutes, 3 slides - program revisited; + tech structure/flow diagram(?)
  • Cloud bioinformatics 2

    1. 1. CLOUD BIOINFORMATICS BY.A.ARPUTHA SELVARAJ
    2. 2. Bioinformatics is… The development of computational methods for studying the structure, function, and evolution of genes, proteins, and whole genomes; The development of methods for the management and analysis of biological information arising from genomics and high- throughput biological experiments.
    3. 3. 3 Why is there Bioinformatics?  Lots of new sequences being added - Automated sequencers - Genome Projects - Metagenomics - RNA sequencing, microarray studies, proteomics,… Patterns in datasets that can be analyzed using computers Huge datasets
    4. 4. 4  Gramicidine S (Consden et al., 1947), partial insulin sequence (Sanger and Tuppy, 1951)  1961: tRNA fragments  Francis Crick, Sydney Brenner, and colleagues propose the existence of transfer RNA that uses a three base code and mediates in the synthesis of proteins (Crick et al., 1961) General nature of genetic code for proteins. Nature 192: 1227- 1232. In Microbiology: A Centenary Perspective, edited by Wolfgang K. Joklik, ASM Press. 1999, p.384  First codon assignment UUU/phe (Nirenberg and Matthaei, 1961) Need for informatics in biology: origins
    5. 5. 5  The key to the whole field of nucleic acid-based identification of microorganisms… …the introduction molecular systematics using proteins and nucleic acids by the American Nobel laureate Linus Pauling. Zuckerkandl, E., and L. Pauling. "Molecules as Documents of Evolutionary History." 1965. Journal of Theoretical Biology 8:357-366  Another landmark: Nucleic acid sequencing (Sanger and Coulson, 1975) Need for informatics in biology: origins
    6. 6. 6 Need for informatics in biology: origins • First genomes sequenced: – 3.5 kb RNA bacteriophage MS2 (Fiers et al., 1976) – 5.4 kb bacteriophage X174 (Sanger et al., 1977) – 1.83 Mb First complete genome sequence of a free-living organism: Haemophilus influenzae KW20 (Fleischmann et al., 1995) – First multicellular organism to be sequenced: C. elegans (C. elegans sequencing consortium, 1998) • Early databases: Dayhoff, 1972; Erdmann, 1978 • Early programs: restriction enzyme sites, promoters, etc… circa 1978. • 1978 – 1993: Nucleic Acids Research published supplemental information
    7. 7. 7(from the National Centre for Biotechnology Information) Genbank and associated resources doubles faster than Moore’s Law! (< every 18 months) http://en.wikipedia.org/wiki/Moore’s_law
    8. 8. 8 Today: So many genomes… As of mid-August 2010, according to the GOLD GenomesOnline database….  Eukaryotic genome projects are in progress? (Genome and ESTs) 1548 (517 - 5 years ago)  Prokaryote genome projects are in progress? 5006 (740 - 5 years ago)  Metagenome projects are in progress? 133 (Zero - 5 years ago) TOTAL 6687 projects (As of Sept 2011: >10,000)
    9. 9. 14 The Human Genome The genome sequence is complete - almost!  approximately 3.5 billion base pairs.
    10. 10. 15 Work ongoing to locate all genes and regulatory regions and describe their functions… …bioinformatics plays a critical role
    11. 11. 16 Identifying single nucleotide polymorphisms (SNPs) and other changes between individuals
    12. 12. 17 Bioinformatics helps with……. Sequence Similarity Searching/Comparison  What is similar to my sequence?  Searching gets harder as the databases get bigger - and quality changes  Tools: BLAST and FASTA = early time saving heuristics (approximate methods)  Need better methods for SNP analysis!  Statistics + informed judgment of the biologist
    13. 13. 18 Bioinformatics helps with……. Structure-Function Relationships  Can we predict the function of protein molecules from their sequence? sequence > structure > function  Prediction of some simple 3-D structures possible (a-helix, b-sheet, membrane spanning, etc.)
    14. 14. 19  Can we define evolutionary relationships between organisms by comparing DNA sequences? - Lots of methods and software, what is the best analysis approach? Bioinformatics helps with……. Phylogenetics
    15. 15. WHAT IS NEXT GENERATION SEQUENCING (NGS)?
    16. 16. Sanger (“dideoxy sequencing or chain termination”) Sequencing  Single stranded DNA from sample* extended by polymerase from primer then randomly terminated by dideoxy nucleotide (ddNTP)  Variable length DNA fragments radiolabelled or fluorescently detected ddNTP *sample derived from amplified cDNA, genomic clones or whole genome shotgun
    17. 17. Sanger Pro’s & Con’s • Advantages – Relatively accurate – Relatively long (500 – 1500) bp reads • Disadvantage – Relatively costly in terms of reagents and relatively low throughput
    18. 18. Next Generation Sequencing (NGS) Sequence Assembly on HPC Roche 454 Life Tech. Ion Torrent Illumina HiSeq Life Tech SOLiD Oxford Nanopore “GridION” Polonator HeliScope Pacific Biosciences SMRT Cell
    19. 19. (General) NGS Pro’s & Con’s • Advantages – Very high throughput – Very cheap data production • Disadvantages – Relatively short reads – Relatively higher error rates – Bioinformatics of assembly is much more challenging
    20. 20. General Workflow 1. Template preparation 2. Sequencing & imaging 3. Genome alignment/assembly
    21. 21. COPING WITH THE BIOINFORMATICS CHALLENGE
    22. 22. Challenge  Assembling “next generation sequence” (NGS) data requires a great deal of computing power and gigabytes memory  Software often can execute in parallel on all available computer processing unit (CPU) cores.  Many functional annotation processes (e.g. database searching, gene expression statistical analyses) also demand a lot of computing power
    23. 23. “High Performance Computing” and “Cloud Computing” Computer Nodes Network Storage Your local workstation/ laptop
    24. 24. What is Cloud Computing?  Pooled resources: shared with many users (remotely accessed)  Virtualization: high utilization of hardware resources (no idling)  Elasticity: dynamic scaling without capital expenditure and time delay  Automation: build, deploy, configure, provision, and move without manual intervention  Metered billing: “pay-as-you-go, only for what you use Cloud Computing
    25. 25. Cloud Bioinformatics Module Raw Data/ Results/ Snapshots Task- Specialized Server Input Job Message Queue Output Job Message Queue Job Status Notification Customized Machine Image Start-up (w/parameters)
    26. 26. A More Complete Picture… Raw Data + Results Web Portal Project Relational Database Database Loader
    27. 27. Case Study in Bioinformatics on the Cloud  Used Amazon Web Services http://aws.amazon.com  Assembled ~99 raw NGS transcriptome sequence datasets from 83 species, on 16 Amazon EC2 instances with 8 CPU cores, 68 GB of RAM, ~200 hours of computer time, total run in less than one working day.  Each single machine of the required size would likely have cost at least ~$10,000 (and time) to purchase, and incur significant operating costs overhead (machine room space, power supplies, networking, air conditioning, staff salaries, etc.)  The above run could be started up in a few minutes and cost ~ $500 to complete. Once done, no machines left idling and unused…
    28. 28. Software for (NGS) Bioinformatics Bundled with sequencing machines: e.g. Newbler assembler with Roche 454 3rd party commercial: DNA Star (www.dnastar.com) Geneious (http://www.geneious.com/) GeneWiz (http://www.genewiz.com) And others… Open Source: Lots (selected examples to be covered in this workshop)
    29. 29. What do I need to run bioinformatics software locally? Some common bioinformatics software is platform independent, hence will run equally under Windows and UNIX (Linux, OSX) Most other software targets Unix systems. If you are running Microsoft Windows and want to run such software locally, the easiest way to do this(?) is to install some version of Linux (suggest “Ubuntu”) as a dual boot or (less intrusively) as a guest operating system in a virtual machine, e.g. http://www.vmware.com/products/player/
    30. 30. But, what are *we* going to use here?
    31. 31. WestGrid @ SFU / IRMACS WestGrid is a consortium member of “Computer Canada” https://computecanada.org/ “bugaboo” cluster: 4328 cores total: 1280 cores, 8 cores/node, 16 GB/node, x86_64, IB. Plus 3048 cores, 12 cores/node, 24GB/node, x86_64, IB. capability cluster, 40 Core Years Access to other Westgrid resources through LAN and WAN More details from Brian Corrie tomorrow…
    32. 32. Galaxy Genomics Workbench http://galaxy.psu.edu/ (also http://main.g2.bx.psu.edu/)
    33. 33. THE ROADMAP
    34. 34. What is Bioinformatics? Road Map Annotation Sequences (Formats) Visualization of Sequence & Annotation Search & Alignments NGS Sequence Databases Sequence Assembly
    35. 35. Specific Applications  Sequence Assembly of Transcriptomes  Sequence Assembly of Whole Genomes  Annotation of de novo Assembled Sequences  Identification and Analysis of Sequence Variation  Comparative Genomic Analysis and Visualization  Meta-Analysis of Annotated Sequence Data
    36. 36. Survey: Workshop Expectations I  How to find significance in the huge amount of data that Next Gen sequencing, but also microarrays etc. generate.  A basic understanding of how to analyse next generation sequencing data.  Learn some hands-on computer experience  learning to use software for analysing sequence data; what can be done and how to do it.  genome assembly + meta-analysis
    37. 37. Survey: Workshop Expectations II  The basics of alignment and SNP calling with next- gen sequencing, and what kind of programs are out there to do these tasks and then analyze the large datasets (I've been trying to figure this out on my own through reading the literature and it's quite time consuming so any info provided through the workshop would be very helpful - thanks)  The main workflow for processing sequence data from the beginning to the more specific paths of analyses. Also the concepts, significance of the adjustable parameters behind the various algorithms used in the workflow.
    38. 38. Survey: Workshop Expectations III  I expect to learn the basic bioinformatics tools.  Learn different sequence alignment software/technologies (i.e. BWA, Abyss, etc.). Learn more about the complexities of NGS sequencing  Next generation sequencing, data analysis etc.  Parameters regulating assembly of contigs. How to take raw data to an assembly, control the main parameters for assembly, mass analyze data for annotation and SNPs  How to compare expression profiles using RNA transcriptomes.  Want to learn new things
    39. 39. Survey: Operating System Being Used Microsoft Windows on Intel/AMD – 14 (86.7%)  Most running Windows 7 (some XP & Vista)  One uses Linux through Westgrid and the IRMACS cluster Some of you also thinking of running Linux Apple OS X – 2 (13.3%)  Snow Leopard Release  Apple Lion, running Windows 7 using Parallels Linux on Intel - 2 (13.3%)
    40. 40. Looking Ahead… What will you need for this workshop? Mainly, just a laptop running a web browser (Optional) access to Linux/Unix locally (VM Player) Reading list: Will give review citations for future lectures For next week, suggest that you surf to http://www.ncbi.nlm.nih.gov/
    41. 41. Thank You
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×