Bonnie	
  Hurwitz,	
  PhD	
  
Arizona	
  Health	
  Sciences	
  Center	
  
Extending	
  the	
  iPlant	
  Cyberinfrastructur...
The	
  iPlant	
  Collabora,ve	
  	
  
Community	
  Cyberinfrastructure	
  for	
  Life	
  Science	
  
hEp://www.iplantcolla...
iVirus	
  and	
  iMicrobe	
  
Joaquin  Ruiz,  PhD	
Dean,  College  of  Science	
 Darren  Boss	
 Devesh  Chourasiya  	
Fund...
The iPlant Collaborative
Vision
Enable life science researchers and
educators to use and extend
cyberinfrastructure to und...
How	
  iPlant	
  CI	
  Enables	
  Discovery	
  
Challenge:	
  Create	
  an	
  easy-­‐to-­‐use	
  plaNorm	
  powerful	
  en...
iPlant is a collaborative virtual
organization
The iPlant Collaborative
Who makes up iPlant?
The iPlant Collaborative
How is iPlant funded?
iPlant Renewed by NSF
September 2013 begins next 5 year period
Scientific A...
iPlant collaborates to enable access to the solutions that work the
best for the community…
The iPlant Collaborative
Who d...
How	
  iPlant	
  CI	
  Enables	
  Discovery	
  
Overview	
  of	
  resources	
  
End	
  Users	
  Computa0onal	
  Users	
  
...
iPlant Data Store
ü  Initial 100 GB allocation – TB allocations available
ü  Automatic data backup
ü  Easy upload /down...
Discovery Environment
Hundreds of bioinformatics Apps in an easy-to-
use interface
ü  A platform that can run almost any ...
Agave API
Fully customize iPlant resources
ü  Science-as-a-service platform
ü  Define your own compute, and storage reso...
Atmosphere
Cloud computing for the life sciences
ü  Simple: One-click access to more than 100 virtual machine
images
ü  ...
DNA Subway
Educational workflows for Genomes, DNA
Barcoding, RNA-Seq
ü  Commonly used bioinformatics tools in streamlined...
Bisque
Image analysis, management, and metadata
ü  Secure image storage, analysis, and data management
ü  Integrate exis...
Typical	
  End	
  
Users	
  
Computa0onal	
  
Users	
   Teragrid
XSEDE
iMicrobe	
  and	
  iVirus	
  
Leverage	
  the	
  iP...
What’s	
  Under	
  the	
  Hood?	
  
Stampede	
  -­‐	
  High	
  Level	
  Overview	
  
•  Base	
  Cluster	
  (Dell/Intel/Mel...
iMicrobe/ iVirus: New App Development
June 2013 – May 2014:
13: New Apps
1: High-throughput analysis pipeline
Forging	
  
Ahead	
  
with	
  
iPlant	
  
•  Build	
  a	
  
metegenomics	
  
toolkit	
  	
  
•  Streamline	
  
metagenomic...
iPlant Data Store
The resources you need to share
and manage data with your lab,
colleagues and community
Overview	
  of	
  the	
  iPlant	
  Data	
  Store
Some	
  Complica0ons	
  of	
  Big	
  Data	
  
•  Difficult/slow	
  transfer...
iPlant	
  Supports	
  the	
  Life	
  Cycle	
  of	
  Data	
  
Store	
  
Markup	
   Search	
  
Transfer	
  
Analyze	
  Visua...
Teragrid
XSEDE
Overview	
  of	
  the	
  iPlant	
  Data	
  Store
Scalable,	
  Reliable,	
  Redundant,	
  High-­‐performance...
Overview	
  of	
  the	
  iPlant	
  Data	
  Store
Some	
  important	
  items	
  we	
  won’t	
  see	
  
Source	
   DesInaIon...
Discovery Environment
Hundreds of bioinformatics
Apps in an easy-to-use
interface
Overview	
  of	
  the	
  iPlant	
  Discovery	
  Environment
Through	
  the	
  Discovery	
  
Environment	
  you	
  have:	
 ...
What	
  you	
  can	
  do	
  in	
  the	
  iPlant	
  DE?
Scalable	
  plajorm	
  for	
  	
  	
  
powerful	
  compu0ng,	
  dat...
Why	
  is	
  iPlant	
  DE	
  Scalable?
Democra0ze	
  your	
  code	
  	
  
•  Rich	
  plajorm	
  for	
  bioinforma0cs	
  
	...
Goal:	
  Create	
  a	
  metagenomic	
  assembly.	
  
	
  
Task	
  1:	
  Upload	
  metagenomic	
  fasta	
  file	
  to	
  you...
Sequence Quality Control in the iPlant DE
Genome, Metagenome,
and Transcriptome
Assembly
Genome and Metagenome
Assembly
ALLPATHS-LG
Newbler
SOAPdenovo
Velvet
MetaVe...
Where is the sample data?
Where is the Assembly App?
Specify Data and Assembly
Parameters
Specify Run Settings
Track Analyses and Results
What about Annotations?
•  Annotations are descriptions of features on contigs in a
genome / metagenome
–  Ab initio gene ...
Genome and Metagenome
Assembly
ALLPATHS-LG
Newbler
SOAPdenovo
Velvet
MetaVelvet
ABySS
SPA
Digital Norm.
IDBA-UD
Ab initio ...
The	
  Louis	
  Pasteur	
  Method:	
  
We	
  can’t	
  “see”	
  all	
  bacteria	
  using	
  culture-­‐based	
  approaches	
...
 	
  	
  	
  	
  Community	
  
	
  	
  	
  	
  	
  Genomics	
   	
  	
  
	
  	
  	
  	
  Isolate	
  
	
  	
  	
  	
  	
  	...
Environmental	
  	
  
Sample	
  
Extract	
  DNA	
   High	
  throughput	
  sequencing	
  
Assemble	
  reads	
   Gene	
  Pre...
Viromes are dominated by the Unknown
PhoIc	
   AphoIc	
  
Hurwitz BL & Sullivan MB. The Pacific Ocean Virome (POV). PLoS O...
Phage	
  FuncIon	
  based	
  on	
  Environment	
  
PcPipe:	
  a	
  VigneEe	
  in	
  Viral	
  Metagenomics	
  
Assemble Find Genes
Protein
Clusters
Input
reads
Input
reads
Cluster
Genes
BIN	
Organizing	
  the	
  Unknown	
  
Yooseph	
...
27K	
  High-­‐Confidence	
  Viral	
  Protein	
  Clusters	
  
GOS	
  	
  
50%	
  
POV	
  +	
  GOS	
  
22%	
  
POV	
  	
  
28...
Ocean	
  Microbial	
  CommuniIes	
  Vary	
  by	
  
Environmental	
  Factors	
  
Pacific	
  Ocean	
  Virome:	
  
Geographic	...
GDS
GFS
M5OD
M4OS
M2MS
LF26S
LA26S
LJ26S
LJ12S
LJ4S
M1CS
STCS
SFCS
SFSS
SFDS
M3MD
LJ12D
LJ26D
LJ4O
LJ12A
LJ4D
LJ4A
M6O1K
M...
Host	
  Genes	
  that	
  Promote	
  Viral	
  ReplicaIon	
  
Fe-­‐S	
  cluster	
  biogenesis	
  and	
  funcIon	
  
DNA/Prot...
AdapIve	
  for	
  High	
  Pressure	
  Environments	
  
DNA	
  replicaIon	
  iniIaIon	
  
DNA	
  repair	
  
MoIlity	
  
Ene...
QC	
  sequences	
  
•  FASTQ_	
  
	
  	
  	
  	
  	
  shrinker	
  
Assembly	
  	
  
part	
  1	
  
•  Velveth	
  
pcpipe	
 ...
1. 	
  Select	
  the	
  Apps	
  
2. 	
  Order	
  the	
  Apps	
  
3. 	
  Map	
  Outputs	
  to	
  Inputs	
  
4. 	
  Run	
  t...
Create	
  a	
  New	
  Workflow	
  
Provide	
  Workflow	
  Informa0on	
  
Select	
  the	
  Apps	
  
Add	
  the	
  Apps	
  
Remove	
  an	
  App	
  
Order	
  the	
  Apps	
  
New.a.faa	
   POV	
  PCs	
  
Map	
  Outputs	
  to	
  Inputs	
  
A	
  New	
  Workflow	
  
User’s	
  ORFs	
  
POV	
  PCs	
  
Run	
  the	
  Workflow	
  
Automated	
  workflows	
  
cannot	
  use	
  Apps	
  that	
  run	
  
on	
  the	
  HPC	
  
QC	
  sequences	
  
•  FASTQ_	
  
	
  	
  	
  	
  	
  shrinker	
  
Assembly	
  	
  
part	
  1	
  
•  Velveth	
  
pcpipe	
 ...
iPlant App iMicrobe
adapter
iMicrobe
condor
node
BLAST vs
SIMAP
cd-hit-2d cd-hit extract
proteins in
novel PCs
SIMAP
Annot...
Exis0ng	
  PCs	
  
(POV)	
  
Directory	
  of	
  
User	
  defined	
  
ORFS	
  
PCPipe	
  App	
  	
  
Collaborating with iPlant
•  Solve	
  computa0onal	
  boulenecks	
  	
  
•  Make	
  tools	
  easier	
  to	
  use	
  
•  Sh...
QuesIons	
  or	
  Comments?	
  
Bonnie	
  Hurwitz,	
  PhD	
  
QC	
  sequences	
  
•  FASTQ_	
  
	
  	
  	
  	
  	
  shrinker	
  
Assembly	
  	
  
•  Velvet	
  
pcpipe	
  part	
  1	
  
...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes
Upcoming SlideShare
Loading in …5
×

iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes

1,159 views

Published on

iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes. Overview of work underway to add applications and computational analysis pipelines to iPlant for metagenomics and microbial ecology.

Published in: Science, Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,159
On SlideShare
0
From Embeds
0
Number of Embeds
25
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to microbes

  1. 1. Bonnie  Hurwitz,  PhD   Arizona  Health  Sciences  Center   Extending  the  iPlant  Cyberinfrastructure:   From  Plants  to  Microbes  
  2. 2. The  iPlant  Collabora,ve     Community  Cyberinfrastructure  for  Life  Science   hEp://www.iplantcollaboraIve.org  
  3. 3. iVirus  and  iMicrobe   Joaquin  Ruiz,  PhD Dean,  College  of  Science Darren  Boss Devesh  Chourasiya   Funding   Staff   Ma=  Sullivan,  PhD Shane  Burgess,  PhD Dean,  CALS
  4. 4. The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure to understand and ultimately predict the complexity of biological systems
  5. 5. How  iPlant  CI  Enables  Discovery   Challenge:  Create  an  easy-­‐to-­‐use  plaNorm  powerful  enough   to  handle  data-­‐intensive  biology   Many  bioinformaIcs  tools  “off  limits”  to  those  without   specialized  computaIonal  backgrounds.  
  6. 6. iPlant is a collaborative virtual organization The iPlant Collaborative Who makes up iPlant?
  7. 7. The iPlant Collaborative How is iPlant funded? iPlant Renewed by NSF September 2013 begins next 5 year period Scientific Advisory Board Focus on Genotype-Phenotype science NSF Recommended expansion of scope beyond plants  
  8. 8. iPlant collaborates to enable access to the solutions that work the best for the community… The iPlant Collaborative Who does iPlant collaborate with?
  9. 9. How  iPlant  CI  Enables  Discovery   Overview  of  resources   End  Users  Computa0onal  Users   Teragrid XSEDE ü  Storage   ü  Computa0on   ü  Hos0ng   ü  Web  Services   ü  Scalability   Building  a  plaNorm   that  can  support   diverse  and   constantly  evolving   needs.  
  10. 10. iPlant Data Store ü  Initial 100 GB allocation – TB allocations available ü  Automatic data backup ü  Easy upload /download and sharing The resources you need to share and manage data with your lab, colleagues and community
  11. 11. Discovery Environment Hundreds of bioinformatics Apps in an easy-to- use interface ü  A platform that can run almost any bioinformatics application ü  Seamlessly integrated with data and high performance computing ü  User extensible – add your own applications
  12. 12. Agave API Fully customize iPlant resources ü  Science-as-a-service platform ü  Define your own compute, and storage resources (local and iPlant) ü  Build your own app store of scientific code and workflows
  13. 13. Atmosphere Cloud computing for the life sciences ü  Simple: One-click access to more than 100 virtual machine images ü  Flexible: Fully customize your software setup ü  Powerful: Integrated with iPlant computing and data resources
  14. 14. DNA Subway Educational workflows for Genomes, DNA Barcoding, RNA-Seq ü  Commonly used bioinformatics tools in streamlined workflows ü  Teach important concepts in biology and bioinformatics ü  Inquiry-based experiments for novel discovery and publication of data
  15. 15. Bisque Image analysis, management, and metadata ü  Secure image storage, analysis, and data management ü  Integrate existing applications or create new ones ü  Custom visualization and image handling routines and APIs
  16. 16. Typical  End   Users   Computa0onal   Users   Teragrid XSEDE iMicrobe  and  iVirus   Leverage  the  iPlant  Cyberinfrastructure   ü  Storage   ü  Computa0on   ü  Analysis   ü  App  dev.   ü  Pipeline  dev.   ü  Code  distrib.   ü  Data   Discoverability       Using  iPlant  for:  
  17. 17. What’s  Under  the  Hood?   Stampede  -­‐  High  Level  Overview   •  Base  Cluster  (Dell/Intel/Mellanox):   –  Intel  Sandy  Bridge  processors   –  Dell  dual-­‐socket  nodes  w/32GB  RAM  (2GB/core)   –  6,400  nodes   –  56  Gb/s  Mellanox  FDR  InfiniBand  interconnect   –  More  than  100,000  cores,  2.2  PF  peak  performance   •  Co-­‐Processors:     –  Intel  Xeon  Phi  “MIC”  Many  Integrated  Core  processors   –  Special  release  of  “Knight’s  Corner”  (61  cores)   –  All  MIC  cards  are  on  site  at  TACC   more  than  6000  installed   final  installa0on  ongoing  for  formal       summer  acceptance   –  7+  PF  peak  performance   •  Max  Total  Concurrency:   –  exceeds  500,000  cores   –  1.8M  threads     •  Entered  produc,on  opera,ons  on  January  7,  2013  
  18. 18. iMicrobe/ iVirus: New App Development June 2013 – May 2014: 13: New Apps 1: High-throughput analysis pipeline
  19. 19. Forging   Ahead   with   iPlant   •  Build  a   metegenomics   toolkit     •  Streamline   metagenomics   workflows   •  Enable  high-­‐ throughput   compuIng   •  Provide  key  datasets   for  computaIon  
  20. 20. iPlant Data Store The resources you need to share and manage data with your lab, colleagues and community
  21. 21. Overview  of  the  iPlant  Data  Store Some  Complica0ons  of  Big  Data   •  Difficult/slow  transfers     •  Expense  for  storage/backup     •  Difficult  to  share  and  publish     •  Metadata     •  Analysis  
  22. 22. iPlant  Supports  the  Life  Cycle  of  Data   Store   Markup   Search   Transfer   Analyze  Visualize   Collaborate  Share   Data                        Results  A                        Results  B                    Algo1                                  Algo2         Pre-­‐  PublicaIon   Post-­‐  PublicaIon  
  23. 23. Teragrid XSEDE Overview  of  the  iPlant  Data  Store Scalable,  Reliable,  Redundant,  High-­‐performance   •  Access  your  data  from  mul0ple  iPlant  services     •  Automa0c  data  backup  (redundant  between                University  of  Arizona  and  University  of  Texas)     •  Mul0ple  ways  to  share  data  with  collaborators   •  Mul0-­‐threaded  high  speed  transfers   •  Default  100GB  alloca0on.  >1TB  alloca0ons                available  with  jus0fica0on    
  24. 24. Overview  of  the  iPlant  Data  Store Some  important  items  we  won’t  see   Source   DesInaIon   Copy  Method   Time  (seconds)   CD   My  Computer   cp   320   Berkeley  Server   My  Computer   scp   150   External  Drive   My  Computer   cp   36   USB2.0  Flash   My  Computer   cp   30   iDS   MyComputer   iget   18   My  Computer   My  Computer   cp   15   Close  to  op0mum  condi0ons;  transfer  between     Univ.  of  Arizona  and  UC  Berkeley     100GB:  29m15s   1  GB  /  17.5  seconds    
  25. 25. Discovery Environment Hundreds of bioinformatics Apps in an easy-to-use interface
  26. 26. Overview  of  the  iPlant  Discovery  Environment Through  the  Discovery   Environment  you  have:     •  High-­‐powered  compu0ng   •  iPlant  data  store     •  Easy  to  use  interface   •  Virtually  limitless  apps   •  Analysis  history   (provenance)  
  27. 27. What  you  can  do  in  the  iPlant  DE? Scalable  plajorm  for       powerful  compu0ng,  data,  and  applica0on  resources     •  Navigate  the  components  of  the  DE   •  Access  and  manipulate  data   •  Start  and  complete  an  analysis   •  Track  your  analysis  and  see  your  results    
  28. 28. Why  is  iPlant  DE  Scalable? Democra0ze  your  code     •  Rich  plajorm  for  bioinforma0cs                    ~400  apps  (and  coun0ng)   •  Data  co-­‐localized  with  analysis   •  Easy  to  use  interface,  with  access                to  support   •  Easy  to  integrate  and  customize  your  own              tools  
  29. 29. Goal:  Create  a  metagenomic  assembly.     Task  1:  Upload  metagenomic  fasta  file  to  your  personal  data  store     Task  2:  Run  quality  control  on  your  raw  sequence  reads     Task  3:  Find  and  select  an  assembly  tool  (e.g.  Metavelvet)     Task  4:  Specify  parameters  and  your  input  files.    Run  the  assembly  App.     Task  5:  Monitor  the  progress  of  your  analysis  and  save  parameters.     Task  6:  View  your  results.   Discovery  Environment  Example  
  30. 30. Sequence Quality Control in the iPlant DE
  31. 31. Genome, Metagenome, and Transcriptome Assembly Genome and Metagenome Assembly ALLPATHS-LG Newbler SOAPdenovo Velvet MetaVelvet ABySS SPA Digital Norm. IDBA-UD Transciptome Assembly Trinity De novo: Reference-guided: SOAPdenovo-Trans Velvet/Oasis Trans-ABySS Tophat Cufflinks In the DEKey:
  32. 32. Where is the sample data?
  33. 33. Where is the Assembly App?
  34. 34. Specify Data and Assembly Parameters
  35. 35. Specify Run Settings
  36. 36. Track Analyses and Results
  37. 37. What about Annotations? •  Annotations are descriptions of features on contigs in a genome / metagenome –  Ab initio gene predictions –  Protein homology (Genbank nr, SIMAP) –  Curated protein resources (COG, Kegg, …) •  Secondary annotations –  InterPro Scan (Pfam, PIR, Prosite, …) –  GO and other ontologies –  Pathway Mapping (Kegg, Metacyc, Ecocyc)
  38. 38. Genome and Metagenome Assembly ALLPATHS-LG Newbler SOAPdenovo Velvet MetaVelvet ABySS SPA Digital Norm. IDBA-UD Ab initio Gene Prediction Glimmer Prodigal FragGeneScan Metagene MetaGenmark Transciptome Assembly Trinity De novo: Reference-guided: SOAPdenovo-Trans Velvet/Oasis Trans-ABySS Tophat Cufflinks Meta- Genome input Evidence input Conversion Tools Annotation Primary: Secondary: BLAST tophat2gff cufflinks2gff Visualization k-mer based InterProScan InterPro2GO JBrowse Web-Apollo Data Commons: Genomes and Metagenomes Proteins / Genes Reference Annotations Metadata (in irods) At TACCIn the DE Under DevelopmentKey: Assembly & Annotation at iPlant ü  Storage   ü  Computa0on   ü  Analysis   ü  Data  Access   ü  Code  Distr.   ü  Query  by   metadata  
  39. 39. The  Louis  Pasteur  Method:   We  can’t  “see”  all  bacteria  using  culture-­‐based  approaches   Razumov  (1932)  “The  Great  Plate  Anomaly.”  
  40. 40.          Community            Genomics              Isolate                                         Metagenomics   The  Post-­‐Genomic  Era:  from  Pasteur  to  CSI  
  41. 41. Environmental     Sample   Extract  DNA   High  throughput  sequencing   Assemble  reads   Gene  Prediction   library   creation   Making  Sense  of  Metagenomes   Function   Taxonomy   Compare  to     known  proteins  
  42. 42. Viromes are dominated by the Unknown PhoIc   AphoIc   Hurwitz BL & Sullivan MB. The Pacific Ocean Virome (POV). PLoS One. 8: e57355. Bacteria   5%   Eukaryota   1%   Archaea    0%   Viruses   3%   Viruses   7%   Bacteria   4%  Eukaryota   1%   Archaea    0%   Unknown   88%   Unknown   91%  We  need   new  tools!
  43. 43. Phage  FuncIon  based  on  Environment   PcPipe:  a  VigneEe  in  Viral  Metagenomics  
  44. 44. Assemble Find Genes Protein Clusters Input reads Input reads Cluster Genes BIN Organizing  the  Unknown   Yooseph  S,  et  al.  (2007)  The  Sorcerer  II  Global  Ocean  Sampling  expedi0on:   expanding  the  universe  of  protein  families.  PLoS  Biol  5(3):e16.    
  45. 45. 27K  High-­‐Confidence  Viral  Protein  Clusters   GOS     50%   POV  +  GOS   22%   POV     28%   Isolate     Phage   1%   2X   environmental   viral  protein   clusters     70%   of  data  now   included   Hurwitz BL & Sullivan MB. (2013) The Pacific Ocean Virome (POV). PLoS One. 8: e57355.
  46. 46. Ocean  Microbial  CommuniIes  Vary  by   Environmental  Factors   Pacific  Ocean  Virome:   Geographic  Region   LocaIon  on  a  Transect   Season   Depth    Hurwitz BL & Sullivan MB. (2013) The Pacific Ocean Virome (POV). PLoS One. 8: e57355.
  47. 47. GDS GFS M5OD M4OS M2MS LF26S LA26S LJ26S LJ12S LJ4S M1CS STCS SFCS SFSS SFDS M3MD LJ12D LJ26D LJ4O LJ12A LJ4D LJ4A M6O1K M7O4K LF26D LF26O LJ12O LF26A LA26A LA26O LJ26O LA26D LJ4O LJ12A LJ4D LJ4A M6O1K M7O4K LF26D LJ12O LF26O LF26A LJ26O LA26A LA26O LA26D LJ26D LJ12D M3MD GDS GFS M4OS M5OD LJ4S LJ12S LJ26S LA26S LF26S M2MS M1CS SFSS SFDS SFCS STCS Aphotic Photic AphoticPhotic Hurwitz  BL,  Brum  J.  and  Sullivan  MB.  Depth  Stra0fied  Func0onal  and  Taxonomic  Niche  Specializa0on   in  the  ‘Core’  and  ‘Flexible’  Pacific  Ocean  Virome  .    In  Review.     Photic vs Photic Aphotic vs Photic Aphotic vs Aphotic Photic vs Aphotic Protein   Clusters   group  by   phoIc   zone   Many  PCs  shared Some  PCs  shared Few  PCs  shared
  48. 48. Host  Genes  that  Promote  Viral  ReplicaIon   Fe-­‐S  cluster  biogenesis  and  funcIon   DNA/Protein  biosynthesis  and  repair   Host  “wake-­‐up”   Energy  producIon  in  photosynthesis   Niche  Defining  PhoIc  Core:   Hurwitz  BL,  Hallam  S.,  Sullivan  MB.  (2013)  Metabolic  Reprogramming  by  Viruses  in  the  Sunlit  and  Dark   Ocean.  Genome  Biology,  14,  R123.   Hurwitz  BL,  Brum  J.  and  Sullivan  MB.  Depth  Stra0fied  Func0onal  and  Taxonomic  Niche  Specializa0on   in  the  ‘Core’  and  ‘Flexible’  Pacific  Ocean  Virome  .    In  Review.    
  49. 49. AdapIve  for  High  Pressure  Environments   DNA  replicaIon  iniIaIon   DNA  repair   MoIlity   Energy  producIon  in  the  TCA  cycle   Niche  Defining  AphoIc  Core:   Hurwitz  BL,  Hallam  S.,  Sullivan  MB.  (2013)  Metabolic  Reprogramming  by  Viruses  in  the  Sunlit  and  Dark   Ocean.  Genome  Biology,  14,  R123.   Hurwitz  BL,  Brum  J.  and  Sullivan  MB.  Depth  Stra0fied  Func0onal  and  Taxonomic  Niche  Specializa0on   in  the  ‘Core’  and  ‘Flexible’  Pacific  Ocean  Virome.    In  Review.    
  50. 50. QC  sequences   •  FASTQ_            shrinker   Assembly     part  1   •  Velveth   pcpipe  part  1   •  Cd-­‐hit-­‐2d     Input  to   Analyses   •  Blastx  to  nr   •  QIIME   •  RarefacMon     New.fastq   Find  Genes   •  Meta-­‐ Gene-­‐Mark   POV  PCs   pcpipe  part  2   •  Cd-­‐hit   Assembly     part  2   •  Velvetg   New.a.faa   iPlant  Discovery  Environment:     Automated  Workflows   POV  +   Novel   PCs   PCpipe:  creaIng  protein  clusters  for  viral  ecology  
  51. 51. 1.   Select  the  Apps   2.   Order  the  Apps   3.   Map  Outputs  to  Inputs   4.   Run  the  analysis   Crea0ng  Workflows  Easy  as  1-­‐2-­‐3-­‐4  
  52. 52. Create  a  New  Workflow  
  53. 53. Provide  Workflow  Informa0on  
  54. 54. Select  the  Apps  
  55. 55. Add  the  Apps  
  56. 56. Remove  an  App  
  57. 57. Order  the  Apps  
  58. 58. New.a.faa   POV  PCs   Map  Outputs  to  Inputs  
  59. 59. A  New  Workflow  
  60. 60. User’s  ORFs   POV  PCs   Run  the  Workflow  
  61. 61. Automated  workflows   cannot  use  Apps  that  run   on  the  HPC  
  62. 62. QC  sequences   •  FASTQ_            shrinker   Assembly     part  1   •  Velveth   pcpipe  part  1   •  Cd-­‐hit-­‐2d     AnnotaIon   •  Protein   annotaMon   •  Secondary   annotaMon     New.fastq   Find  Genes   •  Meta-­‐ Gene-­‐Mark   POV  PCs   pcpipe  part  2   •  Cd-­‐hit   pcpipe  workflow   Assembly     part  2   •  Velvetg   New.a.faa   Gotchas  in  the  PCpipe  Workflow   FoundaIon  API   Runs  on  XSEDE  (HPC)  cannot  be  used  in  a  workflow   POV  +   Novel   PCs   FoundaIon  API   Runs  on  XSEDE  
  63. 63. iPlant App iMicrobe adapter iMicrobe condor node BLAST vs SIMAP cd-hit-2d cd-hit extract proteins in novel PCs SIMAP Annotation Pipeline Management Foundation Code HPC Job distribution on condor on condor on condor on TACC on condor Step 1 Step 2 Step 3 Step 4 Step 5 User ORFs Existing Protein Clusters Input 1 Input 2 ORFs in existing clusters ORFs in new clusters Annotation for new clusters Output 1 Output 2 Output 3 An  Integrated  PCPipe    
  64. 64. Exis0ng  PCs   (POV)   Directory  of   User  defined   ORFS   PCPipe  App    
  65. 65. Collaborating with iPlant •  Solve  computa0onal  boulenecks     •  Make  tools  easier  to  use   •  Share  Data   •  Provide  community  input   Collaboration
  66. 66. QuesIons  or  Comments?   Bonnie  Hurwitz,  PhD  
  67. 67. QC  sequences   •  FASTQ_            shrinker   Assembly     •  Velvet   pcpipe  part  1   •  Cd-­‐hit-­‐2d     Gene   AnnotaIon   •  SIMAP   •  GO   •  PFAM…     New.fastq   PCs   pcpipe  part  2   •  Cd-­‐hit   Find  Genes   •  Prodigal   ORFs   PCpipe:  Protein  Cluster  Pipeline   Steps  in  iPlant  DE   PCs  +   Novel   PCs   (HPC  or  Cloud)  

×