The iPlant Collaborative:
A Cyberinfrastructure for the Life
            Sciences
          Naim Matasci
  BIO5 / The iPlant Collaborative
What is iPlant?
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Problem 1: Data Volume

• Cost of analysis follows Moore's Law:
  – 1 Student with 1 computer to analyze 1 Mb of
    data produced in 2001
  – 200 Students and 200 computers to analyze all
    data produced for the same cost today (10 Gb)
Problem 2: Fragmented Analytical Landscape
                      1. Tools separated by compute
                         platform, data format, integration
                         issues, and programming model.

                      1. Mixture of desktop, command
                         line, database, and web-based
                         tools

                      2. Labor intensive, fragile solutions
                         devised to reach scientific
                         objectives

                      3. Little ability to share results,
                         analytical methods

                      4. Lack of reproducibility
Scalability




ABI 3730 DNA Analyzer, illumina Genome Analyzer, Joe Felesenstein ca. 1980, Ranger Cluster at TACC
Major Ways to Access iPlant
• Storing and sharing data large and small: iPlant Data
  Storage
• Integrated web-based analysis: The Discovery Environment
• Cloud computing: Atmosphere
• Applications: TNRS, TreeViewer, PhytoBisque, etc
• Scientific networking, knowledgebase and information
  exchange: My-Plant.org
• Educational tools: DNASubway
• Embedding iPlant CI capabilities into software: The
  Foundation API
• High Performance Computing for experts: TeraGrid/XSEDE

                                                   10
Why is the tree of life important?


“Knowledge of evolutionary relationships is
fundamental to biology, yielding new insights
across the plant sciences, from comparative
genomics and molecular evolution, to plant
development, to the study of adaptation,
speciation, community assembly, and
ecosystem functioning.”
Nothing in biology makes
sense except in the light
of evolution.

                T. G. Dobzahnsky
C3 to C4 Photosynthesis




Xin-Guang et al. 2008
"We combined geospatial and molecular
sequence data from two public archives to
produce a 1,230-taxon phylogeny of the grasses
with accompanying climate data for all species,
extracted from more than 1.1 million herbarium
specimens."




 Edwards and Smith, 2010
"Here we show that grasses are ancestrally a
warm-adapted clade and that C4 evolution
was not correlated with shifts between
temperate and tropical biomes. Instead, 18
of 20 inferred C4 origins were correlated with
marked reductions in mean annual
precipitation."
New Possibilities



                                                                      Acer glabrum
                                                                      Acer saccharinum
                                                                      Acer rubrum
                                                                      Acer distylum
                                                                      Acer macrophyllum
                                                                      Acer nipponicum
                                                                      Acer spicatum
                                                                      Acer carpinifolium
                                                                      Acer diabolicum
                                                                      Acer circinatum
                                                                      Acer sieboldianum
                                                                      Acer palmatum
                                                                      Acer saccharum
                                                                      Acer tschonoskii
                                                                      Acer rufinerve
                                                                      Acer pensylvanicum
                                                                      Acer crataegifolium
                                                                      Acer mono




illumina Genome Analyzer, Ranger Cluster at TACC, Acer phylogeny (Ackerly 2009), Green Plant ToL
Just Ask
Atmosphere
iPlant's APIs – The Foundation API
      Service                            Role
     Endpoint
IO              File storage, retrieval and management. Database
                interoperability
DATA            File format conversion

APPS            Registration and discovery of HPC applications


JOB             Submission and management of compute jobs


SYSTEMS         Availability and info about XSEDE hosts

PROFILE         User profile discovery

AUTH            Token based secure authentication

POSTIT          URL shortener
Consumer Applications




                        25
iPlant Data Store




Dramatization: Not the actual iPlant Data Store
Overview of the iPlant Data Store
      Some important items we won’t see in the demo
Source            Destination      Copy Method       Time (seconds)
CD                My Computer      cp                320
Berkeley Server   My Computer      scp               150
External Drive    My Computer      cp                36
USB2.0 Flash      My Computer      cp                30
iDS               MyComputer       iget              18
My Computer       My Computer      cp                15

           Close to optimum conditions; transfer between
                  Univ. of Arizona and UC Berkeley
                           100GB: 29m15s
                         1 GB / 17.5 seconds
Tree Visualization

•   > 500K Taxa
•   Fast
•   Web based, platform independent
•   Semantic zooming
•   Metadata driven display of information
iPlant Tree Viewer




http://portnoy.iplantcollaborative.org/
LIVE TREE VIEW DEMO
Obstacles
                                     Lobelia_kauaensis
                                     Lobelia_villosa
                                     Lobelia_gloria-montis
                                     Trematolobelia_kauaiensis
                                     Trematolobelia_macrostachys
                                     Lobelia_hypoleuca
                                     Lobelia_yuccoides
                                     Lobelia_niihauensis
                                     Brighamia_insignis
                                     Brighamia_rockii
                                     Delissea_rhytidosperma
                                     Delissea_subcordata
                                     Cyanea_pilosa
                                     Cyanea_acuminata
                                     Cyanea_hirtella
                                     Cyanea_coriacea
                                     Cyanea_leptostegia
                                     Clermontia_kakeana
                                     Clermontia_parviflora
                                     Clermontia_arborescens
                                     Clermontia_fauriei




Number of taxa               Taxa names
Taxonomic uncertainty

1. Non-existent names
  •   Misspellings
  •   Contamination
      •   Annotations
      •   Morphospecies
      •   Digitization issues (frame shifts, character
          encoding)Lexical variants (digitization conventions)
2. Synonymy
  •   Nomenclatural synonyms
  •   Taxonomic synonyms / concepts
3. Misidentifications, incomplete identifications
a) Centaurium curvistamineum
                                               (Wittr.) Abrams (1951)
                                           b) Centaurium minimum (Howell)
                                               Piper (1915)
                                           c) Centaurium muhlenbergii (Griseb.)
                                               Wight ex Piper (1906)
                                           d) Centaurium muhlenbergii (Griseb.)
                                               Wight ex Piper forma albiflorum
                                               (Suksd.) St. John (1937)
                                           e) Centaurium muhlenbergii (Griseb.)
                                               Wight ex Piper var. albiflorum
                                               Suksd. (1927)
                                           f) Centaurodes muhlenbergii
                                               (Griseb.) Kuntze (1891)
                                           g) Erythraea curvistaminea Wittr.
                                               (1886)
                                           h) Erythraea minima Howell (1901)
                                           i) Erythraea muhlenbergii Griseb.
                                               (1839)



Image: Gordon Leppig & Andrea J. Pickart
Request Tool Installation
        Apps -> Create -> New App




Create New -> Request Tool Installation




                                 Fill out forms and submit.
                                 Receive response in 2-5 days.

The iPlant Collaborative: A Cyberinfrastructure for the Life Sciences

  • 1.
    The iPlant Collaborative: ACyberinfrastructure for the Life Sciences Naim Matasci BIO5 / The iPlant Collaborative
  • 2.
  • 3.
  • 5.
    Problem 1: DataVolume • Cost of analysis follows Moore's Law: – 1 Student with 1 computer to analyze 1 Mb of data produced in 2001 – 200 Students and 200 computers to analyze all data produced for the same cost today (10 Gb)
  • 6.
    Problem 2: FragmentedAnalytical Landscape 1. Tools separated by compute platform, data format, integration issues, and programming model. 1. Mixture of desktop, command line, database, and web-based tools 2. Labor intensive, fragile solutions devised to reach scientific objectives 3. Little ability to share results, analytical methods 4. Lack of reproducibility
  • 7.
    Scalability ABI 3730 DNAAnalyzer, illumina Genome Analyzer, Joe Felesenstein ca. 1980, Ranger Cluster at TACC
  • 10.
    Major Ways toAccess iPlant • Storing and sharing data large and small: iPlant Data Storage • Integrated web-based analysis: The Discovery Environment • Cloud computing: Atmosphere • Applications: TNRS, TreeViewer, PhytoBisque, etc • Scientific networking, knowledgebase and information exchange: My-Plant.org • Educational tools: DNASubway • Embedding iPlant CI capabilities into software: The Foundation API • High Performance Computing for experts: TeraGrid/XSEDE 10
  • 11.
    Why is thetree of life important? “Knowledge of evolutionary relationships is fundamental to biology, yielding new insights across the plant sciences, from comparative genomics and molecular evolution, to plant development, to the study of adaptation, speciation, community assembly, and ecosystem functioning.”
  • 12.
    Nothing in biologymakes sense except in the light of evolution. T. G. Dobzahnsky
  • 14.
    C3 to C4Photosynthesis Xin-Guang et al. 2008
  • 15.
    "We combined geospatialand molecular sequence data from two public archives to produce a 1,230-taxon phylogeny of the grasses with accompanying climate data for all species, extracted from more than 1.1 million herbarium specimens." Edwards and Smith, 2010
  • 16.
    "Here we showthat grasses are ancestrally a warm-adapted clade and that C4 evolution was not correlated with shifts between temperate and tropical biomes. Instead, 18 of 20 inferred C4 origins were correlated with marked reductions in mean annual precipitation."
  • 17.
    New Possibilities Acer glabrum Acer saccharinum Acer rubrum Acer distylum Acer macrophyllum Acer nipponicum Acer spicatum Acer carpinifolium Acer diabolicum Acer circinatum Acer sieboldianum Acer palmatum Acer saccharum Acer tschonoskii Acer rufinerve Acer pensylvanicum Acer crataegifolium Acer mono illumina Genome Analyzer, Ranger Cluster at TACC, Acer phylogeny (Ackerly 2009), Green Plant ToL
  • 19.
  • 20.
  • 24.
    iPlant's APIs –The Foundation API Service Role Endpoint IO File storage, retrieval and management. Database interoperability DATA File format conversion APPS Registration and discovery of HPC applications JOB Submission and management of compute jobs SYSTEMS Availability and info about XSEDE hosts PROFILE User profile discovery AUTH Token based secure authentication POSTIT URL shortener
  • 25.
  • 28.
    iPlant Data Store Dramatization:Not the actual iPlant Data Store
  • 29.
    Overview of theiPlant Data Store Some important items we won’t see in the demo Source Destination Copy Method Time (seconds) CD My Computer cp 320 Berkeley Server My Computer scp 150 External Drive My Computer cp 36 USB2.0 Flash My Computer cp 30 iDS MyComputer iget 18 My Computer My Computer cp 15 Close to optimum conditions; transfer between Univ. of Arizona and UC Berkeley 100GB: 29m15s 1 GB / 17.5 seconds
  • 31.
    Tree Visualization • > 500K Taxa • Fast • Web based, platform independent • Semantic zooming • Metadata driven display of information
  • 32.
  • 33.
  • 34.
    Obstacles Lobelia_kauaensis Lobelia_villosa Lobelia_gloria-montis Trematolobelia_kauaiensis Trematolobelia_macrostachys Lobelia_hypoleuca Lobelia_yuccoides Lobelia_niihauensis Brighamia_insignis Brighamia_rockii Delissea_rhytidosperma Delissea_subcordata Cyanea_pilosa Cyanea_acuminata Cyanea_hirtella Cyanea_coriacea Cyanea_leptostegia Clermontia_kakeana Clermontia_parviflora Clermontia_arborescens Clermontia_fauriei Number of taxa Taxa names
  • 35.
    Taxonomic uncertainty 1. Non-existentnames • Misspellings • Contamination • Annotations • Morphospecies • Digitization issues (frame shifts, character encoding)Lexical variants (digitization conventions) 2. Synonymy • Nomenclatural synonyms • Taxonomic synonyms / concepts 3. Misidentifications, incomplete identifications
  • 36.
    a) Centaurium curvistamineum (Wittr.) Abrams (1951) b) Centaurium minimum (Howell) Piper (1915) c) Centaurium muhlenbergii (Griseb.) Wight ex Piper (1906) d) Centaurium muhlenbergii (Griseb.) Wight ex Piper forma albiflorum (Suksd.) St. John (1937) e) Centaurium muhlenbergii (Griseb.) Wight ex Piper var. albiflorum Suksd. (1927) f) Centaurodes muhlenbergii (Griseb.) Kuntze (1891) g) Erythraea curvistaminea Wittr. (1886) h) Erythraea minima Howell (1901) i) Erythraea muhlenbergii Griseb. (1839) Image: Gordon Leppig & Andrea J. Pickart
  • 39.
    Request Tool Installation Apps -> Create -> New App Create New -> Request Tool Installation Fill out forms and submit. Receive response in 2-5 days.

Editor's Notes

  • #3 Bringing a culture of computing to the Plant Sciences.
  • #7 The state of the art today. On the left are icons representing SOME of the ways we work with data.Tools are separated from one another by compute platform, data format, integration issues, programming model.Often a mixture of desktop, command line, database, and web-page based analysesLabor intensive, fragile solutions devised to reach scientific objectivesLittle ability to share results, analytical methods, or to work collaborativelyWe can INVERT the language of the COMPLAINTS to form DESIGN PRINCIPLES.Going to focus on a couple of NGS cases in my talk
  • #8 Left tree: Maple tree phylogeny from D. AckerlyLeft picture: Joe Felsenstein, ca. 1980Right picture: Ranger cluster at TACC
  • #12 Our understanding of the phylogeny of the half million known species of green plants has expanded dramatically over the past two decades, The task of assembling a comprehensive "tree of life" for them presents a Grand Challenge.Also part of the grand challenge is developing the necessary infrastructre to view and use the tree of life, to put it into the hands of plant biologists
  • #16 Public archives:MAT = Mean Annual TemperatureStephen Smith. iPlant supported postdoc. Now Assistant professor at the U MichiganPublished in PNAS last year
  • #18 Left tree: Maple tree phylogeny from D. AckerlyLeft picture: Joe Felsenstein, ca. 1980Right picture: Ranger cluster at TACCNew sequencing technologies – Computational Power and Simplified access to computational resources allow us to move from local to global scale. Climate change, nutrition global scale.
  • #26 Highest level of abstraction. Exactly like we can embed recent tweets in our web page, portal builders can add tools and services to their portals. E.g. BioExtract and CIPRES
  • #40 From the Apps catalog in the DE, select Create -> New AppOpens the Tool Integration interfaceSelect: Create New -> Request Tool InstallationFill out the form and submit it.It takes 2-5 business days to deploy the tool.