Connecting Chipster genome browser
to the cloud




                      Aleksi Kallio
           CSC – IT Center for Science, Finland
Architecture of Chipster platform
                 Authentication           Management
                    service                 service




                          Message broker


                            File broker
     Clients
                     Brokers                  Computing
                                               services

   Loosely coupled, independent components
   Message oriented communications
   Flexible, scalable, robust
   In other words, very cloud like
Chipster in the cloud

 1) Deploying compute nodes in the cloud
    •   Easy, because architecture already loosely coupled and based
        on message passing
 2) Running large parallel jobs in the cloud
    •   Architecture allows this easily
    •   Cloud compatible tools can be integrated quickly
 3) Using cloud as a back end for interactive
  visualisations
    •   Not maybe so obvious
    •   So let's dig into this further...
Background: Chipster Genome Browser


   Interactive Swing-based GUI
   Shows reads and analysis results in genomic context
   Interactive zooming from chromosome down to nucleotide level
   Ensembl annotations for genes and transcripts
   Integrated with the rest of the Chipster
   Parallel, distributed to some extent
Basic idea

 Preprocess data with Hadoop / MapReduce
 Generate powers of two summaries for the data, like in
  Google Earth
    •   Doubles the data size
 Current genome browser samples data to produce
  summaries
 Now summaries can be read directly
    – Accurate results, significantly less disk seeks
 Distribute data to scale into massive datasets
    •   Use messaging to query independent data providers
 Aggregate results as/if they appear to the visualiser
Work in progress...

 Genome browser up and
  running
 Hadoop based data
  processing at very early
  stages
 Currently trying to get it
  scale well
What's the point?

 Besides items (e.g., reads), visualiser can receive
  “superitems” (e.g., summaries of reads)
    •   Summarises coverage, quality, SNP's etc. of the original reads
 All kinds of advanced information can be generated in
  the preprocessing step
    – Such as features that combine large number of genomes
    – Generators should be pluggable
 We spend resources on the server side to improve user
  experience on the client side
    •   At server side CPU, memory and disk space required
    •   But only for a short time (like in large batch jobs)
    •   Cheap commodity servers can be used
    •   And the experiment has already been expensive
Summary

 Use cheap server resources to enable better user
  experience
 Goal: to make data analysis quicker (and more fun)
 Tackle server side unreliability on the client side
 Future development
        –    If this works out, it could be used in other Chipster
             visualisers also
        –    Integrating Hbase queries to interactive visualisations
        –    Optimising data summarising for visual truthfulness
 For more info: aleksi.kallio@csc.fi,

Kallio bosc2010 chipster-cloud

  • 1.
    Connecting Chipster genomebrowser to the cloud Aleksi Kallio CSC – IT Center for Science, Finland
  • 2.
    Architecture of Chipsterplatform Authentication Management service service Message broker File broker Clients Brokers Computing services  Loosely coupled, independent components  Message oriented communications  Flexible, scalable, robust  In other words, very cloud like
  • 4.
    Chipster in thecloud  1) Deploying compute nodes in the cloud • Easy, because architecture already loosely coupled and based on message passing  2) Running large parallel jobs in the cloud • Architecture allows this easily • Cloud compatible tools can be integrated quickly  3) Using cloud as a back end for interactive visualisations • Not maybe so obvious • So let's dig into this further...
  • 5.
    Background: Chipster GenomeBrowser  Interactive Swing-based GUI  Shows reads and analysis results in genomic context  Interactive zooming from chromosome down to nucleotide level  Ensembl annotations for genes and transcripts  Integrated with the rest of the Chipster  Parallel, distributed to some extent
  • 9.
    Basic idea  Preprocessdata with Hadoop / MapReduce  Generate powers of two summaries for the data, like in Google Earth • Doubles the data size  Current genome browser samples data to produce summaries  Now summaries can be read directly – Accurate results, significantly less disk seeks  Distribute data to scale into massive datasets • Use messaging to query independent data providers  Aggregate results as/if they appear to the visualiser
  • 10.
    Work in progress... Genome browser up and running  Hadoop based data processing at very early stages  Currently trying to get it scale well
  • 11.
    What's the point? Besides items (e.g., reads), visualiser can receive “superitems” (e.g., summaries of reads) • Summarises coverage, quality, SNP's etc. of the original reads  All kinds of advanced information can be generated in the preprocessing step – Such as features that combine large number of genomes – Generators should be pluggable  We spend resources on the server side to improve user experience on the client side • At server side CPU, memory and disk space required • But only for a short time (like in large batch jobs) • Cheap commodity servers can be used • And the experiment has already been expensive
  • 12.
    Summary  Use cheapserver resources to enable better user experience  Goal: to make data analysis quicker (and more fun)  Tackle server side unreliability on the client side  Future development – If this works out, it could be used in other Chipster visualisers also – Integrating Hbase queries to interactive visualisations – Optimising data summarising for visual truthfulness  For more info: aleksi.kallio@csc.fi,