SlideShare a Scribd company logo
1 of 61
Download to read offline
Clouds: All fluff and no
     substance?

             Guy Coates
    Wellcome Trust Sanger Institute


         gmpc@sanger.ac.uk
Outline
About the Sanger Institute.
Experience with cloud to date.
Future Directions.
The Sanger Institute
Funded by Wellcome Trust.
• 2nd largest research charity in the world.
• ~700 employees.
• Based on Hinxton Genome Campus,
    Cambridge, UK.

Large scale genomic research.
• We have active cancer, malaria,
    pathogen and genomic variation / human
    health studies.
•   1k genomes, & 10k-UK Genomes,
    Cancer genome projects.

All data is made publicly
available.
• Websites, ftp, direct database access,
    programmatic APIs.
Economic Trends:
As cost of sequencing halves every 12
months.
• cf Moore's Law
The Human genome project:
• 13 years.
• 23 labs.
• $500 Million.
A Human genome today:
• 3 days.
• 1 machine.
• $10,000.
• Large centres are now doing studies with 1000s and
  10,000s of genomes.

Changes in sequencing technology are
going to continue this trend.
• “Next-next” generation sequencers are on their way.
• $500 genome is probable within 5 years.
The scary graph

                    Instrument upgrades




Peak Yearly capillary
sequencing
Managing Growth
We have exponential growth in
storage and compute.
• Storage /compute doubles every 12                                                     Disk Storage
    months.                                                  6000

     • 2009 ~7 PB raw
                                                             5000



                                                             4000

Moore's law will not save us.
• Transistor/disk density: Td=18 months




                                                 Terabytes
                                                             3000



• Sequencing cost:         Td=12 months                      2000




My Job:                                                      1000


• Running the team who do the IT                                0
    systems heavy-lifting to make it all work.                          1995    1997    1999    2001    2003    2005    2007    2009

•
                                                                    1994    1996    1998    2000    2002    2004    2006    2008
    Tech evaluations.                                                                          Year
•   Systems architecture.
•   Day-to-day administration.
•   All in conjunction with informaticians,
    programmers & investigators who are
    doing the science.
Cloud: Where are we at?
What is cloud?
Technical view:
• On demand, virtual machines.
• Root access, total ownership.
• Pay-as-you-go model.

Non-technical view:
• “Free” compute we can use to solve all of the hard problems thrown up by
  new sequencing.
   • (8cents/hour is almost free, right...?)

• Web 2.0 / Friendface use it, so it must be good.
Hype Cycle

Awesome!

                        Just works...
Out of the trough of
 disillusionment...
Victory!
Cloud Use-Cases
We currently have three areas of activity:
• Web presence
• HPC workload
• Data Warehousing
Ensembl
Ensembl is a system for genome Annotation.
Data visualisation (Web Presence)
• www.ensembl.org
• Provides web / programmatic interfaces to genomic data.
• 10k visitors / 126k page views per day.
Compute Pipeline (HPC Workload)
• Take a raw genome and run it through a compute pipeline to find genes
    and other features of interest.
•   Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate
    genomes.

• Software is Open Source (apache license).
• Data is free for download.
We have done cloud experiments with both the web site
and pipeline.
Web presence
Web Presence
Ensembl has a worldwide audience.
Historically, web site performance was not great, especially
for non-european institutes.
• Pages were quite heavyweight.
• Not properly cached etc.
Web team spent a lot of time re-designing the code to
make it more streamlined.
• Greatly improved performance.
Coding can only get you so-far.
• “A canna' change the laws of physics.”
     •150-240ms round trip time from Europe to the US.
•   We need a set of geographically dispersed mirrors.
uswest.ensembl.org
Traditional mirror: Real machines in a co-lo facility in
California.
Hardware was initially configured on site.
• 16 servers, SAN storage, SAN switches, SAN management appliance,
  Ethernet switches, firewall, out-of-band management etc.

Shipped to the co-lo for installation.
• Sent a person to California for 3 weeks.
• Spent 1 week getting stuff into/out of customs.
   •   ****ing FCC paperwork!

Additional infrastructure work.
• VPN between UK and US.
Incredibly time consuming.
• Really don't want to end up having to send someone on a plane to the US
  to fix things.
Usage
US-West currently takes ~1/3rd of total Ensembl web traffic.
• Much lower latency and improved site usibility.
What has this got to do with
         clouds?
useast.ensembl.org
We want an east coast US mirror to complement our west
coast mirror.
Built the mirror in AWS.
• Initially a proof of concept / test-bed.
• Production-level in due course.
Gives us operational experience.
• We can compare to a “real” colo.
Building a mirror on AWS

Some software development / sysadmin work needed.
• Preparation of OS images, software stack configuration.
• West-coast was built as an extension of Sanger internal network via VPN.
• AWS images built as standalone systems.
Web code changes
• Significant code changes required to make the webcode “mirror aware”.
     •   Seach, site login etc.
     •   We chose not to set up VPN into AWS.
     •   Work already done for the first mirror.

Significant amount of tuning required.
• Initial mysql performance was pretty bad, especially for the large ensembl
    databases. (~1TB).
•   Lots of people doing Apache/mysql on AWS, so there is a good amount of
    best-practice etc available.
Does it work?




  BETA!
Is it better than the co-lo?
No physical hardware.
• Work can start as soon as we enter our credit card numbers...
• No US customs, Fedex etc.
Much simpler management infra-stucture.
     • AWS give you out of band management “for free”.
•   Much simpler to deal with hardware problems.
     • And we do remote-management all the time.


“Free” hardware upgrades.
• As faster machines become available we can take advantage of them
    immediately.
•   No need to get tin decommissioned /re-installed at Co-lo.
Is it cost effective?
Lots of misleading cost statements made about cloud.
• “Our analysis only cost $500.”
• CPU is only “$0.085 / hr”.
What are we comparing against?
• Doing the analysis once? Continually?
• Buying a $2000 server?
• Leasing a $2000 server for 3 years?
• Using $150 of time at your local supercomputing facility?
• Buying a $2000 of server but having to build a $1M datacentre to put it
  in?

Requires the dreaded Total Cost of Ownership (TCO)
calculation.
• hardware + power + cooling + facilities + admin/developers etc
   •   Incredibly hard to do.
Lets do it anyway...
Comparing costs to the co-lo is simpler.
• power, cooling costs are all included.
• Admin costs are the same, so we can ignore them.
   •   Same people responsible for both.

Cost for Co-location facility:
• $120,000 hardware + $51,000 /yr colo.
• $91,000 per year (3 years hardware lifetime).
Cost for AWS :
• $77,000 per year (estimated based on US-east traffic / IOPs)
Result: Estimated 16% cost saving.
• It is not free!
Additional Benefits
Website + code is packaged together.
• Can be conveniently given away to end users in a “ready-to-run” config.
• Simplifies configuration for other users wanting to run Ensembl sites.
• Configuring an ensembl site is non-trivial for non-informaticians.
   •   Cvs, mysql setup, apache configuration etc.

Ensembl data is already available as an Amazon public
dataset.
• Makes a complete system.
Unknowns
What about scale-up?
Current installation is a minimal config.
• Single web / database nodes.
• Main site and us-east use multiple load balanced servers.
AWS load-balancing architecture is different from what we
currently use.
• In theory there should be no problems...
• ...but we don't know until we try.
• Do we go for automatic scale-out?
Downsides
Underestimated the time it would take to make the web-
code mirror-ready.
• Not a cloud specific problem, but something to be aware of when you take
  big applications and move them outside your home institution.

Packaging OS images, code and data needs to be done for
every ensembl release.
• Ensembl team now has a dedicated person responsible for the cloud.
• Somebody has to look after the systems.
Management overhead does not necessarily go down.
• But it does change.
Going forward
useast.ensembl.org to go into production later this year.
• Far-east Amazon availability zone is also of interest.
   •   Likely to be next, assuming useast works.

“Virtual” Co-location concept will be useful for a number of
other projects.
• Other Sanger websites?
Disaster recovery.
• Eg replicate critical databases / storage into AWS.
• Currently all of Sanger data lives in a single datacentre.
• We have a small amount of co-lo space for mirroring critical data.
   •   Same argument apply as for the uswest mirror.
Hype Cycle




         Web services
Ensembl Pipeline
HPC element of Ensembl.
• Takes raw genomes and performs automated annotation on them.
Compute Pipeline
TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG
GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA
TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA
TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC
AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC
TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA
ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG
AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC
TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG
AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG
AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA
GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT
ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC
Raw Sequence → Something
         useful
Example annotation
Gene Finding
  DNA

HMM Prediction

Alignment with
fragments recovered
in vivo

Alignment with
known proteins



Alignment with
other genes and
other species
Workflow
Compute Pipeline
Architecture:
• OO perl pipeline manager.
• Core algorithms are C.
• 200 auxiliary binaries.
Workflow:
• Investigator describes analysis at high level.
• Pipeline manager splits the analysis into parallel chunks.
     • Typically 50k-100k jobs.
•   Sorts out the dependences and then submits jobs to a DRM.
     • Typically LSF or SGE.
•   Pipeline state and results are stored in a mysql database.

Workflow is embarrassingly parallel.
• Integer, not floating point.
• 64 bit memory address is nice, but not required.
     • 64 bit file access is required.
•   Single threaded jobs.
•   Very IO intensive.
Running the pipeline in
               practice
Requires a significant amount of domain knowledge.
Software install is complicated.
• Lots of perl modules and dependencies.
Need a well tuned compute cluster.
• Pipeline takes ~500 CPU days for a moderate genome.
     • Ensembl chewed up 160k CPU days last year.
•   Code is IO bound in a number of places.
•   Typically need a high performance filesystem.
     • Lustre, GPFS, Isilon, Ibrix etc.
•   Need large mysql database.
     • 100GB-TB mysql instances, very high query load generated from the
       cluster.
Why Cloud?
Proof of concept
• Is HPC is even possible in Cloud infrastructures?
Coping with the big increase in data
• Will we be able to provision new machines/datacentre space to keep up?
• What happens if we need to “out-source” our compute?
• Can we be in a position to shift peaks of demand to cloud facilities?
Expanding markets

There are going to be lots of new genomes that need
annotating.
• Sequencers moving into small labs, clinical settings.
• Limited informatics / systems experience.
     • Typically postdocs/PhD who have a “real” job to do.
•   They may want to run the genebuild pipeline on their data, but they may
    not have the expertise to do so.

We have already done all the hard work on installing the
software and tuning it.
• Can we package up the pipeline, put it in the cloud?
Goal: End user should simply be able to upload their data,
insert their credit-card number, and press “GO”.
Porting HPC code to the cloud
Lets build a compute cluster in the cloud.
Software stack / machine image.
• Creating images with software is reasonably straightforward.
• No big surprises.
Queuing system
• Pipeline requires a queueing system: (LSF/SGE)
     • Licensing problems.
•   Getting them to run took a lot of fiddling.
     • Machines need to find each other one they are inside the cloud.
     • Building an automated “self discovering” cluster takes some hacking.
     • Hopefully others can re-use it.


Mysql databases
• Lots of best practice on how to do that on EC2.
It took time, even for experienced systems people.
• (You will not be firing your system-administrators just yet!).
Did it work? NO!
“High performance computing is not facebook.”
                                               -- Chris Dagdigian


The big problem data:
• Moving data into the cloud is hard.
• Doing stuff with data once it is in the cloud is also hard.
If you look closely, most successful cloud projects have
small amounts of data (10-100 Mbytes).
Genomics projects have Tbytes → Pbytes of data.
Moving data is hard

Commonly used tools (FTP,ssh/rsync) are not suited to
wide-area networks.
• Need to use specialised WAN tools: gridFTP/FDT/Aspera.
There is a lot of broken internet.
Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).
• Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s)
• Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s)
• 11 hours to move 1TB to Dublin.
• 23 hours to move 1 TB to East coast.
What speed should we get?
• Once we leave JANET (UK academic network) finding out what the
    connectivity is and what we should expect is almost impossible.
•   Finding out who to talk to when you diagnose a troublesome link is also
    almost impossible.
Networking
“But the physicists do this
all the time.”
• No they don't.
• LHC Grid; Dedicated networking
    between CERN and the T1
    centres who get all of the data.

Can we use this model?
• We have relatively short lived and
    fluid collaborations. (1-2 years,
    many institutions).
•   As more labs get sequencers, our
    potential collaborators also
    increase.
•   We need good connectivity to
    everywhere.
Using data within the cloud
Compute nodes need to have fast access to the data.
• We solve this with exotic and temperamental filesystems/storage.
No viable global filesystems on EC2.
• NFS has poor scaling at the best of times.
• EC2 has poor inter-node networking. > 8 NFS clients, everything stops.

Nasty-hacks:
• Subcloud; commercial product that allows you to run a POSIX filesystem
  on top of S3.
   • Interesting performance, and you are paying by the hour...
Compute architecture

                                                          Data-store
   Batch schedular                       hadoop/S3



      Fat Network                        thin network

                          VS
CPU      CPU        CPU         CPU     CPU     CPU     CPU



                                Local   Local   Local   Local
Posix Global filesystem
                               storage storage storage storage


       Data-store
Why not S3 /hadoop/map-
            reduce?
Not POSIX.
     • Lots of code expects file on a filesystem.
•   Limitations; cannot store objects > 5GB.
•   Throw away file formats?

Nobody want to re-write existing applications.
• They already work on our compute farm.
• How do hadoop apps co-exist with non-hadoop ones?
     •Do we have to have two different type of infrastructure and move data
      between them?
•   Barrier for entry seems much lower for file-systems.

Am I being a reactionary old fart?
• 15 years ago clusters of PCs were not “real” supercomputers.
     • ...then beowulf took over the world.
•   Big difference: porting applications between the two architectures was
    easy.
•   MPI/PVM etc.

Will the market provide “traditional” compute clusters in
the cloud?
Hype cycle




 HPC
HPC app summary
You cannot take an existing data-rich HPC app and expect
it to work.
• IO architectures are too different.
There is some re-factoring going on for the ensembl
pipeline to make it EC2 friendly.
• Currently on a case-by-case basis.
• For the less-data intensive parts.
Waiting for the market to deliver...
Shared data archives
Past Collaborations

                 Sequencing
Sequencing         centre       Sequencing
  centre                          centre




                  Sequencing
                 Centre + DCC


                    Data
Genomics Data
     Data size per Genome




           Individual                                    Structured data
            features             Clinical Researchers,    (databases)
              (3MB)               non-infomaticians

      Variation data (1GB)


     Alignments (200 GB)
                            Sequencing informatics
                                 specialists
Sequence + quality data (500 GB)
                                                         Unstructured data
  Intensities / raw data (2TB)                               (flat files)
The Problem With Current
            Archives
Data in current archives is
“dark”.
• You can put/get data, but cannot
    compute across it.

Data is all in one place.
• Problematic if you are not the DCC:
• You have to pull the data down to do
    something with it,
•   Holding data in one place is bad for
    disaster-recovery and network access.


Is data in an inaccessible
archive really useful?
A real example...
“We want to run out pipeline across 100TB of data
currently in EGA/SRA.”


We will need to de-stage the data to Sanger, and then run
the compute.
• Extra 0.5 PB of storage, 1000 cores of compute.
• 3 month lead time.
• ~$1.5M capex.
• Download:
   •   46 days at 25 Mbytes/s (best transatlantic link).
   •   10 days at 1 Gbit/s. (sling an cable across the datacentre to EBI).
Easy to solve problem in
           powerpoint:
Put data into a cloud.
• Big cloud providers already have replicated storage infrastructures.
Upload workload onto VMs.
• Put VMs on compute that is “attached” to the data.

                                                CPU CPU CPU CPU


                                                        Data
                     CPU CPU CPU CPU


                              Data
VM
Practical Hurdles
How do you expose the data?
• Flat files? Database?
How do you make the compute efficient?
• Cloud IO problems still there.
   •   And you make the end user pay for them.

How do we deal with controlled access?
• Hard problem. Grid / delegated security mechanisms are complicated for
  a reason.
Whose Cloud?
Most of us are funded to hold data, not to fund everyone
else's compute costs to.
• Now need to budget for raw compute power as well as disk.
• Implement visualisation infrastructure, billing etc.
   •   Are you legally allowed to charge?
   •   Who underwrites it if nobody actually uses your service?

Strongly implies data has to be held on a commercial
provider.
Can it solve our networking
          problems?
Moving data across the internet is hard.
• Fixing the internet is not going to be cost effective for us.
Fixing the internet may be cost effective for big cloud
providers.
• Core to their business model.
• All we need to do is get data into Amazon, and then everyone else can get
  the data from there.

Do we invest in a fast links to Amazon?
• It changes the business dynamic.
• We have effectively tied ourselves to a single provider.
Where are we?


Computable
 archives
Summary
Cloud work well for webservices.
Data rich HPC workloads are still hard.
Cloud based data archives look really interesting.
Acknowledgements
Phil Butcher
ISG Team
• James Beal
• Gen-Tao Chiang
• Pete Clapham
• Simon Kelley
Ensembl
• Steve Searle
• Jan-Hinnerk Vogel
• Bronwen Aken
• Glenn Proctor
• Stephen Keenan
Cancer Genome Project
• Adam Butler
• John Teague

More Related Content

What's hot

More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...Amazon Web Services
 
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster RecoveryAWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster RecoveryAmazon Web Services
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsAvere Systems
 
The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...
The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...
The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...Amazon Web Services
 
Optimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsOptimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsAmazon Web Services
 
The challenges of live events scalability
The challenges of live events scalabilityThe challenges of live events scalability
The challenges of live events scalabilityGuy Tomer
 
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million UsersAWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million UsersAmazon Web Services
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The CloudAmazon Web Services
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveIlya Ganelin
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindAvere Systems
 
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million UsersAmazon Web Services
 
Developing for Your Target Market - Social, Games & Mobile - AWS India Summit...
Developing for Your Target Market - Social, Games & Mobile - AWS India Summit...Developing for Your Target Market - Social, Games & Mobile - AWS India Summit...
Developing for Your Target Market - Social, Games & Mobile - AWS India Summit...Amazon Web Services
 
Data storage for the cloud ce11
Data storage for the cloud ce11Data storage for the cloud ce11
Data storage for the cloud ce11CloudExpoEurope
 
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS Amazon Web Services
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users Amazon Web Services
 
(SPOT301) AWS Innovation at Scale | AWS re:Invent 2014
(SPOT301) AWS Innovation at Scale | AWS re:Invent 2014(SPOT301) AWS Innovation at Scale | AWS re:Invent 2014
(SPOT301) AWS Innovation at Scale | AWS re:Invent 2014Amazon Web Services
 
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)Amazon Web Services
 

What's hot (20)

More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
More Nines for Your Dimes: Improving Availability and Lowering Costs using Au...
 
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster RecoveryAWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
AWS Summit Tel Aviv - Enterprise Track - Backup and Disaster Recovery
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for Analysts
 
Running Yarn at Scale
Running Yarn at Scale Running Yarn at Scale
Running Yarn at Scale
 
The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...
The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...
The Boss: A Petascale Database for Large-Scale Neuroscience, Powered by Serve...
 
Optimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsOptimizing Storage for Big Data Workloads
Optimizing Storage for Big Data Workloads
 
The challenges of live events scalability
The challenges of live events scalabilityThe challenges of live events scalability
The challenges of live events scalability
 
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million UsersAWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Sydney 2014 | Scaling on AWS for the First 10 Million Users
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's Perspective
 
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your MindDeliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
Deliver Best-in-Class HPC Cloud Solutions Without Losing Your Mind
 
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
AWS Summit Auckland 2014 | Scaling on AWS for the First 10 Million Users
 
Developing for Your Target Market - Social, Games & Mobile - AWS India Summit...
Developing for Your Target Market - Social, Games & Mobile - AWS India Summit...Developing for Your Target Market - Social, Games & Mobile - AWS India Summit...
Developing for Your Target Market - Social, Games & Mobile - AWS India Summit...
 
Data storage for the cloud ce11
Data storage for the cloud ce11Data storage for the cloud ce11
Data storage for the cloud ce11
 
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
Best Practices for Genomic and Bioinformatics Analysis Pipelines on AWS
 
Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users Scaling on AWS for the First 10 Million Users
Scaling on AWS for the First 10 Million Users
 
(SPOT301) AWS Innovation at Scale | AWS re:Invent 2014
(SPOT301) AWS Innovation at Scale | AWS re:Invent 2014(SPOT301) AWS Innovation at Scale | AWS re:Invent 2014
(SPOT301) AWS Innovation at Scale | AWS re:Invent 2014
 
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
 
Machine Learning in Action
Machine Learning in ActionMachine Learning in Action
Machine Learning in Action
 

Viewers also liked

Library resources at your fingertips 2012 slide share
Library resources at your fingertips 2012 slide share Library resources at your fingertips 2012 slide share
Library resources at your fingertips 2012 slide share Naz Torabi
 
Mary Moser: LOLcats, Celebrities, and (Red Panda) Bears -- Oh, My!!
Mary Moser: LOLcats, Celebrities, and (Red Panda) Bears -- Oh, My!!Mary Moser: LOLcats, Celebrities, and (Red Panda) Bears -- Oh, My!!
Mary Moser: LOLcats, Celebrities, and (Red Panda) Bears -- Oh, My!!oxfordcollegelibrary
 
안드로이드스터디 6
안드로이드스터디 6안드로이드스터디 6
안드로이드스터디 6jangpd007
 
Prestige Magazine
Prestige MagazinePrestige Magazine
Prestige MagazineJay Lee
 
Yahoo! Local : Smart Ads With Localized Product
Yahoo! Local :  Smart Ads With Localized ProductYahoo! Local :  Smart Ads With Localized Product
Yahoo! Local : Smart Ads With Localized ProductDevan McCoy
 
Computo.ppt formatos.ppt1
Computo.ppt formatos.ppt1Computo.ppt formatos.ppt1
Computo.ppt formatos.ppt1cesar
 
đáNh giá và hoàn thiện chương trình tài trợ 'thời trang và cuộc sống' của nhã...
đáNh giá và hoàn thiện chương trình tài trợ 'thời trang và cuộc sống' của nhã...đáNh giá và hoàn thiện chương trình tài trợ 'thời trang và cuộc sống' của nhã...
đáNh giá và hoàn thiện chương trình tài trợ 'thời trang và cuộc sống' của nhã...Hee Young Shin
 
Chap014 pricing and negotiating for value
Chap014 pricing and negotiating for valueChap014 pricing and negotiating for value
Chap014 pricing and negotiating for valueHee Young Shin
 
никуда я не хочу идти
никуда я не хочу идти  никуда я не хочу идти
никуда я не хочу идти ko63ar
 
Bonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBOSC 2010
 
Hemmerich bosc2010 isga_ergatis
Hemmerich bosc2010 isga_ergatisHemmerich bosc2010 isga_ergatis
Hemmerich bosc2010 isga_ergatisBOSC 2010
 
WeonTV at the EuroITV 2009
WeonTV at the EuroITV 2009WeonTV at the EuroITV 2009
WeonTV at the EuroITV 2009Social iTV
 
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdfRobinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdfBOSC 2010
 

Viewers also liked (20)

Library resources at your fingertips 2012 slide share
Library resources at your fingertips 2012 slide share Library resources at your fingertips 2012 slide share
Library resources at your fingertips 2012 slide share
 
Mary Moser: LOLcats, Celebrities, and (Red Panda) Bears -- Oh, My!!
Mary Moser: LOLcats, Celebrities, and (Red Panda) Bears -- Oh, My!!Mary Moser: LOLcats, Celebrities, and (Red Panda) Bears -- Oh, My!!
Mary Moser: LOLcats, Celebrities, and (Red Panda) Bears -- Oh, My!!
 
Conclusions
ConclusionsConclusions
Conclusions
 
안드로이드스터디 6
안드로이드스터디 6안드로이드스터디 6
안드로이드스터디 6
 
Prestige Magazine
Prestige MagazinePrestige Magazine
Prestige Magazine
 
Offers Market Analysis
Offers Market AnalysisOffers Market Analysis
Offers Market Analysis
 
Yahoo! Local : Smart Ads With Localized Product
Yahoo! Local :  Smart Ads With Localized ProductYahoo! Local :  Smart Ads With Localized Product
Yahoo! Local : Smart Ads With Localized Product
 
Computo.ppt formatos.ppt1
Computo.ppt formatos.ppt1Computo.ppt formatos.ppt1
Computo.ppt formatos.ppt1
 
đáNh giá và hoàn thiện chương trình tài trợ 'thời trang và cuộc sống' của nhã...
đáNh giá và hoàn thiện chương trình tài trợ 'thời trang và cuộc sống' của nhã...đáNh giá và hoàn thiện chương trình tài trợ 'thời trang và cuộc sống' của nhã...
đáNh giá và hoàn thiện chương trình tài trợ 'thời trang và cuộc sống' của nhã...
 
Chap014 pricing and negotiating for value
Chap014 pricing and negotiating for valueChap014 pricing and negotiating for value
Chap014 pricing and negotiating for value
 
никуда я не хочу идти
никуда я не хочу идти  никуда я не хочу идти
никуда я не хочу идти
 
Economy katalog
Economy katalogEconomy katalog
Economy katalog
 
Bonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_rubyBonnal bosc2010 bio_ruby
Bonnal bosc2010 bio_ruby
 
Economic and Policy Impacts of Climate Change
Economic and Policy Impacts of Climate ChangeEconomic and Policy Impacts of Climate Change
Economic and Policy Impacts of Climate Change
 
Utube
UtubeUtube
Utube
 
Hemmerich bosc2010 isga_ergatis
Hemmerich bosc2010 isga_ergatisHemmerich bosc2010 isga_ergatis
Hemmerich bosc2010 isga_ergatis
 
Marcellus Shale
Marcellus ShaleMarcellus Shale
Marcellus Shale
 
WeonTV at the EuroITV 2009
WeonTV at the EuroITV 2009WeonTV at the EuroITV 2009
WeonTV at the EuroITV 2009
 
Robinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdfRobinson bosc2010 bio_hdf
Robinson bosc2010 bio_hdf
 
Bicentenario
BicentenarioBicentenario
Bicentenario
 

Similar to Coates bosc2010 clouds-fluff-and-no-substance

Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Guy Coates
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud ExperiencesGuy Coates
 
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov Docker, Inc.
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented DesignRodrigo Campos
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneCeph Community
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
A scalable server environment for your applications
A scalable server environment for your applicationsA scalable server environment for your applications
A scalable server environment for your applicationsGigaSpaces
 
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...Amazon Web Services
 
High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013Server Density
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics PlatformSantanu Dey
 
DATA LAKE AND THE RISE OF THE MICROSERVICES - ALEX BORDEI
DATA LAKE AND THE RISE OF THE MICROSERVICES - ALEX BORDEIDATA LAKE AND THE RISE OF THE MICROSERVICES - ALEX BORDEI
DATA LAKE AND THE RISE OF THE MICROSERVICES - ALEX BORDEIBig Data Week
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingRoshan Karunarathna
 
"The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You""The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You"Chris Dwan
 
CLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchCLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchTom Connor
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...OpenNebula Project
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Ankit Gupta
 

Similar to Coates bosc2010 clouds-fluff-and-no-substance (20)

Clouds: All fluff and no substance?
Clouds: All fluff and no substance?Clouds: All fluff and no substance?
Clouds: All fluff and no substance?
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud Experiences
 
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
Sharding Containers: Make Go Apps Computer-Friendly Again by Andrey Sibiryov
 
Performance Oriented Design
Performance Oriented DesignPerformance Oriented Design
Performance Oriented Design
 
Erasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William ByrneErasure Code at Scale - Thomas William Byrne
Erasure Code at Scale - Thomas William Byrne
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
A scalable server environment for your applications
A scalable server environment for your applicationsA scalable server environment for your applications
A scalable server environment for your applications
 
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
 
High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013
 
DataCore Case Study on Hyperconverged
DataCore Case Study on HyperconvergedDataCore Case Study on Hyperconverged
DataCore Case Study on Hyperconverged
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
DATA LAKE AND THE RISE OF THE MICROSERVICES - ALEX BORDEI
DATA LAKE AND THE RISE OF THE MICROSERVICES - ALEX BORDEIDATA LAKE AND THE RISE OF THE MICROSERVICES - ALEX BORDEI
DATA LAKE AND THE RISE OF THE MICROSERVICES - ALEX BORDEI
 
Introduction to Amazon EC2
Introduction to Amazon EC2Introduction to Amazon EC2
Introduction to Amazon EC2
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
"The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You""The Cutting Edge Can Hurt You"
"The Cutting Edge Can Hurt You"
 
CLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB LaunchCLIMB System Introduction Talk - CLIMB Launch
CLIMB System Introduction Talk - CLIMB Launch
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
OpenNebulaConf2015 1.06 Fermilab Virtual Facility: Data-Intensive Computing i...
 
Making Sense of Remote Sensing
Making Sense of Remote SensingMaking Sense of Remote Sensing
Making Sense of Remote Sensing
 
Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)Mod05lec24(resource mgmt i)
Mod05lec24(resource mgmt i)
 

More from BOSC 2010

Mercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkMercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkBOSC 2010
 
Langmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomicsLangmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomicsBOSC 2010
 
Schultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-servicesSchultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-servicesBOSC 2010
 
Swertz bosc2010 molgenis
Swertz bosc2010 molgenisSwertz bosc2010 molgenis
Swertz bosc2010 molgenisBOSC 2010
 
Rice bosc2010 emboss
Rice bosc2010 embossRice bosc2010 emboss
Rice bosc2010 embossBOSC 2010
 
Morris bosc2010 evoker
Morris bosc2010 evokerMorris bosc2010 evoker
Morris bosc2010 evokerBOSC 2010
 
Kono bosc2010 pathway_projector
Kono bosc2010 pathway_projectorKono bosc2010 pathway_projector
Kono bosc2010 pathway_projectorBOSC 2010
 
Kanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenisKanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenisBOSC 2010
 
Gautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductorGautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductorBOSC 2010
 
Gardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasfGardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasfBOSC 2010
 
Friedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsFriedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsBOSC 2010
 
Fields bosc2010 bio_perl
Fields bosc2010 bio_perlFields bosc2010 bio_perl
Fields bosc2010 bio_perlBOSC 2010
 
Chapman bosc2010 biopython
Chapman bosc2010 biopythonChapman bosc2010 biopython
Chapman bosc2010 biopythonBOSC 2010
 
Puton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rnaPuton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rnaBOSC 2010
 
Bader bosc2010 cytoweb
Bader bosc2010 cytowebBader bosc2010 cytoweb
Bader bosc2010 cytowebBOSC 2010
 
Talevich bosc2010 bio-phylo
Talevich bosc2010 bio-phyloTalevich bosc2010 bio-phylo
Talevich bosc2010 bio-phyloBOSC 2010
 
Zmasek bosc2010 aptx
Zmasek bosc2010 aptxZmasek bosc2010 aptx
Zmasek bosc2010 aptxBOSC 2010
 
Wilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiWilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiBOSC 2010
 
Venkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitVenkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitBOSC 2010
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 

More from BOSC 2010 (20)

Mercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_frameworkMercer bosc2010 microsoft_framework
Mercer bosc2010 microsoft_framework
 
Langmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomicsLangmead bosc2010 cloud-genomics
Langmead bosc2010 cloud-genomics
 
Schultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-servicesSchultheiss bosc2010 persistance-web-services
Schultheiss bosc2010 persistance-web-services
 
Swertz bosc2010 molgenis
Swertz bosc2010 molgenisSwertz bosc2010 molgenis
Swertz bosc2010 molgenis
 
Rice bosc2010 emboss
Rice bosc2010 embossRice bosc2010 emboss
Rice bosc2010 emboss
 
Morris bosc2010 evoker
Morris bosc2010 evokerMorris bosc2010 evoker
Morris bosc2010 evoker
 
Kono bosc2010 pathway_projector
Kono bosc2010 pathway_projectorKono bosc2010 pathway_projector
Kono bosc2010 pathway_projector
 
Kanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenisKanterakis bosc2010 molgenis
Kanterakis bosc2010 molgenis
 
Gautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductorGautier bosc2010 pythonbioconductor
Gautier bosc2010 pythonbioconductor
 
Gardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasfGardler bosc2010 community_developmentattheasf
Gardler bosc2010 community_developmentattheasf
 
Friedberg bosc2010 iprstats
Friedberg bosc2010 iprstatsFriedberg bosc2010 iprstats
Friedberg bosc2010 iprstats
 
Fields bosc2010 bio_perl
Fields bosc2010 bio_perlFields bosc2010 bio_perl
Fields bosc2010 bio_perl
 
Chapman bosc2010 biopython
Chapman bosc2010 biopythonChapman bosc2010 biopython
Chapman bosc2010 biopython
 
Puton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rnaPuton bosc2010 bio_python-modules-rna
Puton bosc2010 bio_python-modules-rna
 
Bader bosc2010 cytoweb
Bader bosc2010 cytowebBader bosc2010 cytoweb
Bader bosc2010 cytoweb
 
Talevich bosc2010 bio-phylo
Talevich bosc2010 bio-phyloTalevich bosc2010 bio-phylo
Talevich bosc2010 bio-phylo
 
Zmasek bosc2010 aptx
Zmasek bosc2010 aptxZmasek bosc2010 aptx
Zmasek bosc2010 aptx
 
Wilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadiWilkinson bosc2010 moby-to-sadi
Wilkinson bosc2010 moby-to-sadi
 
Venkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkitVenkatesan bosc2010 onto-toolkit
Venkatesan bosc2010 onto-toolkit
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 

Recently uploaded

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Recently uploaded (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Coates bosc2010 clouds-fluff-and-no-substance

  • 1. Clouds: All fluff and no substance? Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
  • 2. Outline About the Sanger Institute. Experience with cloud to date. Future Directions.
  • 3. The Sanger Institute Funded by Wellcome Trust. • 2nd largest research charity in the world. • ~700 employees. • Based on Hinxton Genome Campus, Cambridge, UK. Large scale genomic research. • We have active cancer, malaria, pathogen and genomic variation / human health studies. • 1k genomes, & 10k-UK Genomes, Cancer genome projects. All data is made publicly available. • Websites, ftp, direct database access, programmatic APIs.
  • 4. Economic Trends: As cost of sequencing halves every 12 months. • cf Moore's Law The Human genome project: • 13 years. • 23 labs. • $500 Million. A Human genome today: • 3 days. • 1 machine. • $10,000. • Large centres are now doing studies with 1000s and 10,000s of genomes. Changes in sequencing technology are going to continue this trend. • “Next-next” generation sequencers are on their way. • $500 genome is probable within 5 years.
  • 5. The scary graph Instrument upgrades Peak Yearly capillary sequencing
  • 6. Managing Growth We have exponential growth in storage and compute. • Storage /compute doubles every 12 Disk Storage months. 6000 • 2009 ~7 PB raw 5000 4000 Moore's law will not save us. • Transistor/disk density: Td=18 months Terabytes 3000 • Sequencing cost: Td=12 months 2000 My Job: 1000 • Running the team who do the IT 0 systems heavy-lifting to make it all work. 1995 1997 1999 2001 2003 2005 2007 2009 • 1994 1996 1998 2000 2002 2004 2006 2008 Tech evaluations. Year • Systems architecture. • Day-to-day administration. • All in conjunction with informaticians, programmers & investigators who are doing the science.
  • 8. What is cloud? Technical view: • On demand, virtual machines. • Root access, total ownership. • Pay-as-you-go model. Non-technical view: • “Free” compute we can use to solve all of the hard problems thrown up by new sequencing. • (8cents/hour is almost free, right...?) • Web 2.0 / Friendface use it, so it must be good.
  • 9. Hype Cycle Awesome! Just works...
  • 10. Out of the trough of disillusionment...
  • 12. Cloud Use-Cases We currently have three areas of activity: • Web presence • HPC workload • Data Warehousing
  • 13. Ensembl Ensembl is a system for genome Annotation. Data visualisation (Web Presence) • www.ensembl.org • Provides web / programmatic interfaces to genomic data. • 10k visitors / 126k page views per day. Compute Pipeline (HPC Workload) • Take a raw genome and run it through a compute pipeline to find genes and other features of interest. • Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate genomes. • Software is Open Source (apache license). • Data is free for download. We have done cloud experiments with both the web site and pipeline.
  • 15.
  • 16. Web Presence Ensembl has a worldwide audience. Historically, web site performance was not great, especially for non-european institutes. • Pages were quite heavyweight. • Not properly cached etc. Web team spent a lot of time re-designing the code to make it more streamlined. • Greatly improved performance. Coding can only get you so-far. • “A canna' change the laws of physics.” •150-240ms round trip time from Europe to the US. • We need a set of geographically dispersed mirrors.
  • 17. uswest.ensembl.org Traditional mirror: Real machines in a co-lo facility in California. Hardware was initially configured on site. • 16 servers, SAN storage, SAN switches, SAN management appliance, Ethernet switches, firewall, out-of-band management etc. Shipped to the co-lo for installation. • Sent a person to California for 3 weeks. • Spent 1 week getting stuff into/out of customs. • ****ing FCC paperwork! Additional infrastructure work. • VPN between UK and US. Incredibly time consuming. • Really don't want to end up having to send someone on a plane to the US to fix things.
  • 18. Usage US-West currently takes ~1/3rd of total Ensembl web traffic. • Much lower latency and improved site usibility.
  • 19. What has this got to do with clouds?
  • 20. useast.ensembl.org We want an east coast US mirror to complement our west coast mirror. Built the mirror in AWS. • Initially a proof of concept / test-bed. • Production-level in due course. Gives us operational experience. • We can compare to a “real” colo.
  • 21. Building a mirror on AWS Some software development / sysadmin work needed. • Preparation of OS images, software stack configuration. • West-coast was built as an extension of Sanger internal network via VPN. • AWS images built as standalone systems. Web code changes • Significant code changes required to make the webcode “mirror aware”. • Seach, site login etc. • We chose not to set up VPN into AWS. • Work already done for the first mirror. Significant amount of tuning required. • Initial mysql performance was pretty bad, especially for the large ensembl databases. (~1TB). • Lots of people doing Apache/mysql on AWS, so there is a good amount of best-practice etc available.
  • 22. Does it work? BETA!
  • 23. Is it better than the co-lo? No physical hardware. • Work can start as soon as we enter our credit card numbers... • No US customs, Fedex etc. Much simpler management infra-stucture. • AWS give you out of band management “for free”. • Much simpler to deal with hardware problems. • And we do remote-management all the time. “Free” hardware upgrades. • As faster machines become available we can take advantage of them immediately. • No need to get tin decommissioned /re-installed at Co-lo.
  • 24. Is it cost effective? Lots of misleading cost statements made about cloud. • “Our analysis only cost $500.” • CPU is only “$0.085 / hr”. What are we comparing against? • Doing the analysis once? Continually? • Buying a $2000 server? • Leasing a $2000 server for 3 years? • Using $150 of time at your local supercomputing facility? • Buying a $2000 of server but having to build a $1M datacentre to put it in? Requires the dreaded Total Cost of Ownership (TCO) calculation. • hardware + power + cooling + facilities + admin/developers etc • Incredibly hard to do.
  • 25. Lets do it anyway... Comparing costs to the co-lo is simpler. • power, cooling costs are all included. • Admin costs are the same, so we can ignore them. • Same people responsible for both. Cost for Co-location facility: • $120,000 hardware + $51,000 /yr colo. • $91,000 per year (3 years hardware lifetime). Cost for AWS : • $77,000 per year (estimated based on US-east traffic / IOPs) Result: Estimated 16% cost saving. • It is not free!
  • 26. Additional Benefits Website + code is packaged together. • Can be conveniently given away to end users in a “ready-to-run” config. • Simplifies configuration for other users wanting to run Ensembl sites. • Configuring an ensembl site is non-trivial for non-informaticians. • Cvs, mysql setup, apache configuration etc. Ensembl data is already available as an Amazon public dataset. • Makes a complete system.
  • 27. Unknowns What about scale-up? Current installation is a minimal config. • Single web / database nodes. • Main site and us-east use multiple load balanced servers. AWS load-balancing architecture is different from what we currently use. • In theory there should be no problems... • ...but we don't know until we try. • Do we go for automatic scale-out?
  • 28. Downsides Underestimated the time it would take to make the web- code mirror-ready. • Not a cloud specific problem, but something to be aware of when you take big applications and move them outside your home institution. Packaging OS images, code and data needs to be done for every ensembl release. • Ensembl team now has a dedicated person responsible for the cloud. • Somebody has to look after the systems. Management overhead does not necessarily go down. • But it does change.
  • 29. Going forward useast.ensembl.org to go into production later this year. • Far-east Amazon availability zone is also of interest. • Likely to be next, assuming useast works. “Virtual” Co-location concept will be useful for a number of other projects. • Other Sanger websites? Disaster recovery. • Eg replicate critical databases / storage into AWS. • Currently all of Sanger data lives in a single datacentre. • We have a small amount of co-lo space for mirroring critical data. • Same argument apply as for the uswest mirror.
  • 30. Hype Cycle Web services
  • 31. Ensembl Pipeline HPC element of Ensembl. • Takes raw genomes and performs automated annotation on them.
  • 32. Compute Pipeline TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC
  • 33. Raw Sequence → Something useful
  • 35. Gene Finding DNA HMM Prediction Alignment with fragments recovered in vivo Alignment with known proteins Alignment with other genes and other species
  • 37. Compute Pipeline Architecture: • OO perl pipeline manager. • Core algorithms are C. • 200 auxiliary binaries. Workflow: • Investigator describes analysis at high level. • Pipeline manager splits the analysis into parallel chunks. • Typically 50k-100k jobs. • Sorts out the dependences and then submits jobs to a DRM. • Typically LSF or SGE. • Pipeline state and results are stored in a mysql database. Workflow is embarrassingly parallel. • Integer, not floating point. • 64 bit memory address is nice, but not required. • 64 bit file access is required. • Single threaded jobs. • Very IO intensive.
  • 38. Running the pipeline in practice Requires a significant amount of domain knowledge. Software install is complicated. • Lots of perl modules and dependencies. Need a well tuned compute cluster. • Pipeline takes ~500 CPU days for a moderate genome. • Ensembl chewed up 160k CPU days last year. • Code is IO bound in a number of places. • Typically need a high performance filesystem. • Lustre, GPFS, Isilon, Ibrix etc. • Need large mysql database. • 100GB-TB mysql instances, very high query load generated from the cluster.
  • 39. Why Cloud? Proof of concept • Is HPC is even possible in Cloud infrastructures? Coping with the big increase in data • Will we be able to provision new machines/datacentre space to keep up? • What happens if we need to “out-source” our compute? • Can we be in a position to shift peaks of demand to cloud facilities?
  • 40. Expanding markets There are going to be lots of new genomes that need annotating. • Sequencers moving into small labs, clinical settings. • Limited informatics / systems experience. • Typically postdocs/PhD who have a “real” job to do. • They may want to run the genebuild pipeline on their data, but they may not have the expertise to do so. We have already done all the hard work on installing the software and tuning it. • Can we package up the pipeline, put it in the cloud? Goal: End user should simply be able to upload their data, insert their credit-card number, and press “GO”.
  • 41. Porting HPC code to the cloud Lets build a compute cluster in the cloud. Software stack / machine image. • Creating images with software is reasonably straightforward. • No big surprises. Queuing system • Pipeline requires a queueing system: (LSF/SGE) • Licensing problems. • Getting them to run took a lot of fiddling. • Machines need to find each other one they are inside the cloud. • Building an automated “self discovering” cluster takes some hacking. • Hopefully others can re-use it. Mysql databases • Lots of best practice on how to do that on EC2. It took time, even for experienced systems people. • (You will not be firing your system-administrators just yet!).
  • 42. Did it work? NO! “High performance computing is not facebook.” -- Chris Dagdigian The big problem data: • Moving data into the cloud is hard. • Doing stuff with data once it is in the cloud is also hard. If you look closely, most successful cloud projects have small amounts of data (10-100 Mbytes). Genomics projects have Tbytes → Pbytes of data.
  • 43. Moving data is hard Commonly used tools (FTP,ssh/rsync) are not suited to wide-area networks. • Need to use specialised WAN tools: gridFTP/FDT/Aspera. There is a lot of broken internet. Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link). • Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s) • Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s) • 11 hours to move 1TB to Dublin. • 23 hours to move 1 TB to East coast. What speed should we get? • Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible. • Finding out who to talk to when you diagnose a troublesome link is also almost impossible.
  • 44. Networking “But the physicists do this all the time.” • No they don't. • LHC Grid; Dedicated networking between CERN and the T1 centres who get all of the data. Can we use this model? • We have relatively short lived and fluid collaborations. (1-2 years, many institutions). • As more labs get sequencers, our potential collaborators also increase. • We need good connectivity to everywhere.
  • 45. Using data within the cloud Compute nodes need to have fast access to the data. • We solve this with exotic and temperamental filesystems/storage. No viable global filesystems on EC2. • NFS has poor scaling at the best of times. • EC2 has poor inter-node networking. > 8 NFS clients, everything stops. Nasty-hacks: • Subcloud; commercial product that allows you to run a POSIX filesystem on top of S3. • Interesting performance, and you are paying by the hour...
  • 46. Compute architecture Data-store Batch schedular hadoop/S3 Fat Network thin network VS CPU CPU CPU CPU CPU CPU CPU Local Local Local Local Posix Global filesystem storage storage storage storage Data-store
  • 47. Why not S3 /hadoop/map- reduce? Not POSIX. • Lots of code expects file on a filesystem. • Limitations; cannot store objects > 5GB. • Throw away file formats? Nobody want to re-write existing applications. • They already work on our compute farm. • How do hadoop apps co-exist with non-hadoop ones? •Do we have to have two different type of infrastructure and move data between them? • Barrier for entry seems much lower for file-systems. Am I being a reactionary old fart? • 15 years ago clusters of PCs were not “real” supercomputers. • ...then beowulf took over the world. • Big difference: porting applications between the two architectures was easy. • MPI/PVM etc. Will the market provide “traditional” compute clusters in the cloud?
  • 49. HPC app summary You cannot take an existing data-rich HPC app and expect it to work. • IO architectures are too different. There is some re-factoring going on for the ensembl pipeline to make it EC2 friendly. • Currently on a case-by-case basis. • For the less-data intensive parts. Waiting for the market to deliver...
  • 51. Past Collaborations Sequencing Sequencing centre Sequencing centre centre Sequencing Centre + DCC Data
  • 52. Genomics Data Data size per Genome Individual Structured data features Clinical Researchers, (databases) (3MB) non-infomaticians Variation data (1GB) Alignments (200 GB) Sequencing informatics specialists Sequence + quality data (500 GB) Unstructured data Intensities / raw data (2TB) (flat files)
  • 53. The Problem With Current Archives Data in current archives is “dark”. • You can put/get data, but cannot compute across it. Data is all in one place. • Problematic if you are not the DCC: • You have to pull the data down to do something with it, • Holding data in one place is bad for disaster-recovery and network access. Is data in an inaccessible archive really useful?
  • 54. A real example... “We want to run out pipeline across 100TB of data currently in EGA/SRA.” We will need to de-stage the data to Sanger, and then run the compute. • Extra 0.5 PB of storage, 1000 cores of compute. • 3 month lead time. • ~$1.5M capex. • Download: • 46 days at 25 Mbytes/s (best transatlantic link). • 10 days at 1 Gbit/s. (sling an cable across the datacentre to EBI).
  • 55. Easy to solve problem in powerpoint: Put data into a cloud. • Big cloud providers already have replicated storage infrastructures. Upload workload onto VMs. • Put VMs on compute that is “attached” to the data. CPU CPU CPU CPU Data CPU CPU CPU CPU Data VM
  • 56. Practical Hurdles How do you expose the data? • Flat files? Database? How do you make the compute efficient? • Cloud IO problems still there. • And you make the end user pay for them. How do we deal with controlled access? • Hard problem. Grid / delegated security mechanisms are complicated for a reason.
  • 57. Whose Cloud? Most of us are funded to hold data, not to fund everyone else's compute costs to. • Now need to budget for raw compute power as well as disk. • Implement visualisation infrastructure, billing etc. • Are you legally allowed to charge? • Who underwrites it if nobody actually uses your service? Strongly implies data has to be held on a commercial provider.
  • 58. Can it solve our networking problems? Moving data across the internet is hard. • Fixing the internet is not going to be cost effective for us. Fixing the internet may be cost effective for big cloud providers. • Core to their business model. • All we need to do is get data into Amazon, and then everyone else can get the data from there. Do we invest in a fast links to Amazon? • It changes the business dynamic. • We have effectively tied ourselves to a single provider.
  • 60. Summary Cloud work well for webservices. Data rich HPC workloads are still hard. Cloud based data archives look really interesting.
  • 61. Acknowledgements Phil Butcher ISG Team • James Beal • Gen-Tao Chiang • Pete Clapham • Simon Kelley Ensembl • Steve Searle • Jan-Hinnerk Vogel • Bronwen Aken • Glenn Proctor • Stephen Keenan Cancer Genome Project • Adam Butler • John Teague