The Cutting Edge Can Hurt You
Stories from real-world adopters of next generation
sequencing technology
Christopher Dwan
The Bioteam
Bioteam
• Consultancy, with a software business
• Vendor Neutral, Technology Agnostic
• “Bridge the gap” between high performance computing and life
sciences
• Founded 2003
Shameless Plugs:
• We’re in Booth 113
• Next Generation Sequencing Workshop (yesterday, plus next year)
• http://bioteam.net
• We’re hiring.
cdwan@bioteam.net
Disclaimer
• Most BioTeam clients don’t have 7 figure IT
budgets, Petabyte SANs dedicated datacenters,
and so on.
• Many of these problems:
– Are quite different for the largest Bio-HPC centers
– Simply don’t matter to the nationally funded
projects.
I offer no answers…
cdwan@bioteam.net
Review: 2007 Predictions
• Multi-core commodity processors
– Workstations are already insanely powerful.
• Virtualization on the workstation
– Why port code, when you can make a new machine?
• Reconfigurable computing goes mainstream
– Partnerships, Collaborations, and re-seller agreements.
• Next generation DNA sequencing
– Data tsunami
2007 Predictions - Reviewed
• Multi-core commodity processors
– I touched a 16 core workstation with 10TB of disk. It “just worked.”
• Virtualization on the workstation Everywhere! Including Data!
– I saw a workstation replace 6 legacy OS’s in one shop.
• Reconfigurable computing goes mainstream Seemingly not
yet
– Talk to the folks on the show floor to get strong contradictions.
• Next generation DNA sequencing. Yup.
– Data tsunami
NEXT GENERATION SEQUENCERS
$5M: If you get one of these …
cdwan@bioteam.net
You probably know about these
cdwan@bioteam.net
cdwan@bioteam.net
Next Generation Sequencing
• Costs
– Instrument cost: $5x105 to $106
– Reagent Cost: $3k - $10k
– ~ 1 TB / machine / day
– 4 or 5 vendors
• Cost imbalance with IT
components
– $7k for an experiment
– $3k for a new server
• Opens high-throughput to much
smaller labs.
Next Generation Sequencing
• Naming is annoying:
– “Next gen”, “new gen”, “now gen”
• High Throughput DNA Sequencing
– Helicos, Roche, Illumina, ABI,
Church Lab
– Also other domains: Cofocal
microscopes, mass spec, …
“The old days, which were about two
years ago”
Fundamentals
• Standard facilities questsion:
– Heat / Power / Floor capacity, just as much as ever
• Network:
– Moving 1TB of data from instrument to the next
room, much less to a collaborator
– Still with the sneakernet.
• Security
• Data
• Lab information management
cdwan@bioteam.net
Information Management
• Day 0
– Simply catching the data and not dropping it is a
challenge.
• 3 months
– Postdocs carrying data around on firewire disks
– Data management with post-it notes.
• 6 months
– Instrument vendor updates their software.
– Re-analysis?
• 1 year
– New machine from a different vendor.
Networking / Data Motion
• Data motion can interfere with data acquisition
(go on, ask the instrument vendor)
• Software updates can interfere with attempts to
automate data motion
• Move 1TB of data from lab to network closet
– Old building network, instrument offline for hours
– Small $200 4 port gigabit switch (see “security”)
– Excuse for a building-wide upgrade, 1 year horizon
Security
• Network and IT Security: serious job.
• Labs must propose workable solutions, not
wait for security staff to provide them.
• Common observations:
– Mess with building wide network = security audit
– No solution = system offline.
“Stay out of the news”
Humans are the problem
Lab
Instruments
Tape Backup
Compute
Cluster
One-off Linux /
Windows Machine
RAID / Data
Store
Dev
Workstation
Management:
Throughput? Status?
Schedule? Security?
Lab Staff
Availability, quality,
“did it work?”
IT Staff
Throughput? Status?
Schedule
Bioinformaticians
Access to data,
metadata,
Automation is the solution
Lab
Instruments
Tape Backup
Compute
Cluster
One-off Linux /
Windows Machine
RAID / Data
Store
Dev
Workstation
Management
Machines need to
write web pages
IT Staff
Offer levels of
service, negotiate
directly with
scientists
Bioinformaticians
Access to data,
metadata,
LIMS
WIKILIMS
Wikipedia/ UC Berkeley
cdwan@bioteam.net
Automatic data capture - Ra
Most structured content can be captured and recorded by
programs as it is generated
cdwan@bioteam.net
File data
raid
Meta data
wiki
Wikilims: Next Gen Data Store
Version Differences
cdwan@bioteam.net
Launch an assembly
Launch an assembly on the cluster
WikiLIMS
• Still sold as a custom service
– No long term license
– Full source code access
– Highly customizable
• Variety of customers:
– Navy Medical Research Labs
– Cold Spring Harbor
– Emory University
– National Cancer Institute
– …
• Both 113, we’ll talk your ear off.
This is the semantic web
All updates happening at once
cdwan@bioteam.net
STORAGE AND BACKUPS
“On their way to becoming a sick joke”
cdwan@bioteam.net
Storage
• Storage: Same in 2008 as in 2006
– Unhappy technology tradeoffs
– ‘Exotic’ vendors offer blazing speed and a few features
– ‘Mainstream’ vendors exclusively focused on enterprise
– What I need: Massive scaling, decent speed & grab bag of
enterprise features
• Real World Solution, early 2008:
– 100TB disk, backup, small cluster, plus all infrastructure
– Price range: $225k - $998k
Cut the problem into pieces.
Archive vs. Resequence
• 2007
– Shocking suggestion to delete primary data
– Sanger suggested the MAID (Massive Array of Idle Disks)
– Novartis reported that 97% of their files are never
accessed 3 months after generation.
• 2008
– Instrument vendors deleting large volumes of data inside
the box.
– Less shock, more “data lifecycle”
“New instrument data would be different anyway”
Data Storage
• 1 TB
– $200 @ Savers ($0.20 / GB)
• 24 – 48TB
– Commodity solutions, many vendors ($0.70 /GB)
• 100TB+
– Interesting architectural tradeoffs
– Decision should be based on support expectations
– Below $3 / GB really scares me
• 1 – 2PB
– “Large”
cdwan@bioteam.net
•100+TB SAN
•50+ compute nodes
•One rather warm closet.
Backups are legion
• Archive:
– 1TB Firewire disk - $150
– 800GB LTO4 Tape - $90 (plus a sizable machine)
• Disaster Recovery:
– Failover, redundancy, etc.
– Just buy two of everything.
• Incremental Rollback
– Traditional “backups”
– Daily, weekly, differentials
Talk to Finance people about backups.
Data Ingest
(instruments)
Legacy Storage Architecture
4PB Tape Archive
24TB “hot” disk
For analysis
SGI SMP
Machines
Linux
Cluster
Caching
problem
Workstations
Web / FTP
access
Data Ingest
(instruments)
New Realities Allow Simplification
4PB Tape Backup
1PB “hot” disk
For analysis
SGI SMP
Machines
Linux
Cluster
Workstations
Web / FTP
access
NEW, COOL STUFF
cdwan@bioteam.net
Amazon Web Services
• EC2 for virtualized computing
– The economics are compelling
• One month of serious experimentation:
– $9.00 USD billed to credit card
– Various money making approaches
• Flexible pricing allows reselling & revenue sharing
• Create a EC2 image and add my own fees on top to cover
development and support costs
– As a developer, I don’t need your credit card
• Amazon handles all transactions & billing
Bioteam and Amazon EC2
• This is the grid:
– Every Bioteam consultant independently deployed an EC2 solution in
2008.
• Inquiry
– Since 2004 - “bioinformatics on a cluster”
– Apple, Microsoft CCS, Linux, etc.
– May 1, 2008: Inquiry on Amazon EC2
– CPU Cost to customer: $10 / node day
• Data service: 500GB, constantly updated:
– $1400 yr: downloads, maintenance, and storage
– $17 yr / cost to Bioteam to support a customer
Conclusion
If scientists are wasting a bunch of time on IT,
we’ve got more work to do.
Disturbing Observation
I seem to have presented both a functional
“grid” and an instance of a “semantic web” in
the same talk.
Thank You
• Cambridge Healthtech Institute
– Cindy Crowninshield, Kevin Davies
• Bioteam Customers
– Ed Delong (MIT), Tim Read (NMRC), Yuri Kotliari (NIH),
CSHL
• Bioteam
– Mike Cariaso, Chris Dagdigian, Stan Gloss, Brian
Osborne, Bill Van Etten, Jiesheng Zhang
• Community
– Bioclusters, Sun Grid Engine, Bioinformatics.org
cdwan@bioteam.net
Questions
cdwan@bioteam.net

"The Cutting Edge Can Hurt You"

  • 1.
    The Cutting EdgeCan Hurt You Stories from real-world adopters of next generation sequencing technology Christopher Dwan The Bioteam
  • 2.
    Bioteam • Consultancy, witha software business • Vendor Neutral, Technology Agnostic • “Bridge the gap” between high performance computing and life sciences • Founded 2003 Shameless Plugs: • We’re in Booth 113 • Next Generation Sequencing Workshop (yesterday, plus next year) • http://bioteam.net • We’re hiring.
  • 3.
    cdwan@bioteam.net Disclaimer • Most BioTeamclients don’t have 7 figure IT budgets, Petabyte SANs dedicated datacenters, and so on. • Many of these problems: – Are quite different for the largest Bio-HPC centers – Simply don’t matter to the nationally funded projects.
  • 4.
    I offer noanswers… cdwan@bioteam.net
  • 5.
    Review: 2007 Predictions •Multi-core commodity processors – Workstations are already insanely powerful. • Virtualization on the workstation – Why port code, when you can make a new machine? • Reconfigurable computing goes mainstream – Partnerships, Collaborations, and re-seller agreements. • Next generation DNA sequencing – Data tsunami
  • 6.
    2007 Predictions -Reviewed • Multi-core commodity processors – I touched a 16 core workstation with 10TB of disk. It “just worked.” • Virtualization on the workstation Everywhere! Including Data! – I saw a workstation replace 6 legacy OS’s in one shop. • Reconfigurable computing goes mainstream Seemingly not yet – Talk to the folks on the show floor to get strong contradictions. • Next generation DNA sequencing. Yup. – Data tsunami
  • 7.
  • 8.
    $5M: If youget one of these … cdwan@bioteam.net
  • 9.
    You probably knowabout these cdwan@bioteam.net
  • 10.
    cdwan@bioteam.net Next Generation Sequencing •Costs – Instrument cost: $5x105 to $106 – Reagent Cost: $3k - $10k – ~ 1 TB / machine / day – 4 or 5 vendors • Cost imbalance with IT components – $7k for an experiment – $3k for a new server • Opens high-throughput to much smaller labs.
  • 11.
    Next Generation Sequencing •Naming is annoying: – “Next gen”, “new gen”, “now gen” • High Throughput DNA Sequencing – Helicos, Roche, Illumina, ABI, Church Lab – Also other domains: Cofocal microscopes, mass spec, … “The old days, which were about two years ago”
  • 12.
    Fundamentals • Standard facilitiesquestsion: – Heat / Power / Floor capacity, just as much as ever • Network: – Moving 1TB of data from instrument to the next room, much less to a collaborator – Still with the sneakernet. • Security • Data • Lab information management
  • 13.
  • 14.
    Information Management • Day0 – Simply catching the data and not dropping it is a challenge. • 3 months – Postdocs carrying data around on firewire disks – Data management with post-it notes. • 6 months – Instrument vendor updates their software. – Re-analysis? • 1 year – New machine from a different vendor.
  • 15.
    Networking / DataMotion • Data motion can interfere with data acquisition (go on, ask the instrument vendor) • Software updates can interfere with attempts to automate data motion • Move 1TB of data from lab to network closet – Old building network, instrument offline for hours – Small $200 4 port gigabit switch (see “security”) – Excuse for a building-wide upgrade, 1 year horizon
  • 16.
    Security • Network andIT Security: serious job. • Labs must propose workable solutions, not wait for security staff to provide them. • Common observations: – Mess with building wide network = security audit – No solution = system offline. “Stay out of the news”
  • 17.
    Humans are theproblem Lab Instruments Tape Backup Compute Cluster One-off Linux / Windows Machine RAID / Data Store Dev Workstation Management: Throughput? Status? Schedule? Security? Lab Staff Availability, quality, “did it work?” IT Staff Throughput? Status? Schedule Bioinformaticians Access to data, metadata,
  • 18.
    Automation is thesolution Lab Instruments Tape Backup Compute Cluster One-off Linux / Windows Machine RAID / Data Store Dev Workstation Management Machines need to write web pages IT Staff Offer levels of service, negotiate directly with scientists Bioinformaticians Access to data, metadata, LIMS
  • 19.
  • 20.
  • 21.
    cdwan@bioteam.net Automatic data capture- Ra Most structured content can be captured and recorded by programs as it is generated
  • 22.
  • 23.
  • 24.
  • 25.
    WikiLIMS • Still soldas a custom service – No long term license – Full source code access – Highly customizable • Variety of customers: – Navy Medical Research Labs – Cold Spring Harbor – Emory University – National Cancer Institute – … • Both 113, we’ll talk your ear off. This is the semantic web
  • 26.
    All updates happeningat once cdwan@bioteam.net
  • 27.
    STORAGE AND BACKUPS “Ontheir way to becoming a sick joke”
  • 28.
    cdwan@bioteam.net Storage • Storage: Samein 2008 as in 2006 – Unhappy technology tradeoffs – ‘Exotic’ vendors offer blazing speed and a few features – ‘Mainstream’ vendors exclusively focused on enterprise – What I need: Massive scaling, decent speed & grab bag of enterprise features • Real World Solution, early 2008: – 100TB disk, backup, small cluster, plus all infrastructure – Price range: $225k - $998k Cut the problem into pieces.
  • 29.
    Archive vs. Resequence •2007 – Shocking suggestion to delete primary data – Sanger suggested the MAID (Massive Array of Idle Disks) – Novartis reported that 97% of their files are never accessed 3 months after generation. • 2008 – Instrument vendors deleting large volumes of data inside the box. – Less shock, more “data lifecycle” “New instrument data would be different anyway”
  • 30.
    Data Storage • 1TB – $200 @ Savers ($0.20 / GB) • 24 – 48TB – Commodity solutions, many vendors ($0.70 /GB) • 100TB+ – Interesting architectural tradeoffs – Decision should be based on support expectations – Below $3 / GB really scares me • 1 – 2PB – “Large”
  • 31.
    cdwan@bioteam.net •100+TB SAN •50+ computenodes •One rather warm closet.
  • 32.
    Backups are legion •Archive: – 1TB Firewire disk - $150 – 800GB LTO4 Tape - $90 (plus a sizable machine) • Disaster Recovery: – Failover, redundancy, etc. – Just buy two of everything. • Incremental Rollback – Traditional “backups” – Daily, weekly, differentials Talk to Finance people about backups.
  • 35.
    Data Ingest (instruments) Legacy StorageArchitecture 4PB Tape Archive 24TB “hot” disk For analysis SGI SMP Machines Linux Cluster Caching problem Workstations Web / FTP access
  • 36.
    Data Ingest (instruments) New RealitiesAllow Simplification 4PB Tape Backup 1PB “hot” disk For analysis SGI SMP Machines Linux Cluster Workstations Web / FTP access
  • 37.
  • 38.
    cdwan@bioteam.net Amazon Web Services •EC2 for virtualized computing – The economics are compelling • One month of serious experimentation: – $9.00 USD billed to credit card – Various money making approaches • Flexible pricing allows reselling & revenue sharing • Create a EC2 image and add my own fees on top to cover development and support costs – As a developer, I don’t need your credit card • Amazon handles all transactions & billing
  • 39.
    Bioteam and AmazonEC2 • This is the grid: – Every Bioteam consultant independently deployed an EC2 solution in 2008. • Inquiry – Since 2004 - “bioinformatics on a cluster” – Apple, Microsoft CCS, Linux, etc. – May 1, 2008: Inquiry on Amazon EC2 – CPU Cost to customer: $10 / node day • Data service: 500GB, constantly updated: – $1400 yr: downloads, maintenance, and storage – $17 yr / cost to Bioteam to support a customer
  • 40.
    Conclusion If scientists arewasting a bunch of time on IT, we’ve got more work to do.
  • 41.
    Disturbing Observation I seemto have presented both a functional “grid” and an instance of a “semantic web” in the same talk.
  • 42.
    Thank You • CambridgeHealthtech Institute – Cindy Crowninshield, Kevin Davies • Bioteam Customers – Ed Delong (MIT), Tim Read (NMRC), Yuri Kotliari (NIH), CSHL • Bioteam – Mike Cariaso, Chris Dagdigian, Stan Gloss, Brian Osborne, Bill Van Etten, Jiesheng Zhang • Community – Bioclusters, Sun Grid Engine, Bioinformatics.org
  • 43.
  • 44.