The document summarizes challenges faced by early adopters of next generation DNA sequencing technology and potential solutions. It discusses issues such as high upfront costs of sequencers, data storage and management difficulties due to the large amount of data generated, networking and data transfer problems, and lack of laboratory information management systems. Potential solutions proposed include using virtualization and cloud computing through Amazon Web Services, developing a wiki-based laboratory information management system, simplifying storage architectures, and automated data capture and management.
Boost PC performance: How more available memory can improve productivity
"The Cutting Edge Can Hurt You"
1. The Cutting Edge Can Hurt You
Stories from real-world adopters of next generation
sequencing technology
Christopher Dwan
The Bioteam
2. Bioteam
• Consultancy, with a software business
• Vendor Neutral, Technology Agnostic
• “Bridge the gap” between high performance computing and life
sciences
• Founded 2003
Shameless Plugs:
• We’re in Booth 113
• Next Generation Sequencing Workshop (yesterday, plus next year)
• http://bioteam.net
• We’re hiring.
3. cdwan@bioteam.net
Disclaimer
• Most BioTeam clients don’t have 7 figure IT
budgets, Petabyte SANs dedicated datacenters,
and so on.
• Many of these problems:
– Are quite different for the largest Bio-HPC centers
– Simply don’t matter to the nationally funded
projects.
5. Review: 2007 Predictions
• Multi-core commodity processors
– Workstations are already insanely powerful.
• Virtualization on the workstation
– Why port code, when you can make a new machine?
• Reconfigurable computing goes mainstream
– Partnerships, Collaborations, and re-seller agreements.
• Next generation DNA sequencing
– Data tsunami
6. 2007 Predictions - Reviewed
• Multi-core commodity processors
– I touched a 16 core workstation with 10TB of disk. It “just worked.”
• Virtualization on the workstation Everywhere! Including Data!
– I saw a workstation replace 6 legacy OS’s in one shop.
• Reconfigurable computing goes mainstream Seemingly not
yet
– Talk to the folks on the show floor to get strong contradictions.
• Next generation DNA sequencing. Yup.
– Data tsunami
10. cdwan@bioteam.net
Next Generation Sequencing
• Costs
– Instrument cost: $5x105 to $106
– Reagent Cost: $3k - $10k
– ~ 1 TB / machine / day
– 4 or 5 vendors
• Cost imbalance with IT
components
– $7k for an experiment
– $3k for a new server
• Opens high-throughput to much
smaller labs.
11. Next Generation Sequencing
• Naming is annoying:
– “Next gen”, “new gen”, “now gen”
• High Throughput DNA Sequencing
– Helicos, Roche, Illumina, ABI,
Church Lab
– Also other domains: Cofocal
microscopes, mass spec, …
“The old days, which were about two
years ago”
12. Fundamentals
• Standard facilities questsion:
– Heat / Power / Floor capacity, just as much as ever
• Network:
– Moving 1TB of data from instrument to the next
room, much less to a collaborator
– Still with the sneakernet.
• Security
• Data
• Lab information management
14. Information Management
• Day 0
– Simply catching the data and not dropping it is a
challenge.
• 3 months
– Postdocs carrying data around on firewire disks
– Data management with post-it notes.
• 6 months
– Instrument vendor updates their software.
– Re-analysis?
• 1 year
– New machine from a different vendor.
15. Networking / Data Motion
• Data motion can interfere with data acquisition
(go on, ask the instrument vendor)
• Software updates can interfere with attempts to
automate data motion
• Move 1TB of data from lab to network closet
– Old building network, instrument offline for hours
– Small $200 4 port gigabit switch (see “security”)
– Excuse for a building-wide upgrade, 1 year horizon
16. Security
• Network and IT Security: serious job.
• Labs must propose workable solutions, not
wait for security staff to provide them.
• Common observations:
– Mess with building wide network = security audit
– No solution = system offline.
“Stay out of the news”
17. Humans are the problem
Lab
Instruments
Tape Backup
Compute
Cluster
One-off Linux /
Windows Machine
RAID / Data
Store
Dev
Workstation
Management:
Throughput? Status?
Schedule? Security?
Lab Staff
Availability, quality,
“did it work?”
IT Staff
Throughput? Status?
Schedule
Bioinformaticians
Access to data,
metadata,
18. Automation is the solution
Lab
Instruments
Tape Backup
Compute
Cluster
One-off Linux /
Windows Machine
RAID / Data
Store
Dev
Workstation
Management
Machines need to
write web pages
IT Staff
Offer levels of
service, negotiate
directly with
scientists
Bioinformaticians
Access to data,
metadata,
LIMS
25. WikiLIMS
• Still sold as a custom service
– No long term license
– Full source code access
– Highly customizable
• Variety of customers:
– Navy Medical Research Labs
– Cold Spring Harbor
– Emory University
– National Cancer Institute
– …
• Both 113, we’ll talk your ear off.
This is the semantic web
28. cdwan@bioteam.net
Storage
• Storage: Same in 2008 as in 2006
– Unhappy technology tradeoffs
– ‘Exotic’ vendors offer blazing speed and a few features
– ‘Mainstream’ vendors exclusively focused on enterprise
– What I need: Massive scaling, decent speed & grab bag of
enterprise features
• Real World Solution, early 2008:
– 100TB disk, backup, small cluster, plus all infrastructure
– Price range: $225k - $998k
Cut the problem into pieces.
29. Archive vs. Resequence
• 2007
– Shocking suggestion to delete primary data
– Sanger suggested the MAID (Massive Array of Idle Disks)
– Novartis reported that 97% of their files are never
accessed 3 months after generation.
• 2008
– Instrument vendors deleting large volumes of data inside
the box.
– Less shock, more “data lifecycle”
“New instrument data would be different anyway”
30. Data Storage
• 1 TB
– $200 @ Savers ($0.20 / GB)
• 24 – 48TB
– Commodity solutions, many vendors ($0.70 /GB)
• 100TB+
– Interesting architectural tradeoffs
– Decision should be based on support expectations
– Below $3 / GB really scares me
• 1 – 2PB
– “Large”
32. Backups are legion
• Archive:
– 1TB Firewire disk - $150
– 800GB LTO4 Tape - $90 (plus a sizable machine)
• Disaster Recovery:
– Failover, redundancy, etc.
– Just buy two of everything.
• Incremental Rollback
– Traditional “backups”
– Daily, weekly, differentials
Talk to Finance people about backups.
33.
34.
35. Data Ingest
(instruments)
Legacy Storage Architecture
4PB Tape Archive
24TB “hot” disk
For analysis
SGI SMP
Machines
Linux
Cluster
Caching
problem
Workstations
Web / FTP
access
36. Data Ingest
(instruments)
New Realities Allow Simplification
4PB Tape Backup
1PB “hot” disk
For analysis
SGI SMP
Machines
Linux
Cluster
Workstations
Web / FTP
access
38. cdwan@bioteam.net
Amazon Web Services
• EC2 for virtualized computing
– The economics are compelling
• One month of serious experimentation:
– $9.00 USD billed to credit card
– Various money making approaches
• Flexible pricing allows reselling & revenue sharing
• Create a EC2 image and add my own fees on top to cover
development and support costs
– As a developer, I don’t need your credit card
• Amazon handles all transactions & billing
39. Bioteam and Amazon EC2
• This is the grid:
– Every Bioteam consultant independently deployed an EC2 solution in
2008.
• Inquiry
– Since 2004 - “bioinformatics on a cluster”
– Apple, Microsoft CCS, Linux, etc.
– May 1, 2008: Inquiry on Amazon EC2
– CPU Cost to customer: $10 / node day
• Data service: 500GB, constantly updated:
– $1400 yr: downloads, maintenance, and storage
– $17 yr / cost to Bioteam to support a customer
41. Disturbing Observation
I seem to have presented both a functional
“grid” and an instance of a “semantic web” in
the same talk.
42. Thank You
• Cambridge Healthtech Institute
– Cindy Crowninshield, Kevin Davies
• Bioteam Customers
– Ed Delong (MIT), Tim Read (NMRC), Yuri Kotliari (NIH),
CSHL
• Bioteam
– Mike Cariaso, Chris Dagdigian, Stan Gloss, Brian
Osborne, Bill Van Etten, Jiesheng Zhang
• Community
– Bioclusters, Sun Grid Engine, Bioinformatics.org