SlideShare a Scribd company logo
1 of 21
BioPig: Hadoop-based Analytic Toolkit
for Next-Generation Sequence Data
Zhong Wang, Ph.D.
Computational Biology Staff Scientist
Cellulase
The deep metagenome approach to discover
cellulases for biofuel research
Large data, large reward
http://www.cazy.org/
Only 1% shared
(>=95% identity)
50% validated activity
Science. 2011 Jan 28;331(6016):463-7.
Sequence data
More data would be even better
Rumen(2009) Rumen(2010) Rumen(2012)
17 Gb
250 Gb
1000 Gb
But, can analysis keep up with data growth?
Ideal solutions for the terabase problem
1.Scalable to 1Tb?
2.Performance (within hours)?
High-Mem cluster
Input/Output (IO)Memory
MP/MPI solution: k-mer counting
1
2
3
4
Raw Data Data slices
Each node/core
has data and table slices
Count table
MP/MPI performance
MPI version
412 Gb, 4.5B reads
2.7 hours on 128x24 cores
NESRC HopperII
MP Threaded version
268 Gb, 3B reads
5 days on 32 cores
High-Mem Cluster
• Experienced software engineers
• Six months of development time
• One nodes fails, all fail
Problems:
Fast, scalable
Hadoop/Map Reduce framework
• Google MapReduce
– Data Parallel programming model to process petabyte data
– Generally has a map and a reduce step
• Apache Hadoop
– Distributed file system (HDFS) and job handling for
scalability and robustness
– Data locality to bring compute to data, avoiding network
transfer bottleneck
Programmability: Hadoop vs Pig
finding out top 5 websites young people visit
BioPig: design goals
• Flexible
– every dataset is unique, data analysts have domain knowledge that is essential
to optimize the analysis,
– pluggable modules that analysts can use to build custom analytic pipelines,
• High-Level
– domain-specific language enable data analysts to create custom pipelines,
– hide details of parallelism (too complex for most people),
• Scalability
– leverage data parallelism to speed up analytics,
– integrate external tools and applications where necessary,
– scale from 1 to hundreds of compute nodes with minimal effort and linear
scalability.
• Robustness
– Data and computation are replicated across nodes
to combat failures
BioPIG
Runs on any hardware supporting Hadoop
• JGI Titanium (commodity hadoop cluster)
– Up to 20 16-cores 32GB RAM 1.799Ghz, 1G Ethernet
• NERSC Magellan Cloud Testbed
– Up to 200 8-core 24GB RAM, and 2.67GHz Nehalem
processors, 10Gbit InfiniBand, GPFS
• Amazon AWS
– Elastic MapReduce with cluster compute nodes (23 GB of
memory, 2 x Intel quad-core “Nehalem” architecture 1690
GB of instance storage, 10G Ethernet
BioPig Modules
Blast
Input/Output
(Fasta,q)
K-mer
Counter
Assembly
How k-mer count is implemented
Load Mapper
Shuffle
/sort
Reducer Merge
<id1, header, ‘attagc’>
<id2, header, ‘gttagg’>
<id1, ‘atta’>, <id1,’ttag’>
<id2, ‘gtta’>, <id2, ‘ttag’>
<‘atta’, id1>, <‘ttag’, id1, id2>
<‘gtta’, id2>, <‘tagg’, id2>
<‘atta’, 1>, <‘ttag’, 2>
<‘gtta’, 1>, <‘tagg’, 1>
<‘atta’, 3>, <‘ttag’, 2>
<‘gtta’, 2>, <‘tagg’, 1>
A 7-liner BioPig script for k-mer counting
Rumen metagenome gene discovery pipeline
Read
preprocess
(remove artifacts)
pigBlast
(blast reads
against known
cellulases)
pigAssembler
(Assemble reads
into contigs)
pigExtender
(Extend contigs
into full-length
enzymes)
Cloud solution to large data
BioPig-
Blaster
BioPig-
Assembler
BioPig-
Extender
BioPIG
BioPig: 61 lines of code
MPI-extender: ~12,000 lines
(vs 31 in BioPig)
Flexibility
Programmability
Scalability
x
x
Conclusions
Hadoop-based BioPig shows great
potential for scalable analysis on very large
sequence data, it is robust and easy to use.
Challenges in application
• IO optimization, e.g., reduce data copying
• Some problems do not easily fit into
map/reduce framework, e.g., graph-based
algorithms
• Integration into exiting framework, Galaxy
Acknowledgement
• Karan Bhatia
• Henrik Nordberg
• Kai Wang
• Rob Egan
• Alex Sczyrba
• Jeremy Brand @JGI/NERSC
• Shane Cannon @NERSC
BioPIG

More Related Content

What's hot

Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
elliando dias
 

What's hot (20)

NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic... NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
NRP Engagement webinar - Running a 51k GPU multi-cloud burst for MMA with Ic...
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes ClusterTaking Your Database Beyond the Border of a Single Kubernetes Cluster
Taking Your Database Beyond the Border of a Single Kubernetes Cluster
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
 
Burst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud runBurst data retrieval after 50k GPU Cloud run
Burst data retrieval after 50k GPU Cloud run
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Big data solution capacity planning
Big data solution capacity planningBig data solution capacity planning
Big data solution capacity planning
 
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
Analyzing Larger RasterData in a Jupyter Notebook with GeoPySpark on AWS - FO...
 
LocationTech Projects
LocationTech ProjectsLocationTech Projects
LocationTech Projects
 
Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?Deep Learning on Aerial Imagery: What does it look like on a map?
Deep Learning on Aerial Imagery: What does it look like on a map?
 
Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)Nexxworks bootcamp ML6 (27/09/2017)
Nexxworks bootcamp ML6 (27/09/2017)
 
Gnocchi v3
Gnocchi v3Gnocchi v3
Gnocchi v3
 
Project Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster ReliefProject Matsu: Elastic Clouds for Disaster Relief
Project Matsu: Elastic Clouds for Disaster Relief
 
Energy-aware Task Scheduling using Ant-colony Optimization in cloud
Energy-aware Task Scheduling using Ant-colony Optimization in cloudEnergy-aware Task Scheduling using Ant-colony Optimization in cloud
Energy-aware Task Scheduling using Ant-colony Optimization in cloud
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)
 
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
Scaling AncestryDNA with the Hadoop Ecosystem. Presented at the San Jose Hado...
 
Machine Learning & Data Science in the Age of the GPU: Smarter, Faster, Better
Machine Learning & Data Science in the Age of the GPU: Smarter, Faster, BetterMachine Learning & Data Science in the Age of the GPU: Smarter, Faster, Better
Machine Learning & Data Science in the Age of the GPU: Smarter, Faster, Better
 
Bioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9pBioclouds CAMDA (Robert Grossman) 09-v9p
Bioclouds CAMDA (Robert Grossman) 09-v9p
 

Similar to BioPig for scalable analysis of big sequencing data

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
Chester Chen
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
SAIL_QU
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
Yasin Memari
 

Similar to BioPig for scalable analysis of big sequencing data (20)

Manycores for the Masses
Manycores for the MassesManycores for the Masses
Manycores for the Masses
 
Computational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysisComputational infrastructure for NGS data analysis
Computational infrastructure for NGS data analysis
 
Many Task Applications for Grids and Supercomputers
Many Task Applications for Grids and SupercomputersMany Task Applications for Grids and Supercomputers
Many Task Applications for Grids and Supercomputers
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010Hadoop for Scientific Workloads__HadoopSummit2010
Hadoop for Scientific Workloads__HadoopSummit2010
 
Early Application experiences on Summit
Early Application experiences on Summit Early Application experiences on Summit
Early Application experiences on Summit
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Cluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomesCluster Filesystems and the next 1000 human genomes
Cluster Filesystems and the next 1000 human genomes
 
IBM Power Systems: Designed for Data
IBM Power Systems: Designed for DataIBM Power Systems: Designed for Data
IBM Power Systems: Designed for Data
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
Towards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and BenchmarkingTowards a Systematic Study of Big Data Performance and Benchmarking
Towards a Systematic Study of Big Data Performance and Benchmarking
 
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big Data
 
Opportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIOpportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCI
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
 
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataBest pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

BioPig for scalable analysis of big sequencing data

  • 1. BioPig: Hadoop-based Analytic Toolkit for Next-Generation Sequence Data Zhong Wang, Ph.D. Computational Biology Staff Scientist
  • 2. Cellulase The deep metagenome approach to discover cellulases for biofuel research
  • 3. Large data, large reward http://www.cazy.org/ Only 1% shared (>=95% identity) 50% validated activity Science. 2011 Jan 28;331(6016):463-7.
  • 4. Sequence data More data would be even better
  • 5. Rumen(2009) Rumen(2010) Rumen(2012) 17 Gb 250 Gb 1000 Gb But, can analysis keep up with data growth?
  • 6. Ideal solutions for the terabase problem 1.Scalable to 1Tb? 2.Performance (within hours)?
  • 8. MP/MPI solution: k-mer counting 1 2 3 4 Raw Data Data slices Each node/core has data and table slices Count table
  • 9. MP/MPI performance MPI version 412 Gb, 4.5B reads 2.7 hours on 128x24 cores NESRC HopperII MP Threaded version 268 Gb, 3B reads 5 days on 32 cores High-Mem Cluster • Experienced software engineers • Six months of development time • One nodes fails, all fail Problems: Fast, scalable
  • 10. Hadoop/Map Reduce framework • Google MapReduce – Data Parallel programming model to process petabyte data – Generally has a map and a reduce step • Apache Hadoop – Distributed file system (HDFS) and job handling for scalability and robustness – Data locality to bring compute to data, avoiding network transfer bottleneck
  • 11. Programmability: Hadoop vs Pig finding out top 5 websites young people visit
  • 12. BioPig: design goals • Flexible – every dataset is unique, data analysts have domain knowledge that is essential to optimize the analysis, – pluggable modules that analysts can use to build custom analytic pipelines, • High-Level – domain-specific language enable data analysts to create custom pipelines, – hide details of parallelism (too complex for most people), • Scalability – leverage data parallelism to speed up analytics, – integrate external tools and applications where necessary, – scale from 1 to hundreds of compute nodes with minimal effort and linear scalability. • Robustness – Data and computation are replicated across nodes to combat failures BioPIG
  • 13. Runs on any hardware supporting Hadoop • JGI Titanium (commodity hadoop cluster) – Up to 20 16-cores 32GB RAM 1.799Ghz, 1G Ethernet • NERSC Magellan Cloud Testbed – Up to 200 8-core 24GB RAM, and 2.67GHz Nehalem processors, 10Gbit InfiniBand, GPFS • Amazon AWS – Elastic MapReduce with cluster compute nodes (23 GB of memory, 2 x Intel quad-core “Nehalem” architecture 1690 GB of instance storage, 10G Ethernet
  • 15. How k-mer count is implemented Load Mapper Shuffle /sort Reducer Merge <id1, header, ‘attagc’> <id2, header, ‘gttagg’> <id1, ‘atta’>, <id1,’ttag’> <id2, ‘gtta’>, <id2, ‘ttag’> <‘atta’, id1>, <‘ttag’, id1, id2> <‘gtta’, id2>, <‘tagg’, id2> <‘atta’, 1>, <‘ttag’, 2> <‘gtta’, 1>, <‘tagg’, 1> <‘atta’, 3>, <‘ttag’, 2> <‘gtta’, 2>, <‘tagg’, 1>
  • 16. A 7-liner BioPig script for k-mer counting
  • 17. Rumen metagenome gene discovery pipeline Read preprocess (remove artifacts) pigBlast (blast reads against known cellulases) pigAssembler (Assemble reads into contigs) pigExtender (Extend contigs into full-length enzymes)
  • 18. Cloud solution to large data BioPig- Blaster BioPig- Assembler BioPig- Extender BioPIG BioPig: 61 lines of code MPI-extender: ~12,000 lines (vs 31 in BioPig) Flexibility Programmability Scalability x x
  • 19. Conclusions Hadoop-based BioPig shows great potential for scalable analysis on very large sequence data, it is robust and easy to use.
  • 20. Challenges in application • IO optimization, e.g., reduce data copying • Some problems do not easily fit into map/reduce framework, e.g., graph-based algorithms • Integration into exiting framework, Galaxy
  • 21. Acknowledgement • Karan Bhatia • Henrik Nordberg • Kai Wang • Rob Egan • Alex Sczyrba • Jeremy Brand @JGI/NERSC • Shane Cannon @NERSC BioPIG