Nyc big datagenomics-pizarroa-sept2017

Phosphorus
Big Data Genomics NYC
#GenomicsNYC

• We meet quarterly
• We are passionate about Big-Data technologies, the
human genome and personalized medicine
• We have 775 genomies (members) #GenomicsNYC
Big Data Genomics NYC

Past MeetUps include:
• Dipping into Guacamole – a spark-powered Somatic
Variant Caller
• Next Generation Tools and Strategies for Genomic
Analysis
• Leverage ADAM and Spark for Genomic Analysis

Building Genomics
Pipelines in the Cloud:
Using AWS Batch and AWS Step
Functions to Design and Run
High-Throughput Workflows
with Angel Pizarro

What we will cover
• Some context
• Service Overview – AWS Batch
• Service Overview – AWS Step Functions
• Architecture Deep Dive

We see similar data analysis patterns
Life Sciences
Financial Services

Introducing AWS Batch
Fully Managed
No software to install or
servers to manage. AWS
Batch provisions and
scales your infrastructure
Integrated with AWS
AWS Batch jobs can easily
and securely interact with
services such as Amazon S3,
DynamoDB, and Rekognition
Cost-Efficient
AWS Batch launches compute
resources tailored to your jobs
and can provision Amazon EC2
and EC2 Spot instances

AWS Batch Concepts
• Jobs
• Job Definitions
• Job Queue
• Compute Environments
• Scheduler

IAM Role for
Batch Job
Input Files
Queue of
Runnable Jobs
S3 Events Trigger
Lambda Function
Submits Batch Job
AWS Batch
Compute Environments
AWS Batch Job
Output
Example AWS Batch Job Architecture
Job Definition
Job Resource Requirements
and other parameters
AWS Batch Execution
Application
Image
AWS Batch
Scheduler

Job Definitions
Similar to ECS Task Definitions, AWS Batch Job Definitions specify how
jobs are to be run. While each job must reference a job definition, many
parameters can be overridden.
Some of the attributes specified in a job definition:
• IAM role associated with the job
• vCPU and memory requirements
• Mount points
• Container properties
• Environment variables
$ aws batch register-job-definition --job-definition-name gatk
--container-properties ...

Jobs
Jobs are the unit of work executed by AWS Batch as containerized
applications running on Amazon EC2.
As your job starts, AWS Batch creates a container using the command
and parameters specified in your job definition. You can optionally
override properties such as CPU and Memory requirements.
$ aws batch submit-job --job-name variant-calling
--job-definition gatk:12 --job-queue genomics

Job Queues
Jobs are submitted to a Job Queue, where they reside until they are
able to be scheduled to a compute resource. Information related to
completed jobs persists in the queue for 24 hours.
$ aws batch create-job-queue --job-queue-name genomics
--priority 500 --compute-environment-order ...

Compute Environments
Job queues are mapped to one or more Compute Environments which
contain the EC2 instances used to run your AWS Batch jobs.
Managed compute environments enable you to describe your business
requirements (instance types, min/max/desired vCPUs, and EC2 Spot bid as
a % of On-Demand). AWS Batch will then launch an elastic quantity of
instances from a range of instance types based on your jobs’ requirements.
You can select specific instance types (e.g. c4.8xlarge), instance families
(e.g. C4, M4, R4), or simply choose “optimal” and AWS Batch will launch
appropriately sized instances from our more-modern instance families.

AWS Batch Concepts
The Scheduler evaluates when, where, and
how to run jobs that have been submitted to
a job queue.
Jobs run in approximately the order in which
they are submitted as long as all
dependencies on other jobs have been met.

AWS Step Functions…
…makes it easy to
coordinate the components
of distributed applications
using visual workflows.

Application Lifecycle in AWS Step Functions
Visualize in the
Console
Define in JSON Monitor
Executions

Seven State Types
Task A single unit of work
Choice Adds branching logic
Parallel Fork and join the data across tasks
Wait Delay for a specified time
Fail Stops an execution and marks it as a failure
Succeed Stops an execution successfully
Pass Passes its input to its output

BUILD VISUAL WORKFLOWS USING STATE TYPES
2
2
AWS STEP FUNCTIONS
Task
Choice
Fail
ParallelMountains
People
Snow

Executing Job(s)
Specify Docker run parameters as container overrides
Specify Job Queue
Submit Dependencies
response = batch_client.submit_job(
dependsOn=event['dependsOn'],
containerOverrides=event['containerOverrides'],
jobDefinition=event['jobDefinition'],
jobName=event['jobName'],
jobQueue=event['jobQueue'],
)
Confidential

Considerations for Batch Layer: Data Sharing
Consideration: Jobs are managed at the container, not
instance level. Cannot guarantee consecutive containers in
a workflow will run on same instance.
Solution: Stage all data in Amazon S3, and read and write
everything from there. Also important for traceability,
logging, etc.

Considerations for Batch Layer: Multitenancy
Consideration: May have multiple containers running
batch processes on same instance in same base working
directory.
Solution: Within scratch directory, each batch process
creates a subfolder with a unique ID. All scratch data
written to this subdirectory.

Considerations for Batch Layer: Volume Reuse
Consideration: Scratch data should live only as long as
the job using it in order to optimize for instance and
Amazon EBS storage costs.
Solution: Within scratch directory, each batch process
creates a subfolder with a unique ID. All scratch data
written to this subdirectory. Delete subdirectory at end of
job.

Deployment with AWS Step Functions

A Flexible Workflow Deployment Model
• Decouple batch engine and workflow orchestration
• Workflow creation now done as JSON
• Easier to deploy
• Easier to automate
• Easier to test
• Can integrate non-Batch applications as well

{
...
"SubmitJob": {
"Type": "Task",
"Resource":
"arn:aws:lambda:REGION:ACCOUN
T:function:batchSubmitJob1",
"Next": "GetJobStatus"
},
...
}
Change one line to change workflow
{
...
"SubmitJob": {
"Type": "Task",
"Resource":
"arn:aws:lambda:REGION:ACCOUN
T:function:batchSubmitJob2",
"Next": "GetJobStatus"
},
...
}

A Practical Example: Genomics
Annotation
Variant
Calling
QC
Alignment

Evented File Processing
Nanocall*
* Matei David (Jared T. Simpson lab)
doi:10.1093/bioinformatics/btw569

Control Plane for other
Infrastructure
Human Microbiome Project
Public Data Set
Targeted 16S sequencine of 300 healthy adult at 18
specific sites (oral cavity, airways, urogenital track, skin,
and gut)
https://s3-us-west-2.amazonaws.com/human-microbiome-project

IARPA MICrONS
Intelligence Advanced Research Projects Activity
Machine Intelligence from Cortical Networks
• MICrONS seeks to revolutionize machine learning by
understanding the representations, transformations, and
learning rules employed by the brain
• The program is expressly designed as a dialogue between
computer science, data science, and neuroscience
Neurally-plausible
Machine Learning
Framework
Behavior
Experiment
Functional
Imaging
Structural
Imaging
Data
Analysis

Why Is This Different?
• Current Neural networks are “neurally inspired” but
not considered biofidelic or neurally plausible
• Previous projects to build algorithms based on the
brain exist, but have been focused on macro and
micro information, or lower-fidelity statistics
• Little is known about the brain at the mesoscale
• A “cortical column” is theorized to be order ~1mm3
• In this program, structure and function co-registration
provides a uniquely rich picture of computing circuits
• Researchers are directly measuring mesoscale
activity and circuits
Human Connectome Project
(1-100’s of neurons)
microscale
(1k – 1M neurons)
mesoscale
(brain regions)
macroscale
?

Why Is This Different: Functional Imaging
Video Credit: Tianyu Wang (Xu Lab, Cornell University) & Jacob Reimer (Tolias Lab, Baylor College of Medicine)

Why Is This Different: Structural Imaging
• Peta-scale structural imaging
• 1mm3 region is large enough to
contain meaningful circuits never
before observed
• ~50k-100k neurons
• ~100,000,000 synapses
• ~4x4x30nm voxels
• ~2 – 2.5 PB
• Three different techniques
• Scanning Electron Microscopy
(SEM)
• Transmission Electron
Microscopy (TEM)
• Fluorescent in situ sequencing
(FISSEQ) Barcoding
Video Credit: Kasthuri, et al. - Cell 2015
Bobby Kasthuri, Daniel Berger, Jeff Lichtman

Why Is This Different: Co-registered Data
• Co-registration links structure to
function
• For the first time, researchers will
measure in the same sample at scale:
• Stimulus (”input”)
• Behavior (“output”)
• Connectome (“circuit diagram”)
• Neuronal Activity (“voltages”)
Calcium Imaging Data – Tolias Lab, Baylor College of Medicine
X-ray Tomography and co-registration – Allen Institute for Brain Science

Why Can We Succeed Now?
• New imaging techniques and engineering
capabilities can interrogate mesoscale circuits
• Increased computing power has enabled
automated analysis with machine learning
• Reduced storage costs have made collection
and analysis of many petabytes of data possible
• Use of the cloud has provided the ability to scale
when needed and facilitates sharing and
collaboration
We can directly observe and reconstruct mesoscale
neuronal circuits in vivo for the first time
https://www.karlrupp.net

The Boss
Block and Object Storage Service
The Boss is a multi-dimensional spatial database, provided as a managed service on AWS
The Boss stores annotation data co-registered to image data
• An annotation is a unique 64-bit identifier applied to a set of voxels, representing its spatial distribution
ID: 1267
ID: 345345
ID: 534534799

High-Level System Architecture

PyWren
Utilizing
stateless
functions for
distributed
computing
http://pywren.io
https://arxiv.org/abs/1702.04024

Thank you!
AWS Batch: https://aws.amazon.com/batch/
AWS Step Functions: https://aws.amazon.com/step-functions/
Genomics Reference Architecture: https://github.com/awslabs/aws-batch-genomics
The Boss: https://youtu.be/806a3x2s0CY

Nyc big datagenomics-pizarroa-sept2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Nyc big datagenomics-pizarroa-sept2017

Similar to Nyc big datagenomics-pizarroa-sept2017 (20)

More from delagoya

More from delagoya (7)

Recently uploaded

Recently uploaded (20)

Nyc big datagenomics-pizarroa-sept2017