Riverside Life Sciences Centre reception Our North Ryde site is in the heart of Sydney's high-tech hub and co-locates researchers from diverse disciplines. Our NSW science education centre is located on this site.
https://aws.amazon.com/blogs/aws/aws-batch-run-batch-computing-jobs-on-aws/ Jamie Kinney, Principal Product Manager, AWS Batch
High-Throughput: Can process as many concurrent genomic workflows as needed (>1000 day). Flexible: You define your containers, dependencies, and resource requirements. Batch takes care of the rest. Elastic and Scalable: Treat each workflow like a burst compute. Pay only for what you need when you need it. Cost-Optimized: Runs on spot-fleet to significantly reduce cost of genomic analysis.
New AWS Services for Bioinformatics
New AWS Services
For bioinformatics pipelines
New AWS Services
• Useful for scaling bioinformatics pipelines
• Announced at re:Invent (Nov 2016)
• Step Functions
About AWS Athena
Serverless SQL queries on S3 data
AWS Athena Information
• Add table (structure) to database via DDL from input file(s)
• Write and execute SQL query
• Optionally save query
• Optionally review query history
• View results
• Optionally download result set to .csv
About AWS Batch
Fully managed batch processing at scale
What is batch computing?
Run jobs asynchronously and automatically across one or
Jobs may dependencies, making the sequencing and
scheduling of multiple jobs complex and challenging.
What is AWS Batch?
No software to install or
servers to manage.
Integrated with AWS
Batch jobs can easily and
securely interact with
services such as Amazon S3,
DynamoDB, and Rekognition
Auto provisions compute
resources tailored to the job
needs using EC2 & EC2 Spot
AWS Batch Concepts
1. Job Definitions
2. Job Queues
3. Job States
2. Compute Environments
Short Video -- here
Jobs are the unit of work executed by AWS Batch as containerized
applications running on Amazon EC2.
Containerized jobs can reference a container image, command, and
parameters or users can simply provide a .zip containing their
application and we will run it on a default Amazon Linux container.
$ aws batch submit-job --job-name variant-calling
--job-definition gatk --job-queue genomics
Massively parallel jobs
• Now - users can submit a large number of independent “simple jobs.”
• Soon – AWS will add support for “array jobs” that run many copies of an
application against an array of elements.
Array jobs are an efficient way to run:
• Parametric sweeps
• Monte Carlo simulations
• Processing a large collection of objects
NOTE: These use cases are possible today, simply submit more jobs.
Workflows, Pipelines, and Job Dependencies
Jobs can express a dependency on the successful
completion of other jobs or specific elements of an
Use your preferred workflow engine and language to
submit jobs. Flow-based systems simply submit jobs
serially, while DAG-based systems submit many jobs
at once, identifying inter-job dependencies.
$ aws batch submit-job –depends-on 606b3ad1-aa31-48d8-92ec-f154bfc8215f ...
Batch Job Definitions specify how jobs are to be run. While each job
must reference a job definition, many parameters can be overridden.
Some of the attributes specified in a job definition:
• IAM role associated with the job
• vCPU and memory requirements
• Mount points
• Container properties
• Environment variables
$ aws batch register-job-definition --job-definition-name gatk
Jobs are submitted to a Job Queue, where they reside until they are
able to be scheduled to a compute resource. Information related to
completed jobs persists in the queue for 24 hours.
$ aws batch create-job-queue --job-queue-name genomics
--priority 500 --compute-environment-order ...
Mapped from job queues to run containerized batch jobs.
• Managed CEs - you describe your requirements (instance types,
min/max/desired vCPUs, and EC2 Spot bid as a % of On-Demand),
AWS launches & scales resources for you. Pick specific instance types,
instance families or simply choose “optimal”
• Unmanaged CEs - you can launch and manage your own resources. Your
instances need to include the ECS agent and run supported versions of Linux
and Docker. AWS Batch will then create an Amazon ECS cluster which can
accept the instances you launch. Jobs can be scheduled to your Compute
Environment as soon as your instances are healthy and register with the
$ aws batch create-compute-environment --compute-
environment-name unmanagedce --type UNMANAGED ...
AWS Batch Scheduler
The Scheduler evaluates when, where, and
how to run jobs that have been submitted to
a job queue.
Jobs run in approximately the order in which
they are submitted as long as all
dependencies on other jobs have been met.
Queued Job States
• SUBMITTED: Accepted into the queue, but not yet evaluated for execution
• PENDING: Your job has dependencies on other jobs which have not yet
• RUNNABLE: Your job has been evaluated by the scheduler and is ready to run
• STARTING: Your job is in the process of being scheduled to a compute
• RUNNING: Your job is currently running
• SUCCEEDED: Your job has finished with exit code 0
• FAILED: Your job finished with a non-zero exit code or was cancelled or
AWS Batch Actions
• CancelJob: Marks jobs that are not yet STARTING as
• TerminateJob: Cancels jobs that are currently waiting in the
queue. Stops jobs that are in a STARTING or RUNNING state
and transitions them to FAILED.
NOTE: Requires a “reason” which is viewable via DescribeJobs
$ aws batch cancel-job --reason “Submitted to wrong queue”
AWS Batch Pricing and Functionality
There is no charge for AWS Batch; you only pay for the
underlying resources that you consume!
NOTE: Support for Array Jobs, retries, and jobs executed as AWS Lambda
functions coming soon!
Use the Right Tool for the Job
Not all batch workloads are the same…
• ETL and Big Data processing/analytics?
• Consider EMR, Data Pipeline, Redshift, and related services.
• Lots of small Cron jobs? AWS Batch is a great way to execute these jobs, but
you will likely want a workflow or job-scheduling system to orchestrate job
• Efficiently run lots of big and small compute jobs on heterogeneous
compute resources? Use AWS Batch
About AWS Glue
Serverless managed, scalable ETL
1. Build a data catalog
1. Discover and use your datasets via a Hive-compatible metastore
2. Store versions, connection and credential info
3. Use crawlers to auto-generate schema from S3 data & partitions
2. Generate and edit transforms using PySpark
3. Schedule and run your jobs
1. On schedule, event or lambda
NOTE: Glue is announced, but no beta as of yet…video from re:Invent -- here
About AWS QuickSight
Quick and easy data dashboards
Resources for new AWS Services
• Athena (SQL query on S3) – here
• Batch (Optimized, chained EC2 batches) – here
• Glue (Scaled ETL) -- here
• Step Functions (Lambda workflows) – here
• QuickSight (Data Dashboards) – here
• Full list of AWS services announced at re:Invent 2016 -- here