Intro to AWS Batch & How AQR Capital leverages AWS to Identify New Investment Signals (CMP372) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Batch: Easy & Efficient Batch
Computing on Amazon Web Services
Michael Raposa
Head of Platform Engineering
AQR Capital
C M P 3 7 2
Rey Wang
Senior Product Manager
AWS Batch and HPC

Agenda
Intro to AWS Batch
Summary of recent AWS Batch launches
Glimpse into our roadmap
Real world use case: how AQR Capital
leverages AWS Batch to identify new
investment signals
Q&A

AWS Batch
Fully managed Integrated with AWS Cost-optimized
resource provisioning
No software to install
or servers to manage
Natively integrated with
the AWS platform
Automatically provisions
compute resources tailored to
the needs of your jobs using
Amazon On-Demand and Spot
pricing.

Pricing
There is no additional charge for AWS Batch
You only pay for the AWS resources (for example, Amazon Elastic
Compute Cloud [Amazon EC2] instances) you create to store and run
your batch jobs

AWS Batch regional expansion
AWS Batch is available in 15 regions:
us-east-1 (N. Virginia)
us-east-2 (Ohio)
us-west-1 (N. California)
us-west-2 (Oregon)
eu-west-1 (Ireland)
eu-west-2 (London)
eu-west-3 (Paris)
eu-central-1 (Frankfurt)
ap-south-1 (Mumbai)
ap-northeast-1 (Tokyo)
ap-northeast-2 (Seoul)
ap-southeast-1 (Singapore)
ap-southeast-2 (Sydney)
ca-central-1 (Canada Central)
sa-east-1 (São Paulo)

Manageability & performance improvements
• AWS Batch as an Amazon CloudWatch Events target: automate your
workflow; submit a job to AWS Batch in response to an event pattern
or on a schedule
• Job execution timeout: control cost by automatically terminating your
job once the job has been running for the specified duration
• AWS CloudTrail audit calls to AWS Batch APIs: ensure compliance with
internal policies and regulatory standards
• Scheduling & throughput enhancements

Improved managed compute environments
• Use Amazon EC2 launch templates in your compute environment
• Example use cases include:
• Increase size/encrypt container volume
• Support custom user-data: mount Amazon Elastic File System (Amazon EFS) on instance
launch, without needing to create a custom AMI
• Override Amazon Elastic Container Service (Amazon ECS) image cleanup
• New instance types
• Z1d
• R5
• R5d
• M5d
• C5d
• X1e

Container
Container 1 Container 2

Support for multi-node parallel jobs

Support for multi-node parallel jobs
• Run single jobs which require
multiple EC2 instances.
• Designed for distributed
computing needs.
• Tightly coupled High Performance
Computing (HPC) applications
• Distributed deep learning trainings

What can you expect in the next 12 months?
• Support for Elastic Fabric Adapter
• Significant improvements to the AWS Batch console
• Batch to emit CloudWatch metrics for monitoring
• Better GPU support
• More scheduling and performance improvements
• New instance types
• Further regional expansion

How AQR Capital leverages AWS to
research new investment signals
Michael Raposa
November 28, 2018
Head of Platform Engineering, AQR
Not intended for the sale or marketing of financial products or services.

Disclosures
16
The information set forth herein has been obtained or derived from sources believed by AQR Capital Management, LLC (“AQR”) to be reliable. However, AQR does not make any representation or warranty, express or implied, as to the information’s
accuracy or completeness. This presentation does not represent a formal or official view of AQR. Nor services or applications referenced are specifically endorsed by AQR..
The information contained herein is only as current as of the date indicated, and may be superseded by subsequent market events or for other reasons. Charts and graphs provided herein are for illustrative purposes only. The information in this
presentation has been developed internally and/or obtained from sources believed to be reliable; however, neither AQR nor the speaker guarantees the accuracy, adequacy or completeness of such information.
Neither AQR nor the speaker assumes any duty to, nor undertakes to update forward looking statements. No representation or warranty, express or implied, is made or given by or on behalf of AQR, the speaker or any other person as to the accuracy
and completeness or fairness of the information contained in this presentation, and no responsibility or liability is accepted for any such information. By accepting this presentation in its entirety, the recipient acknowledges its understanding and
acceptance of the foregoing statement.

Agenda
• About AQR
• Business problem
• Solution summary
• Lessons learned
• Takeaways
17

Our Firm
AQR is a global investment management firm built at the intersection of financial theory and practical application. We strive to deliver superior, long-term results for our
clients by looking past market noise to identify and isolate what matters most, and by developing ideas that stand up to rigorous testing. Our focus on practical
insights and analysis has made us leaders in alternative and traditional strategies since 1998.
At a glance
• AQR takes a systematic, research-driven approach to managing alternative and traditional strategies
• We apply quantitative tools to process fundamental information and manage risk
• Our clients include institutional investors, such as pension funds, defined contribution plans, insurance companies, endowments, foundations, family offices, and
sovereign wealth funds, as well as RIAs, private banks, and financial advisors
• The firm has 36 principals and 1,025 employees; over half of employees hold advanced degrees
• AQR is based in Greenwich, Connecticut, with offices in Boston, Chicago, Hong Kong, London, Los Angeles, and Sydney
• Approximately $226 billion in assets under management as of September 30, 2018*
*Approximate as of 9/30/2018, includes assets managed by AQR and its advisory affiliates. 18

Problem statement
Background
• Quantitative asset manager
• Investment decisions based on numerical models and
systematic trading
• Researchers develops models and “back test” ideas
over many years
• Never-ending appetite for more data and non-obvious
data sets
19
Source: AQR. For illustrative purposes only.

Researcher workflow
Idea
Gather
data
Build
model
Back test
Analyze
result
20

Problem statement
Background
• Quantitative asset manager
• Investment decisions based on numerical models and
systematic trading
• Researchers develop models and “back test” ideas over
many years
• Never-ending appetite for more data and non-obvious
data sets
Problem
• On-premise compute grid can’t keep up with demand
• CAPEX locked into grid
• Researchers wait for grid resources
• Researchers need job results as quickly as possible
• New experimental use cases, such as GPU, require
significant time and money upfront investment
21

Design considerations
1.Scalable both in compute and memory
2.Fast without long queue times
3.Don’t want to manage a job scheduler, for example, Condor
or Sun Grid Engine
4.Easy to use
5.Secure
22

Solution summary
Burstability
Leverage building block
application services, for
example, Amazon Simple
storage Service (Amazon S3),
EC2, Amazon DynamoDB
Reduce costs—Spot
Leverage cloud
23

Solution summary
Based on Sun Grid Engine
Submit jobs via AWS Command
Line Intervace (AWS CLI) or
API—unlimited compute at
fingertips
Short feedback loop—immediate
results to researcher
Backend grid matches researcher
workstation—no “It works in DEV
but not on the GRID”
Seamless interface
24

Solution summary
Automate everything
“Unlimited” compute
Short start times—no
“queues”
Fast
25

Solution summary
Encryption everywhere
Leverage AWS security
tools
Security is a cloud
engineering problem
Secure
26

Solution summary—Job submission (AWS CLI)
31

32

33

34

35

Step 1
36

Step 2
37

Step 3
38

Step 4
39

Solution summary—DAGs
40
Client
Jobs
Child jobs
Backtest: 20,000 Equities 1998-2018
AAPL: 1998-2018 MSFT: 1998-2018
MSFT: 1998 MSFT: 1999 MSFT: 2018AAPL: 1998 AAPL: 1999 AAPL: 2018

Lessons learned—General
Use Spot
Use as many instance
types as you can
Use as many AZs as
you can
Drive the lowest cost - $15/1000 vCPU/hour
41
Spot Instance
Availability Zones
Instance Families

ECS logs
Job output
Host logs
… more
Log everything!
42

Job runtime
Job start-up times
Job cost by user
vCPU consumption by user
High priority queue
consumption by user
Monitor everything!
43

Lessons learned
As we added more users …
1.TooManyRequestsException—DescribeJobs API call failing
2.Providing governance “guard rails”
46

Lessons learned—Event-based pipeline
47

48

49

50

Lessons learned
As we added more compute …
1.TooManyRequestsException … API calls in the container
2.Job state storage woes
3.Job start times
4.Job costs
51

Lessons learned—API in containers
52
Parameter
Store
ECS
container
Get
Secret

At scale
53
Exception:
too many
requests
Parameter
Store
ECS
container
ECS
container
ECS
container
ECS
container
ECS
container

At scale
54
ECS
container
ECS
container
ECS
container
ECS
container
ECS
container
Amazon
S3

What is shared job state?
Shared memory
across containers
Job assignment &
orchestration
Job input and output
55

Lessons learned—Job state backend
56
Amazon
EFS

57
Amazon
EFS
Redis

58
Amazon
EFS
Amazon
S3
Redis

Lessons learned—Job start times
59

Lessons learned—Set limits
60

Takeaways
Spot
Multi-AZ
Log everything
Monitor everything
Follow best practices
61
Spot
Instance Availability Zones

Takeaways
Eliminate API calls in your
containers. Only use services
that scale, such as
DynamoDB.
Switch to event/message-
based status from poll-based.
Choose a job state backend
that fits your use case.
Watch for scale issues
62

Takeaways
Write a light-weight
wrapper around AWS
Batch for non-technical
users
Have a “governator”
function to ensure
compliance
Make it easy and secure
63

Takeaways
Packer and bake AMIs
Pre-warm the cluster
during active times of the
day
Give yourself an SLA—
75% of jobs start in 10
mins and 90% in 15 mins
Reduce your start times
64

Takeaways
Alarm for large long-
running jobs -> $$$
Set limits on your
compute environment
Control runaway costs
65

Thank you!
Michael Raposa
Head of Platform Engineering
AQR Capital
Rey Wang
Senior Product Manager
AWS Batch and HPC

Intro to AWS Batch & How AQR Capital leverages AWS to Identify New Investment Signals (CMP372) - AWS re:Invent 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Intro to AWS Batch & How AQR Capital leverages AWS to Identify New Investment Signals (CMP372) - AWS re:Invent 2018

Similar to Intro to AWS Batch & How AQR Capital leverages AWS to Identify New Investment Signals (CMP372) - AWS re:Invent 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Intro to AWS Batch & How AQR Capital leverages AWS to Identify New Investment Signals (CMP372) - AWS re:Invent 2018