AWS Batch is a fully managed service that enables developers to easily and efficiently run batch computing workloads of any scale on AWS. In this session, the Senior Product Manager Rey Wang, describes the core concepts behind AWS Batch and details of how the service functions, latest features, and upcoming roadmap items. Afterwards, we dive deep into AQR Capital’s high-performance computing use case for AWS Batch to develop investment signals. AQR researchers can package and submit a job to evaluate a signal without worrying about the compute resources needed, cost, security, or timing. Given the intelligent use of Amazon EC2 instances and Spot by AWS Batch, AQR has processed more than 75 years of compute workload at a very low cost. Learn how to use AWS Batch and containers to perform HPC workloads to manage, schedule, or scale underlying Amazon EC2 instances.
15. How AQR Capital leverages AWS to
research new investment signals
Michael Raposa
November 28, 2018
Head of Platform Engineering, AQR
Not intended for the sale or marketing of financial products or services.
16. Disclosures
16
The information set forth herein has been obtained or derived from sources believed by AQR Capital Management, LLC (“AQR”) to be reliable. However, AQR does not make any representation or warranty, express or implied, as to the information’s
accuracy or completeness. This presentation does not represent a formal or official view of AQR. Nor services or applications referenced are specifically endorsed by AQR..
The information contained herein is only as current as of the date indicated, and may be superseded by subsequent market events or for other reasons. Charts and graphs provided herein are for illustrative purposes only. The information in this
presentation has been developed internally and/or obtained from sources believed to be reliable; however, neither AQR nor the speaker guarantees the accuracy, adequacy or completeness of such information.
Neither AQR nor the speaker assumes any duty to, nor undertakes to update forward looking statements. No representation or warranty, express or implied, is made or given by or on behalf of AQR, the speaker or any other person as to the accuracy
and completeness or fairness of the information contained in this presentation, and no responsibility or liability is accepted for any such information. By accepting this presentation in its entirety, the recipient acknowledges its understanding and
acceptance of the foregoing statement.
17. Agenda
• About AQR
• Business problem
• Solution summary
• Lessons learned
• Takeaways
17
18. Our Firm
AQR is a global investment management firm built at the intersection of financial theory and practical application. We strive to deliver superior, long-term results for our
clients by looking past market noise to identify and isolate what matters most, and by developing ideas that stand up to rigorous testing. Our focus on practical
insights and analysis has made us leaders in alternative and traditional strategies since 1998.
At a glance
• AQR takes a systematic, research-driven approach to managing alternative and traditional strategies
• We apply quantitative tools to process fundamental information and manage risk
• Our clients include institutional investors, such as pension funds, defined contribution plans, insurance companies, endowments, foundations, family offices, and
sovereign wealth funds, as well as RIAs, private banks, and financial advisors
• The firm has 36 principals and 1,025 employees; over half of employees hold advanced degrees
• AQR is based in Greenwich, Connecticut, with offices in Boston, Chicago, Hong Kong, London, Los Angeles, and Sydney
• Approximately $226 billion in assets under management as of September 30, 2018*
*Approximate as of 9/30/2018, includes assets managed by AQR and its advisory affiliates. 18
19. Problem statement
Background
• Quantitative asset manager
• Investment decisions based on numerical models and
systematic trading
• Researchers develops models and “back test” ideas
over many years
• Never-ending appetite for more data and non-obvious
data sets
19
Source: AQR. For illustrative purposes only.
21. Problem statement
Background
• Quantitative asset manager
• Investment decisions based on numerical models and
systematic trading
• Researchers develop models and “back test” ideas over
many years
• Never-ending appetite for more data and non-obvious
data sets
Problem
• On-premise compute grid can’t keep up with demand
• CAPEX locked into grid
• Researchers wait for grid resources
• Researchers need job results as quickly as possible
• New experimental use cases, such as GPU, require
significant time and money upfront investment
21
Source: AQR. For illustrative purposes only.
22. Design considerations
1.Scalable both in compute and memory
2.Fast without long queue times
3.Don’t want to manage a job scheduler, for example, Condor
or Sun Grid Engine
4.Easy to use
5.Secure
22
23. Solution summary
Burstability
Leverage building block
application services, for
example, Amazon Simple
storage Service (Amazon S3),
EC2, Amazon DynamoDB
Reduce costs—Spot
Leverage cloud
23
24. Solution summary
Based on Sun Grid Engine
Submit jobs via AWS Command
Line Intervace (AWS CLI) or
API—unlimited compute at
fingertips
Short feedback loop—immediate
results to researcher
Backend grid matches researcher
workstation—no “It works in DEV
but not on the GRID”
Seamless interface
24
41. Lessons learned—General
Use Spot
Use as many instance
types as you can
Use as many AZs as
you can
Drive the lowest cost - $15/1000 vCPU/hour
41
Spot Instance
Availability Zones
Instance Families
43. Lessons learned—General
Job runtime
Job start-up times
Job cost by user
vCPU consumption by user
High priority queue
consumption by user
Monitor everything!
43
51. Lessons learned
As we added more compute …
1.TooManyRequestsException … API calls in the container
2.Job state storage woes
3.Job start times
4.Job costs
51
53. Lessons learned—API in containers
At scale
53
Exception:
too many
requests
Parameter
Store
ECS
container
ECS
container
ECS
container
ECS
container
ECS
container
54. Lessons learned—API in containers
At scale
54
ECS
container
ECS
container
ECS
container
ECS
container
ECS
container
Amazon
S3
55. What is shared job state?
Shared memory
across containers
Job assignment &
orchestration
Job input and output
55
62. Takeaways
Eliminate API calls in your
containers. Only use services
that scale, such as
DynamoDB.
Switch to event/message-
based status from poll-based.
Choose a job state backend
that fits your use case.
Watch for scale issues
62
63. Takeaways
Write a light-weight
wrapper around AWS
Batch for non-technical
users
Have a “governator”
function to ensure
compliance
Make it easy and secure
63
64. Takeaways
Packer and bake AMIs
Pre-warm the cluster
during active times of the
day
Give yourself an SLA—
75% of jobs start in 10
mins and 90% in 15 mins
Reduce your start times
64
65. Takeaways
Alarm for large long-
running jobs -> $$$
Set limits on your
compute environment
Control runaway costs
65