SlideShare a Scribd company logo
1 of 17
Download to read offline
APL-CUDA LIBRE
A pioneering approach to parallel array processing in
quantitative and mathematical finance
Yigal Jhirad
September 16, 2009
Dyalog ’09
Princeton, New Jersey
1
APL-CUDA: The Need For Speed
 In Portfolio Management and Trading technology speed in processing is a requirement
— Must Have
— High Frequency Trading
— Risk Management, Market Impact
 APL and NVIDIA GPU based physics modeling exploit hardware efficiently
— Array Based Processing paradigm Matrix/Vector thought process is key
— APL leverages CPU’s ability and provides for rapid development of software applications
— CUDA leverages GPU Hardware
 Application in Econometrics and Applied Mathematics
— Monte Carlo Simulations — Optimization
— Cluster Analysis — Cointegration
— Principal Components
— Fourier Analysis
 Neural Networks
— Rapid application development and testing of idea thesis and innovation
2
APL-CUDA: Example - Cluster Analysis
 Cluster Analysis: A multivariate technique designed to identify relationships and cohesion
— Factor Analysis
— Risk Model Development
 Correlation Analysis: Pairwise analysis of data across assets. Each pairwise comparison can be run in
parallel
— Apply proprietary signal filter to remove selected data and reduce spurious correlations
3
APL-CUDA: Speed Enhancements
 Speed Enhancements of CUDA are significant as the matrix size gets larger
0
10
20
30
40
50
60
100x100
200x100
300x100
400x100
500x100
600x100
700x100
800x100
900x100
1000x100
1000x200
1000x300
1000x400
1000x500
1000x600
1000x700
1000x800
1000x900
1000x1000
2000x1000
3000x1000
4000x1000
5000x1000
5000x2000
Cuda Speed Multiple
MATRIX SIZE
CUDA/GPU vs Non-CUDA C
CUDA Performance Enhancement
Cuda/GPU C1060 Tesla vs Non-Cuda C
Cuda/GPU GTX-295 vs Non-Cuda C
Cuda about50 Times Faster on
a5000x2000 Matrix
Windows XP 64. Dual Xeon 3.2 Ghz .
4
APL-CUDA: Application
Application example– Correlation function removing N/A values
 Given an N x M matrix A in which each row is a list of returns for a particular equity, return an N x
N matrix R in which each element is the scalar result of correlating each row of A to every other
— Each element of A may be an N/A value
— When processing a pair of rows, the calculation must include neither each N/A value nor the
corresponding element in the other row
— This prevents a fast, uniform algorithm in APL
5
APL-CUDA: Compute Unified Device Architecture
 Host (CPU) initiates program
 Copy data from host memory to device memory
 Launches Kernels on Graphical Processing Unit (GPU)
 Copy result data from device memory back into host memory
Source: NVIDIA
6
APL-CUDA: Device Architecture
 Global, Constant, Texture
Memory
 Graphical Processing Unit (GPU)
 Streaming Multiprocessors (SM)
— Shared memory
— Register memory
— 8 Scalar Processors (SP)
— Runs blocks of threads
— Schedules Blocks/Warps
Source: NVIDIA
7
APL-CUDA: Execution Model
 Launch Kernel on GPU
— Threads run kernel Code
in parallel
 Thread Blocks are arrays of
threads grouped into warps
— A warp is a group of 32
threads which execute
simultaneously
— Warps/blocks run in
unpredictable order or
interleaved
— Thread Blocks
Automatically scheduled
by the GPU
— Global vs. Shared
memory
Source: NVIDIA
8
APL-CUDA: Execution Configuration
 Problem must be broken down into logical blocks (thread blocks) that will all be handled as similarly
as possible yet independently
 Host launches a very large one or two dimensional grid of thread blocks, which are automatically
scheduled by the GPU
— Blocks are small but many can be launched
— Each block should be designed to be as independent as possible from all the others
— Thread block is a small one, two, or three dimensional array of threads, grouped into warps
— Each block should be designed to be small enough in terms of the number of threads and
memory it uses to fit several within the resources provided by each SM yet employ enough
threads to take advantage of simultaneous processing (normally 64 – 256 threads)
9
APL-CUDA: Warps
 Warp is a group of 32 threads which all execute simultaneously by SIMT (Single Instruction, Multiple
Thread) process
— Optimal to program warps to read and write to contiguous global memory locations to enable
coalescing
– Coalescing is the process of reading or writing 16 global memory elements simultaneously
in a single instruction
— Optimal to program warps to read and write to contiguous shared memory locations to
eliminate bank conflicts
– Banks are 16 divisions in shared memory to which data can be read or written
simultaneously in a single instruction
— Optimal to minimize code divergence within a warp
– Code diverges when two threads in the same warp take different paths via program
control statements (if, switch, for, etc.)
10
APL-CUDA: Application – Correlation Kernel
Application example– Correlation function removing N/A values
 Given an N x M matrix A in which each row is a list of returns for a particular equity, return an N x
N matrix R in which each element is the scalar result of correlating each row of A to every other
— Each element of A may be an N/A value
— When processing a pair of rows, the calculation must include neither each N/A value nor the
corresponding element in the other row
— This prevents a fast, uniform algorithm in APL
11
APL-CUDA: Host/GPU Program Structure
12
APL-CUDA: Host/GPU Program Structure
13
APL-CUDA: Host/GPU Program Structure
Code
 Host writes A to GPU global memory
 Host launches N x (ceil(N / 16)) grid of thread blocks
— Each thread block is 16 x 16 threads
— Each row of the thread block processes a single pair of rows of A
– Each row of the thread block reads 16 contiguous elements of A simultaneously (twice) using
coalescing
– Each row of the thread block processes 16 elements of a pair of rows of A, in a loop
– Each row of the thread block will loop ceil(M / 16) times
– Each thread processes every 16th column of A
– Each row of the thread block reads and writes to 16 contiguous elements of shared
memory with no bank conflicts
— Each thread block yields 16 scalar results which it writes to 16 contiguous elements of R
simultaneously using coalescing
 Host reads R from GPU global memory
14
APL-CUDA: Example of Parallel Thinking
Procedural language algorithm
Sum = 0
For (i = 0 to 15) Sum += Array[i]
Sum now contains sum of Array
 This algorithm is inefficient in parallel computing
because each iteration (read: thread) writes to the
same memory location, causing a bottleneck
necessitating 16 instructions to complete
Parallel algorithm
Array[current thread] += Array[current thread + 8]
Array[current thread] += Array[current thread + 4]
Array[current thread] += Array[current thread + 2]
Array[current thread] += Array[current thread + 1]
Array[0] now contains sum of Array
 This algorithm is more efficient because it writes to
contiguous banks simultaneously and thus takes
only 4 instructions to complete
15
APL-CUDA: Summary
 APL-CUDA provides powerful capabilities in quantitative finance
— Powerful synergies in array processing provide an effective framework for amplifying the
impact of APL with CUDA
— As APL is evolving to exploit multi-core CPU’s another potential important direction would
be to integrate and leverage parallelism in GPU processing
 Speed is more of a necessity and requirement in quantitative applications that drive portfolio
management and trading.
— Those technologies that deliver will reap the benefits to a user community that is savvy and
demanding
16
Biography
 Yigal D. Jhirad is a senior vice president at Cohen and Steers. He is a portfolio manager and
director of quantitative and derivatives strategies. Prior to joining Cohen and Steers, he was an
executive director and head of Quantitative and Derivatives Strategies at Morgan Stanley. He was
responsible for developing, implementing and marketing quantitative and derivatives products to
hedge funds, active and passive funds, pension funds and endowments.

More Related Content

What's hot

Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O Sri Ambati
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationHao Xu
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRDatabricks
 
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...waqarnabi
 
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMachine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMark Lefevre, CQF
 
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduceBlock Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduceVasia Kalavri
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationRevolution Analytics
 
Circuit analysis i with matlab computing and simulink sim powersystems modeling
Circuit analysis i with matlab computing and simulink sim powersystems modelingCircuit analysis i with matlab computing and simulink sim powersystems modeling
Circuit analysis i with matlab computing and simulink sim powersystems modelingIndra S Wahyudi
 
I/O-Efficient Techniques for Computing Pagerank
I/O-Efficient Techniques for Computing PagerankI/O-Efficient Techniques for Computing Pagerank
I/O-Efficient Techniques for Computing PagerankYen-Yu Chen
 
Life in the Fast Lane: A Line-Rate Linear Road
Life in the Fast Lane: A Line-Rate Linear RoadLife in the Fast Lane: A Line-Rate Linear Road
Life in the Fast Lane: A Line-Rate Linear RoadAJAY KHARAT
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Sparksscdotopen
 
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...Tiziano De Matteis
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
 

What's hot (20)

Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Project ppt
Project pptProject ppt
Project ppt
 
Machine Learning - Principles
Machine Learning - PrinciplesMachine Learning - Principles
Machine Learning - Principles
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale Automation
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
 
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
 
Quantitative finance in q
Quantitative finance in qQuantitative finance in q
Quantitative finance in q
 
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read JapaneseMachine Learning in q/kdb+ - Teaching KDB to Read Japanese
Machine Learning in q/kdb+ - Teaching KDB to Read Japanese
 
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduceBlock Sampling: Efficient Accurate Online Aggregation in MapReduce
Block Sampling: Efficient Accurate Online Aggregation in MapReduce
 
Chap2 slides
Chap2 slidesChap2 slides
Chap2 slides
 
Scalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar PresentationScalable Data Analysis in R Webinar Presentation
Scalable Data Analysis in R Webinar Presentation
 
Circuit analysis i with matlab computing and simulink sim powersystems modeling
Circuit analysis i with matlab computing and simulink sim powersystems modelingCircuit analysis i with matlab computing and simulink sim powersystems modeling
Circuit analysis i with matlab computing and simulink sim powersystems modeling
 
I/O-Efficient Techniques for Computing Pagerank
I/O-Efficient Techniques for Computing PagerankI/O-Efficient Techniques for Computing Pagerank
I/O-Efficient Techniques for Computing Pagerank
 
Chap5 slides
Chap5 slidesChap5 slides
Chap5 slides
 
Life in the Fast Lane: A Line-Rate Linear Road
Life in the Fast Lane: A Line-Rate Linear RoadLife in the Fast Lane: A Line-Rate Linear Road
Life in the Fast Lane: A Line-Rate Linear Road
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
 
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Eff...
 
Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 

Similar to A Pioneering Approach to Parallel Array Processing in Quantitative and Mathematical Finance

Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsZvi Avraham
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET Journal
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelMartin Zapletal
 
Architecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUArchitecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUGlobalLogic Ukraine
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsDanish Javed
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...AMD Developer Central
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsnARUNACHALAM468781
 
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisParallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisShah Zaib
 
Asynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NETAsynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NETssusere19c741
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad ranaData Con LA
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSKathirvel Ayyaswamy
 
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...Sara Alvarez
 

Similar to A Pioneering Approach to Parallel Array Processing in Quantitative and Mathematical Finance (20)

unit 2 hpc.pptx
unit 2 hpc.pptxunit 2 hpc.pptx
unit 2 hpc.pptx
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CL
 
Apache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming modelApache spark - Spark's distributed programming model
Apache spark - Spark's distributed programming model
 
Architecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPUArchitecture of TPU, GPU and CPU
Architecture of TPU, GPU and CPU
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 
Ef35745749
Ef35745749Ef35745749
Ef35745749
 
System mldl meetup
System mldl meetupSystem mldl meetup
System mldl meetup
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
PT-4102, Simulation, Compilation and Debugging of OpenCL on the AMD Southern ...
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
 
main
mainmain
main
 
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisParallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
 
Asynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NETAsynchronous and Parallel Programming in .NET
Asynchronous and Parallel Programming in .NET
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
 
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
Automatic Compilation Of MATLAB Programs For Synergistic Execution On Heterog...
 

Recently uploaded

Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communicationskarancommunications
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayNZSG
 
Understanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key InsightsUnderstanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key Insightsseri bangash
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Neil Kimberley
 
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyThe Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyEthan lee
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMRavindra Nath Shukla
 
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...Any kyc Account
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLSeo
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Dave Litwiller
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
 
Cash Payment 9602870969 Escort Service in Udaipur Call Girls
Cash Payment 9602870969 Escort Service in Udaipur Call GirlsCash Payment 9602870969 Escort Service in Udaipur Call Girls
Cash Payment 9602870969 Escort Service in Udaipur Call GirlsApsara Of India
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageMatteo Carbone
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsP&CO
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdfRenandantas16
 
Event mailer assignment progress report .pdf
Event mailer assignment progress report .pdfEvent mailer assignment progress report .pdf
Event mailer assignment progress report .pdftbatkhuu1
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Tech Startup Growth Hacking 101  - Basics on Growth MarketingTech Startup Growth Hacking 101  - Basics on Growth Marketing
Tech Startup Growth Hacking 101 - Basics on Growth MarketingShawn Pang
 

Recently uploaded (20)

Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communications
 
It will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 MayIt will be International Nurses' Day on 12 May
It will be International Nurses' Day on 12 May
 
Understanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key InsightsUnderstanding the Pakistan Budgeting Process: Basics and Key Insights
Understanding the Pakistan Budgeting Process: Basics and Key Insights
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023
 
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Greater Kailash ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
 
Forklift Operations: Safety through Cartoons
Forklift Operations: Safety through CartoonsForklift Operations: Safety through Cartoons
Forklift Operations: Safety through Cartoons
 
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case studyThe Coffee Bean & Tea Leaf(CBTL), Business strategy case study
The Coffee Bean & Tea Leaf(CBTL), Business strategy case study
 
Monte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSMMonte Carlo simulation : Simulation using MCSM
Monte Carlo simulation : Simulation using MCSM
 
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
KYC-Verified Accounts: Helping Companies Handle Challenging Regulatory Enviro...
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
 
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
 
Cash Payment 9602870969 Escort Service in Udaipur Call Girls
Cash Payment 9602870969 Escort Service in Udaipur Call GirlsCash Payment 9602870969 Escort Service in Udaipur Call Girls
Cash Payment 9602870969 Escort Service in Udaipur Call Girls
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and pains
 
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf0183760ssssssssssssssssssssssssssss00101011 (27).pdf
0183760ssssssssssssssssssssssssssss00101011 (27).pdf
 
Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517
Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517
Nepali Escort Girl Kakori \ 9548273370 Indian Call Girls Service Lucknow ₹,9517
 
Event mailer assignment progress report .pdf
Event mailer assignment progress report .pdfEvent mailer assignment progress report .pdf
Event mailer assignment progress report .pdf
 
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Pune Just Call 9907093804 Top Class Call Girl Service Available
 
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
Tech Startup Growth Hacking 101  - Basics on Growth MarketingTech Startup Growth Hacking 101  - Basics on Growth Marketing
Tech Startup Growth Hacking 101 - Basics on Growth Marketing
 

A Pioneering Approach to Parallel Array Processing in Quantitative and Mathematical Finance

  • 1. APL-CUDA LIBRE A pioneering approach to parallel array processing in quantitative and mathematical finance Yigal Jhirad September 16, 2009 Dyalog ’09 Princeton, New Jersey
  • 2. 1 APL-CUDA: The Need For Speed  In Portfolio Management and Trading technology speed in processing is a requirement — Must Have — High Frequency Trading — Risk Management, Market Impact  APL and NVIDIA GPU based physics modeling exploit hardware efficiently — Array Based Processing paradigm Matrix/Vector thought process is key — APL leverages CPU’s ability and provides for rapid development of software applications — CUDA leverages GPU Hardware  Application in Econometrics and Applied Mathematics — Monte Carlo Simulations — Optimization — Cluster Analysis — Cointegration — Principal Components — Fourier Analysis  Neural Networks — Rapid application development and testing of idea thesis and innovation
  • 3. 2 APL-CUDA: Example - Cluster Analysis  Cluster Analysis: A multivariate technique designed to identify relationships and cohesion — Factor Analysis — Risk Model Development  Correlation Analysis: Pairwise analysis of data across assets. Each pairwise comparison can be run in parallel — Apply proprietary signal filter to remove selected data and reduce spurious correlations
  • 4. 3 APL-CUDA: Speed Enhancements  Speed Enhancements of CUDA are significant as the matrix size gets larger 0 10 20 30 40 50 60 100x100 200x100 300x100 400x100 500x100 600x100 700x100 800x100 900x100 1000x100 1000x200 1000x300 1000x400 1000x500 1000x600 1000x700 1000x800 1000x900 1000x1000 2000x1000 3000x1000 4000x1000 5000x1000 5000x2000 Cuda Speed Multiple MATRIX SIZE CUDA/GPU vs Non-CUDA C CUDA Performance Enhancement Cuda/GPU C1060 Tesla vs Non-Cuda C Cuda/GPU GTX-295 vs Non-Cuda C Cuda about50 Times Faster on a5000x2000 Matrix Windows XP 64. Dual Xeon 3.2 Ghz .
  • 5. 4 APL-CUDA: Application Application example– Correlation function removing N/A values  Given an N x M matrix A in which each row is a list of returns for a particular equity, return an N x N matrix R in which each element is the scalar result of correlating each row of A to every other — Each element of A may be an N/A value — When processing a pair of rows, the calculation must include neither each N/A value nor the corresponding element in the other row — This prevents a fast, uniform algorithm in APL
  • 6. 5 APL-CUDA: Compute Unified Device Architecture  Host (CPU) initiates program  Copy data from host memory to device memory  Launches Kernels on Graphical Processing Unit (GPU)  Copy result data from device memory back into host memory Source: NVIDIA
  • 7. 6 APL-CUDA: Device Architecture  Global, Constant, Texture Memory  Graphical Processing Unit (GPU)  Streaming Multiprocessors (SM) — Shared memory — Register memory — 8 Scalar Processors (SP) — Runs blocks of threads — Schedules Blocks/Warps Source: NVIDIA
  • 8. 7 APL-CUDA: Execution Model  Launch Kernel on GPU — Threads run kernel Code in parallel  Thread Blocks are arrays of threads grouped into warps — A warp is a group of 32 threads which execute simultaneously — Warps/blocks run in unpredictable order or interleaved — Thread Blocks Automatically scheduled by the GPU — Global vs. Shared memory Source: NVIDIA
  • 9. 8 APL-CUDA: Execution Configuration  Problem must be broken down into logical blocks (thread blocks) that will all be handled as similarly as possible yet independently  Host launches a very large one or two dimensional grid of thread blocks, which are automatically scheduled by the GPU — Blocks are small but many can be launched — Each block should be designed to be as independent as possible from all the others — Thread block is a small one, two, or three dimensional array of threads, grouped into warps — Each block should be designed to be small enough in terms of the number of threads and memory it uses to fit several within the resources provided by each SM yet employ enough threads to take advantage of simultaneous processing (normally 64 – 256 threads)
  • 10. 9 APL-CUDA: Warps  Warp is a group of 32 threads which all execute simultaneously by SIMT (Single Instruction, Multiple Thread) process — Optimal to program warps to read and write to contiguous global memory locations to enable coalescing – Coalescing is the process of reading or writing 16 global memory elements simultaneously in a single instruction — Optimal to program warps to read and write to contiguous shared memory locations to eliminate bank conflicts – Banks are 16 divisions in shared memory to which data can be read or written simultaneously in a single instruction — Optimal to minimize code divergence within a warp – Code diverges when two threads in the same warp take different paths via program control statements (if, switch, for, etc.)
  • 11. 10 APL-CUDA: Application – Correlation Kernel Application example– Correlation function removing N/A values  Given an N x M matrix A in which each row is a list of returns for a particular equity, return an N x N matrix R in which each element is the scalar result of correlating each row of A to every other — Each element of A may be an N/A value — When processing a pair of rows, the calculation must include neither each N/A value nor the corresponding element in the other row — This prevents a fast, uniform algorithm in APL
  • 14. 13 APL-CUDA: Host/GPU Program Structure Code  Host writes A to GPU global memory  Host launches N x (ceil(N / 16)) grid of thread blocks — Each thread block is 16 x 16 threads — Each row of the thread block processes a single pair of rows of A – Each row of the thread block reads 16 contiguous elements of A simultaneously (twice) using coalescing – Each row of the thread block processes 16 elements of a pair of rows of A, in a loop – Each row of the thread block will loop ceil(M / 16) times – Each thread processes every 16th column of A – Each row of the thread block reads and writes to 16 contiguous elements of shared memory with no bank conflicts — Each thread block yields 16 scalar results which it writes to 16 contiguous elements of R simultaneously using coalescing  Host reads R from GPU global memory
  • 15. 14 APL-CUDA: Example of Parallel Thinking Procedural language algorithm Sum = 0 For (i = 0 to 15) Sum += Array[i] Sum now contains sum of Array  This algorithm is inefficient in parallel computing because each iteration (read: thread) writes to the same memory location, causing a bottleneck necessitating 16 instructions to complete Parallel algorithm Array[current thread] += Array[current thread + 8] Array[current thread] += Array[current thread + 4] Array[current thread] += Array[current thread + 2] Array[current thread] += Array[current thread + 1] Array[0] now contains sum of Array  This algorithm is more efficient because it writes to contiguous banks simultaneously and thus takes only 4 instructions to complete
  • 16. 15 APL-CUDA: Summary  APL-CUDA provides powerful capabilities in quantitative finance — Powerful synergies in array processing provide an effective framework for amplifying the impact of APL with CUDA — As APL is evolving to exploit multi-core CPU’s another potential important direction would be to integrate and leverage parallelism in GPU processing  Speed is more of a necessity and requirement in quantitative applications that drive portfolio management and trading. — Those technologies that deliver will reap the benefits to a user community that is savvy and demanding
  • 17. 16 Biography  Yigal D. Jhirad is a senior vice president at Cohen and Steers. He is a portfolio manager and director of quantitative and derivatives strategies. Prior to joining Cohen and Steers, he was an executive director and head of Quantitative and Derivatives Strategies at Morgan Stanley. He was responsible for developing, implementing and marketing quantitative and derivatives products to hedge funds, active and passive funds, pension funds and endowments.