20110620 amst rdam_kpb

Introduction
Computing in databases
Conclusion
Computing near the data:
let someone else do the heavy lifting for you
Konrad Banachewicz
AmstRdam, June 20th 2011
Konrad Banachewicz Computing near the data

Introduction
Conclusion
”We’re drowning in data and starving for information”

Introduction
Conclusion
Data coming in from the market:

Introduction
Conclusion
1 liquid instrument (front month DAX Future), 1 day, 1
exchange → 400 MB in pure ASCII

Introduction
Conclusion
diﬀerent parameters → ”clones” of the same instrument

Introduction
Conclusion
diﬀerent parameters → ”clones” of the same instrument
{ exchanges } x { instruments } x { days }...
= A LOT

Introduction
Conclusion
Problems:
memory
bandwidth

Introduction
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
Typical approach

Introduction
Conclusion
Model 1: regression
Model 3: VaR
Typical approach
read the data to memory

Introduction
Conclusion
Model 1: regression
Model 3: VaR
Typical approach
analyze there

Introduction
Conclusion
Model 1: regression
Model 3: VaR
Typical approach
analyze there
save the results

Introduction
Conclusion
Model 1: regression
Model 3: VaR
But is it really necessary?

Introduction
Conclusion
Model 1: regression
Model 3: VaR
In many cases what we really need is aggregate info:
Example: linear regression

Introduction
Conclusion
Model 1: regression
Model 3: VaR
classic estimator
ˆβ = (XT
X)−1
XT
y

Introduction
Conclusion
Model 1: regression
Model 3: VaR
classic estimator
ˆβ = (XT
X)−1
XT
y
come to think about it, what we really need are sums, sums of
squares and cross-products

Introduction
Conclusion
Model 1: regression
Model 3: VaR
Two possible approaches:
1 Ripley i Chen: extra interface, pure R
2 R + SQL

Introduction
Conclusion
Model 1: regression
Model 3: VaR
Ripley i Chen
R(user) // CORBA // R(servant)

DB

Introduction
Conclusion
Model 1: regression
Model 3: VaR
Alternative
R(user) // DBoo
Two scenarios:
1 pure R processing
2 computations partially in DB

Introduction
Conclusion
Model 1: regression
Model 3: VaR

Introduction
Conclusion
Model 1: regression
Model 3: VaR
base model:
Yt = β1 + β2Xt + t

Introduction
Conclusion
Model 1: regression
Model 3: VaR
base model:
Yt = β1 + β2Xt + t
estimator:
ˆβ = XT
X
−1
XT
Y

Introduction
Conclusion
Model 1: regression
Model 3: VaR
base model:
Yt = β1 + β2Xt + t
estimator:
ˆβ = XT
X
−1
XT
Y
in the DB: arithmetic operations on a limited set of columns

Introduction
Conclusion
Model 1: regression
Model 3: VaR
Pure R processing
200000 400000 600000 800000 1000000
051015202530
Case study 1, method 1
Dataset size (number of rows)
Executiontime(seconds)
Ingres VW
Ingres
MySQL
PostgreSQL
DBMS X

Introduction
Conclusion
Model 1: regression
Model 3: VaR
Computations partially in DB
200000 400000 600000 800000 1000000
051015202530
Ingres VW
Ingres
MySQL
PostgreSQL
DBMS X

Introduction
Conclusion
Model 1: regression
Model 3: VaR
base model:
Cov(X, Y ) = E [XY ] − EXEY

Introduction
Conclusion
Model 1: regression
Model 3: VaR
base model:
estimator:
ˆCov(X, Y ) =
1
n
n
i=1
Xi Yi −
1
n
n
i=1
Xi
1
n
n
i=1
Yi

Introduction
Conclusion
Model 1: regression
Model 3: VaR
base model:
estimator:
ˆCov(X, Y ) =
1
n
n
i=1
Xi Yi −
1
n
n
i=1
Xi
1
n
n
i=1
Yi
in the DB: large queries

Introduction
Conclusion
Model 1: regression
Model 3: VaR
Pure R processing
15 20 25 30 35
0102030405060
Dataset size (columns)
Ingres VW
Ingres
MySQL
PostgreSQL
DBMS X

Introduction
Conclusion
Model 1: regression
Model 3: VaR
15 20 25 30 35
0102030405060
Dataset size (columns)
Ingres VW
Ingres
MySQL
PostgreSQL
DBMS X

Introduction
Conclusion
Model 1: regression
Model 3: VaR
calculate a quantile of the portfolio PnL
Vp = inf {u : F(u) ≥ 1 − p}

Introduction
Conclusion
Model 1: regression
Model 3: VaR
Vp = inf {u : F(u) ≥ 1 − p}
estimator:
ˆVp = X[n(1−p)]+1

Introduction
Conclusion
Model 1: regression
Model 3: VaR
Vp = inf {u : F(u) ≥ 1 − p}
estimator:
ˆVp = X[n(1−p)]+1
in the DB: sorting

Introduction
Conclusion
Model 1: regression
Model 3: VaR
Pure R processing
2000000 4000000 6000000 8000000 10000000
020406080100
Ingres VW
Ingres
MySQL
PostgreSQL
DBMS X

Introduction
Conclusion
Model 1: regression
Model 3: VaR
200000 400000 600000 800000 1000000
020406080100
Ingres VW
Ingres
MySQL
PostgreSQL
DBMS X

Introduction
Conclusion
1 with minimal eﬀort, signiﬁcant speedups are possible
2 ODBC as minimal requirement
3 extensions: parallel computing...

20110620 amst rdam_kpb

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to 20110620 amst rdam_kpb

Similar to 20110620 amst rdam_kpb (20)

Recently uploaded

Recently uploaded (15)

20110620 amst rdam_kpb