4. Introduction
Computing in databases
Conclusion
Data coming in from the market:
1 liquid instrument (front month DAX Future), 1 day, 1
exchange → 400 MB in pure ASCII
Konrad Banachewicz Computing near the data
5. Introduction
Computing in databases
Conclusion
Data coming in from the market:
1 liquid instrument (front month DAX Future), 1 day, 1
exchange → 400 MB in pure ASCII
different parameters → ”clones” of the same instrument
Konrad Banachewicz Computing near the data
6. Introduction
Computing in databases
Conclusion
Data coming in from the market:
1 liquid instrument (front month DAX Future), 1 day, 1
exchange → 400 MB in pure ASCII
different parameters → ”clones” of the same instrument
{ exchanges } x { instruments } x { days }...
= A LOT
Konrad Banachewicz Computing near the data
11. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
Typical approach
read the data to memory
analyze there
save the results
Konrad Banachewicz Computing near the data
13. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
In many cases what we really need is aggregate info:
Example: linear regression
Konrad Banachewicz Computing near the data
14. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
In many cases what we really need is aggregate info:
Example: linear regression
classic estimator
ˆβ = (XT
X)−1
XT
y
Konrad Banachewicz Computing near the data
15. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
In many cases what we really need is aggregate info:
Example: linear regression
classic estimator
ˆβ = (XT
X)−1
XT
y
come to think about it, what we really need are sums, sums of
squares and cross-products
Konrad Banachewicz Computing near the data
16. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
Two possible approaches:
1 Ripley i Chen: extra interface, pure R
2 R + SQL
Konrad Banachewicz Computing near the data
18. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
Alternative
R(user) // DBoo
Two scenarios:
1 pure R processing
2 computations partially in DB
Konrad Banachewicz Computing near the data
22. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
base model:
Yt = β1 + β2Xt + t
estimator:
ˆβ = XT
X
−1
XT
Y
in the DB: arithmetic operations on a limited set of columns
Konrad Banachewicz Computing near the data
23. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
Pure R processing
200000 400000 600000 800000 1000000
051015202530
Case study 1, method 1
Dataset size (number of rows)
Executiontime(seconds)
Ingres VW
Ingres
MySQL
PostgreSQL
DBMS X
Konrad Banachewicz Computing near the data
24. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
Computations partially in DB
200000 400000 600000 800000 1000000
051015202530
Case study 1, method 2
Dataset size (number of rows)
Executiontime(seconds)
Ingres VW
Ingres
MySQL
PostgreSQL
DBMS X
Konrad Banachewicz Computing near the data
27. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
base model:
Cov(X, Y ) = E [XY ] − EXEY
estimator:
ˆCov(X, Y ) =
1
n
n
i=1
Xi Yi −
1
n
n
i=1
Xi
1
n
n
i=1
Yi
Konrad Banachewicz Computing near the data
28. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
base model:
Cov(X, Y ) = E [XY ] − EXEY
estimator:
ˆCov(X, Y ) =
1
n
n
i=1
Xi Yi −
1
n
n
i=1
Xi
1
n
n
i=1
Yi
in the DB: large queries
Konrad Banachewicz Computing near the data
29. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
Pure R processing
15 20 25 30 35
0102030405060
Case study 1, method 1
Dataset size (columns)
Executiontime(seconds)
Ingres VW
Ingres
MySQL
PostgreSQL
DBMS X
Konrad Banachewicz Computing near the data
30. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
Computations partially in DB
15 20 25 30 35
0102030405060
Case study 1, method 1
Dataset size (columns)
Executiontime(seconds)
Ingres VW
Ingres
MySQL
PostgreSQL
DBMS X
Konrad Banachewicz Computing near the data
32. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
calculate a quantile of the portfolio PnL
Vp = inf {u : F(u) ≥ 1 − p}
Konrad Banachewicz Computing near the data
33. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
calculate a quantile of the portfolio PnL
Vp = inf {u : F(u) ≥ 1 − p}
estimator:
ˆVp = X[n(1−p)]+1
Konrad Banachewicz Computing near the data
34. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
calculate a quantile of the portfolio PnL
Vp = inf {u : F(u) ≥ 1 − p}
estimator:
ˆVp = X[n(1−p)]+1
in the DB: sorting
Konrad Banachewicz Computing near the data
35. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
Pure R processing
2000000 4000000 6000000 8000000 10000000
020406080100
Case study 3, method 1
Dataset size (number of rows)
Executiontime(seconds)
Ingres VW
Ingres
MySQL
PostgreSQL
DBMS X
Konrad Banachewicz Computing near the data
36. Introduction
Computing in databases
Conclusion
Model 1: regression
Model 2: correlation
Model 3: VaR
Computations partially in DB
200000 400000 600000 800000 1000000
020406080100
Case study 3, method 2
Dataset size (number of rows)
Executiontime(seconds)
Ingres VW
Ingres
MySQL
PostgreSQL
DBMS X
Konrad Banachewicz Computing near the data
37. Introduction
Computing in databases
Conclusion
1 with minimal effort, significant speedups are possible
2 ODBC as minimal requirement
3 extensions: parallel computing...
Konrad Banachewicz Computing near the data