4. Use case 1
The figures are:
10 000 articles in total!
50% of books
50% of DVDs
50% of products in English langage
50% of products in French langage
So what is the fraction of rows when
the langage is English and the product
is a DVD?
5. Use case 1
select * from test_orders where language='english' and product='books’;
select * from test_orders where language='english' and product='DVD’;
select * from test_orders where language='french' and product='DVD’;
select * from test_orders where language='french' and product='books';
6. Use case 1
(Oracle)
So , for the optimizer, the estimation
of fraction of 10000 rows when
querying both the langage and the
product is simply:
P(books) x P(product)= ?
P(50%) x p(50%) = P(25%)
Simple !!! 2500 rows !!!
But … WRONG !!!
7. Use case 1 (SQL
Server )
So , for the optimizer, the estimation
of fraction of 10000 rows when
querying both the langage and the
product is simply:
P(books) x P(product)= ?
P(50%) x p(50%) = P(25%)
Simple !!! 2500 rows !!!
8. Use case 1 (Oracle)
With Oracle, we can use extended
statistics or dynamic sampling to
solve this problem. We used the
dynamic sampling in our exemple
and the estimation is much better
for the small fraction (about 100
rows)
9. Use case 1 (Oracle)
Much better estimation for the
big fraction as well( 4900 rows)
10. Use case 1 (SQL Server)
Good estimation with suitable
index (with where clause!!!) for
the big fraction
11. Use case 1 (SQL Server)
Better estimation with suitable
index (with where clause!!!) for
the small fraction
12. Use case 1, one NOSQL exemple:
MongoDB
MongoDB always uses an index if
the index exists, in spite of the
good estimate
13. Use case 1, one NOSQL
exemple: MongoDB
Good estimate, but (most
probably) the wrong plan
14. Use case 1, one NOSQL exemple:
MongoDB
The solution shown is to use the
hint
15. The way SQL Server did it…
The histograms and statistics
16. What could be a data scientist way of
thinking on this ?
P(product) , P(langage) , P(product) x
P(langage) ???
We have dependent variables, so why not use
the Bayes theorem!
P(A|B)= P(B|A)* P(A)/P(B)
P(product|langage)=P(langage|product)*
p(product)/p(langage)
P(DVD|french)=P(french|DVD)*P(DVD)/P(f
rench)
17. What could be a data scientist way of
thinking on this ?
P(DVD|french)=P(french|DVD)*P(DVD)/P(french)
P(french|DVD)=10%, P(DVD)=50%, P(french)=50%
P(DVD|french)=10%
18. Database optimizers and machine
learning ?
Mostly standard statistics are still used …
DB2 intelligent optimizer, Oracle 20c, it’s only a begining.
So while waiting for optimizers to became more intelligent and fully use
machine learning ….
19. The classical way of thinking when
tuning
Oracle: adjust SGA, PGA, parallelism, create indexes, create materialized
views …
SQL Server: Adjust parameters with SP_configure, adjust parallelism,
create/rebuild indexes
ETC Every database has its own parameters to tune memory/disks, IOs,
CPUs …
Those techniques are of course still needed but….
If you think to tune with really understanding your data, understanding a)
cardinalities, b) correlation, c) dispersion and even d) causalities inside your
data then…
20. …you will be able tune almost every
database !!!
SQL or NOSQL !!!
All of them had similar principles, so once you learn, you will be able to
tune them…
21. Lyticsware
Lyticsware is a young innovative
company that can help you to
tune your databases
We are also partners of Amazon
Web Services and we are
helping our clients to migrate
their databases /informations
systems to cloud architectures