cikm_2016_1027

Copyright © 2016, Oracle and/or its affiliates. All rights reserved. |
Approximate Aggregates
In Oracle 12C
Hong Su, Mohamed Zait,
Vladimir Barrière, Joseph
Torres, Andre Menck
CIKM
Oct 27, 2016
Indianapolis

Challenges
• Larger data volume
• Traditional data sources: data
warehouse grows from tera
to peta bytes
• New data sources: Internet
Of Things (IOT), Internet of
cars...
• Stricter response time
requirements
• Real-time business
intelligence
• Visualization tools supporting
explorative/interactive
queries
• Traditional applications in
new pay-as-you-go cloud
infrastructure
2

Approximate Results Are Acceptable
• Visualization tools: Slightly inaccurate answers make no visual
difference
• Analytic queries: summary reports reveal patterns or trends
only
• Explorative queries: show low resolution answers before drill-
down
• Data science queries: early iterations of machine learning
queries tolerate approximation
3

4
Hardware Solution vs. Software Solution
…
… …

Approximate Query Solutions
5
Use summaries
(sketches, wavelets…)
Online aggregation
Use samples
Run more efficient
operator

Oracle 12C Solution
6
Use summaries
(sketches, wavelets…)
Online aggregation
Use samples
Run more efficient
operator

Approximate SQL Operators in Oracle 12C
• Efficient alternative for expensive aggregates
• Exact: Count(distinct), percentile/median
• Memory usage growth with data size
• Approx: Approx_count_distinct, approx_percentile
• Memory usage bounded regardless of data size
• Highly accurate (2% error rate with high confidence)
• Often order of magnitude faster than exact answer
• Good usability for easy adoption
• Automatic conversion from exact to approximate
aggregates
7

Approximate Count Distinct
• Based on Hyperloglog probabilistic counting
algorithm [FFG+07]
• Uses bounded memory (~4K bytes) per group by key
• Optimized to use smaller memory when group is small
• Significantly reduces chance to spill to disk
8
[FFG+07] Hyperloglog: the analysis of a near-optimal cardinality estimation
algorithm. Flajolet, Fusy, Gandouet and Meunier. Proceedings of the 2007
international conference on the analysis of algorithms.
SELECT APPROX_COUNT_DISTINCT(household),
APPROX_COUNT_DISTINCT(creditcard)
FROM sales
Group by region;

Approximate Percentile
• Offers both deterministic and non-deterministic alternatives
• Deterministic: Count-min sketch [C09]
• Results immune to data arrival order
• Non-deterministic: RANDOM [WLY+ 13]
• generally faster (default)
• Bounded memory (~8K bytes) per group by key – 2% error rate
with 95% confidence
9
[WLYC13] Quantiles over data streams: an experimental study, SIGMOD 2013
[C09] Count-min sketch. Cormode. Encyclopedia of database systems 2009
SELECT APPROX_MEDIAN(volume [DETERMINISTIC])
FROM sales
GROUP BY region;

Exact Aggregates Are Not Additive
10
household
household
Count(distinct household)
= 200K
= 150K
= ?

Approximate SQL Operators Are Additive
11
household
household
Highly parallelizable
Hash
key
Data
field
Hash
key
Data
field
Hash
key
Data
field

Material View Support
• Target scenario: OLAP rollup
• Solution: Materialized view (MV) for approximate
aggregates
• Pre-computing and storing approximate aggregates at low
granularity
• Rollup approximate aggregates at query execution time
• Incremental refresh of MV with approximate aggregates
Regional manager viewProduct Manager ViewTime
Product
Sales
12

Materialized View Example
CREATE MATERIALIZED VIEW sales_volume_mv
ENABLE QUERY REWRITE AS
SELECT product, region, time,
APPROX_PERCENTILE_DETAIL(volume)AS volume_detail
FROM sales
GROUP BY product, region, time;
SELECT product, APPROX_MEDIAN(volume) AS detail
FROM sales
GROUP BY product;
SELECT product, TO_APPROX_PERCENTILE(
APPROX_PERCENTILE_AGG(volume_detail), 0.5)
FROM sales_volume_mv
GROUP BY product;
Query rewrite with MV
13

Materialized View Example
SELECT region, APPROX_MEDIAN(volume) AS detail
FROM sales
GROUP BY region;
SELECT region, TO_APPROX_PERCENTILE(
APPROX_PERCENTILE_AGG(volume_detail), 0.5)
FROM sales_volume_mv
GROUP BY region; 14
CREATE MATERIALIZED VIEW sales_volume_mv
ENABLE QUERY REWRITE AS
SELECT product, region, time,
APPROX_PERCENTILE_DETAIL(volume)AS volume_detail
FROM sales
GROUP BY product, region, time;
Query Rewrite using MV

44X
faster!
16
Exact vs. Approximate: Performance

Exact vs. Approximate: Accuracy
Column Count of
Distinct
values
Approx. count
distinct error
rate
Approx. percentile error
rate
Non-
deterministic
deterministic
C1 70K -1.53% 0% -0.3%
C2 1M 3.44% 0.3% 0.3%
C3 54 M 1.16% 0.3% 5%
C4 103 M 1.68% -0.1%~ -1.2% -2.1%
• Synthetic data: 105 million rows, columns of different
distributions (uniform, normal, zipfian …)
17

Exact vs. Approximate and Materialized Views
Group by keys CPU Time (s)
Exact
percentile
Approx.
percentile
without MV
Approx.
percentile
with MV
ProductLevel1,…
2,…3,…4
1122 390 3.8
ProductLevel1,2 846 204 3.3
ProductLevel1 750 168 3.3
No group by 450 114 1.4
• Real customer data: 200 M rows, 958 columns, 220G
compressed
• Materialized view created on group by keys ProductLevel1, …,
ProductLevel4, Product
18

Summary
• Approximate query answering is a software solution to data
analytics on large data volumes
• Approximate aggregates
• Use bounded memory
• Highly parallelizable
• Can be rolled up
• Materialized view rewrite possible
• Highly accurate
• Run much faster than exact counterparts
• Substantial room for further work!
19

cikm_2016_1027

More Related Content

Similar to cikm_2016_1027

cikm_2016_1027

Editor's Notes