High-Performance Analytics with Probabilistic Data Structures: the Power of HyperLogLog

High-Performance Analytics
with spark-alchemy
Sim Simeonov, Founder & CTO, Swoop
@simeons / sim at swoop dot com

Improving patient outcomes
LEADING HEALTH DATA LEADING CONSUMER DATA
Lifestyle
Magazinesubscriptions
Catalog purchases
Psychographics
Animal lover
Fisherman
Demographics
Propertyrecords
Internet transactions
• 300+M unique US patients
• 8+ years longitudinal data
• De-identified, HIPAA-safe
1st Party Data
Proprietary tech to
integrate data
NPI Data
Attributed to the
patient
Claims
ICD 9 or 10, CPT,
Rx and J codes
• 300+M US Consumers
• 3,500+ consumer attributes
• De-identified, privacy-safe
Petabyte scale privacy-preserving ML/AI

petabytes
of data
sub-second
queries

process less data
The secret of high-performance analytics

decompose aggregate(…) into
reaggregate(preaggregate(…))
Solution: divide & conquer
Do this onceDo this many times

Reaggregatability → divide & conquer
Big reduction in rows

count(distinct …)
is the bane of high-performance analytics
because it is not reaggregatable

Distinct counts require all input rows

Replicate all distinct count data
in some high-performance database?

select * from prescriptions
Brand nameGeneric name National Drug Code (NDC)

select
cast(date_trunc("month", date) as date) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
count(*) as scripts
from prescriptions
group by 1
order by 1
Count scripts, generics & brands by month
Time: 193 secs
Input: 10.1B rows / 1.1Gb
Shuffle: 75M rows / 2.3Gb

Pre-aggregate by generic & brand by month
create table prescription_counts_by_month
select
generic,
brand,
count(*) as scripts
from prescriptions
group by 1, 2, 3

select
date,
count(*) as scripts
from prescription_counts_by_month
group by 1
order by 1
Count scripts, generics & brands by month v2
Time: 9 secs (21x faster)
Input: 12M rows / 118Mb
Shuffle: 12M rows / 435Mb

Effects of pre-aggregation
• Row count reduced by 850x
• Shuffle size reduced by 5x
• Execution time reduced by 21x (would be 100x in RDBMS)

high row reduction and small shuffles
are only possible when pre-aggregating
low cardinality dimensions
The curse of high cardinality

select
count(distinct patient_id) as patients,
count(*) as scripts
from prescriptions
group by 1
order by 1
Adding a high-cardinality distinct count
Time: 464 secs :(
Input: 10.1B rows / 112Gb
Shuffle: 8.8B rows / 147Gb

Maybe approximate counting can help?

select
approx_count_distinct(patient_id) as patients,
count(*) as scripts
from prescriptions
group by 1
order by 1
Approximate counting, default 5% error
Input: 10.1B rows / 112Gb
Shuffle: 75M rows / 2.8Gb

Effects of approx_count_distinct()
• Row count remains the same (big problem)
• Shuffle size reduced by 53x (shuffle HyperLogLog sketches!)
• Execution time reduced by 2x (not good enough)

1. Pre-aggregate: get big row count reductions
Create a HyperLogLog (HLL) sketch from data for distinct counts
2. Reaggregate: get big shuffle size reductions
Merge HLL sketches (into HLL sketches)
3. Present
Compute cardinality of HLL sketches
spark-alchemy to the rescue
https://github.com/swoop-inc/spark-alchemy

HLL in spark-alchemy

Pre-aggregate with HLL sketches
create table prescription_counts_by_month_hll
select
to_date(date_trunc("month", date)) as date,
generic,
brand,
count(*) as scripts,
hll_init_agg(patient_id) as patient_ids,
from prescriptions
group by 1, 2, 3

select
hll_cardinality(hll_merge(patient_ids)) as patients,
count(*) as scripts
from prescription_counts_by_month_hll
group by 1
order by 1
Reaggregate and present with HLL sketches
Input: 12M rows / 12Gb
Shuffle: 12M rows / 430Mb

Effects of spark-alchemy pre-aggregation
• Row count reduced by 850x
• Shuffle size reduced by 340x
• Execution time reduced by 66x (in RDBMS, <1 sec)

Tuning approximate counting precision

• Better privacy
• HLL sketches contain no identifiable information
• Unions across columns
• No added error
• Intersections across columns
• Use inclusion/exclusion principle; increases estimate error
• High-performance interactive analytics
• Pre-aggregate in Spark, push to Postgres / Citus, reaggregate there
Other spark-alchemy HLL benefits

select
hll_cardinality(hll_union_agg(patient_ids)) as patients,
count(*) as scripts
from prescription_counts_by_month_hll
group by 1
order by 1
spark-alchemy Postgres/Citus interop
https://github.com/citusdata/postgresql-hll

• Experiment with the HLL functions in spark-alchemy.
• Keep big data in Spark only and interop with HLL sketches.
Do you want to make Spark great while improving millions of lives?
Let’s talk.
Calls to Action
sim at swoop dot com

High-Performance Analytics with Probabilistic Data Structures: the Power of HyperLogLog

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to High-Performance Analytics with Probabilistic Data Structures: the Power of HyperLogLog

Similar to High-Performance Analytics with Probabilistic Data Structures: the Power of HyperLogLog (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

High-Performance Analytics with Probabilistic Data Structures: the Power of HyperLogLog