3. Improving patient outcomes
LEADING HEALTH DATA LEADING CONSUMER DATA
Lifestyle
Magazinesubscriptions
Catalog purchases
Psychographics
Animal lover
Fisherman
Demographics
Propertyrecords
Internet transactions
• 300+M unique US patients
• 8+ years longitudinal data
• De-identified, HIPAA-safe
1st Party Data
Proprietary tech to
integrate data
NPI Data
Attributed to the
patient
Claims
ICD 9 or 10, CPT,
Rx and J codes
• 300+M US Consumers
• 3,500+ consumer attributes
• De-identified, privacy-safe
Petabyte scale privacy-preserving ML/AI
13. root
|-- date: date
|-- generic: string
|-- brand: string
|-- product: string
|-- patient_id: long
|-- doctor_id: long
Demo system: COVID prescriptions
• Narrow sample
• 10.1 billion rows / 200Gb
• Small Spark 3.0 cluster
• 80 cores, 600Gb RAM
• Delta Lake, fully cached
14. select * from prescriptions
Brand nameGeneric name National Drug Code (NDC)
15. select
cast(date_trunc("month", date) as date) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
count(*) as scripts
from prescriptions
group by 1
order by 1
Count scripts, generics & brands by month
Time: 193 secs
Input: 10.1B rows / 1.1Gb
Shuffle: 75M rows / 2.3Gb
16. Pre-aggregate by generic & brand by month
create table prescription_counts_by_month
select
cast(date_trunc("month", date) as date) as date,
generic,
brand,
count(*) as scripts
from prescriptions
group by 1, 2, 3
17. select
date,
count(distinct generic) as generics,
count(distinct brand) as brands,
count(*) as scripts
from prescription_counts_by_month
group by 1
order by 1
Count scripts, generics & brands by month v2
Time: 9 secs (21x faster)
Input: 12M rows / 118Mb
Shuffle: 12M rows / 435Mb
18. Effects of pre-aggregation
• Row count reduced by 850x
• Shuffle size reduced by 5x
• Execution time reduced by 21x (would be 100x in RDBMS)
19. high row reduction and small shuffles
are only possible when pre-aggregating
low cardinality dimensions
The curse of high cardinality
20. select
cast(date_trunc("month", date) as date) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
count(distinct patient_id) as patients,
count(*) as scripts
from prescriptions
group by 1
order by 1
Adding a high-cardinality distinct count
Time: 464 secs :(
Input: 10.1B rows / 112Gb
Shuffle: 8.8B rows / 147Gb
22. select
cast(date_trunc("month", date) as date) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
approx_count_distinct(patient_id) as patients,
count(*) as scripts
from prescriptions
group by 1
order by 1
Approximate counting, default 5% error
Time: 227 secs (2x faster)
Input: 10.1B rows / 112Gb
Shuffle: 75M rows / 2.8Gb
23. Effects of approx_count_distinct()
• Row count remains the same (big problem)
• Shuffle size reduced by 53x (shuffle HyperLogLog sketches!)
• Execution time reduced by 2x (not good enough)
24. 1. Pre-aggregate: get big row count reductions
Create a HyperLogLog (HLL) sketch from data for distinct counts
2. Reaggregate: get big shuffle size reductions
Merge HLL sketches (into HLL sketches)
3. Present
Compute cardinality of HLL sketches
spark-alchemy to the rescue
https://github.com/swoop-inc/spark-alchemy
26. Pre-aggregate with HLL sketches
create table prescription_counts_by_month_hll
select
to_date(date_trunc("month", date)) as date,
generic,
brand,
count(*) as scripts,
hll_init_agg(patient_id) as patient_ids,
from prescriptions
group by 1, 2, 3
https://github.com/swoop-inc/spark-alchemy
27. select
cast(date_trunc("month", date) as date) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
hll_cardinality(hll_merge(patient_ids)) as patients,
count(*) as scripts
from prescription_counts_by_month_hll
group by 1
order by 1
Reaggregate and present with HLL sketches
Time: 7 secs (66x faster)
Input: 12M rows / 12Gb
Shuffle: 12M rows / 430Mb
28. Effects of spark-alchemy pre-aggregation
• Row count reduced by 850x
• Shuffle size reduced by 340x
• Execution time reduced by 66x (in RDBMS, <1 sec)
https://github.com/swoop-inc/spark-alchemy
30. • Better privacy
• HLL sketches contain no identifiable information
• Unions across columns
• No added error
• Intersections across columns
• Use inclusion/exclusion principle; increases estimate error
• High-performance interactive analytics
• Pre-aggregate in Spark, push to Postgres / Citus, reaggregate there
Other spark-alchemy HLL benefits
31. select
cast(date_trunc("month", date) as date) as date,
count(distinct generic) as generics,
count(distinct brand) as brands,
hll_cardinality(hll_union_agg(patient_ids)) as patients,
count(*) as scripts
from prescription_counts_by_month_hll
group by 1
order by 1
spark-alchemy Postgres/Citus interop
https://github.com/citusdata/postgresql-hll
32. • Experiment with the HLL functions in spark-alchemy.
• Keep big data in Spark only and interop with HLL sketches.
Do you want to make Spark great while improving millions of lives?
Let’s talk.
Calls to Action
sim at swoop dot com