Successfully reported this slideshow.
Your SlideShare is downloading. ×

High-Performance Advanced Analytics with Spark-Alchemy

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 37 Ad

High-Performance Advanced Analytics with Spark-Alchemy

Download to read offline

Pre-aggregation is a powerful analytics technique as long as the measures being computed are reaggregable. Counts reaggregate with SUM, minimums with MIN, maximums with MAX, etc. The odd one out is distinct counts, which are not reaggregable.

Traditionally, the non-reaggregability of distinct counts leads to an implicit restriction: whichever system computes distinct counts has to have access to the most granular data and touch every row at query time. Because of this, in typical analytics architectures, where fast query response times are required, raw data has to be duplicated between Spark and another system such as an RDBMS. This talk is for everyone who computes or consumes distinct counts and for everyone who doesn’t understand the magical power of HyperLogLog (HLL) sketches.

We will break through the limits of traditional analytics architectures using the advanced HLL functionality and cross-system interoperability of the spark-alchemy open-source library, whose capabilities go beyond what is possible with OSS Spark, Redshift or even BigQuery. We will uncover patterns for 1000x gains in analytic query performance without data duplication and with significantly less capacity.

We will explore real-world use cases from Swoop’s petabyte-scale systems, improve data privacy when running analytics over sensitive data, and even see how a real-time analytics frontend running in a browser can be provisioned with data directly from Spark.

Pre-aggregation is a powerful analytics technique as long as the measures being computed are reaggregable. Counts reaggregate with SUM, minimums with MIN, maximums with MAX, etc. The odd one out is distinct counts, which are not reaggregable.

Traditionally, the non-reaggregability of distinct counts leads to an implicit restriction: whichever system computes distinct counts has to have access to the most granular data and touch every row at query time. Because of this, in typical analytics architectures, where fast query response times are required, raw data has to be duplicated between Spark and another system such as an RDBMS. This talk is for everyone who computes or consumes distinct counts and for everyone who doesn’t understand the magical power of HyperLogLog (HLL) sketches.

We will break through the limits of traditional analytics architectures using the advanced HLL functionality and cross-system interoperability of the spark-alchemy open-source library, whose capabilities go beyond what is possible with OSS Spark, Redshift or even BigQuery. We will uncover patterns for 1000x gains in analytic query performance without data duplication and with significantly less capacity.

We will explore real-world use cases from Swoop’s petabyte-scale systems, improve data privacy when running analytics over sensitive data, and even see how a real-time analytics frontend running in a browser can be provisioned with data directly from Spark.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to High-Performance Advanced Analytics with Spark-Alchemy (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

High-Performance Advanced Analytics with Spark-Alchemy

  1. 1. High-Performance Analytics with spark-alchemy Sim Simeonov, Founder & CTO, Swoop @simeons / sim at swoop dot com
  2. 2. Improving patient outcomes LEADING HEALTH DATA LEADING CONSUMER DATA Lifestyle Magazinesubscriptions Catalogpurchases Psychographics Animal lover Fisherman Demographics Propertyrecords Internettransactions • 280M unique US patients • 7 years longitudinal data • De-identified, HIPAA-safe 1st Party Data Proprietary tech to integrate data NPI Data Attributed to the patient Claims ICD 9 or 10, CPT, Rx and J codes • 300M US Consumers • 3,500+ consumer attributes • De-identified, privacy-safe Petabyte scale privacy-preserving ML/AI
  3. 3. http://bit.ly/spark-records
  4. 4. http://bit.ly/spark-alchemy
  5. 5. process fewer rows of data The key to high-performance analytics
  6. 6. the most important attribute of a high-performance analytics system is the reaggregatability of its data
  7. 7. count(distinct …) is the bane of high-performance analytics because it is not reaggregatable
  8. 8. Reaggregatability
  9. 9. root |-- date: date |-- generic: string |-- brand: string |-- product: string |-- patient_id: long |-- doctor_id: long Demo system: prescriptions in 2018 • Narrow sample • 10.7 billion rows / 150Gb • Small-ish Spark 2.4 cluster • 80 cores, 600Gb RAM • Delta Lake, fully cached
  10. 10. select * from prescriptions Brand nameGeneric name National Drug Code (NDC)
  11. 11. select to_date(date_trunc("month", date)) as date, count(distinct generic) as generics, count(distinct brand) as brands, count(*) as scripts from prescriptions group by 1 order by 1 Count scripts, generics & brands by month Time: 145 secs Input: 10.7B rows / 10Gb Shuffle: 39M rows / 1Gb
  12. 12. decompose aggregate(…) into reaggregate(preaggregate(…)) Divide & conquer Do this onceDo this many times
  13. 13. Preaggregate by generic & brand by month create table prescription_counts_by_month select to_date(date_trunc("month", date)) as date, generic, brand, count(*) as scripts from prescriptions group by 1, 2, 3
  14. 14. select to_date(date_trunc("month", date)) as date, count(distinct generic) as generics, count(distinct brand) as brands, count(*) as scripts from prescription_counts_by_month group by 1 order by 1 Count scripts, generics & brands by month v2 Time: 3 secs (50x faster) Input: 2.6M rows / 100Mb Shuffle: 2.6M rows / 100Mb
  15. 15. select *, raw_count / agg_count as row_reduction from (select count(*) as raw_count from prescriptions) cross join (select count(*) as agg_count from prescription_counts_by_month) Only 50x faster because of job startup cost
  16. 16. high row reduction is only possible when preaggregating low cardinality dimensions, such as generic (7K) and brand (20K), but not product (350K) or patient_id (300+M) The curse of high cardinality (1 of 2)
  17. 17. small shuffles are only possible with low cardinality count(distinct …) The curse of high cardinality (2 of 2)
  18. 18. select to_date(date_trunc("month", date)) as date, count(distinct generic) as generics, count(distinct brand) as brands, count(distinct patient_id) as patients, count(*) as scripts from prescriptions group by 1 order by 1 Adding a high-cardinality distinct count Time: 370 secs :( Input: 10.7B rows / 21Gb Shuffle: 7.5B rows / 102Gb
  19. 19. Maybe approximate counting can help?
  20. 20. select to_date(date_trunc("month", date)) as date, approx_count_distinct(generic) as generics, approx_count_distinct(brand) as brands, approx_count_distinct(patient_id) as patients, count(*) as scripts from prescriptions group by 1 order by 1 Approximate counting, default 5% error Time: 120 secs (3x faster) Input: 10.7B rows / 21Gb Shuffle: 6K rows / 7Mb
  21. 21. approx_count_distinct() still has to look at every row of data 3x faster is not good enough
  22. 22. How do we preaggregate high cardinality data to compute distinct counts?
  23. 23. 1. Preaggregate Create an HLL sketch from data for distinct counts 2. Reaggregate Merge HLL sketches (into HLL sketches) 3. Present Compute cardinality of HLL sketches Divide & conquer using HyperLogLog
  24. 24. HLL in spark-alchemy https://github.com/swoop-inc/spark-alchemy
  25. 25. Preaggregate with HLL sketches create table prescription_counts_by_month_hll select to_date(date_trunc("month", date)) as date, generic, brand, count(*) as scripts, hll_init_agg(patient_id) as patient_ids, from prescriptions group by 1, 2, 3
  26. 26. select to_date(date_trunc("month", date)) as date, count(distinct generic) as generics, count(distinct brand) as brands, hll_cardinality(hll_merge(patient_ids)) as patients, count(*) as scripts from prescription_counts_by_month_hll group by 1 order by 1 Reaggregate and present with HLL sketches Time: 7 secs (50x faster) Input: 2.6M rows / 200Mb Shuffle: 2.6M rows / 100Mb
  27. 27. the intuition behind HyperLogLog
  28. 28. Distribute n items randomly in k buckets E(distance) ≅ * + E(min) ≅ * + ⇒ 𝑛 ≅ * /(12+) more buckets == greater precision
  29. 29. HLL sketch ≅ a distribution of mins true mean
  30. 30. HyperLogLog sketches are reaggregatable because min reaggregates with min
  31. 31. Making it work in the real world • Data is not uniformly distributed… • Hash it! • How do we get many “samples” from one set of hashes? • Partition them! • Can we get a good estimate for the mean? • Yes, with some fancy math & empirical corrections. • Do we actually have to keep the minimums? • No, just keep the number of 0s before the first 1 in binary form. https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/
  32. 32. my boss wants me to count precisely
  33. 33. Sketch sizes affect estimation errors
  34. 34. • ClearSpring HLL++ https://github.com/addthis/stream-lib • No known interoperability • Neustar (Aggregate Knowledge) HLL https://github.com/aggregateknowledge/java-hll • Postgres & JavaScript interop • BigQuery HLL++ https://github.com/google/zetasketch • BigQuery interop (PRs welcome J) spark-alchemy & HLL interoperability hll_convert(hll_sketch, from, to)
  35. 35. • High-performance interactive analytics • Preaggregate in Spark, push to Postgres / Citus, reaggregate there • Better privacy • HLL sketches contain no identifiable information • Unions across columns • No added error • Intersections across columns • Use inclusion/exclusion principle; increases estimate error Other benefits of using HLL sketches
  36. 36. • Experiment with the HLL functions in spark-alchemy • Can you keep big data in Spark only and interop with HLL sketches? • We’d welcome a PR that adds BigQuery support to spark-alchemy • Last but not least, do you want to build tools to make Spark great while improving the lives of millions of patients? Calls to Action sim at swoop dot com

×