Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Production Quality Sketching Library for the Analysis of Big Data


Published on

In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources to generate exact results, or don’t parallelize well.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

A Production Quality Sketching Library for the Analysis of Big Data

  1. 1. A Production Quality Sketching Library for the Analysis of Big Data Lee Rhodes Distinguished Architect, Verizon Media (Yahoo), Inc.
  2. 2. Outline Problematic Queries of Big Data Where traditional analysis methods don’t work well Approximate Analysis Using Sketches How using stochastic processes and probabilistic analysis wins in a systems architecture context The Open Source Apache DataSketches Library A quick overview of this unique library dedicated to production systems that process big data
  3. 3. The Data Analysis Challenge … … Analyze This Data in Near-Real Time Time Stamp User ID Device ID Site Time Spent Sec Items Viewed 9:00 AM U1 D1 Apps 59 5 9:30 AM U2 D2 Apps 179 15 10:00 AM U3 D3 Music 29 3 1:00 PM U1 D4 Music 89 10 Billions of Rows or Key, Value Pairs … Example: Web Site Logs
  4. 4. Some Very Common Queries … Unique Identifiers with Set Expressions: (AUB) ∩ (CUD) E Quantiles, CDFs Histograms, PMFs Frequent Items / Heavy Hitters Vector & Matrix Operations: SVD, etc. Graph Analysis Uniform Weighted Reservoir Sampling 5 ⋯ 2 ⋮ ⋱ ⋮ 4 ⋯ 3 All Non-Additive
  5. 5. When The Data Gets Large Or Resources Are Limited, All Of These Queries Become Problematic, Because the Aggregations are Non-Additive. A B Result A Result B + Combined Result Current Result New Item Updated Result+
  6. 6. Col1, ..., Item, ..., Coln … Billions of rows ... Ω(u) size: ~ Big Data Exact Results Difficult Query Local Item Copies Query processing often requires sorting …which is very slow. Query Engine Traditional, Exact Analysis Methods Require Local Copies Note: Micro-batch “Streaming Platforms”, e.g., Storm Do Not Solve The Fundamental Problem! Big Data
  7. 7. Parallelization Does Not Help Much Because of Non-Additivity. You have to keep the copies somewhere! Col1, ..., Item, ..., Coln … Billions of rows ... 𝝨 Exact Results Expensive Shuffle Copy Copy Copy Copy Copy Copy Example: Map-Reduce
  8. 8. Every dataset is processed N times for a rolling N-day window! Traditional, Exact Time Windowing Requires Multiple Touches of Every Item
  9. 9. Let’s challenge a fundamental premise: … that our results must be exact! If we can allow for approximation, along with some accuracy guarantees, we can achieve orders-of-magnitude improvement in • speed and • reduction of resources.
  10. 10. Introducing the Sketch (a.k.a., Stochastic Streaming Algorithm) Stream Processor Data Structure Size = f(k) Query Processor Stochastic Process Merge / Set Operations Results +/- ε 𝛆 = f(1/k) Probabilistic Analysis Result Sketch Sketch Stream Data Stream Query Model the Problem as a Stochastic Process Analyze using Probability & Statistics Random Selection Sizing, Storing
  11. 11. How & Why Sketches Achieve Superior Performance For Systems Processing Massive Data
  12. 12. Major Sketch Properties • Small Stored Size • Sub-linear in Space • Single-pass, “One-Touch” • Data Insensitive • Mergeable • Approximate, Probabilistic • Mathematically Proven Error Properties Sub-linear Stream Size Linear SketchSize Sketching Sampling Sketches Overlap with Sampling Based on the Specific Sketch
  13. 13. Win #1: Small Query Space O(k) size: ~ Kilobytes Approximate Answer ± ε Difficult Query Minimal or no sorting required! Ideal for Streaming & Batch Query Engine Sketch Col1, ..., Item, ..., Coln … Billions of rows ... Sketches Start Small Sublinear Means they Stay Small Single Pass Simplifies Processing
  14. 14. Win #2: Mergeability Sketch Approx. ± ε Sketch Sketch … , Item … many rows ... … , Item … many rows ... Query Query Partitions Merge Mergeability Enables Parallelism … With No Additional Loss of Accuracy! Sketches Transform Non-Additive Metrics Into Additive Objects The Result of a Sketch Merge is Another Sketch … Enabling Set Expressions for Selected Sketches Col1, ..., Item, ..., Coln … Billions of rows ...
  15. 15. Intermediate Hyper-Cube Staging Enables Query Speed Additivity Enables Simpler Architecture Win 3: Near-Real Time Query Speed Win 4: Simpler Architecture
  16. 16. Win #5: Simplified Time Windowing Plus Late Data Processing
  17. 17. Near-Real time Results, with History!
  18. 18. Win #6: Lower System Cost ($) Case Study: Real-time Flurry, Before and After • Customers: >250K Mobile App Developers • Data: 40-50 TB per day • Platform: 2 clusters X 80 Nodes = 160 Nodes – Node: 24 CPUs, 250GB RAM Before Sketches After Sketches VCS* / Mo. ~80B ~20B Result Freshness Daily: 2 to 8 hours; Weekly: ~3 days Real-time Unique Counts Not Feasible 15 seconds! Big Wins! Near-Real Time Lower System $ * VCS: Virtual Core Seconds
  19. 19. Introducing
  20. 20. The DataSketches Team Core Team / Committers • Lee Rhodes, Distinguished Architect, Yahoo/VM*. Started internal DataSketches project 2012 • Alex Saydakov, Systems Developer, Yahoo/VM, joined 2015 • Jon Malkin, Ph.D., Research Engineer, Developer, Yahoo/VM, joined 2016 • Edo Liberty, Ph.D., Founder, HyperCube Technologies. Joined 2015 • Justin Thaler, Ph.D., Assistant Professor, Georgetown University, Computer Science. Joined 2015 • Roman Leventov, Systems Developer for Apache Druid, Metamarkets, joined 2018 • Eshcar Hillel, Ph.D., Sr Scientist, Yahoo/VM, Israel, joined 2018 Extended Team & Consultants • Graham Cormode, Ph.D., Professor, University of Warwick, Computer Science, joined 2017 • Jelani Nelson, Ph.D., Professor, U.C. Berkeley, joined 2019 • Daniel Ting, Ph.D., Sr Scientist, Tableau / Salesforce, joined 2019 … And our Community is Growing! * VM = Verizon Media
  21. 21. Our Mission… Combine Deep Science with Exceptional Engineering To Develop Production Quality Sketches That Address These Difficult Queries
  22. 22. Cardinality, 4 Families • HLL (on/off Heap): A very high performing implementation of this well-known sketch • CPC: The best accuracy per space • Theta Sketches: Set Expressions (e.g., Union, Intersection, Difference), on/off Heap • Tuple Sketches: Generic, Associative Theta Sketches, multiple derived sketches: Quantiles Sketches, 2 Families • Quantiles, Histograms, PMF’s and CDF’s of streams of comparable objects, on/off Heap. KLL, highly optimized for accuracy-space. • Relative Error Quantiles (under development) Frequent Items (Heavy-Hitters) Sketches, 2 Families • Frequent Items: Weighted or Unweighted • Frequent Directions: Approximate SVD (a Vector Sketch) Sampling: Reservoir and VarOpt (Edith Cohen) Sketches, 2 Families • Uniform and weighted sampling to fixed-k sized buckets Specialty Sketches • Customer Engagement, Frequent Distinct Tuples, Maps, etc. The Apache DataSketches Library Languages Supported: • Java, C++, Python • Binary Compatibility
  23. 23. Bright Future for Sketching Technology & Solutions Items (words, IDs, events, clicks, …) • Count Distinct • Frequent Items, Heavy-Hitters, etc • Quantiles, Ranks, PMFs, CDFs, Histograms • Set Operations • Sampling • Mobile (IoT) • Moment and Entropy Estimation Graphs (Social Networks, Communications, …) • Connectivity • Cut Sparsification • Weighted Matching • … Areas where we have sketch implementations Areas of research (World-wide) Vectors (text docs, images, features, …) & Matrices (text corpora, recommendations, …) • Dimensionality Reduction (SVD) • Covariance Estimation • Low Rank Approximation • Sparsification • Clustering (k-means, k-median, …) • Linear Regression • Machine Learning (in some areas) • Density Estimation
  24. 24. THANK YOU! Open Invitation for Collaboration Learn More About Apache DataSketches Come And Visit Us!