Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
This project has received funding from the European Union’s Horizon 2020
research and innovation program under grant agree...
Overview
1. Introduction
2. Background
3. Design & Implementation
4. Results
04.09.2018 2
It is better to use a crude approximation and know the truth, plus or minus
10 percent, than demand an exact solution and ...
Unbounded Stream
04.09.2018 4
ID_A
P_A
Key Count
A 1
Time
Infinite
Growth
Unbounded Stream
04.09.2018 5
ID_A
P_A
ID_B
P_A
Key Count
A 1
B 1
Time
Infinite
Growth
Unbounded Stream
04.09.2018 6
ID_A
P_A
ID_C
P_A
ID_B
P_A
Key Count
A 1
B 1
C 1
Time
Infinite
Growth
Unbounded Stream
04.09.2018 7
ID_A
P_A
ID_C
P_A
ID_D
P_A
ID_B
P_A
Key Count
A 1
B 1
C 1
D 1
Time
Infinite
Growth
Unbounded Stream
04.09.2018 8
ID_A
P_A
ID_C
P_A
ID_D
P_A
ID_C
P_A
ID_B
P_A
Key Count
A 1
B 1
C 2
D 1
Time
Infinite
Growth
Unbounded Stream
04.09.2018 9
ID_A
P_A
ID_C
P_A
ID_D
P_A
ID_C
P_A
ID_B
P_A
Key Count
A 1
B 1
C 2
D 1
Time
Infinite
Growth
Approximation Algorithms
Use-cases
• Identify heavy hitters (Count)
• Cardinality Estimation (Count Distinct)
Algorithms
•...
Processing
04.09.2018 11
Flink Architecture (Apache Software Foundation, 2018)
Approximate Queries
Processing
04.09.2018 12
Flink Architecture (Apache Software Foundation, 2018)
Approximate
Queries
Approximate Query API (1)
04.09.2018 13
Approximate Query API (2)
04.09.2018 14
Sketch Capacity
04.09.2018 15
Estimate Deviation - Frequent Items
WIKITRACE DATASET TOP 6000 URL AMAZON DATASET TOP 1000 REVIEWER
04.09.2018 16
Estimate Deviation - HyperLogLog
04.09.2018 17
Native
Implementation
• Potentially increased efficiency
• Reduced overhead
• Stateful processing
Integration with
Table A...
Queryable State
04.09.2018 19
Stream Elements Query Results
Sketch
Function
Time
Queryable State
04.09.2018 20
Stream Elements Query Results
Sketch
Function
Time
Query Results
REST API
Conclusion
CHALLENGES
• Efficient Grouping (HLL)
• Stateful Native Implementation
• Skewed Datasets
LEARNINGS
• Importance...
Team
04.09.2018 22
Tobias Lindener
Consultant @ Netlight
Theodore Vasiloudis
Researcher @ RISE
Paris Carbone
PhD candidate...
References & Links
• https://github.com/tlindener/ApproximateQueries
• https://datasketches.github.io/
• Daniel Anderson, ...
Evaluation Environment
▪ WikiTrace Dataset (9 GB)
▪ Address (6,708,723 distinct urls)
▪ Amazon Rating Dataset (3 GB)
▪ Pro...
Upcoming SlideShare
Loading in …5
×

Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on Stream Processing"

422 views

Published on

Data analytics in its infancy has taken off with the development of SQL. Yet, at web-scale, even simple analytics queries can prove challenging within (Distributed-) Stream Processing environments. Two such examples are Count and Count Distinct. Because of the key-oriented nature of these queries, traditionally such queries would result in ever increasing memory demand. Through approximation techniques with fixed-size memory consumption, said tasks are feasible and potentially more resource efficient within streaming systems. This is demonstrated by integrating Yahoo Data Sketches on Apache Flink. The evaluation highlights the resource efficiency as well as the challenges of approximation techniques (e.g. varying accuracy) and potential for tuning depending on the dataset. Furthermore, challenges in integrating the components within the existing Streaming interfaces(e.g. Table API) and stateful processing are presented.

Published in: Technology
  • Be the first to comment

Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on Stream Processing"

  1. 1. This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 688191. @tobiaslindener lindener@kth.se Approximate Standing Queries on Apache Flink
  2. 2. Overview 1. Introduction 2. Background 3. Design & Implementation 4. Results 04.09.2018 2
  3. 3. It is better to use a crude approximation and know the truth, plus or minus 10 percent, than demand an exact solution and know nothing at all In Arthur Bloch, The Complete Murphy's Law: A Definitive Collection (1991), 126 04.09.2018 3
  4. 4. Unbounded Stream 04.09.2018 4 ID_A P_A Key Count A 1 Time Infinite Growth
  5. 5. Unbounded Stream 04.09.2018 5 ID_A P_A ID_B P_A Key Count A 1 B 1 Time Infinite Growth
  6. 6. Unbounded Stream 04.09.2018 6 ID_A P_A ID_C P_A ID_B P_A Key Count A 1 B 1 C 1 Time Infinite Growth
  7. 7. Unbounded Stream 04.09.2018 7 ID_A P_A ID_C P_A ID_D P_A ID_B P_A Key Count A 1 B 1 C 1 D 1 Time Infinite Growth
  8. 8. Unbounded Stream 04.09.2018 8 ID_A P_A ID_C P_A ID_D P_A ID_C P_A ID_B P_A Key Count A 1 B 1 C 2 D 1 Time Infinite Growth
  9. 9. Unbounded Stream 04.09.2018 9 ID_A P_A ID_C P_A ID_D P_A ID_C P_A ID_B P_A Key Count A 1 B 1 C 2 D 1 Time Infinite Growth
  10. 10. Approximation Algorithms Use-cases • Identify heavy hitters (Count) • Cardinality Estimation (Count Distinct) Algorithms • Frequent Item Estimation • HyperLogLog • Quantile Estimation 04.09.2018 10
  11. 11. Processing 04.09.2018 11 Flink Architecture (Apache Software Foundation, 2018) Approximate Queries
  12. 12. Processing 04.09.2018 12 Flink Architecture (Apache Software Foundation, 2018) Approximate Queries
  13. 13. Approximate Query API (1) 04.09.2018 13
  14. 14. Approximate Query API (2) 04.09.2018 14
  15. 15. Sketch Capacity 04.09.2018 15
  16. 16. Estimate Deviation - Frequent Items WIKITRACE DATASET TOP 6000 URL AMAZON DATASET TOP 1000 REVIEWER 04.09.2018 16
  17. 17. Estimate Deviation - HyperLogLog 04.09.2018 17
  18. 18. Native Implementation • Potentially increased efficiency • Reduced overhead • Stateful processing Integration with Table API • CQL with Approximate Queries • Support for Count Distinct Queryable State • Reduced data handling Future Work 04.09.2018 18
  19. 19. Queryable State 04.09.2018 19 Stream Elements Query Results Sketch Function Time
  20. 20. Queryable State 04.09.2018 20 Stream Elements Query Results Sketch Function Time Query Results REST API
  21. 21. Conclusion CHALLENGES • Efficient Grouping (HLL) • Stateful Native Implementation • Skewed Datasets LEARNINGS • Importance of Data Distribution • Performance Advantages • Algorithm Parameters 04.09.2018 21
  22. 22. Team 04.09.2018 22 Tobias Lindener Consultant @ Netlight Theodore Vasiloudis Researcher @ RISE Paris Carbone PhD candidate @ KTH Slides bit.ly/2LULyoZ
  23. 23. References & Links • https://github.com/tlindener/ApproximateQueries • https://datasketches.github.io/ • Daniel Anderson, Pryce Bevan, Kevin Lang, Edo Liberty, Lee Rhodes, Justin Thaler. A High-Performance Algorithm for Identifying Frequent Items in Data Streams. • Kevin Lang, Back to the Future: an Even More Nearly Optimal Cardinality Estimation Algorithm. 04.09.2018 23
  24. 24. Evaluation Environment ▪ WikiTrace Dataset (9 GB) ▪ Address (6,708,723 distinct urls) ▪ Amazon Rating Dataset (3 GB) ▪ ProductId (9,874,210 distinct items) ▪ ReviewerId (21,176,521 distinct users) ▪ Ryzen 1600 (6C/12T) ▪ 16GB RAM ▪ Ubuntu 18.04 ▪ OpenJDK 8 ▪ JVM tuned for max 10 GB heap ▪ Flink 1.4.2 ▪ Standalone mode ▪ Evaluation through python scripts 04.09.2018 24

×