Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics

Impatience is a Virtue:
Revisiting Disorder in High-Performance
Log Analytics
Badrish Chandramouli, Jonathan Goldstein, Yinan Li
Microsoft Research
1

Disordered Data Processing
• Big data systems frequently collect log and telemetry data from
machines, sensors, devices, apps, browsers, …
• Disorder is common in such logs, due to:
• Network delays
• Intermittent machine failures
• Periods of poor connectivity
• Race conditions during log aggregation
• …
• Increasing demand for real-time analysis on such streams
• Examples: Microsoft Trill, Spark Streaming, Google Cloud Dataflow,
Apache Flink
2

Real Workload Analysis
• Event time: the logical time at which the event occurs.
• Processing time: the time at which the event is ingested into a streaming engine.
3
CloudLog AndroidLog
Chaotic at fine granularity Chaotic at coarse granularity

Disordered Data Processing in Trill
• Trill is a high-performance query processor for streaming analytics
• Widely used in Microsoft products (Azure Stream Analytics, Bing, Office, Halo)
• Highly optimized implementation (columnar storage, code generation, etc…)
• All operators are in-order operators
• Side note: you can now download Trill binaries at http://aka.ms/trill
• Our Goal: make Trill efficiently process out-of-order streams
• Keep using high-performance implementation of in-order operators
• High throughput, low latency, low memory usage
4

Key Challenges and Solutions
• How to sort streams efficiently?
• Impatience sort: online Patience sort
• How to produce good streaming query plans with sorting operators?
• Sort-as-needed execution strategy: push down order-insensitive operators
• How to cope more flexibly with the latency-completeness tradeoff?
• Impatience framework: deliver early results without losing late events
5

Impatience Sort:
Problem Definition and Performance Requirements
• Online sorting operator
• Data stream consists of data events and punctuations
• When receiving a punctuation with a timestamp T, sort all events whose
timestamps are less than or equal to T and output the sorted stream.
• Performance requirements:
• Adaptive to sortedness
• Efficient incremental sorting
6
Existing sorting algorithms fall short of
at least one of the two requirements
2 6 5 1 2 4 3 7 4 8 ∞ 1 2 2 3 4 4 5 6 7 8 ∞
Online
sorting
?Impatience
sort

Background on Patience Sort
• Offline sort inspired by the British card game of Patience (Solitaire)
• Two phases
• Partition phase: for each element, place it into the first sorted run whose last
element is less than or equal to the current element,
or if such a run does not exist, create a new run
• Merge phase: merge all sorted runs
7
2 6 5 1 4 3 7 8
Run 1
Run 2
Run 3
Run 4
# Runs: k = O( 𝑛)
Run selection cost: O(logk)
Partition cost: O(nlogk)
Sorting cost: O(nlogk)

Why Patience Sort?
• Reason 1: Patience sort is naturally adaptive to many common out-of-
order patterns appearing in logs
• If input array is generated by interleaving d sorted runs, we have k ≤ d.
• If there are d natural runs in an input array, we have k ≤ d.
• If there are d distinct values of timestamps in input array, we have k ≤ d.
• Reason 2: its merge-based nature implies a potential solution for
incremental sorting
8

Impatience Sort
• A variant of Patience sort that supports online (incremental) sorting
• Create sorted runs as we receive data
• When we receive a punctuation with timestamp T
• For each sorted run, remove all events whose timestamps ≤ T.
• Merge all removed subsequences and output the merged results.
9
2 6 5 1 2 4 3 7 4 8 ∞Input Stream
Output Stream
Run 1
Run 2
Run 3
Run 4
≤2≤4 ≤∞

Impatience Sort (continued)
• Impatience sort can gradually clean up sorted runs created by
severely delayed events  fewer sorted runs  better performance.
10
The number of sorted runs in Patience and Impatience sort
when sorting the CloudLog dataset
More optimizations in the paper!

Performance Evaluation: offline data
• Implemented all sort algorithms in Trill (in C#)
• Preloaded data in memory
• Single thread execution
11
0
2
4
6
8
10
12
14
16
18
CloudLog AndroidLog
Throughput(millionevents/sec)
Impatience Quicksort Timsort Heapsort
Impatience sort takes better advantage of existing order in input data

Performance Evaluation: Online Data
12
0
2
4
6
8
10
12
14
16
18
Gap between punctuation, log scale
0
5
10
15
20
25
30
35
40
Gap between punctuation, log scale
CloudLog AndroidLog
Impatience sort is less sensitive to frequent punctuations
More results are in the paper!

Outline
• Impatience sort
• Sort-as-needed execution
• Impatience framework
13

Optimizations on Query Plans
• Idea: sorts data “only as needed” for a given query.
• Solution: push down order-insensitive operators
• Selection and projection operators
• Window operators
• Example: a hopping (sliding) window query that computes over an
one-minute window for every second.
• In Trill, this is performed by adjusting timestamps:
eventTime - eventTime % hop-size
• Reduce number of distinct values, number of natural runs  better sorting
performance of Impatience sort.
• Performance: up to 7X speedup
14

Outline
• Impatience sort
• Sort-as-needed execution
15

Impatience framework
- Add support for user-specified set of reorder
latencies (e.g. {1 sec, 1 min, 1 hour})
- Deliver early results without losing late arrival
events
- Reduce memory usage
16
Low-latency
Completeness
1 sec, 98%
1 hour, 100%
?
• Pitfalls of sort-based out-of-order data
processing
• Users are forced to make a tradeoff
between completeness and latency
• High memory usage

• Partition events based on delay, e.g., {< 1 sec, < 1 min, < 1hour }
• Inject user-provided Trill operators into framework
• Low-overhead in throughput
• Reduces memory usage in certain cases
• Unmodified in-order Trill operators
17
patitionwindowfilter sort count union sum
sort count
sort count union sum
1 hour
1 min
1 sec
: user-provided operator
: out-of-order stream
: in-order stream

Performance of Impatience framework
18
Impatience framework:
High completeness, low latency, high throughput, low memory usage!
Complete
ness
Latency
{1 sec,
1 min,
1 hour}
100% ~ 1 sec
{1 sec} 98% ~ 1 sec
{1 hour} 100% ~ 1 hour
{1 sec} +
{1 min} +
{1 hour}
100% ~ 1 sec
0
2
4
6
8
10
12
14
16
Count SmallGroupByLargeGroupBy TopK
Throughput(million/sec)
Throughput
{1sec, 1min, 1hour} {1sec}
{1hour} {1sec}+{1min}+{1hour}
1
10
100
1000
Count SmallGroupByLargeGroupBy TopK
Memoryusage(MB)
Memory usage
{1sec, 1min, 1hour} {1sec}
{1hour} {1sec}+{1min}+{1hour}

Conclusion
• End-to-end sort-based solution for processing disordered
streams
• Impatience sort: an efficient streaming sort operator that can take
advantage of existing order in input stream
• Sort-as-needed execution: push down order-insensitive operators
• Impatience framework: deliver early results without losing late
events
• High completeness, low latency, high throughput, low memory usage
19
http://aka.ms/trill

- Adds support for a set of reorder latencies
(e.g. {1 sec, 1 min, 1 hour})
- Delivers early results without losing late
arrival events
21
Low-latency
Completeness
1 sec, 98%
1 hour, 100%
?
0
20
40
60
80
100
0
20
40
60
80
100
0
20
40
60
80
100
0
20
40
60
80
100
0
20
40
60
80
100
1 min
Refresh every second

Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics

Similar to Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics (20)

Recently uploaded

Recently uploaded (20)

Impatience is a Virtue: Revisiting Disorder in High-Performance Log Analytics