High Performance Stream Processing and Optimizations

University of Iowa | Mobile Sensing Laboratory
High Performance Stream
Processing and Optimizations
May 8, 2015
Farley Lai
Advisor: Octav Chipara
Department of Computer Science

• A class of applications that process continuous input data
streams and may produce continuous output streams
– High performance for real-time processing
– Long-term efficient resource management
Stream Applications
2
Speaker
Models
Speech
Recording
VAD
Feature
Extraction
HTTP
Upload
Speaker Identifier
Develop compiler optimizations and efficient runtime
environments for scalable streaming systems
Introduction

• Challenges
– Multi-core architectures
– High and variable memory and I/O workloads
– Energy efficiency
• Traditional approach: programmer specified parallelization
– Imperative languages expose limited parallelism to exploit
– Error-prone concurrency primitives
– Energy efficiency with aggressive power management
Stream Processing on Modern Architectures
3
Introduction

• Synchronous Data Flow (SDF)
– Programs are described as directed graphs
• Nodes  processes with sequential code
• Edges  FIFO communication channels
– Pipeline and data parallelism are explicitly exposed
– Amount of data consumed/produced by a process is fixed &
known
• Limited expressivity
– Periodic static schedules
– Memory requirements are bounded
– Memory behavior of entire program may be characterized
Model of Computation
4

• Static schedules:
– Order and num. invocations of a process in one period
– Two phases: initialization phase + steady phase
– Existence of a static schedule can be determined based on the
production and consumption rates of all processes
• Sequential schedules
– Built by simulating the process executions iteratively and
tracking the channel buffer sizes:
• A process is schedulable if there is sufficient data to consume
• The simulation continue until the initial buffer size is restored
– Memory requirements are determined during scheduling
• Parallel schedules
– derived by partitioning sequential schedules
Synchronous Data Flow
5

• Potential memory inefficiency
– Per channel buffer allocations and pass-by-value semantics
– Pass-by-reference only works on read-only chunks of data
• Aggregate update problem [Haven85]
• What if a process tries to insert new samples in the middle?
• Can we still prevent copying unchanged data portions by capturing
the semantics of memory operations at compile-time?
Synchronous Data Flow
6

MY RESEARCH
ESMS: Efficient Static Memory Analysis on Stream Programs
CSense: A Stream-Processing Toolkit for Mobile Sensing Applications
7

• StreamIt: a SDF language
• Per channel allocation and pass-by-value semantics
 significant amount of memory usage and copies
• One single global allocation
– Reuses memory and reduce memory requirements
– Avoids unnecessary memory copies
Memory Optimizations: ESMS
8
My Research: ESMS

• Component analysis on filter work functions in the logical
space for each filter
– peek(i), pop(), push(v) are supposed to access
contiguous memory allocations in FIFO channels
– Interprets push(v) in filter work functions as
• PASS: moves a unchanged data sample between channels
• UPDATE: otherwise, a new value is pushed
 Splitters, joiners and reordering filters are pass-only
Static Analysis
9
int->int filter appender() {
work pop 4 push 8{
for(int i=0; i<4; i++) push(pop()); // pass
for(int i=0; i<4; i++) push(compute(i)); // update
}
}
My Research: ESMS

• Relate logical positions to physical locations in the global
allocation
– Remaps peek(i), pop(), push(v) to access possibly
non-contiguous memory
– Live range analysis by reference counting in a schedule period
• Layout starts with size zero and expands when necessary
• Each location represents a live variable with its live range
• The live range begins when receiving the 1st time pushed value
– A splitter pushes a value multiple times for sharing
• The live range ends when it value is last time popped
• A location is free if its live variable is out of range
– Complete memory behaviors and sound approximation
• No pointer aliasing
• Terminates in one schedule period
Whole Program Analysis
10
My Research: ESMS

• Simulate one period of the static schedule
– case PASS: reuse memory locations in the layout
– case UPDATE: follow one of the three strategies
• Always-Append (AA) | Append-on-Conflict (AoC) | Insert-in-Place (IP)
Layout Stitching
11
MEM Layout
MEM[0, 0]: D0
MEM[1, 0]: D1
MEM Layout
MEM[0, 0]: I0
MEM[1, 0]: I1
MEM Layout
MEM[0, 0]: D0
MEM[1, 0]: D1
MEM[2, 1]: I0
MEM[3, 1]: I1
AA AoC & IPInput (2 updates)
MEM Layout
MEM[0, 0]: D0
MEM[1, 1]: D1
MEM Layout
MEM[0, 1]: I0
MEM[1, 1]: I1
MEM[2, 1]: D1
MEM Layout
MEM[0, 0]: D0
MEM[1, 1]: D1
MEM[2, 1]: I0
MEM[3, 1]: I1
AA & AoC IPInput (2 updates)
My Research: ESMS

– ESMS reduces both channel buffer sizes and the number
memory operations for reordering and duplicating data streams.
(splitters, joiners, reordering filters)
Memory Usage Reductions
12
45% to 96% reductions73% reductions on average
My Research: ESMS

– The average speedup of AA, AoC, and IP are 3, 3.1, and 3 while the average
speedup of CacheOpt is merely 1.07.
– ESMS improves the performance by eliminating unnecessary memory
operations and fits in the cache with a smaller working set.
Speedup
13
My Research: ESMS

• Challenges
– Mobile sensing applications are difficult to implement on Android
devices
• High frame rates
• Concurrency
• Robustness
• Energy efficiency
– Resource limitations and Java VM worsen these problems
• Additional cost of virtualization
• Significant overhead of garbage collection
• Integrates SDF with dynamic scheduling
– Conditional dataflow paths by partitioning the SDF
– Asynchronous event processing, i.e., network access and UI
– Android-specific power management
Mobile Sensing Applications: CSense
14
My Research: CSense

• Speaker Identifier
– Conditional dataflow paths result in SDF subgraphs
– Bounded memory requirements
Example Application
15
addComponent("audio", new AudioComponentC(rateInHz, 16));
addComponent("rmsClassifier", new RMSClassifierC(rms));
addComponent("mfcc", new MFCCFeaturesG(speechT, featureT))
...
link("audio", "rmsClassifier");
toTap("rmsClassifier::below");
link("rmsClassifier::above", "mfcc::sin");
fromMemory("mfcc::fin");
...
create
components
wire
components
My Research: CSense

Concurrency Model
16
My Research: CSense
getComponent("audio").setThreading(Threading.NEW_DOMAIN);
getComponent("httpPost").setThreading(Threading.NEW_DOMAIN);
getComponent("mfcc").setThreading(Threading.SAME_DOMAIN);
Compiler transformation

• Static analysis
– composition errors, memory usage errors, race conditions
• Flow analysis
– whole-application configuration and optimization
• Stream Flow Graph transformations
– domain partitioning, type conversions, MATLAB component
coalescing
• Code generation
– Android application/service, MATLAB (C code + JNI stubs)
CSense Compiler
17
My Research: CSense

• Components exchange data using push/pull semantics
• Runtime includes a scheduler for each domain
– task queue + event queue
– wake lock – for power management
CSense Runtime
18
Scheduler1Task Queue
Event Queue
Scheduler2 Task Queue
Event Queue
Memory Pool
My Research: CSense

• Garbage collection overhead limits scalability
• Concurrency primitives have a significant impact on performance
Producer-Consumer Throughput
19
My Research: CSense
30%
13.8x
19x

MY RESEARCH IN CONTEXT
More energy savings in the dataflow model
Dynamic optimizations in cloud stream processing
20

• Energy consumption in mobile sensing applications
– Energy bugs introduced by power management primitives
 Program analysis on code paths and potential races
– Improper usage of I/O components elongates tail power states
 Defer and batch I/O operations to execute in a short interval
– Intensive computations
 Code offloading to cloud, AMP cores or GPGPU
Energy Efficiency and Stream Processing
21
Research of Interest: Energy Efficiency

• Inconsistent throughputs
– Critical path with bottleneck
processes
• Dynamic Voltage Frequency
Scaling (DVFS)
– Dispatch partitions to cores
at different frequencies
 Save energy while maintaining
performance
DVFS: GreenStreams
22
Research of Interest: Energy Efficiency
[Bartenstein2013]

• New challenges
– Changing input structure
• Tweets  mention graph  computations
• Distributed storage of the graph
• Consistent graph representation
– Communication overhead
• Conventional stream processing less concerns large scale
communications but focuses on local computations
– Changing performance criteria
• Statically made decisions cannot be optimal the whole time
Better performance requires to make decisions dynamically
– Fault-tolerance
Cloud Stream Processing
23
Research of Interest: Cloud Computing

• Goal
– Compute timely properties on the changing graph input
• Challenges
– High rate of graph updates
– Consistent graph structure
– Static graph mining algorithms
• Global progress tracking protocol
– Graph updates are queued and progress is tracked in a global table
– Progress snapshots are taken and distributed to perform transactions of
graph updates and associated computations
• Pros
– Decouples graph updates from graph computations
• Cons
– Centralized progress tracking
– No analysis on updates that may cancel each other and aggregation of
potential propagation of communications
Changing Graph: Kineograph
24

• Goal
– High throughput timely and low latency processing
• Challenges
– Communication overhead dominates local computing resources
– Massive communications cause media contentions and high latency
• Efficient batching and localized communications
– User decide to process input synchronously or asynchronously
– Distributed progress tracking based on partially ordered timestamps
• Pros
– Effective aggregation of communications
• Cons
– Flow control might be a concern for asynchronous delivery
Timely Dataflow: Naiad
25

• Goal
– Efficiently switching between Sync and Async modes for better
performance and early termination
• Predict the next better execution mode online
• Pros
– Frees programmers from coding the execution modes explicitly
• Cons
– Separate snapshots and checkpointing under both modes
• Additional space usages
• Unclear performance impact
Execution Modes: PowerSwitch
26

• Energy efficiency in stream processing
– Dataflow paths as code paths facilitate program analysis
– Graph manipulation for better I/O component access aggregations
– Precise workload requests for computing resources
– Adapting to runtime information e.g., user activity predictions
• Incorporate dynamic optimizations
– Static optimizations based on fixed resources
– Reevaluations once in a while, e.g., execution mode switching
– Changing input structures and distributed state sharing
• Less changing information flow paths for aggregating communications
• Localized multicast for reconfiguration and termination in partial order
Conclusions and Future Work
27

• Potential limitation on stream processing optimizations
– Non-trivial transformation between SDF graphs at different
granularities
Conclusions and Future Work
28
FFTTestSource0 split08,16,16
FFTReorderSimple0
8,8,8
FFTReorderSimple1
8,8,8
CombineDFT0
8,4,4
CombineDFT28,4,4
CombineDFT1
4,8,8
join0
8,8,8
FloatPrinter0
16,1,1
CombineDFT34,8,8
8,8,8
FloatSource0 split0
2,4,4
Butterfly0
2,4,4
Butterfly1
2,4,4
join0
4,2,2
4,2,2
split1
4,8,8
Butterfly2
4,4,4
Butterfly3
4,4,4
join1
4,4,4
4,4,4
BitReverse0
8,8,8
FloatPrinter0
8,2,2
Coarse-grained FFT
Fine-grained FFT

Any Questions?
Thank You
29

High Performance Stream Processing and Optimizations

Recommended

Recommended

More Related Content

Similar to High Performance Stream Processing and Optimizations

Similar to High Performance Stream Processing and Optimizations (20)

Recently uploaded

Recently uploaded (20)

High Performance Stream Processing and Optimizations

Editor's Notes