Building scalable systems that process streams of data requires developers to take advantage of the parallelism capabilities offered by today's computer architectures. Existing imperative programming languages provide programmers low-level primitives such as threads, locks, and semaphores. However, programs developed using these primitives tend to be plagued by race conditions and deadlocks, which make it hard to understand and debug non-deterministic behaviors. Acknowledging these limitations, some programming language extensions and libraries (e.g., OpenMPI, OpenMP have been proposed to simplify programming parallel programs. Nevertheless, all these options still burden the programmer with annotating parallelism and specifying data sharing attributes to ensure data consistency.
In recent years, dataflows have attracted significant attention as a model for building highly parallel stream processing applications. According to this model, an application is defined as a graph of processing elements that are connected by communication channels. The processing elements may execute in parallel as long as they have sufficient data to process. A key feature of the dataflow model is that it explicitly capture parallelism and data dependencies between processing elements.
Even though the dataflows provide a simple computational model, using this model to build scalable systems is challenging as naive implementations introduce unexpected runtime scheduling overhead, consume significant memory resources, and are not energy efficient. Consequently, our goal is to develop compiler optimizations and efficient runtime environments for scalable dataflow systems. In the following, we will go through the model of computation, memory optimizations, energy efficiency and stream processing at the scale of cloud computing.
Vip Call Girls Noida โก๏ธ Delhi โก๏ธ 9999965857 No Advance 24HRS Live
ย
High Performance Stream Processing and Optimizations
1. University of Iowa | Mobile Sensing Laboratory
High Performance Stream
Processing and Optimizations
May 8, 2015
Farley Lai
Advisor: Octav Chipara
Department of Computer Science
2. University of Iowa | Mobile Sensing Laboratory
โข A class of applications that process continuous input data
streams and may produce continuous output streams
โ High performance for real-time processing
โ Long-term efficient resource management
Stream Applications
2
Speaker
Models
Speech
Recording
VAD
Feature
Extraction
HTTP
Upload
Speaker Identifier
Develop compiler optimizations and efficient runtime
environments for scalable streaming systems
Introduction
3. University of Iowa | Mobile Sensing Laboratory
โข Challenges
โ Multi-core architectures
โ High and variable memory and I/O workloads
โ Energy efficiency
โข Traditional approach: programmer specified parallelization
โ Imperative languages expose limited parallelism to exploit
โ Error-prone concurrency primitives
โ Energy efficiency with aggressive power management
Stream Processing on Modern Architectures
3
Introduction
4. University of Iowa | Mobile Sensing Laboratory
โข Synchronous Data Flow (SDF)
โ Programs are described as directed graphs
โข Nodes ๏จ processes with sequential code
โข Edges ๏จ FIFO communication channels
โ Pipeline and data parallelism are explicitly exposed
โ Amount of data consumed/produced by a process is fixed &
known
โข Limited expressivity
โ Periodic static schedules
โ Memory requirements are bounded
โ Memory behavior of entire program may be characterized
Model of Computation
4
5. University of Iowa | Mobile Sensing Laboratory
โข Static schedules:
โ Order and num. invocations of a process in one period
โ Two phases: initialization phase + steady phase
โ Existence of a static schedule can be determined based on the
production and consumption rates of all processes
โข Sequential schedules
โ Built by simulating the process executions iteratively and
tracking the channel buffer sizes:
โข A process is schedulable if there is sufficient data to consume
โข The simulation continue until the initial buffer size is restored
โ Memory requirements are determined during scheduling
โข Parallel schedules
โ derived by partitioning sequential schedules
Synchronous Data Flow
5
Model of Computation
6. University of Iowa | Mobile Sensing Laboratory
โข Potential memory inefficiency
โ Per channel buffer allocations and pass-by-value semantics
โ Pass-by-reference only works on read-only chunks of data
โข Aggregate update problem [Haven85]
โข What if a process tries to insert new samples in the middle?
โข Can we still prevent copying unchanged data portions by capturing
the semantics of memory operations at compile-time?
Synchronous Data Flow
6
Model of Computation
7. MY RESEARCH
ESMS: Efficient Static Memory Analysis on Stream Programs
CSense: A Stream-Processing Toolkit for Mobile Sensing Applications
7
8. University of Iowa | Mobile Sensing Laboratory
โข StreamIt: a SDF language
โข Per channel allocation and pass-by-value semantics
๏จ significant amount of memory usage and copies
โข One single global allocation
โ Reuses memory and reduce memory requirements
โ Avoids unnecessary memory copies
Memory Optimizations: ESMS
8
My Research: ESMS
9. University of Iowa | Mobile Sensing Laboratory
โข Component analysis on filter work functions in the logical
space for each filter
โ peek(i), pop(), push(v) are supposed to access
contiguous memory allocations in FIFO channels
โ Interprets push(v) in filter work functions as
โข PASS: moves a unchanged data sample between channels
โข UPDATE: otherwise, a new value is pushed
๏จ Splitters, joiners and reordering filters are pass-only
Static Analysis
9
int->int filter appender() {
work pop 4 push 8{
for(int i=0; i<4; i++) push(pop()); // pass
for(int i=0; i<4; i++) push(compute(i)); // update
}
}
My Research: ESMS
10. University of Iowa | Mobile Sensing Laboratory
โข Relate logical positions to physical locations in the global
allocation
โ Remaps peek(i), pop(), push(v) to access possibly
non-contiguous memory
โ Live range analysis by reference counting in a schedule period
โข Layout starts with size zero and expands when necessary
โข Each location represents a live variable with its live range
โข The live range begins when receiving the 1st time pushed value
โ A splitter pushes a value multiple times for sharing
โข The live range ends when it value is last time popped
โข A location is free if its live variable is out of range
โ Complete memory behaviors and sound approximation
โข No pointer aliasing
โข Terminates in one schedule period
Whole Program Analysis
10
My Research: ESMS
11. University of Iowa | Mobile Sensing Laboratory
โข Simulate one period of the static schedule
โ case PASS: reuse memory locations in the layout
โ case UPDATE: follow one of the three strategies
โข Always-Append (AA) | Append-on-Conflict (AoC) | Insert-in-Place (IP)
Layout Stitching
11
MEM Layout
MEM[0, 0]: D0
MEM[1, 0]: D1
MEM Layout
MEM[0, 0]: I0
MEM[1, 0]: I1
MEM Layout
MEM[0, 0]: D0
MEM[1, 0]: D1
MEM[2, 1]: I0
MEM[3, 1]: I1
AA AoC & IPInput (2 updates)
MEM Layout
MEM[0, 0]: D0
MEM[1, 1]: D1
MEM Layout
MEM[0, 1]: I0
MEM[1, 1]: I1
MEM[2, 1]: D1
MEM Layout
MEM[0, 0]: D0
MEM[1, 1]: D1
MEM[2, 1]: I0
MEM[3, 1]: I1
AA & AoC IPInput (2 updates)
My Research: ESMS
12. University of Iowa | Mobile Sensing Laboratory
โ ESMS reduces both channel buffer sizes and the number
memory operations for reordering and duplicating data streams.
(splitters, joiners, reordering filters)
Memory Usage Reductions
12
45% to 96% reductions73% reductions on average
My Research: ESMS
13. University of Iowa | Mobile Sensing Laboratory
โ The average speedup of AA, AoC, and IP are 3, 3.1, and 3 while the average
speedup of CacheOpt is merely 1.07.
โ ESMS improves the performance by eliminating unnecessary memory
operations and fits in the cache with a smaller working set.
Speedup
13
My Research: ESMS
14. University of Iowa | Mobile Sensing Laboratory
โข Challenges
โ Mobile sensing applications are difficult to implement on Android
devices
โข High frame rates
โข Concurrency
โข Robustness
โข Energy efficiency
โ Resource limitations and Java VM worsen these problems
โข Additional cost of virtualization
โข Significant overhead of garbage collection
โข Integrates SDF with dynamic scheduling
โ Conditional dataflow paths by partitioning the SDF
โ Asynchronous event processing, i.e., network access and UI
โ Android-specific power management
Mobile Sensing Applications: CSense
14
My Research: CSense
15. University of Iowa | Mobile Sensing Laboratory
โข Speaker Identifier
โ Conditional dataflow paths result in SDF subgraphs
โ Bounded memory requirements
Example Application
15
addComponent("audio", new AudioComponentC(rateInHz, 16));
addComponent("rmsClassifier", new RMSClassifierC(rms));
addComponent("mfcc", new MFCCFeaturesG(speechT, featureT))
...
link("audio", "rmsClassifier");
toTap("rmsClassifier::below");
link("rmsClassifier::above", "mfcc::sin");
fromMemory("mfcc::fin");
...
create
components
wire
components
My Research: CSense
16. University of Iowa | Mobile Sensing Laboratory
Concurrency Model
16
My Research: CSense
getComponent("audio").setThreading(Threading.NEW_DOMAIN);
getComponent("httpPost").setThreading(Threading.NEW_DOMAIN);
getComponent("mfcc").setThreading(Threading.SAME_DOMAIN);
Compiler transformation
17. University of Iowa | Mobile Sensing Laboratory
โข Static analysis
โ composition errors, memory usage errors, race conditions
โข Flow analysis
โ whole-application configuration and optimization
โข Stream Flow Graph transformations
โ domain partitioning, type conversions, MATLAB component
coalescing
โข Code generation
โ Android application/service, MATLAB (C code + JNI stubs)
CSense Compiler
17
My Research: CSense
18. University of Iowa | Mobile Sensing Laboratory
โข Components exchange data using push/pull semantics
โข Runtime includes a scheduler for each domain
โ task queue + event queue
โ wake lock โ for power management
CSense Runtime
18
Scheduler1Task Queue
Event Queue
Scheduler2 Task Queue
Event Queue
Memory Pool
My Research: CSense
19. University of Iowa | Mobile Sensing Laboratory
โข Garbage collection overhead limits scalability
โข Concurrency primitives have a significant impact on performance
Producer-Consumer Throughput
19
My Research: CSense
30%
13.8x
19x
20. MY RESEARCH IN CONTEXT
More energy savings in the dataflow model
Dynamic optimizations in cloud stream processing
20
21. University of Iowa | Mobile Sensing Laboratory
โข Energy consumption in mobile sensing applications
โ Energy bugs introduced by power management primitives
๏จ Program analysis on code paths and potential races
โ Improper usage of I/O components elongates tail power states
๏จ Defer and batch I/O operations to execute in a short interval
โ Intensive computations
๏จ Code offloading to cloud, AMP cores or GPGPU
Energy Efficiency and Stream Processing
21
Research of Interest: Energy Efficiency
22. University of Iowa | Mobile Sensing Laboratory
โข Inconsistent throughputs
โ Critical path with bottleneck
processes
โข Dynamic Voltage Frequency
Scaling (DVFS)
โ Dispatch partitions to cores
at different frequencies
๏จ Save energy while maintaining
performance
DVFS: GreenStreams
22
Research of Interest: Energy Efficiency
[Bartenstein2013]
23. University of Iowa | Mobile Sensing Laboratory
โข New challenges
โ Changing input structure
โข Tweets ๏จ mention graph ๏จ computations
โข Distributed storage of the graph
โข Consistent graph representation
โ Communication overhead
โข Conventional stream processing less concerns large scale
communications but focuses on local computations
โ Changing performance criteria
โข Statically made decisions cannot be optimal the whole time
๏จBetter performance requires to make decisions dynamically
โ Fault-tolerance
Cloud Stream Processing
23
Research of Interest: Cloud Computing
24. University of Iowa | Mobile Sensing Laboratory
โข Goal
โ Compute timely properties on the changing graph input
โข Challenges
โ High rate of graph updates
โ Consistent graph structure
โ Static graph mining algorithms
โข Global progress tracking protocol
โ Graph updates are queued and progress is tracked in a global table
โ Progress snapshots are taken and distributed to perform transactions of
graph updates and associated computations
โข Pros
โ Decouples graph updates from graph computations
โข Cons
โ Centralized progress tracking
โ No analysis on updates that may cancel each other and aggregation of
potential propagation of communications
Changing Graph: Kineograph
24
Research of Interest: Cloud Computing
25. University of Iowa | Mobile Sensing Laboratory
โข Goal
โ High throughput timely and low latency processing
โข Challenges
โ Communication overhead dominates local computing resources
โ Massive communications cause media contentions and high latency
โข Efficient batching and localized communications
โ User decide to process input synchronously or asynchronously
โ Distributed progress tracking based on partially ordered timestamps
โข Pros
โ Effective aggregation of communications
โข Cons
โ Flow control might be a concern for asynchronous delivery
Timely Dataflow: Naiad
25
Research of Interest: Cloud Computing
26. University of Iowa | Mobile Sensing Laboratory
โข Goal
โ Efficiently switching between Sync and Async modes for better
performance and early termination
โข Predict the next better execution mode online
โข Pros
โ Frees programmers from coding the execution modes explicitly
โข Cons
โ Separate snapshots and checkpointing under both modes
โข Additional space usages
โข Unclear performance impact
Execution Modes: PowerSwitch
26
Research of Interest: Cloud Computing
27. University of Iowa | Mobile Sensing Laboratory
โข Energy efficiency in stream processing
โ Dataflow paths as code paths facilitate program analysis
โ Graph manipulation for better I/O component access aggregations
โ Precise workload requests for computing resources
โ Adapting to runtime information e.g., user activity predictions
โข Incorporate dynamic optimizations
โ Static optimizations based on fixed resources
โ Reevaluations once in a while, e.g., execution mode switching
โ Changing input structures and distributed state sharing
โข Less changing information flow paths for aggregating communications
โข Localized multicast for reconfiguration and termination in partial order
Conclusions and Future Work
27
28. University of Iowa | Mobile Sensing Laboratory
โข Potential limitation on stream processing optimizations
โ Non-trivial transformation between SDF graphs at different
granularities
Conclusions and Future Work
28
FFTTestSource0 split08,16,16
FFTReorderSimple0
8,8,8
FFTReorderSimple1
8,8,8
CombineDFT0
8,4,4
CombineDFT28,4,4
CombineDFT1
4,8,8
join0
8,8,8
FloatPrinter0
16,1,1
CombineDFT34,8,8
8,8,8
FloatSource0 split0
2,4,4
Butter๏ฌy0
2,4,4
Butter๏ฌy1
2,4,4
join0
4,2,2
4,2,2
split1
4,8,8
Butter๏ฌy2
4,4,4
Butter๏ฌy3
4,4,4
join1
4,4,4
4,4,4
BitReverse0
8,8,8
FloatPrinter0
8,2,2
Coarse-grained FFT
Fine-grained FFT
29. University of Iowa | Mobile Sensing Laboratory
Any Questions?
Thank You
29
Editor's Notes
Today, Iโm going to present stream processing and its optimizations ranging from memory, energy efficiency and new challenges at the scale of cloud computing.
Stream applications are a class applications that process continuous input data streams and may produce continuous output streams.
Here is a sample application Speaker Identifier running on a mobile device.
It reads continuous audio samples from a microphone sensor, performs computationally intensive feature extraction and uploads to a remote server for identification.
This application is supposed to run in the background indefinitely.
This implies high performance is required for real-time processing.
Long-term efficient resource management is also important.
However, there are some challenges to stream processing we need to address.
First, multi-core architectures require programs to expose parallelism.
Second, high and variable memory and I/O workloads may incur significant runtime resource management overhead.
Third, energy efficiency is essential for long-term executions.
Traditional imperative languages such as C and Java, only expose limited parallelism to exploit.
Programmers need to manually code concurrency primitive such as threads and locks which are error-prone.
On mobile devices, aggressive power management tends to burden programmers with error-prone APIs.
So, we are in search of models of computation that fully expose program parallelism.
Eventually, we found the synchronous data flow is promising.
In this model, programs are describe as directed graphs.
The nodes represent processes with sequential code.
The edges represent FIFO communication channels.
It is straightforward to see the pipeline parallelism if there is a pipeline of processes.
And data parallelism if there is a branch in the graph.
Besides, a specific requirement of SDF is that the production and consumption rates of processes are fixed and known at compile-time.
In this way, SDF actually limits the expressivity of general dataflow models.
Such that periodic static schedules can be derived at compile-time
Memory requirements are bounded.
And even the memory behavior of the entire program can be characterized.
So, how does a static schedule look like?
It basically specifies the order and the number of invocations of a process in a period.
It consists of an optional initialization phase and a steady state that executes repetitively.
The good thing is that its existence can be determined based on the production and consumption rates of all processes.
Next, we derive such a schedule for a single processor.
The way is to simulate process executions iteratively and track the channel buffer sizes.
A process is schedulable if there is sufficient input to consume.
That means its invocation wonโt cause negative buffer size.
If there are multiple schedulable processes, just randomly pick one.
The simulation continues until the same buffer size as the initial one is restored.
The maximum buffer size during the simulation is the memory requirement.
To adapt the schedule for multi-cores, simply partition it to run in parallel.
However, the semantics of the memory model may cause potential inefficiency.
It assumes per channel buffer allocations and pass-by-value semantics.
That means even when a data item passes through the channel without being modified, memory copies are necessary across channels.
How about pass-by-reference?
It basically works on read-only chunks of data.
What if a process tries to insert new samples in the middle?
Can we still prevent copying unchanged data portions?
This induces my research on the memory optimizations of stream programs.
A recent work ESMS: Efficient Static Memory Analysis on Stream Programs addresses this.
Another earlier developed stream-processing toolkit also takes care of the memory performance of mobile sensing applications.
I first introduce the memory optimization ESMS.
Itโs a compiler optimization based on a well-defined SDF language, StreamIt.
In StreamIt, stream is the first order construct which can be a filter representing a process in SDF.
Or one of the three hierarchical constructs.
A pipeline is a line of connected streams.
A splitjoin is a branch of streams that are joined later.
A feedbackloop is a loop processing structure which we may support in near future.
StreamIt adopts per channel allocations and pass-by-value semantics.
That implied a significant amount of memory usage.
As you can see in splitjoins, the input needs to be copied to each of the output braches.
Our idea to improve this is using one single global allocation.
to reuse memory whenever possible and avoid unnecessary copies
The first step is to perform component analysis on filter work functions in the logical space.
The logical space is the positions referenced by peek(i), pop(), and push(v) to access contiguous memory in FIFO channels.
Their semantics are just as the name suggests.
There is no direct memory access to channels in StreamIt.
So, the goal is to interpret the memory operations of pushes.
It can be a PASS if the push moves a value from the input channel without being modified.
Otherwise, it is an UPDATE, meaning to push a new value.
Here is a sample filter work function.
As you can see, the first 4 pushes are interpreted as passes because the values are not modified.
The following 4 pushes are interpreted as updates because they push new values.
In this sense, splitters and joiners of a splitjoins are pass-only.
Other filters that only reorder the input are also pass-only.
Apparently, the pass exposes the opportunity to reuse memory.
As for updates, whether it can reuse or not, we donโt know at this time.
So we need to figure out with a whole program analysis.
The goal is relate logical positions to physical locations in the global allocation.
Which essentially means to remap peek(i), pop() and push(v) to access possibly non-contiguous memory.
This is based the live range analysis on one schedule period.
The mapping is stored in a layout which starts with size zero and expands when necessary.
Each location in the layout represents a live variable with its reference counted live range.
The live range begins when receiving a pushed value for the 1st time.
In this sense, a Splitjoin passes a value to multiple output channels with a live range greater than one, indicating the sharing.
Then live range ends when its value is popped for the last time.
A location is free if its live variable is out of range.
In this way, we can derive complete memory behaviors and sound approximation because
There is no pointer aliasing and
The analysis will terminate in one schedule period.
Next, we simply simulate one schedule period to stitch the layout based on the live range information.
During the simulation, a memory location is reused on a PASS.
If it is an UPDATE, reuse the memory location only when itโs free.
Otherwise, we need to find a free location using one of the three strategies.
Always-Append (AA), Append-on-Conflict (AoC), and Insert-in-Place (IP).
To understand the semantics, letโs take a look at the examples.
Suppose the current input memory layout looks like this and there are two updates in the filter invocation.
The first number is the physical location and the 2nd number is the live range count.
D0 and D1 are the live variables.
In the first example, both live variables are out of range.
The AA strategy always appends at the end and expands the layout size.
The AoC strategy checks if all the updates can reuse.
The IP reuses whenever possible.
Clearly, both reuse the input layout in the 1st example.
In the 2nd example, D1 is still alive.
In this case, both AA and AoC append the updates at the end and expand the layout size.
The IP reuses the first location and inserts to expand the layout. D1 is shifted by one location.
So, in general, IP aggressively reuses the space.
However, a side effect is the potentially increasing code complexity due to channel access in loops.
Consider peek, pop and push in loops.
If the memory layout is contiguous, there is no problem.
Otherwise, additional code needs to be generated to ensure correct mappings across loop iterations.
Therefore, this optimization has trade-off between the allocation size and code complexity.
Nonetheless, the ESMS evaluation shows significant memory usage reductions on both code size and data size for the 14 benchmarks.
We compare with baseline StreamIt and their cache optimization
which essentially increases the buffer size to execute a filter multiple times to exploit the instruction cache locality.
The right figure shows the expected reductions on the data size including filter states range from 45% to 96%.
The left figure shows the reductions on the code size are 73% on average.
This is because by capturing the memory behaviors, the memory copies by splitjoins and reordering filters are eliminated.
Hence, the overall memory usage is reduced significantly.
We also evaluate the speedup over the baseline StreamIt. The average is 3.
For some complex computations such as FFT, the speedup can be up to 7 or more.
So, instead of trading space for performance as the StreamIt cache optimization,
ESMS improves the performance by eliminating unnecessary memory operations and fits in the cache with a smaller working set.
Beside the specific memory optimization, we also proposed a stream processing toolkit called CSense
that allows to develop a full-fledged mobile sensing application.
It addresses the following challenges, high frame rates, concurrency complexities, long-term robustness and energy efficiency.
It also prevents the overhead induced by the Java virtual machine.
Based on the SDF, CSense adopts dynamic scheduling to support conditional dataflows,
asynchronous event processing and the Android-specific power management.
A sample application, the Speaker Identifier, is composed as a pipeline.
The programmer simply creates and wires the components in the code fragment.
What is interesting is that the scheduling follows a depth first traversal of the stream flow graph.
Each component is allowed to direct the scheduling by pushing through its output ports.
In this app, the RMS classifier which detects if the input audio sample is voiced or not,
conditionally pushes the sample to either of its output ports.
This essentially partitions the SFG into SDF subgraphs
This implies the CSense compiler can derive the bounded memory requirements of the entire graph
To free programmers from the concurrency primitives,
CSense allows programmers to annotate concurrent I/O operations.
In this app, the audio and httpPost components are specified.
The specified components will be partitioned into different domains.
The other components just follow the domain of the upstream component.
Each domain is executed on a separate thread.
A synchronous queue is inserted to ensure safety for data exchange between domains.
In summary, the CSense compiler performs static analysis to prevent
composition errors due to incompatible types,
memory usage errors due to leaks,
and identify race conditions for safety.
It also perform the flow analysis to deicide the message buffer sizes between each pair of neighboring components
such that in one invocation the producer component can fill the buffer with all the consumer component needs to process
This simplifies the dynamic scheduling.
In the following, the CSense compiler performs graph transformations for
domain partitioning as mentioned before, type conversion and MATLAB component coalescing.
The CSense toolkit allows programmers to write components in MATLAB for ease of signal processing.
Finally, the Csense compiler generates a complete Android application and compiles MATLAB code if necessary.
After the application is installed on the target device, it is executed by the CSense runtime.
The runtime includes a scheduler for each domain.
A scheduler maintains a task queue, an event queue and an Android wakelock.
The task queue schedules components to execute in order.
The event queue allows to process asynchronous events.
The wakelock is associated with the Android power management.
Whenever there is no wakelock acquired by any application, the Android device is put to sleep soon.
For our schedulers, if the task queue is empty, the scheduler determines whether to release the wakelock and goes to sleep.
Despite dynamic scheduling, CSense ensures high throughput by improving the memory and synchronization performance.
We evaluate the throughput of a producer-consumer benchmark.
The CSense memory management is based on message pooling, denoted by MP.
The CSense synchronization primitives are based on spin locks, denoted by C.
The baseline is the default Java garbage collection GC and reentrant lock L.
In this figure, the x-axis represent the production rate while the y-axis represent the consumption rate.
Ideally, both should agree.
However, GC overhead limits the scalability.
With MP, the performance is improved by 13.8X
With the CSense spinlocks, we got additional 30% improvement.
After exploring specific memory optimizations of stream programs and developing a stream processing toolkit for mobile sensing applications,
I would like to further investigate more energy savings specific to the dataflow model and dynamic optimizations in cloud stream processing.
We surveyed major energy consumption issues and those methodologies to address it.
First, the energy bugs introduced by power management primitives may be identified by program analysis on code paths and potential races.
The is because the power management primitives follow the currency primitives and cause similar bugs.
Next, improper usage of I/O components elongates tail power states.
The tail state is a state that consumes a significant amount of energy after performing a high power I/O operation.
Letโs look at this figure which shows the 3G module power states.
Normally, in the idle state, no significant energy consumption.
After the application makes connection and send requests, it goes to the high power state that consumes the most energy.
After that, this application computes something else while the 3G module goes to the tail state and still consumes a significant amount of energy.
Then, the application sends another message and causes the module to go to the high power state again.
Therefore, the energy saving strategy is to defer and batch I/O operations as much as possible to perform in a short interval.
Such that the tail energy consumption can be minimized.
On the other hand, intensive computations may be unavoidable and consume a lot of energy.
Code offloading considers to delegate the computation either to a remote server, some other CPU or GPU cores.
This approach needs to weigh the data transfer cost against local computations.
As for the dataflow model of stream processing, there is an implicit energy consumption due to inconsistent real-time throughputs.
Here is the example. The stream flow graph is partitioned to run on 3 CPU-cores.
However, the partition AB is the bottleneck and slows down the entire throughput.
Apparently, there is no improvement no matter how fast partitions C and DE can run.
Instead, the core dispatched to partition AB should run at its highest frequency.
The other cores dispatched to C and DE can run at lower frequencies to save energy.
This is based on the dynamic voltage frequency scaling technique.
Here is the evaluation based partitioning StreamIt programs.
The GreenStreams configuration sets the frequencies to match the partition throughputs.
Compared with the Fast configuration that set the highest frequencies for each CPU core,
It saves up to 28% energy while maintaining comparable performance.
This implies stream applications may save more energy by making precise requests to computing resources.
In addition to energy efficiency, stream processing at the scale of cloud computing is worth exploration
because of the following new challenges.
The first is about the changing input structure.
Suppose we want to compute the influence of users in a social network like Twitter.
The input is a continuous stream of tweets but the computation is based on the mention graph induced from the tweets.
The graph changes all the time and is so large that distributed storage is necessary.
The key is to track the progress of the structure updates in a distributed setting.
The second challenge is the communication overhead that contributes to significant latency.
The key is to aggregate and localize communications.
The third challenge is changing criteria.
That means the once optimal static decisions become suboptimal over time.
The key is to reevaluate when appropriately.
Last but not least is the fault-tolerance which is essential due to the distributed nature.
A related work to deal with the changing graph input is Kineograph.
The goal is to compute timely properties on the changing graph input.
The graph algorithm is static and based on a consistent graph structure.
However, it would be complex to combine graph updates and computations together.
So Kineograph proposed a global progress tracking protocol.
Such that graph updates are queued and the progress is tracked in a global table.
The progress snapshots are taken periodically to perform transactions of graph updates and the associated computations.
The advantage of this approach is to decouple graph updates from algorithm-specific computations.
Nonetheless, the centralized progress tracking may be a performance concern.
Besides, there is no analysis on graph updates that may cancel each other, and aggregation of potential propagation of communications.
In contrast with Kineograph, Naiad focuses on reducing communication overhead and avoiding massive media contentions.
In this sense, Naiad provides synchronous notifications and asynchronous message delivery.
This facilitates efficient batching.
Besides, Naiad proposed the distributed progress tracking to notify only relevant processes.
There is no broadcast in Naiad systems.
Therefore, Naiad effectively aggregates communications.
On the other hand,
Flow control might a concern for asynchronous delivery.
Too fine-grained message control may not be easy to use for building large stream applications.
Finally, consider making a decision about the execution modes for algorithms such as PageRank.
In general, there are two execution modes.
In the synchronous mode, all the processes are scheduled in iterations. The input data is supposed to be ready for all in the beginning of an iteration.
In the asynchronous mode, each process is allowed to execute as soon as its input is ready.
It is not trivial to make an optimal decision based on static information because the performance of execution modes may change over different execution stages.
Therefore, PowerSwitch proposed to predict the throughput of the current mode and the alternate mode online.
Based on the predictions, PowerSwitch changes the execution mode whenever itโs possible to improve the performance.
Apparently, PowerSwitch frees programmers from coding the execution modes explicitly.
However, separate snapshots and checkpointing under both modes use more space
And its performance impact is unclear.
So far, we have gone through the stream processing model,
memory optimizations,
potential improvement on energy efficiency and
opportunities for dynamic optimizations.
I believe itโs promising to further improve the energy efficiency in stream processing.
First, dataflow paths as code paths should facilitate program analysis
Second, graph manipulation allows for better I/O component access aggregations
Third, itโs possible to make precise workload requests for computing resources
Fourth, it is necessary to adapt to runtime information such as the user activity predictions
For example, a stream program should batch contents in proportion to the userโs changing interest.
Finally, at the scale of cloud computing, it is essential to incorporate dynamic optimizations.
There are levels of dynamism.
First , static optimizations based on fixed resources are still necessary.
Some decisions need reevaluations once in a while such as the switching of execution modes.
Ultimately, the input data structure is stored in distributed storage and change over time.
We may try to maintain a less changing information flow path for aggregation of communications.
and utilize localized multicast in some partial order for notifying reconfiguration and termination.
Finally, we found static stream processing optimizations may be limited to one aspect or some local features of the stream graph.
To our best knowledge, none of the existing optimizations can transform a coarse-grained FFT into an equivalent but fine-grained FFT.
We are aware this should deserve further investigations.
Error-prone concurrency primitives
Races, deadlocks, non-determinism
Imperative languages expose limited parallelism to exploit
Limited parallelism granularity: e.g., function calls and loops
Complex data dependencies to annotate correctly
Energy efficiency with power management primitives
Continuous sensing applications on mobile devices
Extremely high rates of incoming data streams
Runtime scheduling overhead
Excessive memory usages may lead to low performance
Energy inefficiency in mobile sensing applications (MSAs)
----- Meeting Notes (5/4/15 16:40) -----
Intro
New computing platfomrs
2points: parallel architectures
challenges
SDF
def
exposed parallelism
schduleing
static
dynamic
bounded channels
static access patterns