SlideShare a Scribd company logo
1 of 29
University of Iowa | Mobile Sensing Laboratory
High Performance Stream
Processing and Optimizations
May 8, 2015
Farley Lai
Advisor: Octav Chipara
Department of Computer Science
University of Iowa | Mobile Sensing Laboratory
โ€ข A class of applications that process continuous input data
streams and may produce continuous output streams
โ€“ High performance for real-time processing
โ€“ Long-term efficient resource management
Stream Applications
2
Speaker
Models
Speech
Recording
VAD
Feature
Extraction
HTTP
Upload
Speaker Identifier
Develop compiler optimizations and efficient runtime
environments for scalable streaming systems
Introduction
University of Iowa | Mobile Sensing Laboratory
โ€ข Challenges
โ€“ Multi-core architectures
โ€“ High and variable memory and I/O workloads
โ€“ Energy efficiency
โ€ข Traditional approach: programmer specified parallelization
โ€“ Imperative languages expose limited parallelism to exploit
โ€“ Error-prone concurrency primitives
โ€“ Energy efficiency with aggressive power management
Stream Processing on Modern Architectures
3
Introduction
University of Iowa | Mobile Sensing Laboratory
โ€ข Synchronous Data Flow (SDF)
โ€“ Programs are described as directed graphs
โ€ข Nodes ๏ƒจ processes with sequential code
โ€ข Edges ๏ƒจ FIFO communication channels
โ€“ Pipeline and data parallelism are explicitly exposed
โ€“ Amount of data consumed/produced by a process is fixed &
known
โ€ข Limited expressivity
โ€“ Periodic static schedules
โ€“ Memory requirements are bounded
โ€“ Memory behavior of entire program may be characterized
Model of Computation
4
University of Iowa | Mobile Sensing Laboratory
โ€ข Static schedules:
โ€“ Order and num. invocations of a process in one period
โ€“ Two phases: initialization phase + steady phase
โ€“ Existence of a static schedule can be determined based on the
production and consumption rates of all processes
โ€ข Sequential schedules
โ€“ Built by simulating the process executions iteratively and
tracking the channel buffer sizes:
โ€ข A process is schedulable if there is sufficient data to consume
โ€ข The simulation continue until the initial buffer size is restored
โ€“ Memory requirements are determined during scheduling
โ€ข Parallel schedules
โ€“ derived by partitioning sequential schedules
Synchronous Data Flow
5
Model of Computation
University of Iowa | Mobile Sensing Laboratory
โ€ข Potential memory inefficiency
โ€“ Per channel buffer allocations and pass-by-value semantics
โ€“ Pass-by-reference only works on read-only chunks of data
โ€ข Aggregate update problem [Haven85]
โ€ข What if a process tries to insert new samples in the middle?
โ€ข Can we still prevent copying unchanged data portions by capturing
the semantics of memory operations at compile-time?
Synchronous Data Flow
6
Model of Computation
MY RESEARCH
ESMS: Efficient Static Memory Analysis on Stream Programs
CSense: A Stream-Processing Toolkit for Mobile Sensing Applications
7
University of Iowa | Mobile Sensing Laboratory
โ€ข StreamIt: a SDF language
โ€ข Per channel allocation and pass-by-value semantics
๏ƒจ significant amount of memory usage and copies
โ€ข One single global allocation
โ€“ Reuses memory and reduce memory requirements
โ€“ Avoids unnecessary memory copies
Memory Optimizations: ESMS
8
My Research: ESMS
University of Iowa | Mobile Sensing Laboratory
โ€ข Component analysis on filter work functions in the logical
space for each filter
โ€“ peek(i), pop(), push(v) are supposed to access
contiguous memory allocations in FIFO channels
โ€“ Interprets push(v) in filter work functions as
โ€ข PASS: moves a unchanged data sample between channels
โ€ข UPDATE: otherwise, a new value is pushed
๏ƒจ Splitters, joiners and reordering filters are pass-only
Static Analysis
9
int->int filter appender() {
work pop 4 push 8{
for(int i=0; i<4; i++) push(pop()); // pass
for(int i=0; i<4; i++) push(compute(i)); // update
}
}
My Research: ESMS
University of Iowa | Mobile Sensing Laboratory
โ€ข Relate logical positions to physical locations in the global
allocation
โ€“ Remaps peek(i), pop(), push(v) to access possibly
non-contiguous memory
โ€“ Live range analysis by reference counting in a schedule period
โ€ข Layout starts with size zero and expands when necessary
โ€ข Each location represents a live variable with its live range
โ€ข The live range begins when receiving the 1st time pushed value
โ€“ A splitter pushes a value multiple times for sharing
โ€ข The live range ends when it value is last time popped
โ€ข A location is free if its live variable is out of range
โ€“ Complete memory behaviors and sound approximation
โ€ข No pointer aliasing
โ€ข Terminates in one schedule period
Whole Program Analysis
10
My Research: ESMS
University of Iowa | Mobile Sensing Laboratory
โ€ข Simulate one period of the static schedule
โ€“ case PASS: reuse memory locations in the layout
โ€“ case UPDATE: follow one of the three strategies
โ€ข Always-Append (AA) | Append-on-Conflict (AoC) | Insert-in-Place (IP)
Layout Stitching
11
MEM Layout
MEM[0, 0]: D0
MEM[1, 0]: D1
MEM Layout
MEM[0, 0]: I0
MEM[1, 0]: I1
MEM Layout
MEM[0, 0]: D0
MEM[1, 0]: D1
MEM[2, 1]: I0
MEM[3, 1]: I1
AA AoC & IPInput (2 updates)
MEM Layout
MEM[0, 0]: D0
MEM[1, 1]: D1
MEM Layout
MEM[0, 1]: I0
MEM[1, 1]: I1
MEM[2, 1]: D1
MEM Layout
MEM[0, 0]: D0
MEM[1, 1]: D1
MEM[2, 1]: I0
MEM[3, 1]: I1
AA & AoC IPInput (2 updates)
My Research: ESMS
University of Iowa | Mobile Sensing Laboratory
โ€“ ESMS reduces both channel buffer sizes and the number
memory operations for reordering and duplicating data streams.
(splitters, joiners, reordering filters)
Memory Usage Reductions
12
45% to 96% reductions73% reductions on average
My Research: ESMS
University of Iowa | Mobile Sensing Laboratory
โ€“ The average speedup of AA, AoC, and IP are 3, 3.1, and 3 while the average
speedup of CacheOpt is merely 1.07.
โ€“ ESMS improves the performance by eliminating unnecessary memory
operations and fits in the cache with a smaller working set.
Speedup
13
My Research: ESMS
University of Iowa | Mobile Sensing Laboratory
โ€ข Challenges
โ€“ Mobile sensing applications are difficult to implement on Android
devices
โ€ข High frame rates
โ€ข Concurrency
โ€ข Robustness
โ€ข Energy efficiency
โ€“ Resource limitations and Java VM worsen these problems
โ€ข Additional cost of virtualization
โ€ข Significant overhead of garbage collection
โ€ข Integrates SDF with dynamic scheduling
โ€“ Conditional dataflow paths by partitioning the SDF
โ€“ Asynchronous event processing, i.e., network access and UI
โ€“ Android-specific power management
Mobile Sensing Applications: CSense
14
My Research: CSense
University of Iowa | Mobile Sensing Laboratory
โ€ข Speaker Identifier
โ€“ Conditional dataflow paths result in SDF subgraphs
โ€“ Bounded memory requirements
Example Application
15
addComponent("audio", new AudioComponentC(rateInHz, 16));
addComponent("rmsClassifier", new RMSClassifierC(rms));
addComponent("mfcc", new MFCCFeaturesG(speechT, featureT))
...
link("audio", "rmsClassifier");
toTap("rmsClassifier::below");
link("rmsClassifier::above", "mfcc::sin");
fromMemory("mfcc::fin");
...
create
components
wire
components
My Research: CSense
University of Iowa | Mobile Sensing Laboratory
Concurrency Model
16
My Research: CSense
getComponent("audio").setThreading(Threading.NEW_DOMAIN);
getComponent("httpPost").setThreading(Threading.NEW_DOMAIN);
getComponent("mfcc").setThreading(Threading.SAME_DOMAIN);
Compiler transformation
University of Iowa | Mobile Sensing Laboratory
โ€ข Static analysis
โ€“ composition errors, memory usage errors, race conditions
โ€ข Flow analysis
โ€“ whole-application configuration and optimization
โ€ข Stream Flow Graph transformations
โ€“ domain partitioning, type conversions, MATLAB component
coalescing
โ€ข Code generation
โ€“ Android application/service, MATLAB (C code + JNI stubs)
CSense Compiler
17
My Research: CSense
University of Iowa | Mobile Sensing Laboratory
โ€ข Components exchange data using push/pull semantics
โ€ข Runtime includes a scheduler for each domain
โ€“ task queue + event queue
โ€“ wake lock โ€“ for power management
CSense Runtime
18
Scheduler1Task Queue
Event Queue
Scheduler2 Task Queue
Event Queue
Memory Pool
My Research: CSense
University of Iowa | Mobile Sensing Laboratory
โ€ข Garbage collection overhead limits scalability
โ€ข Concurrency primitives have a significant impact on performance
Producer-Consumer Throughput
19
My Research: CSense
30%
13.8x
19x
MY RESEARCH IN CONTEXT
More energy savings in the dataflow model
Dynamic optimizations in cloud stream processing
20
University of Iowa | Mobile Sensing Laboratory
โ€ข Energy consumption in mobile sensing applications
โ€“ Energy bugs introduced by power management primitives
๏ƒจ Program analysis on code paths and potential races
โ€“ Improper usage of I/O components elongates tail power states
๏ƒจ Defer and batch I/O operations to execute in a short interval
โ€“ Intensive computations
๏ƒจ Code offloading to cloud, AMP cores or GPGPU
Energy Efficiency and Stream Processing
21
Research of Interest: Energy Efficiency
University of Iowa | Mobile Sensing Laboratory
โ€ข Inconsistent throughputs
โ€“ Critical path with bottleneck
processes
โ€ข Dynamic Voltage Frequency
Scaling (DVFS)
โ€“ Dispatch partitions to cores
at different frequencies
๏ƒจ Save energy while maintaining
performance
DVFS: GreenStreams
22
Research of Interest: Energy Efficiency
[Bartenstein2013]
University of Iowa | Mobile Sensing Laboratory
โ€ข New challenges
โ€“ Changing input structure
โ€ข Tweets ๏ƒจ mention graph ๏ƒจ computations
โ€ข Distributed storage of the graph
โ€ข Consistent graph representation
โ€“ Communication overhead
โ€ข Conventional stream processing less concerns large scale
communications but focuses on local computations
โ€“ Changing performance criteria
โ€ข Statically made decisions cannot be optimal the whole time
๏ƒจBetter performance requires to make decisions dynamically
โ€“ Fault-tolerance
Cloud Stream Processing
23
Research of Interest: Cloud Computing
University of Iowa | Mobile Sensing Laboratory
โ€ข Goal
โ€“ Compute timely properties on the changing graph input
โ€ข Challenges
โ€“ High rate of graph updates
โ€“ Consistent graph structure
โ€“ Static graph mining algorithms
โ€ข Global progress tracking protocol
โ€“ Graph updates are queued and progress is tracked in a global table
โ€“ Progress snapshots are taken and distributed to perform transactions of
graph updates and associated computations
โ€ข Pros
โ€“ Decouples graph updates from graph computations
โ€ข Cons
โ€“ Centralized progress tracking
โ€“ No analysis on updates that may cancel each other and aggregation of
potential propagation of communications
Changing Graph: Kineograph
24
Research of Interest: Cloud Computing
University of Iowa | Mobile Sensing Laboratory
โ€ข Goal
โ€“ High throughput timely and low latency processing
โ€ข Challenges
โ€“ Communication overhead dominates local computing resources
โ€“ Massive communications cause media contentions and high latency
โ€ข Efficient batching and localized communications
โ€“ User decide to process input synchronously or asynchronously
โ€“ Distributed progress tracking based on partially ordered timestamps
โ€ข Pros
โ€“ Effective aggregation of communications
โ€ข Cons
โ€“ Flow control might be a concern for asynchronous delivery
Timely Dataflow: Naiad
25
Research of Interest: Cloud Computing
University of Iowa | Mobile Sensing Laboratory
โ€ข Goal
โ€“ Efficiently switching between Sync and Async modes for better
performance and early termination
โ€ข Predict the next better execution mode online
โ€ข Pros
โ€“ Frees programmers from coding the execution modes explicitly
โ€ข Cons
โ€“ Separate snapshots and checkpointing under both modes
โ€ข Additional space usages
โ€ข Unclear performance impact
Execution Modes: PowerSwitch
26
Research of Interest: Cloud Computing
University of Iowa | Mobile Sensing Laboratory
โ€ข Energy efficiency in stream processing
โ€“ Dataflow paths as code paths facilitate program analysis
โ€“ Graph manipulation for better I/O component access aggregations
โ€“ Precise workload requests for computing resources
โ€“ Adapting to runtime information e.g., user activity predictions
โ€ข Incorporate dynamic optimizations
โ€“ Static optimizations based on fixed resources
โ€“ Reevaluations once in a while, e.g., execution mode switching
โ€“ Changing input structures and distributed state sharing
โ€ข Less changing information flow paths for aggregating communications
โ€ข Localized multicast for reconfiguration and termination in partial order
Conclusions and Future Work
27
University of Iowa | Mobile Sensing Laboratory
โ€ข Potential limitation on stream processing optimizations
โ€“ Non-trivial transformation between SDF graphs at different
granularities
Conclusions and Future Work
28
FFTTestSource0 split08,16,16
FFTReorderSimple0
8,8,8
FFTReorderSimple1
8,8,8
CombineDFT0
8,4,4
CombineDFT28,4,4
CombineDFT1
4,8,8
join0
8,8,8
FloatPrinter0
16,1,1
CombineDFT34,8,8
8,8,8
FloatSource0 split0
2,4,4
Butter๏ฌ‚y0
2,4,4
Butter๏ฌ‚y1
2,4,4
join0
4,2,2
4,2,2
split1
4,8,8
Butter๏ฌ‚y2
4,4,4
Butter๏ฌ‚y3
4,4,4
join1
4,4,4
4,4,4
BitReverse0
8,8,8
FloatPrinter0
8,2,2
Coarse-grained FFT
Fine-grained FFT
University of Iowa | Mobile Sensing Laboratory
Any Questions?
Thank You
29

More Related Content

Similar to High Performance Stream Processing and Optimizations

CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing A...
CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing A...CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing A...
CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing A...Farley Lai
ย 
A methodology for full system power modeling in heterogeneous data centers
A methodology for full system power modeling in  heterogeneous data centersA methodology for full system power modeling in  heterogeneous data centers
A methodology for full system power modeling in heterogeneous data centersRaimon Bosch
ย 
Scientific
Scientific Scientific
Scientific marpierc
ย 
QoE-Aware Traffic Steering using OpenFlow
QoE-Aware Traffic Steering using OpenFlowQoE-Aware Traffic Steering using OpenFlow
QoE-Aware Traffic Steering using OpenFlowUS-Ignite
ย 
Improving Resource Utilization in Cloud using Application Placement Heuristics
Improving Resource Utilization in Cloud using Application Placement HeuristicsImproving Resource Utilization in Cloud using Application Placement Heuristics
Improving Resource Utilization in Cloud using Application Placement HeuristicsAtakanAral
ย 
Presentation
PresentationPresentation
Presentationbutest
ย 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralPaolo Missier
ย 
Automated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise ApplicationsAutomated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise ApplicationsSAIL_QU
ย 
071310 sun d_0930_feldman_stephen
071310 sun d_0930_feldman_stephen071310 sun d_0930_feldman_stephen
071310 sun d_0930_feldman_stephenSteve Feldman
ย 
Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...
Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...
Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...Chih-Chuan Cheng
ย 
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars Joel Saltz
ย 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsCarole Goble
ย 
2015 04-15 research seminar
2015 04-15 research seminar2015 04-15 research seminar
2015 04-15 research seminarifi8106tlu
ย 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor NetworksOscar Corcho
ย 
Practical Considerations for Deploying a Java Active Networking Platform
Practical Considerations for Deploying a Java Active Networking PlatformPractical Considerations for Deploying a Java Active Networking Platform
Practical Considerations for Deploying a Java Active Networking PlatformTal Lavian Ph.D.
ย 
Exascale Computing and Experimental Sensor Data
Exascale Computing and Experimental Sensor DataExascale Computing and Experimental Sensor Data
Exascale Computing and Experimental Sensor DataJoel Saltz
ย 
Parallel machines flinkforward2017
Parallel machines flinkforward2017Parallel machines flinkforward2017
Parallel machines flinkforward2017Nisha Talagala
ย 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
ย 
PMIx Updated Overview
PMIx Updated OverviewPMIx Updated Overview
PMIx Updated OverviewRalph Castain
ย 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
ย 

Similar to High Performance Stream Processing and Optimizations (20)

CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing A...
CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing A...CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing A...
CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing A...
ย 
A methodology for full system power modeling in heterogeneous data centers
A methodology for full system power modeling in  heterogeneous data centersA methodology for full system power modeling in  heterogeneous data centers
A methodology for full system power modeling in heterogeneous data centers
ย 
Scientific
Scientific Scientific
Scientific
ย 
QoE-Aware Traffic Steering using OpenFlow
QoE-Aware Traffic Steering using OpenFlowQoE-Aware Traffic Steering using OpenFlow
QoE-Aware Traffic Steering using OpenFlow
ย 
Improving Resource Utilization in Cloud using Application Placement Heuristics
Improving Resource Utilization in Cloud using Application Placement HeuristicsImproving Resource Utilization in Cloud using Application Placement Heuristics
Improving Resource Utilization in Cloud using Application Placement Heuristics
ย 
Presentation
PresentationPresentation
Presentation
ย 
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science CentralCloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
Cloud e-Genome: NGS Workflows on the Cloud Using e-Science Central
ย 
Automated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise ApplicationsAutomated Discovery of Performance Regressions in Enterprise Applications
Automated Discovery of Performance Regressions in Enterprise Applications
ย 
071310 sun d_0930_feldman_stephen
071310 sun d_0930_feldman_stephen071310 sun d_0930_feldman_stephen
071310 sun d_0930_feldman_stephen
ย 
Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...
Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...
Extend Your Journey: Considering Signal Strength and Fluctuation in Location-...
ย 
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars
Exascale Challenges: Space, Time, Experimental Science and Self Driving Cars
ย 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
ย 
2015 04-15 research seminar
2015 04-15 research seminar2015 04-15 research seminar
2015 04-15 research seminar
ย 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
ย 
Practical Considerations for Deploying a Java Active Networking Platform
Practical Considerations for Deploying a Java Active Networking PlatformPractical Considerations for Deploying a Java Active Networking Platform
Practical Considerations for Deploying a Java Active Networking Platform
ย 
Exascale Computing and Experimental Sensor Data
Exascale Computing and Experimental Sensor DataExascale Computing and Experimental Sensor Data
Exascale Computing and Experimental Sensor Data
ย 
Parallel machines flinkforward2017
Parallel machines flinkforward2017Parallel machines flinkforward2017
Parallel machines flinkforward2017
ย 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
ย 
PMIx Updated Overview
PMIx Updated OverviewPMIx Updated Overview
PMIx Updated Overview
ย 
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...
ย 

Recently uploaded

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
ย 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
ย 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
ย 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
ย 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
ย 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)
ย 
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR
ย 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto Gonzรกlez Trastoy
ย 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
ย 
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธcall girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธDelhi Call girls
ย 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
ย 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
ย 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
ย 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
ย 
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...Steffen Staab
ย 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
ย 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
ย 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
ย 

Recently uploaded (20)

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
ย 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
ย 
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female serviceCALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
CALL ON โžฅ8923113531 ๐Ÿ”Call Girls Badshah Nagar Lucknow best Female service
ย 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
ย 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
ย 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
ย 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
ย 
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )๐Ÿ” 9953056974๐Ÿ”(=)/CALL GIRLS SERVICE
ย 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
ย 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
ย 
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธcall girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
call girls in Vaishali (Ghaziabad) ๐Ÿ” >เผ’8448380779 ๐Ÿ” genuine Escort Service ๐Ÿ”โœ”๏ธโœ”๏ธ
ย 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
ย 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
ย 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
ย 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
ย 
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spacesย - and Epistemic Querying of RDF-...
ย 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
ย 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
ย 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
ย 
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS LiveVip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida โžก๏ธ Delhi โžก๏ธ 9999965857 No Advance 24HRS Live
ย 

High Performance Stream Processing and Optimizations

  • 1. University of Iowa | Mobile Sensing Laboratory High Performance Stream Processing and Optimizations May 8, 2015 Farley Lai Advisor: Octav Chipara Department of Computer Science
  • 2. University of Iowa | Mobile Sensing Laboratory โ€ข A class of applications that process continuous input data streams and may produce continuous output streams โ€“ High performance for real-time processing โ€“ Long-term efficient resource management Stream Applications 2 Speaker Models Speech Recording VAD Feature Extraction HTTP Upload Speaker Identifier Develop compiler optimizations and efficient runtime environments for scalable streaming systems Introduction
  • 3. University of Iowa | Mobile Sensing Laboratory โ€ข Challenges โ€“ Multi-core architectures โ€“ High and variable memory and I/O workloads โ€“ Energy efficiency โ€ข Traditional approach: programmer specified parallelization โ€“ Imperative languages expose limited parallelism to exploit โ€“ Error-prone concurrency primitives โ€“ Energy efficiency with aggressive power management Stream Processing on Modern Architectures 3 Introduction
  • 4. University of Iowa | Mobile Sensing Laboratory โ€ข Synchronous Data Flow (SDF) โ€“ Programs are described as directed graphs โ€ข Nodes ๏ƒจ processes with sequential code โ€ข Edges ๏ƒจ FIFO communication channels โ€“ Pipeline and data parallelism are explicitly exposed โ€“ Amount of data consumed/produced by a process is fixed & known โ€ข Limited expressivity โ€“ Periodic static schedules โ€“ Memory requirements are bounded โ€“ Memory behavior of entire program may be characterized Model of Computation 4
  • 5. University of Iowa | Mobile Sensing Laboratory โ€ข Static schedules: โ€“ Order and num. invocations of a process in one period โ€“ Two phases: initialization phase + steady phase โ€“ Existence of a static schedule can be determined based on the production and consumption rates of all processes โ€ข Sequential schedules โ€“ Built by simulating the process executions iteratively and tracking the channel buffer sizes: โ€ข A process is schedulable if there is sufficient data to consume โ€ข The simulation continue until the initial buffer size is restored โ€“ Memory requirements are determined during scheduling โ€ข Parallel schedules โ€“ derived by partitioning sequential schedules Synchronous Data Flow 5 Model of Computation
  • 6. University of Iowa | Mobile Sensing Laboratory โ€ข Potential memory inefficiency โ€“ Per channel buffer allocations and pass-by-value semantics โ€“ Pass-by-reference only works on read-only chunks of data โ€ข Aggregate update problem [Haven85] โ€ข What if a process tries to insert new samples in the middle? โ€ข Can we still prevent copying unchanged data portions by capturing the semantics of memory operations at compile-time? Synchronous Data Flow 6 Model of Computation
  • 7. MY RESEARCH ESMS: Efficient Static Memory Analysis on Stream Programs CSense: A Stream-Processing Toolkit for Mobile Sensing Applications 7
  • 8. University of Iowa | Mobile Sensing Laboratory โ€ข StreamIt: a SDF language โ€ข Per channel allocation and pass-by-value semantics ๏ƒจ significant amount of memory usage and copies โ€ข One single global allocation โ€“ Reuses memory and reduce memory requirements โ€“ Avoids unnecessary memory copies Memory Optimizations: ESMS 8 My Research: ESMS
  • 9. University of Iowa | Mobile Sensing Laboratory โ€ข Component analysis on filter work functions in the logical space for each filter โ€“ peek(i), pop(), push(v) are supposed to access contiguous memory allocations in FIFO channels โ€“ Interprets push(v) in filter work functions as โ€ข PASS: moves a unchanged data sample between channels โ€ข UPDATE: otherwise, a new value is pushed ๏ƒจ Splitters, joiners and reordering filters are pass-only Static Analysis 9 int->int filter appender() { work pop 4 push 8{ for(int i=0; i<4; i++) push(pop()); // pass for(int i=0; i<4; i++) push(compute(i)); // update } } My Research: ESMS
  • 10. University of Iowa | Mobile Sensing Laboratory โ€ข Relate logical positions to physical locations in the global allocation โ€“ Remaps peek(i), pop(), push(v) to access possibly non-contiguous memory โ€“ Live range analysis by reference counting in a schedule period โ€ข Layout starts with size zero and expands when necessary โ€ข Each location represents a live variable with its live range โ€ข The live range begins when receiving the 1st time pushed value โ€“ A splitter pushes a value multiple times for sharing โ€ข The live range ends when it value is last time popped โ€ข A location is free if its live variable is out of range โ€“ Complete memory behaviors and sound approximation โ€ข No pointer aliasing โ€ข Terminates in one schedule period Whole Program Analysis 10 My Research: ESMS
  • 11. University of Iowa | Mobile Sensing Laboratory โ€ข Simulate one period of the static schedule โ€“ case PASS: reuse memory locations in the layout โ€“ case UPDATE: follow one of the three strategies โ€ข Always-Append (AA) | Append-on-Conflict (AoC) | Insert-in-Place (IP) Layout Stitching 11 MEM Layout MEM[0, 0]: D0 MEM[1, 0]: D1 MEM Layout MEM[0, 0]: I0 MEM[1, 0]: I1 MEM Layout MEM[0, 0]: D0 MEM[1, 0]: D1 MEM[2, 1]: I0 MEM[3, 1]: I1 AA AoC & IPInput (2 updates) MEM Layout MEM[0, 0]: D0 MEM[1, 1]: D1 MEM Layout MEM[0, 1]: I0 MEM[1, 1]: I1 MEM[2, 1]: D1 MEM Layout MEM[0, 0]: D0 MEM[1, 1]: D1 MEM[2, 1]: I0 MEM[3, 1]: I1 AA & AoC IPInput (2 updates) My Research: ESMS
  • 12. University of Iowa | Mobile Sensing Laboratory โ€“ ESMS reduces both channel buffer sizes and the number memory operations for reordering and duplicating data streams. (splitters, joiners, reordering filters) Memory Usage Reductions 12 45% to 96% reductions73% reductions on average My Research: ESMS
  • 13. University of Iowa | Mobile Sensing Laboratory โ€“ The average speedup of AA, AoC, and IP are 3, 3.1, and 3 while the average speedup of CacheOpt is merely 1.07. โ€“ ESMS improves the performance by eliminating unnecessary memory operations and fits in the cache with a smaller working set. Speedup 13 My Research: ESMS
  • 14. University of Iowa | Mobile Sensing Laboratory โ€ข Challenges โ€“ Mobile sensing applications are difficult to implement on Android devices โ€ข High frame rates โ€ข Concurrency โ€ข Robustness โ€ข Energy efficiency โ€“ Resource limitations and Java VM worsen these problems โ€ข Additional cost of virtualization โ€ข Significant overhead of garbage collection โ€ข Integrates SDF with dynamic scheduling โ€“ Conditional dataflow paths by partitioning the SDF โ€“ Asynchronous event processing, i.e., network access and UI โ€“ Android-specific power management Mobile Sensing Applications: CSense 14 My Research: CSense
  • 15. University of Iowa | Mobile Sensing Laboratory โ€ข Speaker Identifier โ€“ Conditional dataflow paths result in SDF subgraphs โ€“ Bounded memory requirements Example Application 15 addComponent("audio", new AudioComponentC(rateInHz, 16)); addComponent("rmsClassifier", new RMSClassifierC(rms)); addComponent("mfcc", new MFCCFeaturesG(speechT, featureT)) ... link("audio", "rmsClassifier"); toTap("rmsClassifier::below"); link("rmsClassifier::above", "mfcc::sin"); fromMemory("mfcc::fin"); ... create components wire components My Research: CSense
  • 16. University of Iowa | Mobile Sensing Laboratory Concurrency Model 16 My Research: CSense getComponent("audio").setThreading(Threading.NEW_DOMAIN); getComponent("httpPost").setThreading(Threading.NEW_DOMAIN); getComponent("mfcc").setThreading(Threading.SAME_DOMAIN); Compiler transformation
  • 17. University of Iowa | Mobile Sensing Laboratory โ€ข Static analysis โ€“ composition errors, memory usage errors, race conditions โ€ข Flow analysis โ€“ whole-application configuration and optimization โ€ข Stream Flow Graph transformations โ€“ domain partitioning, type conversions, MATLAB component coalescing โ€ข Code generation โ€“ Android application/service, MATLAB (C code + JNI stubs) CSense Compiler 17 My Research: CSense
  • 18. University of Iowa | Mobile Sensing Laboratory โ€ข Components exchange data using push/pull semantics โ€ข Runtime includes a scheduler for each domain โ€“ task queue + event queue โ€“ wake lock โ€“ for power management CSense Runtime 18 Scheduler1Task Queue Event Queue Scheduler2 Task Queue Event Queue Memory Pool My Research: CSense
  • 19. University of Iowa | Mobile Sensing Laboratory โ€ข Garbage collection overhead limits scalability โ€ข Concurrency primitives have a significant impact on performance Producer-Consumer Throughput 19 My Research: CSense 30% 13.8x 19x
  • 20. MY RESEARCH IN CONTEXT More energy savings in the dataflow model Dynamic optimizations in cloud stream processing 20
  • 21. University of Iowa | Mobile Sensing Laboratory โ€ข Energy consumption in mobile sensing applications โ€“ Energy bugs introduced by power management primitives ๏ƒจ Program analysis on code paths and potential races โ€“ Improper usage of I/O components elongates tail power states ๏ƒจ Defer and batch I/O operations to execute in a short interval โ€“ Intensive computations ๏ƒจ Code offloading to cloud, AMP cores or GPGPU Energy Efficiency and Stream Processing 21 Research of Interest: Energy Efficiency
  • 22. University of Iowa | Mobile Sensing Laboratory โ€ข Inconsistent throughputs โ€“ Critical path with bottleneck processes โ€ข Dynamic Voltage Frequency Scaling (DVFS) โ€“ Dispatch partitions to cores at different frequencies ๏ƒจ Save energy while maintaining performance DVFS: GreenStreams 22 Research of Interest: Energy Efficiency [Bartenstein2013]
  • 23. University of Iowa | Mobile Sensing Laboratory โ€ข New challenges โ€“ Changing input structure โ€ข Tweets ๏ƒจ mention graph ๏ƒจ computations โ€ข Distributed storage of the graph โ€ข Consistent graph representation โ€“ Communication overhead โ€ข Conventional stream processing less concerns large scale communications but focuses on local computations โ€“ Changing performance criteria โ€ข Statically made decisions cannot be optimal the whole time ๏ƒจBetter performance requires to make decisions dynamically โ€“ Fault-tolerance Cloud Stream Processing 23 Research of Interest: Cloud Computing
  • 24. University of Iowa | Mobile Sensing Laboratory โ€ข Goal โ€“ Compute timely properties on the changing graph input โ€ข Challenges โ€“ High rate of graph updates โ€“ Consistent graph structure โ€“ Static graph mining algorithms โ€ข Global progress tracking protocol โ€“ Graph updates are queued and progress is tracked in a global table โ€“ Progress snapshots are taken and distributed to perform transactions of graph updates and associated computations โ€ข Pros โ€“ Decouples graph updates from graph computations โ€ข Cons โ€“ Centralized progress tracking โ€“ No analysis on updates that may cancel each other and aggregation of potential propagation of communications Changing Graph: Kineograph 24 Research of Interest: Cloud Computing
  • 25. University of Iowa | Mobile Sensing Laboratory โ€ข Goal โ€“ High throughput timely and low latency processing โ€ข Challenges โ€“ Communication overhead dominates local computing resources โ€“ Massive communications cause media contentions and high latency โ€ข Efficient batching and localized communications โ€“ User decide to process input synchronously or asynchronously โ€“ Distributed progress tracking based on partially ordered timestamps โ€ข Pros โ€“ Effective aggregation of communications โ€ข Cons โ€“ Flow control might be a concern for asynchronous delivery Timely Dataflow: Naiad 25 Research of Interest: Cloud Computing
  • 26. University of Iowa | Mobile Sensing Laboratory โ€ข Goal โ€“ Efficiently switching between Sync and Async modes for better performance and early termination โ€ข Predict the next better execution mode online โ€ข Pros โ€“ Frees programmers from coding the execution modes explicitly โ€ข Cons โ€“ Separate snapshots and checkpointing under both modes โ€ข Additional space usages โ€ข Unclear performance impact Execution Modes: PowerSwitch 26 Research of Interest: Cloud Computing
  • 27. University of Iowa | Mobile Sensing Laboratory โ€ข Energy efficiency in stream processing โ€“ Dataflow paths as code paths facilitate program analysis โ€“ Graph manipulation for better I/O component access aggregations โ€“ Precise workload requests for computing resources โ€“ Adapting to runtime information e.g., user activity predictions โ€ข Incorporate dynamic optimizations โ€“ Static optimizations based on fixed resources โ€“ Reevaluations once in a while, e.g., execution mode switching โ€“ Changing input structures and distributed state sharing โ€ข Less changing information flow paths for aggregating communications โ€ข Localized multicast for reconfiguration and termination in partial order Conclusions and Future Work 27
  • 28. University of Iowa | Mobile Sensing Laboratory โ€ข Potential limitation on stream processing optimizations โ€“ Non-trivial transformation between SDF graphs at different granularities Conclusions and Future Work 28 FFTTestSource0 split08,16,16 FFTReorderSimple0 8,8,8 FFTReorderSimple1 8,8,8 CombineDFT0 8,4,4 CombineDFT28,4,4 CombineDFT1 4,8,8 join0 8,8,8 FloatPrinter0 16,1,1 CombineDFT34,8,8 8,8,8 FloatSource0 split0 2,4,4 Butter๏ฌ‚y0 2,4,4 Butter๏ฌ‚y1 2,4,4 join0 4,2,2 4,2,2 split1 4,8,8 Butter๏ฌ‚y2 4,4,4 Butter๏ฌ‚y3 4,4,4 join1 4,4,4 4,4,4 BitReverse0 8,8,8 FloatPrinter0 8,2,2 Coarse-grained FFT Fine-grained FFT
  • 29. University of Iowa | Mobile Sensing Laboratory Any Questions? Thank You 29

Editor's Notes

  1. Today, Iโ€™m going to present stream processing and its optimizations ranging from memory, energy efficiency and new challenges at the scale of cloud computing. Stream applications are a class applications that process continuous input data streams and may produce continuous output streams. Here is a sample application Speaker Identifier running on a mobile device. It reads continuous audio samples from a microphone sensor, performs computationally intensive feature extraction and uploads to a remote server for identification. This application is supposed to run in the background indefinitely. This implies high performance is required for real-time processing. Long-term efficient resource management is also important.
  2. However, there are some challenges to stream processing we need to address. First, multi-core architectures require programs to expose parallelism. Second, high and variable memory and I/O workloads may incur significant runtime resource management overhead. Third, energy efficiency is essential for long-term executions. Traditional imperative languages such as C and Java, only expose limited parallelism to exploit. Programmers need to manually code concurrency primitive such as threads and locks which are error-prone. On mobile devices, aggressive power management tends to burden programmers with error-prone APIs.
  3. So, we are in search of models of computation that fully expose program parallelism. Eventually, we found the synchronous data flow is promising. In this model, programs are describe as directed graphs. The nodes represent processes with sequential code. The edges represent FIFO communication channels. It is straightforward to see the pipeline parallelism if there is a pipeline of processes. And data parallelism if there is a branch in the graph. Besides, a specific requirement of SDF is that the production and consumption rates of processes are fixed and known at compile-time. In this way, SDF actually limits the expressivity of general dataflow models. Such that periodic static schedules can be derived at compile-time Memory requirements are bounded. And even the memory behavior of the entire program can be characterized.
  4. So, how does a static schedule look like? It basically specifies the order and the number of invocations of a process in a period. It consists of an optional initialization phase and a steady state that executes repetitively. The good thing is that its existence can be determined based on the production and consumption rates of all processes. Next, we derive such a schedule for a single processor. The way is to simulate process executions iteratively and track the channel buffer sizes. A process is schedulable if there is sufficient input to consume. That means its invocation wonโ€™t cause negative buffer size. If there are multiple schedulable processes, just randomly pick one. The simulation continues until the same buffer size as the initial one is restored. The maximum buffer size during the simulation is the memory requirement. To adapt the schedule for multi-cores, simply partition it to run in parallel.
  5. However, the semantics of the memory model may cause potential inefficiency. It assumes per channel buffer allocations and pass-by-value semantics. That means even when a data item passes through the channel without being modified, memory copies are necessary across channels. How about pass-by-reference? It basically works on read-only chunks of data. What if a process tries to insert new samples in the middle? Can we still prevent copying unchanged data portions?
  6. This induces my research on the memory optimizations of stream programs. A recent work ESMS: Efficient Static Memory Analysis on Stream Programs addresses this. Another earlier developed stream-processing toolkit also takes care of the memory performance of mobile sensing applications.
  7. I first introduce the memory optimization ESMS. Itโ€™s a compiler optimization based on a well-defined SDF language, StreamIt. In StreamIt, stream is the first order construct which can be a filter representing a process in SDF. Or one of the three hierarchical constructs. A pipeline is a line of connected streams. A splitjoin is a branch of streams that are joined later. A feedbackloop is a loop processing structure which we may support in near future. StreamIt adopts per channel allocations and pass-by-value semantics. That implied a significant amount of memory usage. As you can see in splitjoins, the input needs to be copied to each of the output braches. Our idea to improve this is using one single global allocation. to reuse memory whenever possible and avoid unnecessary copies
  8. The first step is to perform component analysis on filter work functions in the logical space. The logical space is the positions referenced by peek(i), pop(), and push(v) to access contiguous memory in FIFO channels. Their semantics are just as the name suggests. There is no direct memory access to channels in StreamIt. So, the goal is to interpret the memory operations of pushes. It can be a PASS if the push moves a value from the input channel without being modified. Otherwise, it is an UPDATE, meaning to push a new value. Here is a sample filter work function. As you can see, the first 4 pushes are interpreted as passes because the values are not modified. The following 4 pushes are interpreted as updates because they push new values. In this sense, splitters and joiners of a splitjoins are pass-only. Other filters that only reorder the input are also pass-only. Apparently, the pass exposes the opportunity to reuse memory. As for updates, whether it can reuse or not, we donโ€™t know at this time. So we need to figure out with a whole program analysis.
  9. The goal is relate logical positions to physical locations in the global allocation. Which essentially means to remap peek(i), pop() and push(v) to access possibly non-contiguous memory. This is based the live range analysis on one schedule period. The mapping is stored in a layout which starts with size zero and expands when necessary. Each location in the layout represents a live variable with its reference counted live range. The live range begins when receiving a pushed value for the 1st time. In this sense, a Splitjoin passes a value to multiple output channels with a live range greater than one, indicating the sharing. Then live range ends when its value is popped for the last time. A location is free if its live variable is out of range. In this way, we can derive complete memory behaviors and sound approximation because There is no pointer aliasing and The analysis will terminate in one schedule period.
  10. Next, we simply simulate one schedule period to stitch the layout based on the live range information. During the simulation, a memory location is reused on a PASS. If it is an UPDATE, reuse the memory location only when itโ€™s free. Otherwise, we need to find a free location using one of the three strategies. Always-Append (AA), Append-on-Conflict (AoC), and Insert-in-Place (IP). To understand the semantics, letโ€™s take a look at the examples. Suppose the current input memory layout looks like this and there are two updates in the filter invocation. The first number is the physical location and the 2nd number is the live range count. D0 and D1 are the live variables. In the first example, both live variables are out of range. The AA strategy always appends at the end and expands the layout size. The AoC strategy checks if all the updates can reuse. The IP reuses whenever possible. Clearly, both reuse the input layout in the 1st example. In the 2nd example, D1 is still alive. In this case, both AA and AoC append the updates at the end and expand the layout size. The IP reuses the first location and inserts to expand the layout. D1 is shifted by one location. So, in general, IP aggressively reuses the space. However, a side effect is the potentially increasing code complexity due to channel access in loops. Consider peek, pop and push in loops. If the memory layout is contiguous, there is no problem. Otherwise, additional code needs to be generated to ensure correct mappings across loop iterations. Therefore, this optimization has trade-off between the allocation size and code complexity.
  11. Nonetheless, the ESMS evaluation shows significant memory usage reductions on both code size and data size for the 14 benchmarks. We compare with baseline StreamIt and their cache optimization which essentially increases the buffer size to execute a filter multiple times to exploit the instruction cache locality. The right figure shows the expected reductions on the data size including filter states range from 45% to 96%. The left figure shows the reductions on the code size are 73% on average. This is because by capturing the memory behaviors, the memory copies by splitjoins and reordering filters are eliminated. Hence, the overall memory usage is reduced significantly.
  12. We also evaluate the speedup over the baseline StreamIt. The average is 3. For some complex computations such as FFT, the speedup can be up to 7 or more. So, instead of trading space for performance as the StreamIt cache optimization, ESMS improves the performance by eliminating unnecessary memory operations and fits in the cache with a smaller working set.
  13. Beside the specific memory optimization, we also proposed a stream processing toolkit called CSense that allows to develop a full-fledged mobile sensing application. It addresses the following challenges, high frame rates, concurrency complexities, long-term robustness and energy efficiency. It also prevents the overhead induced by the Java virtual machine. Based on the SDF, CSense adopts dynamic scheduling to support conditional dataflows, asynchronous event processing and the Android-specific power management.
  14. A sample application, the Speaker Identifier, is composed as a pipeline. The programmer simply creates and wires the components in the code fragment. What is interesting is that the scheduling follows a depth first traversal of the stream flow graph. Each component is allowed to direct the scheduling by pushing through its output ports. In this app, the RMS classifier which detects if the input audio sample is voiced or not, conditionally pushes the sample to either of its output ports. This essentially partitions the SFG into SDF subgraphs This implies the CSense compiler can derive the bounded memory requirements of the entire graph
  15. To free programmers from the concurrency primitives, CSense allows programmers to annotate concurrent I/O operations. In this app, the audio and httpPost components are specified. The specified components will be partitioned into different domains. The other components just follow the domain of the upstream component. Each domain is executed on a separate thread. A synchronous queue is inserted to ensure safety for data exchange between domains.
  16. In summary, the CSense compiler performs static analysis to prevent composition errors due to incompatible types, memory usage errors due to leaks, and identify race conditions for safety. It also perform the flow analysis to deicide the message buffer sizes between each pair of neighboring components such that in one invocation the producer component can fill the buffer with all the consumer component needs to process This simplifies the dynamic scheduling. In the following, the CSense compiler performs graph transformations for domain partitioning as mentioned before, type conversion and MATLAB component coalescing. The CSense toolkit allows programmers to write components in MATLAB for ease of signal processing. Finally, the Csense compiler generates a complete Android application and compiles MATLAB code if necessary.
  17. After the application is installed on the target device, it is executed by the CSense runtime. The runtime includes a scheduler for each domain. A scheduler maintains a task queue, an event queue and an Android wakelock. The task queue schedules components to execute in order. The event queue allows to process asynchronous events. The wakelock is associated with the Android power management. Whenever there is no wakelock acquired by any application, the Android device is put to sleep soon. For our schedulers, if the task queue is empty, the scheduler determines whether to release the wakelock and goes to sleep.
  18. Despite dynamic scheduling, CSense ensures high throughput by improving the memory and synchronization performance. We evaluate the throughput of a producer-consumer benchmark. The CSense memory management is based on message pooling, denoted by MP. The CSense synchronization primitives are based on spin locks, denoted by C. The baseline is the default Java garbage collection GC and reentrant lock L. In this figure, the x-axis represent the production rate while the y-axis represent the consumption rate. Ideally, both should agree. However, GC overhead limits the scalability. With MP, the performance is improved by 13.8X With the CSense spinlocks, we got additional 30% improvement.
  19. After exploring specific memory optimizations of stream programs and developing a stream processing toolkit for mobile sensing applications, I would like to further investigate more energy savings specific to the dataflow model and dynamic optimizations in cloud stream processing.
  20. We surveyed major energy consumption issues and those methodologies to address it. First, the energy bugs introduced by power management primitives may be identified by program analysis on code paths and potential races. The is because the power management primitives follow the currency primitives and cause similar bugs. Next, improper usage of I/O components elongates tail power states. The tail state is a state that consumes a significant amount of energy after performing a high power I/O operation. Letโ€™s look at this figure which shows the 3G module power states. Normally, in the idle state, no significant energy consumption. After the application makes connection and send requests, it goes to the high power state that consumes the most energy. After that, this application computes something else while the 3G module goes to the tail state and still consumes a significant amount of energy. Then, the application sends another message and causes the module to go to the high power state again. Therefore, the energy saving strategy is to defer and batch I/O operations as much as possible to perform in a short interval. Such that the tail energy consumption can be minimized. On the other hand, intensive computations may be unavoidable and consume a lot of energy. Code offloading considers to delegate the computation either to a remote server, some other CPU or GPU cores. This approach needs to weigh the data transfer cost against local computations.
  21. As for the dataflow model of stream processing, there is an implicit energy consumption due to inconsistent real-time throughputs. Here is the example. The stream flow graph is partitioned to run on 3 CPU-cores. However, the partition AB is the bottleneck and slows down the entire throughput. Apparently, there is no improvement no matter how fast partitions C and DE can run. Instead, the core dispatched to partition AB should run at its highest frequency. The other cores dispatched to C and DE can run at lower frequencies to save energy. This is based on the dynamic voltage frequency scaling technique. Here is the evaluation based partitioning StreamIt programs. The GreenStreams configuration sets the frequencies to match the partition throughputs. Compared with the Fast configuration that set the highest frequencies for each CPU core, It saves up to 28% energy while maintaining comparable performance. This implies stream applications may save more energy by making precise requests to computing resources.
  22. In addition to energy efficiency, stream processing at the scale of cloud computing is worth exploration because of the following new challenges. The first is about the changing input structure. Suppose we want to compute the influence of users in a social network like Twitter. The input is a continuous stream of tweets but the computation is based on the mention graph induced from the tweets. The graph changes all the time and is so large that distributed storage is necessary. The key is to track the progress of the structure updates in a distributed setting. The second challenge is the communication overhead that contributes to significant latency. The key is to aggregate and localize communications. The third challenge is changing criteria. That means the once optimal static decisions become suboptimal over time. The key is to reevaluate when appropriately. Last but not least is the fault-tolerance which is essential due to the distributed nature.
  23. A related work to deal with the changing graph input is Kineograph. The goal is to compute timely properties on the changing graph input. The graph algorithm is static and based on a consistent graph structure. However, it would be complex to combine graph updates and computations together. So Kineograph proposed a global progress tracking protocol. Such that graph updates are queued and the progress is tracked in a global table. The progress snapshots are taken periodically to perform transactions of graph updates and the associated computations. The advantage of this approach is to decouple graph updates from algorithm-specific computations. Nonetheless, the centralized progress tracking may be a performance concern. Besides, there is no analysis on graph updates that may cancel each other, and aggregation of potential propagation of communications.
  24. In contrast with Kineograph, Naiad focuses on reducing communication overhead and avoiding massive media contentions. In this sense, Naiad provides synchronous notifications and asynchronous message delivery. This facilitates efficient batching. Besides, Naiad proposed the distributed progress tracking to notify only relevant processes. There is no broadcast in Naiad systems. Therefore, Naiad effectively aggregates communications. On the other hand, Flow control might a concern for asynchronous delivery. Too fine-grained message control may not be easy to use for building large stream applications.
  25. Finally, consider making a decision about the execution modes for algorithms such as PageRank. In general, there are two execution modes. In the synchronous mode, all the processes are scheduled in iterations. The input data is supposed to be ready for all in the beginning of an iteration. In the asynchronous mode, each process is allowed to execute as soon as its input is ready. It is not trivial to make an optimal decision based on static information because the performance of execution modes may change over different execution stages. Therefore, PowerSwitch proposed to predict the throughput of the current mode and the alternate mode online. Based on the predictions, PowerSwitch changes the execution mode whenever itโ€™s possible to improve the performance. Apparently, PowerSwitch frees programmers from coding the execution modes explicitly. However, separate snapshots and checkpointing under both modes use more space And its performance impact is unclear.
  26. So far, we have gone through the stream processing model, memory optimizations, potential improvement on energy efficiency and opportunities for dynamic optimizations. I believe itโ€™s promising to further improve the energy efficiency in stream processing. First, dataflow paths as code paths should facilitate program analysis Second, graph manipulation allows for better I/O component access aggregations Third, itโ€™s possible to make precise workload requests for computing resources Fourth, it is necessary to adapt to runtime information such as the user activity predictions For example, a stream program should batch contents in proportion to the userโ€™s changing interest. Finally, at the scale of cloud computing, it is essential to incorporate dynamic optimizations. There are levels of dynamism. First , static optimizations based on fixed resources are still necessary. Some decisions need reevaluations once in a while such as the switching of execution modes. Ultimately, the input data structure is stored in distributed storage and change over time. We may try to maintain a less changing information flow path for aggregation of communications. and utilize localized multicast in some partial order for notifying reconfiguration and termination.
  27. Finally, we found static stream processing optimizations may be limited to one aspect or some local features of the stream graph. To our best knowledge, none of the existing optimizations can transform a coarse-grained FFT into an equivalent but fine-grained FFT. We are aware this should deserve further investigations.
  28. Error-prone concurrency primitives Races, deadlocks, non-determinism Imperative languages expose limited parallelism to exploit Limited parallelism granularity: e.g., function calls and loops Complex data dependencies to annotate correctly Energy efficiency with power management primitives Continuous sensing applications on mobile devices Extremely high rates of incoming data streams Runtime scheduling overhead Excessive memory usages may lead to low performance Energy inefficiency in mobile sensing applications (MSAs) ----- Meeting Notes (5/4/15 16:40) ----- Intro New computing platfomrs 2points: parallel architectures challenges SDF def exposed parallelism schduleing static dynamic bounded channels static access patterns