CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing Applications
Upcoming SlideShare
Loading in...5
×
 

CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing Applications

on

  • 148 views

 

Statistics

Views

Total Views
148
Views on SlideShare
47
Embed Views
101

Actions

Likes
0
Downloads
2
Comments
0

2 Embeds 101

http://farley.dynagrid.net 97
http://www.slideee.com 4

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • With the popularity of smart devices, there is increasing demand of developing mobile sensing application to captureandanalyze physical activities, social interactions and ambient information from rich sensors.Here are two typical mobile sensing applications.The top one is SpeakerIdentification and the bottom one is Activity Recognition.Both applications work in a similar way.First, they collect sensor data that can be local or remote.Next, features are extracted from the sensor data.Finally, the features may be used to perform real time classification or uploaded to a remote server for offline recognition.
  • Though these applications are conceptually straightforward, it is not trivial for programmers to implement efficiently due to the following challenges.This first challenge is Concurrency.Apparently, MSAs are multi-threaded because sensor reading, network communication, interacting with users and the environment may happen concurrently.However,multi-threading is usuallyerror-prone due to data races and even deadlocks.The next challenge is high frames rate.For example, the audio and video sources tend to produce a large mount of data constantly.That stresses memory management.The third challenge is robustness.Well, mobile sensing applications are usually expected to run long-term data collection in the background.It would be unacceptable to bother users due to crash or restarts.So far, our main target is the Android platform. However, the underlying Java virtual machine even worsens the problems because of higher computational overhead and non-deterministic garbage collection.Therefore, we propose the CSense Toolkitto address the challenges without sacrificing the performance.
  • Before introducing the design of CSense, I would like to go through the related work.First, in term of support for MSAs, prior work like SeeMon, Coordinator and JigSaw require programmer to use their special constructs to develop specific types sensing applications.CSense, on the other hand, provides a high-level stream programming abstraction general and suitable for a broad range of MSAs Second, CSensebuildsonthedataflowmodels. There are two categories.One is the synchronous data flow, like the StreamIt and Lustre, which enforces static scheduling and optimizations.However, if you need to process asynchronous events, you’re on your own to adapt it.The other is asynchronous data flow, like Click/XStream which provides asynchronous constructs but sacrifices some performance.Our CSense toolkit adopts the asynchronous data flow model but improve the performance with compile time analysis.
  • For the remainder of the talk, I will introduce the CSense programming model, compiler and runtime environment.Evaluation results will be presented later.
  • Here is the programming model.A MSA is represented as a SFG which is a directed acyclic graph with nodes implemented as components connected through input and output ports.The application of Speaker Identification is shown as an example.The following Javacode segmentshows how we create and wire the components.
  • What’s different between CSense and previous work is the focus of memory management.The goal here is to reduce memory overhead introduced by garbage collection and copy operations.The CSense programming model not only adopts the pass-by-ref semantics to facilitate data sharing between components but also makes explicit inclusion of memory management in a SFG which focuses programmers’ attention on memory operations, tracks data exchange globally and allows for efficient implementation.
  • Let’s take a look at the memory management in the SpeakerIdentification example.In a SFG, only two special components called source and tap are allowed to perform memory management.The sources implement memory pools and pre-allocate frames.When in execution, the frames are taken from the pools and flow form sources, through links to taps.The tap puts the frame back to the memory pool to ensure no leaks.In this example, there are three sources, the audio component, S1 and S2.The data flows follow the colored links and reach the corresponding taps.On the other hand, if a frame is shared between components, its associated reference counter is incremented.When reaching taps, the reference counter is decremented.If the counter is zero, the frame is put back to its memory pool.
  • As for the concurrency challenge, the goal is to expose the concurrency model that may be analyzed statically.The idea is partition the components in a SFG into execution domains.A domain is a connected subgraph of components executed on a single thread.Any frame exchange between domains should be mediated by a shared queue.Other data sharing between components are using a tuple space.Currently, the CSense programming model provides several concurrency constraints such the NEW_DOMAIN and SAME_DOMAIN. Based on the domain partition information, it is possible for compiler analysis to identify data races.
  • Here is the same example.The audio and httpPost components declare new domains.The domain partitioning starts with the two components and expands by including other components in the downstream direction.The other components are added to the domains of adjacent components.After partitioning the SFG, the first four components including S1, S2 and T1 to T3 are in one domain while the httpPost and T4 are in the other domain. With this information, the compiler is able to transform the graph by inserting a shared queue between the two domains for data exchange.Another special concurrency option is SAME_DOMAIN.This annotation is used for a group component that is composed of several related subcomponents.It make sense that those components should be partitioned into the same domain to avoid cross-domain data exchange overhead.
  • Next, we introduce the type system which extends the Java generic types.It is designed ensure the correctness of component composition and facilitates efficient component reuse in different applications.In a SFG, all the input/output ports are typed and allow programmers to specify frame size constraints.The frame size is the amount of data produced or consumed once by a component through a port.Here is the code segment showing how to specify the type constraints.
  • Apparently, there are many frame size configurations satisfying the constraints.However, not all the configurations can be implemented efficiently.In this example, let’s focus on the output port of the energyT and the input port of the speechTThe energyT constrains the output frame size to be greater than 8000 and less than 24,000.The speechT constrains the input frame size to be exactly 128.Now, consider the following two configurations.The first configuration sets the energyT frame size to 10,000 which is not a multiple of 128.That is, there won’t be efficient frame size conversion without frame remainder that causes additional memory copy.In contrast, the second configuration set the energyT frame size to 10,240 which can be divided into 80 frames of the SpeechT. This allows for an efficient frame size conversion.
  • Now, to make it general, we introduce the concept of multiplier which is the number of executions for a component to produce or consume the entire input or output.Whattheflowanalysisdoesistofindtheconstrainedframesizesandmultipliersthatresultinacommonmultiple.Thecommonmultipleisthe resulting frame size to allocate and representedastheequalityconstraintwhichisimplicitlyaddedbythecompiler.
  • Next, to apply the flow analysis to the entire SFG, the compiler formulates an integer program by adding all theconstraints for each pair of input/output ports. Thenthecompilercalls an external solver to derive a solutiontotheframesizesandmultipliers.Theobjectiveis tominimize the totalmemoryusage.Nonetheless, if there is no such solution, the compiler may return inefficient configurationsandshows awarning.Theprogrammersmayrelaxtheconstraintsforanefficientsolution.
  • So, in summary, given all the information about the SFG of a MSA, the CSense compiler first performs the staticanalysisto prevent composition errors, memory usage errors, race conditions.Second, the compilerapplies flowanalysis to derivewhole-applicationframe sizeconfigurationsofcomponents.Third,thecompilermay transformstheSFGbyinsertingsharedqueuesbetweendomains,andtypeconvertersbetween pairs of input/outputports whichare not compatible.Inaddition,connected MATLABcomponentsmaybecoalesced.A MATLAB component is created by wrapping the C code of the MATLB function generated by the MATLAB coder.The coalescing simply combine the MATLAB functions first and then generate a single component toreducedataexchangeoverheadbetweentheJavaspaceandthenativespace.Finally,thecompilergeneratesthetargetAndroidapplication code that links to the native MATLAB functions.
  • Aftertheapplicationisinstalledonthetargetdevice,itisexecutedbytheCSenseruntimetodrivethedataflowfromsourcesthroughcomponentstotaps.Theruntimeincludes aschedulerforeachdomain.Aschedulemaintainsatask queue,an event queueandan Androidwakelock.The task queue allows a component to schedule for execution as soon as possible.The timer queue allows a component to schedule a delayed event to process at a specified time.The wake lock is associated with the Android power management.Whenever there is no wake lock acquiredby any application, the Android device is put to deep sleep soon.Forourschedulers,if the task queue is empty, scheduler determines whether to release wake lock and goes to sleep.
  • Next, I am going to present the CSense runtime performance evaluations based on several benchmarks.Besides,wehaveimplementedthreeMSAsto validate CSense.The speakeridentification.Theactivityrecognition.Thehearingaidsurvey application for audiology that combines subjective questionnaire and objective data collection to capture the listening context.Here is the experimental setup.WeuseGalaxyNexus,Android4.2, MATLABandMATLABcoder.
  • Ourfirst producer-consumer benchmark is conducted to evaluate the performance of data exchange between two domains via a shared queue. We are especially interested in the impacts ofdifferentmemorymanagementoptionsandsynchronizationprimitives.Formemorymanagement, there are two configurationstoallocateframes.GC stands for garbage collection.Framesarecreatedwhenneeded.MP stands for the memory pool.Framesarepre-allocated and reused.As for the concurrent access to the sharedqueue and memory pool,Configuration L stands for the Java reentrant lock.Configuration C stands for the CSense atomic variable based synchronization primitives which utilize the hardware compare-and-swap instructions. It is designed for a thread to retry acquiring the access to a shared resource without being suspended on failure.
  • Here we show the throughputinthisfigure.The x-axis represents the production rate while the y-axis represents the consumption rate.Ideally,bothratesshouldbeequal.Now, as you can see, the GC and L lead to the lowest throughput.Replace GC with memory pools, the throughput is improved by 13.8 times.Replace the Java reentrant lock with the CSense synchronization primitives, the throughput is further improved by 30%.So, the total throughput improvement is about 19x times.The is mainly becauseGC and the Java reentrant lockcausefrequentthreadsuspensions and switching.In summary, Garbage collection overhead limits scalability and concurrency primitives have a significant impact on performance.
  • Next, we want to further understand the garbage collection overhead.In this figure, the x-axis stands for the production rate and the y-axis stands for the time spent in garbage collection.As you can see, with memory pools and the CSense synchronization primitives, it is possible to achieve zero garbage collection.If only memory pools are used, the Java reentrant lock still incurs garbage collection because of implicit object creations.Insummary, the CSense runtime incurslittle garbage collection overhead.
  • Next, we evaluate the benefits of flow analysis and the runtime overhead in the MFCC benchmark which is the simplified Speaker Identification application by removing the httpPost component.
  • Here,weshowthebenefitsofflowanalysis as the reduction of CPUusage.In the left figure, the x-axis stands for the audio sampling rate.The y-axis stands for the total CPU usage of the benmark.As you can see, with flow analysis, the total CPU usage can be reduced up to 45% at the highest sampling rate.To further understand reduction of CPU usage, we break down the total CPU usage into per-component CPU usage in the right figure.Thex-axisstandsforthecomponents.The y-axis stands for the component CPU usage.For theMFCC component, flow analysis eliminates unnecessary memory copy and increases cache locality for execution.For the other components, flow analysis leads to larger but efficient frame allocations that reduce the number of component invocations and disk I/O overhead especially for those components writing to the storage.
  • Finally, we want to understand the CSense runtime overhead.The overhead is computed by subtracting thesumofcomponent CPU time from the total application CPU time.In the figure, the x-axis represents the sampling rate.The y-axis of the bottom figure shows the percentage of the overhead over the total CPU time.As you can see, the percentage of the overhead is low and does not grow with the workload.In the top figure, we further decompose the runtime overhead into the scheduler overhead and sleep overhead.The sleep overhead is incurred when the scheduler calls to sleep() which should be small.The schedule overhead is spent to pass frames between components and access memory pools.Clearly,the scheduler overhead is even smaller than the sleep overhead.Therefore,weconcludetheruntime overhead is low for a wide range of data rates.
  • Alright,I have introduced the main design of the CSense toolkit.Inconclusion,theCSenseprogrammingmodelprovidesefficient memorymanagement, a flexible concurrencymodel anda richtypesystem.TheCSensecompilerperformswhole-application optimization based the static and flow analyses.TheCSenseruntimeisefficientwithlowoverheadandintegrated withAndroidwakelocks.Wehave implementedthreetypicalMSAsto validateCSense.Thebenchmarks indicate significant performance improvementswithmemorypools,CSensesynchronizationprimitives andflow analysis.
  • We especially thank and acknowledge our funding sources.Now, I think it’s time to take your questions.
  • Accelerometer pipelines involve intensive operationsDomain CPU usage grows with sampling rates and length of pipelinesShimmer pipelines involve more components and thus more overheadMaking predictions per sec induce smaller superframe sizeDomain CPU timePhone 60 HzShimmer 50 Hz
  • Electronic surveysAmbient sound samples and GPSDeployed for six months as part of a clinical studyReliability = uploaded / collected0  server offline due to power outages< 100%  move out of wireless signal cover in the study area Reliability during weeklong deployments
  • Mature to support long-term deployments

CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing Applications CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing Applications Presentation Transcript

  • University of Iowa | Mobile Sensing Laboratory CSense: A Stream-Processing Toolkit for Robust and High-Rate Mobile Sensing Applications IPSN 2014 Farley Lai, Syed Shabih Hasan, Austin Laugesen, Octav Chipara Department of Computer Science
  • University of Iowa | Mobile Sensing Laboratory | Mobile Sensing Applications (MSAs) CSense Toolkit 2 Speaker Models Speech Recording VAD Feature Extraction HTTP Upload Sitting Standing Walking Running Climbing Stairs … Bluetooth Data Collection Feature Extraction Activity Classification Speaker Identification Activity Recognition
  • University of Iowa | Mobile Sensing Laboratory | • Mobile sensing applications are difficult to implement on Android devices – concurrency – high frame rates – robustness • Resource limitations and Java VM worsen these problems – additional cost of virtualization – significant overhead of garbage collection Challenges CSense Toolkit 3
  • University of Iowa | Mobile Sensing Laboratory | • Support for MSAs – SeeMon, Coordinator: constrained queries – JigSaw: customized pipelines  CSense provides a high-level stream programming abstraction general and suitable for a broad range of MSAs • CSense builds on prior data flow models – Synchronous data flows: static scheduling and optimizations • e.g., StreamIt, Lustre – Async. data flows: more flexible but have lower performance • e.g., Click, XStream/Wavescript Related Work CSense Toolkit 4
  • University of Iowa | Mobile Sensing Laboratory | • Programming model • Compiler • Run-time environment • Evaluation CSense Toolkit CSense Toolkit 5
  • University of Iowa | Mobile Sensing Laboratory | • Applications modeled as Stream Flow Graphs (SFG) – builds on prior work on asynchronous data flow graphs – incorporates novel features to support MSA Programming Model CSense Toolkit 6 addComponent("audio", new AudioComponentC(rateInHz, 16)); addComponent("rmsClassifier", new RMSClassifierC(rms)); addComponent("mfcc", new MFCCFeaturesG(speechT, featureT)) ... link("audio", "rmsClassifier"); toTap("rmsClassifier::below"); link("rmsClassifier::above", "mfcc::sin"); fromMemory("mfcc::fin"); ... create components wire components
  • University of Iowa | Mobile Sensing Laboratory | • Goal: Reduce memory overhead introduced by garbage collection and copy operations • Pass-by-reference semantics – allows for sharing data between components • Explicit inclusion of memory management in SFGs – focuses programmer’s attention on memory operations – enables static analysis by tracking data exchanges globally – allows for efficient implementation Memory Management CSense Toolkit 7
  • University of Iowa | Mobile Sensing Laboratory | • Data flows from sources, through links, to taps • Implementation: – sources implement memory pools that hold several frames – references counters used to track sharing of frames – taps decrement reference counters Memory Management CSense Toolkit 8 Audio data MFCCs Filenames
  • University of Iowa | Mobile Sensing Laboratory | • Goal: Expressive concurrency model that may be analyzed statically • Components are partitioned into execution domains – components in the same domain are executed on a thread – frame exchanges between domains are mediated using shared queues • Other data sharing between components are using a tuple space • Concurrency is specified as constraints – NEW_DOMAIN / SAME_DOMAIN – heuristic assignment of components to domains to minimize data exchanges between domains • Static analysis may identify some data races Concurrency Model CSense Toolkit 9
  • University of Iowa | Mobile Sensing Laboratory | CSense Toolkit 10 Concurrency Model getComponent("audio").setThreading(Threading.NEW_DOMAIN); getComponent("httpPost").setThreading(Threading.NEW_DOMAIN); getComponent("mfcc").setThreading(Threading.SAME_DOMAIN); Compiler transformation
  • University of Iowa | Mobile Sensing Laboratory | • Goal: Promote component reuse across MSAs • A rich type system that extends Java’s type system – most components use generic type systems – insight: frame sizes are essential in configuring components • detect configuration errors / optimization opportunities Type System CSense Toolkit 11 VectorC energyT = TypeC.newFloatVector(); energyT.addConstraint(Constraint.GT(8000)); energyT.addConstraint(Constraint.LT(24000)); VectorC speechT = TypeC.newFloatVector(128); VectorC featureT = TypeC.newFloatVector(11);
  • University of Iowa | Mobile Sensing Laboratory | • Not all configurations may be implemented efficiently Flow Analysis CSense Toolkit 12 Constraints: energyT > 8000 energyT < 24000 speechT = 128 featuresT = 11 energyT speechT Inefficient 10,000 128 Efficient 10,240 (128 * 80) 128
  • University of Iowa | Mobile Sensing Laboratory | • Not all configurations may be implemented efficiently Flow Analysis CSense Toolkit 13 Constraints: energyT > 8000 energyT < 24000 speechT = 128 featuresT = 11 energyT speechT Inefficient 10,000 128 Efficient 10,240 (128 * 80) 128 Mrms=1 Mmfcc=80 An efficient implementation exists when Mrms * energyT = Mmfcc * speechT
  • University of Iowa | Mobile Sensing Laboratory | • Goal: determine configurations have efficient frame conversions • Problem may be formulated as an integer linear program – constraints: generated from type constraints – optimization: minimize total memory usage – solution: specifies frame sizes and multipliers for application • An efficient frame conversion may not exist – the compiler relaxes conversion rules Flow Analysis CSense Toolkit 14
  • University of Iowa | Mobile Sensing Laboratory | • Static analysis: – composition errors, memory usage errors, race conditions • Flow analysis: – whole-application configuration and optimization • Stream Flow Graph transformations: – domain partitioning, type conversions, MATLAB component coalescing • Code generation: – Android application/service, MATLAB (C code + JNI stubs) CSense Compiler CSense Toolkit 15
  • University of Iowa | Mobile Sensing Laboratory | • Components exchange data using push/pull semantics • Runtime includes a scheduler for each domain – task queue + event queue – wake lock – for power management CSense Runtime CSense Toolkit 16 Scheduler1Task Queue Event Queue Scheduler2 Task Queue Event Queue Memory Pool
  • University of Iowa | Mobile Sensing Laboratory | • Micro benchmarks evaluate the runtime performance – synchronization primitives + memory management • Implemented the MSA using CSense – Speaker identification – Activity recognition – Audiology application • Setup – Galaxy Nexus, TI OMAP 4460 ARM A9@1.2 GHz, 1 GB – Android 4.2 – MATLAB 2012b and MATLAB Coder 2.3 Evaluation 17CSense Toolkit
  • University of Iowa | Mobile Sensing Laboratory | • Scheduler: memory management + synchronization primitives • Memory management options – GC: garbage collection – MP: memory pool • Concurrent access to queues and memory pools – L: Java reentrant lock – C: CSense atomic variable based synchronization primitives Producer-Consumer Benchmark 18CSense Toolkit
  • University of Iowa | Mobile Sensing Laboratory | 19 Producer-Consumer Throughput • Garbage collection overhead limits scalability • Concurrency primitives have a significant impact on performance 30% 13.8x CSense Toolkit 19x
  • University of Iowa | Mobile Sensing Laboratory | • Reentrant locks incurs GC due to implicit allocations • CSense runtime has low garbage collection overhead Producer-Consumer GC Overhead 20 no garbage collection (in this benchmark) CSense Toolkit
  • University of Iowa | Mobile Sensing Laboratory | • Benefits of flow analysis • Runtime overhead MFCC Benchmark 21CSense Toolkit
  • University of Iowa | Mobile Sensing Laboratory | • Flow analysis eliminates unnecessary memory copy • Benefits of larger but efficient frame allocations – reduced number of component invocations and disk I/O overhead – Increased cache locality MFCC Benchmark CPU Usage 22 45% decrease CSense Toolkit
  • University of Iowa | Mobile Sensing Laboratory | • Runtime overhead is low for a wide range of data rates MFCC Runtime Overhead 23 1.83% 2.39% CSense Toolkit
  • University of Iowa | Mobile Sensing Laboratory | • Programming model – efficient memory management – flexible concurrency model – rich type system • Compiler – whole-application configuration & optimization – static and flow analyses • Efficient runtime environment • Evaluation – implemented three typical MSAs – benchmarks indicate significant performance improvements • 19X throughput boost compared with naïve Java baseline • 45% CPU time reduced with flow analysis • Low garbage collection overhead Conclusions 24CSense Toolkit
  • University of Iowa | Mobile Sensing Laboratory | • National Science Foundation (NeTs grant #1144664 ) • Carver Foundation (grant #14-43555 ) Acknowledgements 25CSense Toolkit
  • University of Iowa | Mobile Sensing Laboratory | • Runtime scheduler overhead of a complex 6-domain application that accesses both phone sensors and remote shimmer motes over bluetooth ActiSense Benchmark 26CSense Toolkit
  • University of Iowa | Mobile Sensing Laboratory | • Runtime scheduler overhead of a complex 6-domain application that accesses both phone sensors and remote shimmer motes over bluetooth ActiSense Benchmark 27CSense Toolkit
  • University of Iowa | Mobile Sensing Laboratory | • Overall domain scheduler overhead is small despite a longer pipeline ActiSense CPU Usage 28 50 Hz 60 Hz CSense Toolkit
  • University of Iowa | Mobile Sensing Laboratory | AudioSense 29CSense Toolkit
  • University of Iowa | Mobile Sensing Laboratory | AudioSense 30CSense Toolkit