Departmentof Measurementand InformationSystems 
Budapest University of Technologyand Economics, Hungary 
Measurement-Driven Resilience Design of Cloud-Based Cyber-Physical Systems 
Imre Kocsis 
ikocsis@mit.bme.hu 
SERENE’14 AutumnSchool 
2014.10.14.
A Viewof Cyber-PhysicalSystems
Cyber-PhysicalSystems (CPSs) 
3 
Ubiquitousembeddedand networkedsystems thatcan monitor and control the physical world witha high level of intelligenceand dependabilityNetworkedembeddedsystemseverywhereClouds, „infusable” analytics, Big Data
FromembeddedtoCPS 
4 
Directmanualcontrol, 
„closedworld” engineering
FromembeddedtoCPS 
5 
Directmanualcontrol, 
„closedworld” engineering 
Highlyautonomous, 
„cyber” backend, environment, swarms, …
FromembeddedtoCPS 
6 
Directmanualcontrol, 
„closedworld” engineering 
Highlyautonomous, 
„cyber” backend, environment, swarms, …
Cyber-PhysicalSystems 
Differentflavors 
oNSF, EU, academia, industry… 
Still: itis here 
oFromsmartcities& IoTtoself- drivingcars 
oScalable, reconfigurablebackendis a must 
7 
Health CareTransportationEnergy
„Classical” caseforcloudcomputing: a brainforaCPS 
Video 
surveillance 
Citizen 
devices 
Env. sensors 
… 
Trafficcontrol 
Situationalawareness 
Deep analyticsNormaldayDisaster 
See: Naphadeet. al(IBM), „SmarterCitiesand TheirInnovationChallanges”, Computer, 2011Elastic, reconfigurablecomputing Reconfiguration
Converging domains 
CPS 
Cloud 
computing 
Big Data 
9
Detour1: CloudComputing
Cloudcomputing: leasedresources 
Source: http://cloud.dzone.com/articles/introduction-cloud-computing
Definition? 
NIST 800-145 
Cloudcomputingisamodelforenablingubiquitous, convenient,on-demandnetworkaccesstoashared 
poolofconfigurablecomputingresources(e.g.,networks, servers,storage,applications,andservices)that 
canberapidlyprovisionedandreleasedwithminimalmanagementeffortorserviceproviderinteraction.
Properties 
On-demandself-service 
Broadnetworkaccess 
Resourcepooling 
Rapid elasticity 
Measuredservice 
13
Ontheproviderside… ~?
Whyis itgoodfortheprovider? 
(WithoutCLT) 
푋푖independentprob. Varswith휇andσ2 
Coefficientof variation: 휎 휇 
Exp. valueof sum: sumof exp. values 
Varianceof sum: sumof variances 
CV푋푠푢푚= 푛휎2 푛휇 = 1 푛 휎 휇 = 1 푛 퐶푉(푋푖)
„Statisticalmultiplexing” 
Variancew.r.t. meangetssmaller 
1 푛 : quick–smallerprivateclouds 
Realityis a bit different 
Source: http://en.wikipedia.org/wiki/Central_limit_theorem
Gartner, 2013 
„For larger businesseswith existing internal data centers, well-managed virtualized infrastructure and efficient IT operations teams, IaaSfor steady-state workloads is often no less expensive, and may be more expensive, than an internal private cloud.” 
„I needitnow, and needitfast…”?
Parallellizableloads 
More and more embarrassinglyparallel, „scale- out” applicationcategoriesexist 
NYT TimesMachine: publicdomainarchive 
oConversiontoweb-friendlyformat: ApacheHadoop, a fewhundredVMs, 36 hours 
Inthecloud: coststhesameaswithoneVM 
Practically: „speedupforfree”
Scalingresources 
„Scaleup” 
„Scaleout” 
oAlgorithmics? 
o„webscale”technologies
Detour2: Big Data
1.) Big Data atRest 
Distributedstorage 
„Computationtodata” 
„Atrest Big Data” 
oNo update 
oNo sampling 
„Not true, but a very, very good lie!” 
(T.Pratchett, Nightwatch)
MapReduce(ApacheHadoop) 
Distributed File System[ , ] [ , ] [ , ] [ , ] [ , ] [ , ] [ , ] [ , ] [ , ] [ , ] [ , ] [ , ] [ , ] [ , ] [ , ] [ ,[ , , ]] [ ,[ , , ]] [ ,[ , , ]] [ ,[ , , ]] [ ,[ , , ]] SHUFFLE Map Reduce [ , ][ , ][ , ][ , ][ , ]
2.) „Big Data inMotion” 
Streamprocessing 
Inherentlyscalablethesameway
Streamingdata 
Sensordata 
oFromsmartgridtoturbinetesting 
Images 
oSatellites: nTB/day 
Web services 
Network traffic 
Trading 
…
The streamprocessormodel 
Source: Rajaraman, A., & Ullman, J. D. (2011). Mining of Massive Datasets. Cambridge: Cambridge University Press. p130
Design & composition 
Source: International Technical Support Organization. IBM InfoSphereStreams: Harnessing Data in Motion. September 2010, p76
Whenwehavea WCET constraint… 
Emphasisin„plain” Big Data: keepingstepwithingress 
oButlargelythesamefordirecttimeliness 
No (direct) diskaccess 
Memory: bounded 
Per-tupleprocessing: bounded 
Algorithmicpatterns: 
oPer-tupleprocessing 
oSlidingwindowstorageand processing 
oSpecializedsampling 
•Getsuglyfast 
oVariousheuristics
Applicationclasses 
Source: International Technical Support Organization. IBM InfoSphereStreams: Harnessing Data in Motion. September 2010, p80
Takesoncyber-physicalclouds: Cloud-in-CPS…
Converging domains 
CPS 
Cloud 
computing 
Big Data 
30 
standard link 
Intelligence Reconfigurability
CloudsinCPS –reality, notpromise 
31 
SENSORSACTUATORS
Architecturallandscape 
32
Takesoncyber-physicalclouds: …CPS-in-cloud
ExtendingApacheVCL forCPS 
34 
Apache VCLVirtualized Data Center... VirtualmachinesInternet/CAN/LANRemote clientReservationEstablishing connectionRemote desktop or terminal access
Proofof Concept 
35 
Time-shareable arrangements Cloud-on-Cloud Apache VCLVCL management networkVCL public networkCloud instanceNetwork-attachedphys. devicesExperiment video stream
„CloudonCloud” capability 
36 
Apache VCLVCL management networkVCL public networkApache VCL/OpenStack/... CoC virtual networks
„Cloud on Cloud” capability 
37 
Apache VCL 
VCL management network 
VCL public network 
Apache VCL/OpenStack/... 
CoC virtual networks 
Bootstrap & 
capture XaaS 
Hypervisors
„CloudonCloud” (CoC) 
38 
Withnestedvirtualization 
Wehave… ovirtualesxi 
oVCL over VCL onthat 
Somerestrictionsapply;inVCL, no… 
ostoragevirtualization 
onetworkvirtualization 
odynamicreservations
Integratinga fielddevice: RaspberryPi 
39 
Surprisinglypopular 
oInthetargetdemographic 
Almost a labPC:rpiVCL module 
Linux 
ogentlerlearningcurve 
oInreservation: SSH access 
Usefulsetof interfaces 
ASM 
C 
scripting 
Java 
Wolfram
Integratingfielddevices? 
Otherdevicetypes: adapter computer needed 
oE.g. a RasberryPi foran Arduino 
oScopes/spectrometers/…: alreadythere 
oAutonomouscameras/meshGWs/…: alreadyinside 
Lab.pm: starting point,needsrework 
oFielddevices: „sanitization” is strongerconcept 
oHarderwork-Pi: reset+ read-onlySD netboot 
40
Container/VM 
Container/VM 
Future: fielddevicesastruecloudhosts 
Real-time/embeddedvirtualizationis maturing 
oCheckout: Siemens Jailhouse 
oXenforARM 
o… 
Alsosee: carrierclouds 
RaspberryPi alreadyhas containers! 
41 
Container/VM
Educationalprototype 
42
Immediateapplications: cloudengineering 
CoC:teachingvirt. & cloud 
oE.g. weuseitforan ESXilab; 
osupportforlocal VCL develinprogress 
Real-life: faults, errors, failures 
oCPS: performance! 
Virtualizationintheloop 
oThereareexistingSWIFI tools… 
o… and VCL canbe a harness 
43
Immediate applications: people & labs 
44 
Internet/CAN/LAN 
Remote 
client 
We have EE/CE in view; 
chemistry, biology, 
physics, …?
Trustingyourcloudwithdeadlines-is ita goodidea?
Cloudsfor demanding applications? 
Standard infrastructure 
vs 
demanding application?
Cloudsfor demanding applications? 
Virtual Desktop 
Infrastructure 
Telecommunications 
Extra-functional reqs: 
throughput, timeliness, availability 
„Small problems” have high impact 
(soft real time)
Test automation 
Hypervisor 
Interference 
Lab 
OS and 
hypervisor 
metrics 
LLO 
HHII 
Experimental setup
Short transient faults –long recovery 
8 sec platform overload 
30 sec service outage 
120 sec SLA violation 
As if you unplug your desktop for a second...
Deterministic (?!) run-time in the public cloud... 
Variance tolerable by overcapacity 
Performance outage intolerable by overcapacity
The noisy neighbour problem 
Hypervisor 
Tenant Neighbor
Tenant-side measurability and observability 
Hypervisor 
Tenant Neighbor
CharacterizingIaaSperformance
IaaSperformance 
HW notnecessarilyknown 
Unknown/ uncontrollabledeployment 
Unknown/ uncontrollable 
scheduling 
„Noisyneighbors” 
Also: management actionperformance?
IaaSperformance 
Deploymentdecisions 
oShouldI usethiscloud? 
Capacityplanning 
oTypeand amountof res. 
Perf. prediction 
oQoStobe expected 
oAnd itsdeviances 
Benchmarking!
Benchmarking (a pragmatictakeon) 
(De-facto) standard applications 
withwelldefinedexecutionmetrics 
thatmayexercisespecificsubsystems 
tocompareIT systemsviasaidmetrics. 
Popularbenchmarks: e.g. PhoronixTest Suite 
Benchmarking asa Service: cloudharmony.com
Whytraditionalbenchmarking is notenough 
Stability 
Homogeneity 
Rareevents 
Repeatability? 
oProvider/tenant 
Micro/componentbenchmarks? 
oApplicationsensitivity? 
oCloudfunctions(scaleinand out)?
TowardsMeasurement-DrivenResilienceDesign forClouds
A performance featuremodel 
+ exp. behavior, homogeneity, stability 
Li, Z., OBrien, L., Cai, R., & Zhang, H. (2012). Towards a Taxonomy of Performance Evaluation of Commercial Cloud Services. 
In 2012 IEEE Fifth International Conference on Cloud Computing (pp. 344–351). IEEE. doi:10.1109/CLOUD.2012.74
ModelingIaaSperformanceexperiments 
Li, Z., OBrien, L., Cai, R., & Zhang, H. (2012). Towards a Taxonomy of Performance Evaluation of Commercial Cloud Services. 
In 2012 IEEE Fifth International Conference on Cloud Computing (pp. 344–351). IEEE. doi:10.1109/CLOUD.2012.74
„Cloudmetrology” and itsapplication 
Full stack instrumentationFull adaptive data acquisitionFine-grained storageExploratory Data AnalysisConfirmatory Data Analysis Mystery shoppers and routine excercisesApplication sensitivity model(Platform) fault modelPerformance/capacity modelStructural defensesDynamic defenses MONITORING BENCHMARKING
Example:characterizingVDI „CPU ReadyTime” 
„Ready”: VM readytorun, butnotscheduled 
oVDI: „stutter” 
Rareevents 
oSampling 
Needsfinegranularity! 
+ atleasta fewmonths 
Very„wide” data 
Result: ~QoEcapacity+ load 
Big Data tooling
EDA: hypothesesfrom„visualtours” of thedata 
Cloudresponsetime~ nwdelay 
clientID ~ loc 
ClientlocationsDoesnotscaleforBig Data (yet)
Workflow? (As of now) 
 Classical 
tools 
Slow EDA 
On Big Data 
Interactive EDA 
On samples 
statistics 
on samples 
Big Data 
statistics 
 Hadoop, 
Storm, 
Cassandra, 
…
The effectof CPS cloudbackendinstability
Experimentalenvironment 
Host1 
Host2 
Workstation 
Workstation 
OS_contr 
OS_compute 
nimbus 
OS_network 
CollectD 
replay 
superv2 
superv1 
Application
Applicationtopology 
Redisspout 
Gatherer1 
Gatherer2 
Aggregator 
Timerspout 
Sweeper 
<ts, city, delay> 
<city, delay>
Workload 
Baselineworkload 
Start of stress 
End of stress
CPU utilization
Processlatency 
Relationshipwithguestresourceusage?
Correlation: 0.890
Acknowledgements 
Specialthanksgo fortheexperimentalenvironmentand datatoourOpenStackMeasurement„taskforce”: 
Ágnes Salánki, Dávid Zilahi, Tamás Nádudvari, György Nádudvari, Gábor Kiss (BME) and 
Gábor Urbanics(QuanoptLtd, ourspinoff) 
72

SERENE 2014 School: Measurement-Driven Resilience Design of Cloud-Based Cyber-Physical Systems