1. Advanced service-based data analytics:
concepts and designs
Hong-Linh Truong
Distributed Systems Group,
Vienna University of Technology
truong@dsg.tuwien.ac.at
dsg.tuwien.ac.at/staff/truong
1ASE Summer 2014
Advanced Services Engineering,
Summer 2014, Lecture 7
Advanced Services Engineering,
Summer 2014, Lecture 7
2. Outline
Principles of elasticity for advanced service-
based data analytics
Data analytics within a single system
Data analytics across multiple systems
Composable cost evaluation
ASE Summer 2014 2
4. Advanced service-based data
analytics (1)
ASE Summer 2014 4
Cities, e.g. including:
10000+ buildings
1000000+ sensors
Near
realtime
analytics
Near
realtime
analytics
Predictive
data
analytics
Visual
Analytics
Enterprise
Resource
Planning
Enterprise
Resource
Planning
Emergency
Management
Emergency
Management
Internet/public cloud
boundary
Organization-specific
boundary
Tracking/Log
istics
Tracking/Log
istics
Infrastructure
Monitoring
Infrastructure
Monitoring
Infrastructure/Internet of Things
......
5. Advanced service-based data
analytics (2)
ASE Summer 2014 5
A lot of input data (L0):
~2.7 TB per day
A lot of results (L1, L2):
e.g., L1 has ~140 MB per
day for a grid of
1kmx1km
Soil
moisture
analysis for
Sentinel-1
Michael Hornacek,Wolfgang Wagner, Daniel Sabel, Hong-Linh Truong, Paul Snoeij, Thomas Hahmann, Erhard Diedrich, Marcela Doubkova,
Potential for High Resolution Systematic Global Surface Soil Moisture Retrieval Via Change Detection Using Sentinel-1, IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing, April, 2012
Michael Hornacek,Wolfgang Wagner, Daniel Sabel, Hong-Linh Truong, Paul Snoeij, Thomas Hahmann, Erhard Diedrich, Marcela Doubkova,
Potential for High Resolution Systematic Global Surface Soil Moisture Retrieval Via Change Detection Using Sentinel-1, IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing, April, 2012
Data-as-a-Service
and Platform-as-a-
Service in clouds
Data-as-a-Service
and Platform-as-a-
Service in clouds
6. Advanced service-based data
analytics -- fundamental concepts
ASE Summer 2014 6
Part A Part B ...... Part N
Cluster Grid Local Cloud Public cloud/Sky
Applications
System
infrastructures
Domain 1 Domain 2 Domain n
7. Design questions
Which system infrastructures are used?
Which are the functions of units?
Which interfaces are suitable for units?
Which programming models are used within units?
Which are fundamental units to be used?
How do different units interact?
Which non-functional parameters are important and
how to measure them?
ASE Summer 2014 7
Part = a (composite) service unitPart = a (composite) service unit
8. Fundamental concepts – system
infrastructure unit
ASE Summer 2014 8
System
infrastructures
Cloud
Software-
based Cloud
system
Human-based
Cloud system
Cluster Grid
High
Performance
Server
9. Fundamental concepts – unit
functions
ASE Summer 2014 9
Function
Front-
end/Presentation
Data Analytics
Service
Visualization Service Middleware
Enterprise Service
Bus
Publish/Subscription Messaging/Queuing Data Transfer
10. Fundamental concepts –
programming model within units
ASE Summer 2014 10
Programming
model
MapReduce MPI
Parallel
Database
Workflow
Other
solutions
11. Fundamental concepts – interfaces
between units
ASE Summer 2014 11
Interface
Standard
REST SOAP
APIs
Specific APIs
Standard APIs
(e.g.
OpenStack)
Interaction
Pull Push
12. Fundamental concepts – services
and data concerns
ASE Summer 2014 12
Service and
data
concerns
Data
concerns
Quality of
data
Pricing Data Right ...
Service
Concerns
QoS Pricing ...
13. Complex dependencies in (big)
data analytics
More data more
computational resources
(e.g. more VMs)
More types of data
more computational models
more analytics
processes
Change quality of results
Change quality of data
Change response time
Change cost
Change types of result
(form of the data
output, e.g. tree, visual,
story, etc.)
More data more
computational resources
(e.g. more VMs)
More types of data
more computational models
more analytics
processes
Change quality of results
Change quality of data
Change response time
Change cost
Change types of result
(form of the data
output, e.g. tree, visual,
story, etc.)
Data
Computational
Model
Analytics
Process
Analytics Result
Data
Data
DataxDatax
DatayDatay
DatazDataz
Computational
Model
Computational
ModelComputational
Model
Computational
ModelComputational
Model
Computational
Model
Analytics
Process
Analytics
ProcessAnalytics
Process
Analytics
ProcessAnalytics
Process
Analytics
Process
Quality of
Result
ASE Summer 2014 13
Hong-Linh Truong, Schahram Dustdar, "Principles of
Software-defined Elastic Systems for Big Data
Analytics", (c) IEEE Computer Society, IEEE
International Workshop on Software Defined
Systems, 2014 IEEE International Conference on
Cloud Engineering (IC2E 2014), Boston,
Massachusetts, USA, 10-14 March 2014
Hong-Linh Truong, Schahram Dustdar, "Principles of
Software-defined Elastic Systems for Big Data
Analytics", (c) IEEE Computer Society, IEEE
International Workshop on Software Defined
Systems, 2014 IEEE International Conference on
Cloud Engineering (IC2E 2014), Boston,
Massachusetts, USA, 10-14 March 2014
14. Complex dependencies in (big)
data analytics
Elasticity principles
can be used to
support this!
Elasticity principles
can be used to
support this!
ASE Summer 2014 14
15. Elasticity Principles: Elasticity of
data and computational models
Multiple types of objects from different sources with
complex dependencies, relevancies, and quality
Different data and computational models the same
analytics subject
New analytics subjects can be defined and analytics
goals can be changed
Decide/select/define/compose not only computational
models for analytics subjects but also data models
based on existing ones
Management and modeling of elasticity of data and computational
model during the analytics
Management and modeling of elasticity of data and computational
model during the analytics
ASE Summer 2014 15
16. Elasticity Principles: Elasticity of
data resources
Data provided, managed and shared by different
providers
Data associated with different concerns (cost, quality
of data, privacy, contract, etc.
Static data, open data, data-as-a-service, opportunistic
data (from sensors and human sensing)
Not just centralized big data and total data ownership
Data resources can be taken into account in an elastic
manger: similar to VMs, based on their quality,
relevancy, pricing, etc.
Data resources can be taken into account in an elastic
manger: similar to VMs, based on their quality,
relevancy, pricing, etc.
ASE Summer 2014 16
17. Elasticity Principles: Elasticity of
humans and software as computing
units
Human in the loop to solve analytics tasks that software
cannot solve
Human-based compute units can be scaled up/down
with different cost, availability, performance models
Human-based compute units + software-based
compute units for executing computational models
Elasticity controls can be also done by humans
Provisioning hybrid compute units in an elastic way for
computational/data/network tasks as well as for
monitoring/control tasks in the analytics process
Provisioning hybrid compute units in an elastic way for
computational/data/network tasks as well as for
monitoring/control tasks in the analytics process
ASE Summer 2014 17
18. Elasticity Principles: Elasticity of
quality of results
Definition of quality of results
Trade-offs of time, cost, quality of data, forms of
output
Using quality of results to select suitable
computational models, data resources,
computing units
Multi-level control for the elasticity based on
quality of results
Able to cope with changes in quality of data,
performance, cost and types of results at runtime
Able to cope with changes in quality of data,
performance, cost and types of results at runtime
ASE Summer 2014 18
19. WE NEED TO START FROM
DATA ANALYTICS WITHIN A
SINGLE SYSTEM
ASE Summer 2014 19
20. Domain ADomain A
Data analytics within a single
system
They are complex enough but
do not meet all requirements
In a single domain
Tightly coupled computing
infrastructures
E.g., in the same
cloud
Computation and data are
close
Several concerns can be
by-passed
ASE Summer 2014 20
Data
service
unit
Data Analytics
Unit
Not always provisioned under the „Service Unit“ model
21. Data analytics within a single
system
ASE Summer 2014 21
1. Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael
Stonebraker. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the 2009 ACM
SIGMOD International Conference on Management of data (SIGMOD '09), Carsten Binnig and Benoit Dageville
(Eds.). ACM, New York, NY, USA, 165-178. DOI=10.1145/1559845.1559865
http://doi.acm.org/10.1145/1559845.1559865
2. Leonardo Neumeyer, Bruce Robbins, Anish Nair, Anand Kesari: S4: Distributed Stream Computing Platform. ICDM
Workshops 2010: 170-177
3. Jerry Chou, Mark Howison, Brian Austin, Kesheng Wu, Ji Qiang, E. Wes Bethel, Arie Shoshani, Oliver Rübel,
Prabhat, and Rob D. Ryne. 2011. Parallel index and query for large scale data analysis. In Proceedings of 2011
International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). ACM, New
York, NY, USA, , Article 30 , 11 pages. DOI=10.1145/2063384.2063424 http://doi.acm.org/10.1145/2063384.2063424
4. Boduo Li, Edward Mazur, Yanlei Diao, Andrew McGregor, Prashant J. Shenoy: A platform for scalable one-pass
analytics using MapReduce. SIGMOD Conference 2011: 985-996
5. Fabrizio Marozzo, Domenico Talia, Paolo Trunfio: A Cloud Framework for Parameter Sweeping Data Mining
Applications. CloudCom 2011: 367-374
6. Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst: HaLoop: Efficient Iterative Data Processing on Large
Clusters. PVLDB 3(1): 285-296 (2010)
1. Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael
Stonebraker. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the 2009 ACM
SIGMOD International Conference on Management of data (SIGMOD '09), Carsten Binnig and Benoit Dageville
(Eds.). ACM, New York, NY, USA, 165-178. DOI=10.1145/1559845.1559865
http://doi.acm.org/10.1145/1559845.1559865
2. Leonardo Neumeyer, Bruce Robbins, Anish Nair, Anand Kesari: S4: Distributed Stream Computing Platform. ICDM
Workshops 2010: 170-177
3. Jerry Chou, Mark Howison, Brian Austin, Kesheng Wu, Ji Qiang, E. Wes Bethel, Arie Shoshani, Oliver Rübel,
Prabhat, and Rob D. Ryne. 2011. Parallel index and query for large scale data analysis. In Proceedings of 2011
International Conference for High Performance Computing, Networking, Storage and Analysis (SC '11). ACM, New
York, NY, USA, , Article 30 , 11 pages. DOI=10.1145/2063384.2063424 http://doi.acm.org/10.1145/2063384.2063424
4. Boduo Li, Edward Mazur, Yanlei Diao, Andrew McGregor, Prashant J. Shenoy: A platform for scalable one-pass
analytics using MapReduce. SIGMOD Conference 2011: 985-996
5. Fabrizio Marozzo, Domenico Talia, Paolo Trunfio: A Cloud Framework for Parameter Sweeping Data Mining
Applications. CloudCom 2011: 367-374
6. Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst: HaLoop: Efficient Iterative Data Processing on Large
Clusters. PVLDB 3(1): 285-296 (2010)
Some papers
22. Data analytics within a single
system – some examples
ASE Summer 2014 22
Message Passing
Interface (MPI) + Cluster-
based File system
Message Passing
Interface (MPI) + Cluster-
based File system
MapReduce + Google
File System
MapReduce + Google
File System
Hadoop + HDFSHadoop + HDFS
Dryad+LINQDryad+LINQ
Parallel Database
(SQL/NonSQL)
Parallel Database
(SQL/NonSQL)
Yahoo S4Yahoo S4
WorkflowWorkflow
A short, good overview in Chapter 6: Cloud Programming and Software Environments, Book: Distributed and Cloud
Computing – from Parallel Processing to the Internet of Things, Kai Hwang, Geoffrey C. Fox and Jack J Dongarra,
Morgan Kaufmann, 2012
A short, good overview in Chapter 6: Cloud Programming and Software Environments, Book: Distributed and Cloud
Computing – from Parallel Processing to the Internet of Things, Kai Hwang, Geoffrey C. Fox and Jack J Dongarra,
Morgan Kaufmann, 2012
23. Discussion time
ASE Summer 2014 23
WHY ANALYTICS UNITS SHOULD BE
„CLOSED“ TO DATA UNITS?
WHICH CONCERNS COULD BE IGNORE IN
SINGLE SYSTEM DATA ANALYTICS?
WHICH ISSUES WE NEED TO CONSIDER
WHEN OUR DATA UNITS ARE IN
DIFFERENT SYSTEMS?
24. Data analytics across multiple
systems – design choice
Programming models
for data analytics
service
Data service units
Supporting middleware
units
ASE Summer 2014 24
Programming model
System
Infrastrucure
Interface
25. Data analytics across multiple systems
– programming models (1)
Static data
ASE Summer 2014 25
Local
input
data
Analytics
Results
MapReduce/Hadoop
Workflow
MPI
Other solutions
Servers/Cloud/Cluster
What are our design concerns?What are our design concerns?
Input
data
Output
data
26. Stockmarket
Social media
M2M
Stockmarket
Social media
M2M
Data analytics across multiple systems
– programming models (2)
Near-realtime data
ASE Summer 2014 26
Analytics
Results
Complex event processing
Stream data analysis
Other solutions
Servers/Cloud/Cluster
Input
data
Output
data
What are our design concerns?What are our design concerns?
27. Big data (e.g.,
satellite images)
Big data (e.g.,
satellite images)
Data analytics across multiple systems
– programming models (3)
Near-realtime data
ASE Summer 2014 27
Analytics
Results
MPI
Workflow
Other solutions
Servers/Cloud/Cluster
Input
data
Output
data
What are our design concerns?What are our design concerns?
28. Data analytics across multiple
systems – data service units
ASE Summer 2014 28
Cluster file
LustreLustreNFSNFS
Data
Analytics Unit
• Read/write data via direct ,
low-level read/write via IO
Interface
• Cluster or cluster of clusters
• Can be very large
System
• Usually parallel processing
Programming model
Hadoop File SystemHadoop File System Google file systemGoogle file system
Read/write data
29. Data analytics across multiple
systems – data service units
ASE Summer 2014 29
Storage-as-a-
Service
Google Storage Service
(REST API)
Google Storage Service
(REST API)
Amazon S3
(SOAP/REST API)
Amazon S3
(SOAP/REST API)
Data
Analytics Unit • Direct data transfer via REST/SOAP APIs
Interface
• Decouple between analytics and storage
System
• May require middleware for data transfer
• Request via SOAP/REST
• Real data transfer done by external
middleware
• A rich set of programming models can be used
Programming model
commands data
30. Data analytics across multiple
systems – data service units
ASE Summer 2014 30
Database-as-a-
Service
SkySQL
Amazon RDS
Microsoft SQL Azure
Clustrix DBaaS
SkySQL
Amazon RDS
Microsoft SQL Azure
Clustrix DBaaS
MongoDB/MongoLab
Amazon DynamoDB
Amazon SimpleDB
Cloudant Data
MongoDB/MongoLab
Amazon DynamoDB
Amazon SimpleDB
Cloudant Data
Data
Analytics Unit
• REST/SOAP APIs
• Mainly for commands and results
Interface
• Decouple between analytics unit and database
• Database as a sevice can be very large
System
• Analytics can be done at both sides
• Analytic units can use any programming models
• Database-as-a-service can perform a lot of
analytics
• Parallel database operations
Programming model
Technology
queries data
31. Data analytics across multiple
systems – data service units
ASE Summer 2014 31
DaaS
Infochimps
Microsoft Azure
Xively
GNIP
Infochimps
Microsoft Azure
Xively
GNIP
Data
Analytics Unit
• Data transfer can be uni or bi-
direction
• REST/SOAP APIs
Interface
• Both systems for DaaS and for
analytics units can be very large
System
• Can be any
Programming model
Technology
32. Middleware service unit for
transfering large data -- GlobusOnline
ASE Summer 2014 32
Source: Bryce Allen, John Bresnahan, Lisa Childers, Ian Foster, Gopi Kandaswamy, Raj Kettimuthu, Jack Kordas, Mike
Link, Stuart Martin, Karl Pickett, and Steven Tuecke. 2012. Software as a service for data scientists. Commun. ACM 55,
2 (February 2012), 81-88. DOI=10.1145/2076450.2076468 http://doi.acm.org/10.1145/2076450.2076468
Source: Bryce Allen, John Bresnahan, Lisa Childers, Ian Foster, Gopi Kandaswamy, Raj Kettimuthu, Jack Kordas, Mike
Link, Stuart Martin, Karl Pickett, and Steven Tuecke. 2012. Software as a service for data scientists. Commun. ACM 55,
2 (February 2012), 81-88. DOI=10.1145/2076450.2076468 http://doi.acm.org/10.1145/2076450.2076468
33. Middleware service unit for
transfering large data -- ProxyWS
ASE Summer 2014 33
Spiros Koulouzis, Reginald Cushing, K. A. Karasavvas, Adam Belloum, Marian Bubak: Enabling Web Services to
Consume and Produce Large Datasets. IEEE Internet Computing 16(1): 52-60 (2012)
Spiros Koulouzis, Reginald Cushing, K. A. Karasavvas, Adam Belloum, Marian Bubak: Enabling Web Services to
Consume and Produce Large Datasets. IEEE Internet Computing 16(1): 52-60 (2012)
34. Middleware service units for
messages/queuing
Advanced Message Queuing Protocol (AMQP)
Simple (or Streaming) Text Orientated
Messaging Protocol (STOMP)
Specific protocols/APIs
ASE Summer 2014 34
Amazon SQSAmazon SQSStormMQStormMQ RabbitMQRabbitMQ
37. Different possibilities: Grids and/or
clouds
Raw images stored in archival: iRODS, HTTP server, or
Amazon S3
Notification
Any queuing system: on-premise or cloud-based
service
Reference images:
Local/pre-deployed or deployed on demand
Computation: set of workstations, cluster, EC2, etc.
Sentinel-1 images and SSM storage:
Local files, cloud storage, iRODs, etc.
Result notification and sharing: to whom? At which
scale?
37
The choices are also strongly dependent on “collaboration
needs” and money! But how easy data sharing is?
The choices are also strongly dependent on “collaboration
needs” and money! But how easy data sharing is?
ASE Summer 2014
40. Sustainability governance analysis
ASE Summer 2014 40
Cities, e.g. including:
10000+ buildings
1000000+ sensors
Near
realtime
analytics
Near
realtime
analytics
Predictive
data
analytics
Visual
Analytics
Enterprise
Resource
Planning
Enterprise
Resource
Planning
Emergency
Management
Emergency
Management
Internet/public cloud
boundary
Organization-specific
boundary
Tracking/Log
istics
Tracking/Log
istics
Infrastructure
Monitoring
Infrastructure
Monitoring
Infrastructure/Internet of Things
......
41. DaaS for sustainability governance
Monitoring data DaaS
Domain-specific knowledge DaaS
ASE Summer 2014 41
Hong-Linh Truong, Schahram Dustdar , Sustainability Data and Analytics in Cloud-Based M2M Systems, Big Data
and Internet of Things: A Roadmap for Smart Environments Studies in Computational Intelligence Volume 546, 2014, pp
343-365
Hong-Linh Truong, Schahram Dustdar , Sustainability Data and Analytics in Cloud-Based M2M Systems, Big Data
and Internet of Things: A Roadmap for Smart Environments Studies in Computational Intelligence Volume 546, 2014, pp
343-365
42. Platform-as-a-Service for
Sustainability Governance
• different types of analytics application models,
such as batch, workflow and stream applications
and intelligent bots
different programming models and languages
For analytics of large-scale data but also bot-as-
a-service
ASE Summer 2014 42
45. HOW TO DEAL WITH COST
AND QUALITY OF COMPLEX
SERVICES?
Discussion time
ASE Summer 2014 45
46. 46 46
Examples of our complex data
analytics
Bio-mechanic applications
Simulate the stiffness of human bones
Data and computation intensive applications
Sequential and parallel programs (e.g., parfe and paraview),
Complex software installation: Parmetis, Trilinos, Parfe, Paraview, and HDF5
run under batch and interactive modes
ASE Summer 2014
48. Dealing with performance and cost
of complex applications in clouds
Application complexity
Elastic high performance applications on multiple clouds: libraries, software
services, virtual machines, etc.
Cost and performance are needed for determining which parts of the application
should be excuted in the clouds and when
Cost/performance model complexity
Coarse- and fine-grained cost models of clouds at different layers:
Too coarse-grained (networks, storages, machines) or too fine-grained (IO
calls)
Software-, data-, human-specific cost/performance models
Cost models for individual parts (workflow, MPI, OpenMP, etc.)
Tran Vu Pham, Hong-Linh Truong, Schahram Dustdar "Elastic High Performance Applications - A Composition
Framework", The 2011 Asia-Pacific Services Computing Conference (IEEE APSCC 2011), (c) IEEE Computer Society,
December 12 - 15, 2011, Jeju, Korea
Hong Linh Truong, Schahram Dustdar: Composable cost estimation and monitoring for computational applications in
cloud computing environments. Procedia CS 1(1): 2175-2184 (2010)
Tran Vu Pham, Hong-Linh Truong, Schahram Dustdar "Elastic High Performance Applications - A Composition
Framework", The 2011 Asia-Pacific Services Computing Conference (IEEE APSCC 2011), (c) IEEE Computer Society,
December 12 - 15, 2011, Jeju, Korea
Hong Linh Truong, Schahram Dustdar: Composable cost estimation and monitoring for computational applications in
cloud computing environments. Procedia CS 1(1): 2175-2184 (2010)
ASE Summer 2014 48
49. Composable cost evaluation
Part A Part B Part C
Cost/performance
model i
Cost/performance
model j
Cost/performance
model k
Runtime:
Elastic
processes
Elastic high performance applications on multiple clouds:
libraries, software services, virtual machines, etc.
Utilize different
performance and
dependencies models
for sequential, parallel,
workflows, etc.
ASE Summer 2014 49
50. Composable cost evaluation:
Estimation and Monitoring
Leverage our previous knowledge on event representations,
application monitoring, performance analysis, dependability
analysis
Employ service-oriented approach
RESTful service, JSON and XML event data
50 50ASE Summer 2014
51. Event Representations and
Instrumentation
Captured monitoring events based on a well-defined
specification
– Well-known instrumentation techniques can be
reused
Consider different application execution models (e.g.,
MPI, workflows, etc.)
ASE Summer 2014 51
55. Online Cost Monitoring - Examples
One experiment of a
bioinformatic
workflow in EC2
Support runtime
cost-based
composition and
execution
55ASE Summer 2014 55
56. Exercises
Read mentioned papers
Analyze the relationships between programming
models and system infrastructures for data
analytics across multiple domains
Examine http://cloudcomputingpatterns.org and
see how it supports data analytics patterns
Develop some patterns for data analytics across
multiple systems
Work on composable cost evaluation for
complex data analytics
ASE Summer 2014 56
57. 57
Thanks for
your attention
Hong-Linh Truong
Distributed Systems Group
Vienna University of Technology
truong@dsg.tuwien.ac.at
dsg.tuwien.ac.at/staff/truong
ASE Summer 2014