APM Welcome, APM North West Network Conference, Synergies Across Sectors
Towards Autonomic e-Science Ecosystems
1. Towards autonomic e‐science
ecosystems
Cécile Germain‐Renaud
Laboratoire de Recherche en Informa<que
Université Paris‐Sud ‐ CNRS ‐ INRIA
2. Outline
Computa<onal ecosystems
The Clouds
Challenges
Autonomics
30/01/11 ASSYST mee<ng: Opening the Cloud
3. The requirements of e‐science
“Cyberinfrastructure integrates hardware for
compu6ng, data and networks, digitally‐enabled
sensors, observatories and experimental facili6es,
and an interoperable suite of so=ware and
middleware services and tools…”
NSF’s Cyberinfrastructure Vision for 21st Century
Discovery
30/01/11 ASSYST mee<ng: Opening the Cloud
4. An old dream
UCLA press release on the crea<on of ARPANET, 1969
« A computa6onal grid is a hardware and so=ware
infrastructure that provides dependable, consistent,
pervasive, and inexpensive access to high
computa6onal capabili6es. » I. Foster, C. Kesselman,
The Grid, 1998
30/01/11 ASSYST mee<ng: Opening the Cloud
6. The EGEE/EGI grid
LHC is the EGEE/EGI is the Atlas Collabora<on
• Largest (26km), • Largest (40K CPUs), (one in four)
• Fastest(14TeV) • Most distributed (250 • 3000 scien<sts
• Coldest (1.9K) sites), • 38 countries
• Emp<est (10−13 • Most used (300K
• 174 universi<es and
atm) jobs/day)
labs
machine. Computer system
30/01/11 ASSYST mee<ng: Opening the Cloud
11. Experience with the EGEE/EGI grid
100.00%
EGEE CPU usage
10.00%
Y0 (%)
Y1 (%)
Y2 (%)
1.00%
0.10%
AA CC ES F HEP INF LS MV OTH UNK
Source: Report on U<liza<on of EGEE support services and infrastructure , May 2010
30/01/11 ASSYST mee<ng: Opening the Cloud
12. e‐science ecosystems
• A major requirement is Pervasive: On‐demand,
integrated, transparent
• Con<nuity, not revolu<on – We must learn from the
experience
• Organized scien<fic communi<es are commimed to
globalized homogeneous systems. Individualized
science is not (yet?). Heterogeneous high‐level
systems are s<ll in the design state.
30/01/11 ASSYST mee<ng: Opening the Cloud
16. SaaS: Sopware as a Service
How to deliver/consume/manage such services
« The boCom line is that any dis6nc6on between SaaS and
POWA (Plain Old Web Applica6ons) is at worst arbitrary and
at best concerned with the business rela6onship between the
provider and the consumer rather than technical aspects of
the applica6on. » Same source
• Cloud provides increased infrastructure flexibility, excellent
but not the bomleneck
• Applica<on or user‐oriented flexibility
• Control and orchestra<on of the holis<c applica<ons across specialized
and heterogeneous components, whether local, in a grid or in a cloud
• Agility as the capacity to reconfigure, reorganize the internal processes
30/01/11 ASSYST mee<ng: Opening the Cloud
17. The Grid experience
« Grid are defined by coordinated resource sharing
and problem solving in dynamic, mul6‐ins6tu6onal
virtual organiza6ons. The sharing is necessarily,
highly controlled, with resource providers and
consumers defining clearly and carefully just what is
shared, who is allowed to share, and the condi6ons
under which sharing occurs » Ian Foster, 2000
« A computa6onal grid is a hardware and so=ware infrastructure that
provides dependable, consistent, pervasive, and inexpensive access to high
computa6onal capabili6es. » I. Foster, C. Kesselman, The Grid, 1998
30/01/11 ASSYST mee<ng: Opening the Cloud
18. Consumers
Different users and requirements across and within the collobara<ons
30/01/11 ASSYST mee<ng: Opening the Cloud
21. The message
• DEFINITELY NOT “Cloud is a buzzword”
• A technology, not a silver bullet
• Both e‐science and business require
• Efficient integra<on of large datasets with compu<ng
• Pervasiveness
• e‐science has specific requirements
• Organized sharing: data and funding – technical and
poli<cal issues
• Performance: not always, but a strong cultural bias/
feature.
30/01/11 ASSYST mee<ng: Opening the Cloud
23. The complexity crisis
source: IDC 2008, retrieved from
hmp://www.vmware.com/files/pdf/Virtualiza<on‐applica<on‐based‐cost‐model‐WP‐EN.pdf
30/01/11 ASSYST mee<ng: Opening the Cloud
24. The complexity crisis in ac<on
Source: hmp://www.teach‐ict.com/news/news_stories/news_computer_failures.htm
30/01/11 ASSYST mee<ng: Opening the Cloud
31. Complexity AND uncertainty
• As a distributed system
• Components and communica<ons come and go
• For dynamic (P2P), but for managed systems as well
• CAP (Brewer’s) theorem: at most two of the Consistency,
Availability, Par<<on tolerance can be guaranteed
• As a dynamic(al) system
• En<<es change behavior as an effect of unexpected feedbacks,
emergent behavior
• Organized self‐cri<cality, minority games,...
• Lack of complete and common knowledge – Informa<on
uncertainty
• Monitoring is distributed too
• Resolu<on and calibra<on
• Seman<cs and ontologies
30/01/11 ASSYST mee<ng: Opening the Cloud
32. Complexity AND uncertainty
For applica<ons too
• Opportunis<c behaviors
• Space‐<me, accuracy, and more generally objec<ve
adap<vity
• Context‐awareness as required by a CAP‐prone
environement
• Dynamic and complex coupling and interac<ons
• mul<‐physics, mul<‐model, mul<‐resolu<on, …
• Trust in data and sopware
• Not only for P2P systems
30/01/11 ASSYST mee<ng: Opening the Cloud
33. Challenges Summary
• Current levels of scale, complexity and dynamism make it
infeasible for humans to effec<vely manage and control
systems and applica<ons
• Compu<ng ecosystems, with their very large numbers of
hardware and sopware components interac<ng with very
large data, are complex systems that are currently very
difficult to program
• Compu<ng ecosystems are difficult to manage because
of the heterogeneity of workflows, data sets and
opera<ng environment.
• The ability of an applica<on to self‐adapt by
incorpora<ng dynamic inputs along its execu<on needs
to be formulated through a general and principled
programming model
30/01/11 ASSYST mee<ng: Opening the Cloud
35. What is Autonomic Compu<ng?
“Compu6ng systems that manage themselves in
accordance with high‐level objec6ves from humans”
Kephart and Chess, A Vision of Autonomic
Compu<ng, IEEE Computer, 2003
30/01/11 ASSYST mee<ng: Opening the Cloud
36. Milestones
• IBM Vision and Manifesto 2001
• J. O. Kephart and D. M. Chess. The vision of autonomic
compu<ng. IEEE Computer, 36(1), 2003
• IEEE Interna<onal Conference on Autonomic
Compu<ng series since 2004
• IEEE Task Force on Autonomous and Autonomic
Systems 2006
• ECML PKDD 2006 Tutorial/Workshop: Autonomic
Compu<ng: A New Challenge for Machine Learning, I.
Rish and G. Tesauro
• ACM Transac<ons on Autonomous and Adap<ve
Systems (TAAS), 2006
• Autonomic Compu6ng: Concepts, Infrastructure and
Applica6ons M. Parashar and S. Hariri (Ed.), CRC Press,
2006
• The NSF Center for Autonomic Compu<ng, 2008
• Interna<onal Journal of Autonomic Compu<ng (IJAC),
Interscience Publishers, 2009
• Panel at the 1st GMAC workshop: The convergence of
Grids, Clouds and Autonomics, 2009
30/01/11 ASSYST mee<ng: Opening the Cloud
37. Self‐management
• Self‐ConfiguraDon Automated configura<on of
components, systems according to high‐level
policies; rest of system adjusts seamlessly.
• Self‐Healing Automated detec<on, diagnosis, and
repair of localized sopware/hardware problems.
• Self‐OpDmizaDon Automa<c and con<nual adap<ve
tuning of hundreds of parameters (database params,
server params,…) affec<ng performance & efficiency
• Self‐ProtecDon Automated defense against malicious
amacks or cascading failures; use early warning to
an<cipate and prevent system‐wide failures.
30/01/11 ASSYST mee<ng: Opening the Cloud
38. The Autonomic Nervous System
• The most sophis<cated example of autonomic
behavior.
• Regulates and maintains homeostasis: maintains
structure and func<ons by means of a mul<plicity of
dynamic equilibriums that are rigorously controlled
by interdependent regula<on mechanisms.
• Not all parameters
have the same urgency,
essen<al parameters
are monitored more
closely.
30/01/11 ASSYST mee<ng: Opening the Cloud
39. Ashby’s Ultrastable System
A control theory vision
Source: “Autonomic Computing: An Overview, ” M. Parashar, and S. Hariri, UPP 2004, Mont Saint-
Michel, France, Editors: J.-P. Banâtre et al. LNCS, Springer Verlag, Vol. 3566, pp. 247 – 259, 2005.
30/01/11 ASSYST mee<ng: Opening the Cloud
41. The MAPE‐K loop
S E
Environment sensors
Autonomic Manager
Network
instrumenta<on Analyze Plan
Users context
Applica<on
requirements Monitor Knowledge Execute
S E
High‐dimensional, high‐ Managed Element
volume ‘raw’ data
30/01/11 ASSYST mee<ng: Opening the Cloud
42. The MAPE‐K loop
High‐dimensional, high‐
volume ‘raw’ data S E
Autonomic Manager
State‐Space and Data Analyze Plan
AbstracDon
Streaming:
On‐line data mining, Monitor Knowledge Execute
clustering,..
Dimensionality
reduc<on S E
Ac<ve learning
Managed Element
Ontological inference
Compressed, ‘informa<ve’
30/01/11 data ASSYST mee<ng: Opening the Cloud
43. The MAPE‐K loop
Compressed, ‘informa<ve’
data S E
Autonomic Manager
Analyze Plan
Learn predicDve models
Classifica<on, regression,
<me series, MCMC Monitor Knowledge Execute
Decision‐making
Explora<on vs Exploita<on S E
Game theory, Risk analysis
Reinforcement Learning,
Managed Element
bandits
30/01/11 ASSYST mee<ng: Opening the Cloud
44. The MAPE‐K loop
S E
Knowledge –based eg Autonomic Manager
ontologies, a‐priori
models, intelligent Analyze Plan
ini<alisa<on
Or
Tabula‐rasa Knowledge Monitor Execute
Knowledge
‐ Avoids knowledge‐
intensive model building
S E
Criteria Managed Element
‐ Indepedent Knowledge
and learning
‐ Theore<cal guarantees of
improvement
30/01/11 ASSYST mee<ng: Opening the Cloud
45. Technical issues: example for RL
Need enhancement to Vanilla Reinforcement Learning
• Observa<on uncertainty
• Historical dependencies may exist: MDP might not be an
exact model
• Convergence not guaranteed
• Lack of sta<onarity,
• Con<nuous state‐ac<on space requires approxima<ons
• Local vs global learning, because of curse of dimensionality
• Explora<on penal<es might be excessive
An in depth explora<on of these issues: Gerald Tesauro et
al. On the Use of Hybrid Reinforcement Learning for
Autonomic Resource Alloca<on. Cluster Compu<ng, 10(3):
287‐99, 2007.
30/01/11 ASSYST mee<ng: Opening the Cloud
46. Transversal issues
• Limits
• Biological self‐* (awareness, healing…) may/will ul<mately
fail, plus unforeseen treats
• Overheads
• Designing, programming, execu<ng, provisioning
• Valida<on
• Extreme events: revisit tradi<onal criteria eg RMSE
• Benchmarking under uncertainty
• Availability of reference datasets
C Germain‐Renaud et al. The Grid
Observatory, to appear IEEE/ACM CCGRID'11
30/01/11 ASSYST mee<ng: Opening the Cloud www.grid‐observatory.org