SlideShare a Scribd company logo
1 of 48
Scalable, Fault-tolerant
Management of Grid Services:
Application to Messaging
Middleware
Harshawardhan Gadgil
hgadgil@cs.indiana.edu
Ph.D. Defense Exam (Practice Talk) Advisor: Prof. Geoffrey Fox
2
Talk Outline
 Motivation
 Architecture
 Service-oriented Management
 Performance Evaluation
 Application: Managing Grid Messaging
Middleware
 Related Work
 Conclusion
3
Motivation:
Characteristics of today’s (GRID) Applications
 Increasing Application Complexity
 Applications distributed and composed of multiple resources
 In future, much larger systems will be built
 Components widely dispersed and disparate in nature and
access
 Span different administrative domains
 Under differing network / security policies
 Limited access to resources due to presence of firewalls, NATs
etc…
 Components in flux
 Components (Nodes, network, processes) may fail
 Services must meet
 General QoS and Life-cycle features
 (User defined) Application specific criteria
 Need to “manage” services to provide these
capabilities
4
Motivation:
Key Challenges in Management of Resources
 Scalability
 With Growing Complexity of application, number
of resources that require management increases
 E.g. LHC Grid consists of a large number of CPUs, disks
and mass storage servers
 Web Service based applications such as Amazon’s EC2
dynamically resizes compute capacity, so number of
resources is NOT ONLY large BUT ALSO in a constant
state of flux.
 Management framework MUST cope with large
number of resources in terms of
 Additional components Required
5
Motivation:
Key Challenges in Management of Resources
 Scalability
 Performance
 Performance Important in terms of
 Initialization Cost
 Recovery from failure
 Responding to run-time events
 Performance should not suffer with
increasing resources and additional system
components
6
Motivation:
Key Challenges in Management of Resources
 Scalability
 Performance
 Fault – tolerance
 Failures are Normal
 Resources may fail, but so also components of
the management framework.
 Framework MUST recover from failure
 Recovery period must not increase drastically
with increasing number of resources.
7
Motivation:
Key Challenges in Management of Resources
 Scalability
 Performance
 Fault – tolerance
 Interoperability
 Resources exist on different platforms
 Written in different languages
 Typically managed using system specific protocols and
hence not INTEROPERABLE
 Investigate the use of a service-oriented architecture for
management
8
Motivation:
Key Challenges in Management of Resources
 Scalability
 Performance
 Fault – tolerance
 Interoperability
 Generality
 Management framework must be a generic framework.
 Should manage any type of resource (hardware / software)
9
Motivation:
Key Challenges in Management of Resources
 Scalability
 Performance
 Fault – tolerance
 Interoperability
 Generality
 Usability
 Simple to deploy. Built in terms of simple components
(services)
 Autonomous operation (as much as possible)
10
Summary:
Research Issues
 Building a Fault-tolerant Management
Architecture
 Making the architecture SCALABLE
 Investigate the overhead in terms of
 Additional Components Required
 Typical response time
 Recovery Time
 Interoperable and Extensible Management
framework
 General and usable system
11
Architecture:
Assumptions
 For our discussion: RESOURCE = SERVICE
 External state required by resources is small
 Can be captured using a message-based
interaction interface
 Resource may maintain internal state and can
bootstrap itself
 E.g.: Shopping Cart (Internal State = Contents of cart,
External state = Location of DB and access credentials
where contents persist)
 Assume a scalable, fault-tolerant database
store that acts as a registry to maintain
resource specific external state
12
Definition:
The process of Management
 E.g. Consider Printer as a
resource that will be managed
 Generates Events (E.g. Low
Ink Level, Out of paper)
 Resource Manager appropriately
handles these events as defined
by Resource specific policies
 Job Queue Management is
NOT the responsibility of our
management architecture.
 We imagine existence of a
separate Job Management
Process which itself can be
managed by our framework (E.g.
Make sure it is always up and
running)
Job Queue
Ink Level
Feeder Tray Management
LowInkLevelEvent
OutOfPaperEvent
`
Resource Specific
Manager
Job Queue
Management
MANAGE
MANAGE
13
Management Architecture built in
terms of
 Hierarchical Bootstrap System – Robust itself by Replication
 Resources in different domains can be managed with separate policies for each domain
 Periodically spawns a System Health Check that ensures components are up and
running
 Registry for metadata (distributed database) – Robust by standard database
techniques and our system itself for Service Interfaces
 Stores resource specific information (User-defined configuration / policies, external
state required to properly manage a resource)
 Messaging Nodes form a scalable messaging substrate
 Message delivery between managers and managees
 Provides Secure delivery of messages
 Managers – Active stateless agents that manage resources.
 Resource specific management thread performs actual management
 Multi-threaded to improve scalability
 Managees – what you are managing (Resource / Service to manage) – Our
system makes robust
 There is NO assumption that Managed system uses Messaging nodes
 Wrapped by a Service Adapter which provides a Web Service interface
14
Architecture:
Scalability: Hierarchical distribution
ROOT
US EUROPE
FSU CARDIFF
CGL
Active Bootstrap Nodes
/ROOT/EUROPE/CARDIFF
•Responsible for maintaining a working
set of management components in the
domain
•Always the leaf nodes in the hierarchy
Passive Bootstrap Nodes
•Only ensure that all child
bootstrap nodes are always up
and running
…
Spawns if not present and
ensure up and running
…
15
Architecture:
Conceptual Idea (Internals)
Resource to
Manage
(Managee)
Service
Adapter
Bootstrap
Service
System Health
Check Manager
Resource to
Manage
(Managee)
Service
Adapter
Resource to
Manage
(Managee)
Service
Adapter
Manager
Messaging
Node
Registry
Manager
Manager
...
...
Connect to Messaging
Node for sending and
receiving messages
User writes system
configuration to registry
Manager processes periodically
checks available resources to
manage. Also Read/Write
resource specific external state
from/to registry
Always ensure up
and running
Always ensure up and
running
Periodically Spawn
16
Architecture:
User Component
 Characteristics are determined by the user.
 Events generated by the Managees are handled by
the manager
 Event processing is determined by via WS-Policy constructs
 E.g. Wait for user’s decision on handling specific conditions
 The event handler has been specified, so execute default policy,
etc…
 Note Managers will set up services if registry indicates
that is appropriate; so writing information to registry
can be used to start up a set of services
 Generic and Application specific policies are written to the
registry where it will be picked up by a manager process.
17
Issues:
Issues in the distributed system
 Consistency – Management architecture must provide
consistency of management process
 Examples of inconsistent behaviour
 Two or more managers managing the same resource
 Old messages / requests reaching after new requests
 Multiple copies of resources existing at the same time leading to
inconsistent system state
 Use a Registry generated monotonically increasing Unique Instance
ID to distinguish between new and old instances
 Security – Provide secure communication between communicating
parties (e.g. Manager <-> Resource)
 Leverage NaradaBrokering’s Topic Creation and discovery and
Security framework to provide:
 Provenance, Secure Discovery, Authorization & Authentication
 Prevent unauthorized users from accessing the resource
 Prevent malicious users from modifying message (Thus message
interactions are secure when passing through insecure intermediaries)
18
Consistency
 All new entities get a unique InstanceID (IID), generated thru registry. All
interaction from that entity use this id as a prefix. We assume this to be a
monotonically increasing number (E.g. an NTP timestamp)
 Thus If a Manager Thread starts, it is assigned a unique ID by the registry. A
newer instance has a higher id, thus OLD manager threads can be distinguished
 The resource ALWAYS assumes the manager thread to be current with the
highest known ID
 Thus requests from manager thread A are considered obsolete IF IID(A) <
IID(B)
 This IID may also be used as a prefix for interactions
 Message ID = [X:N] where X is the registry assigned IID and N is a monotonically
increasing number generated by an instance with IID as X
 Service Adapter stores the last known MessageID allowing it to differentiate
between duplicates AND obsolete messages
 Similar principle for auto-instantiated resources. IF a resource is considered
DEAD (Unreachable) and a new resource is spawned this new resource has
the same ResourceID (allows us to identify the type of resource) but a
higher InstanceID.
 Later if the old resource joins back in, it can be distinguished by checking its
InstanceID and appropriately taking action
 E.g. IF IID(ResourceInstance_1) <IID(ResourceInstance_2), then ResourceInstance_1
was previously deemed OBSOLETE hence ResourceInstance_2 exists SO instruct
ResourceInstance_1 to silently shutdown
19
Interoperability:
Service-Oriented Management
 Existing systems
 Platforms, languages
 SNMP, JMX, WMI
 Quite successful, but not interoperable
 Move to Web Service based service-
oriented architecture that uses
 XML based interactions that facilitate
implementation in different languages,
running on different platforms and over
multiple transports.
20
Interoperability:
WS – Distributed Management vs. WS-Management
 Both systems provide Web service model for building application
management solutions
 WSDM – MOWS (Mgmt. Of Web Services) & MUWS (Mgmt. Using Web
Services)
 MUWS: unifying layer on top of existing management specifications such as
SNMP, OMI (Object Management Interface)
 MOWS: Provide support for management framework such as deployment,
auditing, metering, SLA management, life cycle management etc…
 WS Management identifies core set of specification and usage
requirements
 E.g. CREATE, DELETE, GET / PUT, ENUMERATE + any number of resource
specific management methods (if applicable)
 Selected WS-Management primarily due to its simplicity and also to
leverage WS-Eventing implementation recently added for Web Service
support in NaradaBrokering
21
Implemented:
 WS – Specifications
 WS – Management (June 2005) parts (WS – Transfer [Sep 2004], WS
– Enumeration [Sep 2004] and WS – Eventing) (could use WS-DM)
 WS – Eventing (Leveraged from the WS – Eventing capability
implemented in OMII)
 WS – Addressing [Aug 2004] and SOAP v 1.2 used (needed for WS-
Management)
 Used XmlBeans 2.0.0 for manipulating XML in custom
container.
 Security Framework for NB
 Provides secure end-to-end delivery of messages
 Broker Discovery mechanism
 May be leveraged to discover Messaging Nodes
 Currently implemented using JDK 1.4.2 (expect better
performance moving to JDK 1.5 or better)
22
Performance Evaluation
Measurement Model – Test Setup
 Multithreaded manager process - Spawns a Resource
specific management thread (A single manager can
manage multiple different types of resources)
 Limit on maximum resources that can be managed
 Limited by Response time obtained
 Limited by maximum threads per JVM possible (memory
constraints)
Messaging Node
(GF5)
GF1
GF3
GF4
GF2
Registry
(GF5)
Manager Process(es)
(GF6)
Set
of
Managees
Benchmark Accumulator
(GF7)
TCP Connection
23
Performance Evaluation
Results
 Response time increases
with increasing number of
resources
 Response time is
RESOURCE-DEPENDENT
and the shown times are
typical
 MAY involve 1 or more
Registry access which will
increase overall response
time
 Increases rapidly as no.
of resources > (150 –
200)
24
Performance Evaluation
Results: Increasing Managers on Same machine
25
Performance Evaluation
Results: Increasing Managers on different machines
26
How to scale locally
ROOT
US
Node-1
CGL
Node-2 Node-N
…
…
Cluster of
Messaging
Nodes
27
Performance Evaluation
Research Question:
How much infrastructure is required to manage N resources ?
 N = Number of resources to manage
 Z = Max. no. of entities connected to a single messaging node
 D = Max. no of resources managed by a single manager process
 K = min. no. of registry database instances required to provide fault-
tolerance
 Assume every leaf domain has 1 messaging node. Hence we have N/Z
leaf domains.
 Further, No. of managers required per leaf domain is Z/D
 Thus total components at lowest level
= Components Per domain * No. of Domains
= (K + 1 Messaging Node + 1 Bootstrap Node + Z/D Managers) * N/Z
= (2 + K + Z/D) * N/Z
 Note: Other passive bootstrap nodes are not counted here since (No. of
Passive Nodes) << N
 E.g.: If it’s a shared registry, then the value of K = 1 for each domain which
represents the service interface
28
Performance Evaluation
Research Question:
How much infrastructure is required to manage N resources ?
 Thus for N resources we require an additional (2 + K + Z/D) * N/Z resources
 Thus percentage of additional infrastructure is
= [(2 + K + Z/D)*N/Z] * 100 %
N + (2 + K + Z/D)*N/Z
= [1 – 1/(1+2/Z+ K/Z + 1/D)] * 100 %
 A Few Cases
 Typical values of D and Z are 200 and 800 and assuming K = 4, then Additional
Infrastructure
= [1 – 1/(1 + 2/800 + 4/800 + 1/200)] * 100 %
≈ 1.23 %
 When Registry is shared and there is one registry interface per domain, K = 1, then
Additional Infrastructure
= [1 – 1/(1 + 2/800 + 1/800 + 1/200)] * 100 %
≈ 0.87 %
 If the resource manager can only manage 1 resource at any given instance, then D =
1, then Additional Infrastructure
= [1 – 1/(1 + 2/800 + 4/800 + 1/1)] * 100 %
≈ 50%
29
Performance Evaluation
XML Processing Overhead
 XML Processing overhead is measured as the total
marshalling and un-marshalling time required.
 In case of Broker Management interactions, typical
processing time (includes validation against schema)
≈ 5 ms
 Broker Management operations invoked only during
initialization and failure from recovery
 Reading Broker State using a GET operation involves 5ms
overhead and is invoked periodically (E.g. every 1 minute,
depending on policy)
 Further, for most operation dealing with changing broker
state, actual operation processing time >> 5ms and hence
the XML overhead of 5 ms is acceptable.
30
Prototype:
Managing Grid Messaging Middleware
 We illustrate the architecture by managing the distributed messaging
middleware: NaradaBrokering
 This example motivated by the presence of large number of
dynamic peers (brokers) that need configuration and deployment in
specific topologies
 Runtime metrics provide dynamic hints on improving routing which
leads to redeployment of messaging system (possibly) using a
different configuration and topology
 Can use (dynamically) optimized protocols (UDP v TCP v Parallel
TCP) and go through firewalls but no good way to make choices
dynamically
 Broker Service Adapter
 Note NB illustrates an electronic entity that didn’t start off with an
administrative Service interface
 So add wrapper over the basic NB BrokerNode object that provides
WS – Management front-end
 Allows CREATION, CONFIGURATION and MODIFICATION of broker
topologies
31
Prototype:
Use Case
 Use case I: Audio – Video Conferencing
GlobalMMCS project, which uses
NaradaBrokering as a event delivery
substrate
 Consider a scenario where there is a
teacher and 10,000 students. One would
typically form a TREE shaped hierarchy of
brokers
 One broker can support up to 400
simultaneous video clients and 1500
simultaneous audio clients with acceptable
quality*. So one would need (10000 / 400
≈ 25 broker nodes).
 May also require additional links between
brokers for fault-tolerance purposes
 Use Case II: Sensor Network
 Both use cases need high QoS streams of
messages
 Use Case III: Management System itself
* “Scalable Service Oriented Architecture for
Audio/Video Conferencing”, Ahmet Uyar, Ph.D.
Thesis, May 2005
… … …
…
400
participants
400
participants
400
participants
A single participant
sends audio / video
32
Failure Handling
WS - Policy
 Policy defines resource failure handling
 Implemented 2 policies (based on WS-
Policy)
 Require User Input: No action taken
against failure. A user interaction is
required to handle
 Auto Instantiate: Tries auto instantiation of
failed broker.
 Location of a fork process is required.
33
Prototype:
Costs (Individual Resources – Brokers)
Operation
Time (msec) (average values)
Un-Initialized
(First time)
Initialized
(Later modifications)
Set Configuration 777 46
Create Broker 459 132
Create Link 175 43
Delete Link 109 35
Delete Broker 110 187
34
Recovery time:
Topology
Number of Resource
specific Configuration
Entries
Recovery Time
= T(Read State From Registry) + T(Bring
Resource up to speed)
= T(Read State) + T[SetConfig + Create Broker
+ CreateLink(s)]
Ring
N nodes, N links (1
outgoing link per Node)
2 Resource Objects Per
node
10 + (777 + 459 + 175) ≈ 1.4 sec
Cluster
N nodes, Links per broker
vary from 0 – 3
1 – 4 Resource Objects per
node
Min:
5 + (777 + 459)
≈ 1.2 sec
Max:
20 + {777 + 459 + (175*1 +
43*2)}
≈ 1.5 sec
 Assuming 5ms Read time from registry per resource object
35
Prototype:
Observed Recovery Cost per Resource
Operation Average (msec)
*Spawn Process 2362 ± 18
Read State 8 ± 1
Restore (1 Broker + 1 Link) 1421 ± 9
Restore (1 Broker + 3 Link) 1616 ± 82
Time for Create Broker depends on the number & type of transports opened by
the broker
E.g. SSL transport requires negotiation of keys and would require more
time than simply opening a TCP connection
If brokers connect to other brokers, the destination broker MUST be ready to
accept connections, else topology recovery takes more time.
36
Management Console:
Creating Nodes and Setting Properties
37
Management Console:
Creating Links
38
Management Console:
Policies
39
Management Console:
Creating Topologies
40
Related work
 Fault-Tolerance Strategies
 Replication
 Provide transfer of control to a new or existing
backup service instance on failure
 Passive (primary / backup) OR Active
 E.g. Distributed databases, agents-based
systems
41
Related work
 Fault-Tolerance Strategies
 Replication
 Check-pointing
 Allow computation to continue from point of failure OR
for process migration
 E.g. MPI-based systems (Open MPI)
 Can be done independently (easier to do but complicates
recovery) OR co-ordinated (performance issue but
recovery is easy)
42
Related work
 Fault-Tolerance Strategies
 Replication
 Check-pointing
 Request-Retry
 Logging
 Checksum
43
Related work
 Fault-Tolerance Strategies
 Failure Detection
 Via periodic Heartbeats (E.g. Globus Heartbeat
Monitor)
 Scalability
 Hierarchical organization (E.g. DNS)
 Resource Management (Monitoring /
Scheduling)
 E.g. MonALISA, Globus GRAM
44
Conclusion
 We have presented a scalable, fault-tolerant
management framework that
 Adds acceptable cost in terms of extra resources
required (about 1%)
 Provides a general framework for management of
distributed resources
 Is compatible with existing Web Service standard
 We have applied our framework to manage
resources that have modest external state
 This assumption is important to improve scalability
of management process
45
Summary Of Contributions
 Designed and implemented a Resource Management Framework:
 Tolerant to failures in management framework as well as resource
failures by implementing resource specific policies
 Scalable - In terms of number of additional resources required to
provide fault-tolerance and performance
 Implements Web Service Management to manage resources
 Our implementation of global management by leveraging a
scalable messaging substrate to traverse firewalls
 Detailed evaluation of the system components to show that the
proposed architecture has acceptable costs
 The architecture adds (approx.) 1% extra resources
 Implemented Prototype to illustrate management of a distributed
messaging middleware system: NaradaBrokering
46
Future Work
 Current work assumes SMALL runtime
state that needs to be maintained.
 Apply management framework and
evaluate the system when this assumption
does not hold true
 More messages / Higher sized messages
 XML processing overhead becomes significant
 Apply the framework to broader
domains
47
Publications
 On the proposed work:
 Scalable, Fault-tolerant Management in a Service Oriented Architecture
Harshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, Marlon Pierce
Submitted to IPDPS 2007
 Managing Grid Messaging Middleware
Harshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, Marlon Pierce
In Proceedings of “Challenges of Large Applications in Distributed Environments” (CLADE), pp. 83 - 91,
June 19, 2006, Paris, France
 Relevant to the proposed work:
 A Scripting based Architecture for Management of Streams and Services in Real-time Grid
Applications
Harshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, Marlon Pierce, Robert Granat
In Proceedings of the IEEE/ACM Cluster Computing and Grid 2005 Conference, CCGrid 2005, Vol. 2, pp.
710-717, Cardiff, UK
 On the Discovery of Brokers in Distributed Messaging Infrastructure
Shrideep Pallickara, Harshawardhan Gadgil, Geoffrey Fox
In Proceedings of the IEEE Cluster 2005 Conference. Boston, MA
 On the Discovery of Topics in Distributed Publish/Subscribe systems
Shrideep Pallickara, Geoffrey Fox, Harshawardhan Gadgil
In Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing Grid 2005 Conference,
pp. 25-32, Seattle, WA (Selected as one of six Best Papers)
 A Framework for Secure End-to-End Delivery of Messages in Publish/Subscribe Systems
Shrideep Pallickara, Marlon Pierce, Harshawardhan Gadgil, Geoffrey Fox, Yan Yan, Yi Huang
(To Appear) In Proceedings of “The 7th IEEE/ACM International Conference on Grid Computing” (Grid
2006), Barcelona, September 28th-29th, 2006
48
Publications:
Others
 On the Secure Creation, Organization and Discovery of Topics in Distributed
Publish/Subscribe systems
Shrideep Pallickara, Geoffrey Fox, Harshawardhan Gadgil
(To Appear) International Journal of High Performance Computing and Networking
(IJHPCN), 2006. Special Issue of extended versions of the 6 best papers at the
ACM/IEEE Grid 2005 Workshop
 Building Messaging Substrates for Web and Grid Applications
Geoffrey Fox, Shrideep Pallickara, Marlon Pierce, Harshawardhan Gadgil
In special Issue on Scientific Applications of Grid Computing in Philosophical
Transactions of the Royal Society, London, Volume 363, Number 1833, pp 1757-1773,
August 2005
 Management of Real-Time Streaming Data Grid Services
Geoffrey Fox, Galip Aydin, Harshawardhan Gadgil, Shrideep Pallickara, Marlon Pierce,
and Wenjun Wu
Invited talk at Fourth International Conference on Grid and Cooperative Computing
(GCC2005), Beijing, China Nov 30-Dec 3, 2005, Lecture Notes in Computer Science,
Volume 3795, Nov 2005, Pages 3 -12
 SERVOGrid Complexity Computational Environments(CCE) Integrated
Performance Analysis
Galip Aydin, Mehmet S. Aktas, Geoffrey C. Fox, Harshawardhan Gadgil, Marlon Pierce,
Ahmet Sayar
As poster and In Proceedings of the 6th IEEE/ACM International Workshop on Grid
Computing Grid2005 Conference, pp. 256 - 261, Seattle, WA, Nov 13 - 14, 2005

More Related Content

Similar to defenseTalk.ppt

History Of Database Technology
History Of Database TechnologyHistory Of Database Technology
History Of Database TechnologyJacqueline Thomas
 
Cs556 section1
Cs556 section1Cs556 section1
Cs556 section1farshad33
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS systembenosteen
 
database introductoin optimization1-app6891.pdf
database introductoin optimization1-app6891.pdfdatabase introductoin optimization1-app6891.pdf
database introductoin optimization1-app6891.pdfparveen204931475
 
Introduction to Database
Introduction to DatabaseIntroduction to Database
Introduction to DatabaseSiti Ismail
 
Chapter 2 - Enterprise Application Integration.pdf
Chapter 2 - Enterprise Application Integration.pdfChapter 2 - Enterprise Application Integration.pdf
Chapter 2 - Enterprise Application Integration.pdfKhairul Anwar Sedek
 
Database Systems.ppt
Database Systems.pptDatabase Systems.ppt
Database Systems.pptArbazAli27
 
OPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATIONOPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATIONSUMIT KUMAR
 
Lecture 1 to 3intro to normalization in database
Lecture 1 to 3intro to  normalization in databaseLecture 1 to 3intro to  normalization in database
Lecture 1 to 3intro to normalization in databasemaqsoodahmedbscsfkhp
 
DSpace: Technical Basics
DSpace: Technical BasicsDSpace: Technical Basics
DSpace: Technical BasicsIryna Kuchma
 
Database Management System, Lecture-1
Database Management System, Lecture-1Database Management System, Lecture-1
Database Management System, Lecture-1Sonia Mim
 
Implementation of Agent Based Dynamic Distributed Service
Implementation of Agent Based Dynamic Distributed ServiceImplementation of Agent Based Dynamic Distributed Service
Implementation of Agent Based Dynamic Distributed ServiceCSCJournals
 
Cooperative Schedule Data Possession for Integrity Verification in Multi-Clou...
Cooperative Schedule Data Possession for Integrity Verification in Multi-Clou...Cooperative Schedule Data Possession for Integrity Verification in Multi-Clou...
Cooperative Schedule Data Possession for Integrity Verification in Multi-Clou...IJMER
 

Similar to defenseTalk.ppt (20)

Software Engineering 101
Software Engineering 101Software Engineering 101
Software Engineering 101
 
Basic of Networking
Basic of NetworkingBasic of Networking
Basic of Networking
 
History Of Database Technology
History Of Database TechnologyHistory Of Database Technology
History Of Database Technology
 
1_DBMS_Introduction.pdf
1_DBMS_Introduction.pdf1_DBMS_Introduction.pdf
1_DBMS_Introduction.pdf
 
Cs556 section1
Cs556 section1Cs556 section1
Cs556 section1
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS system
 
database introductoin optimization1-app6891.pdf
database introductoin optimization1-app6891.pdfdatabase introductoin optimization1-app6891.pdf
database introductoin optimization1-app6891.pdf
 
Introduction to Database
Introduction to DatabaseIntroduction to Database
Introduction to Database
 
Chapter 2 - Enterprise Application Integration.pdf
Chapter 2 - Enterprise Application Integration.pdfChapter 2 - Enterprise Application Integration.pdf
Chapter 2 - Enterprise Application Integration.pdf
 
Ch12
Ch12Ch12
Ch12
 
Database Systems.ppt
Database Systems.pptDatabase Systems.ppt
Database Systems.ppt
 
OPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATIONOPEN TEXT ADMINISTRATION
OPEN TEXT ADMINISTRATION
 
Lecture 1 to 3intro to normalization in database
Lecture 1 to 3intro to  normalization in databaseLecture 1 to 3intro to  normalization in database
Lecture 1 to 3intro to normalization in database
 
DSpace: Technical Basics
DSpace: Technical BasicsDSpace: Technical Basics
DSpace: Technical Basics
 
Database Management System, Lecture-1
Database Management System, Lecture-1Database Management System, Lecture-1
Database Management System, Lecture-1
 
Implementation of Agent Based Dynamic Distributed Service
Implementation of Agent Based Dynamic Distributed ServiceImplementation of Agent Based Dynamic Distributed Service
Implementation of Agent Based Dynamic Distributed Service
 
Dbms unit01
Dbms unit01Dbms unit01
Dbms unit01
 
DISTRIBUTED SYSTEM 16M.docx
DISTRIBUTED SYSTEM 16M.docxDISTRIBUTED SYSTEM 16M.docx
DISTRIBUTED SYSTEM 16M.docx
 
Database Systems Concepts, 5th Ed
Database Systems Concepts, 5th EdDatabase Systems Concepts, 5th Ed
Database Systems Concepts, 5th Ed
 
Cooperative Schedule Data Possession for Integrity Verification in Multi-Clou...
Cooperative Schedule Data Possession for Integrity Verification in Multi-Clou...Cooperative Schedule Data Possession for Integrity Verification in Multi-Clou...
Cooperative Schedule Data Possession for Integrity Verification in Multi-Clou...
 

More from vasuhisrinivasan

2. Dispersion Understanding the effects of dispersion in optical fibers is qu...
2. Dispersion Understanding the effects of dispersion in optical fibers is qu...2. Dispersion Understanding the effects of dispersion in optical fibers is qu...
2. Dispersion Understanding the effects of dispersion in optical fibers is qu...vasuhisrinivasan
 
AntBrief123A12-6-07.pptMaxwell’s Equations & EM Waves
AntBrief123A12-6-07.pptMaxwell’s Equations & EM WavesAntBrief123A12-6-07.pptMaxwell’s Equations & EM Waves
AntBrief123A12-6-07.pptMaxwell’s Equations & EM Wavesvasuhisrinivasan
 
Helical.pptan antenna consisting of a conducting wire wound in the form of a ...
Helical.pptan antenna consisting of a conducting wire wound in the form of a ...Helical.pptan antenna consisting of a conducting wire wound in the form of a ...
Helical.pptan antenna consisting of a conducting wire wound in the form of a ...vasuhisrinivasan
 
cis595_03_IMAGE_FUNDAMENTALS.ppt
cis595_03_IMAGE_FUNDAMENTALS.pptcis595_03_IMAGE_FUNDAMENTALS.ppt
cis595_03_IMAGE_FUNDAMENTALS.pptvasuhisrinivasan
 
Radiation from an Oscillating Electric Dipole.ppt
Radiation from an Oscillating Electric Dipole.pptRadiation from an Oscillating Electric Dipole.ppt
Radiation from an Oscillating Electric Dipole.pptvasuhisrinivasan
 
Human Detection and Tracking Using Apparent Features under.pdf
Human Detection and Tracking Using Apparent Features under.pdfHuman Detection and Tracking Using Apparent Features under.pdf
Human Detection and Tracking Using Apparent Features under.pdfvasuhisrinivasan
 

More from vasuhisrinivasan (14)

2. Dispersion Understanding the effects of dispersion in optical fibers is qu...
2. Dispersion Understanding the effects of dispersion in optical fibers is qu...2. Dispersion Understanding the effects of dispersion in optical fibers is qu...
2. Dispersion Understanding the effects of dispersion in optical fibers is qu...
 
AntBrief123A12-6-07.pptMaxwell’s Equations & EM Waves
AntBrief123A12-6-07.pptMaxwell’s Equations & EM WavesAntBrief123A12-6-07.pptMaxwell’s Equations & EM Waves
AntBrief123A12-6-07.pptMaxwell’s Equations & EM Waves
 
Helical.pptan antenna consisting of a conducting wire wound in the form of a ...
Helical.pptan antenna consisting of a conducting wire wound in the form of a ...Helical.pptan antenna consisting of a conducting wire wound in the form of a ...
Helical.pptan antenna consisting of a conducting wire wound in the form of a ...
 
surveillance.ppt
surveillance.pptsurveillance.ppt
surveillance.ppt
 
Aerial photo.ppt
Aerial photo.pptAerial photo.ppt
Aerial photo.ppt
 
cis595_03_IMAGE_FUNDAMENTALS.ppt
cis595_03_IMAGE_FUNDAMENTALS.pptcis595_03_IMAGE_FUNDAMENTALS.ppt
cis595_03_IMAGE_FUNDAMENTALS.ppt
 
rmsip98.ppt
rmsip98.pptrmsip98.ppt
rmsip98.ppt
 
IP_Fundamentals.ppt
IP_Fundamentals.pptIP_Fundamentals.ppt
IP_Fundamentals.ppt
 
Ch24 fiber optics.pptx
Ch24 fiber optics.pptxCh24 fiber optics.pptx
Ch24 fiber optics.pptx
 
Radiation from an Oscillating Electric Dipole.ppt
Radiation from an Oscillating Electric Dipole.pptRadiation from an Oscillating Electric Dipole.ppt
Radiation from an Oscillating Electric Dipole.ppt
 
Aperture ant.ppt
Aperture ant.pptAperture ant.ppt
Aperture ant.ppt
 
Spiral antenna.pptx
Spiral antenna.pptxSpiral antenna.pptx
Spiral antenna.pptx
 
Antennas-p-3.ppt
Antennas-p-3.pptAntennas-p-3.ppt
Antennas-p-3.ppt
 
Human Detection and Tracking Using Apparent Features under.pdf
Human Detection and Tracking Using Apparent Features under.pdfHuman Detection and Tracking Using Apparent Features under.pdf
Human Detection and Tracking Using Apparent Features under.pdf
 

Recently uploaded

UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxkalpana413121
 
Introduction to Geographic Information Systems
Introduction to Geographic Information SystemsIntroduction to Geographic Information Systems
Introduction to Geographic Information SystemsAnge Felix NSANZIYERA
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxpritamlangde
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelDrAjayKumarYadav4
 
Passive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptPassive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptamrabdallah9
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdfAlexander Litvinenko
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba
 
Unsatisfied Bhabhi ℂall Girls Ahmedabad Book Esha 6378878445 Top Class ℂall G...
Unsatisfied Bhabhi ℂall Girls Ahmedabad Book Esha 6378878445 Top Class ℂall G...Unsatisfied Bhabhi ℂall Girls Ahmedabad Book Esha 6378878445 Top Class ℂall G...
Unsatisfied Bhabhi ℂall Girls Ahmedabad Book Esha 6378878445 Top Class ℂall G...Payal Garg #K09
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessorAshwiniTodkar4
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...ppkakm
 
Adsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptAdsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptjigup7320
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxMustafa Ahmed
 
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...ronahami
 

Recently uploaded (20)

UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
Introduction to Geographic Information Systems
Introduction to Geographic Information SystemsIntroduction to Geographic Information Systems
Introduction to Geographic Information Systems
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Digital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptxDigital Communication Essentials: DPCM, DM, and ADM .pptx
Digital Communication Essentials: DPCM, DM, and ADM .pptx
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata Model
 
Passive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptPassive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.ppt
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Unsatisfied Bhabhi ℂall Girls Ahmedabad Book Esha 6378878445 Top Class ℂall G...
Unsatisfied Bhabhi ℂall Girls Ahmedabad Book Esha 6378878445 Top Class ℂall G...Unsatisfied Bhabhi ℂall Girls Ahmedabad Book Esha 6378878445 Top Class ℂall G...
Unsatisfied Bhabhi ℂall Girls Ahmedabad Book Esha 6378878445 Top Class ℂall G...
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor8086 Microprocessor Architecture: 16-bit microprocessor
8086 Microprocessor Architecture: 16-bit microprocessor
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...
 
Adsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptAdsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) ppt
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Augmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptxAugmented Reality (AR) with Augin Software.pptx
Augmented Reality (AR) with Augin Software.pptx
 
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
 

defenseTalk.ppt

  • 1. Scalable, Fault-tolerant Management of Grid Services: Application to Messaging Middleware Harshawardhan Gadgil hgadgil@cs.indiana.edu Ph.D. Defense Exam (Practice Talk) Advisor: Prof. Geoffrey Fox
  • 2. 2 Talk Outline  Motivation  Architecture  Service-oriented Management  Performance Evaluation  Application: Managing Grid Messaging Middleware  Related Work  Conclusion
  • 3. 3 Motivation: Characteristics of today’s (GRID) Applications  Increasing Application Complexity  Applications distributed and composed of multiple resources  In future, much larger systems will be built  Components widely dispersed and disparate in nature and access  Span different administrative domains  Under differing network / security policies  Limited access to resources due to presence of firewalls, NATs etc…  Components in flux  Components (Nodes, network, processes) may fail  Services must meet  General QoS and Life-cycle features  (User defined) Application specific criteria  Need to “manage” services to provide these capabilities
  • 4. 4 Motivation: Key Challenges in Management of Resources  Scalability  With Growing Complexity of application, number of resources that require management increases  E.g. LHC Grid consists of a large number of CPUs, disks and mass storage servers  Web Service based applications such as Amazon’s EC2 dynamically resizes compute capacity, so number of resources is NOT ONLY large BUT ALSO in a constant state of flux.  Management framework MUST cope with large number of resources in terms of  Additional components Required
  • 5. 5 Motivation: Key Challenges in Management of Resources  Scalability  Performance  Performance Important in terms of  Initialization Cost  Recovery from failure  Responding to run-time events  Performance should not suffer with increasing resources and additional system components
  • 6. 6 Motivation: Key Challenges in Management of Resources  Scalability  Performance  Fault – tolerance  Failures are Normal  Resources may fail, but so also components of the management framework.  Framework MUST recover from failure  Recovery period must not increase drastically with increasing number of resources.
  • 7. 7 Motivation: Key Challenges in Management of Resources  Scalability  Performance  Fault – tolerance  Interoperability  Resources exist on different platforms  Written in different languages  Typically managed using system specific protocols and hence not INTEROPERABLE  Investigate the use of a service-oriented architecture for management
  • 8. 8 Motivation: Key Challenges in Management of Resources  Scalability  Performance  Fault – tolerance  Interoperability  Generality  Management framework must be a generic framework.  Should manage any type of resource (hardware / software)
  • 9. 9 Motivation: Key Challenges in Management of Resources  Scalability  Performance  Fault – tolerance  Interoperability  Generality  Usability  Simple to deploy. Built in terms of simple components (services)  Autonomous operation (as much as possible)
  • 10. 10 Summary: Research Issues  Building a Fault-tolerant Management Architecture  Making the architecture SCALABLE  Investigate the overhead in terms of  Additional Components Required  Typical response time  Recovery Time  Interoperable and Extensible Management framework  General and usable system
  • 11. 11 Architecture: Assumptions  For our discussion: RESOURCE = SERVICE  External state required by resources is small  Can be captured using a message-based interaction interface  Resource may maintain internal state and can bootstrap itself  E.g.: Shopping Cart (Internal State = Contents of cart, External state = Location of DB and access credentials where contents persist)  Assume a scalable, fault-tolerant database store that acts as a registry to maintain resource specific external state
  • 12. 12 Definition: The process of Management  E.g. Consider Printer as a resource that will be managed  Generates Events (E.g. Low Ink Level, Out of paper)  Resource Manager appropriately handles these events as defined by Resource specific policies  Job Queue Management is NOT the responsibility of our management architecture.  We imagine existence of a separate Job Management Process which itself can be managed by our framework (E.g. Make sure it is always up and running) Job Queue Ink Level Feeder Tray Management LowInkLevelEvent OutOfPaperEvent ` Resource Specific Manager Job Queue Management MANAGE MANAGE
  • 13. 13 Management Architecture built in terms of  Hierarchical Bootstrap System – Robust itself by Replication  Resources in different domains can be managed with separate policies for each domain  Periodically spawns a System Health Check that ensures components are up and running  Registry for metadata (distributed database) – Robust by standard database techniques and our system itself for Service Interfaces  Stores resource specific information (User-defined configuration / policies, external state required to properly manage a resource)  Messaging Nodes form a scalable messaging substrate  Message delivery between managers and managees  Provides Secure delivery of messages  Managers – Active stateless agents that manage resources.  Resource specific management thread performs actual management  Multi-threaded to improve scalability  Managees – what you are managing (Resource / Service to manage) – Our system makes robust  There is NO assumption that Managed system uses Messaging nodes  Wrapped by a Service Adapter which provides a Web Service interface
  • 14. 14 Architecture: Scalability: Hierarchical distribution ROOT US EUROPE FSU CARDIFF CGL Active Bootstrap Nodes /ROOT/EUROPE/CARDIFF •Responsible for maintaining a working set of management components in the domain •Always the leaf nodes in the hierarchy Passive Bootstrap Nodes •Only ensure that all child bootstrap nodes are always up and running … Spawns if not present and ensure up and running …
  • 15. 15 Architecture: Conceptual Idea (Internals) Resource to Manage (Managee) Service Adapter Bootstrap Service System Health Check Manager Resource to Manage (Managee) Service Adapter Resource to Manage (Managee) Service Adapter Manager Messaging Node Registry Manager Manager ... ... Connect to Messaging Node for sending and receiving messages User writes system configuration to registry Manager processes periodically checks available resources to manage. Also Read/Write resource specific external state from/to registry Always ensure up and running Always ensure up and running Periodically Spawn
  • 16. 16 Architecture: User Component  Characteristics are determined by the user.  Events generated by the Managees are handled by the manager  Event processing is determined by via WS-Policy constructs  E.g. Wait for user’s decision on handling specific conditions  The event handler has been specified, so execute default policy, etc…  Note Managers will set up services if registry indicates that is appropriate; so writing information to registry can be used to start up a set of services  Generic and Application specific policies are written to the registry where it will be picked up by a manager process.
  • 17. 17 Issues: Issues in the distributed system  Consistency – Management architecture must provide consistency of management process  Examples of inconsistent behaviour  Two or more managers managing the same resource  Old messages / requests reaching after new requests  Multiple copies of resources existing at the same time leading to inconsistent system state  Use a Registry generated monotonically increasing Unique Instance ID to distinguish between new and old instances  Security – Provide secure communication between communicating parties (e.g. Manager <-> Resource)  Leverage NaradaBrokering’s Topic Creation and discovery and Security framework to provide:  Provenance, Secure Discovery, Authorization & Authentication  Prevent unauthorized users from accessing the resource  Prevent malicious users from modifying message (Thus message interactions are secure when passing through insecure intermediaries)
  • 18. 18 Consistency  All new entities get a unique InstanceID (IID), generated thru registry. All interaction from that entity use this id as a prefix. We assume this to be a monotonically increasing number (E.g. an NTP timestamp)  Thus If a Manager Thread starts, it is assigned a unique ID by the registry. A newer instance has a higher id, thus OLD manager threads can be distinguished  The resource ALWAYS assumes the manager thread to be current with the highest known ID  Thus requests from manager thread A are considered obsolete IF IID(A) < IID(B)  This IID may also be used as a prefix for interactions  Message ID = [X:N] where X is the registry assigned IID and N is a monotonically increasing number generated by an instance with IID as X  Service Adapter stores the last known MessageID allowing it to differentiate between duplicates AND obsolete messages  Similar principle for auto-instantiated resources. IF a resource is considered DEAD (Unreachable) and a new resource is spawned this new resource has the same ResourceID (allows us to identify the type of resource) but a higher InstanceID.  Later if the old resource joins back in, it can be distinguished by checking its InstanceID and appropriately taking action  E.g. IF IID(ResourceInstance_1) <IID(ResourceInstance_2), then ResourceInstance_1 was previously deemed OBSOLETE hence ResourceInstance_2 exists SO instruct ResourceInstance_1 to silently shutdown
  • 19. 19 Interoperability: Service-Oriented Management  Existing systems  Platforms, languages  SNMP, JMX, WMI  Quite successful, but not interoperable  Move to Web Service based service- oriented architecture that uses  XML based interactions that facilitate implementation in different languages, running on different platforms and over multiple transports.
  • 20. 20 Interoperability: WS – Distributed Management vs. WS-Management  Both systems provide Web service model for building application management solutions  WSDM – MOWS (Mgmt. Of Web Services) & MUWS (Mgmt. Using Web Services)  MUWS: unifying layer on top of existing management specifications such as SNMP, OMI (Object Management Interface)  MOWS: Provide support for management framework such as deployment, auditing, metering, SLA management, life cycle management etc…  WS Management identifies core set of specification and usage requirements  E.g. CREATE, DELETE, GET / PUT, ENUMERATE + any number of resource specific management methods (if applicable)  Selected WS-Management primarily due to its simplicity and also to leverage WS-Eventing implementation recently added for Web Service support in NaradaBrokering
  • 21. 21 Implemented:  WS – Specifications  WS – Management (June 2005) parts (WS – Transfer [Sep 2004], WS – Enumeration [Sep 2004] and WS – Eventing) (could use WS-DM)  WS – Eventing (Leveraged from the WS – Eventing capability implemented in OMII)  WS – Addressing [Aug 2004] and SOAP v 1.2 used (needed for WS- Management)  Used XmlBeans 2.0.0 for manipulating XML in custom container.  Security Framework for NB  Provides secure end-to-end delivery of messages  Broker Discovery mechanism  May be leveraged to discover Messaging Nodes  Currently implemented using JDK 1.4.2 (expect better performance moving to JDK 1.5 or better)
  • 22. 22 Performance Evaluation Measurement Model – Test Setup  Multithreaded manager process - Spawns a Resource specific management thread (A single manager can manage multiple different types of resources)  Limit on maximum resources that can be managed  Limited by Response time obtained  Limited by maximum threads per JVM possible (memory constraints) Messaging Node (GF5) GF1 GF3 GF4 GF2 Registry (GF5) Manager Process(es) (GF6) Set of Managees Benchmark Accumulator (GF7) TCP Connection
  • 23. 23 Performance Evaluation Results  Response time increases with increasing number of resources  Response time is RESOURCE-DEPENDENT and the shown times are typical  MAY involve 1 or more Registry access which will increase overall response time  Increases rapidly as no. of resources > (150 – 200)
  • 25. 25 Performance Evaluation Results: Increasing Managers on different machines
  • 26. 26 How to scale locally ROOT US Node-1 CGL Node-2 Node-N … … Cluster of Messaging Nodes
  • 27. 27 Performance Evaluation Research Question: How much infrastructure is required to manage N resources ?  N = Number of resources to manage  Z = Max. no. of entities connected to a single messaging node  D = Max. no of resources managed by a single manager process  K = min. no. of registry database instances required to provide fault- tolerance  Assume every leaf domain has 1 messaging node. Hence we have N/Z leaf domains.  Further, No. of managers required per leaf domain is Z/D  Thus total components at lowest level = Components Per domain * No. of Domains = (K + 1 Messaging Node + 1 Bootstrap Node + Z/D Managers) * N/Z = (2 + K + Z/D) * N/Z  Note: Other passive bootstrap nodes are not counted here since (No. of Passive Nodes) << N  E.g.: If it’s a shared registry, then the value of K = 1 for each domain which represents the service interface
  • 28. 28 Performance Evaluation Research Question: How much infrastructure is required to manage N resources ?  Thus for N resources we require an additional (2 + K + Z/D) * N/Z resources  Thus percentage of additional infrastructure is = [(2 + K + Z/D)*N/Z] * 100 % N + (2 + K + Z/D)*N/Z = [1 – 1/(1+2/Z+ K/Z + 1/D)] * 100 %  A Few Cases  Typical values of D and Z are 200 and 800 and assuming K = 4, then Additional Infrastructure = [1 – 1/(1 + 2/800 + 4/800 + 1/200)] * 100 % ≈ 1.23 %  When Registry is shared and there is one registry interface per domain, K = 1, then Additional Infrastructure = [1 – 1/(1 + 2/800 + 1/800 + 1/200)] * 100 % ≈ 0.87 %  If the resource manager can only manage 1 resource at any given instance, then D = 1, then Additional Infrastructure = [1 – 1/(1 + 2/800 + 4/800 + 1/1)] * 100 % ≈ 50%
  • 29. 29 Performance Evaluation XML Processing Overhead  XML Processing overhead is measured as the total marshalling and un-marshalling time required.  In case of Broker Management interactions, typical processing time (includes validation against schema) ≈ 5 ms  Broker Management operations invoked only during initialization and failure from recovery  Reading Broker State using a GET operation involves 5ms overhead and is invoked periodically (E.g. every 1 minute, depending on policy)  Further, for most operation dealing with changing broker state, actual operation processing time >> 5ms and hence the XML overhead of 5 ms is acceptable.
  • 30. 30 Prototype: Managing Grid Messaging Middleware  We illustrate the architecture by managing the distributed messaging middleware: NaradaBrokering  This example motivated by the presence of large number of dynamic peers (brokers) that need configuration and deployment in specific topologies  Runtime metrics provide dynamic hints on improving routing which leads to redeployment of messaging system (possibly) using a different configuration and topology  Can use (dynamically) optimized protocols (UDP v TCP v Parallel TCP) and go through firewalls but no good way to make choices dynamically  Broker Service Adapter  Note NB illustrates an electronic entity that didn’t start off with an administrative Service interface  So add wrapper over the basic NB BrokerNode object that provides WS – Management front-end  Allows CREATION, CONFIGURATION and MODIFICATION of broker topologies
  • 31. 31 Prototype: Use Case  Use case I: Audio – Video Conferencing GlobalMMCS project, which uses NaradaBrokering as a event delivery substrate  Consider a scenario where there is a teacher and 10,000 students. One would typically form a TREE shaped hierarchy of brokers  One broker can support up to 400 simultaneous video clients and 1500 simultaneous audio clients with acceptable quality*. So one would need (10000 / 400 ≈ 25 broker nodes).  May also require additional links between brokers for fault-tolerance purposes  Use Case II: Sensor Network  Both use cases need high QoS streams of messages  Use Case III: Management System itself * “Scalable Service Oriented Architecture for Audio/Video Conferencing”, Ahmet Uyar, Ph.D. Thesis, May 2005 … … … … 400 participants 400 participants 400 participants A single participant sends audio / video
  • 32. 32 Failure Handling WS - Policy  Policy defines resource failure handling  Implemented 2 policies (based on WS- Policy)  Require User Input: No action taken against failure. A user interaction is required to handle  Auto Instantiate: Tries auto instantiation of failed broker.  Location of a fork process is required.
  • 33. 33 Prototype: Costs (Individual Resources – Brokers) Operation Time (msec) (average values) Un-Initialized (First time) Initialized (Later modifications) Set Configuration 777 46 Create Broker 459 132 Create Link 175 43 Delete Link 109 35 Delete Broker 110 187
  • 34. 34 Recovery time: Topology Number of Resource specific Configuration Entries Recovery Time = T(Read State From Registry) + T(Bring Resource up to speed) = T(Read State) + T[SetConfig + Create Broker + CreateLink(s)] Ring N nodes, N links (1 outgoing link per Node) 2 Resource Objects Per node 10 + (777 + 459 + 175) ≈ 1.4 sec Cluster N nodes, Links per broker vary from 0 – 3 1 – 4 Resource Objects per node Min: 5 + (777 + 459) ≈ 1.2 sec Max: 20 + {777 + 459 + (175*1 + 43*2)} ≈ 1.5 sec  Assuming 5ms Read time from registry per resource object
  • 35. 35 Prototype: Observed Recovery Cost per Resource Operation Average (msec) *Spawn Process 2362 ± 18 Read State 8 ± 1 Restore (1 Broker + 1 Link) 1421 ± 9 Restore (1 Broker + 3 Link) 1616 ± 82 Time for Create Broker depends on the number & type of transports opened by the broker E.g. SSL transport requires negotiation of keys and would require more time than simply opening a TCP connection If brokers connect to other brokers, the destination broker MUST be ready to accept connections, else topology recovery takes more time.
  • 36. 36 Management Console: Creating Nodes and Setting Properties
  • 40. 40 Related work  Fault-Tolerance Strategies  Replication  Provide transfer of control to a new or existing backup service instance on failure  Passive (primary / backup) OR Active  E.g. Distributed databases, agents-based systems
  • 41. 41 Related work  Fault-Tolerance Strategies  Replication  Check-pointing  Allow computation to continue from point of failure OR for process migration  E.g. MPI-based systems (Open MPI)  Can be done independently (easier to do but complicates recovery) OR co-ordinated (performance issue but recovery is easy)
  • 42. 42 Related work  Fault-Tolerance Strategies  Replication  Check-pointing  Request-Retry  Logging  Checksum
  • 43. 43 Related work  Fault-Tolerance Strategies  Failure Detection  Via periodic Heartbeats (E.g. Globus Heartbeat Monitor)  Scalability  Hierarchical organization (E.g. DNS)  Resource Management (Monitoring / Scheduling)  E.g. MonALISA, Globus GRAM
  • 44. 44 Conclusion  We have presented a scalable, fault-tolerant management framework that  Adds acceptable cost in terms of extra resources required (about 1%)  Provides a general framework for management of distributed resources  Is compatible with existing Web Service standard  We have applied our framework to manage resources that have modest external state  This assumption is important to improve scalability of management process
  • 45. 45 Summary Of Contributions  Designed and implemented a Resource Management Framework:  Tolerant to failures in management framework as well as resource failures by implementing resource specific policies  Scalable - In terms of number of additional resources required to provide fault-tolerance and performance  Implements Web Service Management to manage resources  Our implementation of global management by leveraging a scalable messaging substrate to traverse firewalls  Detailed evaluation of the system components to show that the proposed architecture has acceptable costs  The architecture adds (approx.) 1% extra resources  Implemented Prototype to illustrate management of a distributed messaging middleware system: NaradaBrokering
  • 46. 46 Future Work  Current work assumes SMALL runtime state that needs to be maintained.  Apply management framework and evaluate the system when this assumption does not hold true  More messages / Higher sized messages  XML processing overhead becomes significant  Apply the framework to broader domains
  • 47. 47 Publications  On the proposed work:  Scalable, Fault-tolerant Management in a Service Oriented Architecture Harshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, Marlon Pierce Submitted to IPDPS 2007  Managing Grid Messaging Middleware Harshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, Marlon Pierce In Proceedings of “Challenges of Large Applications in Distributed Environments” (CLADE), pp. 83 - 91, June 19, 2006, Paris, France  Relevant to the proposed work:  A Scripting based Architecture for Management of Streams and Services in Real-time Grid Applications Harshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, Marlon Pierce, Robert Granat In Proceedings of the IEEE/ACM Cluster Computing and Grid 2005 Conference, CCGrid 2005, Vol. 2, pp. 710-717, Cardiff, UK  On the Discovery of Brokers in Distributed Messaging Infrastructure Shrideep Pallickara, Harshawardhan Gadgil, Geoffrey Fox In Proceedings of the IEEE Cluster 2005 Conference. Boston, MA  On the Discovery of Topics in Distributed Publish/Subscribe systems Shrideep Pallickara, Geoffrey Fox, Harshawardhan Gadgil In Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing Grid 2005 Conference, pp. 25-32, Seattle, WA (Selected as one of six Best Papers)  A Framework for Secure End-to-End Delivery of Messages in Publish/Subscribe Systems Shrideep Pallickara, Marlon Pierce, Harshawardhan Gadgil, Geoffrey Fox, Yan Yan, Yi Huang (To Appear) In Proceedings of “The 7th IEEE/ACM International Conference on Grid Computing” (Grid 2006), Barcelona, September 28th-29th, 2006
  • 48. 48 Publications: Others  On the Secure Creation, Organization and Discovery of Topics in Distributed Publish/Subscribe systems Shrideep Pallickara, Geoffrey Fox, Harshawardhan Gadgil (To Appear) International Journal of High Performance Computing and Networking (IJHPCN), 2006. Special Issue of extended versions of the 6 best papers at the ACM/IEEE Grid 2005 Workshop  Building Messaging Substrates for Web and Grid Applications Geoffrey Fox, Shrideep Pallickara, Marlon Pierce, Harshawardhan Gadgil In special Issue on Scientific Applications of Grid Computing in Philosophical Transactions of the Royal Society, London, Volume 363, Number 1833, pp 1757-1773, August 2005  Management of Real-Time Streaming Data Grid Services Geoffrey Fox, Galip Aydin, Harshawardhan Gadgil, Shrideep Pallickara, Marlon Pierce, and Wenjun Wu Invited talk at Fourth International Conference on Grid and Cooperative Computing (GCC2005), Beijing, China Nov 30-Dec 3, 2005, Lecture Notes in Computer Science, Volume 3795, Nov 2005, Pages 3 -12  SERVOGrid Complexity Computational Environments(CCE) Integrated Performance Analysis Galip Aydin, Mehmet S. Aktas, Geoffrey C. Fox, Harshawardhan Gadgil, Marlon Pierce, Ahmet Sayar As poster and In Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing Grid2005 Conference, pp. 256 - 261, Seattle, WA, Nov 13 - 14, 2005