My day job is that of internet services manager of a medium sized UK company which builds proprietary solutions involving web service and xml technologies; My primary interests initially revolved around XSLT and xml technologies; started up EXSLT along with Dave Pawson and Jeni Tennison which now enjoys widespread adoption/implementation, have done quite a bit of technical reviewing and co-authoring for the now defunct WROX, and as with many authors I have now moved over to their benefactors, e.g. now writing a tome for APRESS provisionally titled Ant Gems which is due to go to print march next year I am a typical application developer, having refreshed my skill set as the times have changed.
This talk will outline a journey which actually started with a desire to answer XSLT list member questions to a personal rediscovery of the genetic algorithm. Over the years of answering peoples questions on the XSLT list, I noticed that most of the questions were simple mapping transformations…e.g. I have this source xml and what to transform it into target xml. If the user knew the source and target xml, what automated methods could be brought to bear ? I wont be talking about REST, not because I don’t think it important or that it can’t be considered as the valid I see both REST and SOAP based web services existing quite happily together. xml technologies and some
Specifications are being finalized to handle complex messaging: orchestration, coordination, and routing
Due to the lack of critically adopted WS orchestration, composition, coordination standards, many are finding the MVC architecture approach a good match Instead of exposing a large variety of web services, expose one controller web service, which places the importance into the message body, simple interactions, complex messages. List MVC examples
SOAP response could be styled by XSLT This technique lies between typical Web Services and REST
I personally use Systinet WASP server, it takes care of everything a developer would want to not deal with….especially security. The folks who made WASP have a deep heritage with CORBA Instantly solve some problems…stateful web services Anchor is usually used in negative connotations, e.g. boat anchor antipatterns. With many specifications still being worked out, having an anchor in a storm is useful
By integrating Amazon Web Services, Amazon.com Research Services for Microsoft Office System will provide Microsoft Office System users with convenient and seamless access to Amazon.com from within Microsoft productivity applications via the Research Task Pane. Users will be able to access Amazon.com information and make purchases without launching a browser or leaving their document, e-mail message or presentation. For example, a customer reading a bibliography in a Word document could easily click on a book title and purchase it from within the Research Task Pane without having to leave the Word document. Alternatively, a user will be able to add a footnote, bibliography entry and even cover art for books without needing to manually enter the information into a document. The Research Task Pane, a feature in the Microsoft Office 2003 Edition desktop applications (Word, Excel, the Outlook messaging and collaboration client, the PowerPoint presentation graphics program and Access) and in Microsoft Office System products OneNote (TM) note-taking program, Publisher and Visio drawing and diagramming software, uses industry-standard XML to enable users to retrieve and navigate relevant internal or external Web-based information, all from within Office programs. "Amazon.com is breaking new ground in its use of XML-enabled Web Services that connect data from disparate systems, allow greater access to content, and create a more valuable experience for Web users," said Gytis Barzdukas, director of Office Product Management at Microsoft. "By using the advancements of the Microsoft Office System, Microsoft and Amazon.com are transforming the desktop into a dynamic interface for Office customers everywhere." "We are excited to help make this new service available to our customers," said Jeff Barr, Web services technical program manager at Amazon.com. "This Microsoft Office System solution adds significant convenience for our Microsoft users and Amazon.com customers in finding and discovering products. We look forward to receiving feedback from users and to adding more features in the future."
Google (file:wsdl, file:wsil ) Look for inspection.wsil Refer to xmethods or well known UDDI registries The importance of a human understandable description of a web service should not be underestimated, What if the human description is in a different language ? Is the interface enough for automatic composition methods ?
With unlimited processing power and network bandwidth random search is fine. Intelligent software agents must have knowledge of the problem domain, either gained via learning ( neural network ) or through experts embedding knowledge As you will find out, GA does not need any specialist knowledge to solve a problem, and is quicker then random/linear search of large problem domains
This approach is not specific to any problem domain…can be applied to anything different partial effective gene combinations or “schemata” are searched in parallel manner Analogies are good in computing, but can be dangerous and can cloud over some of the more subtle aspects. It may so happen, what happens actually in nature is completely irrevelent, it just so happens that for some groups of problems this technique is potentially useful. Just because an analogy ‘feels right’ does not mean it explains ‘how’ something works…analogies are good for illustration purposes, not for explanation purposes.
where M(H, t) number of strings in population 't' with the schema 'H'. f(H) average fitness of the strings with the schema 'H'. F average fitness of the entire population. p1 probability of the schema being destroyed by crossover. p2 probability of the schema being destroyed by mutation. There are many variations
There are primary and secondary operations in the genetic operation
fitness is usually encompasses domain specific factors Primary operation: reproduction / recombination Secondary operation: mutation / editing / encapsulation
I was reviewing a book by WROX, called Beginning Databases….and since I was xml through and through I was forced to re-examine the differences between hierarchical data models with relational, etc… . Somehow this investigation led me to S BOX structures in LISP…..which re-introduced me to the genetic algorithm…and the idea of partial schemata being used to solve problems. The xslt guru David Carlisle probably didn’t know it, but him and his lot at XSLT UK caused me to investigate the fp approach using XSLT
LISP Symbolic expressions contain lists or atoms Use polish notation LISP is good at Programs and data have the same form A lisp program is its own parse tree EVAL function for lisp easy way to chain execution LISP facilitates the programming of hierarchical structures LISP is not a special GA language, in my opinion working with hierarchical computer programs is more expressive
Most programming languages internally convert to a parse tree, xml and especially xslt is akin to LISP in that we have direct access to the ‘tree’. Since XSLT is xml, we can easily manipulate computer programs as if it were data, this is important in the genetic operations. Since XSLT is the language for transforming XML, we could use it to transform XSLT programs. In practice there is a performance hit to this approach. In any event, this talk focuses on the strategy, and not the precise implementation method.
There will be reasons why I use ANT revealed later on, for one this was a natural choice as this talk is the final chapter in the previously mentioned book. Ant is a natural for dealing with lots of files, as we will be generating lots of xslt populations, applying transformations and various processes on them….was a no-brainer I have been using SAXON from the beginning, it is the only XSLT processor that implements XSLT 2.0.
This type of equivalency problem was chosen to make the prototype’s output easy to validate In addition, looking for logical equivalency, not worried about whitespace at the moment
* Comes from processing specific xslt individual with source.xml
500 xslt documents Going to generate 51 generations
Can supply with parameters to define nodedepth, repeats, supply a random seed, weight odds for certain elements or attributes to be generated. Uses a DTD to define allowable elements. As you can see the example template really does nothing useful, it is typical that starting populations consistently have a low fitness for its ultimate purpose
I wanted to reduce complexity in my early experiments so I avoided what I call early taxonomisation .
We indirectly measure the fitness of an XSLT program by checking its output with a desired target xml. Transformation to each xslt individual in the population Best Fitness for our purposes is defined as an exact match between result and target xml. Fitness does not have to be the result of a single metric, we could have multiple tests for a fitness of an individual Source and target xml were supplied as part of the problem formulation
Note that we have added a element, this is to ensure that XSLT that returned nothing, at least returned a valid xml document with one root node. There were situations where logically the fitness metric was not sufficient for certain special cases, in actuality having a number of source and target xml solved this issue.
IBM’s is based on some novel thinking, though I have not used it ( commercial ) Microsoft’s is fine and fast
Can choose the same individual for multiple operations, any number of times better fitness individuals have larger slice of the pie, so they will be selected more There can be some additional fitness penalties, for example in generation 0 many xslt files maybe invalid and not process at all.
Raw fitness is a metric in terms of the problem, for example if you are trying to optimise some business process that sells products. The number of products sold could be the fitness ranking ( more the better ). Fitness could be calculated over a series of values and event outcomes, e.g. we could have multiple source and target xmls and the overall ranking of an individual would be its ranking
From the selected population an individual is selected to be perfectly reproduced into the new generation
Normally creates 2 offspring, though in nature this is not the case.
Secondary operations tend to speed up convergence towards a solution, though if used too much will restrict convergence to ever occur.
Pick a point and randomly mutate Asexual In xslt this must run XML generator again to obtain nodeset to augment. a form of crossover
A random node is selected and its arguments are reorganized. Since ordering in xml is rarely important this operation has been omitted from our process Asexual
If any function has no side effects, and is not context dependent, has only constant atoms as arguments the editing operation will evaluate that function and replace it with a value. should always return the same amount if the source xml remains the same, so editing would resolve this and replace the xsl instruction with a 1.
Identify useful subtrees by searching high fitness individuals for common subtrees. The effect of encapsulation is that the selected subtree is no longer subject to the potentially disruptive effects of crossover.
Variety in a population drops quickly after generation 0, because GA focuses on marginally better fitness. To improve genetic diversity apply decimation, a set of rules which removes very poor fitness individuals. The example shows a 1 node XSLT, which is indeed very poor for solving our problem. An empty stylesheet is no use to us.
There are situations where convergence around a single version never occurs
Compiling xslt templates
Its hard to apply genetic operations to languages that do not have any discreteness, like xml has with angle brackets demarcating each instruction. This is why s-boxes and the functional approach was the AI choice, because it was easy to
* Comes from processing specific xslt individual with source.xml
Harvesting program is found at www.semantic-web.co.uk wsil solved UDDI/WSDL umbrella
SOAP 1.1 would have these HTTP headers: Content-Type: text/xml SOAPAction: "http://example.com/ticker“ SOAP 1.2 message would have the following: Content-Type: application/soap+_xml; action=http://example.com/ticker Moving all of the metadata into the one place where it should be is also a good thing.
top level element defines namespaces used contains a service referencedNamespace, location of wsdl, UDDI specific stuff and may contain other elements, known as extensibility elements
Shows how we can use with both UDDI and WSDL Link element imports more wsil service definitions 2 conventions of usage; place inspection.wsil in root web directory of web server or under current dir of the webservice itself with the root level wsil containing links to these encapsulated wsil docs. avoid the 2nd convention of using a meta tag and use a RDDL doc to describe xml schema exists
Notice extension mechanism Very easy to extend, any description or link element can have extension element
500 xslt documents Going to generate 51 generations
higher order orchestration standards are striving to become established supporting standards for SOA should stabilize by Q2 2004, with heavy commercial uptake for Q4 2004 XML, XSLT, and XPATH are successful XML schema, RELAX NG and DTD primary forms of schema languages UDDI is struggling to make an impact with developers There are some key differences though between SOA and CORBA/DCOM/RMI that developers and architects are getting confused with. We are possibly occupying that no mans land between white box reuse and true black box components
Does a car build itself based on a set of criteria ? Do we expect it ? Nano technology ……. Allowing problem domain experts to formulate problems assists in direct requirements capture Will a functional approach be the true path to black box reuse ? In a world of unlimited processing, who cares if a computer program is elegantly constructed ? In a world of unlimited bandwidth who cares if we use XML as the preferred over the wire format ? Successful programmatic methods are useful because they assist in modeling the problem. If that model is then used to generate a million line program…..focus on model-led development
Implementing the Genetic Algorithm in XSLT: PoC - Presentation Transcript
Proof of Concept: SOA Application Composition using the Genetic Algorithm Jim Fuller http://www.ruminate.co.uk http://www.slgchorus.com
Introduction
Technical Director / Internet Services Manager for Stuart Lawrence Group companies
on-IDLE ltd sponsored 1 st XSLT conference in the world: XSLT UK 2001 along with Dave Pawson
co-founder of the EXSLT effort, along with Dave Pawson, Jeni Tennison, Uche Obigu, et al.
Technical reviewer and author for now defunct WROX, on books dealing with XML, XSLT and web services
Lecture Overview
How we use WS today
XSLT and S-expressions
Genetic Algorithm refresher
Early Genetic Experiments with XSLT
Application composition using Genetic Algorithm
Conclusions
How we use WS in today's applications
Indirectly consume web services via WSDL / UDDI subsequent generation of stub code
Direct Consumption of SOAP via manual crafting of HTTP Request headers + SOAP envelope
Primary use cases: Integration and Interoperability
Emerging use cases: orchestration, higher level business processes, and automated application composition
MVC type architectures are popular Client Tier Presentation Tier Business Tier Integration Tier Resource Tier Data Repository, XML Binding, Persistence Model View Controller External web services Internal web services
WS MVC with the Browser Controller EventHandler SOAPEventHandler Model The Model receives events from the Controller and updates itself sending Data which gets transformed by our view components. View -IE web service client side processing -XSLT templates -CSS -Global.xml -Global.xsl HTTP GET HTTP POST REQUEST Internal web services External web services HTTP RESPONSE Internet Explorer Client
SOA Anchor
Stability via web service server : BEA Weblogic, IBM Websphere, Systinet WASP, .NET, ColdFusionMX
versioning control of web services
Easy to deploy same web service through multiple transports
Smooth out learning curve for many of the underlying XML technologies ( SAML )
security integration with underlying PKI
Instant solution to some problems
Deploy existing code as web service, no need for ‘special’ web service code embedded in your own code
Bazaar not opened yet
Currently developers ask how can *I* use them in *my* applications.
Web services live behind the firewall and solve integration problems; extraprise.
Google, Amazon and Microsoft are all examples of monolithic web services.
Many deployed web services are highly specific to a certain problem domain.
Who will bind a specific public web service with their precious application ? (Amazon in research pane).
The world of ‘millions of web services’
The question is not ‘how will a developer find a web service?’ but how will a machine find and use the right web service ?
How will the developer/machine know it’s the right one ? That its stable, correct version, and it can be trusted…
The promise of SOA is real time application composition generating applications or components, based on a set of general evolving criteria
Automatic application composition methods
One approach, not linked to any problem domain is to use the Genetic Algorithm…though there are obvious constraints using these methods
Random search of the problem domain AI / intelligent Software agent methods
Genetic Algorithm Refresher
The Genetic Algorithm ( GA ) is a model of the evolution of a population of artificial individuals.
Each individual is a chromosome which contains discrete units of information; in computers this can be a string, binary numbers, etc… .
With each generation the best fitness individuals are selected for genetic operations to create new generation
The driving force behind the search for new and better solutions is the retention and combination of good partial solutions to a problem
Abridged Genetic Algorithm
The Fundamental Theorem of Genetic Algorithms
M(H, t) :# of individuals in population 't' with the schema 'H'.
f(H) : average fitness of the individuals with the schema 'H'.
F : average fitness of the entire population.
p1 :probability of the schema being destroyed by crossover.
p2 :probability of the schema being destroyed by mutation.
GA operations
Reproduction : An individual is perfectly replicated to a new population
Crossover ( Recombination ) : Parental material is recombined to create offspring to join new population
Mutation : random changes
Permutation : reordering
Editing : evaluation to a terminal
Encapsulation : single indivisible function
Decimation : removal of individuals
Genetic Programming Process
Step 0 . Create a random initial population of individuals
Step 1 . Evaluate the fitness of each individual
Step 2 . Select individuals according to their fitness, which will participate in generating offspring (moms+dads)
Step 3 . Apply primary and secondary genetic operations to generate new offspring population
Step 4 . Repeat the steps 1,2,3, to generate X number of generations
Step 5 . choose best fit individual
Symbolic expressions and XSLT
XSLT List questions….I originally wanted to solve ‘I want to transform source xml to target xml using XSLT’. Could use generic templates or some other automated process.
Vestigial lisp memories of s expressions are similar to xslt / xml: data and programming in one
XSLT guru David Carlisle presence at XSLT UK 2001 opened my eyes to functional programming
My work with EXSLT defined the limitations of XSLT…which led me to build frameworks to implement complex MVC architectures
(+(* 2 3) 4) evaluates to 10 and symbolic expression looks like;
Simplest Lisp Example 3 4 + * 2 Hierarchical computer programs are more expressive then manipulating linear strings
XSLT are also general hierarchical computer programs
<d/> <c/> <xsl:template/> <xsl:stylesheet/> There are some differences, e.g. there are a variety of node types within XML
Problem definition
Create a GA process that will discover an XSLT program which taken a source.xml generates a target.xml
Prototype uses ASF ANT to control the whole process
Michael Kay’s excellent SAXON xslt processor, XSLT 2.0 simplified situation by removal of dealing with RTF’s and node-set usage
Initially create a simple problem, e.g. that of transforming a source xml into a copy of itself
Source XML
<a>
<b>
<c>
<d></d>
</c>
</b>
</a>
Target XML
<a>
<b>
<c>
<d></d>
</c>
</b>
</a>
Early Genetic Experiment
Step 0 . Randomly generate initial population of xslt documents
Step 1 . evaluate fitness using via xml diff of target.xml to result.xml
Step 2 . select individuals according to their fitness which can be used by step 3
Step 3 . Apply primary and secondary genetic operations to generate new offspring population from selected individuals
Step 4 . Repeat steps 1,2,3, to generate X number of generations
Step 5 . choose best fit individual of last generation
M=500, G=51 Parameters Same as raw fitness, approaching 0 is better fitness Standardized fitness One fitness case Fitness Cases Node count on xmldiff patch file difference between result xml and target xml Raw fitness Subset of xslt instructions Function Set <a/> <b/> <c/> <d/> Terminal Set Generate an xslt program that transforms source xml into result xml which is equivalent to target xml Objective
Step 0. Generate Initial Population
Used IBM xml generator: com.ibm.XMLGenerator.XMLGenerator to generate a population of xslt documents.
<?xml version='1.0'?>
<!-- Created by IBM XML Generator
numberLevels=10, maxRepeats=3, Random seed=1060890913224
fixedOdds=1, impliedOdds=4, defaultOdds=4
maxIdRefs=3, maxEntities=3, maxNMTokens=3
isExplicitRoot=true, root element name is 'xsl:stylesheet'
Step 1: Evaluate Fitness XSLT generation xslt Source.xml result.xml Target.xml evaluate fitness transformation xml diff Each individual is ranked, by testing xslt program against a source xml
Step 1. evaluate fitness (cont)
Could have chosen multiple source and target xml to use in fitness assessment
Output of transformation (result.xml) is xmldiff’ed with target xml
I used an extremely simple xml diff tool that just output xml patch
Converted Diff patch file into a number, which is the number of nodes contained in the patch file
Step 3. Secondary Genetic Operations encapsulation ‘ selected XSLT’ ‘ define new function’ Identify useful subtrees and encapsulate by defining new function ‘ XSLT’
Step 3. Secondary Genetic Operations decimation Identify very poor fitness individuals and remove from population <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> </xsl:stylesheet> <xsl:stylesheet/>
Mutation seeding ws:invoke statement vastly speeded up process
New timeout factors necessary
GA process significantly slowed down due to inclusion of web services
GA process was more effective with better fitness evaluation; e.g. ranking fitness consisted of 3 source and targets
M=1000, G=51 Parameters three fitness cases Fitness Cases Node count on xmldiff patch file difference between result xml and target xml Raw fitness Subset of xslt instructions + ws:invoke Function Set <a/>, <b/> ( 2 numbers ) Terminal Set Generate an xslt program that multiplies 2 numbers, converts to Celsius and returns number in Chinese Objective
Results
Multiply 2 numbers convert to Celsius and result should be in Chinese: average 2 hours
Tried a variety of more complicated problems, with many runs never converging to a solution; It is apparent that there is not enough ‘genetic material’ online yet
Prototype proved that GA can be applied
Assisting GA always speeded up the process
Many optimization opportunities
Enhancement
Could have used Dimitri Novachtev’s FXSL, though this would have imposed a pure fp viewpoint on process
Use UDDI as web services repository
Applied GA to ANT or xml pipeline, or even to BPEL, WS-CAF or any xml vocabulary
Prototyping with ANT was successful, but eventually will embed in a software framework
The Internet as a maturing Software Framework
Inheritance versus composition resuse mechanism
Hierarchical versus relational data models
Synchronous versus asynchronous
Stateful versus stateless
Declarative versus OO versus procedural
Coarse grained versus RPC versus Object based web services
Conclusion
In 5 years time will there be advances in hardware processing to make GA techniques viable?
problem domain experts can formulate representation of a problem to be solved using simple xml
Coders become farmers
Its counter intuitive to generate a million line ‘messy’ program to solve a problem
Are there any amends/changes to key specifications that will assist or restrict the GA method ?
Thank you, any questions ?
References
JOHN R KOZA, Genetic Programming , MIT Press 1992
W3C, SOAP Version 1.2
W3C, XML Version 1
W3C, XSLT Version 2:
W3C, WSDL Version 1:
WSIL Version 1
J. W. Hunt and M. D. McIlroy , An Algorithm for Differential File Comparison published in 1976
SAXON XSLT PROCESSOR by Michael Kay, http://saxon.sourceforge.net
0 comments
Post a comment