MarkLogic and Hadoop – Genetic AlgorithmJim Fulleremail: jim.fuller@marklogic.com twitter: @xquerySenior Engineer, Europe1...
Senior engineerhttp://jim.fuller.namehttp://exslt.org           @xquery                   XSLT UK 2001http://www.xmlprague...
Overview• Genetic Algorithm Refresher• Marklogic/Hadoop architecture for  implementing GA• Installing Hadoop• Installing M...
Whats the Problem ?• Bigdata breathes life into older algorithmic  approaches• I thought it would interesting to turn ‘big...
Get out of your comfort zone• This talk is slightly different then the  description … 150 slides! Part I.• Its got hadoop/...
Genetic Algorithm Refresher• The Genetic Algorithm ( GA ) is a model of the  evolution of a population of artificial indiv...
Abridged Genetic Algorithm• The Fundamental Theorem of Genetic AlgorithmsM(H, t):# of individuals in population t with the...
GA operations• Reproduction: An individual is perfectly replicated  to a new population• Crossover ( Recombination ): Pare...
Typical GA ProcessStep 0. Create a random initial population of individualsStep 1. Evaluate the fitness of each individual...
Endemic GA Problems• Finding the optimal solution to complex high  dimensional, multimodal problems often  requires very e...
Bit strings vs Lisp Parse Trees(+( 2 3) 4) evaluates to 10 and symbolic  expression looks like;                        +  ...
XSLT – markup is useful!<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"   version=“2.0"> <xsl:template ma...
Objective       Generate an xslt program that                transforms source xml into result                xml which is...
Source XML<a>  <b>   <c>      <d></d>   </c>  </b></a>
Target XML – clear stop criteria       <a>         <b>          <c>             <d></d>          </c>         </b>       <...
Generation zero• XML Instance Generator which is part of the Sun  Multi-Schema Validator• Sun Multi-Schema Validator• The ...
Step 1a: Evaluate against Input                                                  xslt    Source.xml                       ...
Step 1b: Evaluate Fitness                                               xslt    Source.xml                                ...
XML Diff issues• Many diff algorithms are based on a paper  published in 1976 by J. W. Hunt and M. D.  McIlroy, An Algorit...
XML Canonize + TreeDiffMergeTREEDIFFMERGE DIFFERENCE                      RESULTS<?xml version="1.0" encoding="UTF-8"?>   ...
Simple if we match: we are done!<?xml version="1.0" encoding="UTF-8"?>   <?xml version="1.0" encoding="utf-8"?><root><a><d...
MarkLogic/Hadoop Architecture          Interlude        MarkLogic      Connector API via XDBC                             ...
From Hadoop pov
Hadoop Installation Recipe•   installing Hadoop (setting up a single node cluster)     –    brew install hadoop     –    m...
Installing ML Hadoop Connector• copy latest xcc and connector jars to hadoop  lib• Copy ml-examples jar as well• Copy ml h...
Starting it all Up• Start marklogic• Create database• Create xdbc connection (how hadoop/ml  communicate)• Edit marklogic-...
Starting it all Up• Load test Data via query consolexquery version "1.0-ml";let $hello := <data><child>hello mom</child></...
Run hello world example• bin/start-all.sh• hadoop jar lib/marklogic-xcc-examples-  6.0.20120914.jar  com.marklogic.mapredu...
Fitness (hadoop) step• Applies XML canonization• Performs treediffmerge, outputs and writes to  original xslt document xml...
Step 2. Select individuals • Probabilistic selection to choose which   individuals participate in genetic operation       ...
About fitness• Raw fitness: is the natural representation in  terms of the specific problem (primitive  counting nodes of ...
Step 3. Apply Primary Genetic Operations                                                Reproduction             Selected ...
Step 3. Primary Genetic Operations                                            Crossover ( Recombination )        Selected ...
Step 3. Primary Genetic Operations                                          Crossover ( Recombination )   ‘Mom XSLT’      ...
Crossover with xqueryxquery version "1.0-ml";import module namespace mem = "http://xqdev.com/in-mem-update" at "/MarkLogic...
Step 3. Secondary Genetic Operations• Mutation: is a form of random crossover• Permutation: Reorganize nodes• Editing: eva...
Step 3. Secondary Genetic Operations                                                     mutation        ‘selected XSLT’  ...
Step 3. Secondary Genetic Operations                                                   permutation              ‘selected ...
Step 3. Secondary Genetic Operations                                                         editing      ‘selected XSLT’ ...
Step 3. Secondary Genetic Operations                                                             encapsulation‘selected XS...
Step 3. Secondary Genetic Operations                                                           decimation  <xsl:stylesheet...
Initial tests• Initial Population= 500, generations = 51• Set initial genetic operation probabilities:   90% crossover on ...
Results• runs faster with more servers … extreme scale out –  unusual for GA• Arrived quickly to a ‘correct’ solution• Tho...
Source XML<a>  <b>   <c>      <d></d>   </c>  </b></a>
Target XML<a>  <b/>   <c/>  <d/></a>
Results• Needed larger generations/ more individuals• Mutation operation needed to kick out of local  optima
Summary• This approach can be applied to any language  parse tree (xquery with xqueryparser.xq)• Difficulties with little ...
References• JOHN R KOZA, Genetic Programming, MIT Press 1992• J. W. Hunt and M. D. McIlroy , An Algorithm for Differential...
Upcoming SlideShare
Loading in …5
×

Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

2,459 views
2,329 views

Published on

Results of an experimental approach of using MarkLogic/Hadoop to generate source code using map reduce methods.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,459
On SlideShare
0
From Embeds
0
Number of Embeds
31
Actions
Shares
0
Downloads
36
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code

  1. 1. MarkLogic and Hadoop – Genetic AlgorithmJim Fulleremail: jim.fuller@marklogic.com twitter: @xquerySenior Engineer, Europe19/09/12
  2. 2. Senior engineerhttp://jim.fuller.namehttp://exslt.org @xquery XSLT UK 2001http://www.xmlprague.cz @perl6
  3. 3. Overview• Genetic Algorithm Refresher• Marklogic/Hadoop architecture for implementing GA• Installing Hadoop• Installing MarkLogic Connector• Problem Statement• Review of GA process runs• Summary
  4. 4. Whats the Problem ?• Bigdata breathes life into older algorithmic approaches• I thought it would interesting to turn ‘bigdata’ problem on its head (code versus data)• Demonstrate hadoop with MarkLogic, working to each other strengths
  5. 5. Get out of your comfort zone• This talk is slightly different then the description … 150 slides! Part I.• Its got hadoop/marklogic and the genetic algorithm but have focused on the process and early results• Doing data science means pushing yourself out of your comfort zone• Start simple, then iterate
  6. 6. Genetic Algorithm Refresher• The Genetic Algorithm ( GA ) is a model of the evolution of a population of artificial individuals emulating Darwinian Selection.• Each individual is a chromosome which contains discrete units of information (genes).• The driving force behind the search for new and better solutions is the retention and combination of good partial solutions to a problem
  7. 7. Abridged Genetic Algorithm• The Fundamental Theorem of Genetic AlgorithmsM(H, t):# of individuals in population t with the schema H.f(H): average fitness of the individuals with the schema H.F: average fitness of the entire population.p1:probability of the schema being destroyed by crossover.p2:probability of the schema being destroyed by mutation.
  8. 8. GA operations• Reproduction: An individual is perfectly replicated to a new population• Crossover ( Recombination ): Parental material is recombined to create offspring to join new population• Mutation: random changes (is key for pushing past local optima)• Permutation: reordering• Editing: evaluation to a terminal• Encapsulation: single indivisible function• Decimation: removal of individuals
  9. 9. Typical GA ProcessStep 0. Create a random initial population of individualsStep 1. Evaluate the fitness of each individualStep 2. Select individuals according to their fitness, which will participate in generating offspring (moms+dads)Step 3. Apply primary and secondary genetic operations to generate new offspring populationStep 4. Repeat the steps 1,2,3, to generate X number of generationsStep 5. choose fittest individual of last generation based on stop criteria
  10. 10. Endemic GA Problems• Finding the optimal solution to complex high dimensional, multimodal problems often requires very expensive fitness function• Hard to pose problem statement e.g. Stop criteria is not clear in every problem• Premature convergence on local optima
  11. 11. Bit strings vs Lisp Parse Trees(+( 2 3) 4) evaluates to 10 and symbolic expression looks like; + 4 2 3Hierarchical computer programs are more expressive then manipulating linear strings
  12. 12. XSLT – markup is useful!<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version=“2.0"> <xsl:template match="a"> <d/> <c/> </xsl:template> <xsl:stylesheet/></xsl:stylesheet> <xsl:template/> <d/> <c/> Obvious Difficulties to address; different node types and xpath
  13. 13. Objective Generate an xslt program that transforms source xml into result xml which is equivalent to target xmlTerminal Set <a/> <b/> <c/> <d/>Function Set Subset of xslt instructionsFitness Cases One fitness caseRaw fitness Treediffmerge result, node count + standard diffStandardized Same as raw fitness,fitness approaching 0 is better fitnessParameters M=500, G=51
  14. 14. Source XML<a> <b> <c> <d></d> </c> </b></a>
  15. 15. Target XML – clear stop criteria <a> <b> <c> <d></d> </c> </b> </a>
  16. 16. Generation zero• XML Instance Generator which is part of the Sun Multi-Schema Validator• Sun Multi-Schema Validator• The following can do it – OxygenXML – Visual Studio – Eclipse• Ended up using IBM XML Generate – very old, supply it a schema and it would generate example xml
  17. 17. Step 1a: Evaluate against Input xslt Source.xml transformation result.xml XSLT generationMarkLogic evals and places the result into the property for the xslt itself
  18. 18. Step 1b: Evaluate Fitness xslt Source.xml transformation result.xml HADOOP XSLT generation evaluate fitnessfitness performed with treediffmerge + standard diff
  19. 19. XML Diff issues• Many diff algorithms are based on a paper published in 1976 by J. W. Hunt and M. D. McIlroy, An Algorithm for Differential File Comparison• XML has a structure, text based diff programs do not take this into accordance• simple example: <footie/> versus <footie></footie>logically these are equal• XML Canonization helps !
  20. 20. XML Canonize + TreeDiffMergeTREEDIFFMERGE DIFFERENCE RESULTS<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?> <root/><diff xmlns:diff=http://diff.org> <diff:insert dst="1"> <a> <b> <c> <d /> </c> </b> </a> </diff:insert></diff><?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="utf-8"?><root><diff xmlns:diff=http://diff.org> <a/><a><a><c/><c><a><d/></a><c/></c></a><b> <diff:copy src="2" dst="1"> <b/><a/><c/><b> <c> <diff:copy src="16" <d/>dst="2" /> </c> </diff:copy></diff> </b></b><a/></a><d><a><c/><a/><a/></a><c/></ d><c/>
  21. 21. Simple if we match: we are done!<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="utf-8"?><root><a><diff /> <b> <c> <d/> </c> </b> </a></root>
  22. 22. MarkLogic/Hadoop Architecture Interlude MarkLogic Connector API via XDBC MarkLogic Connector API via XDBC
  23. 23. From Hadoop pov
  24. 24. Hadoop Installation Recipe• installing Hadoop (setting up a single node cluster) – brew install hadoop – make sure ssh is setup properly – generate id_rsa and id_rsa.pub – append pub to auth keys • cat id_rsa.pub >> authorized_keys – enable remote on mac osx• configure hadoop – edit core-site.xml – edit mapred-site.xml• ssh localhost – format hdfs • hadoop namenode –format• bin/start-all.sh – if asks for password, you got problem with your ssh setup• to check that all is well – run jps – ps ax | grep hadoop | wc –l – Check • http://localhost:50030/jobtracker.jsp • http://localhost:50060/tasktracker.jsp • http://localhost:50070/dfshealth.jsp
  25. 25. Installing ML Hadoop Connector• copy latest xcc and connector jars to hadoop lib• Copy ml-examples jar as well• Copy ml hadoop conf to hadoop conf
  26. 26. Starting it all Up• Start marklogic• Create database• Create xdbc connection (how hadoop/ml communicate)• Edit marklogic-hello-world.xml• Make sure hadoop is started
  27. 27. Starting it all Up• Load test Data via query consolexquery version "1.0-ml";let $hello := <data><child>hello mom</child></data>let $world := <data><child>world event</child></data>return( xdmp:document-insert("hello.xml", $hello), xdmp:document-insert("world.xml", $world))
  28. 28. Run hello world example• bin/start-all.sh• hadoop jar lib/marklogic-xcc-examples- 6.0.20120914.jar com.marklogic.mapreduce.examples.HelloWorld• Review https://gist.github.com/2484318
  29. 29. Fitness (hadoop) step• Applies XML canonization• Performs treediffmerge, outputs and writes to original xslt document xml property• Performs text diff and writes to original xslt document xml property
  30. 30. Step 2. Select individuals • Probabilistic selection to choose which individuals participate in genetic operation Selected XSLT populationSelect individuals for genetic operations, based on their fitness
  31. 31. About fitness• Raw fitness: is the natural representation in terms of the specific problem (primitive counting nodes of treediffmerge patch)• Standardized fitness: lower the better• Adjusted fitness: lies between 0-1• Normalized fitness: lies between 0-1 with sum of fitness values = 1• In our case the lower the number of ‘different’ nodes the better, use standardized fitness
  32. 32. Step 3. Apply Primary Genetic Operations Reproduction Selected XSLT population New generationIndividual reproduced into new generation
  33. 33. Step 3. Primary Genetic Operations Crossover ( Recombination ) Selected XSLT population Creates 2 offspring ‘Mom’ ‘Dad’ New generationSelect parents then crossover creates 2 offspring
  34. 34. Step 3. Primary Genetic Operations Crossover ( Recombination ) ‘Mom XSLT’ ‘Dad XSLT’ ‘offspring xslt’ ‘offspring xslt’Swap nodes between selected parent xslt New generation
  35. 35. Crossover with xqueryxquery version "1.0-ml";import module namespace mem = "http://xqdev.com/in-mem-update" at "/MarkLogic/appservices/utils/in-mem-update.xqy" ; let $mom := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/" as="item()*"> <bar>help</bar> </xsl:template> <xsl:template match="text()" as="item()*"/> </xsl:stylesheet> let $dad := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/" as="item()*"> <a><b><c>test</c></b></a> </xsl:template> </xsl:stylesheet> let $momCount := fn:count($mom//.) let $dadCount := fn:count($dad//.) (: never want root node :) let $momRdm := xdmp:random($momCount - 2) + 2 let $dadRdm := xdmp:random($dadCount - 2) + 2 (: node selection :) let $momNode := ($mom//.)[$momRdm] let $dadNode := ($dad//.)[$dadRdm] (: crossover :) let $newMom := mem:node-replace( $momNode, $dadNode ) let $newDad := mem:node-replace( $dadNode, $momNode ) return <result> <newMom>{$newMom}</newMom> <newDad>{$newDad}</newDad> </result>
  36. 36. Step 3. Secondary Genetic Operations• Mutation: is a form of random crossover• Permutation: Reorganize nodes• Editing: evaluate a set of nodes• Encapsulation: takes a branch and replaces with 1 indivisible node• Decimation: removes individual based on domain specific criteria
  37. 37. Step 3. Secondary Genetic Operations mutation ‘selected XSLT’ ‘offspring xslt’ Completely new set of instructionsPick a node and randomly mutate
  38. 38. Step 3. Secondary Genetic Operations permutation ‘selected XSLT’ ‘offspring xslt’Permutated node order
  39. 39. Step 3. Secondary Genetic Operations editing ‘selected XSLT’ ‘offspring xslt’ Replace node with evaluated expression
  40. 40. Step 3. Secondary Genetic Operations encapsulation‘selected XSLT’ ‘define new function’ ‘XSLT’ Identify useful subtrees and encapsulate by defining new function
  41. 41. Step 3. Secondary Genetic Operations decimation <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> </xsl:stylesheet> <xsl:stylesheet/> Identify very poor fitness individuals and remove from population
  42. 42. Initial tests• Initial Population= 500, generations = 51• Set initial genetic operation probabilities: 90% crossover on selected individuals 10% reproduction on selected individuals 0% secondary operations on selected individuals
  43. 43. Results• runs faster with more servers … extreme scale out – unusual for GA• Arrived quickly to a ‘correct’ solution• Though some runs Local optima was ‘wrong solution’ e.g. embedded literal• need to constrain xpath (baby steps)• Need to constrain terminal set• Enhance fitness definition
  44. 44. Source XML<a> <b> <c> <d></d> </c> </b></a>
  45. 45. Target XML<a> <b/> <c/> <d/></a>
  46. 46. Results• Needed larger generations/ more individuals• Mutation operation needed to kick out of local optima
  47. 47. Summary• This approach can be applied to any language parse tree (xquery with xqueryparser.xq)• Difficulties with little languages being embedded• Today, commercially applicable to generating mapping solutions, more research required• Illustrates applying strength of ML/Hadoop together• Will place code and results on github soon …
  48. 48. References• JOHN R KOZA, Genetic Programming, MIT Press 1992• J. W. Hunt and M. D. McIlroy , An Algorithm for Differential File Comparison published in 1976

×