Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code
Upcoming SlideShare
Loading in...5

Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code



Results of an experimental approach of using MarkLogic/Hadoop to generate source code using map reduce methods.

Results of an experimental approach of using MarkLogic/Hadoop to generate source code using map reduce methods.



Total Views
Views on SlideShare
Embed Views



3 Embeds 20

https://twitter.com 15
http://www.linkedin.com 4
http://tweetedtimes.com 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code Hadoop and Marklogic: Using the Genetic Algorithm to generate Source Code Presentation Transcript

  • MarkLogic and Hadoop – Genetic AlgorithmJim Fulleremail: jim.fuller@marklogic.com twitter: @xquerySenior Engineer, Europe19/09/12
  • Senior engineerhttp://jim.fuller.namehttp://exslt.org @xquery XSLT UK 2001http://www.xmlprague.cz @perl6
  • Overview• Genetic Algorithm Refresher• Marklogic/Hadoop architecture for implementing GA• Installing Hadoop• Installing MarkLogic Connector• Problem Statement• Review of GA process runs• Summary
  • Whats the Problem ?• Bigdata breathes life into older algorithmic approaches• I thought it would interesting to turn ‘bigdata’ problem on its head (code versus data)• Demonstrate hadoop with MarkLogic, working to each other strengths
  • Get out of your comfort zone• This talk is slightly different then the description … 150 slides! Part I.• Its got hadoop/marklogic and the genetic algorithm but have focused on the process and early results• Doing data science means pushing yourself out of your comfort zone• Start simple, then iterate
  • Genetic Algorithm Refresher• The Genetic Algorithm ( GA ) is a model of the evolution of a population of artificial individuals emulating Darwinian Selection.• Each individual is a chromosome which contains discrete units of information (genes).• The driving force behind the search for new and better solutions is the retention and combination of good partial solutions to a problem
  • Abridged Genetic Algorithm• The Fundamental Theorem of Genetic AlgorithmsM(H, t):# of individuals in population t with the schema H.f(H): average fitness of the individuals with the schema H.F: average fitness of the entire population.p1:probability of the schema being destroyed by crossover.p2:probability of the schema being destroyed by mutation.
  • GA operations• Reproduction: An individual is perfectly replicated to a new population• Crossover ( Recombination ): Parental material is recombined to create offspring to join new population• Mutation: random changes (is key for pushing past local optima)• Permutation: reordering• Editing: evaluation to a terminal• Encapsulation: single indivisible function• Decimation: removal of individuals
  • Typical GA ProcessStep 0. Create a random initial population of individualsStep 1. Evaluate the fitness of each individualStep 2. Select individuals according to their fitness, which will participate in generating offspring (moms+dads)Step 3. Apply primary and secondary genetic operations to generate new offspring populationStep 4. Repeat the steps 1,2,3, to generate X number of generationsStep 5. choose fittest individual of last generation based on stop criteria
  • Endemic GA Problems• Finding the optimal solution to complex high dimensional, multimodal problems often requires very expensive fitness function• Hard to pose problem statement e.g. Stop criteria is not clear in every problem• Premature convergence on local optima
  • Bit strings vs Lisp Parse Trees(+( 2 3) 4) evaluates to 10 and symbolic expression looks like; + 4 2 3Hierarchical computer programs are more expressive then manipulating linear strings
  • XSLT – markup is useful!<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version=“2.0"> <xsl:template match="a"> <d/> <c/> </xsl:template> <xsl:stylesheet/></xsl:stylesheet> <xsl:template/> <d/> <c/> Obvious Difficulties to address; different node types and xpath
  • Objective Generate an xslt program that transforms source xml into result xml which is equivalent to target xmlTerminal Set <a/> <b/> <c/> <d/>Function Set Subset of xslt instructionsFitness Cases One fitness caseRaw fitness Treediffmerge result, node count + standard diffStandardized Same as raw fitness,fitness approaching 0 is better fitnessParameters M=500, G=51
  • Source XML<a> <b> <c> <d></d> </c> </b></a>
  • Target XML – clear stop criteria <a> <b> <c> <d></d> </c> </b> </a>
  • Generation zero• XML Instance Generator which is part of the Sun Multi-Schema Validator• Sun Multi-Schema Validator• The following can do it – OxygenXML – Visual Studio – Eclipse• Ended up using IBM XML Generate – very old, supply it a schema and it would generate example xml
  • Step 1a: Evaluate against Input xslt Source.xml transformation result.xml XSLT generationMarkLogic evals and places the result into the property for the xslt itself
  • Step 1b: Evaluate Fitness xslt Source.xml transformation result.xml HADOOP XSLT generation evaluate fitnessfitness performed with treediffmerge + standard diff
  • XML Diff issues• Many diff algorithms are based on a paper published in 1976 by J. W. Hunt and M. D. McIlroy, An Algorithm for Differential File Comparison• XML has a structure, text based diff programs do not take this into accordance• simple example: <footie/> versus <footie></footie>logically these are equal• XML Canonization helps !
  • XML Canonize + TreeDiffMergeTREEDIFFMERGE DIFFERENCE RESULTS<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?> <root/><diff xmlns:diff=http://diff.org> <diff:insert dst="1"> <a> <b> <c> <d /> </c> </b> </a> </diff:insert></diff><?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="utf-8"?><root><diff xmlns:diff=http://diff.org> <a/><a><a><c/><c><a><d/></a><c/></c></a><b> <diff:copy src="2" dst="1"> <b/><a/><c/><b> <c> <diff:copy src="16" <d/>dst="2" /> </c> </diff:copy></diff> </b></b><a/></a><d><a><c/><a/><a/></a><c/></ d><c/>
  • Simple if we match: we are done!<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="utf-8"?><root><a><diff /> <b> <c> <d/> </c> </b> </a></root>
  • MarkLogic/Hadoop Architecture Interlude MarkLogic Connector API via XDBC MarkLogic Connector API via XDBC
  • From Hadoop pov
  • Hadoop Installation Recipe• installing Hadoop (setting up a single node cluster) – brew install hadoop – make sure ssh is setup properly – generate id_rsa and id_rsa.pub – append pub to auth keys • cat id_rsa.pub >> authorized_keys – enable remote on mac osx• configure hadoop – edit core-site.xml – edit mapred-site.xml• ssh localhost – format hdfs • hadoop namenode –format• bin/start-all.sh – if asks for password, you got problem with your ssh setup• to check that all is well – run jps – ps ax | grep hadoop | wc –l – Check • http://localhost:50030/jobtracker.jsp • http://localhost:50060/tasktracker.jsp • http://localhost:50070/dfshealth.jsp
  • Installing ML Hadoop Connector• copy latest xcc and connector jars to hadoop lib• Copy ml-examples jar as well• Copy ml hadoop conf to hadoop conf
  • Starting it all Up• Start marklogic• Create database• Create xdbc connection (how hadoop/ml communicate)• Edit marklogic-hello-world.xml• Make sure hadoop is started
  • Starting it all Up• Load test Data via query consolexquery version "1.0-ml";let $hello := <data><child>hello mom</child></data>let $world := <data><child>world event</child></data>return( xdmp:document-insert("hello.xml", $hello), xdmp:document-insert("world.xml", $world))
  • Run hello world example• bin/start-all.sh• hadoop jar lib/marklogic-xcc-examples- 6.0.20120914.jar com.marklogic.mapreduce.examples.HelloWorld• Review https://gist.github.com/2484318
  • Fitness (hadoop) step• Applies XML canonization• Performs treediffmerge, outputs and writes to original xslt document xml property• Performs text diff and writes to original xslt document xml property
  • Step 2. Select individuals • Probabilistic selection to choose which individuals participate in genetic operation Selected XSLT populationSelect individuals for genetic operations, based on their fitness
  • About fitness• Raw fitness: is the natural representation in terms of the specific problem (primitive counting nodes of treediffmerge patch)• Standardized fitness: lower the better• Adjusted fitness: lies between 0-1• Normalized fitness: lies between 0-1 with sum of fitness values = 1• In our case the lower the number of ‘different’ nodes the better, use standardized fitness
  • Step 3. Apply Primary Genetic Operations Reproduction Selected XSLT population New generationIndividual reproduced into new generation
  • Step 3. Primary Genetic Operations Crossover ( Recombination ) Selected XSLT population Creates 2 offspring ‘Mom’ ‘Dad’ New generationSelect parents then crossover creates 2 offspring
  • Step 3. Primary Genetic Operations Crossover ( Recombination ) ‘Mom XSLT’ ‘Dad XSLT’ ‘offspring xslt’ ‘offspring xslt’Swap nodes between selected parent xslt New generation
  • Crossover with xqueryxquery version "1.0-ml";import module namespace mem = "http://xqdev.com/in-mem-update" at "/MarkLogic/appservices/utils/in-mem-update.xqy" ; let $mom := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/" as="item()*"> <bar>help</bar> </xsl:template> <xsl:template match="text()" as="item()*"/> </xsl:stylesheet> let $dad := <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/" as="item()*"> <a><b><c>test</c></b></a> </xsl:template> </xsl:stylesheet> let $momCount := fn:count($mom//.) let $dadCount := fn:count($dad//.) (: never want root node :) let $momRdm := xdmp:random($momCount - 2) + 2 let $dadRdm := xdmp:random($dadCount - 2) + 2 (: node selection :) let $momNode := ($mom//.)[$momRdm] let $dadNode := ($dad//.)[$dadRdm] (: crossover :) let $newMom := mem:node-replace( $momNode, $dadNode ) let $newDad := mem:node-replace( $dadNode, $momNode ) return <result> <newMom>{$newMom}</newMom> <newDad>{$newDad}</newDad> </result>
  • Step 3. Secondary Genetic Operations• Mutation: is a form of random crossover• Permutation: Reorganize nodes• Editing: evaluate a set of nodes• Encapsulation: takes a branch and replaces with 1 indivisible node• Decimation: removes individual based on domain specific criteria
  • Step 3. Secondary Genetic Operations mutation ‘selected XSLT’ ‘offspring xslt’ Completely new set of instructionsPick a node and randomly mutate
  • Step 3. Secondary Genetic Operations permutation ‘selected XSLT’ ‘offspring xslt’Permutated node order
  • Step 3. Secondary Genetic Operations editing ‘selected XSLT’ ‘offspring xslt’ Replace node with evaluated expression
  • Step 3. Secondary Genetic Operations encapsulation‘selected XSLT’ ‘define new function’ ‘XSLT’ Identify useful subtrees and encapsulate by defining new function
  • Step 3. Secondary Genetic Operations decimation <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> </xsl:stylesheet> <xsl:stylesheet/> Identify very poor fitness individuals and remove from population
  • Initial tests• Initial Population= 500, generations = 51• Set initial genetic operation probabilities: 90% crossover on selected individuals 10% reproduction on selected individuals 0% secondary operations on selected individuals
  • Results• runs faster with more servers … extreme scale out – unusual for GA• Arrived quickly to a ‘correct’ solution• Though some runs Local optima was ‘wrong solution’ e.g. embedded literal• need to constrain xpath (baby steps)• Need to constrain terminal set• Enhance fitness definition
  • Source XML<a> <b> <c> <d></d> </c> </b></a>
  • Target XML<a> <b/> <c/> <d/></a>
  • Results• Needed larger generations/ more individuals• Mutation operation needed to kick out of local optima
  • Summary• This approach can be applied to any language parse tree (xquery with xqueryparser.xq)• Difficulties with little languages being embedded• Today, commercially applicable to generating mapping solutions, more research required• Illustrates applying strength of ML/Hadoop together• Will place code and results on github soon …
  • References• JOHN R KOZA, Genetic Programming, MIT Press 1992• J. W. Hunt and M. D. McIlroy , An Algorithm for Differential File Comparison published in 1976