Upcoming SlideShare
×

# Experiments in genetic programming

1,718 views
1,627 views

Published on

Published in: Technology
1 Like
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
1,718
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
20
0
Likes
1
Embeds 0
No embeds

No notes for slide

### Experiments in genetic programming

1. 1. Experiments in genetic programmingBouvet BigOne, 2012-03-29Lars Marius Garshol, <larsga@bouvet.no>http://twitter.com/larsga1
2. 2. The background• Duke – open source data matching engine (Java) – can find near-duplicate database records – probabilistic configuration – http://code.google.com/p/duke/• People find making configurations difficult – can we help them? Field Record 1 Record 2 Probability Name acme inc acme inc 0.9 Assoc no 177477707 0.5 Zip code 9161 9161 0.6 Country norway norway 0.51 Address 1 mb 113 mailbox 113 0.49 Address 2 0.52
3. 3. The idea• Given – a test file showing the correct linkages• can we – evolve a configuration• using – genetic algorithms?3
4. 4. What a configuration looks like• Threshold for accepting matches – a number between 0.0 and 1.0• For each property – a comparator function (Exact, Levenshtein, numeric...) – a low probability (0.0-0.5) – a high probability (0.5-1.0)4
5. 5. The hill-climbing problem5
6. 6. How it works1. Generate a population of 100 random configurations2. Evaluate the population3. Throw away the 25 worst, duplicate the 25 best4. Randomly modify the entire population5. Go back to 26
7. 7. Actual codefor generation in range(POPULATIONS): print "===== GENERATION %s ================================" % generation for c in population: f = evaluate(c) if f > highest: best = c highest = f show_best(best, False) # make new generation population = sorted(population, key = lambda c: 1.0 - index[c]) # ditch lower quartile population = population[ : -25] # double upper quartile population = population[ : 25] + population # mutate population = [c.make_new(population) for c in population]7
8. 8. Actual code #2class GeneticConfiguration: def __init__(self): self._props = [] self._threshold = 0.0 # set/get threshold, add/get properties def make_new(self, population): # either we make a number or random modifications, or we mate. # draw a number, if 0 modifications, we mate. mods = random.randint(0, 3) if mods: return self._mutate(mods) else: return self._mate(random.choice(population)) def _mutate(self, mods): c = self._copy() for ix in range(mods): aspect = random.choice(aspects) aspect.modify(c) return c def _mate(self, other): c = self._copy() for aspect in aspects: aspect.set(c, aspect.get(random.choice([self, other]))) return c def _copy(self): c = GeneticConfiguration() c.set_threshold(self._threshold) for prop in self.get_properties(): if prop.getName() == "ID": c.add_property(Property(prop.getName())) else: c.add_property(Property(prop.getName(), prop.getComparator(), prop.getLowProbability(), prop.getHighProbability())) return c8
9. 9. But ... does it work?!?9
10. 10. Linking countries • Linking countries from DBpedia and Mondial – no common identifiers • Manually I manage 95.4% accuracy – genetic script manages 95.7% in first generation – then improves to 98.9% – this was too easy...DBPEDIA MONDIALId http://dbpedia.org/resource/Samoa Id 17019Name Samoa Name Western SamoaCapital Apia Capital Apia, SamoaArea 2831 Area 2860 10
11. 11. The actual configuration Threshold 0.6 PROPERTY COMPARATOR LOW HIGH NAME Exact 0.19 0.91 CAPITAL Exact 0.25 0.86 AREA Numeric 0.36 0.72 Confusing. Why exact name comparisons? Why is area comparison given such weight? Who knows. There’s nobody to ask.11
12. 12. Semantic dogfood • Data about papers presented at semantic web conferences – has duplicate speakers – about 7,000 records, many long string values • Manually I get 88% accuracy – after two weeks, the script gets 82% accuracy – but it’s only half-wayName Grigorios Antoniou Name Grigoris AntoniouHomepage http://www.ics.forth.gr/~antoniou Homepage http://www.ics.forth.gr/~antoniouMbox_Sha1 f44cd7769f416e96864ac43498b08215 Mbox_Sha1 f44cd7769f416e96864ac43498b08215 5196829e 5196829eAffiliation Affiliation http://data.semanticweb.org/organizat ion/forth-ics 12
13. 13. The configuration Threshold 0.91 PROPERTY COMPARATOR LOW HIGH NAME JaroWinklerTokenized 0.2 0.9 AFFILIATION DiceCoefficient 0.49 0.61 HOMEPAGE Exact 0.09 0.67 MBOX_HASH PersonNameComparator 0.42 0.87 Some strange choices of comparator. PersonNameComparator?!? DiceCoefficient is essentially same as Exact, for those values. Otherwise as expected.13
14. 14. Hafslund• I took a subset of customer data from Hafslund – roughly 3000 records – then made a difficult manual test file, where different parts of organizations are treated as different – so NSB Logistikk != NSB Bane – then made another subset for testing• Manually I can do no better than 64% on this data set – interestingly, on the test data set I score 84%• With a cut-down data set, I could run the script overnight, and have a result in the morning14
15. 15. The progress of evolution• 1st generation – best scores: 0.47, 0.43, 0.3• 2nd generation – mutated 0.47 configuration scores 0.136, 0.467, 0.002, and 0.49 – best scores: 0.49, 0.467, 0.4, and 0.38• 3rd generation – mutated 0.49 scores 0.001, 0.49, 0.46, and 0.25 – best scores: 0.49, 0.46, 0.45, and 0.42• 4th generation – we hit 0.525 (modified from 0.21)15
16. 16. The progress of evolution #2• 5th generation – we hit 0.568 (modified from 0.479)• 6th generation – 0.602• 7th generation – 0.702• ...• 60th generation – 0.765 – I’d done no better than 0.64 manually16
17. 17. Evaluation CONFIGURATION TRAINING TEST Genetic #1 0.766 0.881 Genetic #2 0.776 0.859 Manual #1 0.57 0.838 Manual #2 0.64 0.803Threshold: 0.98 Threshold: 0.95PROPERTY COMPARATOR LOW HIGH PROPERTY COMPARATOR LOW HIGHNAME Levenshtein 0.17 0.95 NAME Levenshtein 0.42 0.96ASSOCIATION_NO Exact 0.06 0.69 ASSOCIATION_NO DiceCoefficien 0.0 0.67 tADDRESS1 Numeric 0.02 0.92 ADDRESS1 Numeric 0.1 0.61ADDRESS2 PersonName 0.18 0.76 ADDRESS2 Levenshtein 0.03 0.8ZIP_CODE DiceCoefficien 0.47 0.79 t ZIP_CODE DiceCoefficien 0.35 0.69 17 tCOUNTRY Levenshtein 0.12 0.64 COUNTRY JaroWinklerT. 0.44 0.68
18. 18. Does it find the best configuration?• We don’t know• The experts say genetic algorithms tend to get stuck at local maxima – they also point out that well-known techniques for dealing with this are described in the literature• Rerunning tends to produce similar configurations18
19. 19. The literature http://www.cleveralgorithms.com/ http://www.gp-field-guide.org.uk/19
20. 20. Conclusion• Easy to implement – you don’t need a GP library• Requires reliable test data• It actually works• Configurations may not be very tweakable – because they don’t necessarily make any sense• This is a big field, with lots to learn20 http://www.garshol.priv.no/blog/225.html