SlideShare a Scribd company logo
Experiments in genetic programming

Bouvet BigOne, 2012-03-29
Lars Marius Garshol, <larsga@bouvet.no>
http://twitter.com/larsga




1
The background

• Duke
    –   open source data matching engine (Java)
    –   can find near-duplicate database records
    –   probabilistic configuration
    –   http://code.google.com/p/duke/
• People find making configurations difficult
    – can we help them?           Field       Record 1    Record 2      Probability
                                  Name        acme inc    acme inc      0.9
                                  Assoc no    177477707                 0.5
                                  Zip code    9161        9161          0.6
                                  Country     norway      norway        0.51
                                  Address 1   mb 113      mailbox 113   0.49
                                  Address 2                             0.5

2
The idea

• Given
    – a test file showing the correct linkages
• can we
    – evolve a configuration
• using
    – genetic algorithms?




3
What a configuration looks like

• Threshold for accepting matches
    – a number between 0.0 and 1.0
• For each property
    – a comparator function (Exact, Levenshtein, numeric...)
    – a low probability (0.0-0.5)
    – a high probability (0.5-1.0)




4
The hill-climbing problem




5
How it works

1. Generate a population of 100 random
   configurations
2. Evaluate the population
3. Throw away the 25 worst, duplicate the 25
   best
4. Randomly modify the entire population
5. Go back to 2



6
Actual code
for generation in range(POPULATIONS):
    print "===== GENERATION %s ================================" % generation

    for c in population:
        f = evaluate(c)

       if f > highest:
           best = c
           highest = f
           show_best(best, False)

    # make new generation
    population = sorted(population, key = lambda c: 1.0 - index[c])

    # ditch lower quartile
    population = population[ : -25]
    # double upper quartile
    population = population[ : 25] + population

    # mutate
    population = [c.make_new(population) for c in population]



7
Actual code #2
class GeneticConfiguration:
    def __init__(self):
        self._props = []
        self._threshold = 0.0

    # set/get threshold, add/get properties

    def make_new(self, population):
        # either we make a number or random modifications, or we mate.
        # draw a number, if 0 modifications, we mate.
        mods = random.randint(0, 3)
        if mods:
            return self._mutate(mods)
        else:
            return self._mate(random.choice(population))

    def _mutate(self, mods):
        c = self._copy()
        for ix in range(mods):
            aspect = random.choice(aspects)
            aspect.modify(c)
        return c

    def _mate(self, other):
        c = self._copy()
        for aspect in aspects:
            aspect.set(c, aspect.get(random.choice([self, other])))
        return c

    def _copy(self):
        c = GeneticConfiguration()
        c.set_threshold(self._threshold)
        for prop in self.get_properties():
            if prop.getName() == "ID":
                c.add_property(Property(prop.getName()))
            else:
                c.add_property(Property(prop.getName(), prop.getComparator(), prop.getLowProbability(), prop.getHighProbability()))
        return c
8
But ... does it work?!?




9
Linking countries

 • Linking countries from DBpedia and Mondial
          – no common identifiers
 • Manually I manage 95.4% accuracy
          – genetic script manages 95.7% in first generation
          – then improves to 98.9%
          – this was too easy...
DBPEDIA                                          MONDIAL

Id           http://dbpedia.org/resource/Samoa   Id        17019

Name         Samoa                               Name      Western Samoa

Capital      Apia                                Capital   Apia, Samoa

Area         2831                                Area      2860


     10
The actual configuration

 Threshold 0.6

 PROPERTY        COMPARATOR LOW                  HIGH
 NAME            Exact           0.19            0.91
 CAPITAL         Exact           0.25            0.86
 AREA            Numeric         0.36            0.72



                 Confusing.

                 Why exact name comparisons?

                 Why is area comparison given such weight?

                 Who knows. There’s nobody to ask.

11
Semantic dogfood

  • Data about papers presented at semantic web
    conferences
         – has duplicate speakers
         – about 7,000 records, many long string values
  • Manually I get 88% accuracy
         – after two weeks, the script gets 82% accuracy
         – but it’s only half-way
Name          Grigorios Antoniou                  Name          Grigoris Antoniou

Homepage      http://www.ics.forth.gr/~antoniou   Homepage      http://www.ics.forth.gr/~antoniou

Mbox_Sha1     f44cd7769f416e96864ac43498b08215    Mbox_Sha1     f44cd7769f416e96864ac43498b08215
              5196829e                                          5196829e
Affiliation                                       Affiliation   http://data.semanticweb.org/organizat
                                                                ion/forth-ics
   12
The configuration

 Threshold 0.91

 PROPERTY           COMPARATOR                   LOW      HIGH
 NAME               JaroWinklerTokenized         0.2      0.9
 AFFILIATION        DiceCoefficient              0.49     0.61
 HOMEPAGE           Exact                        0.09     0.67
 MBOX_HASH          PersonNameComparator         0.42     0.87


                  Some strange choices of comparator.

                  PersonNameComparator?!?

                  DiceCoefficient is essentially same as Exact, for those values.

                  Otherwise as expected.

13
Hafslund

• I took a subset of customer data from Hafslund
     – roughly 3000 records
     – then made a difficult manual test file, where different
       parts of organizations are treated as different
     – so NSB Logistikk != NSB Bane
     – then made another subset for testing
• Manually I can do no better than 64% on this
  data set
     – interestingly, on the test data set I score 84%
• With a cut-down data set, I could run the
  script overnight, and have a result in the
  morning
14
The progress of evolution

• 1st generation
     – best scores: 0.47, 0.43, 0.3
• 2nd generation
     – mutated 0.47 configuration scores 0.136, 0.467, 0.002,
       and 0.49
     – best scores: 0.49, 0.467, 0.4, and 0.38
• 3rd generation
     – mutated 0.49 scores 0.001, 0.49, 0.46, and 0.25
     – best scores: 0.49, 0.46, 0.45, and 0.42
• 4th generation
     – we hit 0.525 (modified from 0.21)


15
The progress of evolution #2

• 5th generation
     – we hit 0.568 (modified from 0.479)
• 6th generation
     – 0.602
• 7th generation
     – 0.702
• ...
• 60th generation
     – 0.765
     – I’d done no better than 0.64 manually

16
Evaluation

       CONFIGURATION TRAINING                           TEST
       Genetic #1             0.766                     0.881
       Genetic #2             0.776                     0.859
       Manual #1              0.57                      0.838
       Manual #2              0.64                      0.803

Threshold: 0.98                                     Threshold: 0.95
PROPERTY            COMPARATOR        LOW HIGH      PROPERTY          COMPARATOR       LOW HIGH
NAME                Levenshtein       0.17   0.95   NAME              Levenshtein      0.42   0.96

ASSOCIATION_NO Exact                  0.06   0.69   ASSOCIATION_NO DiceCoefficien      0.0    0.67
                                                                   t
ADDRESS1            Numeric           0.02   0.92
                                                    ADDRESS1          Numeric          0.1    0.61
ADDRESS2            PersonName        0.18   0.76
                                                    ADDRESS2          Levenshtein      0.03   0.8
ZIP_CODE            DiceCoefficien    0.47   0.79
                    t                               ZIP_CODE          DiceCoefficien   0.35   0.69
   17                                                                 t
COUNTRY             Levenshtein       0.12   0.64
                                                    COUNTRY           JaroWinklerT.    0.44   0.68
Does it find the best configuration?

• We don’t know
• The experts say genetic algorithms tend to get
  stuck at local maxima
     – they also point out that well-known techniques for
       dealing with this are described in the literature
• Rerunning tends to produce similar
  configurations




18
The literature




     http://www.cleveralgorithms.com/   http://www.gp-field-guide.org.uk/

19
Conclusion

• Easy to implement
     – you don’t need a GP library
• Requires reliable test data
• It actually works
• Configurations may not be very tweakable
     – because they don’t necessarily make any sense
• This is a big field, with lots to learn



20       http://www.garshol.priv.no/blog/225.html

More Related Content

Similar to Experiments in genetic programming

How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
Sri Ambati
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Wagston Staehler
 
Bot or Not
Bot or NotBot or Not
Bot or Not
Erin Shellman
 
Boston hug
Boston hugBoston hug
Boston hug
Ted Dunning
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
Te-Yen Liu
 
A Fast Content-Based Image Retrieval Method Using Deep Visual Features
A Fast Content-Based Image Retrieval Method Using Deep Visual FeaturesA Fast Content-Based Image Retrieval Method Using Deep Visual Features
A Fast Content-Based Image Retrieval Method Using Deep Visual Features
Hiroki Tanioka
 
Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)
Julien SIMON
 
Are you sure about that?! Uncertainty Quantification in AI
Are you sure about that?! Uncertainty Quantification in AIAre you sure about that?! Uncertainty Quantification in AI
Are you sure about that?! Uncertainty Quantification in AI
inovex GmbH
 
Uncertainty Quantification in AI
Uncertainty Quantification in AIUncertainty Quantification in AI
Uncertainty Quantification in AI
Florian Wilhelm
 
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Dave Stokes
 
muCon 2017 - Build Confidence in your System with Chaos Engineering
muCon 2017 - Build Confidence in your System with Chaos EngineeringmuCon 2017 - Build Confidence in your System with Chaos Engineering
muCon 2017 - Build Confidence in your System with Chaos Engineering
Sylvain Hellegouarch
 
Deep learning
Deep learningDeep learning
Deep learning
Aman Kamboj
 
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
Sri Ambati
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
Itachi SK
 
Leveraging sql server to improve vector display through point clustering
Leveraging sql server to improve vector display through point clusteringLeveraging sql server to improve vector display through point clustering
Leveraging sql server to improve vector display through point clustering
Texas Natural Resources Information System
 
A comparison of apache spark supervised machine learning algorithms for dna s...
A comparison of apache spark supervised machine learning algorithms for dna s...A comparison of apache spark supervised machine learning algorithms for dna s...
A comparison of apache spark supervised machine learning algorithms for dna s...
Valerio Morfino
 
The servicescore card - Gamifying Operational Excellence - SRECON
The servicescore card - Gamifying Operational Excellence - SRECONThe servicescore card - Gamifying Operational Excellence - SRECON
The servicescore card - Gamifying Operational Excellence - SRECON
Daniel ( Danny ) ☃ Lawrence
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
Mike Acton
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
Darius Barušauskas
 

Similar to Experiments in genetic programming (20)

How to win data science competitions with Deep Learning
How to win data science competitions with Deep LearningHow to win data science competitions with Deep Learning
How to win data science competitions with Deep Learning
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Bot or Not
Bot or NotBot or Not
Bot or Not
 
Boston hug
Boston hugBoston hug
Boston hug
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
A Fast Content-Based Image Retrieval Method Using Deep Visual Features
A Fast Content-Based Image Retrieval Method Using Deep Visual FeaturesA Fast Content-Based Image Retrieval Method Using Deep Visual Features
A Fast Content-Based Image Retrieval Method Using Deep Visual Features
 
ieee paper
ieee paper ieee paper
ieee paper
 
Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)Deep Learning with Apache MXNet (September 2017)
Deep Learning with Apache MXNet (September 2017)
 
Are you sure about that?! Uncertainty Quantification in AI
Are you sure about that?! Uncertainty Quantification in AIAre you sure about that?! Uncertainty Quantification in AI
Are you sure about that?! Uncertainty Quantification in AI
 
Uncertainty Quantification in AI
Uncertainty Quantification in AIUncertainty Quantification in AI
Uncertainty Quantification in AI
 
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016Why Your Database Queries Stink -SeaGl.org November 11th, 2016
Why Your Database Queries Stink -SeaGl.org November 11th, 2016
 
muCon 2017 - Build Confidence in your System with Chaos Engineering
muCon 2017 - Build Confidence in your System with Chaos EngineeringmuCon 2017 - Build Confidence in your System with Chaos Engineering
muCon 2017 - Build Confidence in your System with Chaos Engineering
 
Deep learning
Deep learningDeep learning
Deep learning
 
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14
 
Convolutional neural network
Convolutional neural networkConvolutional neural network
Convolutional neural network
 
Leveraging sql server to improve vector display through point clustering
Leveraging sql server to improve vector display through point clusteringLeveraging sql server to improve vector display through point clustering
Leveraging sql server to improve vector display through point clustering
 
A comparison of apache spark supervised machine learning algorithms for dna s...
A comparison of apache spark supervised machine learning algorithms for dna s...A comparison of apache spark supervised machine learning algorithms for dna s...
A comparison of apache spark supervised machine learning algorithms for dna s...
 
The servicescore card - Gamifying Operational Excellence - SRECON
The servicescore card - Gamifying Operational Excellence - SRECONThe servicescore card - Gamifying Operational Excellence - SRECON
The servicescore card - Gamifying Operational Excellence - SRECON
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 

More from Lars Marius Garshol

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
Lars Marius Garshol
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
Lars Marius Garshol
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
Lars Marius Garshol
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
Lars Marius Garshol
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
Lars Marius Garshol
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
Lars Marius Garshol
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
Lars Marius Garshol
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
Lars Marius Garshol
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
Lars Marius Garshol
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
Lars Marius Garshol
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Big data 101
Big data 101Big data 101
Big data 101
Lars Marius Garshol
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceLars Marius Garshol
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
Lars Marius Garshol
 
Semantisk integrasjon
Semantisk integrasjonSemantisk integrasjon
Semantisk integrasjon
Lars Marius Garshol
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
Lars Marius Garshol
 
XML in software development
XML in software developmentXML in software development
XML in software development
Lars Marius Garshol
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
Lars Marius Garshol
 
What's up?
What's up?What's up?

More from Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Big data 101
Big data 101Big data 101
Big data 101
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 
Semantisk integrasjon
Semantisk integrasjonSemantisk integrasjon
Semantisk integrasjon
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 
XML in software development
XML in software developmentXML in software development
XML in software development
 
Unicode 101
Unicode 101Unicode 101
Unicode 101
 
What's up?
What's up?What's up?
What's up?
 

Recently uploaded

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
Globus
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 

Recently uploaded (20)

Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Enhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZEnhancing Performance with Globus and the Science DMZ
Enhancing Performance with Globus and the Science DMZ
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 

Experiments in genetic programming

  • 1. Experiments in genetic programming Bouvet BigOne, 2012-03-29 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga 1
  • 2. The background • Duke – open source data matching engine (Java) – can find near-duplicate database records – probabilistic configuration – http://code.google.com/p/duke/ • People find making configurations difficult – can we help them? Field Record 1 Record 2 Probability Name acme inc acme inc 0.9 Assoc no 177477707 0.5 Zip code 9161 9161 0.6 Country norway norway 0.51 Address 1 mb 113 mailbox 113 0.49 Address 2 0.5 2
  • 3. The idea • Given – a test file showing the correct linkages • can we – evolve a configuration • using – genetic algorithms? 3
  • 4. What a configuration looks like • Threshold for accepting matches – a number between 0.0 and 1.0 • For each property – a comparator function (Exact, Levenshtein, numeric...) – a low probability (0.0-0.5) – a high probability (0.5-1.0) 4
  • 6. How it works 1. Generate a population of 100 random configurations 2. Evaluate the population 3. Throw away the 25 worst, duplicate the 25 best 4. Randomly modify the entire population 5. Go back to 2 6
  • 7. Actual code for generation in range(POPULATIONS): print "===== GENERATION %s ================================" % generation for c in population: f = evaluate(c) if f > highest: best = c highest = f show_best(best, False) # make new generation population = sorted(population, key = lambda c: 1.0 - index[c]) # ditch lower quartile population = population[ : -25] # double upper quartile population = population[ : 25] + population # mutate population = [c.make_new(population) for c in population] 7
  • 8. Actual code #2 class GeneticConfiguration: def __init__(self): self._props = [] self._threshold = 0.0 # set/get threshold, add/get properties def make_new(self, population): # either we make a number or random modifications, or we mate. # draw a number, if 0 modifications, we mate. mods = random.randint(0, 3) if mods: return self._mutate(mods) else: return self._mate(random.choice(population)) def _mutate(self, mods): c = self._copy() for ix in range(mods): aspect = random.choice(aspects) aspect.modify(c) return c def _mate(self, other): c = self._copy() for aspect in aspects: aspect.set(c, aspect.get(random.choice([self, other]))) return c def _copy(self): c = GeneticConfiguration() c.set_threshold(self._threshold) for prop in self.get_properties(): if prop.getName() == "ID": c.add_property(Property(prop.getName())) else: c.add_property(Property(prop.getName(), prop.getComparator(), prop.getLowProbability(), prop.getHighProbability())) return c 8
  • 9. But ... does it work?!? 9
  • 10. Linking countries • Linking countries from DBpedia and Mondial – no common identifiers • Manually I manage 95.4% accuracy – genetic script manages 95.7% in first generation – then improves to 98.9% – this was too easy... DBPEDIA MONDIAL Id http://dbpedia.org/resource/Samoa Id 17019 Name Samoa Name Western Samoa Capital Apia Capital Apia, Samoa Area 2831 Area 2860 10
  • 11. The actual configuration Threshold 0.6 PROPERTY COMPARATOR LOW HIGH NAME Exact 0.19 0.91 CAPITAL Exact 0.25 0.86 AREA Numeric 0.36 0.72 Confusing. Why exact name comparisons? Why is area comparison given such weight? Who knows. There’s nobody to ask. 11
  • 12. Semantic dogfood • Data about papers presented at semantic web conferences – has duplicate speakers – about 7,000 records, many long string values • Manually I get 88% accuracy – after two weeks, the script gets 82% accuracy – but it’s only half-way Name Grigorios Antoniou Name Grigoris Antoniou Homepage http://www.ics.forth.gr/~antoniou Homepage http://www.ics.forth.gr/~antoniou Mbox_Sha1 f44cd7769f416e96864ac43498b08215 Mbox_Sha1 f44cd7769f416e96864ac43498b08215 5196829e 5196829e Affiliation Affiliation http://data.semanticweb.org/organizat ion/forth-ics 12
  • 13. The configuration Threshold 0.91 PROPERTY COMPARATOR LOW HIGH NAME JaroWinklerTokenized 0.2 0.9 AFFILIATION DiceCoefficient 0.49 0.61 HOMEPAGE Exact 0.09 0.67 MBOX_HASH PersonNameComparator 0.42 0.87 Some strange choices of comparator. PersonNameComparator?!? DiceCoefficient is essentially same as Exact, for those values. Otherwise as expected. 13
  • 14. Hafslund • I took a subset of customer data from Hafslund – roughly 3000 records – then made a difficult manual test file, where different parts of organizations are treated as different – so NSB Logistikk != NSB Bane – then made another subset for testing • Manually I can do no better than 64% on this data set – interestingly, on the test data set I score 84% • With a cut-down data set, I could run the script overnight, and have a result in the morning 14
  • 15. The progress of evolution • 1st generation – best scores: 0.47, 0.43, 0.3 • 2nd generation – mutated 0.47 configuration scores 0.136, 0.467, 0.002, and 0.49 – best scores: 0.49, 0.467, 0.4, and 0.38 • 3rd generation – mutated 0.49 scores 0.001, 0.49, 0.46, and 0.25 – best scores: 0.49, 0.46, 0.45, and 0.42 • 4th generation – we hit 0.525 (modified from 0.21) 15
  • 16. The progress of evolution #2 • 5th generation – we hit 0.568 (modified from 0.479) • 6th generation – 0.602 • 7th generation – 0.702 • ... • 60th generation – 0.765 – I’d done no better than 0.64 manually 16
  • 17. Evaluation CONFIGURATION TRAINING TEST Genetic #1 0.766 0.881 Genetic #2 0.776 0.859 Manual #1 0.57 0.838 Manual #2 0.64 0.803 Threshold: 0.98 Threshold: 0.95 PROPERTY COMPARATOR LOW HIGH PROPERTY COMPARATOR LOW HIGH NAME Levenshtein 0.17 0.95 NAME Levenshtein 0.42 0.96 ASSOCIATION_NO Exact 0.06 0.69 ASSOCIATION_NO DiceCoefficien 0.0 0.67 t ADDRESS1 Numeric 0.02 0.92 ADDRESS1 Numeric 0.1 0.61 ADDRESS2 PersonName 0.18 0.76 ADDRESS2 Levenshtein 0.03 0.8 ZIP_CODE DiceCoefficien 0.47 0.79 t ZIP_CODE DiceCoefficien 0.35 0.69 17 t COUNTRY Levenshtein 0.12 0.64 COUNTRY JaroWinklerT. 0.44 0.68
  • 18. Does it find the best configuration? • We don’t know • The experts say genetic algorithms tend to get stuck at local maxima – they also point out that well-known techniques for dealing with this are described in the literature • Rerunning tends to produce similar configurations 18
  • 19. The literature http://www.cleveralgorithms.com/ http://www.gp-field-guide.org.uk/ 19
  • 20. Conclusion • Easy to implement – you don’t need a GP library • Requires reliable test data • It actually works • Configurations may not be very tweakable – because they don’t necessarily make any sense • This is a big field, with lots to learn 20 http://www.garshol.priv.no/blog/225.html