Genetic programming is used to evolve data matching configurations that maximize accuracy on test data. The algorithm generates random initial configurations, evaluates them on the test data, and uses genetic operations of selection, crossover and mutation to evolve better configurations over generations. On several datasets, the genetic algorithm is able to find configurations that improve accuracy over manual configurations. However, the evolved configurations are not always intuitive and may represent local optima rather than global optima. More techniques from genetic programming literature could help address these issues.
NoSQL databases were created to solve scalability problems with SQL databases. It turns out these problems are profoundly connected with Einstein's theory of relativity (no, honestly), and understanding this illuminates the SQL/NoSQL divide in surprising ways.
NoSQL databases were created to solve scalability problems with SQL databases. It turns out these problems are profoundly connected with Einstein's theory of relativity (no, honestly), and understanding this illuminates the SQL/NoSQL divide in surprising ways.
How to win data science competitions with Deep LearningSri Ambati
Note: Please download the slides first, otherwise some links won't work!
How to win kaggle style data science competitions and influence decisions with R, Deep Learning and H2O's fast algorithms.
We take a few public and kaggle datasets and model to win competitions on accuracy and scoring speed.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Introductory presentation to Explainable AI, defending its main motivations and importance. We describe briefly the main techniques available in March 2020 and share many references to allow the reader to continue his/her studies.
Like many Internet giants Twitter makes money by selling ads, but they’ve got an insidious infestation eroding their advertising credibility: bots. More than 23 million of them. Twitter bots are automatons living in the Twittersphere and ranging wildly in capability. In their simplest form, they follow you maybe fav-ing or retweeting your statuses. At their most complex, they troll and ironically, troll trolls using speech patterns that can, at times, fool humans. But when advertisers pay for engagement, they aren’t interested in a four-hour flame war between a gamergate bot and a Kanye bot. When advertisers analyze social data they want to be sure their findings are the result of human activity. In Bot or Not I describe an end-to-end data analysis to build a classifier with Python.
A Fast Content-Based Image Retrieval Method Using Deep Visual FeaturesHiroki Tanioka
Fast and scalable Content-Based Image Retrieval using visual features is required for document analysis, Medical image analysis, etc. in the present age. Convolutional Neural Network (CNN) activations as features achieved their outstanding performance in this area. Deep Convolutional representations using the softmax function in the output layer are also ones among visual features. However, almost all the image retrieval systems hold their index of visual features on main memory in order to high responsiveness, limiting their applicability for big data applications. In this paper, we propose a fast calculation method of cosine similarity with L2 norm indexed in advance on Elasticsearch. We evaluate our approach with ImageNet Dataset and VGG-16 pre-trained model. The evaluation results show the effectiveness and efficiency of our proposed method.
Are you sure about that?! Uncertainty Quantification in AIinovex GmbH
With the advent of Deep Learning (DL), the field of AI made a giant leap forward and it is nowadays applied in many industrial use-cases. Especially critical systems like autonomous driving, require that DL methods not only produce a prediction but also state the certainty about the prediction in order to assess risks and failure.
In my talk, I will give an introduction to different kinds of uncertainty, i.e. epistemic and aleatoric. To have a baseline for comparison, the classical method of Gaussian Processes for regression problems is presented. I then elaborate on different DL methods for uncertainty quantification like Quantile Regression, Monte-Carlo Dropout, and Deep Ensembles. The talk is concluded with a comparison of these techniques to Gaussian Processes and the current state of the art.
Speaker: Dr. Florian Wilhelm, Simon Bachstein, inovex
Event: PyCon/PyData Berlin 2019
Datum: 10.10.2019
Mehr Tech-Vorträge: inovex.de/vortraege
Mehr Tech-Artikel: inovex.de/blog
With the advent of Deep Learning (DL), the field of AI made a giant leap forward and it is nowadays applied in many industrial use-cases. Especially critical systems like autonomous driving, require that DL methods not only produce a prediction but also state the certainty about the prediction in order to assess risks and failure.
In my talk, I will give an introduction to different kinds of uncertainty, i.e. epistemic and aleatoric. To have a baseline for comparison, the classical method of Gaussian Processes for regression problems is presented. I then elaborate on different DL methods for uncertainty quantification like Quantile Regression, Monte-Carlo Dropout, and Deep Ensembles. The talk is concluded with a comparison of these techniques to Gaussian Processes and the current state of the art.
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14Sri Ambati
Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence.
http://docs.0xdata.com/datascience/deeplearning.html
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
A comparison of apache spark supervised machine learning algorithms for dna s...Valerio Morfino
In this work, we deal with the splicing site prediction problem in DNA sequences by using supervised machine learning algorithms included in the MLlib library of Apache Spark. We show the implementation details and the performance of those algorithms on two public available datasets adopting both local and cloud environments. We compare the performance of the algorithms with U-BRAIN, a general-purpose learning algorithm originally designed for DNA splicing site prediction. Results show that, among the Spark algorithms, all have good prediction accuracy (>0.9) – that is comparable with the one of U-BRAIN – and much lower execution time. Therefore, we can state that Apache Spark machine learning algorithms are promising candidates for dealing with the DNA splicing site prediction problem.
What makes a “good” service is a moving target. Technologies and requirements change over time. It can be impossible to ensure that none of your services have been left behind.
The Service ScoreCard approach is to have a small check for each service initiative we have, this could be anything measurable; deployment frequency, the oncall team all have phone; ensuring the latest version of the JVM.
The Service ScoreCard, gives each service a grade from 'F' to 'A+', based on passing or failing the list of checks. As soon as anyone see the service grade’s slipping everyone rallies to improve the grades.
We can then set up rules based on the grades, “Only B and above services can deploy 24 / 7”, “moratorium on services without an A+” or “No SRE support until the services below C grade”.
Schibsted collects and analyzes 900 million events/day using AWS. This presentation gives an overview of the systems and architecture, including the solutions to GDPR.
How to win data science competitions with Deep LearningSri Ambati
Note: Please download the slides first, otherwise some links won't work!
How to win kaggle style data science competitions and influence decisions with R, Deep Learning and H2O's fast algorithms.
We take a few public and kaggle datasets and model to win competitions on accuracy and scoring speed.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Introductory presentation to Explainable AI, defending its main motivations and importance. We describe briefly the main techniques available in March 2020 and share many references to allow the reader to continue his/her studies.
Like many Internet giants Twitter makes money by selling ads, but they’ve got an insidious infestation eroding their advertising credibility: bots. More than 23 million of them. Twitter bots are automatons living in the Twittersphere and ranging wildly in capability. In their simplest form, they follow you maybe fav-ing or retweeting your statuses. At their most complex, they troll and ironically, troll trolls using speech patterns that can, at times, fool humans. But when advertisers pay for engagement, they aren’t interested in a four-hour flame war between a gamergate bot and a Kanye bot. When advertisers analyze social data they want to be sure their findings are the result of human activity. In Bot or Not I describe an end-to-end data analysis to build a classifier with Python.
A Fast Content-Based Image Retrieval Method Using Deep Visual FeaturesHiroki Tanioka
Fast and scalable Content-Based Image Retrieval using visual features is required for document analysis, Medical image analysis, etc. in the present age. Convolutional Neural Network (CNN) activations as features achieved their outstanding performance in this area. Deep Convolutional representations using the softmax function in the output layer are also ones among visual features. However, almost all the image retrieval systems hold their index of visual features on main memory in order to high responsiveness, limiting their applicability for big data applications. In this paper, we propose a fast calculation method of cosine similarity with L2 norm indexed in advance on Elasticsearch. We evaluate our approach with ImageNet Dataset and VGG-16 pre-trained model. The evaluation results show the effectiveness and efficiency of our proposed method.
Are you sure about that?! Uncertainty Quantification in AIinovex GmbH
With the advent of Deep Learning (DL), the field of AI made a giant leap forward and it is nowadays applied in many industrial use-cases. Especially critical systems like autonomous driving, require that DL methods not only produce a prediction but also state the certainty about the prediction in order to assess risks and failure.
In my talk, I will give an introduction to different kinds of uncertainty, i.e. epistemic and aleatoric. To have a baseline for comparison, the classical method of Gaussian Processes for regression problems is presented. I then elaborate on different DL methods for uncertainty quantification like Quantile Regression, Monte-Carlo Dropout, and Deep Ensembles. The talk is concluded with a comparison of these techniques to Gaussian Processes and the current state of the art.
Speaker: Dr. Florian Wilhelm, Simon Bachstein, inovex
Event: PyCon/PyData Berlin 2019
Datum: 10.10.2019
Mehr Tech-Vorträge: inovex.de/vortraege
Mehr Tech-Artikel: inovex.de/blog
With the advent of Deep Learning (DL), the field of AI made a giant leap forward and it is nowadays applied in many industrial use-cases. Especially critical systems like autonomous driving, require that DL methods not only produce a prediction but also state the certainty about the prediction in order to assess risks and failure.
In my talk, I will give an introduction to different kinds of uncertainty, i.e. epistemic and aleatoric. To have a baseline for comparison, the classical method of Gaussian Processes for regression problems is presented. I then elaborate on different DL methods for uncertainty quantification like Quantile Regression, Monte-Carlo Dropout, and Deep Ensembles. The talk is concluded with a comparison of these techniques to Gaussian Processes and the current state of the art.
H2O.ai's Distributed Deep Learning by Arno Candel 04/03/14Sri Ambati
Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence.
http://docs.0xdata.com/datascience/deeplearning.html
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
A comparison of apache spark supervised machine learning algorithms for dna s...Valerio Morfino
In this work, we deal with the splicing site prediction problem in DNA sequences by using supervised machine learning algorithms included in the MLlib library of Apache Spark. We show the implementation details and the performance of those algorithms on two public available datasets adopting both local and cloud environments. We compare the performance of the algorithms with U-BRAIN, a general-purpose learning algorithm originally designed for DNA splicing site prediction. Results show that, among the Spark algorithms, all have good prediction accuracy (>0.9) – that is comparable with the one of U-BRAIN – and much lower execution time. Therefore, we can state that Apache Spark machine learning algorithms are promising candidates for dealing with the DNA splicing site prediction problem.
What makes a “good” service is a moving target. Technologies and requirements change over time. It can be impossible to ensure that none of your services have been left behind.
The Service ScoreCard approach is to have a small check for each service initiative we have, this could be anything measurable; deployment frequency, the oncall team all have phone; ensuring the latest version of the JVM.
The Service ScoreCard, gives each service a grade from 'F' to 'A+', based on passing or failing the list of checks. As soon as anyone see the service grade’s slipping everyone rallies to improve the grades.
We can then set up rules based on the grades, “Only B and above services can deploy 24 / 7”, “moratorium on services without an A+” or “No SRE support until the services below C grade”.
Schibsted collects and analyzes 900 million events/day using AWS. This presentation gives an overview of the systems and architecture, including the solutions to GDPR.
An overview of farmhouse brewing in Norway, both as it exists today, and as it was historically. Extra information on the unique Norwegian yeast cultures that still survive.
NoSQL databases, the CAP theorem, and the theory of relativityLars Marius Garshol
A presentation showing how the CAP theorem causes NoSQL databases to have BASE semantics. That is, they don't support ACID consistency. Then shows how CAP is related to Einstein's theory of relativity. And finally shows how Google Spanner and F1 provide ACID that scales.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Experiments in genetic programming
1. Experiments in genetic programming
Bouvet BigOne, 2012-03-29
Lars Marius Garshol, <larsga@bouvet.no>
http://twitter.com/larsga
1
2. The background
• Duke
– open source data matching engine (Java)
– can find near-duplicate database records
– probabilistic configuration
– http://code.google.com/p/duke/
• People find making configurations difficult
– can we help them? Field Record 1 Record 2 Probability
Name acme inc acme inc 0.9
Assoc no 177477707 0.5
Zip code 9161 9161 0.6
Country norway norway 0.51
Address 1 mb 113 mailbox 113 0.49
Address 2 0.5
2
3. The idea
• Given
– a test file showing the correct linkages
• can we
– evolve a configuration
• using
– genetic algorithms?
3
4. What a configuration looks like
• Threshold for accepting matches
– a number between 0.0 and 1.0
• For each property
– a comparator function (Exact, Levenshtein, numeric...)
– a low probability (0.0-0.5)
– a high probability (0.5-1.0)
4
6. How it works
1. Generate a population of 100 random
configurations
2. Evaluate the population
3. Throw away the 25 worst, duplicate the 25
best
4. Randomly modify the entire population
5. Go back to 2
6
7. Actual code
for generation in range(POPULATIONS):
print "===== GENERATION %s ================================" % generation
for c in population:
f = evaluate(c)
if f > highest:
best = c
highest = f
show_best(best, False)
# make new generation
population = sorted(population, key = lambda c: 1.0 - index[c])
# ditch lower quartile
population = population[ : -25]
# double upper quartile
population = population[ : 25] + population
# mutate
population = [c.make_new(population) for c in population]
7
8. Actual code #2
class GeneticConfiguration:
def __init__(self):
self._props = []
self._threshold = 0.0
# set/get threshold, add/get properties
def make_new(self, population):
# either we make a number or random modifications, or we mate.
# draw a number, if 0 modifications, we mate.
mods = random.randint(0, 3)
if mods:
return self._mutate(mods)
else:
return self._mate(random.choice(population))
def _mutate(self, mods):
c = self._copy()
for ix in range(mods):
aspect = random.choice(aspects)
aspect.modify(c)
return c
def _mate(self, other):
c = self._copy()
for aspect in aspects:
aspect.set(c, aspect.get(random.choice([self, other])))
return c
def _copy(self):
c = GeneticConfiguration()
c.set_threshold(self._threshold)
for prop in self.get_properties():
if prop.getName() == "ID":
c.add_property(Property(prop.getName()))
else:
c.add_property(Property(prop.getName(), prop.getComparator(), prop.getLowProbability(), prop.getHighProbability()))
return c
8
10. Linking countries
• Linking countries from DBpedia and Mondial
– no common identifiers
• Manually I manage 95.4% accuracy
– genetic script manages 95.7% in first generation
– then improves to 98.9%
– this was too easy...
DBPEDIA MONDIAL
Id http://dbpedia.org/resource/Samoa Id 17019
Name Samoa Name Western Samoa
Capital Apia Capital Apia, Samoa
Area 2831 Area 2860
10
11. The actual configuration
Threshold 0.6
PROPERTY COMPARATOR LOW HIGH
NAME Exact 0.19 0.91
CAPITAL Exact 0.25 0.86
AREA Numeric 0.36 0.72
Confusing.
Why exact name comparisons?
Why is area comparison given such weight?
Who knows. There’s nobody to ask.
11
12. Semantic dogfood
• Data about papers presented at semantic web
conferences
– has duplicate speakers
– about 7,000 records, many long string values
• Manually I get 88% accuracy
– after two weeks, the script gets 82% accuracy
– but it’s only half-way
Name Grigorios Antoniou Name Grigoris Antoniou
Homepage http://www.ics.forth.gr/~antoniou Homepage http://www.ics.forth.gr/~antoniou
Mbox_Sha1 f44cd7769f416e96864ac43498b08215 Mbox_Sha1 f44cd7769f416e96864ac43498b08215
5196829e 5196829e
Affiliation Affiliation http://data.semanticweb.org/organizat
ion/forth-ics
12
13. The configuration
Threshold 0.91
PROPERTY COMPARATOR LOW HIGH
NAME JaroWinklerTokenized 0.2 0.9
AFFILIATION DiceCoefficient 0.49 0.61
HOMEPAGE Exact 0.09 0.67
MBOX_HASH PersonNameComparator 0.42 0.87
Some strange choices of comparator.
PersonNameComparator?!?
DiceCoefficient is essentially same as Exact, for those values.
Otherwise as expected.
13
14. Hafslund
• I took a subset of customer data from Hafslund
– roughly 3000 records
– then made a difficult manual test file, where different
parts of organizations are treated as different
– so NSB Logistikk != NSB Bane
– then made another subset for testing
• Manually I can do no better than 64% on this
data set
– interestingly, on the test data set I score 84%
• With a cut-down data set, I could run the
script overnight, and have a result in the
morning
14
15. The progress of evolution
• 1st generation
– best scores: 0.47, 0.43, 0.3
• 2nd generation
– mutated 0.47 configuration scores 0.136, 0.467, 0.002,
and 0.49
– best scores: 0.49, 0.467, 0.4, and 0.38
• 3rd generation
– mutated 0.49 scores 0.001, 0.49, 0.46, and 0.25
– best scores: 0.49, 0.46, 0.45, and 0.42
• 4th generation
– we hit 0.525 (modified from 0.21)
15
16. The progress of evolution #2
• 5th generation
– we hit 0.568 (modified from 0.479)
• 6th generation
– 0.602
• 7th generation
– 0.702
• ...
• 60th generation
– 0.765
– I’d done no better than 0.64 manually
16
17. Evaluation
CONFIGURATION TRAINING TEST
Genetic #1 0.766 0.881
Genetic #2 0.776 0.859
Manual #1 0.57 0.838
Manual #2 0.64 0.803
Threshold: 0.98 Threshold: 0.95
PROPERTY COMPARATOR LOW HIGH PROPERTY COMPARATOR LOW HIGH
NAME Levenshtein 0.17 0.95 NAME Levenshtein 0.42 0.96
ASSOCIATION_NO Exact 0.06 0.69 ASSOCIATION_NO DiceCoefficien 0.0 0.67
t
ADDRESS1 Numeric 0.02 0.92
ADDRESS1 Numeric 0.1 0.61
ADDRESS2 PersonName 0.18 0.76
ADDRESS2 Levenshtein 0.03 0.8
ZIP_CODE DiceCoefficien 0.47 0.79
t ZIP_CODE DiceCoefficien 0.35 0.69
17 t
COUNTRY Levenshtein 0.12 0.64
COUNTRY JaroWinklerT. 0.44 0.68
18. Does it find the best configuration?
• We don’t know
• The experts say genetic algorithms tend to get
stuck at local maxima
– they also point out that well-known techniques for
dealing with this are described in the literature
• Rerunning tends to produce similar
configurations
18
19. The literature
http://www.cleveralgorithms.com/ http://www.gp-field-guide.org.uk/
19
20. Conclusion
• Easy to implement
– you don’t need a GP library
• Requires reliable test data
• It actually works
• Configurations may not be very tweakable
– because they don’t necessarily make any sense
• This is a big field, with lots to learn
20 http://www.garshol.priv.no/blog/225.html