The document describes using a Monte Carlo simulation to model the process of generating English names over multiple generations. It outlines the key processes in the model, including prepending nicknames, appending place names or occupations, shortening names by dropping syllables, and rejecting identical names. The simulation is run with adjustable parameters and the results are compared statistically to real name data to test how well the model fits, rather than requiring an exact match of individual names. The simulation is found to match some properties like name length distribution but not others like the frequency of names containing "smith". This highlights both the utility and limitations of the Monte Carlo approach for this problem.
How to Make Your University the Technology Source of Choice for Innovation -...Fuentek, LLC
These slides by John McEntire (formerly of the Univ. of Illinois at Urbana-Champaign) were part of a panel led by Fuentek, LLC on technology licensing at universities, presented at the Technology Transfer Society conference, October 2, 2009.
How to Make Your University the Technology Source of Choice for Innovation -...Fuentek, LLC
These slides by John McEntire (formerly of the Univ. of Illinois at Urbana-Champaign) were part of a panel led by Fuentek, LLC on technology licensing at universities, presented at the Technology Transfer Society conference, October 2, 2009.
Presented at JavaZone (9th September 2015)
Video available at https://vimeo.com/138863968
The three-act play, the given–when–then BDD triptych, the three steps of the Feynman problem-solving algorithm... a surprising number of things appear to come in threes. This talk walks through — and has some fun with — a number of triples that affect and are found in software development.
STATISTICAL DIVERSIONSPeter Petocz and Eric SoweyMacquarie.docxdessiechisomjj4
STATISTICAL DIVERSIONS
Peter Petocz and Eric Sowey
Macquarie University, Sydney and
The University of New South Wales
Sydney, Australia.
“The world is a complicated place.” This is often
heard in social conversation, and the speaker gen-
erally lets it go at that. But consider what it means
for someone who is trying to understand how things
actually work in this complicated world – how the
brain detects patterns, how consumers respond to
rises in credit card interest rates, how aeroplane
wings deflect during a supersonic flight and so on.
Understanding will not get very far without some
initially simplified representation of whatever situ-
ation is being examined. We call such a simplified
representation a model of reality. A neat definition
of a model is “a concise abstraction of reality.” A
model is an abstraction in the sense that it does not
include every detail of reality, but only those details
that are centrally relevant to the matter under inves-
tigation. And a model is concise in the sense that it
is relatively easy to comprehend and to work with.
A simple example of a model is a page of a street
directory. The page shows the directions and names
of streets in a certain locality and represents, by a
colour coding, the relative importance of the streets
as traffic arteries. It’s an abstraction of reality in
that it supplies the main information that a motorist
needs, but little else. For example, it’s two-
dimensional and so does not show the steepness of
hills; neither does it show all the buildings that line
those streets nor the boundaries of the land that
each building occupies. And the page is concise
in that it’s drawn on a small scale (typically,
1 cm = 100 m).
Because there are many different kinds of things in
the world that we seek to understand, there are
many different kinds of models. However, there is a
basic distinction between physical models and alge-
braic (also called computational) models. A physical
model is, as the name suggests, some kind of object
(whether in two or in three dimensions). Each page
in a street directory is evidently a physical model. So
is an architect’s three-dimensional representation of
the finished appearance of a building, and so also is
a child’s balsa wood aeroplane. An algebraic model,
by contrast, uses equations to describe the main
features of interest in a real world situation and
their interrelations. If these equations describe rela-
tions that are certain, or relations where chance
influences are ignored, then the model is called a
mathematical model. Newton’s three ‘laws’ of
motion and Einstein’s famous equation, E = mc 2,
are examples of mathematical models. If, however,
the equations explicitly include the influence of
chance, then the model is called a statistical model.
Although introductory textbooks of statistics may
not highlight the fact, all the standard probability
distributions (binomial, Poisson, Normal, etc) are
indeed statistical models. To see this in the case.
Rsqrd AI: Errudite- Scalable, Reproducible, and Testable Error AnalysisSanjana Chowdhury
In this talk, Sherry Wu talks about Errudite and reproducible results.
Presented 09/25/2019
**These slides are from a talk given at Rsqrd AI. Learn more at rsqrdai.org**
This presentation contains my one day lectures which introduces fuzzy set theory, operations on fuzzy sets, some engineering control applications using Mamdamn model.
Presented at JavaZone (9th September 2015)
Video available at https://vimeo.com/138863968
The three-act play, the given–when–then BDD triptych, the three steps of the Feynman problem-solving algorithm... a surprising number of things appear to come in threes. This talk walks through — and has some fun with — a number of triples that affect and are found in software development.
STATISTICAL DIVERSIONSPeter Petocz and Eric SoweyMacquarie.docxdessiechisomjj4
STATISTICAL DIVERSIONS
Peter Petocz and Eric Sowey
Macquarie University, Sydney and
The University of New South Wales
Sydney, Australia.
“The world is a complicated place.” This is often
heard in social conversation, and the speaker gen-
erally lets it go at that. But consider what it means
for someone who is trying to understand how things
actually work in this complicated world – how the
brain detects patterns, how consumers respond to
rises in credit card interest rates, how aeroplane
wings deflect during a supersonic flight and so on.
Understanding will not get very far without some
initially simplified representation of whatever situ-
ation is being examined. We call such a simplified
representation a model of reality. A neat definition
of a model is “a concise abstraction of reality.” A
model is an abstraction in the sense that it does not
include every detail of reality, but only those details
that are centrally relevant to the matter under inves-
tigation. And a model is concise in the sense that it
is relatively easy to comprehend and to work with.
A simple example of a model is a page of a street
directory. The page shows the directions and names
of streets in a certain locality and represents, by a
colour coding, the relative importance of the streets
as traffic arteries. It’s an abstraction of reality in
that it supplies the main information that a motorist
needs, but little else. For example, it’s two-
dimensional and so does not show the steepness of
hills; neither does it show all the buildings that line
those streets nor the boundaries of the land that
each building occupies. And the page is concise
in that it’s drawn on a small scale (typically,
1 cm = 100 m).
Because there are many different kinds of things in
the world that we seek to understand, there are
many different kinds of models. However, there is a
basic distinction between physical models and alge-
braic (also called computational) models. A physical
model is, as the name suggests, some kind of object
(whether in two or in three dimensions). Each page
in a street directory is evidently a physical model. So
is an architect’s three-dimensional representation of
the finished appearance of a building, and so also is
a child’s balsa wood aeroplane. An algebraic model,
by contrast, uses equations to describe the main
features of interest in a real world situation and
their interrelations. If these equations describe rela-
tions that are certain, or relations where chance
influences are ignored, then the model is called a
mathematical model. Newton’s three ‘laws’ of
motion and Einstein’s famous equation, E = mc 2,
are examples of mathematical models. If, however,
the equations explicitly include the influence of
chance, then the model is called a statistical model.
Although introductory textbooks of statistics may
not highlight the fact, all the standard probability
distributions (binomial, Poisson, Normal, etc) are
indeed statistical models. To see this in the case.
Rsqrd AI: Errudite- Scalable, Reproducible, and Testable Error AnalysisSanjana Chowdhury
In this talk, Sherry Wu talks about Errudite and reproducible results.
Presented 09/25/2019
**These slides are from a talk given at Rsqrd AI. Learn more at rsqrdai.org**
This presentation contains my one day lectures which introduces fuzzy set theory, operations on fuzzy sets, some engineering control applications using Mamdamn model.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Monte Carlo
1. Monte Carlo Simulation.∗
Greg Kochanski
http://kochanski.org/gpk
2005/03/10 10:45:56 UTC
1 Introduction
The idea behind Monte-Carlo simulations gained its name and its first major use in 1944 [Pllana, 2000], in the
research work to develop the first atomic bomb. The scientists working on the Manhattan Project had intractably
difficult equations to solve in order to calculate the probability with which a neutron from one fissioning Uranium1
atom would cause another to fission. The equations were complicated because they had to mirror the complicated
geometry of the actual bomb, and the answer had to be right because, if the first test failed, it would be months
before there was enough Uranium for another attempt.
They solved the problem with the realization that they could follow the trajectories of individual neutrons,
one at a time, using teams of humans implementing the calculation with mechanical calculators [Feynman, 1985,
Man, 2004]. At each step, they could compute the probabilities that a neutron was absorbed, that it escaped from
the bomb, or it started another fission reaction. They would pick random numbers, and, with the appropriate
probabilities at each step, stop their simulated neutron or start new chains from the fission reaction.
The brilliant insight was that the simulated trajectories would have identical statistical properties to the
real neutron trajectories, so that you could compute reliable answers for the important question, which was
the probability that a neutron would cause another fission reaction. All you had to do was simulate enough
trajectories.
When Simulation is Valuable.: Q: In a free fall, how long would it take to reach the ground from a height
of 1,000 feet? A: I have never performed this experiment.
2 Simple Example
2.1 Birthday Problem - Classical Approach
Simple examples of Monte-Carlo simulation are almost embarrassingly simple. Suppose we want to find out the
probability that, out of a group of thirty people, two people share a birthday. It’s a classic problem in probability,
with a surprisingly large answer.
Classically, you approach it like this: Pick people (and their birthdays) randomly, one at a time. We will keep
track of the probability that there are no shared birthdays.
• The first person can have any birthday, and there is still a 100% chance of no shared birthdays.
• The second person has one chance of overlapping with the first person, so there is a 364/365 chance of
placing him/her without an overlap. The probability of no shared birthdays is 364/365.
∗ This work is licensed under the Creative Commons Attribution License. To view a copy of this license, visit
http://creativecommons.org/licenses/by/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford,
California 94305, USA. This work is available under http://kochanski.org/gpk/teaching/0401Oxford.
1 Or, Plutonium, of course.
1
2. • The third person has two chances of overlapping with the first two people, so there is a 363/365 chance
of placing him/her without overlaps (two days are taken). The probability of no shared birthdays is now
(364/365) · (363/365).
• The fourth person has three chances of overlapping with the first three people, so there is a
362/365 chance of placing him/her without overlaps. The probability of no shared birthdays is now
(364/365) · (363/365) · (362/365).
• ...
• The thirtieth person has 29 chances of overlapping with the first three people, so there is a 336/365
chance of placing him/her without overlaps. The probability of having no shared birthdays is now
(364/365) · (363/365) · (362/365) · . . . · (336/365).
The overall probability of no overlapping birthdays is then 0.294, giving a 71% chance that at least one pair of
people have overlapping birthdays. It’s not too complex if you see the trick of keeping track of the probability
of zero overlaps, rather than trying to add up the probability of one or more overlaps. It also takes some
thought to realize that the probabilities are conditioned properly, so that multiplying together all the various
P (N th person doesn’t overlap|first N − 1 people don’t overlap) factors.
2.2 Birthday Problem – Monte-Carlo Approach
The solution here is conceptually very simple:
1. Pick 30 random numbers in the range [1,365]. Each number represents one day of the year.
2. Check to see if any of the thirty are equal.
3. Go back to step 1 and repeat 10,000 times.
4. Report the fraction of trials that have matching birthdays.
A computer program in Python to do this calculation is quite simple:
#!/usr/bin/env python
import random # Get a random number generator.
NTRIALS = 10000 # Enough trials to get an reasonably accurate answer.
NPEOPLE = 30 # How many people in the group?
matches = 0 # Keep track of how many trials have matching birthdays.
for trial in range(NTRIALS): # Do a bunch of trials...
taken = {} # A place to keep track of which birthdays
# are already taken on this trial.
for person in range(NPEOPLE): # Put the people’s birthdays down, one at a time...
day = random.randint(0, 365) # On a randomly chosen day.
if day in taken:
matches += 1 # A match!
break # No need to look for more than one.
taken[day] = 1 # Mark the day as taken.
print ’The fraction of trials that have matching birthdays is’, float(matches)/NTRIALS
And the answer is:
The fraction of trials that have matching birthdays is 0.7129
2
3. 3 Example in Class
• How many raisins do you add to a batch of dough to make M cookies to make sure (with probability P )
that a random cookie has at least N raisins?
• How about that 99% of the cookies will have at least one raisin?
• How about that all the cookies will (with probability P ) have at least one raisin?
4 A Linguistic Example
Let’s try an artificial model with a certain amount of linguistic reality, to see how Monte-Carlo techniques might
be applied to bigger problems.
Imagine we are studying English names, and wish to understand their origins. We hypothesise that names
were generated by the operation of five processes:
1. A name could be prepended by a random descriptive nickname: (e.g., “speedy”).
2. A name could be lengthened by appending a place name or an occupation.
3. A name could be shortened by dropping any sequence of adjacent syllables, so long as two syllables remain.
4. If a new name is identical to a place name, reject the change that led to this and try again.
5. If a new name is identical to a common name, possibly reject the change that led to this and try again.
How can we test this hypothesis? How can we even understand what it will do? How can we compare it to
recorded names?
We will build a little universe that generates names, and then compare the simulated names to the real
records that we have. Specifically, we will run the simulation to produce a good-sized sample of data, then
compute P (Data|Modeli ). We can do this several times with several different models and compare them by using
(you guessed it!) Bayes’ Theorem to compute P (Modeli |Data). We can also do various standard hypothesis tests
to see how well any of our attempted models match the data. Thus, we can test the model by simulating it, and
seeing how well it reproduces the names that we have collected.
Now, the above description of the hypothesis is not complete. It is the framework for a family of related
models (models in the sense of Bayes’ Theorem). In order to work with this, we need to make this hypothesis
specific enough so that we can actually compute one or more lists of names to compare to reality. To complete the
hypothesis, we need to know how often these five processes operate. We need to know things like a probability
distribution for the nicknames in process 1. We need to know the distribution of place names or occupations in
process 2. We need to know the probability distribution of the number of syllables to drop in process 3.
Some of these things we can guess from existing documents: lists of place names and occupations, along with
their relative frequencies. Others, we can’t guess, not without looking at our data, the names in the documents.
Consequently, we don’t guess; we will put an adjustable parameter in the model for these numbers, and we will
What is an adjustable parameter?: An adjustable parameter is a place-holder in a family of models. Every
value of the parameter (or each combination of the parameters, if you have more than one) creates a new model.
Often, you look for the value(s) of the adjustable parameter(s) that give the best fit to some data.
eventually let the combination of the data and model tell us which which values for the adjustable parameters
work and which don’t.
4.1 How do we compare the simulation to reality?
Given that there are so many possible names, it might be too much to ask our little simulation to produce the
exact set of names that exist in the documents. Not that it would be intrinsically wrong to ask the simulation to
produce real names, but you could easily run into either of two practical problems:
3
4. 1. You might have to run far too many simulations before you find one that produces more than a few of the
names that actual Englishmen had. It could well be that the set of possible names was far larger than the
set of Englishmen, for instance. If so, even a correct simulation might well produce the names of Englishmen
in some alternate universe – names that they might have had, had their mothers married someone different.
This could mean that you run out of time or money before you find a good solution.
2. You might find that you need to know too many things to get the solution to match in enough detail.
Suppose you needed to know the probability that each possible nickname could end up in a name? One
cannot deduce that information without looking at some data; if looking at descriptive nicknames in other
languages is not sufficient, your only hope would be to put an adjustable parameter in for the probability
of each nickname, and hope that the data will actually allow you to choose these values.
By doing that, you will probably end up with a family of models that has too many adjustable parameters for
the available data. Aside from taking longer and costing more, you will find that many of these models in the
family are equally good fits to the available data, and consequently, there are many possible combinations
of adjustable parameters that are equally likely, given the data.
Degeneracy: The technical term for a situation where the data allow many (nearly-) equally good hypotheses
is “degeneracy”. A family of models is degenerate with respect to a set of data if the data are not sufficient to
select a single model (or a small group of closely related models) out of the family.
The real problem is that, along with many possible values of these parameters that you might not really
care abut, any adjustable parameters that you actually do care about may well become quite uncertain,
too.
Consequently, it might be more practical, rather than matching names in detail, to match classes of names,
and to see how well the simulation generates names in the right class, even if the names are not exactly the same.
For our example here, our classes will be names of one, two, three, . . . syllables, and names that contain the
syllable “smith”.
4.2 Implementing the Model
The complete model, as implemented, assumes that the probability of process 3 (shortening) is proportional
to length of the name, and that the probability of process 5 (rejection) is proportional to the frequency of
the name you bump into. We assume that nicknames, place-names, and occupations are simply drawn from
an equally-probable list. (Needless to say, this shouldn’t be taken too seriously as an actual model of name
generation.)
Thus, there are four adjustable parameters: the probability that processes 1 and 2 will operate on a name in
each generation, and the constants of proportionality for processes 5 and 3. We initialise the names from place
and occupational names2 . I run the model for a constant population of 10,000 people, for 20 generations.
Now, with the parameters set to (0.25, 0.25, 0.25, and 5), the simulation runs in 30 seconds,
and yields the following names (these are the most common names, each followed by their frequency):
2 Initialising the names from place and occupational names was an an arbitrary choice, which was made on the assumption that it
wouldn’t matter. That is indeed the case with the set of adjustable parameters that I started with. With the initial set of parameters,
most names would be heavily modified within 20 generations: syllables would be added and deleted, and the initial name would
typically have been removed at one generation or another. The processes (with the initial values) would lose all memory of the
original names fairly rapidly. However, as it turns out, the best-fit parameters that we will derive later do not modify the names so
rapidly. If this were a real analysis instead of a tutorial, we’d have to go back and think carefully about what the initial set of names
might be.
4
5. Name Occurrences Name Occurrences Name Occurrences Name Occurrences
far-rier 49 butch-er 49 black-smith 49 sai-er 47
por-ter 45 weav-er 44 farm-er 43 gold-smith 42
mer-town 41 ding-ton 40 bar-ber 39 push-er 38
ling-ton 37 scho-lar 36 red-er 36 brew-er 36
goat-herd 35 shep-herd 34 tan-er 33 con-er 33
book-er 32 ba-ker 32 spee-ter 31 cow-herd 31
tai-lor 30 cow-man 30 stink-er 29 spee-er 29
bur-ry 29 black-ter 29 black-er 29 big-er 29
pen-ter 28 ing-don 27 arch-er 27 scot-er 26
red-ter 25 mas-ter 25 stink-smith 21 ... ...
The names have some relation to recent English names, but there are lots of names that simply don’t often
occur in the real world (e.g. “penter”, “stinksmith”, “speeter”). And, finally, because it is built from a small
number of occupational names, place names, and nicknames, the simulation will not cover all possible syllables,
so it misses many English names. Some of these failings are intrinsic to the family of models; some could be
improved by adjusting the adjustable parameters, and some could be improved by getting a complete set of place
names3 . However, we anticipated some of these problems, and we are not making a word-by-word comparison.
We are going to see if it gets some of the statistical properties of names correct: specifically the distribution of
lengths and the fraction of names that contain “smith”.
We will compare it to statistics from a list of 27882 names of New Jersey employees of Lucent Technologies.
Of these, 27041 had given names, rather than an initial in the database. We will approximate a syllable count
for each name by assuming that you get one syllable per three letters, and rounding to the nearest integer.
Fraction Fraction
Length
simulation data
1 0 0
2 43% 3.6%
In all, 10.6% of the names that the simulation produces contain 3 20% 20%
“smith”, compared to 0.6% in our reference data. Looking at the 4 12% 46%
length histogram, we get the following distribution of lengths: 5 10% 23%
6 7% 5%
7 4% 0.9%
8 2% 0.2%
9 1% 0.02%
... ...
No single syllable names.: The fact we generate no one-syllable names is interesting. That lack comes
primarily from process 3, which never shortens a name down to one syllable, combined with processes 1 and 2,
which will, over the course of many generations, extend single-syllable names by either adding a prefix nickname
or a place or occupation name as a suffix.
As you can see, this isn’t a very good match. Our simulation gives far too many Smiths, far too many
two-syllable names, and too many names with seven or more syllables. If we do a hypothesis test, comparing our
statistics, using (for instance) a chi-squared test on the difference between the simulated results and the actual
data, we will find that the we can reject this model at the 99.99% confidence level or beyond.
However, this is not the only model in the family. We have four adjustable parameters we shall adjust, in
hopes of getting a better representation of the data.
3 And, we should probably weight place names by their population. After all, there were a lot more people from York than from
Wytham.
5
6. MAP and Hypothesis testing: Note that you may want to know which is the best of N different models.
That would involve a MAP (Maximum A-Posteriori) comparison between models. This question of best-fit
is different from asking whether a particular model is acceptable by a hypothesis test. While one prefers a
situation where the best fit model of a family is not rejected, it happens sometimes. If so, it means that the
family is not an accurate description of the data. Even so, the best fit model could still be interesting either as
an approximation to the data or as an important clue toward a future, better description of the data.
Fraction Fraction
Fraction
Length best-fit first sim-
data
If one changes the adjustable parameters to im- simulation ulation
prove the match between the simulation and the
data, one can obtain a considerably better match 1 0 0 0
to the data, as you can see from the table to the2 19% 3.6% 43%
left. We still have far too many Smiths (8% vs. 3 31% 20% 20%
0.6%), but most of the frequencies of lengths of 4 27% 46% 12%
simulated names are now within a factor of two of5 14% 23% 10%
6
the data. Overall, it is still not a good fit to the 6% 5% 7%
data, even though we have reduced the chi-squared7 2% 0.9% 4%
statistic by a factor of three. 8 0.5% 0.2% 2%
9 0.2% 0.02% 1%
... ...
A sample of the most common names produced by this best-fit model (within its family) follows. Some of the
names look quite plausible:
Name Occurrences Name Occurrences Name Occurrences Name Occurrences
scot-er 29 blond-er 26 shrimp-er 24 ee-er 24
red-er 22 push-ter 22 stink-ee-er 21 big-er 21
spee-er 20 book-worm-er 20 book-er 19 shrimp-ton 18
shrimp-smith 18 spee-dee-ter 17 red-smith 17 red-ee 17
big-ter 17 stink-ee 16 shrimp-ter 16 ee-smith 16
black-er 16 stink-ter 15 stink-smith 14 push-ee-er 14
black-smith 14 stink-don 13 spee-ter 13 spee-smith 13
sai-ee 13 book-ter 13 black-ter 13 big-don 13
spee-ley 12 scot-ton 12 scot-smith 12 scot-ee 12
sai-lor-er 12 dee-er 12 blond-ee-er 12 big-smith 12
stink-ee-ee-er 11 shrimp-herd 11 sai-ter 11 sai-lor-ter 11
sai-ley 11 ee-ter 11 ... ... ... ...
So, what did we do to the adjustable parameters to improve things?
• We increased the probability of the first process (prepending a nickname) from 25% to 44% per generation.
• We reduced the probability of appending a place or occupational name from 25% to 0.01%. To all prac-
tical purposes, we have completely ceased to include place and occupational signifiers into names. Any
occupational names that persist (e.g. “Smith”) are survivals from the initial set of names.
• We have reduced the probability of syllable deletion: it is now 0.14 times the length of the name, rather
than 0.25 times the length of the name.
• Finally, we have dramatically reduced the extent to which new names avoid existing names. To a good
approximation, a name is now generated without consideration of whether or not other people have the
same name.
If this were a real paper, or if the fit were better, we would conclude that to the extent that this model is
valid, names last for many generations, and that nicknames have recently been4 more important as a source of
4 Over the last 10 or 20 generations.
6
7. syllables than occupational or place names. However, this is just a toy model, and should not be taken seriously,
except to point out a path that could be followed.
The Python source code for the relevant parts of the model is found in Appendix A.
References
Richard Feynman. Surely You’re Joking Mr. Feynman! Bantam, 1985.
History – Los Alamos – Oversight Committee Formed. The Manhattan Project Heritage Preservation Asso-
ciation, http://www.childrenofthemanhattanproject.org/HISTORY/H-06c12.htm, 2004. URL http://www.
childrenofthemanhattanproject.org/HISTORY/H-06c12.htm.
Sabri Pllana. History of Monte Carlo method. http://www.geocities.com/CollegePark/Quad/2435/index.html,
August 2000. URL http://www.geocities.com/CollegePark/Quad/2435/index.html.
7
8. A Computer Implementation of the Model
#!/usr/bin/env python
"""This script simulates the generation of English names.
It is also used to find the set of parameters that does
the best job of generating names.
This work is available under http://kochanski.org/gpk/teaching/0401Oxford ,
part of the lecture titled ‘‘Monte Carlo Simulations,’’
from the Hilary Term 2004 course.
"""
# This work is licensed under the Creative Commons Attribution License.
# To view a copy of this license,
# visit http://creativecommons.org/licenses/by/1.0/
# or send a letter to Creative Commons,
# 559 Nathan Abbott Way, Stanford, California 94305, USA.
# HISTORY
# Written and copyright by Greg Kochanski, 2004.
import random # Random number generators.
import Numeric # Math on vectors and matrices.
import math # Other maths functions.
# A list of nicknames. All are assumed to be equally probable.
Nicknames = [
[’red’],
[’spee’,’dee’],
[’big’],
[’push’, ’ee’],
[’blond’, ’ee’],
[’shrimp’],
[’stink’, ’ee’],
[’book’, ’worm’],
[’sai’, ’lor’],
[’scot’],
[’black’]
]
# A list of occupational names:
Occ = [
[’smith’], [’butch’, ’er’], [’farm’, ’er’],
[’cow’, ’man’], [’weav’, ’er’],
[’tai’, ’lor’], [’tan’, ’er’],
[’brew’, ’er’], [’vel’, ’lum’, ’mak’, ’er’],
[’car’, ’pen’, ’ter’], [’groom’],
[’far’, ’rier’], [’black’, ’smith’],
[’bar’, ’ber’], [’gold’, ’smith’],
[’arch’, ’er’], [’cook’], [’ba’, ’ker’],
[’cow’, ’herd’], [’shep’, ’herd’],
[’goat’, ’herd’], [’fal’, ’con’, ’er’],
8
9. [’scho’, ’lar’], [’mas’, ’ter’],
[’por’, ’ter’]
]
# A list of place names:
Place = [
[’ox’, ’ford’], [’hink’, ’sey’],
[’wy’, ’tham’], [’thame’],
[’wynch’, ’wood’], [’bot’, ’ley’],
[’sum’, ’mer’, ’town’],
[’lon’, ’don’], [’york’],
[’ches’, ’ter’], [’read’, ’ing’],
[’bath’], [’ave’, ’bur’, ’ry’],
[’dor’, ’ches’, ’ter’], [’mar’, ’ston’],
[’hea’, ’ding’, ’ton’], [’cow’, ’ley’],
[’cum’, ’nor’], [’kid’, ’ling’, ’ton’],
[’saint’, ’giles’],
[’ab’, ’ing’, ’don’]
]
class ModelParameters:
doc = """This class contains the adjustable parameters
for the family of models."""
def init (self, prms=None):
"""This function creates an instance of the class."""
if prms is None: # Default parameters
self.p1 = 0.25
self.p2 = 0.25
self.p3 = 0.25
self.pdup = 5.0
else:
# Take parameters from an array on the argument list.
self.p1, self.p2, self.p3, self.pdup = prms
def not ok(self):
"""This function tests if the adjustable parameters are silly
or not.
"""
if self.p1<0 or self.p1>1:
return ’p1’
if self.p2<0 or self.p2>1:
return ’p2’
if self.p3<0 or self.p3>1:
return ’p3’
if self.pdup<0:
return ’pdup’
def xp(old, new, operation):
"""Print the individual operations that transform
one name into a new one.
"""
9
10. print old, ’ (%s) >’%operation, new
class Name:
doc = """This class represents a single name,
and for convenience, it also stores
the parameters that control
the processes that transform names."""
def init (self, syllablelist, nprm):
"""Create a name from a list of its syllables (syllablelist)
and the adjustable parameters (nprm)."""
self.sl = syllablelist
self.np = nprm
def str (self):
"""Represent the name as a string."""
return ’ ’.join(self.sl)
repr = str
def p1(self):
"""Process 1: Prepend a nickname."""
o = Name(random.choice(Nicknames) + self.sl, self.np)
# xp(self, o, ’prepend’)
return o
def p2(self):
"""Process 2: Append a placename or occupation."""
o = Name(self.sl + random.choice(PlaceOcc), self.np)
# xp(self, o, ’append’)
return o
def p3(self):
"""Process 3: Drop syllables."""
ns = len(self.sl)
if ns <= 2:
# If the name is already short, just return a copy.
return Name(self.sl, self.np)
while 1:
# Try to delete a range of syllables, and see if
# it leaves at least two syllables.
dropstart = random.randint(0, ns 1)
dropend = random.randint(1, ns 1)
if dropstart <= dropend and dropstart + (ns dropend) >= 2:
break # Yes! An acceptable drop.
o = Name(self.sl[:dropstart]+self.sl[dropend:], self.np)
# xp(self, o, ’drop’)
return o
10
11. def evolve(self, namedict):
"""This generates the next generation’s form of the
current name."""
# print ’NN :’, self.sl
while 1:
x = random.random()
tmp = Name(self.sl, self.np)
if x < self.np.p1:
tmp = tmp.p1()
if x < self.np.p2:
tmp = tmp.p2()
if x < self.np.p3 ∗ len(self.sl):
tmp = tmp.p3()
# Check to see if the new name duplicates other names already
# out in the population. If so, how many? Also, does
# it duplicate a place or occupational name?
dups = namedict.get(str(tmp), 0) + 100000∗Placedict.get(str(tmp), 0)
if random.random() > self.np.pdup ∗ float(dups)/float(len(namedict)):
break # Good enough!
# print ’ TOO COMMON’
return tmp
def cmp (self, other):
"""Compare two names."""
return cmp(self.sl, other.sl)
def generation(namelist):
"""Computes the names in generation N+1 from the array that is passed
into it (generation N)."""
namedict = {}
for t in namelist:
namedict[str(t)] = namedict.get(str(t), 0) + 1
N = len(namelist)
nnl = []
for i in range(N):
# We randomly choose names to breed.
# Some names will therefore have no descendents;
# some will have more than one.
parent = random.choice(namelist)
nn = parent.evolve(namedict)
nnl.append(nn)
return nnl
def print statistics(namelist):
"""A helper function to let you watch the evolution
of the statistical distribution of names. It prints
out some summary statistics; run() call it every generation.
"""
11
12. N = len(namelist)
lenhist = {}
smith = 0
for name in namelist:
ln = len(name.sl)
lenhist[ln] = lenhist.get(ln, 0) + 1
smith += ’smith’ in name.sl
for l in range(10):
print ’#LEN:’, l, ’%.3f’ % (lenhist.get(l, 0)/float(N))
print ’#SMITH:’, smith/float(N)
def write histogram(namelist, nfd):
"""A helper function: it writes a histogram of names
to a file, to allow debugging.
"""
hist = {}
for n in namelist:
sn = str(n)
hist[sn] = hist.get(sn, 0) + 1
histlist = [ (v, k) for (k, v) in hist.items() ]
histlist.sort()
histlist.reverse()
for (v, k) in histlist:
if v > 1:
nfd.writelines(’%s %dn’ % ( k, v) )
def run(N, prms=None):
"""This function runs and prints 20 generations of statistics."""
np = ModelParameters(prms)
# Set up the names in the first generation:
namelist = []
for i in range(N):
namelist.append( Name(random.choice(Nicknames+PlaceOcc), np) )
# Run the simulation:
for t in range(20):
print ’# GENERATION ’, t
namelist = generation(namelist)
print statistics(namelist)
# Print the most common names in the final generation:
import sys
write histogram(namelist, sys.stdout)
def resid(x, N):
"""This function is used to find the best fit values
of the adjustable parameters.
It is called by an external script (not supplied)
that adjusts the parameters, calls resid(),
12
13. and looks to see whether or not the new model
(based on the adjusted parameters) is a better
or worse fit to our data.
Argument x is an array of parameters to control the name generation
process.
Argument N is the number of names to simulate.
"""
print ’prms=’, x
np = ModelParameters(x)
if np.not ok():
return None # Give up, if parameters are silly.
# Compute a list of names:
namelist = []
NNPO = Nicknames + PlaceOcc
for i in range(N):
namelist.append( Name(random.choice(NNPO), np) )
for t in range(20):
namelist = generation(namelist)
# Compute statistics from the list of names:
lenhist = {}
smith = 0
for n in namelist:
ln = len(n.sl)
lenhist[ln] = lenhist.get(ln, 0) + 1
smith += ’smith’ in n.sl
# The data:
datasmith = 0.006
data = [None, 0, 0.036, 0.20, 0.46, 0.23, 0.05, 0.009, 0.002, 0.0002]
# Compare the data to the statistics from the simulation:
o = [ math.log((smith/float(N))/datasmith) ]
for l in range(2,10):
o.append(math.log( (lenhist.get(l, 0)/float(N)) / data[l]))
write histogram(namelist, open(’names.txt’, ’w’))
print ’r=’, o
# Return an array of the ’differences’ between the model
# statistics and the data.
return [10∗r for r in o]
def start(arglist):
"""Sets the starting position for the search to find
the best fit adjustable parameters. This is used
when optimizing the parameters; it is called by an external
script."""
return Numeric.array([0.25, 0.25, 0.25, 2], Numeric.Float)
13
14. def V(start):
"""Sets the initial region over which to search for the
the best fit adjustable parameters. This is used
when optimizing the parameters; it is called by an external
script."""
return Numeric.array([[1, 0, 0, 0], [0, 0.1, 0, 0], [0, 0, 1, 0],
[0, 0, 0, 1]], Numeric.Float)
NI = 1000 # Used for finding best fit adjustable parameters.
c = 10000 # Used for finding best fit adjustable parameters.
# Next, we compute a few things that will speed up the computation.
# First, a dictionary of place names, to allow us to rapidly decide
# whether or not a newly generated name matches a place name:
Placedict = {}
for p in Place:
Placedict[str(Name(p, None))] = 1
# Second, we need an array of place or occupational names:
PlaceOcc = Place + Occ
if name == ’ main ’:
# Begin the computation. The first argument is the
# size of the population; the second argument
# (an array) are the values of the adjustable parameters
# that we want to use.
run(10000, [0.43172252, 0.00283817, 0.13898237, 0.28479306])
14