- 1. Lecture 14: Artificial intelligence and machine learning Dr. Martin Chapman Principles of Health Informatics (7MPE1000). https://martinchapman.co.uk/teaching
- 2. Preamble: Setting expectations… Studying artificial intelligence is less of this… Ex Machina, 2015
- 3. Preamble: Setting expectations… And more of this…
- 4. Preamble: Simplifications Artificial Intelligence and machine learning could easily take up an entire course (or 10), and are very hard to describe properly within a single lecture. As such, I will make some fairly significant simplifications, particularly when discussing machine learning. As such, while the explanations here are enough to understand concepts at a high level, for a full and proper understanding further resources will be required.
- 5. Preamble: You’ll find a million (literally) similar explanations online… We’re not trying to compete with these resources, and they may even be useful in respect of the simplification discussed previously.
- 6. What is (medical) AI? There are lots of different ways to define AI, particularly in the healthcare domain. Perhaps one of the first was: Medical artificial intelligence is primarily concerned with the construction of AI programs that perform diagnosis and make therapy recommendations… medical AI programs are based on symbolic models of disease entities and their relationship to patient factors and clinical manifestations. William J. Clancey and Edward H. Shortliffe. Readings in medical artificial intelligence: the first decade. 1984. We’ll see how this definition might not apply exactly anymore.
- 7. What is (medical) AI? We can/will argue that, like humans, a (medical) AI is able to represent knowledge in the world and use that knowledge to reason, and, before all of this, is able to gain that knowledge from somewhere. Hence the focus of this AI (and machine learning) lecture is…
- 8. Lecture structure 1. Computational representation and reasoning 2. Model building
- 9. Learning outcomes 1. Understand and critique different approaches to storing and applying knowledge. 2. Understand the connection between modelling and these approaches. 3. Be able to list different computational discovery techniques and how they support the collection of knowledge.
- 10. Relationship with CDSS Lecture 13 introduced a fairly compelling example of an AI – a clinical decision support system (CDSS). Therefore, the representation, reasoning and knowledge acquisition techniques discussed here very much explain the ‘how’ of Lecture 13; how do CDSSs deliver the benefits we saw there. We’ll provide specific examples of this where possible.
- 11. Computational representation and reasoning How can we represent knowledge, and what approaches can we use to automatically take that knowledge and arrive at a conclusion?
- 12. Recall: Knowledge structure Knowledge, from our model: If a plane does not have pontoons, then it will sink. We refer to this as an inference procedure. It is supported by three other (sub-) entities: (1) A language (2) A knowledge base (3) An ontology When determining whether a plane sinks, we are applying knowledge that actually consists of four distinct entities…
- 13. Computational representation and reasoning In the first half of this lecture, we’ll be asking how we build this knowledge base (representation) and define the associated inference procedures so that reasoning can occur. We’ll look at three representation and reasoning tools in the following slides: (1) Rules, (2) Networks and (3) Systems dynamics. For each, we should be able to answer the following: 1. How can we structure our knowledge? 2. How can we apply that structured knowledge to new data to gain information?
- 15. The first way we can structure knowledge is as a set of rules. We saw a basic version of a rule in Lecture 2. We can formalise this as: Importantly, rules like this are a subset of formal logic: RuleP1 If plane not pontoons then conclude sink Logic Rules No pontoons ⇒ plane sinks This symbol means implies; if A is true we are free to conclude that B is true. We’re free to use some of the Boolean logic we saw in Lecture 3.
- 16. Reasoning with logic rules How do we now draw conclusions from these rules? We actually saw this answer to this back in Lecture 4…
- 17. Recall: Deduction, abduction and induction
- 18. Reasoning with logic rules How do we now draw conclusions from these rules? We actually saw this answer to this back in Lecture 3… Using deduction, for example, in combination with this rule, we can conclude that if we encounter a plane without pontoons it will sink. RuleP1 If plane not pontoons then conclude sink This would support the kind of ‘symbolic AI’ mentioned in our original definition.
- 19. Reasoning with logic rules How do we now draw conclusions from these rules? We actually saw this answer to this back in Lecture 3… From the perspective of logic (the core of our rules), we are saying if a plane doesn’t have pontoons, and not having pontoons implies that a plane sinks, then the plane must sink. This is an inference procedure known as modus ponens and adds weight to our conclusions. It is a formalisation of our original inference procedure from Lecture 2.
- 20. Reasoning with logic rules How do we now draw conclusions from these rules? We actually saw this answer to this back in Lecture 3… We could also work backwards and use a rule of logic called modus tollens to tell us that if a plane isn’t sinking then it must have pontoons. We need to be careful with modus tollens though as it creates a very strict association. If we want to have multiple potential reasons for an outcome, the abduction alone might be a better way to draw conclusions.
- 21. Statistical rules Logic rules allow us to express that one event definitely implies another, but as we’ve seen throughout the course it’s often the case that certain outcomes have an element of probability attached to them. When rules account for probability, we call them statistical rules. Many statistical rules are based on something we’ve already seen…
- 22. Recall: A question You are told that a person named Alex has the following personality traits: ‘Quiet’, ‘Reserved’, ‘Good at maths’ Do you think it’s more likely that Alex: (A)Works in retail (e.g. a shop assistant) or (B) Is a computer programmer? This is our new data
- 23. Recall: Bayes’ theorem: Formula We can write this process out as a formula: (75% × 25%) ((75% × 25%)+(25% × 75%)) (Programmer likelihood × Programmer prior) ((Programmer likelihood × Programmer prior) + (Retail likelihood × Retail prior)) or = 50% Prior Likelihood Likelihood
- 24. Statistical rules – Bayes’ theorem But we can represent, and thus reason with, even more complex scenarios using the principles of Bayes’ theorem… RuleB1 If person is ‘Quiet’ ‘Reserved’, ‘Good at maths’ then conclude programmer with probability (0.5) As a statistical rule, this is what the output from our previous Bayes’ calculation would look like.
- 25. Networks Bayesian Belief Networks, Markov chains and Neural Networks
- 26. Preamble: Beyond statistical rules Relying on sets of rules alone – which are relatively simple in form – and associated inference procedures, to represent and reason with knowledge may be somewhat limiting. Therefore, we can look instead to creating more complex structures to store and link our knowledge, and facilitate reasoning: networks. We saw a basic example of a network containing linked knowledge back in Lecture 4:
- 27. Recall: ‘Decision trees’ We want to know the chance that Alex could fill a Python programmer role: Suitable for the job: (50% × 80%) + (50% × 1%) = 40.5% Not suitable for the job: (50% × 20%) + (50% × 99%) = 59.5% Alex turns up Alex is a programmer Alex is not a programmer Alex knows Python Alex does not know Python Alex knows Python Alex does not know Python Suitable for the job Not suitable for the job Suitable for the job Not suitable for the job 50% 50% 80% 20% 99% 1%
- 28. Bayesian belief networks We can take things a step further by explicitly combining decision trees with concepts from Bayes’ theorem to introduce a more nuanced representation of probabilistic knowledge – a Bayesian belief network – that allows us to reason with more complex scenarios. First, we can do this simply…
- 29. Bayesian belief networks Alex is a programmer Alex is quiet Likelihood = 75% The probability that someone with Alex’s traits is a programmer The probability that anyone in the population is a programmer Prior = 25% We lay out (part) of the information from our instantiation of Bayes’ theorem in a network structure.
- 30. Bayesian belief networks We can take things a step further by explicitly combining decision trees with concepts from Bayes’ theorem to introduce a more nuanced representation of probabilistic knowledge – a Bayesian belief network – that allows us to reason with more complex scenarios. First, we can do this simply… reformatting our existing Bayes’ equation graphically, capturing that if a parent node is true (initially based on a prior probability) then a child node is also true, with a given conditional probability (likelihood). We can then make this more complex…
- 31. Bayesian belief networks … … Prior = … Likelihood = … Likelihood = … Prior = … Likelihood = 75% Likelihood = … Alex is a programmer Alex is quiet Prior = 25% Prior = …
- 32. Bayesian belief networks We can take things a step further by explicitly combining decision trees with concepts from Bayes’ theorem to introduce a more nuanced representation of probabilistic knowledge – a Bayesian belief network – that allows us to reason with more complex scenarios. First, we can do this simply… reformatting our existing Bayes’ equation graphically, showing that if a parent node is true (initially based on a prior probability) then a child node is also true, with a given conditional probability (likelihood). We can then make this more complex… adding additional dependencies. Complex questions can then be asked of this network, and a form of Bayes’ rule applied iteratively to infer the answer.
- 33. Networks Bayesian Belief Networks, Markov chains and Neural Networks
- 34. Markov chains A Bayesian belief network stores probabilistic knowledge and allows us to reason with it. Another type of network, a Markov Chain, does something very similar. Here, probabilities represent the chance of transitioning from a current state to another state when those states are connected. An example of a current state might be the state of the weather today, and another, connected state might be the state of the weather tomorrow:
- 35. Markov chains Here we are representing, for example, that if today is sunny, the chance of it being sunny tomorrow is 20%. We’ll talk about where these numbers are likely to come from later. Sunny Rainy 20% 80% 75% 25%
- 36. Markov chains - inference Is predicting the weather tomorrow (inference) therefore as simple as reading these probabilities, or should we look at past patterns (i.e. combine probabilities from previous states)? Something called the Markov property, connected to Markov chains, actually says no, in order to balance simplicity with correctness, we should not do this. We should actually just use the current state, interpreting the probabilities on the chain directly.
- 37. Hidden Markov chains Let’s imagine we know our weather state transition probabilities (seen previously), but for some reason we can’t see what the weather is like (e.g. we are in a room with no windows). We can, however, see evidence of the weather (e.g. people coming into the room with or without an umbrella) This would introduce a second set of probabilities (emission probabilities), between the hidden state (the weather) and our observations (the presence of an umbrella):
- 38. Hidden Markov chains Umbrella No umbrella Hidden states Observations Emission probabilities Transition probabilities 90% 10% 90% 10% We have much more expressive power in a Hidden Markov chain Sunny Rainy 20% 80% 75% 25%
- 39. Recall: Part-of-speech (POS) tagging Our bag of words approach doesn’t allow us to appreciate that words appear in sentences, and each have a different grammatical role (e.g. verbs and nouns). The process of determining the role each word plays in a sentence is known as part-of-speech (POS) tagging. It is important to understand whether a word is a noun or a verb, for example, so we can correctly label entities from our terminology. We’ll look at Markov models, which support automatic POS tagging, later in the course.
- 40. POS tagging and Hidden Markov chains In POS tagging, we encounter a similar set of observations and hidden states as we do in our (slightly contrived) weather example. Our observations are our words (e.g. ‘land’), and our hidden states – which these observations might be evidence of – are the different grammatical roles each word might have (e.g. ‘verb’ or ‘noun’). Transition probabilities are the chance of, for example, a verb following a noun, and our emission probabilities are the chance of a word, in general, holding a certain grammatical role:
- 41. POS tagging and Hidden Markov chains How inference would be applied to a network like this for POS tagging is outside the scope of this course, but we will come back to where our probabilities come from. The overall chance of land being a verb How often one type of word is followed by another Transition probabilities Emission probabilities Hidden states Observations Land Seaplane ?% ?% ?% ?% Noun Verb ?% ?% ?% ?%
- 42. Networks Bayesian Belief Networks, Markov chains and Neural Networks
- 43. Background: Human thinking The human brain consists of a set of neurons, connected by a set of synapses. If the inputs (e.g. observations in the world) to a given neuron, via a set of synapses, are sufficient (i.e. we see enough of them) that neuron will fire. This is the basis of thought. Neuron Synapses If I see this wing, these feet and this beak, the neuron fires and I can conclude this is a penguin.
- 44. Background: Human thinking But of course some input features are more or less correlated with a given output than others. To represent this, we implicitly weight certain inputs in our mind. These weights thus reflect our knowledge. Neuron Synapses Because other animals might have wings like this, I give this input a lower weight as it alone is less indicative of this being a penguin. 0.25 0.5 0.5
- 45. Neural networks A very similar structure is used to store knowledge – in the form of weights – in a third type of network, a neural network. Reasoning then occurs by seeing if the inputs are, given the weights, sufficient for a neuron to fire. Input 1 Input 2 Input 3 Weight A Weight B Weight C Output
- 46. Neural networks The network represents whether inputs are sufficient for a neuron to fire using something called an activation function. The simplest activation function (f), a binary step function, has a straightforward threshold for firing the neuron: whether, once weighted and summed, there is any (positive) input at all. Input 1 Input 2 Input 3 Weight A Weight B Weight C Output x = Input 1 × Weight A + Input 2 × Weight B + Input 3 × Weight C If x > 0: f(x) = 1 If x < 0: f(x) = 0 Summation Activation
- 47. Neural networks x = 0.5 × 1 + 0.1 × 1 + 0 × 1 0.5 0.1 0 = 0.6 f(0.6) = 1 1 1 could indicate ‘true’ (e.g. is a penguin) 1 1 1 Output Summation Activation If x > 0: f(x) = 1 If x < 0: f(x) = 0 The network represents whether inputs are sufficient for a neuron to fire using something called an activation function. The simplest activation function (f), a binary step function, has a straightforward threshold for firing the neuron: whether, once weighted and summed, there is any (positive) input at all.
- 48. Systems dynamics
- 49. Preamble: Rule and network limitations We will talk more about how we acquire the knowledge we’ve seen thus far (e.g. the percentages for Bayes’ rule) shortly, but informally we know it is based on past experiences. We can estimate, for example, the portion of programmers in a population fitting a description using our experiences with programmers. But what if we’ve never met any programmers and seen their traits before? In other words, what if we’ve never seen a particular pattern before?
- 50. Background: Foundational knowledge In these situations, all we can rely on foundational knowledge; things we already know to be true of the world, and reason from that. Blueprints are a good example of foundational knowledge. They represent, at a very low level, exactly what we know to be true of a device, so that when new problems are encountered they can be reasoned upon to generate new information in respect of that problem.
- 51. Systems dynamics An entity that can capture a greater range of foundational knowledge than a blueprint is something called systems dynamics. This representation uses a variety of (diagrammatic) formalisms – stocks, flows, converters and connectors – to represent complex systems.
- 52. Systems dynamics – an example Let’s say we want to represent foundational knowledge about the impact of birth rate on the size of a population. This isn’t as straightforward as connecting birth rate and population, as, for example, a decrease in birth rate doesn’t directly decrease the population. Instead, it causes it to increase at a slower rate. Similarly, there is a reverse connection between population size and birth rate (a larger population means, in turn, more births). We can capture all this using systems dynamics:
- 53. Systems dynamics Population Births Birth rate An example of a stock: a value that can be increased or decreased. An example of a (in)flow: an entity that increases the value of a stock An example of a converter: an entity whose value is not determined directly by the system. An example of a connector: stocks can also, in turn, impact flows. An example of a connector: convertors can impact flows. Only flows can impact stocks. By separating out the concepts of birth rate, births and population we can capture this nuance of the world.
- 54. Systems dynamics An entity that can capture a greater range of foundational knowledge than a blueprint is something called systems dynamics. This representation uses a variety of (diagrammatic) formalisms – stocks, flows, converters and connectors – to represent complex systems. Inference is then done by digitalising these formalisms, providing initial values and then simulating how flows are likely to impact these stocks over time.
- 55. Summary: Computational representation and reasoning How can we structure our knowledge? How can we apply that structured knowledge to new data to gain information? Logical rules See which rules match the data. What these rules tell us is definitely true then becomes our new information. Statistical rules See which rules match the data. What these rules tell us might be true then becomes our new information.
- 56. Summary: Computational representation and reasoning How can we structure our knowledge? How can we apply that structured knowledge to new data to gain information? Bayesian networks Our data tells us what is true in the network, and then we apply a form of Bayes’ rule we gain new information about what else may also be true. Markov chains Our data tells us the current state, and then we use the chain to gain information about which state(s) may be transitioned to. Neural networks Our data is used as input to an (artificial) neuron, the activation (or not) of which is information.
- 57. Summary: Computational representation and reasoning How can we structure our knowledge? How can we apply that structured knowledge to new data to gain information? Systems dynamics Simulations are run on computational versions of systems dynamics, and the state of the system at the end of the simulation is our information. These reasoning processes can be done by a machine (an AI), hence the term ‘computational’.
- 58. Model building Where does the knowledge that we represent and then reason with come from?
- 59. Model building So far, we’ve been representing knowledge and applying it back to the world (reasoning). This might be familiar to you…
- 60. Recall: Knowledge acquisition and application In this way, we see how models capture knowledge that has been gained from the world, and in turn apply that knowledge to the world to gain new information. Abstraction + Knowledge Acquisition Instantiation + Knowledge Application
- 61. Model building So far, we’ve been representing knowledge and applying it back to the world. This might be familiar to you… We have thus, in the first half of this lecture, ticked off the right-hand side of this diagram, which we saw back in Lecture 2. This diagram also tells us that – whether we are constructing rules, networks or systems dynamics – what we are doing is building models.
- 62. Model building The left-hand side is now of interest: how do we get the knowledge needed to build these models? We will use a generalised form of this diagram to visualise the techniques we look at in the remainder of the lecture, each of which tells us how we acquire knowledge.
- 63. Model building ? Reasoning Representation ? as rules, networks or systems dynamics Model e.g. for clinical decision support
- 64. Running example We know that there is knowledge represented within an instantiation of Bayes’ theorem, which can then be used to reason about the world (e.g. whether an individual is a programmer based on their traits). It is a model. But where does this knowledge (the likelihoods and priors) come from? We’ll answer this question when exploring various model-building methods, but these methods could be used to construct other models too (other rules, networks and systems dynamics).
- 65. Model building - asking humans
- 66. Model building from humans Probably the most straightforward way to obtain the knowledge for a machine to represent and reason with is to ask humans for it. This is often called knowledge engineering. We could, for example, ask humans to estimate the probabilities for our Bayes’ theorem (similar to what we did in Lecture 4 for our programmer vs. retail likelihoods).
- 67. Model building from humans ? Reasoning Representation as rules, networks or systems dynamics Model
- 68. Model building from humans Probably the most straightforward way to obtain the knowledge for a machine to represent and reason with is to ask humans for it. This is often called knowledge engineering. We could, for example, ask humans to estimate the probabilities for our Bayes’ rule (similar to what we did in Lecture 4 for our initial programmer vs. retail likelihoods). Knowledge like this can be obtained from humans using several different mechanisms:
- 69. Model building from humans - mechanisms ‘Think-aloud’ Interviews Observation Ask users to verbalise their thoughts while they complete tasks.
- 70. Model building from humans - mechanisms Systematic Review Repertory grid Crowdsourcing Ask users to discuss the similarities and differences between different concepts to capture relationships. As we saw in Lecture 4, search is one way we can systematically look through data to obtain knowledge, such as when examining literature.
- 71. Summary: Model building from humans Interviews Reasoning Representation as rules, networks or systems dynamics Model
- 72. Model building – computational discovery Data Mining, Clustering (Unsupervised machine learning) and Naïve Bayes Classifier (Supervised machine learning)
- 73. Model building from data (computational discovery) Our final example of model building from humans, a systematic literature review, suggests that if human knowledge is embedded in data, we may not need to consult humans at all. In other words, we might be able to build our models (acquire knowledge) automatically from data.
- 74. Model building from data (computational discovery) ? Reasoning Representation as rules, networks or systems dynamics Model
- 75. Model building from data (computational discovery) Our final example of model building from humans, a systematic literature review, suggests that if human knowledge is embedded in data, we may not need to consult humans at all. In other words, we might be able to build our models (acquire knowledge) automatically. Automatic model building is known as computational discovery. We will look at three computational discovery mechanisms: (1) Data mining, (2) Unsupervised learning (briefly) and (3) Supervised learning.
- 76. Data Mining A broad approach to computational discovery is something called data mining. If the data being examined is unstructured (e.g. free text) this may also be called text mining. Here, we look for specific patterns within data. The more often we find the same pattern, the surer we can be that it is knowledge. These patterns may be rules (e.g. quiet ⇒ programmer), or sequences of events (known as process mining).
- 77. Model building – computational discovery Data Mining, Clustering (Unsupervised machine learning) and Naïve Bayes Classifier (Supervised machine learning)
- 78. Unsupervised machine learning One way to conceptualise data mining is as a system (or machine) learning new knowledge (e.g. rules) from patterns in the data. Therefore, if data mining is conducted without any human support, as is often the case, we can describe what is being done as unsupervised machine learning.
- 79. Clustering A common form of unsupervised machine learning is known as clustering. Rather than identifying, for example, rules from the data, the goal of clustering is broader: to identify groups (classes) in the data. The features of those groups and the value of those features then become new knowledge that can be used to better understand the data or reason with new data. We could apply clustering to our programmer and retail worker population to try and learn the values needed for Bayes’ theorem…
- 80. Recall: Our programmer and retail worker population One (existing) cluster or class containing programmers Another (existing) cluster or class containing retail workers
- 81. Clustering A common form of unsupervised machine learning is known as clustering. Rather than identifying, for example, rules from the data, the goal of clustering is broader: to identify groups (classes) in the data. The features of those groups and the value of those features then become new knowledge that can be used to better understand the data or reason with new data. We could apply clustering to our programmer and retail worker population to try and learn the values needed for Bayes’ theorem… but this doesn’t tell us much more than we already know. We already know the classes that would have been discovered through the clustering process.
- 82. Model building – computational discovery Data Mining, Clustering (Unsupervised machine learning) and Naïve Bayes Classifier (Supervised machine learning)
- 83. Supervised Machine Learning We’ve been talking a lot about learning without talking about a related concept: teaching. Rather than leaving a machine unsupervised to learn, we can supervise (teach) it by providing it with examples that it can learn relationships from. To understand this better, we can again ask how we might learn the probabilities in Bayes’ theorem:
- 84. Recall: Our programmer and retail worker population
- 85. Population features Quiet Quiet Quiet Loud Quiet Loud Loud Loud Loud Loud Quiet Loud Loud Loud Quiet Loud Let’s imagine we now have information on the traits of people within our population.
- 86. ID Trait Age Occupation 1 Quiet 42 Programmer 2 Quiet 38 Programmer 3 Quiet 76 Programmer 4 Loud 51 Programmer 5 Quiet 19 Retail 6 Loud 53 Retail 7 Loud 59 Retail 8 Loud 80 Retail Prelude: Populations as data Let’s translate this graphic to a set of data (and add age as well to help our example). How would we now calculate our Bayes’ values? ID Trait Age Occupation 9 Loud 77 Retail 10 Loud 61 Retail 11 Quiet 40 Retail 12 Loud 35 Retail 13 Loud 32 Retail 14 Loud 46 Retail 15 Quiet 56 Retail 16 Loud 25 Retail
- 87. ID Trait Age Occupation 1 Quiet 42 Programmer 2 Quiet 38 Programmer 3 Quiet 76 Programmer 4 Loud 51 Programmer 5 Quiet 19 Retail 6 Loud 53 Retail 7 Loud 59 Retail 8 Loud 80 Retail Learning Bayes’: The prior 25% Calculating the prior(s) is much the same as we did before, however… ID Trait Age Occupation 9 Loud 77 Retail 10 Loud 61 Retail 11 Quiet 40 Retail 12 Loud 35 Retail 13 Loud 32 Retail 14 Loud 46 Retail 15 Quiet 56 Retail 16 Loud 25 Retail
- 88. ID Trait Age Occupation 1 Quiet 42 Programmer 2 Quiet 38 Programmer 3 Quiet 76 Programmer 4 Loud 51 Programmer 5 Quiet 19 Retail 6 Loud 53 Retail 7 Loud 59 Retail 8 Loud 80 Retail Learning Bayes’: The likelihood ID Trait Age Occupation 9 Loud 77 Retail 10 Loud 61 Retail 11 Quiet 40 Retail 12 Loud 35 Retail 13 Loud 32 Retail 14 Loud 46 Retail 15 Quiet 56 Retail 16 Loud 25 Retail 75% …we can now calculate the likelihood directly, as we have actual data we can use. Previously, we had estimated these values, based upon what portion of a group of programmers (the number of which was based on our prior) we thought would have the traits of interest.
- 89. ID Trait Age Occupation 1 Quiet 42 Programmer 2 Quiet 38 Programmer 3 Quiet 76 Programmer 4 Loud 51 Programmer 5 Quiet 19 Retail 6 Loud 53 Retail 7 Loud 59 Retail 8 Loud 80 Retail ID Trait Age Occupation 9 Loud 77 Retail 10 Loud 61 Retail 11 Quiet 40 Retail 12 Loud 35 Retail 13 Loud 32 Retail 14 Loud 46 Retail 15 Quiet 56 Retail 16 Loud 25 Retail Learning Bayes’: The likelihood We can do a similar thing for retail workers. 25%
- 90. The process we have been through here is the process a machine goes through automatically when undertaking what is known as supervised learning. Let’s formalise slightly some of the things we did… Supervised learning
- 91. ID Trait Age Occupation 1 Quiet 42 Programmer 2 Quiet 38 Programmer 3 Quiet 76 Programmer 4 Loud 51 Programmer 5 Quiet 19 Retail 6 Loud 53 Retail 7 Loud 59 Retail 8 Loud 80 Retail ID Trait Age Occupation 9 Loud 77 Retail 10 Loud 61 Retail 11 Quiet 40 Retail 12 Loud 35 Retail 13 Loud 32 Retail 14 Loud 46 Retail 15 Quiet 56 Retail 16 Loud 25 Retail Training data, features and classes Features Classes (or labels) Predictor We call this training data What is the relationship between the predictor and each label?
- 92. Supervised learning The process we have been through here is the process a machine goes through automatically when undertaking what is known as supervised learning. Let’s formalise slightly some of the things we did… Our training data supplied us with features, from which we selected trait as a predictor and added labels to form positive examples for different classes (e.g. programmer). From these examples, we learned the relationship between, for example, someone’s traits and them being a programmer (e.g. 75% chance of being quiet if a programmer). Much like our initial guess about the portions of programmers and retail workers, we hope our training data generalises (more later). This contrasts unsupervised learning, where we didn’t have any labelled examples to learn from.
- 93. Learning Bayes’: Final model We now have all the numbers we need to instantiate Bayes’ theorem. Specifically, the simple model built from this type of supervised learning is called a Naïve Bayes’ classifier. It can now make future predictions when new data is encountered. This classifier is naïve because it makes several assumptions, such as there being no connection between different features. In reality, this might not be the case (e.g. might someone’s traits be impacted by age?).
- 94. Other supervised learning-based models The kind of simplification within a Naïve Bayes classifier is the same kind of simplification we have identified in models throughout the course. Therefore, there are several other examples of supervised machine learning-based models, some of which you may have heard of, such as logistic regression (models). These models, of course, also rely on the presence of labelled data. If we don’t have this data, the best we can do is rely on unsupervised learning.
- 95. Recall: Markov chains Knowing that machine learning uses data to acquire knowledge, we can now answer where the transition probabilities in our Markov chain might come from: Day Weather 1 Sunny 2 Rainy 3 Sunny 4 Rainy 5 Sunny 6 Rainy 7 Sunny 8 Sunny 9 Rainy 10 Rainy Sunny Rainy 20% 80% 75% 25% Count of transitions from state i to state j Total count of transitions from state i
- 96. Recall: Neural Networks We can also answer where the weights (knowledge) in our neural network might come from: given a set of inputs (features) and outputs (labels), we experiment with different weights until the output of the activation function matches the output in the data: This is obviously a very simple training process. When we have more complex neural networks – those with different activation functions or potentially those where we have neurons organised into different groups or layers – then our training approaches also become more complex, often referred to as ‘deep learning’. Input A Input B Input C Output 0.5 0.1 0 1 … … … …
- 97. Aside: What about Generative AI? We can’t really talk about AI and machine learning without talking about ChatGPT. The more formal way to refer to ChatGPT is as a Large Language Model-backed (LLM-backed) chatbot; the chatbot uses a trained LLM (Generative Pretrained Transformer (GPT)) to support its operation. LLMs are trained on large amounts of existing text using a mixture of unsupervised and supervised processes.
- 98. Aside: What about Generative AI? LLMs are a form of generative AI as they generate new text content: based on a prompt, they generate the next most likely sequence of words based on the information about the nature of text gained during training. LLMs could, therefore, conduct the Named Entity Recognition (NER) process we saw in Lecture 11, taking the string to ‘make computable’ as a prompt and generating the labels. They are thus also part of the field of Natural Language Processing (NLP). Because Markov Chains also support POS tagging, part of the NER process, they can be viewed as a simple form of language model. LLMs in the context of what we’ve seen so far…
- 99. Aside: It’s all statistics in the end… In many situations, the relationship supervised machine learning tries to identify can just be viewed as a line (or an equation that describes that line) that separates the examples given. With just one or two classes this is simple and only requires two dimensions, but with more classes this becomes more complex and requires more sophisticated methods such as support vector machines. Retail Programmer
- 100. Summary: Model building from data (computational discovery) Machine learning Reasoning Representation as rules, networks or systems dynamics Model
- 101. Summary Artificial intelligence effectively equates to automatically gaining knowledge, representing that knowledge, and then automatically applying it to new data to obtain information. It is thus akin to the modelling processes we have seen. Knowledge can be gained from humans, or from existing data through data mining, unsupervised or supervised learning. The applicability of each depends upon the data available for learning. Knowledge can be represented and applied using rules (logical or statistical), networks (Bayes’, Markov or Neural) or Systems Dynamics.
- 102. Recall: It all comes back to public health interventions… If we can automate the application of knowledge to health data, then we can automate (the introduction of) interventions. If we can’t fully use information systems to automate this application, then they can assist clinicians in the delivery of interventions. Diabetes A computer, as an information system, could automatically determine whether a patient has diabetes and act accordingly
- 104. Model suitability The success of the (simple) supervised learning process we’ve seen depends upon how representative our data is of the true situation in the world. If it is representative, then the relationships we capture in our model are likely to be correct. If it is not, then the relationships are also unlikely to be accurate. The smaller a dataset is the less likely it is to be representative. The accuracy of the relationships captured can also depend on the features selected, the accuracy of the labels and even the learning algorithm itself.
- 105. Model evaluation For these reasons, it’s important to check how well our model performs before we use it to make future predictions. To do this, we run our model against test data and evaluate the quality of the predictions it makes. We can base this evaluation on techniques we’ve already seen…
- 106. Recall: Questioning: Evaluation (Query Performance Measures – Sensitivity) ‘UCL Institute of Health Inf’ Q: ‘Health Informatics’ The items that we want that are included in our search are called True Positives (TP). Any items that are missed by our search (i.e. we wanted them but they were not retrieved) are called False Negatives (FN). We can calculate a true-positive rate (or sensitivity) as the proportion of true positives from amongst those results that we wanted, or TP/(TP+FN). ‘KCL History’ ‘KCL Physics’ 1 2 = 50% ‘KCL Principles of Health Inf’
- 107. Recall: Questioning: Evaluation (Query Performance Measures – Specificity) ‘KCL Principles of Health Inf’ ‘KCL History’ ‘UCL Institute of Health Inf’ Q: ‘Health Informatics’ The items that we don’t want that are not included in our search are called True Negatives (TN). Any unwanted items that are included by our search (i.e. we didn’t want them but they were retrieved anyway) are called False Positives (FP). We can calculate a true-negative rate (or specificity) as the proportion of true negatives from amongst those results that we did not want, or TN/(TN+FP). ‘KCL Physics’ 1 2 = 50%
- 108. Model evaluation For these reasons, it’s important to check how well our model performs before we use it to make future predictions. To do this, we run our model against test data and evaluate the quality of the predictions it makes. We can base this evaluation on techniques we’ve already seen… calculating sensitivity and specificity, for example. Specifically, we show the testing data to our model (e.g. our Naïve Bayes classifier) hiding the true labels when doing so. The labels the classifier comes up with can thus be used to determine its performance:
- 109. Model evaluation ID Trait Age Occupation 1 Quiet 42 Programmer 2 Quiet 38 Programmer 3 Quiet 76 Programmer 4 Loud 51 Programmer Occupation Programmer Programmer Programmer Retail 3 4 = 75% 1. True labels (hidden): 2. Classifier labels: 3. Evaluate the classifier labels in the way we’ve seen: 4.Calculate sensitivity:
- 110. References and Images Enrico Coiera. Guide to Health Informatics (3rd ed.). CRC Press, 2015. Bella Martin. Universal methods of design 100 ways to research complex problems, develop innovative ideas, and design effective solutions. Rockport Publishers, 2012. https://www.gettyimages.co.uk/ https://etn-sas.eu/2020/09/23/part-of-speech-tagging-using-hidden-markov-models/ https://towardsdatascience.com/deep-learning-with-python-neural-networks-complete-tutorial-6b53c0b06af0 https://www.pinterest.co.uk/pin/2040762311279389/ https://thesystemsthinker.com/step-by-step-stocks-and-flows-improving-the-rigor-of-your-thinking