45-minute talk given at iOSCon London on 3/22/19 at SkillsMatter. *Some of the slides did not convert as expected from Keynote to PowerPoint in order to be uploaded here on SlideShare.
iOSCon 2019: Generate a Song from Markov Models in Swift
1. SWIFT AS A
COURSING
RIVER
GENERATE A SONG FROM MARKOV
MODELS WITH SWIFT
LIZZIE SIEGLE
@LIZZIEPIKA
SWIFT AS A
COURSING
RIVER
GENERATE A SONG FROM MARKOV
MODELS WITH SWIFT
LIZZIE SIEGLE
@LIZZIEPIKA
7. •Pair key with
possible
tokens that
can follow
@LIZZIEPIKA
START* [Never]
Never [gonna]
Gonna [give, let]
Give [you]
Let [you]
You
[up,down]
wndown]
23. DEFINIT
ION
• “In probability theory, a Markov model
is a stochastic model used to model
randomly changing systems where it is
assumed that future states depend
only on the current state, not on the
events that occurred before it (that is, it
assumes the Markov property)”. -
Wikipedia @LIZZIEPIKA
27. •Transition data: probability of
transitioning to new state from a current
one
•Emission data: probability of
transitioning to observed from hidden
•Initial state data: initial probability of @LIZZIEPIKA
46. Let’s get down to death
Hope he doesn’t see right through
Mister, I’ll make a great typhoon
Be a man out of a man out of a great typhoon
Be a spineless, pale, pathetic man
And you find your center,
you can bet before we’re through
Mister I’ll make a man With all the moon
Time is racing towards us till the dark side of the war
So pack up, go home, you’re through me daughters
@LIZZIEPIKA
47. @LIZZIEPIKA
Let’s get down to those who knew how to those who knew
me daughters, when I really wish that I a spineless, pale, pathetic lot
And you might survive You are sure to business,
to win You’re the strength of you
Tranquil as the dark side of you can bet before we’re through Mister,
I’ll make a man We must be swift as the strength of a man
With all the force of you I’m never gonna catch a breath
Say goodbye to those who knew how to business, to swim
Be a forest but on fire Mysterious as the Huns
Be a man With all the force of you?
Twilio evangelist in SF,
first time in London, love Swift, Disney, Pikachu, send recs! Also, here’s a warning—I could be more serious, and so could this talk.
Agenda
Fun example like forced fun, like when your boss makes you do icebreakers. A metaphor, almost. Raise your hand if you’ve been Rickrolled? If you’re not raising your hand, this is how it works: Never Going to Give you Up is a Meme song.
ANYWAYS so we have this great, weird, funny example. But it’s time to get professional here.
Fairly repetitive sentence. And with repetition, comes the greater potential for prediction.
Each word has a different color. If a word repeats, it is a token! Keys do not. And each word of the same key has the same color, like “never” is purple and “you” is pink. These words can also be states. The states are generated from some input corpus of text, in this case, the sentence.
Here, each key is matched with an array of possible tokens that could follow that key. Here, the tokens are organized in pairs with a key corresponding to the possible tokens that can follow it.
Now, each state can point to another state. One state hopping to another state is called a transition. How can we best predict what word comes next? With probabilities.
The probability of one state hopping to another is called a transition probability.
Let’s calculate those. So what do we know?
10 total words and 7 unique words. Tokens can repeat, there can be multiple of them. Keys can not.
weighted distribution = % that a key will appear. It is based on the amount of times 1 token shows up divided by the total number of tokens.
Histograms are a simple way to rep weighted distributions.
We’re going to start off with a fun example. A metaphor, almost. How many of you know about rickrolling? Or maybe you know about Rick Astley?
Each word represents a key, or a state, and the arrows point to potential states that can follow it
Let’s revisit the distribution of keys and redo it as probabilities
Again, each word represents a key, or a state, and the arrows point to potential states that can follow it. “gonna” will always follow “never” so it’s at 1.00. 50% of the time, “let” will follow “gonna” and the other 50% “give” will follow “let.” “Never” will always follow “up” so it’s at 100%. “Down” or “up” will follow “you” so both are at 50%. Etc.! Given that the current word (or state) is “never”, what is the probability that the next one will be “gonna”? 100%!! Etc. Etc. You can read slides!! You get it. This is easy math, because it’s such a short, repetitive sentence. Don’t worry, it will get more complicated
One more time, let’s walk through the Markov chain and view it differently.
- Let’s see the probability of each transition next to their corresponding tokens, or states
- Here, so we’ve added the weighted distributions so that each arrow has the probability that it will be selected as the transition to the next state.
MMAAS: Like SAAS. It’s going to be the next big buzzword, mark my words
Besides word prediction, you can do more general prediction or estimation
Learn statistics of sequential data.
Recognize patterns
And more! With Markov models
What is the motivation for this talk? We finish each other’s …
We finish each other’s…(sandwiches/sentences)
We finish each other’s…(sandwiches/sentences)
What is the motivation for this talk? Ultimately, it’s word prediction.
Most basic Markov model. State machine where each transition has a corresponding probability, it’s a mathematical system that models the state of a system with a random variable that changes through time. In other words, the probability of transitioning to any particular state only depends on the current state and time elapsed. The state space, or set of all possible states, can be anything: letters, numbers, weather conditions, baseball scores, or stock performances. Or words, as we’ve seen.
In addition to prediction and AI, they have applications in data compression, computational molecular biology, and more.
ex: Predicting tomorrow’s weather by looking only at today’s weather, not yesterday’s. Predicting words, ofc. Anyone have any other examples?
Predict whether or not a sports team will win tomorrow based on today’s game
Memoryless: given probability distribution is independent of its history
What does this mean? A stochastic model is a way to estimate probability distributions of potential outcomes using one or more random variables as input. Markov model is more complex than a Markov chain. It models sequences with discrete states
What are Hidden Markov models, or HMMs? They predict a sequence of unknown (hidden) variables from a set of observed variables.
A Markov chain where the states are only partially observable. These internal state changes are invisible (hidden) to someone viewing them from outside the system
Each state (of which are all finite) can generate a set of external events or output which can be observed. These are called observations.
Markov process holds: the current state is always dependent on the immediate previous state only, hence the name
let’s say you see Baxmax go get someone a bandaid. That is observable. You can see that physical action happen.
However, there is something hidden: you don’t know what illness or injury Baymax is trying to treat. The illness or injury is the hidden variable.
Another example is predicting the weather (hidden) based on someone’s clothes they wear inside
Transition data — the probability of transitioning to a new state conditioned on a present state.
Emission data — the probability of transitioning to an observed state conditioned on a hidden state.
Initial state information — This can also be looked at as the prior probability.
Andrey Markov was a Russian mathematician who, in the early 1900s, disagreed with the commonly-held belief that independence was needed for the weak law of large numbers to hold. - published his first paper on Markov chains in 1906, showing that under certain conditions the avg outcomes of the Markov chain would converge to a series of fixed values, thus proving a weak law of large numbers without assuming independence. He also used his chains to study vowel distribution and proved a central limit theorem for such chains. And now he has a slide in my first iOS conference talk, right here at iOSCon! *water*
Back to the present. How do we generate text? By walking through a chain or model and outputting the word for each state based on probability. Let’s do this with code.
Here, let’s make a Word class which is made up of a String for whatever the word is as well as a set of transitions to other words. The transitions are represented by an array of Words that follow it. If a transition appears more than one time, each duplicate still stays in the array to increase the chances that the next word would be that transition that appears multiple times.
Here, let’s make a Word class which is made up of a String for whatever the word is as well as a set of transitions to other words. The transitions are represented by an array of Words that follow it. If a transition appears more than one time, each duplicate still stays in the array to increase the chances that the next word would be that transition that appears multiple times. Then we choose a random element with the corresponding weight by selecting a random index in the Transitions array.
Next, a Chain class will help manage all of the Words in a chain so memory isn’t leaked. Any circular references are removed in `deinit`.
Next, let’s see how words are added to the chain. This method accepts an array of Strings, each of which contains a word (or any other unit that the caller wants to work with). If there aren't actually any words, we split. We want to iterate over pairs of words, where the second element in the pair is the word that follows the first element.
For example, in our example sentence, “Never gonna give you up, never gonna let you down” we want to iterate over these. See? It was important to have a start and end.
To accommodate for nil values used to represent the start and end, our array’s contents need to be an optional.
Then we make 2 arrays: one with nil prepended and one with nil appended. And then we zip them together and for each word in the pair, we access the corresponding Word object with a handy-dandy helper function. Now all we have to do is add secondWord into the transitions of firstWord
We’re going to skip the helper method for time because it doesn’t contribute content to the talk in representing Markov models. Now we generate the words one by one, accumulating them in the result String array. We use an endless loop (while True) because the exit condition is difficult to represent itself as a loop condition: the transitions are done.
We then fetch the Word instance for the last string in result which cleanly handles that first case where result could be empty, since last produces nil which indicates the first word: `let currentWord = word(result.last)`. We need that `word` helper to handle checking whether there’s already a Word for a given string, so that the rest of the code doesn’t have to.
We get a random transition with `let nextWord = currentWord.randomNext()`. If that isn't the end, it’s added to result. Else, the loop ends and the result is returned.
Initially used “thank u, next” as the example song sentence because there’s also a lot of repetition, but “Never gonna give you up” has a special place in my heart, and my head, so I had to use it, and we’re coming back to it. That song gets stuck in my head so I hope it gets stuck in yours. Similarly, as a someone who was an avid hackathon attendee who now attends as a sponsor for a SAAS company, I have a special place in my heart for external APIs and libraries. Which is the second way we’ll look at training a Markov Model in Swift.
Here, we create a Markov model to train and pass it an array of the tokens from our sentence. We calculate a future state by calling built-in library method next. With this library we have three possible decision process options: predict, random and weightedRandom. Default = predict, we want to predict the next state. A random state would not be the same!
You can also print out the weighted distributions of each token with print(markovModel). Again, each word represents a key, or a state, and the arrows point to potential states that can follow it. This top row shows percentages of each state pointing to the next word.
“let” will always point to “you”. “gonna” will either point to “let” or “give”. … “Never” will always point to “gonna”
We’ll make a .txt file with “Let’s get down to business” from Disney movie Mulan. Which has been a running theme throughout this talk. So I hope you’ve guessed that we would either use it, or I liked Disney, or that’d appear later in the talk.
Read the text file: our corpus. Whatever data a Model is trained on is called a corpus. It’s the input our Markov model is trained on.
Helper method takes in the first word of the text file we want to read, the length of the file, and "chain", which is the start of the generated text.
- calculate the next state by calling built-in method next on the given word we are on in the text. This method loops through until the end of , generating the new words that will comprise our song based on previous words.
Alas, this is not enough. Before the text file can be read, the input must be cleaned. BuildWords is called in this next method that reads the file from the Resources Xcode playground directory, replacing new lines with spaces. It then calls the MarkovModel library's process method to train the model on the text.
The model trains and works on the text file all at once in the above .process closure and then calls buildText to generate a new song on the trained model.
This is my favorite line from multiple songs I generated using “Let’s Get Down to Business” as a input to train the Markov Model. This is as the line was printed in Xcode. However, I trained the Model multiple times. The Markov model generated a longer song with more lines than this one. This can look prettier on a slide, and while we’re at it, let’s also see ONE OF the complete generated songs.
This was the text or content of the printed output to the console, I just made it look pretty for this talk. I hope you like the animations and shapes and colors, I think they really reflect the text. The content. The output that was generated by the Markov model.
Condensed version of one of many generated songs
can see how it can be silly, but also some of it makes sense.
“You’re the strength of you” honestly that could be a motivational speaker line. “All the force of you” also sounds fairly inspiring, and I really could “never catch my breath” with all the force of you. I’m not sure how I would be a forest but on fire, but indeed, a forest can be on fire!
We used a sample sentence (rickrolling), you could also use a similar repetitive sentence (Thank u, next, etc) to learn vocabulary like keys (unique words, which we used to represent states), tokens (keys that can repeat), and weighted distributions.
We went over state transitions of said weighted distributions and how Markov models can be used to represent them.
We went over Markov chains, Markov models, hidden Markov models.
We also trained Markov models in Swift first with plain Swift code and then again with a library to generate lyrics, or text, in Swift. It was weird and was both sensible and nonsensical at the same time.
Then you, the audience, created the corpus of text that we trained the model on.
Will a current bill be passed in Parliament based on past bills? See what I did there? That’s a UK thing. I’m very proud of it.
The algorithm Google uses to display search results! It’s a modified (read: more advanced) form of the Markov chain algorithm. Higher the “fixed probability” of arriving at a certain webpage, the higher its PageRank. This is bc a higher fixed probability implies the page has a lot of incoming links from other pages—and Google assumes that if a page has a lot of incoming links, then it must be valuable. More incoming links, more valuable
Marketing: will a consumer use a different product next month. What is the probability of someone switching to Kellogg cereal next month if they are eating Quaker cereal this month?
And more! The sky is the limit!
Resources: How can you learn more about what to do next with Markov models?
This book by Oliver Cappe: Inference in Hidden Markov Models. It’s more theoretical than this talk. Yes, I included the link in talk slides.
Interactive explanation of models by Uber SWE Victor Powell. I had a lot of fun playing with the weights, speeds, and transitions, and trying different values. Idk about you, but I learn best by doing and love interactive tutorials.
College lectures. I was first introduced to Markov models in my Speech Synthesis and Recognition elective at Haverford College, where I recently learned fellow speaker Daniel Steinberg attended. There are many PDFs from lectures online I recommend.