This document discusses building a brand tracking product using Bayesian methods. It explains how the product uses Bayesian modeling (MRP) to more accurately estimate brand awareness for small target audiences by utilizing all available survey data, rather than just a small sample. It also describes how Bayesian models allow the product to quantify the probability of changes in brand metrics, learn from prior survey information, and be implemented in a production environment using PyMC3 and variational inference for fast, stable results.
5. But we discovered traditional
brand tracking wasn’t working
very well.
Marketeers are interested in
niche audiences! Most brands
do not actively market to
everyone. Rather they have
smaller target groups they
focus on, be it in terms of age,
gender, location, interest or
other demographic or
psychographic criteria.
6. And with these traditional brand trackers, if one wants
to boil it down to one of these target groups, the
sample sizes become even smaller, the margin-of-
error skyrockets out of control and the insights
become entirely in-actionable.
Not good.
7. We concluded that we needed to
fundamentally rethink brand tracking if we
want to truly solve brands problems and
help them grow.
Their problem is straightforward: they
need reliable insights to understand how
their brand is performing in the real world,
for their various audiences, and how this
is changing over time.
So we chose to use MRP.
8. To explain how MRP works, we need to first compare it with a more traditional way of
doing things. Let’s take the example of measuring opinion in a very specific, small target
audience.
Imagine a brand who wants to run a campaign targeting young females who use Twitter
and also like American football. They want to find out what this specific group of people
think of their brand.
The traditional brand tracker creates a sample of 1,000+ respondents and then zooms in
young females who use Twitter and like American football. In the end, there are 20
respondents who fit the target audience. The brand tracker takes the average opinion of
this group but because the number of respondents is so small, the margin of error is large.
10. Latana is able to fix this problem.
Instead of narrowing the sample size to just 20 respondents, MRP makes an estimate of
the target audience group by using ALL the information available in the 1,000+ respondent
sample size. This means it looks at ALL the young female females, ALL the people who
use Twitter, and all the people who like American football. Because we use all the
information from the sample, the estimate for a small group is much more reliable.
Therefore, the magic potion isn’t really magic at all. It’s as simple as this: instead of
focusing on a tiny group in a target audience, MRP builds a model. This model is used to
calculate the opinion of a brand by looking at the respondents’ individual characteristics
and how they relate to the brand.
12. So, essentially MRP can be used as a model driven
approach to brand tracking.
Whereas the method was originally designed using a
Hierarchical Bayesian model, one is free to choose
any binary classifier that returns some estimate for
the probability that a person knows a brand.
So if you use Python, you could choose your
favourite library scikit learn and try all kinds of
classifiers.
We did that in the beginning and just used a simple
logistic regression and were good to go!
13. Introducing Latana
The first brand tracking tool to use data
science to ensure reliable and accurate
brand insights.
16. #1: Learning from prior
information
Blinkist is an up-and-coming startup that has
built a reading app that condenses non-fiction
books into 15-minute audio summaries. Latana
monitored Blinkist’s levels of brand awareness
in Germany before, during and after Blinkist’s
TV campaign by surveying 2000 people. They
then used the MRP model to predict brand
awareness levels for hundreds of niche target
audiences.
17. #1: Learning from prior
information
So, how does using a Bayesian model with prior information
help us?
What we soon discovered is that the real world isn’t always as rosy
as it seems, and sometimes even single characteristics are hard to reach.
One may end up collecting a sample of 2000 people, but only 200 of those
fall into a certain category.
18. #1: Learning from prior
information
With Blinkist test, this was the case for people between 56-65
years old who are on average less tech savvy and thus less
likely to fill out our mobile surveys.
To estimate brand awareness for the small group of respondents
aged 56-65 (approximately 11% of the sample / 220 people),
using prior information from past surveys is crucial. In the
graph below, it can be seen that if prior information is not used,
the brand awareness estimate for this group is essentially the
same as the overall brand awareness of 7.5%.
19. #1: Learning from prior
information
This happens because the MRP model doesn’t have enough
information from respondents aged 56-65 in the sample to find
any differences between them and the rest of the sample.
However, if the MRP model is allowed to use information from the
past (i.e. the survey data that occurred before and during the
campaign), then this helps the model find a stronger signal. By using
prior information, there comes a different result: the MRP model
estimates that brand awareness for 56-65-year-olds is 5.5%.
Therefore, without using prior information, MRP would not be able to
detect a difference between the general population and
56-65-year-olds and would simply assign the niche audience
the overall average of 7.5%, even if the full sample of 2000
respondents was used.
20. #1: Learning from prior
information
In this case, this “low education” niche audience is considered as
people who don’t have higher education. Again we see a similar
pattern as the previous example.
The model that uses prior information helps detect a lower level of
brand awareness, even with small sample sizes. On the flip side, the
model that doesn’t use prior information only starts to detect the
lower sample size at a sample size of 800 respondents or more.
21. #2: Uncertainty quantification
Let’s assume one of our clients runs a marketing campaign between October
and December.
Then in December they look at the Latana dashboard and see that the brand
awareness increased in some niche audience from 5% to 8%.
Now the question is how likely is that increase?
Well in a frequentist world one would just come up with some t-test or
bootstrap confidence bounds and then give a YES or NO. So ‘Yes’ this
change happened and isn’t just some random noise or ‘NO’ it did not.
Well we figured out that marketeers don’t really like showing to their boss
that there was actually no effect of their campaign.
So is there a better way to frame that?
22. #2: Uncertainty quantification
Well with a Bayesian model one always
gets the full posterior distribution of
estimates. This is nice since then one can
just compare the probability masses.
23. #2: Uncertainty quantification
So if you, for example, have two estimates, one before
and one after the campaign, just look at the overlap of
their posteriors and you will be to say:
“With a probability of 80% we are very certain that our
campaign had a positive effect on the awareness of our
brand”
Which also means that if they mess up, they would still
get some weak change probability of what ever 30-60%,
which is better than a definite NO.
24. #2: Uncertainty quantification
So how does this look like in our dashboard?
So basically whenever you want to compare two
estimates, the dashboard also shows you the change
probability with a color coding. This is something really
helpful for our clients.
25. Using Bayesian models in
production
The results looked really good, but now to the hands on part.
For coding the Bayesian model we used PyMC3 and started off with the general full
Bayesian inference algorithm, most advanced one currently is Hamiltonian MCMC (NUTS).
The advantage is that it covers all complex posterior distributions, even when they are multi
modal and so on.
However, this solver is highly unstable, it takes several hours and is just not practicable in
production. There is also another approach that is much lighter so called approximative
Bayesian inference (variational inference).
This algorithm basically assumes a smooth distribution and then just finds the one that best
fits the data. It is stable, fast but the disadvantage is that it does not cover complex
distributions.
We ran some tests and compared those two, and chose the second one because it gave us
much better results.
27. Using Bayesian models in
production
So interesting for people here is maybe how that looks like in production.
Well it is actually not so much different from using other machine learning libraries in production.
We wrote our model in PyMC3, then packed it into a Django web service, deployed the web service
on AWS.
Now our survey engine generates survey responses in real time, writes them to our database, our
web service picks them up, calculates the results in a reasonable time, writes them back to the
database and the Latana dashboard updates from there.
28. Summary
Bayesian methods added a
whole new layer of value to
our product.
Quantify probability of change
in brand KPIs
Use prior information to
uncover hard to reach
audiences
Bayesian methods in
production is no magic
Editor's Notes
As a young company we also wanted to enter this market and launched a product called BrandTracker in 2017.
With BrandTracker, we offered a leaner, lower-cost version that focused on a set of standardised KPIs (around 5-10) and smaller sample sizes (usually 500). We delivered insights to our clients on a regular basis, usually monthly or quarterly, through an easy-to-use dashboard.
Some aspects of BrandTracker were received really well by our clients. The dashboard was intuitive and a big improvement over the industry-typical PowerPoint presentations or PDF documents.
Also, the speed of BrandTracker was a big plus. We were able to deliver results within 1-2 weeks, in an industry where clients often wait months to get the first insights. Lastly, our low-touch approach allowed us to keep the prices low. Our clients were surprised how much value they could get for their money, especially those that had previous experience with brand tracking.
Most brands we work with do not actively market to everyone. Rather they have smaller target groups they focus on, be it in terms of age, gender, location, interest or other demographic or psychographic criteria.
If one wants to boil it down to one of these target groups, the sample sizes become even smaller, the margin-of-error skyrockets out of control and the insights become entirely in-actionable.
After countless conversations with our clients, we concluded that we needed to fundamentally rethink our approach if we want to truly solve their problems and help them to build a thriving brand. Their problem is straightforward: they need reliable insights to understand how their brand is performing in the real world, for their various audiences, and how this is changing over time.
After months of conceptualising and prototyping, we concluded that recent innovation in data science, Multilevel Regression and Poststratification (MRP), could be a tool to solve this problem. It recently gained popularity within election predictions with great success so we decided to adapt and further develop it to the benefit of consumer brands.
If one wants to boil it down to one of these target groups, the sample sizes become even smaller, the margin-of-error skyrockets out of control and the insights become entirely in-actionable.
To explain how MRP works, we compare it with a more traditional way of doing things. Let’s take the example of measuring opinion in a very specific, small target audience.
Imagine a brand who wants to run a campaign targeting young females who use twitter. They want to find out what this specific group of people think of their brand.
The traditional brand tracker creates a sample of 1,000+ respondents and then zooms in young females who use twitter. In the end, there are 20 respondents who fit the target audience. The brand tracker takes the average opinion of this group but because the number of respondents is so small, the margin of error is large.
If one wants to boil it down to one of these target groups, the sample sizes become even smaller, the margin-of-error skyrockets out of control and the insights become entirely in-actionable.
This large margin of error is a problem Latana is able to fix. Instead of narrowing the sample size to just 20 respondents, MRP makes an estimate of the target audience group by using ALL the information available in the 1,000+ respondent sample size. This means it looks at ALL the young people, ALL the females, and ALL the people who use twitter. Because we use all the information from the sample, the estimate for a small group is much more reliable.
The magic potion isn’t really magic at all. It’s as simple as this: instead of focusing on a tiny group in a target audience, MRP builds a model. This model is used to calculate the opinion of a brand by looking at the respondents’ individual characteristics and how they relate to the brand.
So essentially MRP can be used as a model driven approach to brand tracking.
Whereas the method was originally designed using a Hierarchical Bayesian model, one is free to choose any binary classifier that returns some estimate for the probability that a person knows a brand.
So if you use Python, you could choose your favourite library scikit learn and try all kinds of classifiers.
We did that in the beginning and just used a simple logistic regression and were good to go!
Our engineers developed an even fancier and more intuitive dashboard. This time we used MRP in the backend.
However, back to the premise of the talk:
Why would we want to switch a working product that uses an easy to understand model to a way more complicated Bayesian framework?
When we talk about Bayesian methods in an academic context, usually two big advantages are mentioned:
Using prior information in your model
Having a probabilistic way to quantify uncertainty in our estimates
But how does this add value to our product?
Let’s focus on the first one for now and take one of our clients as a showcase:
Blinkist is an up-and-coming startup that has built a reading app that condenses non-fiction books into 15-minute audio summaries. Latana monitored Blinkist’s levels of brand awareness in Germany before, during and after Blinkist’s TV campaign by surveying 2000 people. They then used the MRP model to predict brand awareness levels for hundreds of niche target audiences.
So what we tested with our client, how does the notion of using a Bayesian model with prior information help us?
What we soon discovered is that the real world isn’t always as rosy as it seems, and sometimes even single characteristics are hard to reach. One may end up collecting a sample of 2000 people, but only 200 of those fall into a certain category.
In our Blinkist test this was the case for people between 56-65 years old who are on average less tech savvy and thus less likely to fill out our mobile surveys.
To estimate brand awareness for the small group of respondents aged 56-65 (approximately 11% of the sample / 220 people), using prior information from past surveys is crucial. In the graph below, it can be seen that if prior information is not used, the brand awareness estimate for this group is essentially the same as the overall brand awareness of 7.5%.
This happens because the MRP model doesn’t have enough information from respondents aged 56-65 in the sample to find any differences between them and the rest of the sample. However, if the MRP model is allowed to use information from the past (i.e. the survey data that occurred before and during the campaign), then this helps the model find a stronger sig- nal. By using prior information, there comes a different result: the MRP model estimates that brand awareness for 56-65-year-olds is 5.5%. Therefore, without using prior information, MRP would not be able to detect a difference between the general population and 56-65-ye- ar-olds and would simply assign the niche au- dience the overall average of 7.5%, even if the full sample of 2000 respondents was used.
In our Blinkist test this was the case for people between 56-65 years old who are on average less tech savvy and thus less likely to fill out our mobile surveys.
To estimate brand awareness for the small group of respondents aged 56-65 (approximately 11% of the sample / 220 people), using prior information from past surveys is crucial. In the graph below, it can be seen that if prior information is not used, the brand awareness estimate for this group is essentially the same as the overall brand awareness of 7.5%.
This happens because the MRP model doesn’t have enough information from respondents aged 56-65 in the sample to find any differences between them and the rest of the sample. However, if the MRP model is allowed to use information from the past (i.e. the survey data that occurred before and during the campaign), then this helps the model find a stronger sig- nal. By using prior information, there comes a different result: the MRP model estimates that brand awareness for 56-65-year-olds is 5.5%. Therefore, without using prior information, MRP would not be able to detect a difference between the general population and 56-65-ye- ar-olds and would simply assign the niche au- dience the overall average of 7.5%, even if the full sample of 2000 respondents was used.
In this case, this “low education” niche audien- ce is considered as people who don’t have higher education. Again we see a similar pat- tern as the previous example.
The model that uses prior information helps de- tect a lower level of brand awareness, even with small sample sizes. On the flip side, the model that doesn’t use prior information only starts to detect the lower sample size at a sample size of 800 respondents or more.
Let’s assume one of our clients runs a marketing campaign between October and December.
Then in December they look at the Latana dashboard and see that the Brand awareness of their brand increased in some niche audience from 5% to 8%.
Now the question is how likely is that increase?
Well in a frequentist world one would just come up with some t-test or bootstrap confidence bounds and then give a YES or NO. So Yes this change happened and isn’t just some random noise or NO it did not.
Well we figured out that marketeers don’t really like showing to their boss that there was actually no effect of their campaign ;)
So is there a better way to frame that?
Well with a Bayesian model one always gets the full posterior distribution of estimates. This is nice since then one can just compare the probability masses.
So if you for example have two estimates, one before and one after the campaign, just look at the overlap of their posteriors and you will be to say:
“With a probability of 80% we are very certain that our campaign had a positive effect on the awareness of our brand”
Which also means that if they mess up, they would still get some weak change probability of what ever 30-60%, which is better than a definite NO.
So how does this look like in our dashboard? So basically whenever you want to compare two estimates, the dashboard also shows you the change probability with a color coding. This is something really helpful for our clients.
The results looked really good, but now to the hands on part.
For coding the Bayesian model we used PyMC3 and started off with the general full Bayesian inference algorithm, most advanced one currently is Hamiltonian MCMC (NUTS). The advantage is that it covers all complex posterior distributions, even when they are multi modal and so on.
However, this solver is highly unstable, it takes several hours and is just not practicable in production. There is also another approach that is much lighter so called approximative Bayesian inference (variational inference).
This algorithm basically assumes a smooth distribution and then just finds the one that best fits the data. It is stable, fast but the disadvantage is that it does not cover complex distributions.
We ran some tests and compared those two, and chose the second one because it gave us much better results.
So interesting for people here is maybe how that looks like in production.
Well it is actually not so much different from using other machine learning libraries in production. We wrote our model in PyMC3, then packed it into a Django webservice, deployed the webservice on AWS.
Now our survey engine generates survey responses in real time, writes them to our database, our web service picks them up, calculates the results in a reasonable time, writes them back to the database and the Latana dashboard updates from there.
So what we tested with our client, how does the notion of using a Bayesian model with prior information help us?
What we soon discovered is that the real world isn’t always as rosy as it seems, and sometimes even single characteristics are hard to reach. One may end up collecting a sample of 2000 people, but only 200 of those fall into a certain category.