SEO split tests
you should be running
Will Critchlow / @willcritchlow
BrightonSEO
Robert Liston famously:
Carried out an operation with
a 300% mortality rate Via: reddit
He amputated:
1. The patient’s leg
2. His assistant’s fingers Via: reddit
Both of them died,
along with a bystander
Via: reddit
Before germ theory, 25-50% of
patients died from infections
(Speed also used to be a prized
surgical skill pre-anaesthetic)
It wasn’t always confidence-inspiring
Liston could amputate a leg
in 2 ½ minutes
Liston could amputate a leg
in 2 ½ minutes
(but in his enthusiasm he once cut off the patient’s testicles too)
“Welcome.
I’ll be your doctor today.”
Confidence inspiring stuff
The “Liston” of site migrations
Step 1: fail to put redirects in place
The “Liston” of site migrations
Step 2: rel=canonical every page to the homepage
Good for the patientBad for the patient
Accidental
Deliberate
Good for the patientBad for the patient
Accidental
Deliberate
Mercury for syphilis
Good for the patientBad for the patient
Accidental
Deliberate
Mercury for syphilis
Not washing hands
Good for the patientBad for the patient
Accidental
Deliberate
Mercury for syphilis
Not washing hands Garlic + Onion
Of course a lot of deliberate things were
neither harmful nor beneficial
Cargo cult:
During WW2, Pacific islanders who had never seen
manufactured equipment saw modern military
planes bring cargo to their remote islands.
Read Richard Feynman’s speech
Cargo cult:
After the war, cults developed that tried to recreate
the conditions that “brought” the planes
(runways, control towers, military uniforms)
without understanding what had really happened.
Read Richard Feynman’s speech
Do we have our own cargo cults?
Do you recommend changing h2 to h1?
Do you have a good reason why?
Even if it does help, does it help enough to be worth it?
Let’s start washing our hands
The scientific method
Step 1:
Generate hypotheses
In medicine, old wives tales are a great place to start. Looking at things that appear to work, but we have
no idea why is a good source of hypotheses.
A great example is this 1,000-year-old “spell” that included garlic, onion and cow’s stomach, and turned
out to kill MRSA.
I guess the SEO equivalent is to ask the old timers.
In all areas of science, you are at an advantage if you can figure from first principles. Richard Feynman
famously used to draw what became known as “Feynman diagrams” to understand sub-atomic
interactions through thought experiments alone.
The SEO equivalent is to stay abreast of information retrieval and ML papers and formulate hypotheses
based on an understanding of how the algorithm likely works.
Finally, you can go mining the data.
The obvious SEO equivalent is the various correlation studies into ranking factors.
In both medicine and SEO, you obviously have to be wary of spurious correlations. Blindly mining data
can get arbitrarily high correlations (the example above has a correlation of 0.993!).
The scientific method
Step 2:
Try things in the lab
Problem: results may not hold, or may come with new side effects
In the SEO space, this is work like that done by
IMEC labs.
It involves attempting to run controlled
experiments on test domains and / or with
volunteer participants. The outcomes are
normally not improved rankings or traffic that
participants care about.
Potential pitfalls
What works in tests may not work in the real world
Source: National Institutes of Health
Do you recommend http → https migrations?
All else being equal, secure is better. All else is never equal.
Side effects may include ranking fluctuations, traffic drops, difficult conversations with your boss.
Side effects
may include headache, nausea, vomiting, death, dizziness, dysentery, cardiac
arrhythmia, mild heart explosions, varicose veins, darkened stool, darkened
soul, lycanthropy, trucanthropy, more vomiting, arteriosclerosis,
hemorrhoids, mild discomfort, vampirism, spontaneous dental hydroplosion,
sugar high, even more vomiting, and mild rash.
The scientific method
Step 3:
Gold standard scientific trials
TL;DR scurvy bad, science hard
You should read the story of one of the first controlled scientific experiments that proved lemons could
cure scurvy (in 1747!). The incredible story of how the discovery supported British naval supremacy, and
then how compounding errors involving the colonial supply-chain, faster steam-powered ships, and polar
bear offal led to the loss of the knowledge, the death of polar explorers, and the eventual rediscovery of
vitamin C.
Source: idlewords
How SEO split tests work
You might have seen @TomAnthonySEO tweeting
about the platform we’ve built to make this easy
Excuse a brief diversion into geeky
details
Instead of comparing the performance of the control pages directly with the variant pages, we build a
forecast of what’s called the counterfactual which is an estimate of what would have happened if we hadn’t
made the change. We use the control group to make a counterfactual forecast that takes into account
seasonality and site-wide changes.
The black line on the chart above is the actual organic traffic to the variant pages. The blue line is the
counterfactual.
More: Distilled blog post and free forecasting tool
It’s easiest to analyse the results by looking at the cumulative difference over time between the actual
organic traffic and the counterfactual.
The pale blue area is the 95% confidence interval.
We can see a (statistically) zero effect for an initial time while Google crawls and indexes the test,
followed by steady growth. A couple of weeks in, the confidence interval goes above zero and we have a
winning test.
More: Distilled blog
It’s easiest to analyse the results by looking at the cumulative difference over time between the actual
organic traffic and the counterfactual.
The pale blue area is the 95% confidence interval.
We can see a (statistically) zero effect for an initial time while Google crawls and indexes the test,
followed by steady growth. A couple of weeks in, the confidence interval goes above zero and we have a
winning test.
More: Distilled blog
Hashtag winning
Further reading for those interested:
● Predicting the present with Bayesian structural time series [PDF]
● Inferring causal impact using Bayesian structural time series [PDF]
● CausalImpact R package
● Finding the ROI of title tag changes
More: Distilled blog
What should you be testing?
Add
structured
data
One of the easiest tests to run is
the addition of structured data -
we recommend schema.org via
JSON-LD.
smokymountains.com
We got one of the fastest and clearest uplifts we have seen so far with
the addition of structured data to detail pages. This chart shows the
uplift from adding location-based data to individual property pages.
Improve
your organic
“adverts”
Advert testing plays a huge part
in PPC. Looking at typical meta
descriptions, it appears it’s
rarely a priority in organic.
More: Distilled blog
This is the chart I showed you earlier when I was describing the statistics.
It’s actually an uplift from improved clickthrough rate. We didn’t detect
an accompanying ranking improvement during this experiment.
Make your
site mobile
friendly
I’ve spent a lot of time trying to
persuade people to do this
without data to back me up.
Now I’m going to carry on with
data.
More: @TomAnthonySEO
This chart shows the uplift from making a bunch of category pages
mobile-friendly (with some simple responsiveness) on a holiday site.
Just to help prove that these are real uplifts, we ran a “null” test
designed to have no impact
...and there are tons of tests where we
don’t have pretty charts we can share yet
Tabbed
versus flat
We know Google in particular is
paying more attention to CSS
and JS. How much difference
does it make it content is visible
initially on page load?
Additional
content
You might want to test both
adding and removing additional
content on category pages.
This would test the benefit of
additional text vs. increased
focus and possibly-improved
usage metrics.
Breadcrumbs
How much difference does it
make if you add breadcrumbs to
product pages?
Note: this introduces the
complexity of testing internal
linking. I’ll come back to this.
Canonicals
vs. noindex
We’ve often argued about the
best ways of keeping certain
pages and page-types out of the
index.
Argue with data.
We have all kinds of keyword-targeting test ideas
● Simpler messaging
○ (what happens if you have less keyword targeting?)
● Timely keywords
○ (what happens if you add "2016" in appropriate places?)
Argue with data
We’re running tests like these right now
Follow @distilled to hear the results first
If you’re going to implement split-testing,
there are some things you should know
You can’t assume traffic equality
between “buckets” of pages
This is why we build a counterfactual comparison using control pages.
Different pages can have different
seasonality
For example, “roses” pages on valentine’s day. You need to cut outliers.
One site I looked at had 72<html> tags on
a single page
You’ll find some of your work more sensitive to amusingly broken
HTML
We’re not quite sure how to model
cross-section impacts
This will be needed for testing internal linking structures, for example.
You may detect unexplained
phenomena
In medicine, this would be things like the placebo effect with no known
pathway.
We may find that things that “shouldn’t” work, in fact do drive uplifts.
We can speculate that the continuing benefit of changing 302s to 301s (despite Google’s insistence that
302s don’t lose PageRank) is to do with them losing other link signals, but we don’t really know.
I’m not sure this matters.
It’s changing the way we make
recommendations
The big one:
Business cases
I wrote more about this in my better business documents post
But I’m also seeing more subtle impacts on my recommendations:
● You can recommend small tweaks and see the benefits compound
● You can test wild hypotheses with unknown upsides
● You can try things that might have a downside (more focused targeting, less copy, etc.)
And that’s even before you get the benefits of testing clickthrough rate, and the benefits of pretty charts
to show the boss highlighting the impact of your work!
More: blog post
Our work is so much easier than
theirs
But still, let’s move past
“cut the leg off as fast as you can”
@willcritchlow
PS - We’re hiring: distilled.net/jobs
● Surgeons - Phalinn Ooi
● Operating chair - Peter Pelisek
● Potions - Sam Simpson
● Old operating theatre - Uglix
● Searchmetrics rankings drop - img_eisy
● Air drop - Wikipedia
● Test tubes - ironpoison
● Syringes - ad-vantage
● Pills - ashleyrosex
● Stethoscope - proimos
● Old wives’ tale - Jon Bunting, pgillard and John
Davey
Image credits
● Richard Feynman - Juana la loca, dullhunk and
jkannenberg
● Spurious correlation - Tyler Vigen
● Lemon, lime, polar bear - abhijittembhekar,
libraryman, ucumari
● Blackboard - arenamontanus
● Facepalm - brandongrasley
● Buckets - mamarazzi
● Rose - alicelingching
● Girders - JFB119
● Ghost - daveallday
● Bezos - jurvetson

SEO split tests you should run - Will Critchlow

  • 1.
    SEO split tests youshould be running Will Critchlow / @willcritchlow BrightonSEO
  • 2.
    Robert Liston famously: Carriedout an operation with a 300% mortality rate Via: reddit
  • 3.
    He amputated: 1. Thepatient’s leg 2. His assistant’s fingers Via: reddit
  • 4.
    Both of themdied, along with a bystander Via: reddit
  • 5.
    Before germ theory,25-50% of patients died from infections (Speed also used to be a prized surgical skill pre-anaesthetic) It wasn’t always confidence-inspiring
  • 6.
    Liston could amputatea leg in 2 ½ minutes
  • 7.
    Liston could amputatea leg in 2 ½ minutes (but in his enthusiasm he once cut off the patient’s testicles too)
  • 8.
    “Welcome. I’ll be yourdoctor today.” Confidence inspiring stuff
  • 9.
    The “Liston” ofsite migrations Step 1: fail to put redirects in place
  • 10.
    The “Liston” ofsite migrations Step 2: rel=canonical every page to the homepage
  • 11.
    Good for thepatientBad for the patient Accidental Deliberate
  • 12.
    Good for thepatientBad for the patient Accidental Deliberate Mercury for syphilis
  • 13.
    Good for thepatientBad for the patient Accidental Deliberate Mercury for syphilis Not washing hands
  • 14.
    Good for thepatientBad for the patient Accidental Deliberate Mercury for syphilis Not washing hands Garlic + Onion
  • 15.
    Of course alot of deliberate things were neither harmful nor beneficial
  • 16.
    Cargo cult: During WW2,Pacific islanders who had never seen manufactured equipment saw modern military planes bring cargo to their remote islands. Read Richard Feynman’s speech
  • 17.
    Cargo cult: After thewar, cults developed that tried to recreate the conditions that “brought” the planes (runways, control towers, military uniforms) without understanding what had really happened. Read Richard Feynman’s speech
  • 18.
    Do we haveour own cargo cults?
  • 19.
    Do you recommendchanging h2 to h1? Do you have a good reason why? Even if it does help, does it help enough to be worth it?
  • 20.
  • 21.
    The scientific method Step1: Generate hypotheses
  • 22.
    In medicine, oldwives tales are a great place to start. Looking at things that appear to work, but we have no idea why is a good source of hypotheses. A great example is this 1,000-year-old “spell” that included garlic, onion and cow’s stomach, and turned out to kill MRSA. I guess the SEO equivalent is to ask the old timers.
  • 23.
    In all areasof science, you are at an advantage if you can figure from first principles. Richard Feynman famously used to draw what became known as “Feynman diagrams” to understand sub-atomic interactions through thought experiments alone. The SEO equivalent is to stay abreast of information retrieval and ML papers and formulate hypotheses based on an understanding of how the algorithm likely works.
  • 24.
    Finally, you cango mining the data. The obvious SEO equivalent is the various correlation studies into ranking factors. In both medicine and SEO, you obviously have to be wary of spurious correlations. Blindly mining data can get arbitrarily high correlations (the example above has a correlation of 0.993!).
  • 25.
    The scientific method Step2: Try things in the lab Problem: results may not hold, or may come with new side effects
  • 26.
    In the SEOspace, this is work like that done by IMEC labs. It involves attempting to run controlled experiments on test domains and / or with volunteer participants. The outcomes are normally not improved rankings or traffic that participants care about.
  • 27.
  • 28.
    What works intests may not work in the real world Source: National Institutes of Health
  • 29.
    Do you recommendhttp → https migrations? All else being equal, secure is better. All else is never equal. Side effects may include ranking fluctuations, traffic drops, difficult conversations with your boss. Side effects may include headache, nausea, vomiting, death, dizziness, dysentery, cardiac arrhythmia, mild heart explosions, varicose veins, darkened stool, darkened soul, lycanthropy, trucanthropy, more vomiting, arteriosclerosis, hemorrhoids, mild discomfort, vampirism, spontaneous dental hydroplosion, sugar high, even more vomiting, and mild rash.
  • 30.
    The scientific method Step3: Gold standard scientific trials
  • 31.
    TL;DR scurvy bad,science hard You should read the story of one of the first controlled scientific experiments that proved lemons could cure scurvy (in 1747!). The incredible story of how the discovery supported British naval supremacy, and then how compounding errors involving the colonial supply-chain, faster steam-powered ships, and polar bear offal led to the loss of the knowledge, the death of polar explorers, and the eventual rediscovery of vitamin C. Source: idlewords
  • 32.
    How SEO splittests work
  • 36.
    You might haveseen @TomAnthonySEO tweeting about the platform we’ve built to make this easy
  • 38.
    Excuse a briefdiversion into geeky details
  • 39.
    Instead of comparingthe performance of the control pages directly with the variant pages, we build a forecast of what’s called the counterfactual which is an estimate of what would have happened if we hadn’t made the change. We use the control group to make a counterfactual forecast that takes into account seasonality and site-wide changes. The black line on the chart above is the actual organic traffic to the variant pages. The blue line is the counterfactual. More: Distilled blog post and free forecasting tool
  • 40.
    It’s easiest toanalyse the results by looking at the cumulative difference over time between the actual organic traffic and the counterfactual. The pale blue area is the 95% confidence interval. We can see a (statistically) zero effect for an initial time while Google crawls and indexes the test, followed by steady growth. A couple of weeks in, the confidence interval goes above zero and we have a winning test. More: Distilled blog
  • 41.
    It’s easiest toanalyse the results by looking at the cumulative difference over time between the actual organic traffic and the counterfactual. The pale blue area is the 95% confidence interval. We can see a (statistically) zero effect for an initial time while Google crawls and indexes the test, followed by steady growth. A couple of weeks in, the confidence interval goes above zero and we have a winning test. More: Distilled blog Hashtag winning
  • 42.
    Further reading forthose interested: ● Predicting the present with Bayesian structural time series [PDF] ● Inferring causal impact using Bayesian structural time series [PDF] ● CausalImpact R package ● Finding the ROI of title tag changes More: Distilled blog
  • 43.
    What should yoube testing?
  • 44.
    Add structured data One of theeasiest tests to run is the addition of structured data - we recommend schema.org via JSON-LD.
  • 45.
  • 46.
    We got oneof the fastest and clearest uplifts we have seen so far with the addition of structured data to detail pages. This chart shows the uplift from adding location-based data to individual property pages.
  • 47.
    Improve your organic “adverts” Advert testingplays a huge part in PPC. Looking at typical meta descriptions, it appears it’s rarely a priority in organic.
  • 48.
    More: Distilled blog Thisis the chart I showed you earlier when I was describing the statistics. It’s actually an uplift from improved clickthrough rate. We didn’t detect an accompanying ranking improvement during this experiment.
  • 49.
    Make your site mobile friendly I’vespent a lot of time trying to persuade people to do this without data to back me up. Now I’m going to carry on with data.
  • 50.
    More: @TomAnthonySEO This chartshows the uplift from making a bunch of category pages mobile-friendly (with some simple responsiveness) on a holiday site.
  • 51.
    Just to helpprove that these are real uplifts, we ran a “null” test designed to have no impact
  • 52.
    ...and there aretons of tests where we don’t have pretty charts we can share yet
  • 53.
    Tabbed versus flat We knowGoogle in particular is paying more attention to CSS and JS. How much difference does it make it content is visible initially on page load?
  • 54.
    Additional content You might wantto test both adding and removing additional content on category pages. This would test the benefit of additional text vs. increased focus and possibly-improved usage metrics.
  • 55.
    Breadcrumbs How much differencedoes it make if you add breadcrumbs to product pages? Note: this introduces the complexity of testing internal linking. I’ll come back to this.
  • 56.
    Canonicals vs. noindex We’ve oftenargued about the best ways of keeping certain pages and page-types out of the index. Argue with data.
  • 57.
    We have allkinds of keyword-targeting test ideas ● Simpler messaging ○ (what happens if you have less keyword targeting?) ● Timely keywords ○ (what happens if you add "2016" in appropriate places?) Argue with data
  • 58.
    We’re running testslike these right now Follow @distilled to hear the results first
  • 59.
    If you’re goingto implement split-testing, there are some things you should know
  • 60.
    You can’t assumetraffic equality between “buckets” of pages This is why we build a counterfactual comparison using control pages.
  • 61.
    Different pages canhave different seasonality For example, “roses” pages on valentine’s day. You need to cut outliers.
  • 62.
    One site Ilooked at had 72<html> tags on a single page You’ll find some of your work more sensitive to amusingly broken HTML
  • 63.
    We’re not quitesure how to model cross-section impacts This will be needed for testing internal linking structures, for example.
  • 64.
    You may detectunexplained phenomena In medicine, this would be things like the placebo effect with no known pathway.
  • 65.
    We may findthat things that “shouldn’t” work, in fact do drive uplifts. We can speculate that the continuing benefit of changing 302s to 301s (despite Google’s insistence that 302s don’t lose PageRank) is to do with them losing other link signals, but we don’t really know. I’m not sure this matters.
  • 66.
    It’s changing theway we make recommendations
  • 67.
    The big one: Businesscases I wrote more about this in my better business documents post
  • 68.
    But I’m alsoseeing more subtle impacts on my recommendations: ● You can recommend small tweaks and see the benefits compound ● You can test wild hypotheses with unknown upsides ● You can try things that might have a downside (more focused targeting, less copy, etc.) And that’s even before you get the benefits of testing clickthrough rate, and the benefits of pretty charts to show the boss highlighting the impact of your work! More: blog post
  • 69.
    Our work isso much easier than theirs
  • 70.
    But still, let’smove past “cut the leg off as fast as you can”
  • 72.
    @willcritchlow PS - We’rehiring: distilled.net/jobs
  • 74.
    ● Surgeons -Phalinn Ooi ● Operating chair - Peter Pelisek ● Potions - Sam Simpson ● Old operating theatre - Uglix ● Searchmetrics rankings drop - img_eisy ● Air drop - Wikipedia ● Test tubes - ironpoison ● Syringes - ad-vantage ● Pills - ashleyrosex ● Stethoscope - proimos ● Old wives’ tale - Jon Bunting, pgillard and John Davey Image credits ● Richard Feynman - Juana la loca, dullhunk and jkannenberg ● Spurious correlation - Tyler Vigen ● Lemon, lime, polar bear - abhijittembhekar, libraryman, ucumari ● Blackboard - arenamontanus ● Facepalm - brandongrasley ● Buckets - mamarazzi ● Rose - alicelingching ● Girders - JFB119 ● Ghost - daveallday ● Bezos - jurvetson