Presence in News? Revenue Opportunity? Impact on World? ... let's talk numbers. It will be pretty boring, numbers oriented. I want to show you that digging after numbers pays off in the end, and present you my own path as a case study. I will assume you know the basics – if you ever saw wordpress or blogger, or if you have a web site, you should be ok. Back when economy was invented, doing business was easy – you created a pot and you exchanged it for wine. Then we invented factories and supermarkets, and suddenly you had to predict the supply and demand, to enable your business to operate smoothly. And then we invented computers, and suddenly everything became countable and measurable. And then someone invented the web and forgot to build into it everything that we've learned in history.
So today, when someone asks, how big is the web?, the usual answer is is: it's big! And the same for everything else online. The fact is, measuring stuff online is often pretty darn hard. If you have your own site, the amount of data can be overwhelming, and if you want to know something about other peoples' sites, there is no really legal way to do it. Important to know: to project your business I'll be talking about many numbers that were never published before, because nobody wants to hear them.
Lot's of people are working with bloggers, all of them should know Every blogger blogs somewhere Researchers exist Google sees everything But each of them has their own interests, and none are to make your projections accurate.
Of 12 top blogging platforms, these are the only ones who published their stats publicly. You can see how diverse they are, but believable, right? We have 150M of something right here, so there must be 300M altogether, right? Bloggers, publishers, people, journals, users.
Bottom-up The most famous one is probably Technorati, then there are other, each with a different approach and purpose. BlogCatalog is user-generated general catalogue, Alltop is editorial list of best readings for wide range of subjects, Federated media is a list of absolute top influencers, and so forth. I was interested in finding out the total reach of these catalogs. I wanted to see, if they can provide with the comprehensive list of 'wine bloggers'. So I crawled and indexed publicly published information from 14 different catalogs . I don't know if their internal indexes are different, my Uber-list has only the blogs I could find if I was a normal marketer browsing their sites. For technical reasons I didn't include Yahoo and DMOZ.
This chart is rather complex, but essential. It demonstrates two findings: - the bubble sizes are related to catalogs unique contribution to our combined index - number of blogs that were found only in the respective catalog. as you can see, the bubbles are larger, the higher they are, which makes sense - larger catalogs contribute more new blogs - the horizontal axis shows the percentage of unique blogs in each catalog. this tells us how special the respective catalogue is, how much is it worth to look at it. So we can see that blogcatalog contributes almost 90% of it's volume and technorati almost 80%. But truly surprising finding is, that even the rest of the catalogues list more than 60% of new blogs, and that even the absolutely smallest catalogs, that never inteded to be reference for the blogosphere, like loud3r, still contribute 30% of their size. This basically means, that if you create your own list without looking at other catalogs, chances are 30% of the list will not be listed anywhere else. Let me repeat this: if you throw a rock at a blogger, there are reasonable chances you'll be the first. - Now, for the second finding let's look at the vertical axis - the scale ends at 200k. This means, that the largest catalog out there publishes merely 200k bloggers, and that all of them combined see only 200k bloggers.
Professionals estimate anything between 200k and 200M, and the enablers are aligning on the upper margin. We are on our own.
Traffic is most reliable and most measured metric online Compete makes it possible to measure traffic on other peoples domains, by sampling from browser plugins. It turns out we can, because we have one very special platform in the ecosystem.
blogger.com publishes everything on blogspot.com, and has all dashboards on blogger.com this means, that unique visitors on first are readers and on second are bloggers! If we assume this ratio stays fixed, we can estimate sizes of every other platform in the ecosystem!
So now we can estimate sizes of several platforms. But not all of them, and we can't know which we are missing. Two problems: Useful for estimating reach of hosted platforms. Self-hosted remain mystery. Missing custom domains. Google sees everyone, so if there is a way to get data from them...
Now, we should cross-check the previous numbers, and maybe tap into custom domains and self-hosted platforms. Random sample of blog posts in english Heuristics to guess which they are. This is pretty reliable sampling.
Boštjan špetič howtoweb blogosphere
Measuring the Blogosphere ... or How Big is It really? Boštjan Špetič [email_address] How To Web Conference November 2010
1. Official data <ul><li>Wordpress.com </li></ul><ul><li>Wordpress.org </li></ul><ul><li>Netlog </li></ul><ul><li>Tumblr.com </li></ul><ul><li>Livejournal.com </li></ul>
2. Professional Research <ul><li>Technorati </li></ul><ul><li>Nielsen </li></ul><ul><li>Accenture </li></ul><ul><li>eMarketer </li></ul><ul><li>200M (2008) </li></ul><ul><li>33% USA </li></ul><ul><li>150M </li></ul><ul><li>85M in USA </li></ul><ul><li>30M in USA </li></ul>
3. Catalogs Alltop Technorati Blog Catalog Federated Media Best Of The Web Networked Blogs My Blog Log Postrank Loud3r Toprank Adage Wikio
5. Google Blog Search <ul><li>a) Get random sample of blog posts </li></ul><ul><li>b) Heuristics to guess platforms </li></ul><ul><li>c) Compare with previous numbers </li></ul><ul><li>This way we identify 61% of blogs. </li></ul><ul><li>Better for non-hosted platforms. </li></ul><ul><li>Works only in English. </li></ul>
Bringing it all Together <ul><li>Official data where available </li></ul><ul><li>Compete interpolation for hosted platforms </li></ul><ul><li>Google Blog Search interpolation for non-hosted platforms </li></ul>
6. Bonus: Internal Data <ul><li>Blog search measurement for well known blogging add-on </li></ul><ul><li>+ </li></ul><ul><li>Their own unpublished internal number of active users </li></ul><ul><li>= </li></ul><ul><li>best estimate for active, monthly, English blogosphere: </li></ul><ul><li>10.000.000 </li></ul>
Bottom Line <ul><li>Trust No One! </li></ul><ul><ul><li>Blogosphere 3x-5x overestimated </li></ul></ul><ul><li>Don't take no for an answer </li></ul><ul><ul><li>Knowing your numbers is competitive advantage </li></ul></ul><ul><li>If you can't measure it, it doesn't exist </li></ul>