Search Startups are Dead Entrepreneurs tend to think that there’s always a way to innovate out of a problem In this case, however, I’m going to show you that there are systematic reasons for why there cannot be a general purpose search engines that compete with Google and Bing.
I’ve worked for three search startups – SideStep, Kosmix, and Powerset – and I still don’t have a Gulfstream This is sort of an exercise in apologetics: it’s really not my fault that I don’t have mountains of cash from my stock options
There are good reasons about switching costs and marketing that a new search engine can’t pop up, but that’s not what I’ll focus on. It’s all about the mighty greenback: building a search engine is a really expensive proposal.
It goes without saying that the numbers herein are not the opinion of my employer and are speculative, but they are informed by experience . I’ve made a lot of estimations in Excel to come up with these numbers and I’m pretty confident that I’m in the right ballpark
The equation has two major components: hardware and people. In the following slides, I’ll explain the components going into hardware and people and, in the processs, show you how complicated and expensive a search engine is to build.
Last year, Google estimated that the Web is over 1T documents. That’s really expensive to store
It’s not just the Web page you have to store. There’s links, anchor text, and, since you’re a smarty-pants startup, you’ll probably be extracting all kinds of smart metadata on any page.
Keep in mind that the Web is constantly changing. New pages are being added, pages already crawled are changing, and making sure you have the latest copy of the Web on hand is really important.
At bare minimum, you need results that are as relevant as Bing or Google. To do that, you’ll need lots of servers to run relevance experiments. You’ll need lots of storage for huge amounts of clickstream data.
I know there aren’t any black hat SEO folks in this crowd, but there’s a constant battle with site-owners who don’t have the best interests of users at heart and are willing to do things to game search results.
No search engine is complete without lots of ancillary data: weather, stock quotes, images, maps, Twitter, Facebook. Licensing the content or building the vertical is very expensive and you’re not a true replacement without it.
One of the most expensive components of a search engine is runtime. When you do a search in Bing, results come back from thousands, or possibly billions, of Web pages in less than a second? How does that happen? Lots, and lots, and lots of servers.
All search engines use some kind of divide and conquer algorithm that federates your search to thousands of machines. That means that for any query, there are thousands of machines involved. When you have millions of users, serving search results gets very expensive.
At Powerset, we estimated that our index was 10-20 times the size of a typical keyword index. The Johnson coefficient represents the tax on storing, relevance and runtime that you’d have at an innovative search engine.
250 people for 2 years.
Death of the Search Startup
The Death of the Search
Startup <ul><li>Mark Johnson ( @philosophygeek ) </li></ul><ul><li>Bing Lead Program Manager </li></ul>
Search Startups are Dead <ul><li>“
After Buddha was dead, his shadow was still shown for centuries in a cave – a tremendous, gruesome shadow. God is dead ; but given the way of men, there may still be caves for thousands of years in which his shadow will be shown. And we – we still have to vanquish his shadow, too. ” </li></ul><ul><li> - Nietzsche, Der Fröhliche Wissenschaft </li></ul>
Standard Disclaimer <ul><li>The numbers that
you are about to see are the opinion of the author and calculated using speculative (but informed by experience) numbers pulled out of his ass. There are probably numerous cases of over- and under-estimation, but I’m not trying to create a definitive “cost of a search engine.” What I’m trying to do is to give you ammunition to call bullshit on any entrepreneur who tells you he can build a full-scale, general-purpose search engine for under $100M. My employer, Microsoft, was not involved in the creation of any of the following dubious dollar values. </li></ul>
An Equation <ul><li>Ĵ (SC+ Rel
+ RT) + (P* t) = [a number too big to invest in] </li></ul><ul><li>Ĵ = Johnson coefficient </li></ul><ul><li>SC = Storage/Crawling </li></ul><ul><li>Rel = Relevance </li></ul><ul><li>RT = Runtime </li></ul><ul><li>P = People </li></ul><ul><li>t = Time (in years) </li></ul>Hardware People
Thanks! <ul><li>[Follow me on Twitter:
</li></ul><ul><li>@philosophygeek ] </li></ul><ul><li>[Blog post to follow: </li></ul><ul><li>deliberateambiguity.typepad.com </li></ul><ul><li>[Comments welcome/encouraged: </li></ul><ul><li>[email_address] ] </li></ul>