Death of the Search Startup

3,400 views

Published on

My argument that there cannot be another search startup because of the extreme expense of building a search engine. I worked at three search startups: SideStep, Komsix, and Powerset.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,400
On SlideShare
0
From Embeds
0
Number of Embeds
2,310
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Search Startups are Dead Entrepreneurs tend to think that there’s always a way to innovate out of a problem In this case, however, I’m going to show you that there are systematic reasons for why there cannot be a general purpose search engines that compete with Google and Bing.
  • I’ve worked for three search startups – SideStep, Kosmix, and Powerset – and I still don’t have a Gulfstream This is sort of an exercise in apologetics: it’s really not my fault that I don’t have mountains of cash from my stock options
  • There are good reasons about switching costs and marketing that a new search engine can’t pop up, but that’s not what I’ll focus on. It’s all about the mighty greenback: building a search engine is a really expensive proposal.
  • It goes without saying that the numbers herein are not the opinion of my employer and are speculative, but they are informed by experience . I’ve made a lot of estimations in Excel to come up with these numbers and I’m pretty confident that I’m in the right ballpark
  • The equation has two major components: hardware and people. In the following slides, I’ll explain the components going into hardware and people and, in the processs, show you how complicated and expensive a search engine is to build.
  • Last year, Google estimated that the Web is over 1T documents. That’s really expensive to store
  • It’s not just the Web page you have to store. There’s links, anchor text, and, since you’re a smarty-pants startup, you’ll probably be extracting all kinds of smart metadata on any page.
  • Keep in mind that the Web is constantly changing. New pages are being added, pages already crawled are changing, and making sure you have the latest copy of the Web on hand is really important.
  • At bare minimum, you need results that are as relevant as Bing or Google. To do that, you’ll need lots of servers to run relevance experiments. You’ll need lots of storage for huge amounts of clickstream data.
  • I know there aren’t any black hat SEO folks in this crowd, but there’s a constant battle with site-owners who don’t have the best interests of users at heart and are willing to do things to game search results.
  • No search engine is complete without lots of ancillary data: weather, stock quotes, images, maps, Twitter, Facebook. Licensing the content or building the vertical is very expensive and you’re not a true replacement without it.
  • One of the most expensive components of a search engine is runtime. When you do a search in Bing, results come back from thousands, or possibly billions, of Web pages in less than a second? How does that happen? Lots, and lots, and lots of servers.
  • All search engines use some kind of divide and conquer algorithm that federates your search to thousands of machines. That means that for any query, there are thousands of machines involved. When you have millions of users, serving search results gets very expensive.
  • At Powerset, we estimated that our index was 10-20 times the size of a typical keyword index. The Johnson coefficient represents the tax on storing, relevance and runtime that you’d have at an innovative search engine.
  • 250 people for 2 years.
  • Death of the Search Startup

    1. The Death of the Search Startup <ul><li>Mark Johnson ( @philosophygeek ) </li></ul><ul><li>Bing Lead Program Manager </li></ul>
    2. Search Startups are Dead <ul><li>“ After Buddha was dead, his shadow was still shown for centuries in a cave – a tremendous, gruesome shadow. God is dead ; but given the way of men, there may still be caves for thousands of years in which his shadow will be shown. And we – we still have to vanquish his shadow, too. ” </li></ul><ul><li> - Nietzsche, Der Fröhliche Wissenschaft </li></ul>
    3. ∑ ≠ Photo Credit
    4.  
    5. Standard Disclaimer <ul><li>The numbers that you are about to see are the opinion of the author and calculated using speculative (but informed by experience) numbers pulled out of his ass. There are probably numerous cases of over- and under-estimation, but I’m not trying to create a definitive “cost of a search engine.” What I’m trying to do is to give you ammunition to call bullshit on any entrepreneur who tells you he can build a full-scale, general-purpose search engine for under $100M. My employer, Microsoft, was not involved in the creation of any of the following dubious dollar values. </li></ul>
    6. An Equation <ul><li>Ĵ (SC+ Rel + RT) + (P* t) = [a number too big to invest in] </li></ul><ul><li>Ĵ = Johnson coefficient </li></ul><ul><li>SC = Storage/Crawling </li></ul><ul><li>Rel = Relevance </li></ul><ul><li>RT = Runtime </li></ul><ul><li>P = People </li></ul><ul><li>t = Time (in years) </li></ul>Hardware People
    7. The Web is big
    8. It’s (even) bigger than you think
    9. Freshness: Things change quickly Photo credit
    10. Serving relevant results
    11. Out with the bad
    12. And then all the other stuff
    13. Speed is everything Photo Credit
    14. Divide and conquer
    15. * The Johnson Coefficient Ĵ
    16. People Photo credit
    17. Adding it all up <ul><li>Ĵ (SC+ Rel + RT) + (P* t) = </li></ul><ul><li>$100M on the lower end </li></ul><ul><li>>$300M on the high end </li></ul>
    18. Case Study 1: SearchMe
    19. Case Study 2: Powerset
    20. Keep Hope Alive
    21. The Moral of the Story <ul><li>[ don’t invest in general purpose search engines ] </li></ul>
    22. Thanks! <ul><li>[Follow me on Twitter: </li></ul><ul><li>@philosophygeek ] </li></ul><ul><li>[Blog post to follow: </li></ul><ul><li>deliberateambiguity.typepad.com </li></ul><ul><li>[Comments welcome/encouraged: </li></ul><ul><li>[email_address] ] </li></ul>

    ×