PARC Forum 2009: Adventures in SearchLand


Published on

Last year I left PARC after almost nine years in residence, to join Cuil, a start up company then in stealth mode. Three months later, Cuil launched with a lot of buzz and a product that is innovative to the hilt.This was the beginning of an exciting (and bumpy) journey over the unchartered territory of Searchland, part of the larger and (to me) mysterious continent of StartUpLand. In this talk I will discuss this journey and highlights so far, with the help of a little guide.

Published in: Technology, Design
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

PARC Forum 2009: Adventures in SearchLand

  1. 1. AdventuresIn SearchLandValeria de PaivaJuly 2009PARC
  2. 2. Outline● Personal background● What is a search engine?● How do they work?● SearchLand?● Cuil!● Adventures...● and Opportunities
  3. 3. Yours truly...● Pure mathematics in Cambridge● Work on Category Theory● Programming languages● Natural language & KR in PARC● Search... BRIDGE
  4. 4. Search engines...● Until last year my understanding of search engines was like my understanding of telephones or cars...● I know when theyre working and how to use them.● I have no idea why or how they work...● Assuming youre like this too, some tidbits...
  5. 5. Search Engines are like Librarians● Have to have loads of documents a pesky user might want to see.● Need to know the contents of the documents, to give the appropriate document.● Need to aggregate the records of the contents of the documents in the index.● When the user asks for a document, the librarian has to consult its index, decide on the most appropriate answers (the hits), find and deliver them in a timely and pleasant manner
  6. 6. Metaphor continued...● There is a building up step: collecting and indexing documents● There is a serving up process: reading the query in, massaging it, finding the results, ranking results and serving results.● These correspond to the modules of the search engine: crawler, indexer, query analyzer, finding and ranking algorithms, webserver magic
  7. 7. Metaphor gone too far...● Books dont arrive at a library in tens of thousands every day Search engines crawl the web all the time (and freshness is a real problem)● Libraries get rid of books once a year Search engines would re-index every five minutes if they could● Libraries simply hand off their goods, search engines differentiate themselves by how they deliver their goods
  8. 8. Search Engine BasicsA search engine has modules – Crawler – Indexer – Query analyzer – Searcher – Ranking – Webserver Why writing your own search engine is hard Patterson, ACM Q, 2004 Building Nutch: Open Source, Cafarella and Cutting, 2004 Search technologies for the internet Henzinger, Science. 2007
  9. 9. Search Engine Scheme WEB WEB (users) (data) Web server crawler mining indexerranking Index server Query server
  10. 10. SearchLand...● So far, so good.● Like Alice in the Wonderland in the Oxford meadows with her sister● Then she follows the rabbit into the hole and things began to change..
  11. 11. Getting there● PARC: a big change from academia. There are things that you cannot tell your friends about your industrial research● Timing is an art: you cannot publish too early, as IP has to be protected. Wait too much and theres nothing to publish.● But PARC is still much closer to academia than I realized. Its research! It must become a product. Pretty soon. But it isnt one to begin with.
  12. 12. Are we there yet?● Start-up landscape is different: no offices, an open plan with individual desks and machines● No book shelves, no work phones● No four All Hands per year, one every week.● Release of new code once a week, usually more● Life moves fast...
  13. 13. SearchLand: Cool Cuil!● How did I get there? Anna Patterson and Tom Costello are friends of many years. How did they get there?● They did a search start-up called Xift in 1999. Then Anna designed, wrote and sold Recall—the largest search engine in 2004 to Google. Also architect of Google’s TeraGoogle in early 2006.● Tom worked in IBM on the prototype of WebFountain and on Storage Systems Strategy worldwide● Then they decided to work together in Cuil
  14. 14. The reasons for Cuil● There are many search engines. But their results tend to be very similar. Are we seeing everything?● Reports estimate we can see only 15% of the existing web. This is decreasing● Probing the web is mostly popularity based. Youre likely to see what others have seen before. But your seeing increases the popularity of what you saw, thereby reducing the pool of available stuff.● Deep Web too?...
  15. 15. The reasons for CuilMuch rubbish on the web.Some say all we dont see isweb rot: web spam, porn,mindless duplication of non-content...Cuil says lets check it out, letsanalyze contents of the pages.People want to find informationimportant to them, even whenits not popular.[e.g. vanity search yields longlost brother]
  16. 16. The reasons for Cuil● Cost and natural resources● Users dont pay directly for using search engines and their server farms● But costs to the environment should be part of the equation● Cuil can serve a bigger index using a small fraction of the number of machines● Cheaper for the environment and for the company
  17. 17. The reasons for Cuil ● Cuil doesnt need to know your search history and habits. ● So we dont. ● no names, no IP addresses, and no cookies ● Your search history is your business, not ours.
  18. 18. The reasons for Cuil● There is (too much) information on the web.● Cuil organizes the web so that you can find information that you didnt know you wanted..
  19. 19. Organizing the web...● Images can help.● Longer snippets help.● Tabs and categories show new stuff.● Images can help.● Definitions –easier then going to a dictionary● Timelines -- show you the evolution of your concept● Maplines – new connections● Videos from Hulu, maps from Mapquest.
  20. 20. Organization is fundamental● Definitions –easier then going to a dictionary● Timelines - show the evolution of your concept● Maplines – new connections● Videos from Hulu, maps from Mapquest.
  21. 21. Adventures● There are many.● Talking about three:● Launch! – And blogsphere...● Timelines● Languages
  22. 22. Launching a product● Its different from anything I had ever done before.● Launched July 28th, less than three months from my start.● Hoped for a “soft” launch in the middle of the summer..● Unbelievable “flood” of interest
  23. 23. After the hype, the blogs...● Hadnt realized how much the valley runs on blogs● Didnt know about tech celebrities or valleywag...● Had no idea how many people make a living doing SEO● Unbelievable that people went to the trouble of “faking” bad results.
  24. 24. Timelines● Launched in March09● Dynamic timelines, not pre- computed for a few subjects● Project completed in less than six weeks● Too many? Algorithm still needs improvement● But a personal battle won...
  25. 25. Multiple Languages● Launched in May09● Infra-structure in place, took less than a month to release● Seven languages so far● Evaluation hardly started● But loads of offers to help● All of this organization with a team of less than thirty...
  26. 26. Opportunities...● There are many.● Quality evaluation● Relevance improvement● More services...
  27. 27. More Opportunities...● Three banes of my life:● Spam, spam, spam● (Economics of) malware● Attacking pornography
  28. 28. Summing up● Life in Searchland is very different● And lots of fun!● As Patterson says in “Why Writing your Own Search Engine is Hard”, AM Q 2004, “[...] once the search bug gets you, youll be back. The problem isnt getting any easier, and it needs all the experience anyone can muster.”
  29. 29. And ever, as the story drained The wells of fancy dry,And faintly strove that weary one To put the subject by,“The rest next time--” “It is next time!” The happy voices cry. Lewis Carroll -- Proem Thank You!