Hadoop World 2011 Keynote: Ebay - Hugh Williams


Published on

Hugh Williams will discuss building Cassini, a new search engine at eBay which processes over 250 million search queries and serves more than 2 billion page views each day. Hugh will trace the genesis and building of Cassini as well as highlight and demonstrate the key features of this new search platform. He will discuss some of the challenges in scaling arguably the world’s largest real-time search problem, including the unique considerations associated with e-commerce and eBay’s domain, and how Hadoop and HBase are used to solve these problems

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Great to be here – privilege to speak to you allToday, going to talk to you about eBay, our new search engine Cassini, and how Hadoop and Hbase is used in searchHighlight title – and mention that I work on marketplaces (ebay.com, and its sister sites all over the world)Let me begin by giving you a brief overview of eBay…
  • We’re 16 years old. Here is a shot of the original site – called AuctionWeb – that eBay’s founder, Pierre Omidyar, launched over Labor Weekend in 1995 … as an “experiment.”I’ve circled some text on this page, not sure if you can read it … but it says “There are always SEVERAL HUNDRED auctions underway, so you’re bound to find something interesting.” “Several hundred” … those were our humble beginnings, though pretty impressive at the timeThe only thing that’s remained the same since 1995 is that eBay has always connected buyers and sellers.
  • In 2010, we sold $62 billion in merchandise.
  • We’re one of the Web’s largest properties… and the pace of change is being driven largely by our customers and their new and their increasingly more sophisticated shopping expectations …<read slide>
  • We are fast becoming a data company, where our engineers use data everyday to inform what they doAnd we have a lot of data, as you can imagine from our 97 million users, 200+ million listings, 250 million search queries, and 2 billion page views each day
  • Before I move on to talk about Search, I want to let you know that it’s becoming more interesting at eBay:Customers are changing how they shop, and we’re at the center of this revolution. Nearly half of all offline purchases have an online component. The offline and online worlds are merging … and this is THE NEW RETAIL landscapeAnd it’s being driven by consumers who are using their smartphones and mobile devices to change the way they shop. eBay and mobile commerce are at the center of this shift – more change is going to happen in commerce in next year or two than in the past ten.
  • I’ve set the context on eBay.Now, I want to introduce you to project Cassini, our most ambitious engineering project at eBay.We are completely rewriting our search engine, and Hadoop and Hbase are key to this rewrite.But, first, let me tell you something about our current search engine, Voyager
  • Voyager is named after a 1976 satellite that <fix>.
  • It’s been driving the search experience on eBay since the early 2000s.Improvements to Voyager have been critical to improving the buyer experience and driving our sellers’ businesses.
  • However, Voyager is behind the times: a lot has happened in search since 2002Our best match ranking uses only tens of factors in computing our best match ranking functionIt only allows search of item titles by default -- we don’t rank using the great information that’s in the descriptions and elsewhereSearch is very literal – it finds almost exactly what you type, it doesn’t always understand what you mean
  • Voyager is a challenge to manage and run as an engineering team.It’s very manual, so deployments of software and data take time.Troubleshooting is slow.We decided in late 2010 that Voyager needed to be replaced, and that began project Cassini
  • Cassini is named after a 1996 satellite, a nod to it being many years ahead of Voyager
  • <read and click>
  • We’re probably only the major web property that’s completely rewriting its search engine from scratch.You can see many of the features of Cassini, and I’ll just talk about a couple briefly:First, it will use all data by default – all that great data in descriptions, information in images, data about our buyers and sellers, and the signals that come from 2 billion page views each day will be used in Cassini to compute its best match. Our users are going to see world-class results, and it’ll be a much more powerful tool to connect buyers and sellersSecond, automation is key. There’ll be no more manual operation of the search engine – rolling out code and data, monitoring, alerting, remediation, and more are fully automated.Third, it’s a major engineering undertaking: we’ve over 100 engineers working across four parallel tracks to deliver Cassini in less than 18 months from start to finish
  • We’ve hit a few major internal milestones, and internal users can already use Cassini if they’d like.<read slide>
  • To understand how Hadoop and Hbase play a role in Cassini, let me explain some of the fundamentals of building a search engine<first point>200 million items would take about 30 seconds, if we could do 1 document every 10 milliseconds and we had 1000 machines working concurrently<second point>An inverted index is an auxiliary data structure that allows fast calculation of the best matching search resultsA typical query takes ten milliseconds using the same 1000 machines, and an inverted index<third point>Walk through using the index in the back of a book…
  • It isn’t possible to create an index for over 200 million items on a single machine – we can’t keep in memory the terms and all of their positions in the documentsWhat we do at scale is distributed index construction, it is classic map-reduce (and has been so from well before the phrase was coined).We build an inverted index for a small part of the document collection on one machine, and do the same on hundreds of other machines. We merge the small inverted indexes into larger inverted indexes that are distributed to our query serving grid.This is a technical graphic from our team, it shows the seven high level stages to creating all the index pieces we need in Cassini.
  • Let’s talk about why Cassini indexing is more challenging than in Voyager, and why we changed the architecture dramatically to include Hadoop and Hbase.First reason: Voyager completed pool = 14 days. Cassini = 90 daysSecond reason: we refresh indexes on an hourly basis – Helps improve ranking, for example updating item and seller informationThird reason: full power to our ranking team to make fast twitch changes
  • Hadoop is the platform for our index construction and index maintenance in CassiniIt’s ideal because it gives us fault tolerance, and smart utilization of our hardware – without Hadoop, we’d probably have small pools of machines that run custom code for different stages of our index constructionOur Hadoop clusters for analytics are much larger, but this is our major use of Hadoop in driving a customer experience.It’s pretty large scale too: while we have over 200 million active items at any time, we also maintain a “completed index” that is over 1 billion item
  • We use Hbase to store eBay’s items for index construction and maintenance.Hbase, as you know, is a column oriented data store built on top of HDFS that is tightly integrated with the Hadoop Map/Reduce framework. It has no schema, which is great for us – it means what we store can evolve. Hbase supports fast item lookups and scans, both of which are necessary for index constructionIncremental writes are what we normally do: about 10 million items enter eBay each day, and we need them in the searchable index within a couple of minutesBulk writes are necessary when our ranking team wants to rescore all our items
  • We’ve got running Hadoop at scale mostly down, but we have challenges with HBaseFirst issue: Ops and Dev are both new to hbase. Lots of learning through failuresSecond issue: Test using mini hadoop cluster + local hbaseThird issue: getting the hardware tuned just rightFourth issue: HBase stability – Unstable Region Servers & HBase master. Regions stuck in transition, etcFifth issue: Monitoring – a lot of times we don’t recognize there are issues until jobs begin to failSixth ssue: Workflow – Our index chains have around 20 stagesBut it’s not all doom and gloom, we’ve recently had a couple of weeks of stability, and we’ve getting more confident each week…Before I finish today, I want to show you a couple of pictures of our data center that houses Cassini…
  • This is our new data center that we opened in Salt Lake City, Utah in May last yearOne of the most efficient data centers ever built, makes clever use of power and cooling technologies
  • Andhere are the machines inside the data center that run Cassini.
  • Before I conclude, I want to let you know that we’re hiring in the search team, and right across all the teams that use and maintain Hadoop and HbaseIf you’re an Hadoop or Hbase committer, I’d especially love to talk to you…And with that, I want to thank you all for listening, and I hope you enjoy a great conference
  • Hadoop World 2011 Keynote: Ebay - Hugh Williams

    1. 1. Project Cassini: ’sNew Search Engine Vice President of Search, Experience, and Platforms eBay Marketplaces
    2. 2. $2.63millionfor a lunch withWarren Buffett
    3. 3. $40,668for Justin Bieber’sjust-cut hair
    4. 4. $130Kfor PrincessBeatrice’s hat
    5. 5. $62billionin merchandise sold in 2010
    6. 6. 97 millionactive buyers and sellers worldwide250 million querieseach day to our search engine200+ million itemslive in more than 50,000 categories
    7. 7. 9 petabytes of datain our Hadoop and Teradata clusters2 billion page viewseach day75 billion database callseach day
    8. 8. Huge Opportunity: Taking the “e” out of ecommerce Yesterday Today Tomorrow Online Online 4% 6% Web- influenced Online offline + Offline 37% Offline Offline 96% 2008 = $325B 2013 = $10T Source: Forrester, Euromonitor and Economist Intelligence Unit Source: Forrester Source: Economist Intelligence Unit
    9. 9. Voyager: our current search engine
    10. 10. Voyager: our current search engine ► Reliable, critical, proven workhorse
    11. 11. Voyager: our current search engine ► Circa-2002 textbook design ► Basic ranking functionality ► Title-only match by default ► Very literal search
    12. 12. Voyager: our current search engine ► Inflexible & Manual ► The next wave of innovation requires a new search platform…
    13. 13. Project Cassini at eBayOur new search engine
    14. 14. Project Cassini at eBay Our most ambitious core engineering project
    15. 15. Project Cassini at eBay Our most ambitious core engineering project ► Entirely new codebase ► World-class, from a world-class team ► Platform for ranking innovation ► Uses all data by default ► Flexible ► Automated ► Four major tracks, 100+ engineers ► Complete in less than 18 months
    16. 16. Project Cassini at eBay Beginning tests, likely launch in 2012
    17. 17. A Short Primer on Indexing When a user types a query, it isn’t practical to exhaustively scan 200+ million items Instead, we create an inverted index, and use it to rank the items and find the best matches An inverted index is similar to the index in the back of a book:  A set of searchable terms  For each term, a list of locations
    18. 18. An Inverted Index cat 3: 1, 2, 7 1 cat on the mat fat cat 2 3 4 wild cat 5 6 7 8
    19. 19. Distributed Index Construction
    20. 20.  Larger index than Voyager  Descriptions, Seller data, other metadata, …  Much more history in our indexes More computationally expensive work at index- time (and less at query-time) Ability to rescore or reclassify entire site inventory
    21. 21.  Hadoop:  Distributed indexing – platform for hourly index refreshes  Fault tolerance through HDFS replication  Better utilization of hardware – can generate different index types with one cluster
    22. 22.  HBase:  Column-oriented data store on top of HDFS  Used to store eBay’s items  Bulk and incremental item writes  Fast item reads for index construction  Fast item reads and writes for item annotation
    23. 23.  Everyone is still learning Some issues only appear at scale Production cluster configuration is challenging  Hardware issues  Tuning cluster configuration to our work loads HBase stability Monitoring health of HBase Managing workflows – many step map/reduce jobs