Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBaseCon 2012 | HBase, the Use Case in eBay Cassini


Published on

eBay marketplace has been working hard on the next generation search infrastructure and software system, code-named Cassini. The new search engine processes over 250 million search queries and serves more than 2 billion page views each day. Its indexing platform is based on Apache Hadoop and Apache HBase. Apache HBase is a distributed persistent layer built on Hadoop to support billions of updates per day. Its easy sharding character, fast writes, and table scans, super fast data bulk load, and natural integration to Hadoop provide the cornerstones for successful continuous index builds. We will share with the audience the technical details and share the difficulties and challenges that we’ve gone through and that we are still facing in the process.

Published in: Technology
  • Be the first to comment

HBaseCon 2012 | HBase, the Use Case in eBay Cassini

  1. 1. HBasethe Use Case in eBay CassiniThomas PanPrincipal Software EngineereBay Marketplaces
  2. 2. eBay Marketplaces 97 million active buyers and sellers world wide 200+ million items in more than 50,000 categories 2 billion page views each day 9 petabytes of data in our Hadoop and Teradata clusters 250 million queries each day to our search engine
  3. 3. Cassini eBay’s new Search Engine Entirely new codebase World-class, from a world class team Platform for ranking innovation Four major tracks, 100+ engineers Likely launch in 2012
  4. 4. Indexing in Cassini Index with more data and more history More computationally expensive work at index- time (and less at query-time) Ability to rescore and reclassify entire site inventory The entire site inventory is stored in HBase Indexes are built via MapReduce jobs and stored in HDFS Build the entire site inventory in hours
  5. 5. Hbase Table Data Import Bulk Load  Batch processing on demand or every couple of hours  Load a large amount of data quickly PUT  Near real time updates  Better for updating small amount of data  Read after PUT for better random read performance
  6. 6. HBase Tables 3 major tables: active items, completed items and sellers 15TB data 3600 pre-split regions per table with auto-split disabled 3 column families with maximum 200 columns Automatic major compaction disabled RowKey is bit reversal of document id (unsigned 64-bit integer)
  7. 7. Indexing Job Pipeline Full table scan Run every couple of hours
  8. 8. Numbers Data import  Bulk data import: 30 minutes for 500 million full rows  Random write: ~ 200,000,000 rows per day  1.2 TB data daily import Scan Performance  Scan speed: 2004 rows per second per region server (average version 3), 465 rows per second per region server (average version 10)  Scan speed with filters: 325~353 rows per second per region server
  9. 9. Operations Monitoring  Ganglia  Nagios  OpenTSDB Testing  Unit test and regression test  HBaseTestingUtility for unit test  Standalone Hbase for regression test (mvn verify)  Cluster level  Fault Injection Tests [HBASE-4925] Region balancer Manual major compaction
  10. 10. Operations (Cont’d) Disable swap Largely increase file descriptor limit and xciever count Metrics Watch for jvm.DataNode.metrics.threadRunnable Connection leakage with netstat hbase.regionserver.compactionQueueSize Major/minor compactions dfs.datanode.blockReports_avg_time Data block reporting (for too many data blocks) network_report Network bandwidth usage (for data locality)
  11. 11. Community Acknowledgement Eli Collins Kannan Muthukkaruppan Karthik Ranganathan Konstantin Shvachko Lars George Michael Stack Ted Yu Todd Lipcon