Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Advertisement

Similar to HBaseCon 2012 | HBase, the Use Case in eBay Cassini (20)

More from Cloudera, Inc.(20)

Advertisement

HBaseCon 2012 | HBase, the Use Case in eBay Cassini

  1. HBase the Use Case in eBay Cassini Thomas Pan Principal Software Engineer eBay Marketplaces
  2. eBay Marketplaces  97 million active buyers and sellers world wide  200+ million items in more than 50,000 categories  2 billion page views each day  9 petabytes of data in our Hadoop and Teradata clusters  250 million queries each day to our search engine
  3. Cassini eBay’s new Search Engine  Entirely new codebase  World-class, from a world class team  Platform for ranking innovation  Four major tracks, 100+ engineers  Likely launch in 2012
  4. Indexing in Cassini  Index with more data and more history  More computationally expensive work at index- time (and less at query-time)  Ability to rescore and reclassify entire site inventory  The entire site inventory is stored in HBase  Indexes are built via MapReduce jobs and stored in HDFS  Build the entire site inventory in hours
  5. Hbase Table Data Import  Bulk Load  Batch processing on demand or every couple of hours  Load a large amount of data quickly  PUT  Near real time updates  Better for updating small amount of data  Read after PUT for better random read performance
  6. HBase Tables  3 major tables: active items, completed items and sellers  15TB data  3600 pre-split regions per table with auto-split disabled  3 column families with maximum 200 columns  Automatic major compaction disabled  RowKey is bit reversal of document id (unsigned 64-bit integer)
  7. Indexing Job Pipeline  Full table scan  Run every couple of hours
  8. Numbers  Data import  Bulk data import: 30 minutes for 500 million full rows  Random write: ~ 200,000,000 rows per day  1.2 TB data daily import  Scan Performance  Scan speed: 2004 rows per second per region server (average version 3), 465 rows per second per region server (average version 10)  Scan speed with filters: 325~353 rows per second per region server
  9. Operations  Monitoring  Ganglia  Nagios  OpenTSDB  Testing  Unit test and regression test  HBaseTestingUtility for unit test  Standalone Hbase for regression test (mvn verify)  Cluster level  Fault Injection Tests [HBASE-4925]  Region balancer  Manual major compaction
  10. Operations (Cont’d)  Disable swap  Largely increase file descriptor limit and xciever count Metrics Watch for jvm.DataNode.metrics.threadRunnable Connection leakage with netstat hbase.regionserver.compactionQueueSize Major/minor compactions dfs.datanode.blockReports_avg_time Data block reporting (for too many data blocks) network_report Network bandwidth usage (for data locality)
  11. Community Acknowledgement  Eli Collins  Kannan Muthukkaruppan  Karthik Ranganathan  Konstantin Shvachko  Lars George  Michael Stack  Ted Yu  Todd Lipcon

Editor's Notes

  1. 45 nodes per rack with 5 racks of data nodes total.Each node has 12 * 2TB disk space, 72GB RAM and 24 cores under hyper-threading.Each node is running region server, task tracker, data node, 8 open slots for mappers and 6 open slots for reducers.Enterprise nodes are dual powered, dual homed with active-active TORS and backed up by Netapp Filer.No TORS redundancy on data node racksWhy share Hmaser with Zookeeper nodes?----- Meeting Notes (1/26/12 14:02) -----# TORS lack of redudencyShare ranks among different clusters.Then, network bandwidth on TORS could be an issue.With extra 5 racks, the impact is much smaller
  2. MapReduce is to slice and dice data, leveraging large scale cluster.The indexing job is to convert raw data into pieces of data, easy to merge, in index format, and grouped under query node columns.Merge jobs are running parallel. Among them, the posting list merge job is the most expensive and will become more expensive.Column group data is copied 4 times and posting list data is copied 5 times in the pipeline.----- Meeting Notes (1/26/12 14:02) -----Nick: Why not collapse all three merge/packing/packaging phases together?
Advertisement