HBaseCon 2012 | HBase, the Use Case in eBay Cassini

5,389 views

Published on

eBay marketplace has been working hard on the next generation search infrastructure and software system, code-named Cassini. The new search engine processes over 250 million search queries and serves more than 2 billion page views each day. Its indexing platform is based on Apache Hadoop and Apache HBase. Apache HBase is a distributed persistent layer built on Hadoop to support billions of updates per day. Its easy sharding character, fast writes, and table scans, super fast data bulk load, and natural integration to Hadoop provide the cornerstones for successful continuous index builds. We will share with the audience the technical details and share the difficulties and challenges that we’ve gone through and that we are still facing in the process.

Published in: Technology
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,389
On SlideShare
0
From Embeds
0
Number of Embeds
246
Actions
Shares
0
Downloads
128
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide
  • 45 nodes per rack with 5 racks of data nodes total.Each node has 12 * 2TB disk space, 72GB RAM and 24 cores under hyper-threading.Each node is running region server, task tracker, data node, 8 open slots for mappers and 6 open slots for reducers.Enterprise nodes are dual powered, dual homed with active-active TORS and backed up by Netapp Filer.No TORS redundancy on data node racksWhy share Hmaser with Zookeeper nodes?----- Meeting Notes (1/26/12 14:02) -----# TORS lack of redudencyShare ranks among different clusters.Then, network bandwidth on TORS could be an issue.With extra 5 racks, the impact is much smaller
  • MapReduce is to slice and dice data, leveraging large scale cluster.The indexing job is to convert raw data into pieces of data, easy to merge, in index format, and grouped under query node columns.Merge jobs are running parallel. Among them, the posting list merge job is the most expensive and will become more expensive.Column group data is copied 4 times and posting list data is copied 5 times in the pipeline.----- Meeting Notes (1/26/12 14:02) -----Nick: Why not collapse all three merge/packing/packaging phases together?
  • HBaseCon 2012 | HBase, the Use Case in eBay Cassini

    1. 1. HBasethe Use Case in eBay CassiniThomas PanPrincipal Software EngineereBay Marketplaces
    2. 2. eBay Marketplaces 97 million active buyers and sellers world wide 200+ million items in more than 50,000 categories 2 billion page views each day 9 petabytes of data in our Hadoop and Teradata clusters 250 million queries each day to our search engine
    3. 3. Cassini eBay’s new Search Engine Entirely new codebase World-class, from a world class team Platform for ranking innovation Four major tracks, 100+ engineers Likely launch in 2012
    4. 4. Indexing in Cassini Index with more data and more history More computationally expensive work at index- time (and less at query-time) Ability to rescore and reclassify entire site inventory The entire site inventory is stored in HBase Indexes are built via MapReduce jobs and stored in HDFS Build the entire site inventory in hours
    5. 5. Hbase Table Data Import Bulk Load  Batch processing on demand or every couple of hours  Load a large amount of data quickly PUT  Near real time updates  Better for updating small amount of data  Read after PUT for better random read performance
    6. 6. HBase Tables 3 major tables: active items, completed items and sellers 15TB data 3600 pre-split regions per table with auto-split disabled 3 column families with maximum 200 columns Automatic major compaction disabled RowKey is bit reversal of document id (unsigned 64-bit integer)
    7. 7. Indexing Job Pipeline Full table scan Run every couple of hours
    8. 8. Numbers Data import  Bulk data import: 30 minutes for 500 million full rows  Random write: ~ 200,000,000 rows per day  1.2 TB data daily import Scan Performance  Scan speed: 2004 rows per second per region server (average version 3), 465 rows per second per region server (average version 10)  Scan speed with filters: 325~353 rows per second per region server
    9. 9. Operations Monitoring  Ganglia  Nagios  OpenTSDB Testing  Unit test and regression test  HBaseTestingUtility for unit test  Standalone Hbase for regression test (mvn verify)  Cluster level  Fault Injection Tests [HBASE-4925] Region balancer Manual major compaction
    10. 10. Operations (Cont’d) Disable swap Largely increase file descriptor limit and xciever count Metrics Watch for jvm.DataNode.metrics.threadRunnable Connection leakage with netstat hbase.regionserver.compactionQueueSize Major/minor compactions dfs.datanode.blockReports_avg_time Data block reporting (for too many data blocks) network_report Network bandwidth usage (for data locality)
    11. 11. Community Acknowledgement Eli Collins Kannan Muthukkaruppan Karthik Ranganathan Konstantin Shvachko Lars George Michael Stack Ted Yu Todd Lipcon

    ×