HBaseCon 2012 | HBase, the Use Case in eBay Cassini

•Download as PPTX, PDF•

9 likes•6,133 views

eBay marketplace has been working hard on the next generation search infrastructure and software system, code-named Cassini. The new search engine processes over 250 million search queries and serves more than 2 billion page views each day. Its indexing platform is based on Apache Hadoop and Apache HBase. Apache HBase is a distributed persistent layer built on Hadoop to support billions of updates per day. Its easy sharding character, fast writes, and table scans, super fast data bulk load, and natural integration to Hadoop provide the cornerstones for successful continuous index builds. We will share with the audience the technical details and share the difficulties and challenges that we’ve gone through and that we are still facing in the process.

Technology

HBase
the Use Case in eBay Cassini
Thomas Pan
Principal Software Engineer
eBay Marketplaces

eBay Marketplaces

 97 million
active buyers and sellers world wide

 200+ million items
in more than 50,000 categories

 2 billion page views
each day

 9 petabytes of data
in our Hadoop and Teradata clusters

 250 million queries
each day to our search engine

Cassini
eBay’s new Search Engine
 Entirely new codebase

 World-class, from a world class team

 Platform for ranking innovation

 Four major tracks, 100+ engineers

 Likely launch in 2012

Indexing in Cassini

 Index with more data and more history
 More computationally expensive work at index-
time (and less at query-time)
 Ability to rescore and reclassify entire site inventory
 The entire site inventory is stored in HBase
 Indexes are built via MapReduce jobs and stored in
HDFS
 Build the entire site inventory in hours

Hbase Table Data Import

 Bulk Load
 Batch processing on demand or every couple of hours
 Load a large amount of data quickly

 PUT
 Near real time updates
 Better for updating small amount of data
 Read after PUT for better random read performance

HBase Tables

 3 major tables: active items, completed items and sellers

 15TB data

 3600 pre-split regions per table with auto-split disabled

 3 column families with maximum 200 columns

 Automatic major compaction disabled

 RowKey is bit reversal of document id (unsigned 64-bit
integer)

Indexing Job Pipeline

 Full table scan

 Run every couple of hours

Numbers

 Data import
 Bulk data import: 30 minutes for 500 million full rows
 Random write: ~ 200,000,000 rows per day
 1.2 TB data daily import

 Scan Performance
 Scan speed: 2004 rows per second per region server
(average version 3), 465 rows per second per region
server (average version 10)
 Scan speed with filters: 325~353 rows per second per
region server

Operations

 Monitoring
 Ganglia
 Nagios
 OpenTSDB

 Testing
 Unit test and regression test
 HBaseTestingUtility for unit test
 Standalone Hbase for regression test (mvn verify)
 Cluster level
 Fault Injection Tests [HBASE-4925]

 Region balancer

 Manual major compaction

Operations (Cont’d)
 Disable swap

 Largely increase file descriptor limit and xciever count

Metrics Watch for
jvm.DataNode.metrics.threadRunnable Connection leakage
with netstat
hbase.regionserver.compactionQueueSize Major/minor
compactions
dfs.datanode.blockReports_avg_time Data block reporting (for
too many data blocks)
network_report Network bandwidth
usage (for data locality)

Community
Acknowledgement
 Eli Collins
 Kannan Muthukkaruppan
 Karthik Ranganathan
 Konstantin Shvachko
 Lars George
 Michael Stack
 Ted Yu
 Todd Lipcon

What's hot

Odoo Experience 2018 - All You Need to Know About Odoo's PartnershipElínAnna Jónasdóttir

ISO 20022: So nutzen Sie den internationalen Nachrichtenstandard optimalIBsolution GmbH

Sap solutions presentationKumar M.

Sap BenefitsJeffreyCarson

SAP HCM EhP6 and HR Renewal OverviewAndrey Kulikov

Fleet Management Software in Odoo 15 Enterprise Edition.pptxCeline George

What's hot (6)

Odoo Experience 2018 - All You Need to Know About Odoo's Partnership

ISO 20022: So nutzen Sie den internationalen Nachrichtenstandard optimal

Sap solutions presentation

Sap Benefits

SAP HCM EhP6 and HR Renewal Overview

Fleet Management Software in Odoo 15 Enterprise Edition.pptx

Viewers also liked

BIG Data & Hadoop Applications in E-CommerceSkillspeed

hive HBase Metastore - Improving Hive with a Big Data Metadata StorageDataWorks Summit/Hadoop Summit

Impala: A Modern, Open-Source SQL Engine for HadoopAll Things Open

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...Cloudera, Inc.

HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase Cloudera, Inc.

HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...Cloudera, Inc.

Real-World NoSQL Schema DesignDataWorks Summit/Hadoop Summit

How we solved Real-time User Segmentation using HBaseDataWorks Summit

MongoDB Schema Design: Four Real-World ExamplesMike Friedman

Magento scalability from the trenches (Meet Magento Sweden 2016)Divante

Surprising failure factors when implementing eCommerce and Omnichannel eBusinessDivante

Omnichannel Customer ExperienceDivante

HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.

Viewers also liked (13)

BIG Data & Hadoop Applications in E-Commerce

hive HBase Metastore - Improving Hive with a Big Data Metadata Storage

Impala: A Modern, Open-Source SQL Engine for Hadoop

HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...

HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase

HBaseCon 2012 | Developing Real Time Analytics Applications Using HBase in th...

Real-World NoSQL Schema Design

How we solved Real-time User Segmentation using HBase

MongoDB Schema Design: Four Real-World Examples

Magento scalability from the trenches (Meet Magento Sweden 2016)

Surprising failure factors when implementing eCommerce and Omnichannel eBusiness

Omnichannel Customer Experience

HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce

Similar to HBaseCon 2012 | HBase, the Use Case in eBay Cassini

Scaling Hadoop at LinkedInDataWorks Summit

MYSQLgilashikwa

Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica

Introduction of MariaDB AX / TXGOTO Satoru

Getting Started with Amazon RedshiftAmazon Web Services

Handling Data in Mega Scale Web SystemsVineet Gupta

What's new in JBoss ON 3.2Thomas Segismont

Aerospike Hybrid Memory ArchitectureAerospike, Inc.

highly available distributed databases (poster)Rim Moussa

AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)Amazon Web Services Korea

Presentacion redislabs-ihubssuser9d7c90

HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.

Google Megastorebergwolf

GECon2017_High-volume data streaming in azure_ Aliaksandr LaishaGECon_Org Team

AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services

Hadoop for Scientific Workloads__HadoopSummit2010Yahoo Developer Network

Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale

Hw09 Hadoop Based Data Mining Platform For The Telecom IndustryCloudera, Inc.

Need for Time series DatabasePramit Choudhary

Similar to HBaseCon 2012 | HBase, the Use Case in eBay Cassini (20)

Scaling Hadoop at LinkedIn

MYSQL

Stephan Ewen - Experiences running Flink at Very Large Scale

Introduction of MariaDB AX / TX

Getting Started with Amazon Redshift

Handling Data in Mega Scale Web Systems

What's new in JBoss ON 3.2

Aerospike Hybrid Memory Architecture

highly available distributed databases (poster)

AWS CLOUD 2018- Amazon DynamoDB기반 글로벌 서비스 개발 방법 (김준형 솔루션즈 아키텍트)

Presentacion redislabs-ihub

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

Google Megastore

GECon2017_High-volume data streaming in azure_ Aliaksandr Laisha

AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Hadoop for Scientific Workloads__HadoopSummit2010

Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off

Hw09 Hadoop Based Data Mining Platform For The Telecom Industry

Need for Time series Database

Recently uploaded

CloudStudio User manual (basic edition):comworks

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Understanding the Laravel MVC ArchitecturePixlogix Infotech

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

How to convert PDF to text with Nanonetsnaman860154

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Recently uploaded (20)

CloudStudio User manual (basic edition):

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Understanding the Laravel MVC Architecture

How to Remove Document Management Hurdles with X-Docs?

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

How to convert PDF to text with Nanonets

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Pigging Solutions in Pet Food Manufacturing

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Breaking the Kubernetes Kill Chain: Host Path Mount

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Unblocking The Main Thread Solving ANRs and Frozen Frames

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

HBaseCon 2012 | HBase, the Use Case in eBay Cassini

1. HBase the Use Case in eBay Cassini Thomas Pan Principal Software Engineer eBay Marketplaces

2. eBay Marketplaces  97 million active buyers and sellers world wide  200+ million items in more than 50,000 categories  2 billion page views each day  9 petabytes of data in our Hadoop and Teradata clusters  250 million queries each day to our search engine

3. Cassini eBay’s new Search Engine  Entirely new codebase  World-class, from a world class team  Platform for ranking innovation  Four major tracks, 100+ engineers  Likely launch in 2012

4. Indexing in Cassini  Index with more data and more history  More computationally expensive work at index- time (and less at query-time)  Ability to rescore and reclassify entire site inventory  The entire site inventory is stored in HBase  Indexes are built via MapReduce jobs and stored in HDFS  Build the entire site inventory in hours

7. Hbase Table Data Import  Bulk Load  Batch processing on demand or every couple of hours  Load a large amount of data quickly  PUT  Near real time updates  Better for updating small amount of data  Read after PUT for better random read performance

8. HBase Tables  3 major tables: active items, completed items and sellers  15TB data  3600 pre-split regions per table with auto-split disabled  3 column families with maximum 200 columns  Automatic major compaction disabled  RowKey is bit reversal of document id (unsigned 64-bit integer)

9. Indexing Job Pipeline  Full table scan  Run every couple of hours

10. Numbers  Data import  Bulk data import: 30 minutes for 500 million full rows  Random write: ~ 200,000,000 rows per day  1.2 TB data daily import  Scan Performance  Scan speed: 2004 rows per second per region server (average version 3), 465 rows per second per region server (average version 10)  Scan speed with filters: 325~353 rows per second per region server

11. Operations  Monitoring  Ganglia  Nagios  OpenTSDB  Testing  Unit test and regression test  HBaseTestingUtility for unit test  Standalone Hbase for regression test (mvn verify)  Cluster level  Fault Injection Tests [HBASE-4925]  Region balancer  Manual major compaction

12. Operations (Cont’d)  Disable swap  Largely increase file descriptor limit and xciever count Metrics Watch for jvm.DataNode.metrics.threadRunnable Connection leakage with netstat hbase.regionserver.compactionQueueSize Major/minor compactions dfs.datanode.blockReports_avg_time Data block reporting (for too many data blocks) network_report Network bandwidth usage (for data locality)

13. Community Acknowledgement  Eli Collins  Kannan Muthukkaruppan  Karthik Ranganathan  Konstantin Shvachko  Lars George  Michael Stack  Ted Yu  Todd Lipcon

Editor's Notes

45 nodes per rack with 5 racks of data nodes total.Each node has 12 * 2TB disk space, 72GB RAM and 24 cores under hyper-threading.Each node is running region server, task tracker, data node, 8 open slots for mappers and 6 open slots for reducers.Enterprise nodes are dual powered, dual homed with active-active TORS and backed up by Netapp Filer.No TORS redundancy on data node racksWhy share Hmaser with Zookeeper nodes?----- Meeting Notes (1/26/12 14:02) -----# TORS lack of redudencyShare ranks among different clusters.Then, network bandwidth on TORS could be an issue.With extra 5 racks, the impact is much smaller
MapReduce is to slice and dice data, leveraging large scale cluster.The indexing job is to convert raw data into pieces of data, easy to merge, in index format, and grouped under query node columns.Merge jobs are running parallel. Among them, the posting list merge job is the most expensive and will become more expensive.Column group data is copied 4 times and posting list data is copied 5 times in the pipeline.----- Meeting Notes (1/26/12 14:02) -----Nick: Why not collapse all three merge/packing/packaging phases together?

HBaseCon 2012 | HBase, the Use Case in eBay Cassini

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Viewers also liked

Viewers also liked (13)

Similar to HBaseCon 2012 | HBase, the Use Case in eBay Cassini

Similar to HBaseCon 2012 | HBase, the Use Case in eBay Cassini (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

HBaseCon 2012 | HBase, the Use Case in eBay Cassini

Editor's Notes