Bass Chorng is a principal capacity engineer at eBay who specializes in database performance, availability, and scalability. He established eBay's database capacity team in 2003. eBay uses both NoSQL and RDBMS databases including Cassandra, MongoDB, CouchBase, and Oracle. eBay sees over 400 billion database calls per day across 2000 NoSQL nodes and 450 Oracle nodes while hosting 800 million active items and 120 million active users. Capacity planning involves analyzing traffic, utilization, forecasting growth, and converting resource needs into costs. It requires knowledge of the platform, bottlenecks, and new technologies.
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Ebay: DB Capacity planning at eBay
1. Feng Qu, Sr MTS
Bass Chorng, Principal Capacity Engineer
DB Capacity Planning at eBay
#CassandraSummit2015
2. Who Am I?
#CassandraSummit2015 2
Bass Chorng – Principal Capacity Engineer @
eBay
Specializes in database performance, availability
& scalability in a large website.
Established DB capacity team at eBay in 2003.
Loves mountain biking.
3. #CassandraSummit2015
eBay Site DB Traffic At A Glance
NoSQL Total – 52 B/Day
Cassandra – 15 B
Mongo – 15 B
CouchBase – 12 B
PushVM – 10B
RDBMS Total – 350 B
MySQL – 10 B
Oracle – 340 B
Peak Traffic – 8M/sec
Site Total DB Calls – 400B/Day across 2,000 NoSQL Nodes + 450 Oracle Nodes
Hosting 800M Active items & 120M Active Users
Y-o-Y Growth – 30% ~ 35%
15 15 12 10
10
340
Billion SQL Calls per Day
Cassandra
Mongo
CouchBase
PushVM
MySQL
Oracle
4. Capacity Planning - Simply Put
Ø Analyze Traffic
o Data
Ø Analyze Utilization
o Data
Ø Analyze The Relationship Of The Above Two
o Same Data
Ø Forecast Growth
o Simple Models, Then Impress Your Boss.
Ø Convert Resource Need into $
o A Calculator, Then Impress Your CIO’s
BTW, You Also Need To Know …
• Platform Domain Knowledge – Server, DB Engine, IO Subsystem, Networks …
• Relationship Between System Overhead & Utilization
• Seasonality & Workload Characteristics
• Bottlenecks – Components, Systems, Platforms, Architecture, Site & Apps
• New Technologies
#CassandraSummit2015 4
6. Data
Ø What To Collect?
Apps, Database, Sessions, CPU, Memory, Connections, IOPS,
IO Time, NIC, HBA, Array
Ø How To Collect?
Time Resolution, Aggregation Level, Retention
Ø How To Use It?
Average, Max, 95th percentile, Dashboard, Reporting, Trending
#CassandraSummit2015 6
0.0
1.0
2.0
3.0
4.0
5/1/2015
5/2/2015
5/3/2015
5/4/2015
5/5/2015
5/6/2015
5/7/2015
5/8/2015
5/10/2015
5/11/2015
5/12/2015
5/13/2015
5/14/2015
5/15/2015
5/16/2015
5/17/2015
5/19/2015
5/20/2015
5/21/2015
5/22/2015
5/23/2015
5/24/2015
5/25/2015
5/26/2015
5/27/20150
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
1/26/2015
1/28/2015
1/30/2015
2/1/2015
2/3/2015
2/5/2015
2/7/2015
2/9/2015
2/11/2015
2/13/2015
2/15/2015
2/17/2015
2/19/2015
2/21/2015
2/23/2015
2/25/2015
2/27/2015
3/1/2015
7. Forecast
Ø Model Traffic, Not Resources
Ø Need One Year Trend
Ø Forecast At Daily Level
Ø Eliminate Outliers
Ø No Data Is Better Than Wrong Data
Ø Convert Traffic To Resource Usage
Ø Linear Extrapolation Only (CPU Utilization, not IO Time)
Ø Simple Excel Formula Works Well
Ø For Long Term Resource Planning Only
Ø Use Average, Not Max
Ø Not All Workloads Are Predictable
#CassandraSummit2015
7
0
10
20
30
40
50
60
70
01/01/2012 01/01/2013 01/01/2014 01/01/2015
Billion
Calls
CATY Traffic Forecast
Forecast Actual Capacity
8. Things To Watch For
Myths
Ø More CPU Makes Apps Run Faster
Ø More Data Makes Apps Run Slower
Ø Apps Run Twice As Fast On CPU Twice The Speed
Ø High Session = High Load
Pitfalls
Ø Cause VS. Symptom
Ø Time Resolution Masks Issues
Ø Look At The Whole Picture
Ø Slow Down In Order To Go Faster < Throttle >
Challenges
Ø Data Quality – Data Missing, Data Source Changes, F/O Data Residency, Data Errors …
Ø Varieties of Data Formats & Resolutions
Ø Data Collection In Secured Zones
#CassandraSummit2015
8
9. Me: Everything NoSQL
CassandraSummit2015
|
#CassandraSummit
Ø Prior to 2011: Worked on Oracle at DoubleClick/Yahoo/Intuit
Ø Worked on NoSQL at eBay Database Infrastructure team:
Ø Cassandra since 2011
Ø MongoDB since 2012
Ø Couchbase since 2014
Ø Cassandra Summit speaker for 2013, 2014, 2015
Ø DataStax Cassandra MVP for 2014, 2015
11. Benchmarking
Ø Benchmarking for different hardware
Ø High I/O SKU
Ø High memory SKU
Ø High storage SKU
Ø Bare metal or cloud
Ø Benchmarking for different software releases
Ø Benchmarking for different workloads
Ø 100% Writes
Ø 50% Writes, 50% Reads
Ø 5% Writes, 95% Reads
Ø 100% Reads
Ø Benchmarking Tools
Ø YCSB
Ø Cassandra-stress
Ø Proactive and repeated process using near real-time traffic in prod like environment
CassandraSummit2015
|
#CassandraSummit
12. Capacity Planning
Ø Key to avoid surprise in production
Ø The concept behind capacity planning is simple, but the mechanics are harder.
Ø Business requirements may increase, need to forecast how much resource must be
added to the system to ensure that user experience continues uninterrupted
Ø Input: clearly defined capacity goal coming from business requirement and performance baseline
from benchmark test
Ø Output: Identify resources to be added, such as memory, CPU, storage, I/O, network
Ø Always prepare for peak + headroom
CassandraSummit2015
|
#CassandraSummit
13. Capacity Planning Process
Ø Initial Sizing
Ø Storage size vs. data size
Ø Compaction overhead, compression ratio, RF, indexes
Ø Cost-effective configuration to meet capacpity/latency SLA
Ø Routine Review
Ø System utilization on I/O, storage, network, CPU, memory etc
Ø Cassandra metrics on GC, compaction, latency, throughput etc
Ø Compactionstats, cfhistoralgrams, tpstats etc
Ø Forecasting
Ø Historical comparison
Ø Traffic projection
Ø Flex up or Flex down
CassandraSummit2015
|
#CassandraSummit
14. Scale Up vs. Scale Out
Ø Scale Up(vertical)
Ø Pros
Ø Smaller data center footprint, such as space, power, cooling
Ø Less license cost
Ø Cons
Ø Likely cost more using proprietary hardware
Ø Less fault tolerant
Ø Limited upgradability in future
Ø Scale Out(horizontal)
Ø Pros
Ø Cheaper using commodity hardware
Ø More fault tolerant
Ø (unlimited) upgradability
Ø Cons
Ø Bigger data center footprint
Ø More license cost
Ø Likely need more network equipment
CassandraSummit2015
|
#CassandraSummit
15. Questions ?
CassandraSummit2015
|
#CassandraSummit
eBay is hiring experienced NoSQL professionals, please send resume to fengqu@ebay.com