Cassandra 
Best 
Prac-ces 
at 
ebay 
inc 
Feng 
Qu 
principal 
database 
engineer, 
ebay 
inc 
September 
11, 
2014 
CassandraSummit2014 | #CassandraSummit
CassandraSummit2014 | #CassandraSummit 
Agenda 
• ebay 
inc 
Cassandra 
footprints 
• NoSQL 
life 
cycle 
• Cassandra 
best 
prac?ces 
• Q&A
CassandraSummit2014 | #CassandraSummit 
ebay 
inc
CassandraSummit2014 | #CassandraSummit 
ebay 
inc 
Database 
Pla5orms 
• We 
manage 
thousands 
of 
databases 
powering 
eBay 
and 
PayPal
CassandraSummit2014 | #CassandraSummit 
Why 
NoSQL? 
• Challenges 
of 
tradi?onal 
RDBMS 
• Performance 
penalty 
to 
maintain 
ACID 
features 
• Lack 
of 
na?ve 
sharding 
and 
replica?on 
features 
• Lack 
of 
linear 
scalability 
• Cost 
of 
soMware/hardware 
• Higher 
cost 
of 
commit 
• NoSQL 
used 
in 
eBay 
inc 
• Cassandra, 
Couchbase, 
MongoDB 
managed 
by 
DBA 
• HBase, 
Redis, 
OpenTSDB 
managed 
by 
developers
CassandraSummit2014 | #CassandraSummit 
Cassandra 
@ 
ebay 
inc 
• Started 
in 
2011 
at 
eBay 
and 
later 
expanded 
to 
PayPal 
• Started 
with 
Apache 
Cassandra 
0.8, 
now 
using 
Apache 
Cassandra 
2.0 
and 
DataStax 
Enterprise 
4.0 
• Over 
a 
dozen 
produc?on 
clusters 
on 
hundreds 
of 
servers 
across 
3 
data 
centers 
• Choices 
between 
dedicated 
cluster 
for 
large/cri?cal 
use 
case 
and 
mul?-­‐tenant 
cluster 
for 
small 
use 
cases 
• Over 
20 
billions 
daily 
reads/writes 
to 
Cassandra 
• Cluster 
size 
varies 
from 
4-­‐node 
to 
80-­‐node 
• 100TB+ 
user 
data 
on 
HDD, 
local 
SSD 
and 
SSD 
array 
• One 
cluster 
is 
es?mated 
to 
grow 
over 
few 
PBs
CassandraSummit2014 | #CassandraSummit 
Use Case 
Analysis 
Data 
Modeling 
Capacity 
NoSQL 
Life 
Cycle 
Operation 
Deployment Planning
CassandraSummit2014 | #CassandraSummit 
Data 
Modeling 
Phase 
• Development 
team 
requests 
a 
review 
mee?ng 
for 
a 
new 
use 
case 
with 
data 
architect 
• Once 
data 
architect 
understands 
requirement 
and 
then 
recommends 
a 
proper 
data 
store. 
It 
could 
be 
either 
one 
of 
RDBMS 
or 
one 
of 
NoSQL 
products 
we 
support 
• Both 
par?es 
work 
on 
data 
modeling 
together 
• Outputs 
the 
engagement 
are 
a 
set 
of 
?ckets, 
for 
tracking 
purpose, 
which 
captures 
project 
informa?on 
and 
data 
configura?on 
for 
chosen 
data 
store.
CassandraSummit2014 | #CassandraSummit 
Data 
Modeling 
Best 
Prac-ces 
• Unlike 
tradi?onal 
RDBMS, 
data 
modeling 
for 
Cassandra 
is 
quite 
different. 
• Modeling 
around 
query 
pa_ern, 
not 
en?ty 
• De-­‐normalize 
to 
improve 
read 
performance 
• Separate 
read 
heavy 
data 
from 
write 
heavy 
data 
• Store 
values 
in 
column 
names 
as 
names 
are 
physical 
sorted 
already 
• Former 
eBay 
architect 
Jay 
Patel 
published 
few 
technical 
blogs 
on 
Cassandra 
data 
modeling.
Data 
Modeling 
Best 
Prac-ces 
-­‐ 
indexing 
• Secondary 
CassandraSummit2014 | #CassandraSummit 
index 
+ 
Less 
overhead 
as 
built 
in 
+ 
data 
and 
index 
are 
changed 
atomically 
-­‐ 
not 
scale 
well 
with 
high 
cardinality 
data 
• Column 
family 
as 
index 
+ 
No 
hot 
spot 
-­‐ 
index 
is 
maintained 
manually 
by 
applica?on 
-­‐ 
index 
change 
is 
not 
atomically 
• Avoid 
secondary 
index 
and 
use 
column 
family 
as 
index 
if 
possible
CassandraSummit2014 | #CassandraSummit 
Benchmark 
Tes-ng 
• Benchmark 
tes?ng 
is 
key 
to 
capacity 
planning 
• Performance 
baseline 
with 
near-­‐real 
traffic 
in 
produc?on 
size 
environment 
• for 
different 
type 
of 
hardware 
• for 
different 
soMware 
release 
• for 
different 
use 
case 
or 
workload 
• A 
proac?ve 
and 
repe??ve 
process
CassandraSummit2014 | #CassandraSummit 
Capacity 
Planning 
Phase 
• Is 
key 
to 
avoid 
surprise 
in 
produc?on 
• The 
concept 
behind 
capacity 
planning 
is 
simple, 
but 
the 
mechanics 
are 
harder. 
• Business 
requirements 
may 
increase, 
need 
to 
forecast 
how 
much 
resource 
must 
be 
added 
to 
the 
system 
to 
ensure 
that 
user 
experience 
con?nues 
uninterrupted 
• Input: 
clearly 
defined 
capacity 
goal 
coming 
from 
business 
requirement 
and 
performance 
baseline 
from 
benchmark 
test 
• Output: 
Iden?fy 
resources 
to 
be 
added, 
such 
as 
memory, 
CPU, 
storage, 
I/O, 
network 
• Always 
prepare 
for 
peak 
+ 
headroom
CassandraSummit2014 | #CassandraSummit 
Deployment 
Best 
Prac-ces 
• SoMware 
packages 
with 
customized 
op?miza?on 
• kernel, 
JVM 
heap, 
compac?on 
• Deployment 
automa?on 
for 
efficiency 
• Mul? 
data 
center 
deployment 
for 
load 
balancing 
and 
disaster 
recovery 
• Vnode 
is 
a 
must 
for 
manageability 
• SSD 
as 
default 
storage 
requires 
addi?onal 
OS 
level 
tuning
CassandraSummit2014 | #CassandraSummit 
Opera-on 
Best 
Prac-ces 
• Collect 
system 
and 
database 
metrics 
• Monitoring 
and 
aler?ng 
• event 
driven 
and 
metrics 
driven 
alerts 
• Opera?on 
runbook 
• Reduce 
human 
error 
• Performance 
tuning 
runbook 
• nodetool 
tpstats 
for 
dropped 
requests 
• nodetool 
cdistograms 
for 
latency 
distribu?on 
• Troubleshoo?ng 
runbook 
• Document 
previous 
incidents 
as 
future 
reference
CassandraSummit2014 | #CassandraSummit 
Opera-on 
Best 
Prac-ces 
• Rou?ne 
repair 
is 
not 
really 
needed 
if 
there 
is 
no 
deletes. 
You 
s?ll 
need 
run 
repair 
aMer 
bringing 
up 
a 
down 
node 
if 
it 
is 
dead 
for 
a 
while 
• Use 
CNAME 
in 
client 
configura?on 
to 
avoid 
client 
conf 
change 
in 
case 
of 
hardware 
replacement 
with 
new 
IP/ 
name 
• Reduce 
gc_grace 
to 
reduce 
overall 
data 
size 
• Disable 
row 
cache, 
unless 
you 
have 
<100K 
rows 
• Collect 
sta?s?cs, 
real-­‐?me 
or 
historical, 
to 
monitor 
overall 
system 
performance 
• Disable 
swap 
to 
avoid 
a 
slow 
node
CassandraSummit2014 | #CassandraSummit 
Capacity 
Review 
• Rou?ne 
capacity 
review 
and 
adjustment 
• When 
to 
scale 
up 
and 
when 
to 
scale 
out 
• In 
general, 
scale 
out 
by 
adding 
nodes 
to 
increase 
capacity 
with 
NoSQL 
• Some?mes, 
it’s 
cost 
efficient 
to 
scale 
up 
at 
component 
level 
by 
iden?fying 
scaling 
bo_leneck, 
then 
resolve 
it 
accordingly 
• Network 
bandwidth: 
upgrade 
to 
10 
Gbps 
network 
• I/O 
latency: 
upgrade 
to 
(be_er) 
SSD 
• Storage: 
add/expand 
data 
volume
CassandraSummit2014 | #CassandraSummit 
Typical 
Use 
Cases 
• Write 
Intensive: 
metrics 
collec?on, 
logging 
• Collec?ng 
metrics 
from 
tens 
of 
thousands 
devices 
periodically 
• Read 
Intensive: 
home 
page 
feeds 
• Recommenda?on 
backend 
to 
generate 
dynamic 
taste 
graph 
• Mixed 
workload: 
personaliza?on, 
classifica?on 
• Data 
is 
loaded 
from 
data 
warehouse 
periodically 
in 
bulk 
and 
from 
user 
events 
consistently 
• Data 
is 
retrieved 
in 
real 
?me 
when 
user 
visits 
ebay 
site
CassandraSummit2014 | #CassandraSummit 
Metrics 
Collec-on 
Applica-on
CassandraSummit2014 | #CassandraSummit 
The 
End 
• We 
are 
hiring 
for 
NoSQL 
talent. 
• Contact: 
• fengqu@ebay.com 
• www.linkedin.com/in/fengqu/ 
• Q&A

Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay

  • 1.
    Cassandra Best Prac-ces at ebay inc Feng Qu principal database engineer, ebay inc September 11, 2014 CassandraSummit2014 | #CassandraSummit
  • 2.
    CassandraSummit2014 | #CassandraSummit Agenda • ebay inc Cassandra footprints • NoSQL life cycle • Cassandra best prac?ces • Q&A
  • 3.
  • 4.
    CassandraSummit2014 | #CassandraSummit ebay inc Database Pla5orms • We manage thousands of databases powering eBay and PayPal
  • 5.
    CassandraSummit2014 | #CassandraSummit Why NoSQL? • Challenges of tradi?onal RDBMS • Performance penalty to maintain ACID features • Lack of na?ve sharding and replica?on features • Lack of linear scalability • Cost of soMware/hardware • Higher cost of commit • NoSQL used in eBay inc • Cassandra, Couchbase, MongoDB managed by DBA • HBase, Redis, OpenTSDB managed by developers
  • 6.
    CassandraSummit2014 | #CassandraSummit Cassandra @ ebay inc • Started in 2011 at eBay and later expanded to PayPal • Started with Apache Cassandra 0.8, now using Apache Cassandra 2.0 and DataStax Enterprise 4.0 • Over a dozen produc?on clusters on hundreds of servers across 3 data centers • Choices between dedicated cluster for large/cri?cal use case and mul?-­‐tenant cluster for small use cases • Over 20 billions daily reads/writes to Cassandra • Cluster size varies from 4-­‐node to 80-­‐node • 100TB+ user data on HDD, local SSD and SSD array • One cluster is es?mated to grow over few PBs
  • 7.
    CassandraSummit2014 | #CassandraSummit Use Case Analysis Data Modeling Capacity NoSQL Life Cycle Operation Deployment Planning
  • 8.
    CassandraSummit2014 | #CassandraSummit Data Modeling Phase • Development team requests a review mee?ng for a new use case with data architect • Once data architect understands requirement and then recommends a proper data store. It could be either one of RDBMS or one of NoSQL products we support • Both par?es work on data modeling together • Outputs the engagement are a set of ?ckets, for tracking purpose, which captures project informa?on and data configura?on for chosen data store.
  • 9.
    CassandraSummit2014 | #CassandraSummit Data Modeling Best Prac-ces • Unlike tradi?onal RDBMS, data modeling for Cassandra is quite different. • Modeling around query pa_ern, not en?ty • De-­‐normalize to improve read performance • Separate read heavy data from write heavy data • Store values in column names as names are physical sorted already • Former eBay architect Jay Patel published few technical blogs on Cassandra data modeling.
  • 10.
    Data Modeling Best Prac-ces -­‐ indexing • Secondary CassandraSummit2014 | #CassandraSummit index + Less overhead as built in + data and index are changed atomically -­‐ not scale well with high cardinality data • Column family as index + No hot spot -­‐ index is maintained manually by applica?on -­‐ index change is not atomically • Avoid secondary index and use column family as index if possible
  • 11.
    CassandraSummit2014 | #CassandraSummit Benchmark Tes-ng • Benchmark tes?ng is key to capacity planning • Performance baseline with near-­‐real traffic in produc?on size environment • for different type of hardware • for different soMware release • for different use case or workload • A proac?ve and repe??ve process
  • 12.
    CassandraSummit2014 | #CassandraSummit Capacity Planning Phase • Is key to avoid surprise in produc?on • The concept behind capacity planning is simple, but the mechanics are harder. • Business requirements may increase, need to forecast how much resource must be added to the system to ensure that user experience con?nues uninterrupted • Input: clearly defined capacity goal coming from business requirement and performance baseline from benchmark test • Output: Iden?fy resources to be added, such as memory, CPU, storage, I/O, network • Always prepare for peak + headroom
  • 13.
    CassandraSummit2014 | #CassandraSummit Deployment Best Prac-ces • SoMware packages with customized op?miza?on • kernel, JVM heap, compac?on • Deployment automa?on for efficiency • Mul? data center deployment for load balancing and disaster recovery • Vnode is a must for manageability • SSD as default storage requires addi?onal OS level tuning
  • 14.
    CassandraSummit2014 | #CassandraSummit Opera-on Best Prac-ces • Collect system and database metrics • Monitoring and aler?ng • event driven and metrics driven alerts • Opera?on runbook • Reduce human error • Performance tuning runbook • nodetool tpstats for dropped requests • nodetool cdistograms for latency distribu?on • Troubleshoo?ng runbook • Document previous incidents as future reference
  • 15.
    CassandraSummit2014 | #CassandraSummit Opera-on Best Prac-ces • Rou?ne repair is not really needed if there is no deletes. You s?ll need run repair aMer bringing up a down node if it is dead for a while • Use CNAME in client configura?on to avoid client conf change in case of hardware replacement with new IP/ name • Reduce gc_grace to reduce overall data size • Disable row cache, unless you have <100K rows • Collect sta?s?cs, real-­‐?me or historical, to monitor overall system performance • Disable swap to avoid a slow node
  • 16.
    CassandraSummit2014 | #CassandraSummit Capacity Review • Rou?ne capacity review and adjustment • When to scale up and when to scale out • In general, scale out by adding nodes to increase capacity with NoSQL • Some?mes, it’s cost efficient to scale up at component level by iden?fying scaling bo_leneck, then resolve it accordingly • Network bandwidth: upgrade to 10 Gbps network • I/O latency: upgrade to (be_er) SSD • Storage: add/expand data volume
  • 17.
    CassandraSummit2014 | #CassandraSummit Typical Use Cases • Write Intensive: metrics collec?on, logging • Collec?ng metrics from tens of thousands devices periodically • Read Intensive: home page feeds • Recommenda?on backend to generate dynamic taste graph • Mixed workload: personaliza?on, classifica?on • Data is loaded from data warehouse periodically in bulk and from user events consistently • Data is retrieved in real ?me when user visits ebay site
  • 18.
    CassandraSummit2014 | #CassandraSummit Metrics Collec-on Applica-on
  • 19.
    CassandraSummit2014 | #CassandraSummit The End • We are hiring for NoSQL talent. • Contact: • fengqu@ebay.com • www.linkedin.com/in/fengqu/ • Q&A