Data driven testing: Case study with Apache Helix

Data
Driven
Tes,ng
for

Distributed
Systems

Case
study
with
Apache
Helix

Kishore
Gopalakrishna,
@kishoreg1980

hBp://www.linkedin.com/in/kgopalak

Outline

•  Intro
to
Helix

•  Use
case:
Distributed
data
store

•  Tradi,onal
approach

•  Data
driven
tes,ng

•  Q
&
A

What
is
Helix

•  Generic
cluster
management
framework

–  Par,,on
management

–  Failure
detec,on
and
handling

–  Elas,city

Terminologies

Node
A
single
machine

Cluster
Set
of
Nodes

Resource
A
logical
en/ty
e.g.
database,
index,
task

Par,,on
Subset
of
the
resource.

Replica
Copy
of
a
par,,on

State
Status
of
a
par,,on
replica,
e.g
Master,
Slave

Transi,on
Ac,on
that
lets
replicas
change
status
e.g
Slave
-‐>
Master

4

Core
concept:
Augmented
ﬁnite
state
machine

State
Machine
Constraints
Objec,ves

• States
• States
• Par,,on
Placement

• S1,S2,S3
• S1à
max=1,
S2=min=2
• Failure
seman,cs

• Transi,on
• Transi,ons

• S1àS2,
S2àS1,
S2àS3,
• Concurrent(S1-‐>S2)

S3àS1

across
cluster
<
5

5

Helix
usage
at
LinkedIn

Espresso

6

Use
case:
Distributed
data
store

•  Timeline
consistent
par,,oned
data
store

•  One
master
replica
per
par,,on

•  Even
distribu,on
of
master/slave

•  On
failure:
promote
slave
to
master

P.1
P.2
P.3
P.5
P.6
P.7
P.9
P.10
P.11

P.4
P.5
P.6
P.8
P.1
P.2
P.12
P.3
P.4

P.1

P.9
P.10
P.11
P.12
P.7
P.8

Node
1
Node
2
Node
3

Helix
based
solu,on

State
Machine
Constraints
Objec,ves

• States
• States
• Par,,on
Placement

• Oﬄine,
Slave,
Master
• M=1,
S=2
• Failure
seman,cs

• Transi,on
• Transi,ons

• O-‐>S,
S-‐>M,S-‐>M,
M-‐>S
• concurrent(0-‐>S)
<
5

COUNT=2 minimize(maxnj∈N
S(nj)
)
t1≤5
S

t1 t2

t3 t4
O
M
COUNT=1 minimize(maxnj∈N
M(nj)
)

8

Tes,ng

•  Happy
path
func,onality

–  Meet
SLA

• 
99th
percen,le
latency
etc

–  Writes
to
master

•  Non
happy
path

–  System
failures

–  Applica,on
failures

–  How
does
system
behave
in
such
scenarios

Non
happy
path
-‐
Tradi,onal
approach

•  Iden,fy
scenarios
of
interest

–  Node
failure

–  System
upgrade

•  Tested
each
scenario
in
isola,on
via
test
case

–  All
test
passed
J

•  Deployed
in
alpha

–  First
soiware
upgrade
failed
…
but
we
tested
it

What
was
missing

•  Failures
don’t
happen
in
isola,on

•  Induc,on
principle
does
not
work

–  If
something
works
once
does
not
mean
it
will

always
work

•  Lack
of
tools
to
debug
issues

–  Could
not
iden,fy
the
cause
from
one
log
ﬁle

•  Poor
coverage

–  Impossible
to
think
of
all
possible
test
cases

What
we
learnt

•  Test
with
all
components
integrated

•  Simulate
real
produc,on
environment

–  Generate
load

–  Random
failures
of
mul,ple
components

•  BeBer
debugging
tools

–  Need
to
co-‐relate
messages
from
mul,ple
logs

–  Failure
is
a
symptom,
actual
reason
in
past
logs
of

diﬀerent
machine.

Data
driven
tes,ng

•  Instrument
–

• 
Zookeeper,
controller,
par,cipant
logs

•  Simulate
–
Chaos
monkey

•  Analyze
–
Invariants
are

•  Respect
state
transi,on
constraints

•  Respect
state
count
constraints

•  And
so
on

•  Debugging
made
easy

•  Reproduce
exact
sequence
of
events

13

Chaos
monkey

•  Select
a
random
component(s)
to
fail

•  How
should
it
fail

–  Hard/soi
failure

–  Network
Par,,on

–  Garbage
collec,on

–  Process
freeze

Automa,on
of
chaos
monkey

•  Helix
agent
on
each
node

STATE
MACHINE

•  Modify
the
behavior
of

each
service
using
Helix
STOPPED

–  Component
1

•  Node1:
RUNNING
STOP
FREEZED
START

•  Node2:
STOPPED
PAUSE
UNPAUSE

•  Node3:
KILLED

RUNNING

–  Component
2

KILL

•  Node1:
STOPPED

KILLED

Pseudo
test
case

setup
cluster

generate
load

do

(c,t)
=
components
to
fail
and
type
of
failure

simulate
failure

verify
system_is_stable

restart
failed
components

while(verify
system_is_stable)

Test
case
failed
&
here
is
the
sequence
of
events

Cluster
veriﬁca,on

•  Verify
all
constraints
are
sa,sﬁed

–  Is
there
a
master
for
all
par,,on

–  Is
slave
replica,ng

–  Node/component
down
should
not
maBer

–  Validate
every
ac,on
not
just
end
result

•  Having
master
is
not
good
enough,
if
two
nodes

became
master
and
later
one
of
them
died.

Log
analysis

•  Log
important
events

–  Becoming
master
from
slave
for
this
par,,on
at

this
,me

•  Tools
to
collect,
merge
&
analyze
logs

–  Parsed
zookeeper
transac,on
logs

–  Gathered
helix
controller,
par,cipant
logs

–  Sorted
on
,me.

•  Helix
provides
these
tools
out
of
the
box

Structured
Log
File
–
sample

timestamp partition instanceName sessionId state

1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE




1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE









Beneﬁts

•  Test
case
stops
as
soon
as
system
is
unstable

–  The
cluster
is
available
for
debugging

•  Provides
exact
sequence
of
events

–  Makes
it
easy
to
debug
and
reproduce

–  Best
part:
We
auto
generated
test
case.

Reproduce
the
issue

Start
state
Orchestrate
the
sequence

•  Helix
brings
the
system
to
•  Use
Helix
messaging
api
to

start
state.
replay
the
events

{

"id" : "MyDataStore",
"simpleFields" : { 1.  Node1:MyDataStore_0: Master-Slave
"IDEAL_STATE_MODE" : "CUSTOM",
"NUM_PARTITIONS" : ”2",
"REPLICAS" : "3", 2. Node1:HARD KILL
"STATE_MODEL_DEF_REF" : "MasterSlave",
}
"mapFields" : { 3. Node2:START
"MyDataStore_0" : {
"node1" : "MASTER",
"node2" : "OFFLINE",
"node3" : "SLAVE",
},
"MyDataStore_0" : {
"node1" : "SLAVE",
"node2" : "OFFLINE",
"node3" : "MASTER",
},
}
}

Constraint
viola,on

No
more
than
R=2
slaves

Time State Number Slaves Instance
42632 OFFLINE 0 10.117.58.247_12918
42796 SLAVE 1 10.117.58.247_12918
43124 OFFLINE 1 10.202.187.155_12918
43131 OFFLINE 1 10.220.225.153_12918
43275 SLAVE 2 10.220.225.153_12918
43323 SLAVE 3 10.202.187.155_12918
85795 MASTER 2 10.220.225.153_12918

How
long
was
it
out
of
whack?

Number
of
Slaves
Time

Percentage

0
1082319
0.5

1
35578388
16.46

2
179417802
82.99

3
118863
0.05

83%
of
the
,me,
there
were
2
slaves
to
a
par,,on

93%
of
the
,me,
there
was
1
master
to
a
par,,on

Number
of
Masters
Time
Percentage

0 15490456 7.164960359
1 200706916 92.83503964

Invariant
2:
State
Transi,ons

FROM
TO
COUNT

MASTER SLAVE 55
OFFLINE DROPPED 0
OFFLINE SLAVE 298
SLAVE MASTER 155
SLAVE OFFLINE 0

Fun
facts

•  For
almost
a
month
the
test
failed
to
run

successfully
for
one
night

•  Most
issues
were
found
using
one
test
case

•  Reproduced
almost
all
failures

Conclusion

•  Tradi,onal
approach
is
not
good
enough

•  Data
driven
tes,ng
is
way
to
go

–  Focus
on
workload
and
analysis

–  Produc,on
system
always
in
test
mode

–  Leverage
tools
built
for
tes,ng
to
debug

produc,on
issues

website
helix.incubator.apache.org

users
user@helix.incubator.apache.org

dev
dev@helix.incubator.apache.org

twiBer
@apachehelix

27

Data driven testing: Case study with Apache Helix

More Related Content

What's hot

Similar to Data driven testing: Case study with Apache Helix

More from Kishore Gopalakrishna

Recently uploaded

Data driven testing: Case study with Apache Helix

Editor's Notes