Project Gemini
Roy Dahan, QA manager
Presenter
Roy Dahan, QA manager
Testing and Managing Scylla QA group for the last 3 years.
Managing testing teams in the field of data and storage for the
10 years.
Usually delivers the bad news during release process.
What is project Gemini?
Project Gemini
Testing tool designed to detect data integrity issues like data loss and
data corruption.
Gemini accomplishes this by applying random testing to a system under
test and validating the results against a test oracle.
Started by Pekka Enberg, Larisa Ustalov, Henrik Johansson, and Alex
Bykov in 2017.
Implemented with Go programing language.
The Need
Data Integrity issues are rare and hard to find and debug.
■ Existing Tools focused on availability, stress, load & performance.
■ Limited to certain types of schemas with specific field types.
■ Fragile to schema changes during testing.
■ Hard to debug or reproduce when detecting one.
How Does Gemini Work?
System Under Test Test Oracle
OR
1. Generate a schema to be
used during the test.
2. Generate random CQL
operations on both clusters
at the same time.
3. Query both clusters and
compare each query results.
How Does Gemini Work?
■ Schema generation is random (Support “seed” for test repeating).
■ Generate random values for every column in every table according to the schema.
■ Many threads run in parallel, each responsible for a specific partition key range.
■ Each thread generate either write operation or read operation.
■ Write operations are somewhat simple - INSERT / UPDATE / DELETE
■ Read operations which are being used to validate the data are more complex.
For example:
● SELECT a, b FROM tab WHERE pk = ? AND ck = ?
● SELECT a, b FROM tab WHERE pk = ? AND ck > ? LIMIT ?
● SELECT b FROM tab WHERE token(pk) >= ? LIMIT ?
● SELECT b FROM tab WHERE token(pk) >= ? AND c = ? LIMIT ? ALLOW FILTERING
How Does Gemini Work?
Usage Example
gemini -d --duration 10800s --warmup 1800s -c 100 -m mixed -f --
non-interactive --cql-features normal --test-cluster=10.0.180.52 --
outfile /tmp/gemini-l0-c0d89088-f15f-436b-acf2-73fbab0b7f55.log --
seed 25 --oracle-cluster=10.0.60.205
Usage by QA
Integrated with Scylla-Cluster-Tests (aka SCT)
- Deployment of clusters (SUT & test Oracle)
- Deployment of a client running Gemini.
- Triggering “Nemesis” on the SUT.
- Searching the nodes for errors, coredumps, stalls, etc.
- Analyzing the Gemini final output.
- Sending full report.
In case Gemini detects any difference between SUT & Test Oracle,
it stops and leave both systems for further investigation.
Sample of Test Report
Nemesis Test Result Details
Status FAILED
read_ops 117
write_errors 0
errors Validation failed: row count differ (test has 2100 rows, oracle has 2201 rows, test is missing rows: [pk0=76,
pk1=419622209, pk2=31, pk3=922835259, ck0=6644405980324451.754, ck1=1998-01-15 18:02:25 +0000 UTC pk0=111,
pk1=632366503, pk2=-48, pk3=1222579647, ck0=3345274792944728.080, ck1=2018-02-08 01:22:14 +0000 UTC pk0=108,
pk1=1643207977, pk2=-30, pk3=1878114379, ck0=3722122018497478.686, ck1=1974-07-21 15:32:05 +0000 UTC pk0=90,
pk1=278797784, pk2=-69, pk3=1755546, ck0=7809203802197026.969, ck1=2021-03-10 07:40:04 +0000 UTC pk0=-49,
pk1=1911330670, pk2=110, pk3=640637430, ck0=4592421251461013.628, ck1=1995-11-22 21:00:36 +0000 UTC pk0=-
118, pk1=554193011, pk2=-38, pk3=292494436, ck0=3338362084821289.559, ck1=1970-02-01 19:24:15 +0000 UTC pk0=-
26, pk1=1180095760, pk2=115, pk3=1114905090, ck0=4413492767842183.832, ck1=2017-05-11 09:33:37 +0000 UTC
pk0=-28, pk1=449180670, pk2=-120, pk3=1733204278, ck0=7825973161922347.653, ck1=2023-03-03 17:34:43 +0000 UTC
pk0=109, pk1=1404802328, pk2=116, pk3=1207752519, ck0=4901967462462222.815, ck1=1992-10-03 05:38:14 +0000
UTC pk0=100, pk1=351103930, pk2=20, pk3=956746865, ck0=1678423284211121.674, ck1=2021-10-12 07:13:52 +0000
UTC pk0=-47, pk1=658485119, pk2=19, pk3=968667022, ck0=3345274792944728.080, ck1=2018-02-08 01:22:14 +0000
UTC pk0=71, pk1=1718518478, pk2=-57, pk3=720416914, ck0=3338362084821289.559, ck1=1970-02-01
Failed Test Example
Sample of Test
Monitor
Thank you Stay in touch
Any questions?
Roy Dahan
roy@scylladb.com

Project Gemini - a fuzzing tool used by Scylla to guarantee that data, once written, is always safe and sound

  • 1.
  • 2.
    Presenter Roy Dahan, QAmanager Testing and Managing Scylla QA group for the last 3 years. Managing testing teams in the field of data and storage for the 10 years. Usually delivers the bad news during release process.
  • 3.
  • 4.
    Project Gemini Testing tooldesigned to detect data integrity issues like data loss and data corruption. Gemini accomplishes this by applying random testing to a system under test and validating the results against a test oracle. Started by Pekka Enberg, Larisa Ustalov, Henrik Johansson, and Alex Bykov in 2017. Implemented with Go programing language.
  • 5.
    The Need Data Integrityissues are rare and hard to find and debug. ■ Existing Tools focused on availability, stress, load & performance. ■ Limited to certain types of schemas with specific field types. ■ Fragile to schema changes during testing. ■ Hard to debug or reproduce when detecting one.
  • 6.
    How Does GeminiWork? System Under Test Test Oracle OR
  • 7.
    1. Generate aschema to be used during the test. 2. Generate random CQL operations on both clusters at the same time. 3. Query both clusters and compare each query results. How Does Gemini Work?
  • 8.
    ■ Schema generationis random (Support “seed” for test repeating). ■ Generate random values for every column in every table according to the schema. ■ Many threads run in parallel, each responsible for a specific partition key range. ■ Each thread generate either write operation or read operation. ■ Write operations are somewhat simple - INSERT / UPDATE / DELETE ■ Read operations which are being used to validate the data are more complex. For example: ● SELECT a, b FROM tab WHERE pk = ? AND ck = ? ● SELECT a, b FROM tab WHERE pk = ? AND ck > ? LIMIT ? ● SELECT b FROM tab WHERE token(pk) >= ? LIMIT ? ● SELECT b FROM tab WHERE token(pk) >= ? AND c = ? LIMIT ? ALLOW FILTERING How Does Gemini Work?
  • 9.
    Usage Example gemini -d--duration 10800s --warmup 1800s -c 100 -m mixed -f -- non-interactive --cql-features normal --test-cluster=10.0.180.52 -- outfile /tmp/gemini-l0-c0d89088-f15f-436b-acf2-73fbab0b7f55.log -- seed 25 --oracle-cluster=10.0.60.205
  • 10.
    Usage by QA Integratedwith Scylla-Cluster-Tests (aka SCT) - Deployment of clusters (SUT & test Oracle) - Deployment of a client running Gemini. - Triggering “Nemesis” on the SUT. - Searching the nodes for errors, coredumps, stalls, etc. - Analyzing the Gemini final output. - Sending full report. In case Gemini detects any difference between SUT & Test Oracle, it stops and leave both systems for further investigation.
  • 11.
  • 12.
  • 13.
    Status FAILED read_ops 117 write_errors0 errors Validation failed: row count differ (test has 2100 rows, oracle has 2201 rows, test is missing rows: [pk0=76, pk1=419622209, pk2=31, pk3=922835259, ck0=6644405980324451.754, ck1=1998-01-15 18:02:25 +0000 UTC pk0=111, pk1=632366503, pk2=-48, pk3=1222579647, ck0=3345274792944728.080, ck1=2018-02-08 01:22:14 +0000 UTC pk0=108, pk1=1643207977, pk2=-30, pk3=1878114379, ck0=3722122018497478.686, ck1=1974-07-21 15:32:05 +0000 UTC pk0=90, pk1=278797784, pk2=-69, pk3=1755546, ck0=7809203802197026.969, ck1=2021-03-10 07:40:04 +0000 UTC pk0=-49, pk1=1911330670, pk2=110, pk3=640637430, ck0=4592421251461013.628, ck1=1995-11-22 21:00:36 +0000 UTC pk0=- 118, pk1=554193011, pk2=-38, pk3=292494436, ck0=3338362084821289.559, ck1=1970-02-01 19:24:15 +0000 UTC pk0=- 26, pk1=1180095760, pk2=115, pk3=1114905090, ck0=4413492767842183.832, ck1=2017-05-11 09:33:37 +0000 UTC pk0=-28, pk1=449180670, pk2=-120, pk3=1733204278, ck0=7825973161922347.653, ck1=2023-03-03 17:34:43 +0000 UTC pk0=109, pk1=1404802328, pk2=116, pk3=1207752519, ck0=4901967462462222.815, ck1=1992-10-03 05:38:14 +0000 UTC pk0=100, pk1=351103930, pk2=20, pk3=956746865, ck0=1678423284211121.674, ck1=2021-10-12 07:13:52 +0000 UTC pk0=-47, pk1=658485119, pk2=19, pk3=968667022, ck0=3345274792944728.080, ck1=2018-02-08 01:22:14 +0000 UTC pk0=71, pk1=1718518478, pk2=-57, pk3=720416914, ck0=3338362084821289.559, ck1=1970-02-01 Failed Test Example
  • 14.
  • 15.
    Thank you Stayin touch Any questions? Roy Dahan roy@scylladb.com

Editor's Notes

  • #6 The need came from the fact that we had some good tooling to test availability, stress, load, performance, BUT we lack a good tool to detect data integrity issues. The tools we were using were limited to certain types of schemas, fragile to schema changes and hard to very hard to debug. We wanted a tool to detect those rare issues and be able to reproduce when such an issue is detected.
  • #7 So, how does it work? We have 2 systems running in parallel. One is the system under test, running the version we would like to test. Another is the “test oracle” running a “safe” version of Scylla or Cassandra. “Safe” means a tested enough release, stripped to the minimum from any complex component like cache, compactions, etc. The “test oracle” is the “source of truth” for the test, holding the “expected result”.
  • #11 How we use it in the QA. It is integrated with our main repository for scylla testing called Scylla Cluster Tests or in short SCT. If you want to learn more about SCT and how we test scylla, I invite you to watch a video of my session from Scylla Summit 20-17. So, SCT is responsible for: Deployment of the systems, it includes both the system under test and Test Oracle. Deployment of a client that runs the gemini tool. Then, during the test in a constant interval, it triggers what we call “Nemesis”. Nemesis can be either disruptive operations like crashing a node OR Non-disruptive operations like administrative commands. During the entire test, which can last from several hours to several days, SCT searches for any possible issue that may happen on the SUT. It includes, errors, coredumps, stalls, etc... Finally, when the Gemini load is complete, it analyze the output and send a full report for the test. In case Gemini detects any difference between SUT & Test Oracle, it stops and leave both systems for further investigation.
  • #15 A monitor snippet from Gemini run. We can see the total request throughput