3. Early Experiences
The persistent sessions (aka PESSIONs)
Replaced Memcached + MySQL
Recommendations
Replaced MongoDB
ZMON (https://demo.zmon.io(https://demo.zmon.io))
KairosDB on top of Cassandra
All those are still alive and kicking
4. Mistakes
Obviously, we did "some" mistakes:
Bad cluster planning
Poor choices on compaction
Wrong data models
etc...
But we learned a lot from them!
5. FU#01 - Over-sized nodes
We started with a relatively small amount of super sized nodes. Each node had lots of
CPU and lots of storage.
The first cluster had 10 nodes, evenly distributed across 2 data centers
Recovering or repairing a node required a lot of time
Loss of a single node had a relevant performance impact on the rest of the cluster
6. FU#02 - Bad storage planning (1/2)
We did estimations on required storage and provisioned each node accordingly.
We used RF=3 and thougth we would be totally safe up to 80% usage
7. Bad storage planning (2/2)
Depending on the compaction strategy and number of SSTables, you may need up to
double the amount of disk space for a full compaction.
8. FU#03 - Wrong compaction strategies
The default SizeTieredCompactionStrategy seemed fine for all cases
But for data that is continuously expiring, it's NOT a good fit.
LeveledCompactionStrategy is more appropriate(http://www.datastax.com/dev/blog/when-to-use-leveled-compaction).
DateTieredCompactionStrategy is tricky (CASSANDRA-9666(https://issues.apache.org/jira/browse/CASSANDRA-9666)
).
The original author created an alternative TimeWindowCompactionStrategy
(https://github.com/jeffjirsa/twcs/)
9. FU#04 - Cassandra configuration
We tried to be smart and tuned the amount of concurrent_reads and concurrent_writes.
The defaults were carefully chosen and they will do the Right Thing™.
If, and when, you know each and every corner, you can adventure at it.
11. RDBMS
In the relational world it's common practice:
think about the data
model it
build the application
3rd
Build Application
2nd
Normalization
1st
Data
This worked fine because it's possible to join data, aggregate, however needed.
12. Cassandra
With Cassandra it's the other way around. You MUST start with 'How will I access the
data'?
3rd
Data
2nd
De-normalization
1st
Identify your queries
Knowing your queries in advance is NOT optional
This if different from RDBMS because you can't just JOIN or create new indexes to
support new queries
14. An example
Let's consider an application where videos are published and users can comment on
them.
create table video (
id uuid,
description text,
tags set<text>,
primary key(id)
);
create table user (
id text,
password text,
first_name text,
last_name text,
primary key (id)
);
This looks like a reasonable data model for videos and users
15. Commenting videos
create table comment (
video_id uuid,
user_id text,
comment_date timestamp,
content text,
primary key (video_id, user_id, comment_date)
);
How do we get the comments for a given video?
select * from comment where video_id = 7ede2c5e-8814-4516-a20d-bf01d4da381c;
What about for a given user?
select * from comment where user_id = 'lmineiro';
16. You wish!
You should get the infamous error:
Cannot execute this query as it might involve data filtering and thus may have unpredictable
performance. If you want to execute this query despite the performance unpredictability, use
ALLOW FILTERING
17. Getting comments for a given user
Let's try the suggestion
select * from comment where user_id = 'lmineiro' allow filtering;
video_id | user_id | comment_date | content
--------------------------------------+----------+--------------------------+--------------------
7ede2c5e-8814-4516-a20d-bf01d4da381c | lmineiro | 2016-03-02 13:18:05+0000 | Some dummy comment
Seems to work. But this will query all the nodes and won't be efficient.
We could still add an index:
create index comment_user_id on comment(user_id);
It would optimize the previous query slightly, but still not perfect.
20. Think again
We DON'T have transactions. At least not as we're used to.
We can batch(https://docs.datastax.com/en/developer/java-driver/2.1/java-driver/reference/batch-statements.html)the insert statements
though.
begin batch using timestamp 123456789
insert into comment_by_user(user_id, video_id, comment_date, content)
values ('lmineiro', 7ede2c5e-8814-4516-a20d-bf01d4da381c,
dateof(now()), 'Dummy comment')
insert into comment_by_video(video_id, user_id, comment_date, content)
values (7ede2c5e-8814-4516-a20d-bf01d4da381c, 'lmineiro',
dateof(now()), 'Dummy comment')
apply batch;
21. Finally
Let's try to repeat the query for comments from a given user:
select * from comment_by_user where user_id='lmineiro';
user_id | video_id | comment_date | content
----------+--------------------------------------+--------------------------+----------------
lmineiro | 7ede2c5e-8814-4516-a20d-bf01d4da381c | 2016-03-02 13:43:49+0000 | Dummy comment
No error anymore. If we need to query the comments for a given video:
select * from comment_by_video where video_id = 7ede2c5e-8814-4516-a20d-bf01d4da381c;
video_id | user_id | comment_date | content
--------------------------------------+----------+--------------------------+----------------
7ede2c5e-8814-4516-a20d-bf01d4da381c | lmineiro | 2016-03-02 13:43:49+0000 | Dummy comment
22. Future
We continued to invest in Cassandra and we have a lot more teams and applications
using it.
Some of them, without any order of importance:
Cart and Checkout
IAM PlanB (JSON Web Token Provider)
The Platform
Many others...