Building an Activity Feed with Cassandra

Building an Activity
Feed with Cassandra
Mark Dunphy, Software Engineer
Behance/Adobe
@dunphtastic

Disclaimer
Not an operations person.
Will pretend to be one for the purpose of this talk.

Quick Overview
What is the Behance Activity Feed?

• Actions
• Comments, Appreciations, Etc
• Entities
• Projects, Works in Progress
• Actors
• Users

Project Entity
Actions taken
by actors

User A publishes
a new project
Write to Follower A’s feed
Write to Follower B’s feed
Write to Follower C’s feed
Write to Follower D’s feed

• Smaller user base (~340,000).
• Built very quickly. Worked well at the time.
• Not well researched.

• Frequent node failures
• Heavy disk fragmentation caused by deletes
• Slow reads from disk. Started storing in RAM.
• Primary -> Secondary caused downtime for
some.
• Scaled out vertically and horizontally.

• Riak
• Very close. Community seemed lacking.
• Redis
• No native cluster. Too much maintenance.
• Memcached/MySQL
• Too much complex app logic.

• Fantastic community. #cassandra on Twitter
• Easy to read documentation
• Linearly scalable. Easy to grow cluster.
• Low maintenance overhead for ops team.
• Handles time series data very well.

• Cassandra Summit 2014
• Other team in Adobe
• Long nights reading documentation

• Ephemeral
• “Source of truth” lives in a MySQL database
• Okay with *some* data loss

• User’s feed is comprised of entities with one set
of actions
• User’s feed only contains one of any given entity
• An entity’s set of actions contains up to seven of
the most recent actions taken by that user’s
network

Language Support
• Most services on Behance are PHP
• No ofﬁcial Datastax PHP driver

–Mark Dunphy, 2014
“Looks like I’m learning python.”

Go to Production
No, nothing is working yet. I didn’t skip a slide.

• App/cluster in production before anything works
• Test real life load
• Fail spectacularly without anybody noticing
• Deploy risky changes without fear
• Run alongside MongoDB

Query Patterns
• “Create your data models based on the queries
you want to run” - Basically Everybody
• Wanted to…
• Read a user’s feed entities by type and time of
most recent action…separately.
• Write/Update a user’s feed entities with new
actions while knowing only user id and entity id

–Mark Dunphy, January 2015
“An UPDATE in Cassandra works like an
UPSERT! Let’s store the user’s entire feed in a
single row in a table! It’s so simple!”
First Data Model

CREATE TYPE activity.action (
created_on timestamp,
secondary_entity_id int,
actor_id int,
verb_id int
);
CREATE TYPE activity.entity (
entity_type_id int,
entity_id int
);

CREATE TABLE activity.project_actions (
modified_on timestamp,
entity_id int,
user_id int,
actions list<frozen<action>>,
PRIMARY KEY(user_id, entity_id)
)

CREATE TABLE activity.feeds (
modified_entities list<frozen<entity>>,
modified_on timestamp,
project_ids list<int>,
user_id int,
wip_revision_ids list<int>,
PRIMARY KEY(user_id)
)

First Data Model
Moments Before Everything Exploded

–Mark Dunphy, January 2015
“Okay let’s keep nearly the same model, but
use INSERT and DELETE instead of always
UPDATE. Just use batch statements.”
Second Data Model

Second Data Model
This was also a very very bad idea.

• Lose the beneﬁt of Cassandra being distributed
• All queries go through the same coordinator
which puts a lot of stress and responsibility on
one node.
• Use concurrency and prepared statements
instead. Datastax drivers make this easy.
Second Data Model

CREATE TYPE activity.action (
secondary_entity_id int,
actor_id int,
verb_id int
);

CREATE TABLE activity.projects (
user_id int,
entity_id int,
actions list<frozen<action>>,
PRIMARY KEY(user_id, created_on, entity_id)
)

Write Strategy
• “User A comments on Project A. User B follows
User A.”
• Request out to add the comment action to User
B’s feed
• Read existing actions for that entity (Project A) in
B’s feed. Push new action on top.
• Write new actions list into new “row” in projects
table

Read Strategy
• SELECT * FROM projects WHERE user_id
= 123 AND created_on > 123214373
• Optimized for quick/easy reads. More important
that a user’s feed loads quickly than it updating
quickly.
• Use timestamp to “page” through data.

Lessons Learned
• Duplicate your data to achieve desired queries.
Storage is cheap. Writes are cheap.
• Think outside the box. Cassandra is not
relational.
• Never ever ever ignore inserts/deletes in favor of
an update only workﬂow. Never. It is literally
insane.

Final Specs
• 16 node cluster on AWS EC2 c3.8xlarge
• Mix of SizeTieredCompactionStrategy and
DateTieredCompactionStrategy
• NetworkTopologyStrategy
• Replication factor 3
• ConsistencyLevel = ONE for most requests

Final Specs
• Bursty write volume. Consistent read volume.
• 5k to 80k writes per second
• 2k to 4k reads per second

Questions?
I might have answers.

Thank you!
Mark Dunphy, Software Engineer
Behance/Adobe
@dunphtastic

Building an Activity Feed with Cassandra

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Building an Activity Feed with Cassandra

Similar to Building an Activity Feed with Cassandra (20)

Recently uploaded

Recently uploaded (20)

Building an Activity Feed with Cassandra