A description of Fashiolista's accidental growth and the history of our feed systems.
It explains what we learned about Cassandra and gives you an introduction to our open source module Feedly.
Some links:
https://github.com/tschellenbach/Feedly
http://highscalability.com/blog/2013/10/28/design-decisions-for-scaling-your-high-traffic-feeds.html
CQLEngine fork using Python-Driver
https://github.com/tbarbugli/cqlengine
16. Feed History
1. PostgreSQL
2. Redis – Feedly 0.1
3. Cassandra – Feedly 0.9
More details in this highscalability post:
http://bit.ly/hsfeedly
17. PostgreSQL - Pull
1. Smooth till we reached ~100M activities
2. Spikes in performance due to the query
planner
18. Redis - Push
1. Fast, Easy to setup and maintain
2. Becomes expensive really quickly
115K Followers
19. Cassandra - Feedly 0.9
1.
2.
3.
4.
5.
Few moving components
Supported by Datastax
Instagram
Easy to add capacity
Cost effective
20. We open sourced Feedly!
• Github/tschellenbach/Feedly
• Python library, which allows you to build
newsfeed and notification systems using
Cassandra and/or Redis
21. Feedly – What can you build?
Newsfeeds
Notification systems
22. Cassandra Challenges
1. Which Python library to chose?
•
•
•
•
Pycassa
CQLEngine (using the old CQL module)
Python-Driver (beta)
Fork CQLEngine to support Python-Driver
– Github/tbarbugli/cqlengine
23. Cassandra Challenges
2. Importing data
(300M loves * 1000 followers = 300 billion activities)
• High CPU load
• Nodes going down
• Start with many nodes, scale down afterwards
25. Cassandra Challenges
4. Data model denormalization
CREATE TABLE fashiolista_feedly.timeline_flat (
feed_id ascii,
activity_id varint,
actor int,
extra_context blob,
object int,
target int,
time timestamp,
verb int
PRIMARY KEY (feed_id, activity_id) )
WITH CLUSTERING ORDER BY (activity_id ASC)
AND bloom_filter_fp_chance=0.010000 AND caching='KEYS_ONLY' AND dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'};
29. Current Challenges
How do we limit the storage for feeds?
Trimming?
(Not supported)
DELETE from timeline_flat WHERE activity_id < 5000
Use a TTL on the rows?
30. Fork Feedly
This is our first time using Cassandra, let us
know how we can further speedup our
implementation:
http://bit.ly/feedlycassandra
31. Check out Feedly at
Github.com/tschellenbach/Feedly
Ask questions, Give tips to these guys:
Thierry Schellenbach
Tommaso Barbugli
Guyon Morée
Editor's Notes
Follow me on Twitter and Github
Today I’ll give a quick introduction to Fashiolista and our growth over the past years.Afterwards I’ll explain how our feed systems worked prior to Cassandra.But most importantly, we’ve opensourced all the code which we’ll be discussing during this talk.I’ll start by explaining some of our Cassandra learnings.There are many people in this room with Cassandra expertise so we definitely encourage you to have a look on Github.It’s quite possible you’ll find something which can be improved.
1.)Fashiolista started out as a hobby project4 guys, working on product comparisonWe we’re doing ok, but growth wasn’t spectacular.Noticed the rapidly growing fashion segment and tried to incorporate it.The first iteration on YouTellMe was a massive fail.Fortunately a few girls from the Amsterdam fashion institute helped us cover up our lack of fashion sense.We started with an empty sheet and designed a product around inspiration instead of search.Now at this point Fashiolista was just a hobby project, which we spent a few weeks on before launching it at TNW.
So we launched with a bang at TNW.Organized a mini fashion show on stage and clearly stood out from the other startups.But at this point Fashiolista was jst a side project. We got a few hundred users and went back to work on our product comparison site.
The next week though, my co-founder Thijs called while I was shopping at the AH.All the graphs looked off, and the growth over the past days completely disappeared.All that remained on the graphs was a spike showing the current day.Turns out several Brazilian blogs and the teen magazine Capricho posted about Fashiolista.Within a few hours tens of thousands of users signed up for Fashiolista.
Over the past 2 years thing have moved along rapidly.Currently we’re the second largest fashion community worldwide.With close to 1.5 M members, and massive monthly engagement.
And the team has also grown considerably
Users of Fashiolista install the so called “love button”. While browsing around the web they can use this button to add their favourite fashion finds to Fashiolista.
Once they click the button, we figure out the relevant image on the page and allow you to add it to your profile.
The find is added to your profile and other people can follow the items you love.
So a quick interlude about what we run.We’re a pretty standard Python/ Django stack.Similar to sites like Instagram and Pinterest.
This talk will focus on this page, The feed page.It shows the content by people you follow.When scaling a social site this is quite a tough problem to solve.Since there is no easy way to shard the data.
Our feed setup went through 3 generations.We started out with PostgreSQL, moved to Redis and eventually settled on Cassandra.The topic of scaling feed systems is something which we can talk about for days.Today I won’t go into much detail, but definitely have a look at my post on Highscalability if you are building something similar.
Our first setup with PostgreSQL was really easy. It took 5 minutes to develop and kept on running smoothly till we had about 100M activities in the database.
We were using Redis for our caching needs. Building a push based feed system with Redis was really easy.It took only a few weeks to develop. It was fast, easy to setup and maintain.The push approach works by storing a small list for every user.When kayture loves something, this love is stored on the feed for all the people which follow her.The Redis approach worked really well, but storing everything in memory can become expensive really quickly.
We evaluated several options for replacing our redis based approach.We looked at Cassandra, Hbase and dynamodb.We chose Cassandra because it has fewer moving parts, is supported by Datastax and is used by at least one other large startup for their feed system.In addition it’s trivial to add more capacity and the storage is very cost effective.
We’veopensourcedFeedly which you can find on Github.This is great, cause solving the scalability of your feed system is a lot of work.And it’s better to share this across multiple companies.
You can build newsfeed systems. Examples are your:Facebook news feed, twitter stream, pinterest content etc.Alternatively you can also built notifications systems.Which are basically a simpler version of the newsfeed problem.
Which language are you guys using?Java? Python? Ruby? Node? PHP?Pycassa is reliable, but uses the old thrift API and doesn’t support CQL.It’s reliable, almost all examples still use Pycassa, but it’s not very future compatible.CQLEngine is an ORM for writing CQL. It’s a great piece of code, but it relies on the old CQL adapter module.Python-Driver is where all the development effort of the datastax guys is. They say it’s not ready for production, but it’s already a really good beta.- It uses the native binary protocol- The client is smart, saving you a few roundtrips- You can use prepared statements- You can run your queries asyncWe forked CQLEngine and added support for Python-Driver, have a look at Githubhttps://github.com/tbarbugli/cqlengine
Another thing we didn’t expect was the high CPU load Cassandra generates when importing data.When we tried the import with only a few nodes, they would often go down.The solution was to run a huge number of nodes during import and subsequently scale back down.
When importing the 300M loves we used 4 techniques to import as fast as possible.- First of all we’re using Python-Driver which has excellent performance- Secondly we used batch queries- Batch queries on their own can actually be slower than regular queries, due to their atomic by default behaviour. To further improve speeds you want to use UNLOGGED batch queries- Last we used prepared statements to remove a bit of query parsing overhead
Completely denormalized approach.We evaluated a more normalized approach.ButThe performance is worse as you’ll often hit many nodesIt doesn’t fit as naturally with Cassandra as there are no transactions