Accidental scaling issues
From a hobby project to one of the
largest online fashion communities
About Me
•
•
•
•

Thierry Schellenbach
Founder/ CTO Fashiolista
Github/tschellenbach
Feedly & Django Facebook

• Blog: mel...
Today
• Fashiolista’s growth
• Pre Cassandra feed systems
• Github/tschellenbach/Feedly
– Cassandra learnings
– Remaining ...
A long time ago

Rick, Joost, Thierry & Thijs
Launched Fashiolista at TNW
Got a few hundred users
And went back to work
Brazil?!
• Blogs
• Twitter
• Capricho (Teen
magazine with
1.8M followers)
Growth
2nd largest fashion community
• 1.5M members
• 17M loves/month
• 94M pageviews (google analytics)
5.000.000+
14.000.000+
The team
Global Fashion Discovery
Our Stack
•
•
•
•
•
•
•
•
•

Django/Python
PostgreSQL/ Pgbouncer
Cassandra
Redis
Solr
Celery/ RabbitMQ
AWS/ Ubuntu
Nginx/ ...
Feed History
1. PostgreSQL
2. Redis – Feedly 0.1
3. Cassandra – Feedly 0.9
More details in this highscalability post:
http...
PostgreSQL - Pull
1. Smooth till we reached ~100M activities
2. Spikes in performance due to the query
planner
Redis - Push
1. Fast, Easy to setup and maintain
2. Becomes expensive really quickly

115K Followers
Cassandra - Feedly 0.9
1.
2.
3.
4.
5.

Few moving components
Supported by Datastax
Instagram
Easy to add capacity
Cost eff...
We open sourced Feedly!
• Github/tschellenbach/Feedly
• Python library, which allows you to build
newsfeed and notificatio...
Feedly – What can you build?
Newsfeeds

Notification systems
Cassandra Challenges
1. Which Python library to chose?
•
•
•
•

Pycassa
CQLEngine (using the old CQL module)
Python-Driver...
Cassandra Challenges
2. Importing data
(300M loves * 1000 followers = 300 billion activities)

• High CPU load
• Nodes goi...
Cassandra Challenges
3. Optimizing import speed
(300M loves * 1000 followers = 300 billion activities)

•
•
•
•

Python-Dr...
Cassandra Challenges
4. Data model denormalization
CREATE TABLE fashiolista_feedly.timeline_flat (
feed_id ascii,
activity...
Opscenter is great
Opscenter & Datastax AMI are great
For startups Enterprise is also Free
Evaluation
7 instances, m1.xlarge, 2.59 TB
Cassandra 2.0.0, CQL3, Python-driver
(Would have been one expensive Redis clust...
Current challenges
Average load times are good, but 99th percentile
sometimes spikes
Current Challenges
How do we limit the storage for feeds?
Trimming?

(Not supported)
DELETE from timeline_flat WHERE activ...
Fork Feedly
This is our first time using Cassandra, let us
know how we can further speedup our
implementation:

http://bit...
Check out Feedly at
Github.com/tschellenbach/Feedly
Ask questions, Give tips to these guys:

Thierry Schellenbach

Tommaso...
Feedly & Cassandra at Fashiolista
Feedly & Cassandra at Fashiolista
Feedly & Cassandra at Fashiolista
Feedly & Cassandra at Fashiolista
Upcoming SlideShare
Loading in...5
×

Feedly & Cassandra at Fashiolista

1,631

Published on

A description of Fashiolista's accidental growth and the history of our feed systems.
It explains what we learned about Cassandra and gives you an introduction to our open source module Feedly.

Some links:
https://github.com/tschellenbach/Feedly
http://highscalability.com/blog/2013/10/28/design-decisions-for-scaling-your-high-traffic-feeds.html
CQLEngine fork using Python-Driver
https://github.com/tbarbugli/cqlengine

Published in: Technology
5 Comments
4 Likes
Statistics
Notes
  • Yeay we thought about that, only problem is that we limit on a number of items in the feed. Say 3600, so we would need a counter to keep track when to switch the keyspace and drop the old one. Not sure if that will work well.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If you're dealing with temporary feeds, why not shard writes of new stories into a new keyspace, and then later drop the old keyspace? This way, you won't have to deal with tombstones and you can rotate clusters with isolated keyspaces in and out.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • For those, who needs C* trimming, there is interesting commit in Feedly which helped us a lot: https://github.com/tschellenbach/Feedly/commit/ab1679c1ef7e84e60328e9d709f288de6c498c25
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Yeay well the latest version is a lot better, not perfect though
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Guys! Did you solve your problem with trimming?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,631
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
11
Comments
5
Likes
4
Embeds 0
No embeds

No notes for slide
  • Follow me on Twitter and Github
  • Today I’ll give a quick introduction to Fashiolista and our growth over the past years.Afterwards I’ll explain how our feed systems worked prior to Cassandra.But most importantly, we’ve opensourced all the code which we’ll be discussing during this talk.I’ll start by explaining some of our Cassandra learnings.There are many people in this room with Cassandra expertise so we definitely encourage you to have a look on Github.It’s quite possible you’ll find something which can be improved.
  • 1.)Fashiolista started out as a hobby project4 guys, working on product comparisonWe we’re doing ok, but growth wasn’t spectacular.Noticed the rapidly growing fashion segment and tried to incorporate it.The first iteration on YouTellMe was a massive fail.Fortunately a few girls from the Amsterdam fashion institute helped us cover up our lack of fashion sense.We started with an empty sheet and designed a product around inspiration instead of search.Now at this point Fashiolista was just a hobby project, which we spent a few weeks on before launching it at TNW.
  • So we launched with a bang at TNW.Organized a mini fashion show on stage and clearly stood out from the other startups.But at this point Fashiolista was jst a side project. We got a few hundred users and went back to work on our product comparison site.
  • The next week though, my co-founder Thijs called while I was shopping at the AH.All the graphs looked off, and the growth over the past days completely disappeared.All that remained on the graphs was a spike showing the current day.Turns out several Brazilian blogs and the teen magazine Capricho posted about Fashiolista.Within a few hours tens of thousands of users signed up for Fashiolista.
  • Over the past 2 years thing have moved along rapidly.Currently we’re the second largest fashion community worldwide.With close to 1.5 M members, and massive monthly engagement.
  • And the team has also grown considerably
  • Users of Fashiolista install the so called “love button”. While browsing around the web they can use this button to add their favourite fashion finds to Fashiolista.
  • Once they click the button, we figure out the relevant image on the page and allow you to add it to your profile.
  • The find is added to your profile and other people can follow the items you love.
  • So a quick interlude about what we run.We’re a pretty standard Python/ Django stack.Similar to sites like Instagram and Pinterest.
  • This talk will focus on this page, The feed page.It shows the content by people you follow.When scaling a social site this is quite a tough problem to solve.Since there is no easy way to shard the data.
  • Our feed setup went through 3 generations.We started out with PostgreSQL, moved to Redis and eventually settled on Cassandra.The topic of scaling feed systems is something which we can talk about for days.Today I won’t go into much detail, but definitely have a look at my post on Highscalability if you are building something similar.
  • Our first setup with PostgreSQL was really easy. It took 5 minutes to develop and kept on running smoothly till we had about 100M activities in the database.
  • We were using Redis for our caching needs. Building a push based feed system with Redis was really easy.It took only a few weeks to develop. It was fast, easy to setup and maintain.The push approach works by storing a small list for every user.When kayture loves something, this love is stored on the feed for all the people which follow her.The Redis approach worked really well, but storing everything in memory can become expensive really quickly.
  • We evaluated several options for replacing our redis based approach.We looked at Cassandra, Hbase and dynamodb.We chose Cassandra because it has fewer moving parts, is supported by Datastax and is used by at least one other large startup for their feed system.In addition it’s trivial to add more capacity and the storage is very cost effective.
  • We’veopensourcedFeedly which you can find on Github.This is great, cause solving the scalability of your feed system is a lot of work.And it’s better to share this across multiple companies.
  • You can build newsfeed systems. Examples are your:Facebook news feed, twitter stream, pinterest content etc.Alternatively you can also built notifications systems.Which are basically a simpler version of the newsfeed problem.
  • Which language are you guys using?Java? Python? Ruby? Node? PHP?Pycassa is reliable, but uses the old thrift API and doesn’t support CQL.It’s reliable, almost all examples still use Pycassa, but it’s not very future compatible.CQLEngine is an ORM for writing CQL. It’s a great piece of code, but it relies on the old CQL adapter module.Python-Driver is where all the development effort of the datastax guys is. They say it’s not ready for production, but it’s already a really good beta.- It uses the native binary protocol- The client is smart, saving you a few roundtrips- You can use prepared statements- You can run your queries asyncWe forked CQLEngine and added support for Python-Driver, have a look at Githubhttps://github.com/tbarbugli/cqlengine
  • Another thing we didn’t expect was the high CPU load Cassandra generates when importing data.When we tried the import with only a few nodes, they would often go down.The solution was to run a huge number of nodes during import and subsequently scale back down.
  • When importing the 300M loves we used 4 techniques to import as fast as possible.- First of all we’re using Python-Driver which has excellent performance- Secondly we used batch queries- Batch queries on their own can actually be slower than regular queries, due to their atomic by default behaviour. To further improve speeds you want to use UNLOGGED batch queries- Last we used prepared statements to remove a bit of query parsing overhead
  • Completely denormalized approach.We evaluated a more normalized approach.ButThe performance is worse as you’ll often hit many nodesIt doesn’t fit as naturally with Cassandra as there are no transactions
  • https://github.com/tbarbugli/cqlengine
  • https://github.com/tbarbugli/cqlengine
  • https://github.com/tbarbugli/cqlengine
  • https://github.com/tbarbugli/cqlengine
  • https://github.com/tbarbugli/cqlengine
  • Feedly & Cassandra at Fashiolista

    1. 1. Accidental scaling issues From a hobby project to one of the largest online fashion communities
    2. 2. About Me • • • • Thierry Schellenbach Founder/ CTO Fashiolista Github/tschellenbach Feedly & Django Facebook • Blog: mellowmorning.com • @tschellenbach
    3. 3. Today • Fashiolista’s growth • Pre Cassandra feed systems • Github/tschellenbach/Feedly – Cassandra learnings – Remaining challenges
    4. 4. A long time ago Rick, Joost, Thierry & Thijs
    5. 5. Launched Fashiolista at TNW Got a few hundred users And went back to work
    6. 6. Brazil?! • Blogs • Twitter • Capricho (Teen magazine with 1.8M followers)
    7. 7. Growth 2nd largest fashion community • 1.5M members • 17M loves/month • 94M pageviews (google analytics)
    8. 8. 5.000.000+ 14.000.000+
    9. 9. The team
    10. 10. Global Fashion Discovery
    11. 11. Our Stack • • • • • • • • • Django/Python PostgreSQL/ Pgbouncer Cassandra Redis Solr Celery/ RabbitMQ AWS/ Ubuntu Nginx/ Gunicorn/ Supervisor Newrelic, Datadog & Sentry
    12. 12. Feed History 1. PostgreSQL 2. Redis – Feedly 0.1 3. Cassandra – Feedly 0.9 More details in this highscalability post: http://bit.ly/hsfeedly
    13. 13. PostgreSQL - Pull 1. Smooth till we reached ~100M activities 2. Spikes in performance due to the query planner
    14. 14. Redis - Push 1. Fast, Easy to setup and maintain 2. Becomes expensive really quickly 115K Followers
    15. 15. Cassandra - Feedly 0.9 1. 2. 3. 4. 5. Few moving components Supported by Datastax Instagram Easy to add capacity Cost effective
    16. 16. We open sourced Feedly! • Github/tschellenbach/Feedly • Python library, which allows you to build newsfeed and notification systems using Cassandra and/or Redis
    17. 17. Feedly – What can you build? Newsfeeds Notification systems
    18. 18. Cassandra Challenges 1. Which Python library to chose? • • • • Pycassa CQLEngine (using the old CQL module) Python-Driver (beta) Fork CQLEngine to support Python-Driver – Github/tbarbugli/cqlengine
    19. 19. Cassandra Challenges 2. Importing data (300M loves * 1000 followers = 300 billion activities) • High CPU load • Nodes going down • Start with many nodes, scale down afterwards
    20. 20. Cassandra Challenges 3. Optimizing import speed (300M loves * 1000 followers = 300 billion activities) • • • • Python-Driver Batch queries Non-Atomic (unlogged) batch queries Prepared statements
    21. 21. Cassandra Challenges 4. Data model denormalization CREATE TABLE fashiolista_feedly.timeline_flat ( feed_id ascii, activity_id varint, actor int, extra_context blob, object int, target int, time timestamp, verb int PRIMARY KEY (feed_id, activity_id) ) WITH CLUSTERING ORDER BY (activity_id ASC) AND bloom_filter_fp_chance=0.010000 AND caching='KEYS_ONLY' AND dclocal_read_repair_chance=0.000000 AND gc_grace_seconds=864000 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'};
    22. 22. Opscenter is great Opscenter & Datastax AMI are great For startups Enterprise is also Free
    23. 23. Evaluation 7 instances, m1.xlarge, 2.59 TB Cassandra 2.0.0, CQL3, Python-driver (Would have been one expensive Redis cluster)
    24. 24. Current challenges Average load times are good, but 99th percentile sometimes spikes
    25. 25. Current Challenges How do we limit the storage for feeds? Trimming? (Not supported) DELETE from timeline_flat WHERE activity_id < 5000 Use a TTL on the rows?
    26. 26. Fork Feedly This is our first time using Cassandra, let us know how we can further speedup our implementation: http://bit.ly/feedlycassandra
    27. 27. Check out Feedly at Github.com/tschellenbach/Feedly Ask questions, Give tips to these guys: Thierry Schellenbach Tommaso Barbugli Guyon Morée
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×