• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Feedly & Cassandra at Fashiolista
 

Feedly & Cassandra at Fashiolista

on

  • 1,092 views

A description of Fashiolista's accidental growth and the history of our feed systems. ...

A description of Fashiolista's accidental growth and the history of our feed systems.
It explains what we learned about Cassandra and gives you an introduction to our open source module Feedly.

Some links:
https://github.com/tschellenbach/Feedly
http://highscalability.com/blog/2013/10/28/design-decisions-for-scaling-your-high-traffic-feeds.html
CQLEngine fork using Python-Driver
https://github.com/tbarbugli/cqlengine

Statistics

Views

Total Views
1,092
Views on SlideShare
1,087
Embed Views
5

Actions

Likes
3
Downloads
8
Comments
5

1 Embed 5

https://twitter.com 5

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

15 of 5 previous next Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Yeay we thought about that, only problem is that we limit on a number of items in the feed. Say 3600, so we would need a counter to keep track when to switch the keyspace and drop the old one. Not sure if that will work well.
    Are you sure you want to
    Your message goes here
    Processing…
  • If you're dealing with temporary feeds, why not shard writes of new stories into a new keyspace, and then later drop the old keyspace? This way, you won't have to deal with tombstones and you can rotate clusters with isolated keyspaces in and out.
    Are you sure you want to
    Your message goes here
    Processing…
  • For those, who needs C* trimming, there is interesting commit in Feedly which helped us a lot: https://github.com/tschellenbach/Feedly/commit/ab1679c1ef7e84e60328e9d709f288de6c498c25
    Are you sure you want to
    Your message goes here
    Processing…
  • Yeay well the latest version is a lot better, not perfect though
    Are you sure you want to
    Your message goes here
    Processing…
  • Guys! Did you solve your problem with trimming?
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Follow me on Twitter and Github
  • Today I’ll give a quick introduction to Fashiolista and our growth over the past years.Afterwards I’ll explain how our feed systems worked prior to Cassandra.But most importantly, we’ve opensourced all the code which we’ll be discussing during this talk.I’ll start by explaining some of our Cassandra learnings.There are many people in this room with Cassandra expertise so we definitely encourage you to have a look on Github.It’s quite possible you’ll find something which can be improved.
  • 1.)Fashiolista started out as a hobby project4 guys, working on product comparisonWe we’re doing ok, but growth wasn’t spectacular.Noticed the rapidly growing fashion segment and tried to incorporate it.The first iteration on YouTellMe was a massive fail.Fortunately a few girls from the Amsterdam fashion institute helped us cover up our lack of fashion sense.We started with an empty sheet and designed a product around inspiration instead of search.Now at this point Fashiolista was just a hobby project, which we spent a few weeks on before launching it at TNW.
  • So we launched with a bang at TNW.Organized a mini fashion show on stage and clearly stood out from the other startups.But at this point Fashiolista was jst a side project. We got a few hundred users and went back to work on our product comparison site.
  • The next week though, my co-founder Thijs called while I was shopping at the AH.All the graphs looked off, and the growth over the past days completely disappeared.All that remained on the graphs was a spike showing the current day.Turns out several Brazilian blogs and the teen magazine Capricho posted about Fashiolista.Within a few hours tens of thousands of users signed up for Fashiolista.
  • Over the past 2 years thing have moved along rapidly.Currently we’re the second largest fashion community worldwide.With close to 1.5 M members, and massive monthly engagement.
  • And the team has also grown considerably
  • Users of Fashiolista install the so called “love button”. While browsing around the web they can use this button to add their favourite fashion finds to Fashiolista.
  • Once they click the button, we figure out the relevant image on the page and allow you to add it to your profile.
  • The find is added to your profile and other people can follow the items you love.
  • So a quick interlude about what we run.We’re a pretty standard Python/ Django stack.Similar to sites like Instagram and Pinterest.
  • This talk will focus on this page, The feed page.It shows the content by people you follow.When scaling a social site this is quite a tough problem to solve.Since there is no easy way to shard the data.
  • Our feed setup went through 3 generations.We started out with PostgreSQL, moved to Redis and eventually settled on Cassandra.The topic of scaling feed systems is something which we can talk about for days.Today I won’t go into much detail, but definitely have a look at my post on Highscalability if you are building something similar.
  • Our first setup with PostgreSQL was really easy. It took 5 minutes to develop and kept on running smoothly till we had about 100M activities in the database.
  • We were using Redis for our caching needs. Building a push based feed system with Redis was really easy.It took only a few weeks to develop. It was fast, easy to setup and maintain.The push approach works by storing a small list for every user.When kayture loves something, this love is stored on the feed for all the people which follow her.The Redis approach worked really well, but storing everything in memory can become expensive really quickly.
  • We evaluated several options for replacing our redis based approach.We looked at Cassandra, Hbase and dynamodb.We chose Cassandra because it has fewer moving parts, is supported by Datastax and is used by at least one other large startup for their feed system.In addition it’s trivial to add more capacity and the storage is very cost effective.
  • We’veopensourcedFeedly which you can find on Github.This is great, cause solving the scalability of your feed system is a lot of work.And it’s better to share this across multiple companies.
  • You can build newsfeed systems. Examples are your:Facebook news feed, twitter stream, pinterest content etc.Alternatively you can also built notifications systems.Which are basically a simpler version of the newsfeed problem.
  • Which language are you guys using?Java? Python? Ruby? Node? PHP?Pycassa is reliable, but uses the old thrift API and doesn’t support CQL.It’s reliable, almost all examples still use Pycassa, but it’s not very future compatible.CQLEngine is an ORM for writing CQL. It’s a great piece of code, but it relies on the old CQL adapter module.Python-Driver is where all the development effort of the datastax guys is. They say it’s not ready for production, but it’s already a really good beta.- It uses the native binary protocol- The client is smart, saving you a few roundtrips- You can use prepared statements- You can run your queries asyncWe forked CQLEngine and added support for Python-Driver, have a look at Githubhttps://github.com/tbarbugli/cqlengine
  • Another thing we didn’t expect was the high CPU load Cassandra generates when importing data.When we tried the import with only a few nodes, they would often go down.The solution was to run a huge number of nodes during import and subsequently scale back down.
  • When importing the 300M loves we used 4 techniques to import as fast as possible.- First of all we’re using Python-Driver which has excellent performance- Secondly we used batch queries- Batch queries on their own can actually be slower than regular queries, due to their atomic by default behaviour. To further improve speeds you want to use UNLOGGED batch queries- Last we used prepared statements to remove a bit of query parsing overhead
  • Completely denormalized approach.We evaluated a more normalized approach.ButThe performance is worse as you’ll often hit many nodesIt doesn’t fit as naturally with Cassandra as there are no transactions
  • https://github.com/tbarbugli/cqlengine
  • https://github.com/tbarbugli/cqlengine
  • https://github.com/tbarbugli/cqlengine
  • https://github.com/tbarbugli/cqlengine
  • https://github.com/tbarbugli/cqlengine

Feedly & Cassandra at Fashiolista Feedly & Cassandra at Fashiolista Presentation Transcript

  • Accidental scaling issues From a hobby project to one of the largest online fashion communities
  • About Me • • • • Thierry Schellenbach Founder/ CTO Fashiolista Github/tschellenbach Feedly & Django Facebook • Blog: mellowmorning.com • @tschellenbach
  • Today • Fashiolista’s growth • Pre Cassandra feed systems • Github/tschellenbach/Feedly – Cassandra learnings – Remaining challenges
  • A long time ago Rick, Joost, Thierry & Thijs
  • Launched Fashiolista at TNW Got a few hundred users And went back to work
  • Brazil?! • Blogs • Twitter • Capricho (Teen magazine with 1.8M followers)
  • Growth 2nd largest fashion community • 1.5M members • 17M loves/month • 94M pageviews (google analytics)
  • 5.000.000+ 14.000.000+
  • The team
  • Global Fashion Discovery
  • Our Stack • • • • • • • • • Django/Python PostgreSQL/ Pgbouncer Cassandra Redis Solr Celery/ RabbitMQ AWS/ Ubuntu Nginx/ Gunicorn/ Supervisor Newrelic, Datadog & Sentry
  • Feed History 1. PostgreSQL 2. Redis – Feedly 0.1 3. Cassandra – Feedly 0.9 More details in this highscalability post: http://bit.ly/hsfeedly
  • PostgreSQL - Pull 1. Smooth till we reached ~100M activities 2. Spikes in performance due to the query planner
  • Redis - Push 1. Fast, Easy to setup and maintain 2. Becomes expensive really quickly 115K Followers
  • Cassandra - Feedly 0.9 1. 2. 3. 4. 5. Few moving components Supported by Datastax Instagram Easy to add capacity Cost effective
  • We open sourced Feedly! • Github/tschellenbach/Feedly • Python library, which allows you to build newsfeed and notification systems using Cassandra and/or Redis
  • Feedly – What can you build? Newsfeeds Notification systems
  • Cassandra Challenges 1. Which Python library to chose? • • • • Pycassa CQLEngine (using the old CQL module) Python-Driver (beta) Fork CQLEngine to support Python-Driver – Github/tbarbugli/cqlengine
  • Cassandra Challenges 2. Importing data (300M loves * 1000 followers = 300 billion activities) • High CPU load • Nodes going down • Start with many nodes, scale down afterwards
  • Cassandra Challenges 3. Optimizing import speed (300M loves * 1000 followers = 300 billion activities) • • • • Python-Driver Batch queries Non-Atomic (unlogged) batch queries Prepared statements
  • Cassandra Challenges 4. Data model denormalization CREATE TABLE fashiolista_feedly.timeline_flat ( feed_id ascii, activity_id varint, actor int, extra_context blob, object int, target int, time timestamp, verb int PRIMARY KEY (feed_id, activity_id) ) WITH CLUSTERING ORDER BY (activity_id ASC) AND bloom_filter_fp_chance=0.010000 AND caching='KEYS_ONLY' AND dclocal_read_repair_chance=0.000000 AND gc_grace_seconds=864000 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'};
  • Opscenter is great Opscenter & Datastax AMI are great For startups Enterprise is also Free
  • Evaluation 7 instances, m1.xlarge, 2.59 TB Cassandra 2.0.0, CQL3, Python-driver (Would have been one expensive Redis cluster)
  • Current challenges Average load times are good, but 99th percentile sometimes spikes
  • Current Challenges How do we limit the storage for feeds? Trimming? (Not supported) DELETE from timeline_flat WHERE activity_id < 5000 Use a TTL on the rows?
  • Fork Feedly This is our first time using Cassandra, let us know how we can further speedup our implementation: http://bit.ly/feedlycassandra
  • Check out Feedly at Github.com/tschellenbach/Feedly Ask questions, Give tips to these guys: Thierry Schellenbach Tommaso Barbugli Guyon Morée