Feedly & Cassandra at Fashiolista

•Download as PPTX, PDF•

4 likes•2,771 views

A description of Fashiolista's accidental growth and the history of our feed systems. It explains what we learned about Cassandra and gives you an introduction to our open source module Feedly. Some links: https://github.com/tschellenbach/Feedly http://highscalability.com/blog/2013/10/28/design-decisions-for-scaling-your-high-traffic-feeds.html CQLEngine fork using Python-Driver https://github.com/tbarbugli/cqlengine

Technology

Accidental scaling issues
From a hobby project to one of the
largest online fashion communities

About Me
•
•
•
•

Thierry Schellenbach
Founder/ CTO Fashiolista
Github/tschellenbach
Feedly & Django Facebook

• Blog: mellowmorning.com
• @tschellenbach

Today
• Fashiolista’s growth
• Pre Cassandra feed systems
• Github/tschellenbach/Feedly
– Cassandra learnings
– Remaining challenges

A long time ago

Rick, Joost, Thierry & Thijs

Launched Fashiolista at TNW
Got a few hundred users
And went back to work

Brazil?!
• Blogs
• Twitter
• Capricho (Teen
magazine with
1.8M followers)

Growth
2nd largest fashion community
• 1.5M members
• 17M loves/month
• 94M pageviews (google analytics)

Our Stack
•
•
•
•
•
•
•
•
•

Django/Python
PostgreSQL/ Pgbouncer
Cassandra
Redis
Solr
Celery/ RabbitMQ
AWS/ Ubuntu
Nginx/ Gunicorn/ Supervisor
Newrelic, Datadog & Sentry

Feed History
1. PostgreSQL
2. Redis – Feedly 0.1
3. Cassandra – Feedly 0.9
More details in this highscalability post:
http://bit.ly/hsfeedly

PostgreSQL - Pull
1. Smooth till we reached ~100M activities
2. Spikes in performance due to the query
planner

Redis - Push
1. Fast, Easy to setup and maintain
2. Becomes expensive really quickly

115K Followers

Cassandra - Feedly 0.9
1.
2.
3.
4.
5.

Few moving components
Supported by Datastax
Instagram
Easy to add capacity
Cost effective

We open sourced Feedly!
• Github/tschellenbach/Feedly
• Python library, which allows you to build
newsfeed and notification systems using
Cassandra and/or Redis

Feedly – What can you build?
Newsfeeds

Notification systems

Cassandra Challenges
1. Which Python library to chose?
•
•
•
•

Pycassa
CQLEngine (using the old CQL module)
Python-Driver (beta)
Fork CQLEngine to support Python-Driver
– Github/tbarbugli/cqlengine

Cassandra Challenges
2. Importing data
(300M loves * 1000 followers = 300 billion activities)

• High CPU load
• Nodes going down
• Start with many nodes, scale down afterwards

Cassandra Challenges
3. Optimizing import speed
(300M loves * 1000 followers = 300 billion activities)

•
•
•
•

Python-Driver
Batch queries
Non-Atomic (unlogged) batch queries
Prepared statements

$Cassandra Challenges 4. Data model denormalization CREATE TABLE fashiolista_feedly.timeline_flat ( feed_id ascii, activity_id varint, actor int, extra_context blob, object int, target int, time timestamp, verb int PRIMARY KEY (feed_id, activity_id) ) WITH CLUSTERING ORDER BY (activity_id ASC) AND bloom_filter_fp_chance=0.010000 AND caching='KEYS_ONLY' AND dclocal_read_repair_chance=0.000000 AND gc_grace_seconds=864000 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'};$

Opscenter is great
Opscenter & Datastax AMI are great
For startups Enterprise is also Free

Evaluation
7 instances, m1.xlarge, 2.59 TB
Cassandra 2.0.0, CQL3, Python-driver
(Would have been one expensive Redis cluster)

Current challenges
Average load times are good, but 99th percentile
sometimes spikes

Current Challenges
How do we limit the storage for feeds?
Trimming?

(Not supported)
DELETE from timeline_flat WHERE activity_id < 5000

Use a TTL on the rows?

Fork Feedly
This is our first time using Cassandra, let us
know how we can further speedup our
implementation:

http://bit.ly/feedlycassandra

Check out Feedly at
Github.com/tschellenbach/Feedly
Ask questions, Give tips to these guys:

Thierry Schellenbach

Tommaso Barbugli

Guyon Morée

Similar to Feedly & Cassandra at Fashiolista

Faster Faster Faster! Datamarts with Hive at YahooMithun Radhakrishnan

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics

Open Data Summit Presentation by Joe OlsenChristopher Whitaker

Scylla Summit 2018: How Scylla Helps You to be a Better Application DeveloperScyllaDB

Apache Solr for TYPO3 what's new 2018timohund

Presentation by TachyonNexus & Baidu at Strata Singapore 2015Tachyon Nexus, Inc.

Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan

[System design] Design a tweeter-like systemAree Oh

Big Data Analysis : Deciphering the haystack Srinath Perera

What is MariaDB Server 10.3?Colin Charles

OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica SarbuNETWAYS

OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica SarbuNETWAYS

ExtBase workshop schmutt

Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY

Django OverviewBrian Tol

Presto updates to 0.178Kai Sasaki

AWS Summit Amsterdam - Thierry Schellenbach Founder/ FashiolistaThierry Schellenbach

Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...DataStax

Scaling, Tuning and Maintaining the MonolithRoss McFadyen

Similar to Feedly & Cassandra at Fashiolista (20)

Faster Faster Faster! Datamarts with Hive at Yahoo

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...

Open Data Summit Presentation by Joe Olsen

Scylla Summit 2018: How Scylla Helps You to be a Better Application Developer

Apache Solr for TYPO3 what's new 2018

Presentation by TachyonNexus & Baidu at Strata Singapore 2015

Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale

[System design] Design a tweeter-like system

Big Data Analysis : Deciphering the haystack

What is MariaDB Server 10.3?

OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu

OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu

ExtBase workshop

Séminaire Big Data Alter Way - Elasticsearch - octobre 2014

Django Overview

Presto updates to 0.178

AWS Summit Amsterdam - Thierry Schellenbach Founder/ Fashiolista

Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...

Scaling, Tuning and Maintaining the Monolith

Recently uploaded

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Architecting Cloud Native ApplicationsWSO2

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Why Teams call analytics are critical to your entire businesspanagenda

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

DBX First Quarter 2024 Investor PresentationDropbox

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Recently uploaded (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Strategies for Landing an Oracle DBA Job as a Fresher

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

Apidays New York 2024 - The value of a flexible API Management solution for O...

FWD Group - Insurer Innovation Award 2024

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Architecting Cloud Native Applications

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Six Myths about Ontologies: The Basics of Formal Ontology

Artificial Intelligence Chap.5 : Uncertainty

Why Teams call analytics are critical to your entire business

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

[BuildWithAI] Introduction to Gemini.pdf

DBX First Quarter 2024 Investor Presentation

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Feedly & Cassandra at Fashiolista

1. Accidental scaling issues From a hobby project to one of the largest online fashion communities

2. About Me • • • • Thierry Schellenbach Founder/ CTO Fashiolista Github/tschellenbach Feedly & Django Facebook • Blog: mellowmorning.com • @tschellenbach

3. Today • Fashiolista’s growth • Pre Cassandra feed systems • Github/tschellenbach/Feedly – Cassandra learnings – Remaining challenges

4. A long time ago Rick, Joost, Thierry & Thijs

5. Launched Fashiolista at TNW Got a few hundred users And went back to work

6. Brazil?! • Blogs • Twitter • Capricho (Teen magazine with 1.8M followers)

7. Growth 2nd largest fashion community • 1.5M members • 17M loves/month • 94M pageviews (google analytics)

8. 5.000.000+ 14.000.000+

9. The team

10. Global Fashion Discovery

11.

12.

13.

14. Our Stack • • • • • • • • • Django/Python PostgreSQL/ Pgbouncer Cassandra Redis Solr Celery/ RabbitMQ AWS/ Ubuntu Nginx/ Gunicorn/ Supervisor Newrelic, Datadog & Sentry

15.

16. Feed History 1. PostgreSQL 2. Redis – Feedly 0.1 3. Cassandra – Feedly 0.9 More details in this highscalability post: http://bit.ly/hsfeedly

17. PostgreSQL - Pull 1. Smooth till we reached ~100M activities 2. Spikes in performance due to the query planner

18. Redis - Push 1. Fast, Easy to setup and maintain 2. Becomes expensive really quickly 115K Followers

19. Cassandra - Feedly 0.9 1. 2. 3. 4. 5. Few moving components Supported by Datastax Instagram Easy to add capacity Cost effective

20. We open sourced Feedly! • Github/tschellenbach/Feedly • Python library, which allows you to build newsfeed and notification systems using Cassandra and/or Redis

21. Feedly – What can you build? Newsfeeds Notification systems

22. Cassandra Challenges 1. Which Python library to chose? • • • • Pycassa CQLEngine (using the old CQL module) Python-Driver (beta) Fork CQLEngine to support Python-Driver – Github/tbarbugli/cqlengine

23. Cassandra Challenges 2. Importing data (300M loves * 1000 followers = 300 billion activities) • High CPU load • Nodes going down • Start with many nodes, scale down afterwards

24. Cassandra Challenges 3. Optimizing import speed (300M loves * 1000 followers = 300 billion activities) • • • • Python-Driver Batch queries Non-Atomic (unlogged) batch queries Prepared statements

25. Cassandra Challenges 4. Data model denormalization CREATE TABLE fashiolista_feedly.timeline_flat ( feed_id ascii, activity_id varint, actor int, extra_context blob, object int, target int, time timestamp, verb int PRIMARY KEY (feed_id, activity_id) ) WITH CLUSTERING ORDER BY (activity_id ASC) AND bloom_filter_fp_chance=0.010000 AND caching='KEYS_ONLY' AND dclocal_read_repair_chance=0.000000 AND gc_grace_seconds=864000 AND read_repair_chance=0.100000 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'};

26. Opscenter is great Opscenter & Datastax AMI are great For startups Enterprise is also Free

27. Evaluation 7 instances, m1.xlarge, 2.59 TB Cassandra 2.0.0, CQL3, Python-driver (Would have been one expensive Redis cluster)

28. Current challenges Average load times are good, but 99th percentile sometimes spikes

29. Current Challenges How do we limit the storage for feeds? Trimming? (Not supported) DELETE from timeline_flat WHERE activity_id < 5000 Use a TTL on the rows?

30. Fork Feedly This is our first time using Cassandra, let us know how we can further speedup our implementation: http://bit.ly/feedlycassandra

31. Check out Feedly at Github.com/tschellenbach/Feedly Ask questions, Give tips to these guys: Thierry Schellenbach Tommaso Barbugli Guyon Morée

Editor's Notes

Follow me on Twitter and Github
Today I’ll give a quick introduction to Fashiolista and our growth over the past years.Afterwards I’ll explain how our feed systems worked prior to Cassandra.But most importantly, we’ve opensourced all the code which we’ll be discussing during this talk.I’ll start by explaining some of our Cassandra learnings.There are many people in this room with Cassandra expertise so we definitely encourage you to have a look on Github.It’s quite possible you’ll find something which can be improved.
1.)Fashiolista started out as a hobby project4 guys, working on product comparisonWe we’re doing ok, but growth wasn’t spectacular.Noticed the rapidly growing fashion segment and tried to incorporate it.The first iteration on YouTellMe was a massive fail.Fortunately a few girls from the Amsterdam fashion institute helped us cover up our lack of fashion sense.We started with an empty sheet and designed a product around inspiration instead of search.Now at this point Fashiolista was just a hobby project, which we spent a few weeks on before launching it at TNW.
So we launched with a bang at TNW.Organized a mini fashion show on stage and clearly stood out from the other startups.But at this point Fashiolista was jst a side project. We got a few hundred users and went back to work on our product comparison site.
The next week though, my co-founder Thijs called while I was shopping at the AH.All the graphs looked off, and the growth over the past days completely disappeared.All that remained on the graphs was a spike showing the current day.Turns out several Brazilian blogs and the teen magazine Capricho posted about Fashiolista.Within a few hours tens of thousands of users signed up for Fashiolista.
Over the past 2 years thing have moved along rapidly.Currently we’re the second largest fashion community worldwide.With close to 1.5 M members, and massive monthly engagement.
And the team has also grown considerably
Users of Fashiolista install the so called “love button”. While browsing around the web they can use this button to add their favourite fashion finds to Fashiolista.
Once they click the button, we figure out the relevant image on the page and allow you to add it to your profile.
The find is added to your profile and other people can follow the items you love.
So a quick interlude about what we run.We’re a pretty standard Python/ Django stack.Similar to sites like Instagram and Pinterest.
This talk will focus on this page, The feed page.It shows the content by people you follow.When scaling a social site this is quite a tough problem to solve.Since there is no easy way to shard the data.
Our feed setup went through 3 generations.We started out with PostgreSQL, moved to Redis and eventually settled on Cassandra.The topic of scaling feed systems is something which we can talk about for days.Today I won’t go into much detail, but definitely have a look at my post on Highscalability if you are building something similar.
Our first setup with PostgreSQL was really easy. It took 5 minutes to develop and kept on running smoothly till we had about 100M activities in the database.
We were using Redis for our caching needs. Building a push based feed system with Redis was really easy.It took only a few weeks to develop. It was fast, easy to setup and maintain.The push approach works by storing a small list for every user.When kayture loves something, this love is stored on the feed for all the people which follow her.The Redis approach worked really well, but storing everything in memory can become expensive really quickly.
We evaluated several options for replacing our redis based approach.We looked at Cassandra, Hbase and dynamodb.We chose Cassandra because it has fewer moving parts, is supported by Datastax and is used by at least one other large startup for their feed system.In addition it’s trivial to add more capacity and the storage is very cost effective.
We’veopensourcedFeedly which you can find on Github.This is great, cause solving the scalability of your feed system is a lot of work.And it’s better to share this across multiple companies.
You can build newsfeed systems. Examples are your:Facebook news feed, twitter stream, pinterest content etc.Alternatively you can also built notifications systems.Which are basically a simpler version of the newsfeed problem.
Which language are you guys using?Java? Python? Ruby? Node? PHP?Pycassa is reliable, but uses the old thrift API and doesn’t support CQL.It’s reliable, almost all examples still use Pycassa, but it’s not very future compatible.CQLEngine is an ORM for writing CQL. It’s a great piece of code, but it relies on the old CQL adapter module.Python-Driver is where all the development effort of the datastax guys is. They say it’s not ready for production, but it’s already a really good beta.- It uses the native binary protocol- The client is smart, saving you a few roundtrips- You can use prepared statements- You can run your queries asyncWe forked CQLEngine and added support for Python-Driver, have a look at Githubhttps://github.com/tbarbugli/cqlengine
Another thing we didn’t expect was the high CPU load Cassandra generates when importing data.When we tried the import with only a few nodes, they would often go down.The solution was to run a huge number of nodes during import and subsequently scale back down.
When importing the 300M loves we used 4 techniques to import as fast as possible.- First of all we’re using Python-Driver which has excellent performance- Secondly we used batch queries- Batch queries on their own can actually be slower than regular queries, due to their atomic by default behaviour. To further improve speeds you want to use UNLOGGED batch queries- Last we used prepared statements to remove a bit of query parsing overhead
Completely denormalized approach.We evaluated a more normalized approach.ButThe performance is worse as you’ll often hit many nodesIt doesn’t fit as naturally with Cassandra as there are no transactions
https://github.com/tbarbugli/cqlengine
https://github.com/tbarbugli/cqlengine
https://github.com/tbarbugli/cqlengine
https://github.com/tbarbugli/cqlengine
https://github.com/tbarbugli/cqlengine

Feedly & Cassandra at Fashiolista

Recommended

Recommended

More Related Content

Similar to Feedly & Cassandra at Fashiolista

Similar to Feedly & Cassandra at Fashiolista (20)

Recently uploaded

Recently uploaded (20)

Feedly & Cassandra at Fashiolista

Editor's Notes