Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
CASSANDRA @ INSTAGRAM 2016
Dikang Gu
Software Engineer @ Facebook
ABOUT ME
2
• @dikanggu
• Software Engineer
• Instagram core infra, 2014 — present
• Facebook data Infra, 2012 — 2014
AGENDA
3
1 Overview
2 Improvements
3 Challenges
OVERVIEW
OVERVIEW
Cluster Deployment
5
• Cassandra Nodes: 1,000+
• Data Size: 100s of TeraBytes
• Ops/sec: in the millions
• Larges...
OVERVIEW
6
• Client: Python/C++/Java/PHP
• Protocol: mostly thrift, some CQL
• Versions: 2.0.x - 2.2.x
• Use LCS for most ...
TEAM
7
USE CASE 1
Feed
8
PUSH
When posting, we push the media information to the followers' feed store.
When reading, we fetch th...
USE CASE 1
Feed
• Write QPS: 1M+
• Avg/P99 Read Latency : 20ms/100ms
• Data Model:

user_id —> List(media_id)
USE CASE 2
Metadata store
10
Applications use C* as a key value store, they store a list of blobs associated with a key, a...
USE CASE 2
Metadata store
• Read/Write QPS: 100K+
• Avg read size: 50KB
• Avg/P99 Read Latency : 7ms/50ms
• Data Model:

u...
USE CASE 3
Counter
12
Applications issue bump/get counter operations for each user requests.
USE CASE 3
Counter
• Read/Write QPS: 50K+
• Avg/P99 Read Latency : 3ms/50ms
• C* 2.2
• Data Model:

some_id —> Counter
IMPROVEMENTS
1. PROXY NODES
15
PROXY NODE
Problem
16
• Thrift client, NOT token aware
• Data node coordinates the requests
• High latency and timeout whe...
PROXY NODE
Solution
17
• join_ring: false
• act as coordinator
• do not store data locally
• client only talks to proxy no...
(CASSANDRA-9258)
2. PENDING RANGES
18
PENDING RANGES
Problem
19
• CPU usage +30% when bootstrapping new nodes.
• Client requests latency jumps and timeouts
• Mu...
PENDING RANGES
Solution
20
• Cassandra-9258
• Use two NavigableMaps to implement the pending ranges
• We can expand or shr...
(CASSANDRA-6908)
3. DYNAMIC SNITCH
21
DYNAMIC SNITCH
22
• High read latency during peak time.
• Unnecessary cross region requests.
• dynamic_snitch_badness_thre...
4. COMPACTION
23
COMPACTION IMPROVEMENTS
24
• Track the write amplification. (CASSANDRA-11420)
• Optimize the overlapping lookup. (CASSANDRA...
5. BIG HEAP SIZE
25
BIG HEAP SIZE
26
• 64G max heap size
• 16G new gen size
• -XX:MaxTenuringThreshold=6
• Young GC every 10 seconds
• Avoid f...
(CASSANDRA-10406)
6. NODETOOL REBUILD RANGE
27
NODETOOL REBUILD
28
• rebuild may fail for nodes with TBs of data
• Cassandra-10406
• support to rebuild the failed token ...
CHALLENGES
PERFORMANCE
30
P99 Read Latency
Latency on the C* nodes, even higher on the client side.
PERFORMANCE
31
Compaction has difficulties to catch up
Impact the read latency
PERFORMANCE
32
Compaction uses too much CPU (40%+)
PERFORMANCE
33
Tombstone
SCALABILITY
34
Gossip, nodes see inconsistent ring
(CASSANDRA-11709, CASSANDRA-11740)
FEATURES
35
Counter, problem with repair
(CASSANDRA-11432, CASSANDRA-10862)
SSTables in each level: [966/4, 20/10, 152/100...
CLIENT
36
Access C* from different languages
Cassandra ClusterService
Service
Service
OPERATIONS
37
Cluster expansion takes long time
15 days to bootstrap 30 nodes
RECAP
38
• Proxy Node
• Pending Ranges
• Dynamic Snitch
• Compaction
• Big heap size
• Nodetool rebuild range token
• P99 ...
QUESTIONS?
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Upcoming SlideShare
Loading in …5
×

Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

3,762 views

Published on

At Instagram, our mission is to capture and share the world's moments. Our app is used by over 400M people monthly; this creates a lot of challenging data needs. We use Cassandra heavily, as a general key-value storage. In this presentation, I will talk about how we use Cassandra to serve our critical use cases; the improvements/patches we made to make sure Cassandra can meet our low latency, high scalability requirements; and some pain points we have.

About the Speaker
Dikang Gu Software Engineer, Facebook

I'm a software engineer at Instagram core infra team, working on scaling Instagram infrastructure, especially on building a generic key-value store based on Cassandra. Prior to this, I worked on the development of HDFS in Facebook. I got the master degree of Computer Science in Shanghai Jiao Tong university in China.

Published in: Software

Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016

  1. 1. CASSANDRA @ INSTAGRAM 2016 Dikang Gu Software Engineer @ Facebook
  2. 2. ABOUT ME 2 • @dikanggu • Software Engineer • Instagram core infra, 2014 — present • Facebook data Infra, 2012 — 2014
  3. 3. AGENDA 3 1 Overview 2 Improvements 3 Challenges
  4. 4. OVERVIEW
  5. 5. OVERVIEW Cluster Deployment 5 • Cassandra Nodes: 1,000+ • Data Size: 100s of TeraBytes • Ops/sec: in the millions • Largest Cluster: 100+ • Regions: multiple
  6. 6. OVERVIEW 6 • Client: Python/C++/Java/PHP • Protocol: mostly thrift, some CQL • Versions: 2.0.x - 2.2.x • Use LCS for most tables.
  7. 7. TEAM 7
  8. 8. USE CASE 1 Feed 8 PUSH When posting, we push the media information to the followers' feed store. When reading, we fetch the feed ids from the viewer's feed store.
  9. 9. USE CASE 1 Feed • Write QPS: 1M+ • Avg/P99 Read Latency : 20ms/100ms • Data Model:
 user_id —> List(media_id)
  10. 10. USE CASE 2 Metadata store 10 Applications use C* as a key value store, they store a list of blobs associated with a key, and do point query or range query during the read time.
  11. 11. USE CASE 2 Metadata store • Read/Write QPS: 100K+ • Avg read size: 50KB • Avg/P99 Read Latency : 7ms/50ms • Data Model:
 user_id —> List(Blob)
  12. 12. USE CASE 3 Counter 12 Applications issue bump/get counter operations for each user requests.
  13. 13. USE CASE 3 Counter • Read/Write QPS: 50K+ • Avg/P99 Read Latency : 3ms/50ms • C* 2.2 • Data Model:
 some_id —> Counter
  14. 14. IMPROVEMENTS
  15. 15. 1. PROXY NODES 15
  16. 16. PROXY NODE Problem 16 • Thrift client, NOT token aware • Data node coordinates the requests • High latency and timeout when data node is hot.
  17. 17. PROXY NODE Solution 17 • join_ring: false • act as coordinator • do not store data locally • client only talks to proxy node • 2X latency drop
  18. 18. (CASSANDRA-9258) 2. PENDING RANGES 18
  19. 19. PENDING RANGES Problem 19 • CPU usage +30% when bootstrapping new nodes. • Client requests latency jumps and timeouts • Multimap<Range<Token>, InetAddress> PendingRange • In-efficient O(n) pendingRanges lookup for request
  20. 20. PENDING RANGES Solution 20 • Cassandra-9258 • Use two NavigableMaps to implement the pending ranges • We can expand or shrink the cluster without affecting requests • Thanks Branimir Lambov for patch review and feedbacks.
  21. 21. (CASSANDRA-6908) 3. DYNAMIC SNITCH 21
  22. 22. DYNAMIC SNITCH 22 • High read latency during peak time. • Unnecessary cross region requests. • dynamic_snitch_badness_threshold: 50 • 10X P99 latency drop
  23. 23. 4. COMPACTION 23
  24. 24. COMPACTION IMPROVEMENTS 24 • Track the write amplification. (CASSANDRA-11420) • Optimize the overlapping lookup. (CASSANDRA-11571) • Optimize the isEOF() checking. (CASSANDRA-12013) • Avoid searching for column index. (CASSANDRA-11450) • Persist last compacted key per level. (CASSANDRA-6216) • Compact tables before making available in L0. (CASSANDRA-10862)
  25. 25. 5. BIG HEAP SIZE 25
  26. 26. BIG HEAP SIZE 26 • 64G max heap size • 16G new gen size • -XX:MaxTenuringThreshold=6 • Young GC every 10 seconds • Avoid full GC • 2X P99 latency drop
  27. 27. (CASSANDRA-10406) 6. NODETOOL REBUILD RANGE 27
  28. 28. NODETOOL REBUILD 28 • rebuild may fail for nodes with TBs of data • Cassandra-10406 • support to rebuild the failed token ranges • Thanks Yuki Morishita for reviewing
  29. 29. CHALLENGES
  30. 30. PERFORMANCE 30 P99 Read Latency Latency on the C* nodes, even higher on the client side.
  31. 31. PERFORMANCE 31 Compaction has difficulties to catch up Impact the read latency
  32. 32. PERFORMANCE 32 Compaction uses too much CPU (40%+)
  33. 33. PERFORMANCE 33 Tombstone
  34. 34. SCALABILITY 34 Gossip, nodes see inconsistent ring (CASSANDRA-11709, CASSANDRA-11740)
  35. 35. FEATURES 35 Counter, problem with repair (CASSANDRA-11432, CASSANDRA-10862) SSTables in each level: [966/4, 20/10, 152/100, 33, 0, 0, 0, 0, 0]
  36. 36. CLIENT 36 Access C* from different languages Cassandra ClusterService Service Service
  37. 37. OPERATIONS 37 Cluster expansion takes long time 15 days to bootstrap 30 nodes
  38. 38. RECAP 38 • Proxy Node • Pending Ranges • Dynamic Snitch • Compaction • Big heap size • Nodetool rebuild range token • P99 Read latency • Compaction • Tombstone • Gossip • Counter • Client • Cluster expansion ChallengesImprovements
  39. 39. QUESTIONS?

×