Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Decomposing Twitter
QConNY 2013
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese an...
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the sprea...
Tech Lead of Tweet Service Team
Backstory
⇢Exponential growth in traffic
⇢Cost of failure has gone up
⇢Success rate constantly improving
⇢Company has grow...
Tweets Per Day
20132006 2010
400,000,000
200,000,000
Site Success Rate
20132008 2010
100%
99._%
World Cup
“off the monorail”
not a lot of traffic
Routing Presentation Logic Storage
Monorail MySQL
Challenges
⇢storage I/O bottlenecks
⇢poor concurrency, runtime performance
⇢brittle
⇢too many cooks in the same kitchen
⇢l...
Goals of SOA
⇢not just a web-stack
⇢isolate responsibilities and concerns
⇢site speed
⇢reliability, isolate failures
⇢deve...
API Boundaries
Routing Presentation Logic Storage
Monorail
MySQL
Tweet
Flock
Redis
Memcache
Cache
Routing Presentation Logic Storage
MySQL
Tweet Store
Flock
Redis
Memcached
Cache
TFE
Monorail
Tweet
Service
User Service
T...
Finagle
⇢service discovery
⇢load balancing
⇢retrying
⇢thread/connection pooling
⇢stats collection
⇢distributed tracing
Future[T]
⇢async, non-blocking IO
⇢fixed number of threads
⇢very high levels of concurrency
Tweet Service
Writes
⇢daily: > 400 MM tweets
⇢avg: > 5000 TPS
⇢daily peak: 9000 TPS
⇢max peak: 33,388 TPS
Tweet Service
Reads
⇢daily: 350 Billion tweets
⇢avg: 4 MM TPS
⇢median latency: 7 ms
⇢p99 latency: 120 ms (bulk requests)
⇢...
user metadata
mentioned-user metadata
url metadata
media metadata
card metadata
geo metadata
counts
mentions urls
pctd
rt/rp/fv
perspectivals
client-appcards
conversati
spam
language
geo
media
reply/
Storage
Memcached
CacheTweet
Service
Upstream Services
Url Service
Media
Cards
Geo
Spam
User Service
Language
Service
Twee...
Tweet
Service
Url Service
Media
Geo
Spam
User Service
Language
MySQL
HTTP API
Snowflake
Tweet Writes Compose
TFlock
Timeli...
Tweet
Service
TFlock
Timeline
Tweet Store
Memcached
Store
Hosebird
Search
Mail
Replication
Deliver
DeferredR
PC
Timeline
Fanout
Tweet
Service
Fanout
Redis
Redis
Redis
Social
Graph
Service
Timeline Cache
Hosebird
Tweet
Service
Streaming
Social
Graph
Hosebird
Firehose
Firehose
Firehose
Hosebird
User Stream
User Stream
User St...
Top 3 Lessons Learned
Lessons Learned #1
Incremental change eventually wins
Step 1: Make the smallest possible change

Step 2: Verify/tweak

Step 3: GOTO Step 1
Deploy Often and Incrementally
⇢Keep the changes in each deploy small
⇢Validate on a canary server
Enabling a New Feature
⇢Dynamic configuration
⇢Deploy with feature turned off
⇢Turn on to 1%, verify, increment
Bringing up a New Service
⇢Find something small to break off
⇢Validate correctness and capacity
⇢Slowly turn up
⇢Repeat
Tweet Service Read-Path
⇢Reads from secondary services
⇢Reads from new API services
⇢Reads from Monorail
Tweet Service Write-Path
⇢Moved store/deliver effects one-by-one
⇢Moved compose phase
⇢Switched from Monorail to new API s...
Moving an Effect
⇢Decide in calling service
⇢Pass flag with RPC call
⇢Slowly turn up
If Reliability Matters...
Make big changes in small steps
Top 3 Lessons Learned
⇢1) Incremental change eventually wins
⇢
⇢
Lessons Learned #2
Integration testing is hard
Scenario #1
Testing a change within a single service
Difficulty: Easy to Medium
Tap Compare
TFE
Producti Dark
Log
Capture and Replay
Client
Producti
Staging
Scribe Iago Log
Forked Traffic
Client
Producti StagingForward
Verifier
Dev
Scenario #2
Testing a change that spans multiple services
Difficulty: Medium to Hard
Challenges
⇢Need to launch multiple services with change
⇢Need to point those services at each other
⇢A single service may...
Testing Changes
API
Logic
Storage
Logic
Storage Storage
Easy
Hard
Hardest
Unsolved Problem
Manual and tedious process
Moral of the story
Plan and build your integration testing framework
early
Top 3 Lessons Learned
⇢1) Incremental change eventually wins
⇢2) Plan integration testing strategy early
⇢
Lessons Learned #3
Failure is always an option
More Hops = More Failures
Retries can be tricky
⇢Dogpile!
⇢Use a library like Finagle if possible
⇢Route around problems
⇢Idempotency is critical
Plan for entire rack failure
⇢Over provision
⇢Mix services within rack
⇢Test this scenario in production
Degrade gracefully
⇢Avoid SPOFs
⇢Overprovision, add redundancy
⇢Have a fallback strategy
⇢Dark-mode a feature instead of t...
Partial Failures
⇢More common
⇢Harder to handle
⇢Harder to imagine
If Reliability Matters...
Spend most of your time preparing for failures
Top 3 Lessons Learned
⇢1) Incremental change eventually wins
⇢2) Integration testing is hard
⇢3) Failure is always an opti...
@JoinTheFlock
Upcoming SlideShare
Loading in …5
×

Decomposing Twitter: Adventures in Service-Oriented Architecture

6,635 views

Published on

Video and slides synchronized, mp3 and slide download available at http://bit.ly/15yZeLf.

Jeremy Cloud discusses SOA at Twitter, approaches taken for maintaining high levels of concurrency, and briefly touches on some functional design patterns used to manage code complexity. Filmed at qconnewyork.com.

Jeremy Cloud is the Tech Lead of the Tweet Service team at @twitter, and oversees one of Twitter's core infrastructure services. His team is responsible for ensuring the scalability and robustness of the tweet read and write paths.

Published in: Technology

Decomposing Twitter: Adventures in Service-Oriented Architecture

  1. 1. Decomposing Twitter QConNY 2013
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /twitter-soa
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Tech Lead of Tweet Service Team
  5. 5. Backstory ⇢Exponential growth in traffic ⇢Cost of failure has gone up ⇢Success rate constantly improving ⇢Company has grown 10x over the last 3yrs
  6. 6. Tweets Per Day 20132006 2010 400,000,000 200,000,000
  7. 7. Site Success Rate 20132008 2010 100% 99._% World Cup “off the monorail” not a lot of traffic
  8. 8. Routing Presentation Logic Storage Monorail MySQL
  9. 9. Challenges ⇢storage I/O bottlenecks ⇢poor concurrency, runtime performance ⇢brittle ⇢too many cooks in the same kitchen ⇢lack of clear ownership ⇢leaky abstractions / tight-coupling
  10. 10. Goals of SOA ⇢not just a web-stack ⇢isolate responsibilities and concerns ⇢site speed ⇢reliability, isolate failures ⇢developer productivity
  11. 11. API Boundaries
  12. 12. Routing Presentation Logic Storage Monorail MySQL Tweet Flock Redis Memcache Cache
  13. 13. Routing Presentation Logic Storage MySQL Tweet Store Flock Redis Memcached Cache TFE Monorail Tweet Service User Service Timeline Service SocialGraph Service DirectMessa ge Service User Store API Web Search Feature-X Feature-Y HTTP THRIFT THRIFT*
  14. 14. Finagle ⇢service discovery ⇢load balancing ⇢retrying ⇢thread/connection pooling ⇢stats collection ⇢distributed tracing
  15. 15. Future[T] ⇢async, non-blocking IO ⇢fixed number of threads ⇢very high levels of concurrency
  16. 16. Tweet Service Writes ⇢daily: > 400 MM tweets ⇢avg: > 5000 TPS ⇢daily peak: 9000 TPS ⇢max peak: 33,388 TPS
  17. 17. Tweet Service Reads ⇢daily: 350 Billion tweets ⇢avg: 4 MM TPS ⇢median latency: 7 ms ⇢p99 latency: 120 ms (bulk requests) ⇢server count: low hundreds ⇢cache: ~ 15 TB
  18. 18. user metadata mentioned-user metadata url metadata media metadata card metadata geo metadata counts
  19. 19. mentions urls pctd rt/rp/fv perspectivals client-appcards conversati spam language geo media reply/
  20. 20. Storage Memcached CacheTweet Service Upstream Services Url Service Media Cards Geo Spam User Service Language Service Tweets TFlock MySQL Snowflake FirehoseSearch Write-Only Mail Timeline Service Tweet Reads/ Writes
  21. 21. Tweet Service Url Service Media Geo Spam User Service Language MySQL HTTP API Snowflake Tweet Writes Compose TFlock Timeline Tweet Store Memcached Hosebird Search Mail Replication Store/Deliver
  22. 22. Tweet Service TFlock Timeline Tweet Store Memcached Store Hosebird Search Mail Replication Deliver DeferredR PC
  23. 23. Timeline Fanout Tweet Service Fanout Redis Redis Redis Social Graph Service Timeline Cache
  24. 24. Hosebird Tweet Service Streaming Social Graph Hosebird Firehose Firehose Firehose Hosebird User Stream User Stream User Stream Hosebird Track/Follow Track/Follow Track/Follow
  25. 25. Top 3 Lessons Learned
  26. 26. Lessons Learned #1 Incremental change eventually wins
  27. 27. Step 1: Make the smallest possible change Step 2: Verify/tweak Step 3: GOTO Step 1
  28. 28. Deploy Often and Incrementally ⇢Keep the changes in each deploy small ⇢Validate on a canary server
  29. 29. Enabling a New Feature ⇢Dynamic configuration ⇢Deploy with feature turned off ⇢Turn on to 1%, verify, increment
  30. 30. Bringing up a New Service ⇢Find something small to break off ⇢Validate correctness and capacity ⇢Slowly turn up ⇢Repeat
  31. 31. Tweet Service Read-Path ⇢Reads from secondary services ⇢Reads from new API services ⇢Reads from Monorail
  32. 32. Tweet Service Write-Path ⇢Moved store/deliver effects one-by-one ⇢Moved compose phase ⇢Switched from Monorail to new API service
  33. 33. Moving an Effect ⇢Decide in calling service ⇢Pass flag with RPC call ⇢Slowly turn up
  34. 34. If Reliability Matters... Make big changes in small steps
  35. 35. Top 3 Lessons Learned ⇢1) Incremental change eventually wins ⇢ ⇢
  36. 36. Lessons Learned #2 Integration testing is hard
  37. 37. Scenario #1 Testing a change within a single service Difficulty: Easy to Medium
  38. 38. Tap Compare TFE Producti Dark Log
  39. 39. Capture and Replay Client Producti Staging Scribe Iago Log
  40. 40. Forked Traffic Client Producti StagingForward Verifier Dev
  41. 41. Scenario #2 Testing a change that spans multiple services Difficulty: Medium to Hard
  42. 42. Challenges ⇢Need to launch multiple services with change ⇢Need to point those services at each other ⇢A single service may have many dependencies (memcache, kestrel, MySQL) ⇢Can’t run everything on one box
  43. 43. Testing Changes API Logic Storage Logic Storage Storage Easy Hard Hardest
  44. 44. Unsolved Problem Manual and tedious process
  45. 45. Moral of the story Plan and build your integration testing framework early
  46. 46. Top 3 Lessons Learned ⇢1) Incremental change eventually wins ⇢2) Plan integration testing strategy early ⇢
  47. 47. Lessons Learned #3 Failure is always an option
  48. 48. More Hops = More Failures
  49. 49. Retries can be tricky ⇢Dogpile! ⇢Use a library like Finagle if possible ⇢Route around problems ⇢Idempotency is critical
  50. 50. Plan for entire rack failure ⇢Over provision ⇢Mix services within rack ⇢Test this scenario in production
  51. 51. Degrade gracefully ⇢Avoid SPOFs ⇢Overprovision, add redundancy ⇢Have a fallback strategy ⇢Dark-mode a feature instead of total failure
  52. 52. Partial Failures ⇢More common ⇢Harder to handle ⇢Harder to imagine
  53. 53. If Reliability Matters... Spend most of your time preparing for failures
  54. 54. Top 3 Lessons Learned ⇢1) Incremental change eventually wins ⇢2) Integration testing is hard ⇢3) Failure is always an option
  55. 55. @JoinTheFlock

×