Your SlideShare is downloading. ×
Decomposing Twitter: Adventures in Service-Oriented Architecture
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Decomposing Twitter: Adventures in Service-Oriented Architecture

2,648
views

Published on

Video and slides synchronized, mp3 and slide download available at http://bit.ly/15yZeLf. …

Video and slides synchronized, mp3 and slide download available at http://bit.ly/15yZeLf.

Jeremy Cloud discusses SOA at Twitter, approaches taken for maintaining high levels of concurrency, and briefly touches on some functional design patterns used to manage code complexity. Filmed at qconnewyork.com.

Jeremy Cloud is the Tech Lead of the Tweet Service team at @twitter, and oversees one of Twitter's core infrastructure services. His team is responsible for ensuring the scalability and robustness of the tweet read and write paths.

Published in: Technology

0 Comments
21 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,648
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
21
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Decomposing Twitter QConNY 2013
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /twitter-soa
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Tech Lead of Tweet Service Team
  • 5. Backstory ⇢Exponential growth in traffic ⇢Cost of failure has gone up ⇢Success rate constantly improving ⇢Company has grown 10x over the last 3yrs
  • 6. Tweets Per Day 20132006 2010 400,000,000 200,000,000
  • 7. Site Success Rate 20132008 2010 100% 99._% World Cup “off the monorail” not a lot of traffic
  • 8. Routing Presentation Logic Storage Monorail MySQL
  • 9. Challenges ⇢storage I/O bottlenecks ⇢poor concurrency, runtime performance ⇢brittle ⇢too many cooks in the same kitchen ⇢lack of clear ownership ⇢leaky abstractions / tight-coupling
  • 10. Goals of SOA ⇢not just a web-stack ⇢isolate responsibilities and concerns ⇢site speed ⇢reliability, isolate failures ⇢developer productivity
  • 11. API Boundaries
  • 12. Routing Presentation Logic Storage Monorail MySQL Tweet Flock Redis Memcache Cache
  • 13. Routing Presentation Logic Storage MySQL Tweet Store Flock Redis Memcached Cache TFE Monorail Tweet Service User Service Timeline Service SocialGraph Service DirectMessa ge Service User Store API Web Search Feature-X Feature-Y HTTP THRIFT THRIFT*
  • 14. Finagle ⇢service discovery ⇢load balancing ⇢retrying ⇢thread/connection pooling ⇢stats collection ⇢distributed tracing
  • 15. Future[T] ⇢async, non-blocking IO ⇢fixed number of threads ⇢very high levels of concurrency
  • 16. Tweet Service Writes ⇢daily: > 400 MM tweets ⇢avg: > 5000 TPS ⇢daily peak: 9000 TPS ⇢max peak: 33,388 TPS
  • 17. Tweet Service Reads ⇢daily: 350 Billion tweets ⇢avg: 4 MM TPS ⇢median latency: 7 ms ⇢p99 latency: 120 ms (bulk requests) ⇢server count: low hundreds ⇢cache: ~ 15 TB
  • 18. user metadata mentioned-user metadata url metadata media metadata card metadata geo metadata counts
  • 19. mentions urls pctd rt/rp/fv perspectivals client-appcards conversati spam language geo media reply/
  • 20. Storage Memcached CacheTweet Service Upstream Services Url Service Media Cards Geo Spam User Service Language Service Tweets TFlock MySQL Snowflake FirehoseSearch Write-Only Mail Timeline Service Tweet Reads/ Writes
  • 21. Tweet Service Url Service Media Geo Spam User Service Language MySQL HTTP API Snowflake Tweet Writes Compose TFlock Timeline Tweet Store Memcached Hosebird Search Mail Replication Store/Deliver
  • 22. Tweet Service TFlock Timeline Tweet Store Memcached Store Hosebird Search Mail Replication Deliver DeferredR PC
  • 23. Timeline Fanout Tweet Service Fanout Redis Redis Redis Social Graph Service Timeline Cache
  • 24. Hosebird Tweet Service Streaming Social Graph Hosebird Firehose Firehose Firehose Hosebird User Stream User Stream User Stream Hosebird Track/Follow Track/Follow Track/Follow
  • 25. Top 3 Lessons Learned
  • 26. Lessons Learned #1 Incremental change eventually wins
  • 27. Step 1: Make the smallest possible change Step 2: Verify/tweak Step 3: GOTO Step 1
  • 28. Deploy Often and Incrementally ⇢Keep the changes in each deploy small ⇢Validate on a canary server
  • 29. Enabling a New Feature ⇢Dynamic configuration ⇢Deploy with feature turned off ⇢Turn on to 1%, verify, increment
  • 30. Bringing up a New Service ⇢Find something small to break off ⇢Validate correctness and capacity ⇢Slowly turn up ⇢Repeat
  • 31. Tweet Service Read-Path ⇢Reads from secondary services ⇢Reads from new API services ⇢Reads from Monorail
  • 32. Tweet Service Write-Path ⇢Moved store/deliver effects one-by-one ⇢Moved compose phase ⇢Switched from Monorail to new API service
  • 33. Moving an Effect ⇢Decide in calling service ⇢Pass flag with RPC call ⇢Slowly turn up
  • 34. If Reliability Matters... Make big changes in small steps
  • 35. Top 3 Lessons Learned ⇢1) Incremental change eventually wins ⇢ ⇢
  • 36. Lessons Learned #2 Integration testing is hard
  • 37. Scenario #1 Testing a change within a single service Difficulty: Easy to Medium
  • 38. Tap Compare TFE Producti Dark Log
  • 39. Capture and Replay Client Producti Staging Scribe Iago Log
  • 40. Forked Traffic Client Producti StagingForward Verifier Dev
  • 41. Scenario #2 Testing a change that spans multiple services Difficulty: Medium to Hard
  • 42. Challenges ⇢Need to launch multiple services with change ⇢Need to point those services at each other ⇢A single service may have many dependencies (memcache, kestrel, MySQL) ⇢Can’t run everything on one box
  • 43. Testing Changes API Logic Storage Logic Storage Storage Easy Hard Hardest
  • 44. Unsolved Problem Manual and tedious process
  • 45. Moral of the story Plan and build your integration testing framework early
  • 46. Top 3 Lessons Learned ⇢1) Incremental change eventually wins ⇢2) Plan integration testing strategy early ⇢
  • 47. Lessons Learned #3 Failure is always an option
  • 48. More Hops = More Failures
  • 49. Retries can be tricky ⇢Dogpile! ⇢Use a library like Finagle if possible ⇢Route around problems ⇢Idempotency is critical
  • 50. Plan for entire rack failure ⇢Over provision ⇢Mix services within rack ⇢Test this scenario in production
  • 51. Degrade gracefully ⇢Avoid SPOFs ⇢Overprovision, add redundancy ⇢Have a fallback strategy ⇢Dark-mode a feature instead of total failure
  • 52. Partial Failures ⇢More common ⇢Harder to handle ⇢Harder to imagine
  • 53. If Reliability Matters... Spend most of your time preparing for failures
  • 54. Top 3 Lessons Learned ⇢1) Incremental change eventually wins ⇢2) Integration testing is hard ⇢3) Failure is always an option
  • 55. @JoinTheFlock