Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Skype's Journey from P2P: It's Not Just about the Services

49 views

Published on

Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2wzjaD2.

Bruce Lowekamp discusses the evolution of Skype's architecture and tradeoffs in design made along the way. He discusses the lessons learned and the improvements that are still in process as Skype continues to evolve. Filmed at qconnewyork.com.

Bruce Lowekamp is Principal Architect for Skype's Cloud Infrastructure Portfolio at Microsoft. He enjoys the challenge of building highly available, low-latency global systems using commodity cloud services and has even more fun understanding failures. He is co-author of several IETF RFCs in the area of P2PSIP and NAT traversal.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Skype's Journey from P2P: It's Not Just about the Services

  1. 1. Skype’s Journey From P2P: It’s Not Just About the Services Bruce Lowekamp People and Connections June 27, 2018 Microsoft’s Intelligent Conversations and Communications Cloud (IC3) Powering Skype, Teams, and O365
  2. 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ skype-p2p
  3. 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  4. 4. Skype History • First released in 2003 • P2P, based on Global Index originally used for KaZaa file sharing • Chat, audio, video, file sharing, contact invites all over P2P • Acquired by Microsoft in 2011 • Supernodes moved to datacenters • Chat moved to (evolution of) Messenger chat service • Calling, file-transfer, contacts, etc moved to new services • P2P network officially being decommissioned in Fall 2018
  5. 5. Outline • Original P2P architecture • P2P compared to Modern service architecture • Why not P2P? • Migrating from old to new architectures • Doing it well: Experimentation at massive client scale
  6. 6. Skype P2P Architecture P2P Network formed by clients Backend team running mostly DB-based services Shared Library with clients (data structures, etc) Services were thin shim on top of sharded PG SQL PG bouncer: Transparently sharded stored procedures LUX + DUB
  7. 7. P2P Contact Invites Search for users across SNs Send invite (signed) to target via P2P Receive signed ack with secret. Update local and feed to other nodes Lazy sync to backend
  8. 8. High Availability in P2P P2P Network implements HA • Invites easily sent when both clients online Backend forwards P2P invite • When invitee offline Operation completed by clients Changes to contact list lazily synced to DB AP CPCAP Theorem • P2P Network is AP • BE DBs are CP
  9. 9. Breaking apart P2P Contact Changes “Changes lazily synced to DB” Sequence of changes sent to clients and DB DB syncs to clients Eventually all DB and clients see same result AP CP CRDT? “J” in JCS was for Journaled
  10. 10. Distributed Service vs P2P Architecture Clients Service Distributed DB Storage Contacts Contacts Contacts Contacts CP CP
  11. 11. Why not P2P? Desktop apps no longer dominant Servers cheap Need to support mobile Offline messaging, suggestions, server-side search, browser state Business logic (and service implementation) in clients, not services Can still do P2P media and E2E encryption in service-based systems
  12. 12. Migrations Supernodes->Dedicated Supernodes->Trouter Chat: P2P -> P2P+Griffin -> Messenger -> New Chat Service Contacts: CBL->JCS->ABCH->PCS->EXO Calling: P2P -> NGC Login: Skype -> MSA
  13. 13. Dual-head vs Gateway Contacts P2P Calling New Calling P2P Calling New Contacts Contacts Service Contacts Gateway
  14. 14. Dual-Stack: Calling and Chat P2P Calling New Calling P2P Calling Calling Service Call Alice Call Alice Chat Service Alice: Hi Bob! Bob: Hi Alice! New Chat Alice: Hi Bob! Bob: Hi Alice!
  15. 15. Technology gateway: Dual-headed with Help P2P requires clients running continuously Mobile devices don’t… P2P Calling P2P GW P2P Calling New Calling Call Alice Call Alice Push Notifications
  16. 16. Gateway for Contact migrations Contacts New Contacts Contacts Service Contacts Gateway Contacts Contacts Migration 1 Move Contact Data Migration 2 Update Client
  17. 17. Contact migrations Contacts New Contacts Contacts Service Contacts Gateway Get Contacts Get Contacts? Get Contacts? Flags: Migration in Progress Migrated Flag: Is Master Write Blobs to Cache
  18. 18. When to migrate? Contacts New Contacts Contacts Service Contacts Gateway Contacts Migration 1 Move Contact Data Migration 2 Update Client
  19. 19. Need for Online Experimentation Even objective metrics are a function of • Product quality • Seasonal/weekly effects • User population • Device population • Usage scenario These aren’t stable across new client releases Need robust online experimentation to separate new calling implementation from other factors. 6/26/2018 MICROSOFT CONFIDENTIAL Lync + Skype 17 Early Adopter Bias Seasonality, Overall Trends
  20. 20. Experimentation – When to use A/B Testing? When to use A/B testing: • Making a data-driven decision about the impact of a change in the product How is A/B testing different from "monitoring metrics before and after a change": • A/B testing is the only valid method to draw causal inference – i.e. the changes in metric behavior cannot be attributed to any particular change in code unless in a randomized treatment assignment (A/B testing) setting Why set up automated scorecards vs manually aggregating data into test statistics: • To make sure the results are trustworthy – it is easy to be misled by data! • To scale the experimentation so you don’t need a data scientist for every single experiment analysis • To have a standard procedure that controls the rates of false positive/negative in long run over the entire org First step for getting started on experimentation: • Data! • Decide about which metrics are to be used for tracking the improvements - they should be aligned with T0 KPIs of the org • Make the data available for querying with experimentation labels (e.g. knowing which each calls fell into) • Link data to a validated scorecard
  21. 21. Experimentation Lifecycle
  22. 22. Experimentation Lifecycle, Client Edition
  23. 23. Experimentation Requirements Many teams • Self-service • Structured Config Configuration-centric • Long-lived clients know what, not why High-quality scorecards • A&E Experimentation team evolved out of Bing Experimentation and Configuration Service (ECS) was built to address the flighting and configuration portion of experimentation.
  24. 24. Configuration-Centric View Straightforward approach gives the client configuration describing its situation, and client decides what to do. ECSApplication Presents Client Context Relevant Configurations ClientLib ConfigValueA = ClientLib.GetSettings(“Shutdown. A”) ?? ClientLib.GetSettings(“Region.A”) ?? ClientLib.GetSettings(“Rollout.A”)
  25. 25. Configuration-Centric View But reasons to change behavior interact Resolving these collision manually and statically is not scalable IF Ver>2.0 && 80% THEN A=5 IF Version>1.0 THEN A=3 IF Country=Australia THEN A=4 IF Shutdown THEN A=0 IF NOT Shutdown AND Country != Australia AND Version>2.0 && 80% THEN A=5 IF NOT Shutdown AND Country != Australia AND !(Version > 2.0 && 80%) AND Version>1.0 THEN A=3 IF NOT Shutdown AND Country=Australia THEN A=4 IF Shutdown THEN A=0
  26. 26. Configuration-Centric View ...becomes a Live-site issue What if the Australia setup needs to be turned off? It is more manageable to disable the precise setup IF Ver>2.0 && 80% THEN A=5 IF Version>1.0 THEN A=3 IF Country=Australia THEN A=4 IF Shutdown THEN A=0 IF NOT Shutdown AND Country != Australia AND Version>2.0 && 80% THEN A=5 IF NOT Shutdown AND Country != Australia AND !(Version > 2.0 && 80%) AND Version>1.0 THEN A=3 IF NOT Shutdown AND Country=Australia THEN A=4 IF Shutdown THEN A=0
  27. 27. Configuration-Centric View Applications are made to be Configurable Applications should only be concerned on What it should be configured to, not Why ECSApplication Presents Client Context Relevant Configurations ClientLib ConfigValueA = ClientLib.GetSettings(“A”)
  28. 28. Configuration-Centric View And the reason to configure will be many As the number of reasons scale, the reasons will collide Need Tie-breaking Rules ECSApplication Presents Client Context Relevant Configurations ClientLib ConfigValueA = ClientLib.GetSettings(“A”) Many Reasons to Configure: • Experimentation • Feature Rollouts to X% • Regional Settings • Exposure to User/Tenants (Murphy/Rings) • Live-site assistance • Traffic Routing • Sampling • Any combinations (e.g. 5% of Ring 2 in Europe) • and many more…
  29. 29. Configuration-Centric View ECS configuration approach is to provide a set of Tie-breaking rules for users, but let the service resolve the collision dynamically ECSApplication Presents Client Context Relevant Configurations ClientLib ConfigValueA = ClientLib.GetSettings(“A”) Value of A INPUT: Version = 3.0, Country = US, Shutdown = false, UserID=myuser OUTPUT: 5 IF Ver>2.0 and 80% THEN A=5 IF Version>1.0 THEN A=3 IF Country=Australia THEN A=4 IF Shutdown THEN A=0
  30. 30. Configuration-Centric View ECS Application Presents Client Context Relevant Configurations ClientLib ConfigValueA = ClientLib.GetSettings(“A”) Experiment Rollout Ring-Based Sampling Configuration Prioritization (Config Merge, Layer Order, Priority Order) Shutdown Default External ……
  31. 31. No Client-Service Contract Change Example: Configuration with Rings • Decoupling who the user is from how the application is configured • No Client-Server contract change. No Mobile re-deployment for Rings Application ECS Resolves User to Ring X Presents Client Context (UserID, TenantID) Relevant Configurations for Ring X ClientLib +Cache Ring Definition (ECS) Ring Definition (Partner) Translate to empower ECS Ring Filters
  32. 32. ConfigID Identify each experiment, rollout, default Needed for debugging and analysis <Type-ExpID-TreatID-Iteration>
  33. 33. Configuration Merge Experiment Config “SkypeAndroid": { "ShortCircuit": true } Rollout Config “SkypeAndroid": { "PhoneVerification": false, “ShortCircuit”: false } Merged Config “SkypeAndroid": { "ShortCircuit": true, "PhoneVerification": false }
  34. 34. ETag ETag is a hash of the set of ConfigIDs being served ETag-ConfigID mapping is forwarded to data pipeline by ECS service Client Telemetry is logged with the ETag Data Analysis to associate telemetry with an iteration of the treatment • Client Telemetry.Etag • Data Analysis.ConfigID • Service log: Etag-ConfigID Map Also useful for debugging client implementation
  35. 35. Impression-based vs Sticky Conventional experimentation: • Numberline assigns user to experiment. Experiment is sticky. • Analyze impact over time What if your experiment is more risky? • Next-gen code frequently known not to be better (yet) • Still need to get real-world experiment • “Impression-based” assign at random each fetch • No one gets broken experience for more than an hour/restart
  36. 36. Importance of Scorecards Changes in important metrics ALL metrics, not just intended by experiment P-values of changes to confirm caused by experiment Unanswered call UX experiment - Higher ratio of established calls - BUT, Call Drop Ratio is up by 0.07% overall, caused by PSTN drops Likely explanation: retrying a failed call on PSTN isn’t useful on a bad network Experiments can have unexpected consequences on other scenarios. A scorecard capturing important metrics across all scenarios is needed to find unintended consequences. PSTN Calls only
  37. 37. ECS Today Scale (as of 6/8/18) 479 Project Teams Currently running: Experiments 388 Rollouts 2.74K Defaults 701 12.69K Complex Configs 3.83K layers (uniquely salted numberline) ~140K RPS (daily peak) Used by Skype & Teams clients and services. Most Office apps, etc…
  38. 38. Lessons from Skype’s evolution P2P • Architecture is different, but same HA principles can be achieved • Solved problems originally, but became a bottleneck over time Migrations • Config support plans for your next migration in advance • Pick strategy based on complexity of transition Experimentation • Migration (and other changes) require robust experimentation • Don’t bake in experiments: What not Why!
  39. 39. Acknowledgements Many, many people at Skype and Microsoft built the systems described here and implemented the strategies to migrate users to newer systems. Special thanks to Eric Lau, Michael Rubin, Daniel Schneider, and the ECS team. The E2E experimentation pipeline includes major components developed by the Aria, A&E EXP, IC3 Media, and other partner teams. bruce.lowekamp@skype.net https://linkedin.com/in/brucelowekamp
  40. 40. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ skype-p2p

×