Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling wix with microservices and multi cloud - 2015

Many small startups build their systems on top of a traditional toolset like Tomcat, Hibernate, and MySQL. These systems are used because they facilitate easy development and fast progress, but many of them are monolithic and have limited scalability. So as a startup grows, the team is confronted with the problem of how to evolve the system and make it scalable. Facing the same dilemma, grew from 0 to 70 million users in just a few years. Facing some interesting challenges, like performance and availability. Traditional performance solutions, such as caching, would not help due to a very long tail problem which causes caching to be highly inefficient. And because every minute of downtime means customers lose money, the product needed to have near 100% availability. Solving these issues required some interesting and out-of-the-box thinking, and this talk will discuss some of these strategies: building a highly preformant, highly available and highly scalable system; and leveraging microservices architecture and multi-cloud platforms to help build a very efficient and cost-effective system.

  • Be the first to comment

Scaling wix with microservices and multi cloud - 2015

  1. 1. @aviranm Aviran Mordo Head of Engineering @aviranm Scaling with Microservices Architecture & Multi-cloud platforms
  2. 2. @aviranm
  3. 3. @aviranm Wix in Numbers Over 72M users (website builders) Static storage is >2PB of data 3 data centers + 3 clouds (Google, Amazon,Azure) 2B HTTP requests/day 1000 people work atWix
  4. 4. @aviranm Initial Architecture Built for fast development Stateful login (Tomcat session), Ehcache, file uploads No consideration for performance, scalability and testing Intended for short-term use Tomcat, Hibernate, custom web framework Lighttpd (file serving) MySQL DB Wix (Tomcat)
  5. 5. @aviranm The Monolithic Giant One monolithic server that handled everything Dependency between features Changes in unrelated areas of the system caused deployment of the whole system Failure in unrelated areas will cause system wide downtime
  6. 6. @aviranm Breaking the System Apart
  7. 7. @aviranm Concerns and SLA DataValidation Security / Authentication Data consistency Lots of data Edit websites High availability High performance Lots of static files Very high traffic volume Viewport optimization Long tail (immutable) Serving Media High availability High performance High traffic volume Long tail (mutable) View sites, created by Wix editor
  8. 8. @aviranm Wix Segmentation 1. Editor Segment 3. Public Segment2. Media Segment Networking
  9. 9. @aviranm HTML Editor Flash Editor MSM Private Media Public Media Editor Segment Public Segment Premium Services eCommerse List DB App Builder App Store App Market Dashboard Statics/me dia Mailer TimeZone Public HTML API Public API (Flash) MSP Public Server HTML Renderer HTML SEO Renderer Flash Renderer Flash SEO Renderer Sitemap Renderer Robots.txt Renderer User Server Template Viewer ContactsHUB Activit y Site Members Provided Mailing Service Comments Snapshoter User Pref Feed Me Shout-out Hotels PETRI Site Pref Dist LoggerSlicer eCom Renderer eCom Cart eCom Checkout eCom Catalog eCom Orders Payment Facade Account Info HTML API HTML Embeder BlogMobile
  10. 10. @aviranm It is all about
  11. 11. @aviranm Microservices Guidelines Each service has its own DB schema (if one is needed) Only one service should write to a specific DB table(s) There may be additional read-only services that directly accesses the DB (for performance reasons) Services are stateless No DB transactions Cache is not a building block, but an optimization
  12. 12. @aviranm Microservices Tradeoffs Each service has its own DB schema (if one is needed) Gain - Easy to scale microservices based on service level concerns Tradeoff – system complexity, performance Only one service should write to a specific DB table(s) Gain - Decoupling architecture – faster development Tradeoff – system complexity / performance May have additional read-only services that accesses the DB Gain - Performance gain Tradeoff - coupling Services are stateless Gain - Easy to scale out (just add more servers) Tradeoff - performance / consistency No DB transactions Gain - Better DB performance, easier to scale Tradeoff - system complexity
  13. 13. @aviranm 1. Editor Segment
  14. 14. @aviranm Editor Server Immutable JSON pages (~3M / day) Site revisions Active – standby MySQL cross datacenters Editor Server MySQL Active Sites MySQL Archive
  15. 15. @aviranm
  16. 16. @aviranm Protect The Data DB outage with fast recovery = replication Data poisoning/corruption = revisions / backup Make the data available at all times = data distribution to multiple locations / providers
  17. 17. @aviranm Browser Editor Server GCS MySQL Active Sites MySQL Archive Saving Editor Data WixMedia (Amazon) WixMedia (Google) Save Page(s) 200 OK Upload Save Page DC replication Notify MySQL Archive MySQL Active Sites S3 WixMedia (DC-1)
  18. 18. @aviranm Browser Editor Server GCS MySQL Active Sites MySQL Archive WixMedia (Amazon) WixMedia (Google) Save Page(s) 200 OK Upload Save Page DC replication Notify MySQL Archive MySQL Active Sites S3 WixMedia (DC-1) Self Healing Process
  19. 19. @aviranm No DB Transactions Save each page (JSON) as an atomic operation Page ID is a content based hash (immutable/idempotent) Finalize transaction by sending site header (list of pages) Can generate orphaned pages, not a problem in practice
  20. 20. @aviranm 2. Media Segment (WixMP)
  21. 21. @aviranm Wix Media Platform (WixMP) Eventual consistent distributed file system (2PB user media files) Dynamic media processing Multi datacenter aware Automatic fallback cross DC Run on commodity servers & cloud
  22. 22. @aviranm T Google Cloud Prospero – Wix Media Manager get image.jpg First fallback Second fallback If not in CDN Amazon x36 T x36 T x32 Austin CDN
  23. 23. @aviranm 3. Public Segment
  24. 24. @aviranm Public Segment Roles Routing (resolve URLs) Dispatching (to a renderer) Rendering (HTML,XML,TXT) Public Server HTML Renderer HTML SEO Renderer Flash Renderer Sitemap Renderer Robots.txt Renderer Flash SEO Renderer
  25. 25. @aviranm Public SLA Our goal: 99% response time <100ms at peak traffic
  26. 26. @aviranm Publish Site Publish site header (a map of pages for a site) Publish routing table Publish site header / routes (CQRS) Editor Segment Public Segment
  27. 27. @aviranm Built For Speed Minimize out-of-process hops (2 DB, 1 RPC) Lookup tables are cached in memory, updated every few minutes Denormalized data – optimize for read by primary key (MySQL) Minimize business logic
  28. 28. @aviranm How a Page Gets Rendered Bootstrap HTML template that contains only data Only JavaScript imports JSON data (site-header + dynamic data) No “real” HTML view
  29. 29. @aviranm Offload rendering work to the browser
  30. 30. @aviranm The average Intel Core i750 can push up to 7 GFLOPS without overclocking
  31. 31. @aviranm Why JSON? Easy to parse in JavaScript and Java/Scala Fairly compact text format Highly compressible (5:1 even for small payloads) Easy to fix rendering bugs and cross browsers issues (just deploy a new code)
  32. 32. @aviranm Minimum Number of Public Servers Needed to Serve 66M Sites 4
  33. 33. @aviranm Public SLA Be Available 99.999%
  34. 34. @aviranm Serving a Site – Sunny Day Archive CDN WixMP Browser Store HTML to cache HTTP Request Notify site view LB Public Renderer HTML Resources / Media HTTP Request
  35. 35. @aviranm Serving a Site – DC Lost Archive CDN WixMP Browser LB Public Renderer LB Public Renderer Change DNS HTTP Request
  36. 36. @aviranm Serving a Site – Public Lost Archive Browser LB Public Renderer Get Cached HTML Version HTML HTTP Request LB Public Renderer Fallback to 2nd DC
  37. 37. @aviranm Living in the Browser CDN WixMP Browser LB Public Renderer Editor Pages Fallback JSON / Media HTML HTTP Request Fallback
  38. 38. @aviranm Summary Identify concerns and SLA for different parts of the system Build redundancy in critical path (for availability) De-normalize data (for performance) Minimize out-of-process hops (for performance) Take advantage of client’s CPU power
  39. 39. @aviranm
  40. 40. @aviranm @WixEng We’re hiring Q&A