DOs and DONTs on the way to 10M users

1,992 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,992
On SlideShare
0
From Embeds
0
Number of Embeds
75
Actions
Shares
0
Downloads
36
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

DOs and DONTs on the way to 10M users

  1. 1. DOs and DON'Ts <br />on the way to over 10M users…<br />IGT – <br />Event<br />July 2011<br />
  2. 2. Wix Initial Architecture<br />Server<br />Public Sites Serving<br />Sites Editing API<br />User Authentication<br />User Media upload, searches<br />Public Media upload, management, searches<br />Template Galleries management<br />Database<br />Wix Sites<br />WixML Pages<br />User Authentication<br />Users Media<br />Public Media<br />Template Galleries<br />
  3. 3. Wix Initial Architecture<br />Tomcat, Hibernate, Custom web framework<br />Everything generated from HBM files<br />Built for fast development<br />Statefull login (tomcat session), EHCache, File uploads<br />Not considering performance, scalability, fast feature rollout, testing<br />It reflected the fact that we didn’t really know what is our business<br />Yoav A. - “it is great for a first stage start-up, but you will have to replace it within 2 years”<br />Nadav A, after two years - “you were right, however, you failed to mention how hard it’s gonna be”<br />
  4. 4. Wix Initial Architecture<br />What we have learned <br />Don’t worry about ‘building it right from the start’ – you won’t<br />You are going to replace stuff you are building in the initial stages of a startup or any project<br />Be ready to do it<br />Get it up to customers as fast as you can. Get feedback. Evolve.<br />Our mistake was not planning for gradual re-write<br />Build for gradual re-write as you learn the problems and find the right solutions<br />
  5. 5. Distributed Cache<br />Next we added EHCache as Hibernate 2nd-level cache<br />Why? <br />Cause it is in the design<br />How was it?<br />Black Box cache<br />How do we know what is the state of our system?<br />How to invalidate the cache?<br />When to invalidate it?<br />How does operations manage the cache?<br />Did we really need it?<br />No<br />We eventually dropped it<br />
  6. 6. Distributed Cache<br />What we have learned <br />You don’t need a Cache<br />Really, you don’t<br />Cache is not part of an architecture. It is a means to make something more efficient<br />Architect while ignoring the caching. Introduce caching only as needed to solve real performance problems.<br />When introducing a cache, think about <br />cache management – how do you know what is in the cache? How do you find invalid data in the cache?<br />Invalidation – who invalidates the cache? When? Why?<br />Cache Reset – can your architecture stand a cache restart?<br />
  7. 7. Two years passed<br />We learned what our business is – building websites<br />We started selling premium websites<br />
  8. 8. Two years passed<br />Our architecture evolved<br />We added a separate Billing segment<br />We moved static file storage and serving to a separate instance<br />But we started seeing problems<br />Updates to our server imposed complete wix downtime<br />Our static storage reached 500 GByte of small files, the limit of Bash scripts<br />The server codebase became large and entangled. Feature rollout became harder over time, requiring longer and longer manual regression<br />Strange full-table scans queries generated by Hibernate, which we still have no idea what code is responsible for…<br />Statefull user sessions required a lot of memory and a statefull load balancer<br />BillingDB<br />Billing<br />(Tomcat)<br />Server<br />(Tomcat)<br />DB<br />statics<br />(Lighttpd)<br />Media storage<br />
  9. 9. Editor & Public Segments<br />The Challenge - Updates to our Server imposed downtime for our customer’s websites<br />Any Server or Database update has the potential of bringing down all Wix sites<br />Is a symptom of a larger issue<br />The Server served two different concerns<br />Wix Users editing websites<br />Viewing Wix Sites, the sites created by the Wix editor<br />The two concerns require a different SLA<br />Wix Sites should never ever have a downtime!<br />Wix Sites should work as fast as possible, always! <br />However, an editing system does not require this level of SLA. <br />The two concerns evolve independently <br />Releases of Editing feature should have no impact on existing Wix sites operations!<br />
  10. 10. Editor & Public Segments<br />Our Solution<br />Split the Server into two Segments – Public and Editor<br />The Public segment targets serving websites for Wix Users<br />Has mostly read-only usage pattern – only updated on when a site is published<br />Simple publishing system<br />Simple means it is easier to have higher SLA <br />Simple and read-only means it is easier to make DRP<br />Built using Spring MVC 2.5 and JDBC (without Hibernate)<br />MySQL used as NoSQL – single large table with XML text fields<br />The Editor segment <br />Exposes the Wix Editing APIs, as well as user account and galleries management APIs.<br />Has different release schedule compared to the Public segment<br />Public<br />(Tomcat)<br />Public <br />DB<br />Editor<br />(Tomcat)<br />Editor <br />DB<br />
  11. 11. Editor & Public Segments<br />What we have learned<br />Architecture is inspired by aspects such as<br />SLA<br />Release Cycles – deployment flexibility<br />Separate Segments for discrete concerns<br />Editing (Editor Segment)<br />Publishing (Public Segment)<br />modularity – in terms of breaking the architecture to services / modules / sub-systems (or any other name)<br />Enabler for gradual re-write<br />Enabler for continues delivery<br />Simplifies QA, Operations & Release Cycles<br />Public<br />(Tomcat)<br />Public <br />DB<br />Editor<br />(Tomcat)<br />Editor <br />DB<br />
  12. 12. Editor & Public Segments<br />What we have learned<br />MySQL is a damn good NoSQLengine<br />Our public DB is (mainly) one table with 10M entries<br />Queries are by primary key<br />Updates are by primary key<br />Instead of relations, we use text/xml columns<br />Immutable Data<br />Architect your data to be immutable<br />MySql sequential keys cause problems<br />Locks on key generation<br />Require a single instance to generate keys<br />We use GUID keys<br />Can be generated by any client<br />No locks in key value generation<br />Enabler for Master-Master replication<br />Public<br />(Tomcat)<br />Public <br />DB<br />Editor<br />(Tomcat)<br />Editor <br />DB<br />
  13. 13. Authentication Segment<br />The Challenge – Statefull user sessions<br />Makes it harder to separate software to different segments<br />Requires shared library or single sign-on<br />Requires statefull load balancer<br />Takes more memory<br />Our solution<br />Dedicated authentication segment – authentication is a specific concern<br />Cookie based login – solves all the above problems<br />Implement stronger security solution only where really required<br />What we have learned<br />Do not use statefull sessions – they don’t scale (easily)<br />Use cookie based login – this is the standard on the web today<br />
  14. 14. The Challenge – Our static storage reached over 500 GByte of small files<br />The “upload to app server, copy to static server, serve by file server” pattern proved inefficient, slow and error prone<br />Disk IO became slow and inefficient as the number of files increased<br />We needed a solution we can grow with – <br />HTTP connections<br />number of files<br />We needed control over caching and Http headers<br />Our Solution<br />Prospero<br />You’ve heard about it in a previous presentation <br />
  15. 15. What we have learned<br />It was hard to make Prospero work right<br />But it was the right choice for us<br />Media files caching headers are critical<br />Max-age, ETag, if-modified-since, etc.<br />Think how to tune those parameters for media files, as per your specific needs<br />We tried using Amazon S3 as secondary storage<br />However, they proved unreliable<br />The S3 service had connection reliability issues<br />S3 have “lost” a 200M files, 30 Tbyte volume<br />We found that using a CDN in front of Prospero is very affective<br />
  16. 16. CDN<br />What we have learned<br />Use a CDN!<br />CDN acts as a great connection manager<br />We have CDN hit ratio’s of over 99.9%<br />Use the “Cache Killer” pattern<br />http://static.wix.com/client/css/viewer.css?v=327<br />Makes flushing files from the CDN redundant<br />Enabler for longer caching periods<br />There are many vendors<br />We started with 1 CDN vendor<br />We are now “playing” with multiple CDN vendors<br />Different CDN vendors have advantages at different geo<br />Tune HTTP Headers per CDN Vendor<br />CDN Vendors interpret HTTP heads differently<br />
  17. 17. Wix-Framework / TDD / DevOps<br />The Challenge<br />The server codebase became large and entangled<br />Feature rollout became harder over time, requiring longer and longer manual regression<br />The Solution<br />The Wix Framework + TDD / CI / CD / DevOps<br />TDD – Developer responsible for all automated tests<br />With QA Developers<br />CI / CD – automate build & deployment<br />Solve the build / test / deployment / rollback issues first<br />Get the project to production with minimal functionality<br />DevOps - Developer is responsible from Productto 10,000 active users<br />
  18. 18. Wix-Framework<br />The Wix Framework<br />Java, Spring, Jetty, MySQL, MongoDB<br />Spring MVC based<br />Adjustments for Flash<br />Flash imposes some restrictions on HTTP which require special handling<br />DevOps support<br />Built-in support for monitoring dashboard, configuration, usage tracking, profiling and Self-Test in every app server<br />Automated Deployment – using Chef and SouceChef<br />TDD Support<br />Unit-Testing helpers<br />Multi-browser Javascript Unit-Testing support, integrated with IDE<br />Integration Testing framework, allows running the same tests in embedded mode (for developer) and deployed in production like env<br />Embedded MySQL & Embedded MongoDB<br />
  19. 19. TDD / DevOps<br />Our development process<br />Developer writes code and Unit-Tests<br />Developer and QA Developer write integration tests (IT)<br />Build on developer machine<br />On check-in<br />Mark RC<br />Deploy to QA env<br />Mark GA<br />Deploy to production<br />Compile<br />Unit Tests<br />Embedded ITs<br />Run ITs<br />Compile<br />Unit Tests<br />Self Tests<br />Deploy – IT Env<br />RC<br />QA<br />Self Tests<br />Deploy – QA Env<br />Monitor<br />GA<br />Self Tests<br />Deploy – Production<br />Monitor<br />
  20. 20. Wix-Framework / TDD / DevOps<br />What we have learned<br />TDD / CI / CD / DevOps works<br />It enables us to release frequently<br />We release 2-3 times a day<br />With higher confidence in release quality<br />With reduced risk<br />It takes time and effort<br />Requires management commitment<br />Transforms the way R&D, Product and operations work<br />Requires investment in supporting tools – the Wix Framework<br />Infrastructure – Build Server, Artifactory, Maven (+Plugins), Chef, SourceChef, Testing environments (virtual boxes), Server Dashboard, New Relic.<br />It is worth the effort!<br />A Wix developer – “never again shell I work without automated testing”<br />
  21. 21. Some of our Tools<br />

×