DOs and DON'Ts on the way to over 10M users…IGT – EventJuly 2011
Wix Initial ArchitectureServerPublic Sites ServingSites Editing APIUser AuthenticationUser Media upload, searchesPublic Media upload,   management, searchesTemplate Galleries   managementDatabaseWix SitesWixML PagesUser AuthenticationUsers MediaPublic MediaTemplate Galleries
Wix Initial ArchitectureTomcat, Hibernate, Custom web frameworkEverything generated from HBM filesBuilt for fast developmentStatefull login (tomcat session), EHCache, File uploadsNot considering performance, scalability, fast feature rollout, testingIt reflected the fact that we didn’t really know what is our businessYoav A. - “it is great for a first stage start-up, but you will have to replace it within 2 years”Nadav A, after two years - “you were right, however, you failed to mention how hard it’s gonna be”
Wix Initial ArchitectureWhat we have learned Don’t worry about ‘building it right from the start’ – you won’tYou are going to replace stuff you are building in the initial stages of a startup or any projectBe ready to do itGet it up to customers as fast as you can. Get feedback. Evolve.Our mistake was not planning for gradual re-writeBuild for gradual re-write as you learn the problems and find the right solutions
Distributed CacheNext we added EHCache as Hibernate 2nd-level cacheWhy? Cause it is in the designHow was it?Black Box cacheHow do we know what is the state of our system?How to invalidate the cache?When to invalidate it?How does operations manage the cache?Did we really need it?NoWe eventually dropped it
Distributed CacheWhat we have learned You don’t need a CacheReally, you don’tCache is not part of an architecture. It is a means to make something more efficientArchitect while ignoring the caching. Introduce caching only as needed to solve real performance problems.When introducing a cache, think about cache management – how do you know what is in the cache? How do you find invalid data in the cache?Invalidation – who invalidates the cache? When? Why?Cache Reset – can your architecture stand a cache restart?
Two years passedWe learned what our business is – building websitesWe started selling premium websites
Two years passedOur architecture evolvedWe added a separate Billing segmentWe moved static file storage and serving to a separate instanceBut we started seeing problemsUpdates to our server imposed complete wix downtimeOur static storage reached 500 GByte of small files, the limit of Bash scriptsThe server codebase became large and entangled. Feature rollout became harder over time, requiring longer and longer manual regressionStrange full-table scans queries generated by Hibernate, which we still have no idea what code is responsible for…Statefull user sessions required a lot of memory and a statefull load balancerBillingDBBilling(Tomcat)Server(Tomcat)DBstatics(Lighttpd)Media storage
Editor & Public SegmentsThe Challenge - Updates to our Server imposed downtime for our customer’s websitesAny Server or Database update has the potential of bringing down all Wix sitesIs a symptom of a larger issueThe Server served two different concernsWix Users editing websitesViewing Wix Sites, the sites created by the Wix editorThe two concerns require a different SLAWix Sites should never ever have a downtime!Wix Sites should work as fast as possible, always! However, an editing system does not require this level of SLA. The two concerns evolve independently Releases of Editing feature should have no impact on existing Wix sites operations!
Editor & Public SegmentsOur SolutionSplit the Server into two Segments – Public and EditorThe Public segment targets serving websites for Wix UsersHas mostly read-only usage pattern – only updated on when a site is publishedSimple publishing systemSimple means it is easier to have higher SLA Simple and read-only means it is easier to make DRPBuilt using Spring MVC 2.5 and JDBC (without Hibernate)MySQL used as NoSQL – single large table with XML text fieldsThe Editor segment Exposes the Wix Editing APIs, as well as user account and galleries management APIs.Has different release schedule compared to the Public segmentPublic(Tomcat)Public DBEditor(Tomcat)Editor DB
Editor & Public SegmentsWhat we have learnedArchitecture is inspired by aspects such asSLARelease Cycles – deployment flexibilitySeparate Segments for discrete concernsEditing (Editor Segment)Publishing (Public Segment)modularity – in terms of breaking the architecture to services / modules / sub-systems (or any other name)Enabler for gradual re-writeEnabler for continues deliverySimplifies QA, Operations & Release CyclesPublic(Tomcat)Public DBEditor(Tomcat)Editor DB
Editor & Public SegmentsWhat we have learnedMySQL is a damn good NoSQLengineOur public DB is (mainly) one table with 10M entriesQueries are by primary keyUpdates are by primary keyInstead of relations, we use text/xml columnsImmutable DataArchitect your data to be immutableMySql sequential keys cause problemsLocks on key generationRequire a single instance to generate keysWe use GUID keysCan be generated by any clientNo locks in key value generationEnabler for Master-Master replicationPublic(Tomcat)Public DBEditor(Tomcat)Editor DB
Authentication SegmentThe Challenge – Statefull user sessionsMakes it harder to separate software to different segmentsRequires shared library or single sign-onRequires statefull load balancerTakes more memoryOur solutionDedicated authentication segment – authentication is a specific concernCookie based login – solves all the above problemsImplement stronger security solution only where really requiredWhat we have learnedDo not use statefull sessions – they don’t scale (easily)Use cookie based login – this is the standard on the web today
The Challenge – Our static storage reached over 500 GByte of small filesThe “upload to app server, copy to static server, serve by file server” pattern proved inefficient, slow and error proneDisk IO became slow and inefficient as the number of files increasedWe needed a solution we can grow with – HTTP connectionsnumber of filesWe needed control over caching and Http headersOur SolutionProsperoYou’ve heard about it in a previous presentation 
What we have learnedIt was hard to make Prospero work rightBut it was the right choice for usMedia files caching headers are criticalMax-age, ETag, if-modified-since, etc.Think how to tune those parameters for media files, as per your specific needsWe tried using Amazon S3 as secondary storageHowever, they proved unreliableThe S3 service had connection reliability issuesS3 have “lost” a 200M files, 30 Tbyte volumeWe found that using a CDN in front of Prospero is very affective
CDNWhat we have learnedUse a CDN!CDN acts as a great connection managerWe have CDN hit ratio’s of over 99.9%Use the “Cache Killer” patternhttp://static.wix.com/client/css/viewer.css?v=327Makes flushing files from the CDN redundantEnabler for longer caching periodsThere are many vendorsWe started with 1 CDN vendorWe are now “playing” with multiple CDN vendorsDifferent CDN vendors have advantages at different geoTune HTTP Headers per CDN VendorCDN Vendors interpret HTTP heads differently
Wix-Framework / TDD / DevOpsThe ChallengeThe server codebase became large and entangledFeature rollout became harder over time, requiring longer and longer manual regressionThe SolutionThe Wix Framework + TDD / CI / CD / DevOpsTDD – Developer responsible for all automated testsWith QA DevelopersCI / CD – automate build & deploymentSolve the build / test / deployment / rollback issues firstGet the project to production with minimal functionalityDevOps - Developer is responsible from Productto 10,000 active users
Wix-FrameworkThe Wix FrameworkJava, Spring, Jetty, MySQL, MongoDBSpring MVC basedAdjustments for FlashFlash imposes some restrictions on HTTP which require special handlingDevOps supportBuilt-in support for monitoring dashboard, configuration, usage tracking, profiling and Self-Test in every app serverAutomated Deployment – using Chef and SouceChefTDD SupportUnit-Testing helpersMulti-browser Javascript Unit-Testing support, integrated with IDEIntegration Testing framework, allows running the same tests in embedded mode (for developer) and deployed in production like envEmbedded MySQL & Embedded MongoDB
TDD / DevOpsOur development processDeveloper writes code and Unit-TestsDeveloper and QA Developer write integration tests (IT)Build on developer machineOn check-inMark RCDeploy to QA envMark GADeploy to productionCompileUnit TestsEmbedded ITsRun ITsCompileUnit TestsSelf TestsDeploy – IT EnvRCQASelf TestsDeploy – QA EnvMonitorGASelf TestsDeploy – ProductionMonitor
Wix-Framework / TDD / DevOpsWhat we have learnedTDD / CI / CD / DevOps worksIt enables us to release frequentlyWe release 2-3 times a dayWith higher confidence in release qualityWith reduced riskIt takes time and effortRequires management commitmentTransforms the way R&D, Product and operations workRequires investment in supporting tools – the Wix FrameworkInfrastructure – Build Server, Artifactory, Maven (+Plugins), Chef, SourceChef, Testing environments (virtual boxes), Server Dashboard, New Relic.It is worth the effort!A Wix developer – “never again shell I work without automated testing”
Some of our Tools
DOs and DONTs on the way to 10M users

DOs and DONTs on the way to 10M users

  • 1.
    DOs and DON'Ts onthe way to over 10M users…IGT – EventJuly 2011
  • 2.
    Wix Initial ArchitectureServerPublicSites ServingSites Editing APIUser AuthenticationUser Media upload, searchesPublic Media upload, management, searchesTemplate Galleries managementDatabaseWix SitesWixML PagesUser AuthenticationUsers MediaPublic MediaTemplate Galleries
  • 3.
    Wix Initial ArchitectureTomcat,Hibernate, Custom web frameworkEverything generated from HBM filesBuilt for fast developmentStatefull login (tomcat session), EHCache, File uploadsNot considering performance, scalability, fast feature rollout, testingIt reflected the fact that we didn’t really know what is our businessYoav A. - “it is great for a first stage start-up, but you will have to replace it within 2 years”Nadav A, after two years - “you were right, however, you failed to mention how hard it’s gonna be”
  • 4.
    Wix Initial ArchitectureWhatwe have learned Don’t worry about ‘building it right from the start’ – you won’tYou are going to replace stuff you are building in the initial stages of a startup or any projectBe ready to do itGet it up to customers as fast as you can. Get feedback. Evolve.Our mistake was not planning for gradual re-writeBuild for gradual re-write as you learn the problems and find the right solutions
  • 5.
    Distributed CacheNext weadded EHCache as Hibernate 2nd-level cacheWhy? Cause it is in the designHow was it?Black Box cacheHow do we know what is the state of our system?How to invalidate the cache?When to invalidate it?How does operations manage the cache?Did we really need it?NoWe eventually dropped it
  • 6.
    Distributed CacheWhat wehave learned You don’t need a CacheReally, you don’tCache is not part of an architecture. It is a means to make something more efficientArchitect while ignoring the caching. Introduce caching only as needed to solve real performance problems.When introducing a cache, think about cache management – how do you know what is in the cache? How do you find invalid data in the cache?Invalidation – who invalidates the cache? When? Why?Cache Reset – can your architecture stand a cache restart?
  • 7.
    Two years passedWelearned what our business is – building websitesWe started selling premium websites
  • 8.
    Two years passedOurarchitecture evolvedWe added a separate Billing segmentWe moved static file storage and serving to a separate instanceBut we started seeing problemsUpdates to our server imposed complete wix downtimeOur static storage reached 500 GByte of small files, the limit of Bash scriptsThe server codebase became large and entangled. Feature rollout became harder over time, requiring longer and longer manual regressionStrange full-table scans queries generated by Hibernate, which we still have no idea what code is responsible for…Statefull user sessions required a lot of memory and a statefull load balancerBillingDBBilling(Tomcat)Server(Tomcat)DBstatics(Lighttpd)Media storage
  • 9.
    Editor & PublicSegmentsThe Challenge - Updates to our Server imposed downtime for our customer’s websitesAny Server or Database update has the potential of bringing down all Wix sitesIs a symptom of a larger issueThe Server served two different concernsWix Users editing websitesViewing Wix Sites, the sites created by the Wix editorThe two concerns require a different SLAWix Sites should never ever have a downtime!Wix Sites should work as fast as possible, always! However, an editing system does not require this level of SLA. The two concerns evolve independently Releases of Editing feature should have no impact on existing Wix sites operations!
  • 10.
    Editor & PublicSegmentsOur SolutionSplit the Server into two Segments – Public and EditorThe Public segment targets serving websites for Wix UsersHas mostly read-only usage pattern – only updated on when a site is publishedSimple publishing systemSimple means it is easier to have higher SLA Simple and read-only means it is easier to make DRPBuilt using Spring MVC 2.5 and JDBC (without Hibernate)MySQL used as NoSQL – single large table with XML text fieldsThe Editor segment Exposes the Wix Editing APIs, as well as user account and galleries management APIs.Has different release schedule compared to the Public segmentPublic(Tomcat)Public DBEditor(Tomcat)Editor DB
  • 11.
    Editor & PublicSegmentsWhat we have learnedArchitecture is inspired by aspects such asSLARelease Cycles – deployment flexibilitySeparate Segments for discrete concernsEditing (Editor Segment)Publishing (Public Segment)modularity – in terms of breaking the architecture to services / modules / sub-systems (or any other name)Enabler for gradual re-writeEnabler for continues deliverySimplifies QA, Operations & Release CyclesPublic(Tomcat)Public DBEditor(Tomcat)Editor DB
  • 12.
    Editor & PublicSegmentsWhat we have learnedMySQL is a damn good NoSQLengineOur public DB is (mainly) one table with 10M entriesQueries are by primary keyUpdates are by primary keyInstead of relations, we use text/xml columnsImmutable DataArchitect your data to be immutableMySql sequential keys cause problemsLocks on key generationRequire a single instance to generate keysWe use GUID keysCan be generated by any clientNo locks in key value generationEnabler for Master-Master replicationPublic(Tomcat)Public DBEditor(Tomcat)Editor DB
  • 13.
    Authentication SegmentThe Challenge– Statefull user sessionsMakes it harder to separate software to different segmentsRequires shared library or single sign-onRequires statefull load balancerTakes more memoryOur solutionDedicated authentication segment – authentication is a specific concernCookie based login – solves all the above problemsImplement stronger security solution only where really requiredWhat we have learnedDo not use statefull sessions – they don’t scale (easily)Use cookie based login – this is the standard on the web today
  • 14.
    The Challenge –Our static storage reached over 500 GByte of small filesThe “upload to app server, copy to static server, serve by file server” pattern proved inefficient, slow and error proneDisk IO became slow and inefficient as the number of files increasedWe needed a solution we can grow with – HTTP connectionsnumber of filesWe needed control over caching and Http headersOur SolutionProsperoYou’ve heard about it in a previous presentation 
  • 15.
    What we havelearnedIt was hard to make Prospero work rightBut it was the right choice for usMedia files caching headers are criticalMax-age, ETag, if-modified-since, etc.Think how to tune those parameters for media files, as per your specific needsWe tried using Amazon S3 as secondary storageHowever, they proved unreliableThe S3 service had connection reliability issuesS3 have “lost” a 200M files, 30 Tbyte volumeWe found that using a CDN in front of Prospero is very affective
  • 16.
    CDNWhat we havelearnedUse a CDN!CDN acts as a great connection managerWe have CDN hit ratio’s of over 99.9%Use the “Cache Killer” patternhttp://static.wix.com/client/css/viewer.css?v=327Makes flushing files from the CDN redundantEnabler for longer caching periodsThere are many vendorsWe started with 1 CDN vendorWe are now “playing” with multiple CDN vendorsDifferent CDN vendors have advantages at different geoTune HTTP Headers per CDN VendorCDN Vendors interpret HTTP heads differently
  • 17.
    Wix-Framework / TDD/ DevOpsThe ChallengeThe server codebase became large and entangledFeature rollout became harder over time, requiring longer and longer manual regressionThe SolutionThe Wix Framework + TDD / CI / CD / DevOpsTDD – Developer responsible for all automated testsWith QA DevelopersCI / CD – automate build & deploymentSolve the build / test / deployment / rollback issues firstGet the project to production with minimal functionalityDevOps - Developer is responsible from Productto 10,000 active users
  • 18.
    Wix-FrameworkThe Wix FrameworkJava,Spring, Jetty, MySQL, MongoDBSpring MVC basedAdjustments for FlashFlash imposes some restrictions on HTTP which require special handlingDevOps supportBuilt-in support for monitoring dashboard, configuration, usage tracking, profiling and Self-Test in every app serverAutomated Deployment – using Chef and SouceChefTDD SupportUnit-Testing helpersMulti-browser Javascript Unit-Testing support, integrated with IDEIntegration Testing framework, allows running the same tests in embedded mode (for developer) and deployed in production like envEmbedded MySQL & Embedded MongoDB
  • 19.
    TDD / DevOpsOurdevelopment processDeveloper writes code and Unit-TestsDeveloper and QA Developer write integration tests (IT)Build on developer machineOn check-inMark RCDeploy to QA envMark GADeploy to productionCompileUnit TestsEmbedded ITsRun ITsCompileUnit TestsSelf TestsDeploy – IT EnvRCQASelf TestsDeploy – QA EnvMonitorGASelf TestsDeploy – ProductionMonitor
  • 20.
    Wix-Framework / TDD/ DevOpsWhat we have learnedTDD / CI / CD / DevOps worksIt enables us to release frequentlyWe release 2-3 times a dayWith higher confidence in release qualityWith reduced riskIt takes time and effortRequires management commitmentTransforms the way R&D, Product and operations workRequires investment in supporting tools – the Wix FrameworkInfrastructure – Build Server, Artifactory, Maven (+Plugins), Chef, SourceChef, Testing environments (virtual boxes), Server Dashboard, New Relic.It is worth the effort!A Wix developer – “never again shell I work without automated testing”
  • 21.