Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Silver Linings
Our journey in migrating a cloud
Tom Dance
Head of Platform Development
Tristan Davey
Senior Software Engineer
Moving Clouds
Relocating from Google to Amazon
Large monolith - 500,000+ LOC
We outgrew Google App Engine
➔ Poor documentation and support
➔ Immature API’s (everything i...
Google Cloud Platform
Web Frontend
Web API
Email
Client API
Binary Data
SafetyCloud
‘The Monolith’
SafetyCloud Monolithic ...
SafetyCulture The Goal
Improve our product
➔ More reliable and performant syncing
➔ New modern user interface
➔ Feature eq...
SafetyCulture The Solution
10 Microservices built with Node.js
Single Page App built with Ember.js
Document store with Cou...
Amazon Web Services
Web API
Email
Client API
Binary Data
Web Frontend
SafetyCulture Microservice Architecture
Rebuilding an API
Reconstructing our client API in Node.js
Client API
HTTP API for SafetyCulture iOS and Android Applications
● Authentication
● Document Synchronisation
● User Mana...
Client API Change Considerations
Consumed by over 500,000 devices
Many users in legacy versions of consuming clients:
● 2%...
Language
Server Framework
Database
Query Engine
Binary Storage
Scaling
Client API Rebuild
Original API
Python
WebApp2
Goog...
Client API Maintaining Backwards Compatibility
API Specification-based implementation
External specification of original A...
Client API Rebuild Outcomes
Built, tested, deployed in under 9 engineering-months
Client API Codebase: 10000+ LOC
Regressi...
Data Migration
Google Datastore to Couchbase Server
Google Datastore
Non-relational key-value store
Proprietary software
Eventually consistent
1MB value limit
Basic indexing ...
PRODUCTION
LevelDB
Database
Migration
Couchbase
Stage 1
Migration
Stage 2
Migration
Google
Datastore
Couchbase
Server
MIGR...
Stage 1
Migration
Stage 2
Migration
MIGRATION
Stage 1 Migration
LevelDB ➜ Couchbase
1 Key/Value Pair = 1 Document
JSON Ser...
Stage 1
Migration
Stage 2
Migration
MIGRATION
Migration Process
40+ Quad Core Machines
320+ documents migrated concurrentl...
PRODUCTION
Stage 1
Migration
Stage 2
Migration
Google
Datastore
Couchbase
Server
MIGRATIONVALIDATION
Clean-room
Validation...
VALIDATION
Clean-room
Validation
Migration
Validation
Process
Validation
Clean-room Validation
Rebuilt transforms from
spe...
Google Datastore
129,607,422 KV Entities
121 Query Indexes
1900 Ops/sec average
Couchbase Server
2,596,011 Documents
25 Ma...
SafetyCulture
Moving into our new cloud...
Instant Switchover
“Google App Engine one day, Amazon Web Services the next”
28 Hour Switchover Process
➔ Downtime require...
12 microservices
➔ Unique scaling requirements for each
➔ Stateless and fault tolerant
Infrastructure
➔ 30+ Virtual machin...
SafetyCulture Development
Continuous integration and delivery
➔ ~500 deploys in under five months
➔ Zero downtime deploys
...
SafetyCulture The Business
A better product for customers
➔ Faster and more reliable
➔ Clean and modern UI
➔ More features...
May the safe be with you...
safetyculture.io
@safetycultrehq
Silver Linings - North Queensland IT Industry Conference
Upcoming SlideShare
Loading in …5
×

Silver Linings - North Queensland IT Industry Conference

409 views

Published on

A presentation by Tom Dance and Tristan Davey from SafetyCulture Pty Ltd about their organisation's successful migration from Google App Engine to Amazon Web Services, changing databases, environments, languages and software architecture along the way.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Silver Linings - North Queensland IT Industry Conference

  1. 1. Silver Linings Our journey in migrating a cloud
  2. 2. Tom Dance Head of Platform Development Tristan Davey Senior Software Engineer
  3. 3. Moving Clouds Relocating from Google to Amazon
  4. 4. Large monolith - 500,000+ LOC We outgrew Google App Engine ➔ Poor documentation and support ➔ Immature API’s (everything is beta!) ➔ Bumping into limitations ➔ Proprietary technology Feature development and fixes slow Needed more flexibility Scaling the engineering team was hard SafetyCloud Google App Engine
  5. 5. Google Cloud Platform Web Frontend Web API Email Client API Binary Data SafetyCloud ‘The Monolith’ SafetyCloud Monolithic Architecture
  6. 6. SafetyCulture The Goal Improve our product ➔ More reliable and performant syncing ➔ New modern user interface ➔ Feature equivalent ➔ Full backwards compatibility with iAuditor Address the problems a large monolithic codebase brings Scalable, flexible, open technologies Strong partner for infrastructure
  7. 7. SafetyCulture The Solution 10 Microservices built with Node.js Single Page App built with Ember.js Document store with Couchbase Document indexing with ElasticSearch Scalable cloud based infrastructure with Amazon Web Services
  8. 8. Amazon Web Services Web API Email Client API Binary Data Web Frontend SafetyCulture Microservice Architecture
  9. 9. Rebuilding an API Reconstructing our client API in Node.js
  10. 10. Client API HTTP API for SafetyCulture iOS and Android Applications ● Authentication ● Document Synchronisation ● User Management ● Document Permissions
  11. 11. Client API Change Considerations Consumed by over 500,000 devices Many users in legacy versions of consuming clients: ● 2% of users on version older than 1 year ● 8.5% of users on version older than 6 months ● 25.3% of users on version older than 1 month Consuming clients relied on undocumented quirks and edge cases to function - these needed to be maintained
  12. 12. Language Server Framework Database Query Engine Binary Storage Scaling Client API Rebuild Original API Python WebApp2 Google Datastore SQL-Like Queries Google Blobstore Vertical + Horizontal Rebuilt API Coffeescript Hapi.js Couchbase Server MapReduce Indexes Amazon S3 Horizontal
  13. 13. Client API Maintaining Backwards Compatibility API Specification-based implementation External specification of original API became the internal implementation specification of the new API. Manual and Automated testing Automated unit and integration tests. Production device testing with large scale, multi-hour real-world tests. Replay-based Regression testing Production device traffic was observed, recorded and replayed with a custom- built tool. Allowed us to identify request/response behaviours.
  14. 14. Client API Rebuild Outcomes Built, tested, deployed in under 9 engineering-months Client API Codebase: 10000+ LOC Regression Test Codebase: 22000+ LOC Seamlessly continued working with legacy clients Horizontally scales to easily meet peak demand Now serves 1,200,000 requests/day
  15. 15. Data Migration Google Datastore to Couchbase Server
  16. 16. Google Datastore Non-relational key-value store Proprietary software Eventually consistent 1MB value limit Basic indexing and querying Couchbase Server Non-relational document store Open-source project Eventually consistent Configurable document limit MapReduce-based indexing PRODUCTION
  17. 17. PRODUCTION LevelDB Database Migration Couchbase Stage 1 Migration Stage 2 Migration Google Datastore Couchbase Server MIGRATION
  18. 18. Stage 1 Migration Stage 2 Migration MIGRATION Stage 1 Migration LevelDB ➜ Couchbase 1 Key/Value Pair = 1 Document JSON Serialise Data Index data for Stage 2 Stage 2 Migration Couchbase ➜ Couchbase Many Key/Value Pair = 1 Document Data denormalisation Transform data structures Reformat data types
  19. 19. Stage 1 Migration Stage 2 Migration MIGRATION Migration Process 40+ Quad Core Machines 320+ documents migrated concurrently Concurrency and tasks controlled by Amazon Simple Queue Service 6.5 hours from start to finish 80+ test migrations
  20. 20. PRODUCTION Stage 1 Migration Stage 2 Migration Google Datastore Couchbase Server MIGRATIONVALIDATION Clean-room Validation Migration Validation Process Validation
  21. 21. VALIDATION Clean-room Validation Migration Validation Process Validation Clean-room Validation Rebuilt transforms from specification Compared random subsamples of original data with result data Migration Validation SQS Queue validation Numerical validation of entities and sizes Strict error monitoring Process Validation Issue management Documented procedures Checklists & audits
  22. 22. Google Datastore 129,607,422 KV Entities 121 Query Indexes 1900 Ops/sec average Couchbase Server 2,596,011 Documents 25 MapReduce Indexes 260 Ops/sec average PRODUCTION Couchbase Server Google Datastore
  23. 23. SafetyCulture Moving into our new cloud...
  24. 24. Instant Switchover “Google App Engine one day, Amazon Web Services the next” 28 Hour Switchover Process ➔ Downtime required ➔ Minimum Load Period - Saturday to Sunday ➔ Required 15 engineering Staff ➔ Additional support staff SafetyCulture Moving Clouds
  25. 25. 12 microservices ➔ Unique scaling requirements for each ➔ Stateless and fault tolerant Infrastructure ➔ 30+ Virtual machines serving simultaneously ➔ 14 Load balancers Use AWS services where possible ➔ DynamoDB, ELBs, ASGs, CloudWatch, Route53... SafetyCulture The Infrastructure
  26. 26. SafetyCulture Development Continuous integration and delivery ➔ ~500 deploys in under five months ➔ Zero downtime deploys Better team workflow ➔ Agile development methodology ➔ Every pull request gets reviewed and tested ➔ Microservices allow for faster and isolated development ➔ Features hidden behind feature flags
  27. 27. SafetyCulture The Business A better product for customers ➔ Faster and more reliable ➔ Clean and modern UI ➔ More features and fixes being released In the five months since launch ➔ 100% growth in database records ➔ 50% user growth ➔ 40% saving in infrastructure costs
  28. 28. May the safe be with you... safetyculture.io @safetycultrehq

×