Building a public cloud is a challenging task. It becomes more challenging when you are equipped with a set of open source products. WSO2 Public Cloud is built using the WSO2 middleware stack which is completely open source. There have been many challenges during the effort and Amila shared the experience of building and maintaing it, which includes:
Customizing and extending the existing products
Configuration management
Ensuring security
Frequent bug fixes, upgrades and migrations
Collecting stats
Getting feedback and helping the users
4. Motivation
o History
o StratosLive
o Stratos -> Donated to Apache
o Wanted to provide a better user experience
o WSO2 API Manager was becoming a hot product
o WSO2 AppFactory was in the making
5. Beginning
o Two clouds
o App Cloud
■Powered by WSO2 AppFactory
o API Cloud
■Powered by WSO2 API Manager
oIn beta for nearly two years
oAPI Cloud is commercial now
7. Why it is a war :)
o It is not a war, but
o More than 100 instances
o Can’t let a bug to live too long
o Need to upgrade frequently
o Customer issues/questions
o We depend on other WSO2 products, but...
8. Will be sharing the experience on..
o Customizations and new developments
o Planning the deployment
o Configuration management
o Monitoring and alerts
o Bug fixes, upgrades and migrations
o Security
o Backups and restoration
o Statistics
o Feedback and customer support
o Performance issue
o Processes
9. Customizations and new developments
o New user model
o Requirement of plugging the wso2.com
userstore
o Ease of registering and working in
organizations
o Wrote our own userstore implementation
oManagement app
o Two clouds (and more in the future) to be
centrally managed
10. Customizations and new developments
user1@tenant1
user2@tenant2
user3@tenant3
email
Tenant 1
Tenant 2
Tenant 3
Old Model New Model
11. Planning the deployment
o AWS as the IaaS
o Previous experience in running a cloud in our
infrastructure
o We are not specialized in maintaining data
centers
o So, why waste our time
o EC2, VPC, RDS, R53
o High availability
12. Configuration Management
o We had experienced our own solution
previously
o We were also playing with Puppet
o Some facts to consider
o AWS instances are shutdown for maintenance
o Necessity of scaling
o Setting up multiple environments
o Decided to go with puppet
o We manage more than 100 nodes now
13. Monitoring & Alerting
o Three types of monitoring were needed
o Health of the instances and JVM processes
■SNMP, Nagios
■Emails, Phone alerts etc.
o Functionality health
■Our own heartbeat monitoring tool
■Improved to track the uptime as well
■Keeps adding more tests
o Logs
■To smell trouble
■Logstash and Kibana from https://www.elastic.co/
15. Bug fixes, Upgrades & Migration
o Bug fixing
o We can’t let a bug to exist in the live system
o We are a customer of WSO2 :-)
o Get patches from WSO2 Support
oUpgrades & Migrations
o Deploy AppFactory milestones every 2 or 3
weeks
o Some ends up needing migrations
o Have our own ways
oTarget
o Continuous deployment
16. Bug fixes, Upgrades & Migration
Load Balancer
Serving the traffic Doing the migration
17. Security
o Access to infrastructure
o Public/Private access for services
o Which service/product should be exposed
publicly/privately
oSecuring users’ data
o Native multi-tenancy support
o Java security manager enabled
■Very strict at the moment
18. Backups & Restoration
o Any user artifact is in Git or SVN
o Snapshots taken
o RDS
o EBS
o Automated via AWS facilities
oLDAP backups
19. Statistics
oWe need to know what is happening
o How many users are using this daily
o How far they go
o Understand about our UX
oPublish stats to WSO2 BAM on various user
activities
o Run analytic scripts
oUptime tracking
oAlso refer logstash stats
20. Feedback and customer support
oThere was no way for the users to contact us
oProvided few methods
o Contact us menu at the top
■Via StackOverflow
■Via email - which will automatically create a jira
oImproved our customer service
o Monitor the dashboard
o Keep the user informed regularly
21. Performance
o Identified and fixed several performance
issues
o Changed some architectures as well
o Several issues
o Registry related issues
o File system related issues
oHad problems when the number of tenants
were growing
o Now we have cleanup mechanisms in place
22. Processes
o If same mistake happens more than once, its
negligence.
o But, people do make mistakes
o Processes are the best way to minimize them
o Applying patches
o Making a config change
o Monitoring logs for errors
o Supporting users
oChecklists
o In upgrades