When small problems become big problems


Published on

Challenges met developing a multi-tenant PaaS runtime summarized from interviews of CloudHub.io development and product teams.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • photo credit http://wallpoper.com/wallpaper/movies-godzilla-271845
  • Not going to focus on normal dev/ops except in context to multi-tenancy
  • Keep tenant safe, protect all tenants, stay in business http://whosjack.wpengine.netdna-cdn.com/wp-content/uploads/2012/05/terraced_houses_manchester_298792.jpg
  • Integration customers range from potato to dev/architect; high value features are not easy to pay on a per-message basis, esp when some services run only 150 messages/month; find the right pricing model for users who want to just use mule by itself, they can use ec2 or heroku
  • LDAP infrastructure existed for Mulesoft community
  • how do you handle lockout of users? system keys, etc. Who’s building them?! SIs need access to create apps for other users, and account conflation leads to N accounts some users just want access to download mule and docs and will never become a cloudhub user
  • also follows patterns like s3 which is largest cloud service
  • x.tenant.cloudhub.io can do more w/ a tenant
  • also introduces another complexity in release process and opportunity for laptop != prod
  • ex. dynamo says fully reliable, but if dynamo is out you can only “wait”
  • some problems are much harder than they seem. Searching, indexing, chronology are difficult and emerging products can suffer from reliability. Logging is also a core means to troubleshoot problems, so if logging is a problem, it is a big problem… significant effort and expertise to nail. cluster *will* eventually become split-brain; how long to restore service from rebooting?
  • ex. marketo has no sandbox, neither does billing system even if you reproduce prod apps, how do you reproduce behavior of them? *corner cases are the ones likely to stress your platform out*
  • desire for ssh access can thwart your firewall rules Each ipsec.conf has tunnels configured for each of its peers, and needs to recognize one side of the tunnel as itself. This results in each host's ipsec.conf being unique to that instance, so you can't collapse the hosts into a class, but have to manage each one separately. you need to use config management to role-assign this. we move off VPC due to many problems with using ipsec, yet still have inter-region problem. do you know solution?
  • ex. failures in management can lead to conservative healing and scaling policies or mandatory user intervention
  • sometimes people code that true = false as opposed to report a bug api design can go be loveable than hateable or hateable first MVP approach may backfire when you are dealing with a *public* service simple as possible, and expose conservatively
  • know your customer and if they are likely to be savvy enough to recover from system failures
  • http://www.slideshare.net/benjchristensen/fault-tolerance-in-a-high-volume-distributed-system
  • ex. notification vs activity feed, streaming? once information escapes your network, it can haunt you with clashing instructions, stackoverflow, etc.
  • ex. what streaming is?)
  • mysql doesn’t support full text search on partitioned tables ex. druid just released yesterday, twitter storm only a year move the problem from the customer to us, which includes the technical profile and migration. small problem become big problem is when a customer desires capability unsupported or difficult to support with the existing (datastore|infrastructure) or indexing strategy
  • consistent backup isn't possible until very recently ext4. immature by DB standards, though older than its years. devs love it, and lack of tools are problem; basically have to use navicat, etc. for DataOps stuck with command line, visualization problems. no way to do analytics onto of mongo, without telling them to write some javascript. and answer might be transform to postgresql
  • we should use mongo or RDS, mongo isn't being used correctly, as a relational database, so it has transactional data; we now have event tracking, but we don't have a document for the event configuration. have to store the whole thing or you will have tx data problems; main problem is that it is not being used correctly. Eventhough it seems you can store relational data in a nosql store like mongo, doesn’t mean you should. 2 (or more) types of datastores may be the most supportable answer to your data problem.A. we use it as Tx (so clashing or overlapping writes)B. it doesn't give you mature features (like consistent backup)
  • pretty much everyone has to be DevOps design can be refactored, but are tough to change at scaleyour job changes often, so right tool for the job also changes realize choices made now can be difficult to change a year on
  • When small problems become big problems

    1. 1. When small problems become big problems@adrianfcole
    2. 2. Agenda• Introduction to CloudHub• Challenges we faced building multi-tenant architecture• Q/A
    3. 3. Ego slideAdrian Cole (@jclouds) founded jclouds march 2009 cloudhub.io architect at cloudhub.io architect at cloudhub.io architect at
    4. 4. 4
    5. 5. Platform as a ServiceAutomated ProvisioningEvent TrackingCentralized LoggingSecure Data Gateway 56
    6. 6. The landlord’s dilemma
    7. 7. When you’ve priced yourself out of business
    8. 8. Cloud is utility, but your service may be more• Measurement based pricing exists in infrastructure tier• Know your customer, who are they and where in the value chain you act• Don’t get into race to the bottom
    9. 9. When 200 usersbecomes 2000 accounts
    10. 10. Choosing a BASIC starting point• Already had a LDAP infrastructure• Straightforward integration with console and other access tools• Easy to do do BASIC authentication
    11. 11. Remember users (and api users) (and api users)• Basic Auth is not a good choice for an API over time• System integrators need delegated access• Hard to cleanup accounts when there are multiple owners
    12. 12. When myapp.cloudhub.io becomesmyapp001.cloudhub.iomyapp001.cloudhub.io
    13. 13. How to present the iApps• X.cloudhub.io• DNS is flexible to deal with• clear branding
    14. 14. X.cloudhub.io woes• Namespace contention• qa.cloudhub.io isn’t really an iApp• need to maintain blacklist
    15. 15. When mule isn’t mule
    16. 16. PaaS is more than java -jar mule.jar• CloudHub adds services integration to Mule• Logging, Event Tracking, Replay, etc.
    17. 17. appstack -> platform is tricky• transparent features and also compatible?• dealing with network streams that could be more brittle• matching serialization/marshalling w/ cloud features like streaming
    18. 18. When SLA turns into refund
    19. 19. Desire to rely on more services• Cloud Infrastructure• Cloud Search• Cloud Scaling
    20. 20. Reality of relying on more services• uptime is less the more service dependencies you add• services may underperform their SLAs with little financial impact• you may need to manually deal with service outages
    21. 21. When logging turns into a big data problem
    22. 22. Customers desire real time search• need to centralize and index logs• using ElasticSearch can avoid service fees or license fees• with a custom logging plugin, we can redirect output to the cluster
    23. 23. Logging is always a big problem• Clusters can fail for reasons beyond servers deployed• API design for logging is different• What happens if your disk fails or your cluster fails?• What happens when you replace a worker?
    24. 24. Real men test in production
    25. 25. Testability is crucial• each dependency needs to be testable and mockable• devs need a local environment that matches, or your test cases will suffer• creation of new tenants means more money.. test it!
    26. 26. Platform testing is really hard• Some external deps don’t have sandboxes• Can you try 500 applications?• Can you maintain a quiet production “neighborhood" while testing QA
    27. 27. When security updates = vi ipsec.conf in for loop
    28. 28. Security in a public service is hard• assume user is infinitely clever and malicious• deny by default vs service simplicity• maintain segregation and availability of tenants• Asset value can vary widely across tenants
    29. 29. Security design touches everything• ipsec is hard to maintain without proper CM, and wasn’t built for noisy network• deny by default means higher maintenance, and not all products support it• it is easy to violate tenancy segregation in a platform• you may have to hire consultants
    30. 30. When yourmanagement service goes haywire
    31. 31. automation automation automation• myriad of technology to automate scaling and availability• policies can be fine tuned to relaunch or scale out based on system feedback or api
    32. 32. What about network splits• Will your management server “heal” something that is already around?• Is your management server on the same failure plane as your managed servers• Will you end up with manual intervention controls (aka red button)
    33. 33. When your api design haunts you
    34. 34. Put an API on everything• Allows automation and guis besides what you’ve invented• simplifies testing• eat your own dogfood
    35. 35. Design redo is a big problem• GUIs can change easier as humans drive them• Maintaining old apis may not be worth it• People may depend on bugs or semantic gaps• Version practices in ReST are not uniform• remember understanding state machine is a prerequisite for HATEOAS
    36. 36. When 5 retriesbecomes a DDoS attack
    37. 37. We want to build resilient apps• recovery is a part of the service you provide, more important as you go up in value chain• connections should assume failure and be able to reconnect to dependencies• recovery is non-trivial
    38. 38. 5 retries is code smell• things that backup or fail can get worse with naive error retry loops• APIs often can be made to include data about when to retry or that you need to slow down• Treat resilience as a requirement, not a feature
    39. 39. When your users askthe same questions
    40. 40. Wrong words suck• Some terms seem sensible in design discussions, but public use something else• Changing requires retraining, and thorough doc review• What goes online lingers
    41. 41. When a featurerequest implies new architecture
    42. 42. Platform changes• Customers are looking for service, not explanations of why it is hard• Adding value implies touch decisions on new features• As the world turns, expectations rise• Know your customer
    43. 43. Real-time, full-textsearch, streaming.. oh my! full-text search,•Not all databases support esp with partitioning• Some data is better stored in S3, how does that affect indexing strategy?• Real-time tools are emerging but immature
    44. 44. When you end up with a “lock” table in mongo
    45. 45. Datastore diversity!• NoSQL datastores like Mongo are attractive and energize developers• Cloud provisioners like RDS-driven MySQL are also attractive• Specialized stores like CloudWatch for statistics
    46. 46. Don’t expect mongo to do magic• Database Engines Mature• Consistent backups are tricky and only recently supported• Data Ops and visualization tools are emerging• There are type safe bridges like Morphia
    47. 47. Hammers and screwdrivers• In a pinch, you can knock in a screw with a hammer, but you can’t screw in a nail with a screwdriver• Don’t throw data into whatever store happens to be easy to grab, even if you can.• Rechecking data assumptions at T 1 is better than T3. At T6, you may a disaster
    48. 48. Summary
    49. 49. multi-tenant platform• Own your dependencies or they will own you• Add time for entropy• Repeatedly remind yourself you are a landlord
    50. 50. Architecture asiterative development• Forethought• Critical debate• Decision review
    51. 51. ‣ @adrianfcole‣ adrian.cole@mulesoft.com‣ www.cloudhub.io