When small problems
               become big problems




@adrianfcole
Agenda

• Introduction to CloudHub
• Challenges we faced building multi-tenant
  architecture
• Q/A
Ego slide


Adrian Cole (@jclouds)
 founded jclouds march 2009
  cloudhub.io architect at
  cloudhub.io architect at
  cloudhub.io architect at
4
Platform as a Service

Automated Provisioning
Event Tracking
Centralized Logging
Secure Data Gateway




                         56
The landlord’s dilemma
When you’ve priced
  yourself out of
    business
Cloud is utility, but
  your service may be
         more
• Measurement based pricing exists in
  infrastructure tier
• Know your customer, who are they and
  where in the value chain you act
• Don’t get into race to the bottom
When 200 users
becomes 2000
  accounts
Choosing a BASIC
     starting point

• Already had a LDAP infrastructure
• Straightforward integration with console
  and other access tools
• Easy to do do BASIC authentication
Remember users
     (and api users)
     (and api users)
• Basic Auth is not a good choice for an API
  over time
• System integrators need delegated access
• Hard to cleanup accounts when there are
  multiple owners
When
 myapp.cloudhub.io
     becomes
myapp001.cloudhub.io
myapp001.cloudhub.io
How to present the
        iApps

• X.cloudhub.io
• DNS is flexible to deal with
• clear branding
X.cloudhub.io woes

• Namespace contention
• qa.cloudhub.io isn’t really an iApp
• need to maintain blacklist
When mule isn’t mule
PaaS is more than java
     -jar mule.jar

• CloudHub adds services integration to
  Mule
• Logging, Event Tracking, Replay, etc.
appstack -> platform is
        tricky
• transparent features and also compatible?
• dealing with network streams that could be
  more brittle
• matching serialization/marshalling w/ cloud
  features like streaming
When SLA turns into
      refund
Desire to rely on more
       services

• Cloud Infrastructure
• Cloud Search
• Cloud Scaling
Reality of relying on
    more services
• uptime is less the more service
  dependencies you add
• services may underperform their SLAs with
  little financial impact
• you may need to manually deal with service
  outages
When logging turns
 into a big data
     problem
Customers desire real
    time search
• need to centralize and index logs
• using ElasticSearch can avoid service fees or
  license fees
• with a custom logging plugin, we can
  redirect output to the cluster
Logging is always a big
      problem
• Clusters can fail for reasons beyond
  servers deployed
• API design for logging is different
• What happens if your disk fails or your
  cluster fails?
• What happens when you replace a worker?
Real men test in
  production
Testability is crucial

• each dependency needs to be testable and
  mockable
• devs need a local environment that
  matches, or your test cases will suffer
• creation of new tenants means more
  money.. test it!
Platform testing is really
          hard

• Some external deps don’t have sandboxes
• Can you try 500 applications?
• Can you maintain a quiet production
  “neighborhood" while testing QA
When security updates
 = vi ipsec.conf in for
          loop
Security in a public
     service is hard
• assume user is infinitely clever and
  malicious
• deny by default vs service simplicity
• maintain segregation and availability of
  tenants
• Asset value can vary widely across tenants
Security design touches
       everything
• ipsec is hard to maintain without proper
  CM, and wasn’t built for noisy network
• deny by default means higher maintenance,
  and not all products support it
• it is easy to violate tenancy segregation in a
  platform
• you may have to hire consultants
When your
management service
   goes haywire
automation automation
     automation

• myriad of technology to automate scaling
  and availability
• policies can be fine tuned to relaunch or
  scale out based on system feedback or api
What about network
        splits
• Will your management server “heal”
  something that is already around?
• Is your management server on the same
  failure plane as your managed servers
• Will you end up with manual intervention
  controls (aka red button)
When your api design
    haunts you
Put an API on
         everything

• Allows automation and guis besides what
  you’ve invented
• simplifies testing
• eat your own dogfood
Design redo is a big
       problem
• GUIs can change easier as humans drive
  them
• Maintaining old apis may not be worth it
• People may depend on bugs or semantic
  gaps
• Version practices in ReST are not uniform
• remember understanding state machine is a
  prerequisite for HATEOAS
When 5 retries
becomes a DDoS
    attack
We want to build
      resilient apps
• recovery is a part of the service you
  provide, more important as you go up in
  value chain
• connections should assume failure and be
  able to reconnect to dependencies
• recovery is non-trivial
5 retries is code smell
• things that backup or fail can get worse
  with naive error retry loops
• APIs often can be made to include data
  about when to retry or that you need to
  slow down
• Treat resilience as a requirement, not a
  feature
When your users ask
the same questions
Wrong words suck

• Some terms seem sensible in design
  discussions, but public use something else
• Changing requires retraining, and thorough
  doc review
• What goes online lingers
When a feature
request implies new
    architecture
Platform changes
• Customers are looking for service, not
  explanations of why it is hard
• Adding value implies touch decisions on
  new features
• As the world turns, expectations rise
• Know your customer
Real-time, full-text
search, streaming.. oh
                 my! full-text search,
•Not all databases support
    esp with partitioning
• Some data is better stored in S3, how does
    that affect indexing strategy?
• Real-time tools are emerging but immature
When you end up with
  a “lock” table in
       mongo
Datastore diversity!

• NoSQL datastores like Mongo are
  attractive and energize developers
• Cloud provisioners like RDS-driven MySQL
  are also attractive
• Specialized stores like CloudWatch for
  statistics
Don’t expect mongo to
       do magic
• Database Engines Mature
• Consistent backups are tricky and only
  recently supported
• Data Ops and visualization tools are
  emerging
• There are type safe bridges like Morphia
Hammers and
        screwdrivers
• In a pinch, you can knock in a screw with a
  hammer, but you can’t screw in a nail with a
  screwdriver
• Don’t throw data into whatever store
  happens to be easy to grab, even if you can.
• Rechecking data assumptions at T     1   is better
  than T3. At T6, you may a disaster
Summary
multi-tenant platform


• Own your dependencies or they will own
  you
• Add time for entropy
• Repeatedly remind yourself you are a
  landlord
Architecture as
iterative development

• Forethought
• Critical debate
• Decision review
‣ @adrianfcole
‣ adrian.cole@mulesoft.com
‣ www.cloudhub.io

When small problems become big problems

  • 1.
    When small problems become big problems @adrianfcole
  • 2.
    Agenda • Introduction toCloudHub • Challenges we faced building multi-tenant architecture • Q/A
  • 3.
    Ego slide Adrian Cole(@jclouds) founded jclouds march 2009 cloudhub.io architect at cloudhub.io architect at cloudhub.io architect at
  • 4.
  • 5.
    Platform as aService Automated Provisioning Event Tracking Centralized Logging Secure Data Gateway 56
  • 6.
  • 7.
    When you’ve priced yourself out of business
  • 8.
    Cloud is utility,but your service may be more • Measurement based pricing exists in infrastructure tier • Know your customer, who are they and where in the value chain you act • Don’t get into race to the bottom
  • 9.
  • 10.
    Choosing a BASIC starting point • Already had a LDAP infrastructure • Straightforward integration with console and other access tools • Easy to do do BASIC authentication
  • 11.
    Remember users (and api users) (and api users) • Basic Auth is not a good choice for an API over time • System integrators need delegated access • Hard to cleanup accounts when there are multiple owners
  • 12.
    When myapp.cloudhub.io becomes myapp001.cloudhub.io myapp001.cloudhub.io
  • 13.
    How to presentthe iApps • X.cloudhub.io • DNS is flexible to deal with • clear branding
  • 14.
    X.cloudhub.io woes • Namespacecontention • qa.cloudhub.io isn’t really an iApp • need to maintain blacklist
  • 15.
  • 16.
    PaaS is morethan java -jar mule.jar • CloudHub adds services integration to Mule • Logging, Event Tracking, Replay, etc.
  • 17.
    appstack -> platformis tricky • transparent features and also compatible? • dealing with network streams that could be more brittle • matching serialization/marshalling w/ cloud features like streaming
  • 18.
    When SLA turnsinto refund
  • 19.
    Desire to relyon more services • Cloud Infrastructure • Cloud Search • Cloud Scaling
  • 20.
    Reality of relyingon more services • uptime is less the more service dependencies you add • services may underperform their SLAs with little financial impact • you may need to manually deal with service outages
  • 21.
    When logging turns into a big data problem
  • 22.
    Customers desire real time search • need to centralize and index logs • using ElasticSearch can avoid service fees or license fees • with a custom logging plugin, we can redirect output to the cluster
  • 23.
    Logging is alwaysa big problem • Clusters can fail for reasons beyond servers deployed • API design for logging is different • What happens if your disk fails or your cluster fails? • What happens when you replace a worker?
  • 24.
    Real men testin production
  • 25.
    Testability is crucial •each dependency needs to be testable and mockable • devs need a local environment that matches, or your test cases will suffer • creation of new tenants means more money.. test it!
  • 26.
    Platform testing isreally hard • Some external deps don’t have sandboxes • Can you try 500 applications? • Can you maintain a quiet production “neighborhood" while testing QA
  • 27.
    When security updates = vi ipsec.conf in for loop
  • 28.
    Security in apublic service is hard • assume user is infinitely clever and malicious • deny by default vs service simplicity • maintain segregation and availability of tenants • Asset value can vary widely across tenants
  • 29.
    Security design touches everything • ipsec is hard to maintain without proper CM, and wasn’t built for noisy network • deny by default means higher maintenance, and not all products support it • it is easy to violate tenancy segregation in a platform • you may have to hire consultants
  • 30.
  • 31.
    automation automation automation • myriad of technology to automate scaling and availability • policies can be fine tuned to relaunch or scale out based on system feedback or api
  • 32.
    What about network splits • Will your management server “heal” something that is already around? • Is your management server on the same failure plane as your managed servers • Will you end up with manual intervention controls (aka red button)
  • 33.
    When your apidesign haunts you
  • 34.
    Put an APIon everything • Allows automation and guis besides what you’ve invented • simplifies testing • eat your own dogfood
  • 35.
    Design redo isa big problem • GUIs can change easier as humans drive them • Maintaining old apis may not be worth it • People may depend on bugs or semantic gaps • Version practices in ReST are not uniform • remember understanding state machine is a prerequisite for HATEOAS
  • 36.
  • 37.
    We want tobuild resilient apps • recovery is a part of the service you provide, more important as you go up in value chain • connections should assume failure and be able to reconnect to dependencies • recovery is non-trivial
  • 38.
    5 retries iscode smell • things that backup or fail can get worse with naive error retry loops • APIs often can be made to include data about when to retry or that you need to slow down • Treat resilience as a requirement, not a feature
  • 39.
    When your usersask the same questions
  • 40.
    Wrong words suck •Some terms seem sensible in design discussions, but public use something else • Changing requires retraining, and thorough doc review • What goes online lingers
  • 41.
    When a feature requestimplies new architecture
  • 42.
    Platform changes • Customersare looking for service, not explanations of why it is hard • Adding value implies touch decisions on new features • As the world turns, expectations rise • Know your customer
  • 43.
    Real-time, full-text search, streaming..oh my! full-text search, •Not all databases support esp with partitioning • Some data is better stored in S3, how does that affect indexing strategy? • Real-time tools are emerging but immature
  • 44.
    When you endup with a “lock” table in mongo
  • 45.
    Datastore diversity! • NoSQLdatastores like Mongo are attractive and energize developers • Cloud provisioners like RDS-driven MySQL are also attractive • Specialized stores like CloudWatch for statistics
  • 46.
    Don’t expect mongoto do magic • Database Engines Mature • Consistent backups are tricky and only recently supported • Data Ops and visualization tools are emerging • There are type safe bridges like Morphia
  • 47.
    Hammers and screwdrivers • In a pinch, you can knock in a screw with a hammer, but you can’t screw in a nail with a screwdriver • Don’t throw data into whatever store happens to be easy to grab, even if you can. • Rechecking data assumptions at T 1 is better than T3. At T6, you may a disaster
  • 48.
  • 49.
    multi-tenant platform • Ownyour dependencies or they will own you • Add time for entropy • Repeatedly remind yourself you are a landlord
  • 50.
    Architecture as iterative development •Forethought • Critical debate • Decision review
  • 51.

Editor's Notes

  • #2 photo credit http://wallpoper.com/wallpaper/movies-godzilla-271845
  • #3 Not going to focus on normal dev/ops except in context to multi-tenancy
  • #7 Keep tenant safe, protect all tenants, stay in business http://whosjack.wpengine.netdna-cdn.com/wp-content/uploads/2012/05/terraced_houses_manchester_298792.jpg
  • #9 Integration customers range from potato to dev/architect; high value features are not easy to pay on a per-message basis, esp when some services run only 150 messages/month; find the right pricing model for users who want to just use mule by itself, they can use ec2 or heroku
  • #11 LDAP infrastructure existed for Mulesoft community
  • #12 how do you handle lockout of users? system keys, etc. Who’s building them?! SIs need access to create apps for other users, and account conflation leads to N accounts some users just want access to download mule and docs and will never become a cloudhub user
  • #14 also follows patterns like s3 which is largest cloud service
  • #15 x.tenant.cloudhub.io can do more w/ a tenant
  • #18 also introduces another complexity in release process and opportunity for laptop != prod
  • #21 ex. dynamo says fully reliable, but if dynamo is out you can only “wait”
  • #24 some problems are much harder than they seem. Searching, indexing, chronology are difficult and emerging products can suffer from reliability. Logging is also a core means to troubleshoot problems, so if logging is a problem, it is a big problem… significant effort and expertise to nail. cluster *will* eventually become split-brain; how long to restore service from rebooting?
  • #27 ex. marketo has no sandbox, neither does billing system even if you reproduce prod apps, how do you reproduce behavior of them? *corner cases are the ones likely to stress your platform out*
  • #30 desire for ssh access can thwart your firewall rules Each ipsec.conf has tunnels configured for each of its peers, and needs to recognize one side of the tunnel as itself. This results in each host's ipsec.conf being unique to that instance, so you can't collapse the hosts into a class, but have to manage each one separately. you need to use config management to role-assign this. we move off VPC due to many problems with using ipsec, yet still have inter-region problem. do you know solution?
  • #33 ex. failures in management can lead to conservative healing and scaling policies or mandatory user intervention
  • #36 sometimes people code that true = false as opposed to report a bug api design can go be loveable than hateable or hateable first MVP approach may backfire when you are dealing with a *public* service simple as possible, and expose conservatively
  • #38 know your customer and if they are likely to be savvy enough to recover from system failures
  • #39 http://www.slideshare.net/benjchristensen/fault-tolerance-in-a-high-volume-distributed-system
  • #41 ex. notification vs activity feed, streaming? once information escapes your network, it can haunt you with clashing instructions, stackoverflow, etc.
  • #43 ex. what streaming is?)
  • #44 mysql doesn’t support full text search on partitioned tables ex. druid just released yesterday, twitter storm only a year move the problem from the customer to us, which includes the technical profile and migration. small problem become big problem is when a customer desires capability unsupported or difficult to support with the existing (datastore|infrastructure) or indexing strategy
  • #47 consistent backup isn't possible until very recently ext4. immature by DB standards, though older than its years. devs love it, and lack of tools are problem; basically have to use navicat, etc. for DataOps stuck with command line, visualization problems. no way to do analytics onto of mongo, without telling them to write some javascript. and answer might be transform to postgresql
  • #48 we should use mongo or RDS, mongo isn't being used correctly, as a relational database, so it has transactional data; we now have event tracking, but we don't have a document for the event configuration. have to store the whole thing or you will have tx data problems; main problem is that it is not being used correctly. Eventhough it seems you can store relational data in a nosql store like mongo, doesn’t mean you should. 2 (or more) types of datastores may be the most supportable answer to your data problem.A. we use it as Tx (so clashing or overlapping writes)B. it doesn't give you mature features (like consistent backup)
  • #51 pretty much everyone has to be DevOps design can be refactored, but are tough to change at scaleyour job changes often, so right tool for the job also changes realize choices made now can be difficult to change a year on