Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail


Published on

Many have gone before you along this path. Many have failed. A few have succeeded. All have scars. Although the journey is different for everyone there are common aspects to them all. In this talk we will cover our experiences in moving applications into the Cloud. What you must do. What you must not. What matters, what doesn’t.

In moving to the cloud there is no try.

In this talk:

* We’ll cover the core aspects of how the cloud differs from local data centers in terms of application design, runtime characteristics and operational considerations.
* We’ll explain through various real life examples where things worked and where they didnt
* We end with a summary of the key elements to success and the major pitfalls to avoid.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Jax Devops 2017 Succeeding in the Cloud – the guidebook of Fail

  1. 1. The guidebook of FailSucceeding in the Cloud
  2. 2. Steve Poole – IBM Making Java Real Since Version 0.9 DevOps Practitioner @spoole167
  3. 3. This talk • Come from personal and team experiences as a Leader of a DevOps team • Comes from weekly consultancy etc with product teams and external customers
  4. 4. Agenda of Fail • Fail 0 – Believing Migration to Cloud is easy • Fail 1 – No Clarity of Purpose • Fail 2 – Lack of education • Fail 3 – Not kicking the tires enough first • Fail 4 – Ignoring unpleasant discoveries • Fail 5 – Fudging the hard decisions • Fail 6 – Lack of preparation • Fail 7 – Not enough exercise • Fail 8 – Too much excitement • Fail 9 - Big bang deployment • Fail A – A few other things
  5. 5. Fail 0.0 : Believing Migration to Cloud is Easy • ‘Cloud’ is not easy • It may be self-service but don’t be fooled • It may look like a nice walk into the forest to grandma’s house. .. • Get yourself together for a large and painful exercise. • Ever moved a Data Centre? • Experience is key. • Staff. Who’s going to do this – are they qualified? • Prepare to change your plans • Most migrations require architectural design changes within the first 6 months • Half of all projects fail • Half of all projects will need significant increases in budget • Think it through • Projects fail later on when new objectives get added be clear on your ultimate goal Emigration not Migration (Migration suggests its something you want to do annually)
  6. 6. Fail 1.0: No Clarity of Purpose • There are many reasons for moving applications to the ’Cloud’ • There are many types of application • There are many ‘Clouds’ to move to • What’s the chance of you getting it right first time? • What’s the consequence of failure? • Do you even know if you’ll even know it’s failed in time to recover? • Clarity of purpose reduces your risk • Clarity of purpose gives you focus
  7. 7. Not understanding the communications process • How do they talk to you? • What’s the ticketing system? • How do you get told of a problem? • How do you get told of planned outages? • How much notice do you get for planned outages? • How do you raise a problem? • How do you ESCALATE? • What is the communications SLA here? • Know your rights Fail 2.0 : Lack of Education or RTFM! DOH! Ask me about passwords
  8. 8. I was a single point of failure And I didn’t even know it I think I’m in control of my account until I need my password reset I had no idea where the reset email was going to Cloud support could trigger the reset but wouldn’t/couldn’t tell me more. Suggested I go to my Admin!! - Which I thought was me. Turns out there’s a corporate owner of the accounts. Took me days to resolve. Fail 2.1 : Lack of Education or RTFM!
  9. 9. The one thing you should remember from this talk We’re techies. We get excited about APIs. We understand APIs Moving to the Cloud means giving your data, applications, security etc to a 3rd Party. That means the ‘API’ extends into the human world. The contract and it’s SLA defines what you can and cannot do when using Cloud services Cloud providers benefit from economies of scale and have large numbers of customers Just like the more usual service providers you use at home. Gas, Electricity, Broadband, Satellite TV. You know how that can work at home. Cloud Provisioning is much more complicated.. Fail 2.2 : Lack of Education or RTFM!
  10. 10. Not understanding the Service Level Agreement • Does it have location specific differences? • How is the SLA measured? • How well defined are the criteria? • How are issues resolved? • What are your responsibilities? • If you don’t know your SLA you will fail Fail 2.3 : Lack of Education or RTFM! Example: Can you assess free capacity? If a location is at capacity what happens?
  11. 11. Not understanding the Service Level Agreement (2) • True story • Go to a service provider SLA dashboard • Service says SLA available of 99.5% • I think that means  • Turns out that actual availability is 95.8 Fail 2.4 : Lack of Education or RTFM! Daily: 43.2s Weekly: 5m 2.4s Monthly: 21m 54.9s Yearly: 4h 22m 58.5 Daily: 1h 0m 28.8s Weekly: 7h 3m 21.6s Monthly: 1d 6h 40m 49.3s Yearly: 15d 8h 9m 52.0s
  12. 12. • True story • The difference is because the provider has a planned daily outage of 1hr • They still claim 99.5% • Get’s worse. • Outages beyond their control don’t ‘count’ either. Fail 2.5 : Lack of Education or RTFM! Not understanding the Service Level Agreement (3) •Daily: 3h 36m 0.0s •Weekly: 1d 1h 12m 0.0s •Monthly: 4d 13h 34m 21.9s •Yearly: 54d 18h 52m 22.8s 85%
  13. 13. Not understanding the cost model • Units of cost. • CPU / RAM / Network / Storage / IP Addresses …. • Penalty costs if you overrun? • When does the time start and end? • Costs change by location? Fail 2.6 : Lack of Education or RTFM! DOH! Ask me about GPUs
  14. 14. Not understanding the cost model • I’m testing new GPU support In IBM’s JVM 8.0 • IBM has GPU support in SoftLayer • Amazon has GPU support in AWS • I want to do some scale performance testing • Got my VirtualBox and Ansible config • Point it at AWS. Deploy < 1hr x 2 • Costs me $39 ? • Other charges included  Fail 2.7 : Lack of Education or RTFM! p2.16xlarge 16 GPU 64 vCPU 732 GB ram $14/hr
  15. 15. Not understanding how security and compliance is managed • What are the security, compliance and image update policies? • How did they handle the last pervasive vulnerability? • Firewalls – do you get one for free? Can you configure it? What’s the default policy for firewalls with deployments? • SSL certificates – do you own and manage or do they offer a service? • How do you access your VMs ? (ssh, telnet, web?) • Passwords vs keys? • Where are the keys kept? • Can you retrieve the keys in an emergency? Fail 2.8 : Lack of Education or RTFM! You do understand penetration attack vectors?
  16. 16. Misunderstanding what APIs exist • Are there APIs for all the actions you want to perform • Are they symmetrical? • Do any need human interaction to complete? • Are the APIs proprietary or standard? • Are there plugins for IaC tools? Fail 2.9 : Lack of Education or RTFM! DOH! Ask me about VM termination APIs
  17. 17. Lack of a Community • What do others think of this Cloud? • Is there an active DevOps community? • Do you see active participation from the Cloud provider? Fail 2.A : Lack of Education or RTFM!
  18. 18. Fail 3.0 : Not Kicking the tires enough first Poor assumptions about ’how things work’ • For instance: • “I don’t need a public IP address for my VM as I have a private gateway” • “Now I can’t do apt-get update!” • “what do you mean I have to buy public IP addresses?”
  19. 19. Fail 3.1 : Not Kicking the tires enough first • If you don’t start with IaC techniques from Day 1 you will fail. • Environments are all different • Is your memory that good enough? • You must encode. • Trying by hand and then encoding into IaC • helps you learn about your target environments (API’s anyone?) • Builds up a IaC asset base you’ll need in the future. “The human touch”
  20. 20. Fail 3.2 : Not Kicking the tires enough first • Get a buddy - “Extreme Deployment” • Install VirtualBox and Vagrant • Build a Vagrantfile for an environment you care about • Provision locally “vagrant up –provider=virtualbox” • Pick a Cloud. (Use the ’free tier’!) • Try to deploy a VM by hand. • Now do “vagrant up –provider=XXXXXXX” • Examine the differences.. • Add more and repeat Look for how IP addresses are allocated. Look at the options for memory size, networking, disk space, disk types (IO speeds) What CPU’s can you get? What OS’s can you provision? What architectures are available? What’s the cost?
  21. 21. Fail 3.3 : Not Kicking the tires enough first Try another Cloud Try someone's IaC pattern Ansible script to deploy a docker swarm Go wild: Try to deploy OpenStack on your laptop (with 32GB) Now do it all again with Docker
  22. 22. Not understanding that your initial deploys are the least secure • How long until your newly deployed VM is attacked? 20 seconds -> 40 minutes • So deploying and then adding vulnerability patches is not the right answer • War story: • Customer deploys a VM to Cloud. • VM gets hacked immediately • Customer patches the VM. • Customer keeps the VM and uses it in production • Customer gets bill for $500,000 network traffic. VM is now being use to host warez Fail 3.4 : Not Kicking the tires enough first
  23. 23. Fail 3.5 : Not Kicking the tires enough first • Time to think about security • If you don’t get your security posture defined before you deploy you’ll fail and possibly get some interesting bills • Maybe you’ll go out of business. • Worst case (maybe) is you have provided a gateway into your company network • Regular Vulnerability scanning & fixing. • Keys not passwords • Specific IP address access for VMs • Whitelisted access to internal systems (inside your firewall) • Whitelisted access to remote systems (on the internet) …
  24. 24. Fail 4.0 – Ignoring unpleasant discoveries • Not all the OS’s you want are there • Performance of the Cloud is less than you expected • Now you know what multi-tenancy means. • Managing VMs in the Cloud is complicated • Keeping systems secure and compliant is hard • Deployment times vary (and fail unexpectedly) • Debugging problems remotely is difficult • It costs more than you realized. • Cost is your responsibility. (No one is going to help you save money!) • Clouds fill up So now you know some of those ‘unexpected’ restrictions Initial cloud deployments are juicy targets for the bad guys
  25. 25. Fail 4.1 – Ignoring unpleasant discoveries • Deploy anyway. • Just run with a smaller JVM heap • Ok I get it wont scale – deploy anyway and we’ll fit scaling later • You’ll just have to deploy with a small budget for VMs • Use the public multitenancy option – its cheaper. • Can’t you add some sort of cache? I’m impressed by the number of customers who can change the rules of physics
  26. 26. Fail 5.0 – Fudging the hard decisions • You have to pick one. Changing your mind later is going to be expensive and complicated • IaC is critical but it’s not magic. Not realizing Clouds are sticky Many of my consultancy discussions started with a company saying to itself: “It’s ok. If Cloud XXX is too expensive we’ll just move over to YYY”
  27. 27. Fail 5.1 – Fudging the hard decisions For instance: A large rich-client application used in-house in multiple locations . Plan was to consolidate into the Cloud. Network traffic between client and servers measured in TB’s / day To reduce costs, plan was to create special proxies/data caches on-prem Consequence: Increased complexity of design, poor performance, Untried new system -> fail. Should have spent the money on replacing the rich-client with a web based one. Compromising the architecture because of cost Unexpected expensive items (such as network costs) can drive you to weird hybrid configurations that increase complexity and ultimately fails
  28. 28. Fail 5.2 – Fudging the hard decisions Offering RAM Cost (2015) CPUs IBM Bluemix (CF) $24.15 GB/Month 4vCPUs per instance IBM Bluemix (Containers) $ 9.94 GB/Month 4vCPUs per GB $21.60 GB/Month 4vCPUs per instance Heroku (Hobby) $14.00 GB/Month 1 "CPU share" per 512MB in an instance Heroku (Professional) $50.00 GB/Month 1 "CPU share" per 512MB in an instance Amazon EC2 (SLES) $16.56 GB/Month 1 vCPU per 4GB in an instance. Not understanding the cost projections Old data for example only
  29. 29. Fail 6.0 – Lack of preparation Driving straight into live deployment Premature deployment based on happy path will ultimately fail It is critical that you have exercised an end-to-end deployment and support model before you go live So many projects fail because of problems later. Even simple applications need security, logging and monitoring
  30. 30. Fail 6.1 – Lack of preparation Not having a solid monitoring and diagnostics solution Most successful cloud applications consider their monitoring solution to be the most critical part of their system If your monitoring solution fails – you’re running blind Build the monitoring system and then exercise it Break things, Scale things, Build run away jobs Figure out what is important and monitor it Now build dashboards Do you get the events you need when you need? Are you measuring end user response times?
  31. 31. Fail 6.2 – Lack of preparation Not having enough dashboards! My team was a traditional IT one. Responded to tickets – so customers always found the problem first We added dashboards and an objective “First to Know” We moved from being last to know to being the one to tell the customer. Dashboards allowed my team to see issues clearly when there was a failure and when trends showed bad things we’re going to happen. Dashboards changed my teams attitudes. Makes automation and monitoring more acceptable
  32. 32. Fail 6.3 – Lack of preparation Not having a robust and automated deployment solution After your application goes live things will go wrong It’s not just about having a robust application design. How quickly you can remediate issues is dependent on your ability to deliver those fixes Design for Failure. "Everything fails, all the time". Werner Vogels, CTO Your deployment solution is your disaster recovery solution
  33. 33. Fail 6.4 – Lack of preparation Cloud location goes off-line -> can you fail-over to a new location? What happens if your database gets corrupted? Where is you data backed up to? Can you get the data back into the Cloud fast enough? Who does the backups? When was the last backup taken? If your deployment solution is not your disaster recovery solution
  34. 34. Fail 7.0 – Not enough exercise Scale testing reveals bottlenecks Even just running two instances can be revealing Break things too (chaos monkey) Your aim is to understand how well your application can react to demand Scale across Cloud locations - Data costs increase? Response times get worse? Timeouts occur? Scale testing reveals design issues in application and infrastructure. Things you want to know about before you go live. And tells you if your monitoring is going to be any use Not testing how your application scales
  35. 35. Fail 7.1 – Not enough exercise Failing to scale appropriately costs money 0 20 40 60 80 100 120 a b c d e f g h i j Demand Provisioned
  36. 36. Fail 7.2 – Not enough exercise Failing to scale appropriately costs money 0 20 40 60 80 100 120 a b c d e f g h i j Demand Provisioned
  37. 37. Fail 8.0 – Too much excitement Projects can fail because of an excess of enthusiasm ”Lets take the opportunity to rewrite the application” “Lets use this new tech” Often fails due to a lack of situational awareness of the state of play in the industry It’s easy to get carried away.
  38. 38. Fail 9.0 – Staged deployment Going from Lift and Shift to what? You can lift and shift. Probably going to bite you. Unexpected dependencies on local items such a C:/ or a local services and servers (authentication servers etc) Consider your options The “strangler pattern” – staged conversion to micro services Time for a rewrite? Look at new options - “serverless” ? BTW – adding in sufficient debug capability can be just as expensive and increase risk How far into the woods do you want to go?
  39. 39. Fail A.0 - A few other things • Cloud providers often offer additional services • Why build your own when you can use a provided one? • Skill sets • We have lots of tech experts but not that many systems experts. • Take a look at your team. Do they have the skills and experience you need? • IaC & DevOps skills? • Some parts of your process are going to become more critical than before • Who’s doing the data backups? • Who owns your build and test infrastructure? • Deployment process • How long does it take to deploy a change? • Does your team understand the importance of the process?
  40. 40. Wrap up • Moving anything into a cloud environment is always a challenge • Lack of clarity around why you want to do this will cost you money, sleep and probably doom the project • Be sure your team is skilled and commitment . It’s their sleep too • Most of the projects that fail – fail because of the approach. Not the technology • But not understanding the economics drivers on systems will also lead to fail
  41. 41. Fail to adapt -> Fail How you design, code, deploy, debug, support etc will be effected by the metrics and limits imposed on you. Financial metrics and limits always change behavior. It also creates opportunity You will have to learn new techniques and tools Applications have to get leaner and meaner
  42. 42. Thank you