Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chaos is a ladder !

100 views

Published on

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Chaos is a ladder !

  1. 1. FULLSTACK TECH RADAR DAY CHAOS is a Ladder Haggai Philip Zagury (hagzag) | DevOps Group & Tech Lead @ Tikal Knowledge
  2. 2. FULLSTACK TECH RADAR DAY Haggai Philip Zagury DevOps Group & Tech Lead -> 10+ years @ Tikal My open thinking and open techniques ideology is driven by Open Source technologies and the collaborative manner defining my M.O. My solution driven approach is strongly based on hands-on and deep understanding of Operating Systems, Applications stacks and Software languages, Networking, Cloud in general and today more an more Cloud Native solutions. @hagzag
  3. 3. FULLSTACK TECH RADAR DAY What is Chaos Engineering ? The philosophy behind Chaos Engineering
  4. 4. FULLSTACK TECH RADAR DAY http://bit.ly/2VQGCup Chaos means many different things to different people…
  5. 5. FULLSTACK TECH RADAR DAY In 1 Sentence ‣ Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. Building Trust
  6. 6. FULLSTACK TECH RADAR DAY Building Resilient Trust in systems is hard ! Backend DevOps Frontend & Mobile }
  7. 7. Frontend
  8. 8. Backend
  9. 9. DevOps
  10. 10. DevOps
  11. 11. FULLSTACK TECH RADAR DAY Building confidence in computer systems is hard ! ● Systems fail (Some “Design to Fail”) ● “Best Effort” Infra ● *aaS ● Cloud ● Cloud native ● Hybrid Cloud ● …
  12. 12. FULLSTACK TECH RADAR DAY Experiment in Pr duction !
  13. 13. FULLSTACK TECH RADAR DAY Additional to “Traditional Testing” ● Chaos Engineering goes beyond traditional (failure) testing in that it's not only about verifying assumptions. It also helps us explore the many unpredictable things that could happen and discover new properties of our inherently chaotic systems.
  14. 14. FULLSTACK TECH RADAR DAY Hypothesis-Driven Experiments ● Hypothesis Define your steady state
  15. 15. FULLSTACK TECH RADAR DAY Hypothesis-Driven Experiments ● Hypothesis Define your steady state ● Experiment by challenging it
  16. 16. FULLSTACK TECH RADAR DAY Hypothesis-Driven Experiments ● Hypothesis Define your steady state ● Experiment by challenging it ● Analyse your findings - spread the word
  17. 17. FULLSTACK TECH RADAR DAY Hypothesis-Driven Experiments ● Hypothesis - Define your steady state ● Experiment by challenging it ● Analyse your findings - spread the word ● Action items should be noted ● Perhaps run another round with other limits / variables ● Immune your system (eventually) Immune
  18. 18. FULLSTACK TECH RADAR DAY Chaos engineering is: ● Like injecting a Vaccine to immune yourself. ● Increase system resilience - by discovering vulnerabilities ● Identify failure before it becomes an outage ● Better define your steady state (iterative) and constantly challenge it.
  19. 19. FULLSTACK TECH RADAR DAY Chaos engineering isn’t: ● Breaking down production on purpose. ● A (new) blame mechanism ● Surprising partial outages. ● Taking down all the system at the same time.
  20. 20. FULLSTACK TECH RADAR DAY Chaos Engineering Origins? How did we get here ?
  21. 21. FULLSTACK TECH RADAR DAY DevOps 2010
  22. 22. FULLSTACK TECH RADAR DAY DevOps 2010
  23. 23. FULLSTACK TECH RADAR DAY DevOps 2010 2011 FaaS
  24. 24. FULLSTACK TECH RADAR DAY DevOps 2010 20111998 How Complex Systems Fail (Being a Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New 25 years Resilience partitionist
  25. 25. FULLSTACK TECH RADAR DAY DevOps 2010 20111998 How Complex Systems Fail (Being a Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New 25 years Resilience partitionist http://erikhollnagel.com/ideas/resilience-engineering.html A system is resilient if it can adjust its functioning prior to, during, or following events (changes, disturbances, and opportunities), and thereby sustain required operations under both expected and Resilience Engineering
  26. 26. FULLSTACK TECH RADAR DAY Unleash the Army DevOps 2010 2011 2014 Chaos Engineer Role Announced
  27. 27. FULLSTACK TECH RADAR DAY DevOps 2010 2011 2014 Chaos Engineer Role Announced gremlin.com Failure as a service Unleash the Army 2015
  28. 28. FULLSTACK TECH RADAR DAY DevOps 2010 2011 2014 Chaos Engineer Role Announced gremlin.com Failure as a service 2017 Unleash the Army 2015 A system is resilient if it can adjust its functioning prior to, during, or following events (changes, disturbances, and opportunities), and thereby sustain required operations under both expected and Resilience Engineering
  29. 29. FULLSTACK TECH RADAR DAY DevOps 2010 20142011 http://erikhollnagel.com/ideas/resilience-engineering.html 2015 A system is resilient if it can adjust its functioning prior to, during, or following events (changes, disturbances, and opportunities), and thereby sustain required operations under both expected and Resilience Engineering 20172016 Building trust in
 Chaos Engineering 1998 Chaos Engineer Role Announced
  30. 30. FULLSTACK TECH RADAR DAY Where we meet Chaos How did we get here ?
  31. 31. FULLSTACK TECH RADAR DAY Where we meet Chaos Chaos starts here
  32. 32. FULLSTACK TECH RADAR DAY In 1 Sentence ‣ Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. ‣ Preparing for the unknown … Building Trust
  33. 33. FULLSTACK TECH RADAR DAY Turbulent condition - failing node in a cluster default a b b aa a ● 2 services in a 3 node cluster
  34. 34. FULLSTACK TECH RADAR DAY Turbulent conditions default a b b aa a ● What’s my application going to suffer from ?
  35. 35. FULLSTACK TECH RADAR DAY Turbulent conditions default a b b aa a ● 2 services in a 3 node cluster ● What’s my application going to suffer from ? ● Is this OK ?
  36. 36. FULLSTACK TECH RADAR DAY Turbulent conditions default a b b aa a ● Back to Normal
  37. 37. FULLSTACK TECH RADAR DAY Turbulents
  38. 38. FULLSTACK TECH RADAR DAY How to practice Chaos Engineering ? Perquisites + Tools of Chaos Engineering
  39. 39. FULLSTACK TECH RADAR DAY Practice ● You should have: ● GameDays ● ChaosDays ● Controlled & Schedule drills / experiments
  40. 40. FULLSTACK TECH RADAR DAY Practice & Collaborate ● You should have: ● GameDays ● ChaosDays ● Controlled & Schedule drills / experiments
  41. 41. FULLSTACK TECH RADAR DAY It’s slowly becoming a culture https://github.com/dastergon/awesome-chaos-engineering
  42. 42. FULLSTACK TECH RADAR DAY Automation is key !
  43. 43. FULLSTACK TECH RADAR DAY Monitoring (ROI) Observability DevOps
  44. 44. FULLSTACK TECH RADAR DAY Not just graphs and logs (that too) ● RCA’s - recording and being able to reach it ! ● Document, Document, Document - great resources on how to do that. ● We don’t Chaos everything … ● Only what makes sense / repeats ● Game / Chaos Days -> keep experiment definitions for GameDay/ ChaosDay to define
  45. 45. FULLSTACK TECH RADAR DAY SLA … is innovation driven - how fast did you do without failing ? https://cloudplatformonline.com/rs/248-TPC-286/images/DORA-State%20of%20DevOps.pdf
  46. 46. FULLSTACK TECH RADAR DAY SLA … is innovation driven - how fast did you do without failing ? https://cloudplatformonline.com/rs/248-TPC-286/images/DORA-State%20of%20DevOps.pdf
  47. 47. FULLSTACK TECH RADAR DAY Experiment !
  48. 48. FULLSTACK TECH RADAR DAY Application Caching Database Hardware Network What layer ? - All !
  49. 49. FULLSTACK TECH RADAR DAY The ultimate chaos “butterfly Affect” / “Domino Affect” ● How will my application do ● without cache ? ● without a certain api available ? ● with n sessions
  50. 50. FULLSTACK TECH RADAR DAY The ultimate chaos “butterfly Affect” / “Domino Affect” ● How will my application do ● without cache ? ● without a certain api available ? ● with n sessions
  51. 51. FULLSTACK TECH RADAR DAY Applying Chos Engineering practices Log | Messure
 Monitor Break Things & Auto Recover
 Experiment Full Cycle - Chaos
 Immune Application Caching Database Hardware Network Security
  52. 52. FULLSTACK TECH RADAR DAY Where is Chaos going ? "the discipline of experimenting on a distributed system in order to build confidence in the system's capability to withstand turbulent conditions in production."
  53. 53. FULLSTACK TECH RADAR DAY Toolz
  54. 54. FULLSTACK TECH RADAR DAY Failure as a service
  55. 55. FULLSTACK TECH RADAR DAY Game-day resources https://www.gremlin.com/community/tutorials/planning-your-own-chaos-day/ Planning your GameDay ? Feel Free to contact me directly - 
 we’d be happy to help -> hagzag@tikalk.com
  56. 56. FULLSTACK TECH RADAR DAY Hypothesis - steady state { "name": "all-our-microservices-should-be-healthy", "type": "probe", "tolerance": "true", "provider": { "type": "python", "module": "chaosk8s.probes", "func": "microservice_available_and_healthy", "arguments": { "name": "myapp", "ns": “default" } } }
  57. 57. FULLSTACK TECH RADAR DAY Experiment Terminate a pod ! ● What to do ● When to do it { "type": "action", "name": "terminate-db-pod", "provider": { "type": "python", "module": "chaosk8s.pod.actions", "func": "terminate_pods", "arguments": { "label_selector": "app=my-app", "name_pattern": "my-app-[0-9]$", "rand": true, "ns": "default" } }, "pauses": { "after": 5 }
  58. 58. FULLSTACK TECH RADAR DAY If your just peeping / evaluating
  59. 59. FULLSTACK TECH RADAR DAY Chaoskube ● chaoskube is a “chaos-monkey lite” it basically takes down pod based on a schedule to test your resilience (and there are some tweaks via configuration) ● use —dry-run https://github.com/linki/chaoskube
  60. 60. FULLSTACK TECH RADAR DAY kube-bench Find vulnerabilities, configuration flags, define your own policies.
  61. 61. FULLSTACK TECH RADAR DAY kube-hunter (Security) 1. Remote scanning To specify remote machines for hunting, select option 1 or use the --remote option. Example:./kube-hunter.py --remote some.node.com
 2. Internal scanning To specify internal scanning, you can use the --internal option. (this will scan all of the machine's network interfaces) Example: ./kube-hunter.py -- internal
 3. Network scanning To specify a specific CIDR to scan, use the --cidr option. Example: ./kube-hunter.py --cidr 192.168.0.0/24

  62. 62. FULLSTACK TECH RADAR DAY Many many more …. ● Stay tuned for more stuff about Chaos Engineering ● https://www.tikalk.com/community
  63. 63. Thank you for joining us Haggai Philip Zagury DevOps Group & Tech Lead @ Tikal

×