Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Octo and the DevSecOps Evolution at Oracle by Ian Van Hoven

237 views

Published on

The transition from 40 years of successful licensed software development to an agile-based SaaS business involves many challenges. Octo, a real-time streaming metrics framework built around InfluxDB time series database, is aimed specifically at one: simplifying the collection and visualization of mission-critical operational data to enable a culture change toward metrics immersion and product ownership. Learn more by viewing this InfluxDays NYC 2019 presentation.

Published in: Technology
  • Login to see the comments

  • Be the first to like this

Octo and the DevSecOps Evolution at Oracle by Ian Van Hoven

  1. 1. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Octo AND THE DEVOPS EVOLUTION @ ORACLE Ian Van Hoven, Oracle Engineering Services InfluxDays NYC ✣ March 13, 2019 1
  2. 2. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Let me tell you a (user) story
  3. 3. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Oracle aims to achieve 99.99%+ availability and industry- leading performance across the Cloud portfolio To delight existing customers, attract new ones, and enable Oracle teams to innovate
  4. 4. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Achieving our SLA goals requires evolutionary change in how we monitor and measure our products And – especially – a cultural shift toward metrics immersion, supported and driven by ubiquitous, user-friendly tooling
  5. 5. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Octo will leverage proven industry-standard technology demonstrably scalable to Oracle's global footprint and volume While complementing existing M&M tech … and allowing us to leverage innovation and attract/retain top talent
  6. 6. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | The 1st order priorities are to improve Oracle Cloud customer satisfaction & protect revenue Commercialization of Octo is explicitly a 2nd order priority
  7. 7. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | SOUNDS EASY
  8. 8. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Scene from the Past
  9. 9. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Scene from the Past
  10. 10. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Looks good. Let’s do that.
  11. 11. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Modern Day Silos Aplenty (Teams, Tools, Data)
  12. 12. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Modern Day I have that data. Also, you don’t get that data.
  13. 13. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Modern Day PreDevOps Monitoring Alert
  14. 14. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Legacy MindsetModern Day PreDevOps Math $fear{abstract} != $fear{concrete}
  15. 15. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Why would devs ever want to be on-call? It would eliminate our incentive to ship. Modern Day PreDevOps Mindset
  16. 16. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | If you don’t like something…
  17. 17. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Octo Is A stunningly innovative product acronym… Oracle Cloud Telemetry (Oracle)
  18. 18. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Octo Also Is A real-time streaming metrics framework... • Conceptualized at May 2017 leadership offsite; GA since Thanksgiving 2018 • Aimed directly at enabling culture change of product ownership & metrics immersion across Oracle • Aggregator & cross-correlator of discrete time series data: metrics from diverse sources in one view • Sharing and collaboration tool to connect colleagues from teams across all of Oracle • Easy to... • FIND: goto/octo (browser alias), /go octo (Slack command) • USE: up & running in seconds (read) or minutes (write) • SHARE: include graph URLs & images in Slack, email, ticket, preso, wall-screens, etc. • Common UX: Octo for Fusion looks/works like Octo for TOA, RNT, Eloqua, Taleo, etc. • Simple, secure, declarative, RESTful write API – abstracted from data pipeline & b/e data tier • Built with proven, scalable industry-standard tech: nginx, memcache, Kafka, InfluxDB/TICK, Grafana, etc. • Integrated with standard Cloud-wide tools: Slack, PagerDuty, Akamai, Dyn, CatchPoint, 1000 Eyes, Jira, etc. • Data source & product/property agnostic: collect data any way/where – then visualize & share w/ Octo • Imperfect – which was the plan all along
  19. 19. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Octo Is Not • Monitoring tool or alerting framework • Replacement for [insert monitoring solution here] • Troubleshooting cure-all • Data analytics warehouse • Auto-remediation framework • Application profiler / testbench • Asset management system • Runtime dependency • Log processor/aggregator/analytics • Big data (map/reduce) platform • Reporting platform (canned or ad-hoc)
  20. 20. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Octo Use Cases • General Golden Signal Awareness: Business KPIs, Availability, RPS, Latency, Errors, Consumption Every SaaS engineer (Dev, SRE, NOC, CS) should know peak weekday FE RPS – and be aware of (and responsive to) any WoW divergence • Troubleshooting Accelerator: quickly narrow down outage scope, identify next step/s & escalation PoC/s ”Back End” dashboard shows sharp DB read/write RPS plunges; anyone can infer something amiss at DB tier (net? sys? sw?); pivot to specialized tool/s to discover precise failure RC/s “SaaS Cloud Web FE” dashboard shows sharp increase in LB 4xx error RPS, sharp drop in LB 2xx OK RPS, drop to 0 RPS in HTTPD RPS; pivot to LB forensics & log analysis to identify IP whitelist RC • ChM Validation, Incremental Rollout A|B Test: ethereal/point-in-time dashboarding NetEng, SvcOps & GNC monitor one dashboard showing LB VIPs/throughput/RPS, app RPS/latency, business KPIs, APM availability, etc. to confirm GBIC swap successful Time shift (HoH, DoD, WoW, MoM) allows quick analysis of impact/delta for new code/infra/etc.
  21. 21. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Octo Use Cases (cont’d) • Presentations • Product usage awareness (features) • Pre-emptive capacity/scaling strategic planning (optimize or add capacity) • Reactive troubleshooting (post-outage start) • Slow degradation catch • Corpus of “detailed” data for after-the-fact analysis (not necessarily on permanent graphs/dashboards) • Per-customer usage/performance/availability
  22. 22. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | So…?
  23. 23. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Octo Today The Good… • 8 “customers” onboarded; steady trickle of inbound activation requests; live in 5 PoPs (NorAm & EMEA) • FE API latency dropping as RPS increases (~35ms at 50th %ile) • “Shoot me the Octo graph” & “Put it on the Octo dash” heard more & more often in meetings/hallways • Octo acquiring/retaining real estate on Oracle SaaS Cloud NOC “big board” & on hallway TVs • Bi-weekly releases live to site, mostly error free  • Continued investments in o4o (“Octo for Octo”) – separate parallel ecosystem to customer Octo (thanks, Paul) • Energized & ambitious team, collaborating closely with customers The Imperfect… • No customer self-onboarding; minimal onboarding automation; no read API; no DWH integration • Time-shift CPU overhead driving investments in downsampling via CQ (near term) and Kapacitor (long term) • Code and config living together in sin; general config messiness; documentation gaps • Still in PIPD phase (evolutionary precursor to CICD)
  24. 24. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Why Influx[DB|Data]? (Non-Exhaustive Un-Sorted List) 1. Purpose built for our use case (streaming metrics) 2. O/S with strong/focused/visionary company & active community behind it (ala Elastic, Mongo, etc.) 3. Seamless integration with leading visualization & alerting/messaging platforms 4. So intuitive/easy even a VP can use it 5. TICK stack simplifies key considerations around collection/ingest & aggregation/downsampling 6. Horizontal scaling via TSDB clustering when needed 7. Not jerks; easy to work/collab with; have seen almost everything; very responsive; super patient * 8. Givers of sage advice ☛ Paul Dix: “Don’t monitor your monitoring system with your monitoring system” 9. See #1
  25. 25. Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | Questions?

×