Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Driving TAS Enterprise Fitness

Download to read offline

SpringOne 2020
Seth Jones: Solution Owner, Slalom LLC;
Ishaan Khurana: Data Scientist/ Analyst, Ford Motor Company;
Tom Woods: Platform Services Analytics and Billing Super, Ford Motor Company;
Kyle Hinton: Solution Architect, Slalom Detroit

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Driving TAS Enterprise Fitness

  1. 1. About Us Tom Woods Platform Services Analytics and Billing Supervisor, Ford Motor Company Ishaan Khurana Data Scientist / Analyst, Platform Services Analytics and Billing team, Ford Motor Company The Platform Services Analytics Billing Automation team at Ford measures, drives and benchmarks the value of Ford Strategic Cloud Platform investments.
  2. 2. Problem Statement
  3. 3. PCF Billing Unit Calculation
  4. 4. Measure Investment Value
  5. 5. Fitness Actions Metrics Store For PCF Tile • Fitness Reports Identify: o Lazy AIs o Bloated AIs o Overstaffed AIs o Misplaced AIs o Orphaned AIs o Unhealthy AIs • Forecasting o Predict AI tipping points Tagging • Supports Automation • Improve Audits • More Granularity
  6. 6. We are SRE Engineers from Slalom, based out of Detroit. We are passionate about Open Source and SRE. Our goal is to use our expertise and knowledge to improve our clients’ products and platforms. Who are we? Seth Jones Solution Owner – Slalom LLC Seth.Jones@slalom.com Kyle Hinton Solution Architect – Slalom LLC Kyle.Hinton@slalom.com
  7. 7. How we got involved Ford is a transitioning organization that is migrating their infrastructure from traditional ops to more cloud native solutions. Here we are presenting some methods that we have found successful in assisting product teams in their evolutions.
  8. 8. Why SRE Matters to Ford “Ford’s Future: Evolving to Become Most Trusted Mobility Company, Designing Smart Vehicles for a Smart World” – Ford Oct. 3rd, 2017
  9. 9. Observability 15,000+ PCF Applications 20,000+ Application Instances 95% Applications Java/Kotlin 500+ Product Teams Global Product Teams AWS / Azure / On-Prem PCF Foundries Global Data Centers
  10. 10. Ease of Adoption "Every team should be able to develop in whatever method they want." - Jonathan Schneider How do easily monitor 20k TAS applications? Observability Goals 1. Provide various levels of insight to each teams platform to allow for troubleshooting, optimizing, and improved management 2. A solution that requires single time setup, and removes continued platform management toil from product development teams
  11. 11. Solution - Lenses Gateway Metrics Application Metrics Infrastructure Metrics
  12. 12. Right Sizing Scaling of an application or platform to properly utilize resources to achieve intended capacity
  13. 13. Assumptions / Considerations • Provide insights without required Product team effort • Most Product teams have little experience with Capacity Planning • Product teams control their own infrastructure, and resources • Nudge teams towards change, with metrics • Leverage open source technology to limit third party dependencies, and maximize customization • Ability to measure the impact of our “Right Sizing” efforts
  14. 14. How / Designs
  15. 15. On Premise Multiple data centers run and maintained by Ford. In general these are egress only environments. Microsoft Azure Several foundries running in the Azure cloud to support applications that need to be exposed directly to the public internet. Amazon Web Services New foundries are being stood up in AWS in support of some of Ford’s most important initiatives. Observability Platform The Ford Observability Platform has been designed to collect and aggregate data from all of these sources. Cloud Agnostic Ecosystem
  16. 16. Getting Metrics From PCF 1) Prometheus BOSH Release installed in all foundries 2) Currently we focus on the metrics exposed by the cf and firehose exporters. 3) Other metrics are available about nodes, the bosh system, etc.
  17. 17. Aggregation at Global Scale 1) Our Observability Platform has multiple Prometheus instances which federate metrics from the Prometheus instances in the foundries 2) We utilize the CNCF Sandbox Project Thanos to give us a global overview of all collected metrics. 3) Foundries in the Azure and AWS environments are tied into Thanos via the Thanos Sidecar service, while the egress only on prem instances utilize Thanos Receive for remote write.
  18. 18. Crunching The Numbers • Goal to start getting teams to think about capacity management by providing memory quota recommendations in real time • Too much raw data to process it all at time of request • Use recording rules in Prometheus to process the raw data as it comes in. • Occasional missing data points cause extra headaches. • We can now provide suggestions for thousands of applications in seconds.
  19. 19. Utilizing The Data • We currently use a very basic model for a recommended application memory quota, aiming for 65% average memory usage while accounting for spikes. • Displaying data in Grafana, providing both high level overviews with numerous applications as well as targeted dashboards showing other application metrics • Data transparency is an underlying tenant of our system.
  20. 20. Future Enhancements 1) Utilize other data sources (logs, app specific metrics, and traces) to further refine suggestions. 2) Better understand resource utilization profiles 3) Provide recommendations for where to host applications 4) Analyze profiles to recommend an auto-scaling strategy. 5) Provide guides around failure domains and application design best practices.
  21. 21. Fitness Reports - App Instance memory reduction • Sent initial fitness communications to application teams for 1 of 4 On Prem foundries (EDC 1 Pre Prod) • EDC 1 PP Contains 19% of all Ford TAS app instances • Targeted 516 Ford Applications (485 TAS Orgs) that contained app instances with potential for memory downsizing • 31% of EDC 1 PP app instances were considered potentially overallocated (more memory allocated than required) • Suggested app instance memory reductions based off historical utilization • Aimed for 65 % max average memory utilization, and 90 % absolute maximum utilization • Reduces app instance memory while minimizing any performance risks • Aggregated data and memory recommendations by Ford Application, with dashboard links for detailed app instance utilization metrics
  22. 22. Fitness Reports Effects • 7 days after first fitness reports, total memory allocation in EDC 1 Pre Prod decreased by 1831 GB due to app instance downsizing • Greater number of app instances were downsized than expected (targeted 3421 instances, 4313 instances were downsized) • Teams that received fitness reports reduced memory for app instances that were not specifically targeted • Continuing these reductions for over a year would result in 16 M GB Hour (15% of scalable platform load) reduced annually • Fitness FAQ to help guide teams on how to adjust, monitor, and optimize app instance resources in TAS
  23. 23. Fitness Reports Next Steps • Send TAS app instance memory guidelines to developers who provision new Orgs and Spaces • Create and send fitness reports targeting all 11 foundries • Recurring communications with application/product teams to maintain TAS fitness over time • Include additional resource utilization metrics in future fitness reports • Identify Orgs, Spaces that can reduce the number of active app instances • Train supervised models to forecast future resource utilization • Identify app instances trending towards becoming over/under allocated

SpringOne 2020 Seth Jones: Solution Owner, Slalom LLC; Ishaan Khurana: Data Scientist/ Analyst, Ford Motor Company; Tom Woods: Platform Services Analytics and Billing Super, Ford Motor Company; Kyle Hinton: Solution Architect, Slalom Detroit

Views

Total views

230

On Slideshare

0

From embeds

0

Number of embeds

85

Actions

Downloads

15

Shares

0

Comments

0

Likes

0

×