SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
DevOps: Automate all the things
1. DEVOPS:
AUTOMATE ALL THE THINGS
Mat Mannion, Web Development Team Leader, IT Services
14th November 2017 / CS352 Project Management for Computer Scientists
2. • 13 years ago I was sat where you are now. I knew nothing, and I didn’t
know that I knew nothing.
• 6 years as team leader. Many projects come and go.
• Everything you do is a learning experience.
Why should you listen to me?
3. • 12 web developers
• 40 web applications
Sitebuilder (CMS), Tabula, My Warwick app, web sign-on, CourseSync, Files.Warwick, Search, Online Payments, PeopleSearch,
Car Parking, Blogs, Start.Warwick…
• Loosely grouped into 3 agile teams; Ada, Babbage, Turing (so a developer
doesn’t need to know about everything at once)
• Mainly JVM-based
Java EE (Spring framework + Hibernate ORM), Scala (Play! Framework + anorm or Slick ORM), node.js (Express + mongodb)
The Web Development Team
4. • Unified development and operations
• Automation and monitoring at all steps
of software construction and
deployment
• Shorter development cycles, increased
deployment frequency
• More able to respond to changing
requirements – more agile
WHAT IS
DEVOPS?
Source: Kharnagy on Wikipedia
https://en.wikipedia.org/wiki/DevOps#/media/File:Devops-toolchain.svg
6. • Initial questions:
- Who’s paying for it?
- Who’s going to use it?
- Is there a requirements document?
- Do we buy or build?
• In the 21st Century, software evolves
• So is this a product, or is it a service?
BRIEF: REPLACE
THE EXISTING
TOOL TO
PROVIDE ID
PHOTOS
8. • Agile software development
• Rapid, continuous delivery of useful software
• Late breaking changes are welcome (mostly)
• Close co-operation between stakeholders and
the software development team
Build, but build how?
9. Incremental vs. Iterative
Source: Jeff Patton, “Don’t Know What I Want, But I Know How To Get It”, January 2008,
http://jpattonassociates.com/dont_know_what_i_want/
10. • Work is taken from the backlog at the start
of a sprint in a sprint planning meeting
• Daily stand-ups to assess progress and
work through any blockers
• Sprint review meeting and release at the
end of the sprint, if approved by the
product owner
Scrum or Kanban?
With Scrum, build in a series of fixed-length
iterations, with milestones at the end of each sprint
• Work travels from left to right on a Kanban
board through defined stages from the
backlog to completion
• Releases can happen continuously, or at
the team’s discretion
• Change can happen at any time
With Kanban, build to just in time (JIT) principles with
continuous deployment
11. • Meetings at the end of sprints to review
the previous sprint and plan the next one
• Daily stand-ups to keep the team focused
on the sprint goals
• Release at the end of sprints with work
packages – each sprint has a goal
Scrumban!
Take some of the structure from Scrum that helps
visibility to stakeholders:
• Within a sprint, work travels across a
Kanban board
• Changes can happen to the work during the
sprint (but are generally discouraged as we
can’t learn about velocity of the team)
Take some of the flexibility of Kanban:
12. • Bring together the development team with
product owner
• Identify development themes
• Split themes into epics (big user stories)
• Break down epics into stories – a story should fit
within a single sprint (if it doesn’t, break it down
further)
• Stories may comprise multiple tasks
• “Definition of Done” – consistent acceptance
criteria across all user stories
Initial planning meeting
Photo upload
As a new student, I need to
provide an official photo
As a new student, I want
to upload a new official
photo
Create photo
upload form
Send uploaded
photo to
membership
system
As a new student, I want
to be able to change my
official photo
Create page to
display all
uploaded
photos
Set photo in
membership
when selected
Theme
Epic
Story
Task
13. • Estimating is hard
• Defining how many hours a task will take at the start
of the project is near-impossible
• Just give rough estimates to start with
• We use t-shirt sizes, i.e. XS, S, M, L, XL, XXL, XXXL
• Anything over L probably isn’t doable in a single sprint
• We can get better at estimating as we gain
experience in the project
How long is a piece of string?
14. The cone of uncertainty
Source: Steve McConnel, The Cone of Uncertainty
http://www.construx.com/Thought_Leadership/Books/The_Cone_of_Uncertainty/
15. • Do MoSCoW prioritisation of stories from the
backlog – do this every time as priorities change
• Take the highest priority stories and put them into
the next sprint
• Only put as much in there as you can achieve – as
estimations get better, this will become more
accurate
• Anything not done at the end of each sprint goes
back into the backlog for re-prioritisation
Sprint planning meeting
Source: https://www.agilebusiness.org/content/moscow-prioritisation-0
19. Development practices that support iterative development
• Pair programming on new features and
changes – effectively constant code review
• Continuous integration (using Bamboo, in our
case) to ensure that the system meets all
automated testing (“Definition of Done”)
• Coding standards to define a consistent style
and format
• Code reviews for all non-trivial changes before
they are merged into the mainline
There are a number of development practices that we use to make agile development work well and produce good
quality, maintainable software. We use many practices from Extreme Programming (XP):
20. • A branching model for a git (or other VCS) repository
• Development is against a develop branch, master is always
the current state of production
• Work takes place in branches off develop – branches are
named after issue numbers. Keep feature branches
relatively short-lived so they don’t diverge
• When it’s time to release, this is branched off as a release
branch (e.g. release/1.0) to allow work to continue on
develop
• Code reviews whenever we want to merge into develop or
master
Git flow
Source: http://nvie.com/posts/a-successful-git-branching-model/
21. • Demo what’s changed in the most recent sprint to the
customers
• Take feedback for the next sprint
• If everyone agrees, release the work to production
• Make a cup of tea and start again, using the gained
knowledge to improve prioritisation and estimation
Sprint review meeting
30. • Architecture and team must support continuous
delivery
• Traditionally servers are the responsibility of an
Ops or Platforms team – who probably don’t
understand the software you’re deploying on it
• In our environment, Operations manage the tin,
then jointly configure and manage servers through
code
BRIEF: NOW
YOU’VE BUILT IT,
RUN IT FOREVER
31. • Bad: Run an application server (e.g.) Tomcat, deploy the application to it. Redeploys
unload the application and reload it
• Less bad: Run a web server (e.g. Apache) and load balance (mod_jk or HAproxy) across
multiple application servers. Redeploys can take servers out of the load balancing pool
to minimise downtime. Still a SPOF
• Good: Load balance across multiple servers with a load balancer appliance. Separate
concerns by running a service-oriented architecture for database, storage, search etc.
• Very good: Deploy your application as containers (e.g. Docker). Build new containers in
CI and switch out an entire new set of containers when deploying new versions
Application architecture - evolution
We are here
32. • Have to design your applications to run in a stateless environment
• No local filesystem, no in-memory or filesystem sessions
• Build your applications so it doesn’t matter if you hit one server with your first request
and a different one with your second
• Build mechanisms for your applications to communicate with each other where
necessary (e.g. propagating configuration changes while running)
Building stateless applications
33. • ~40 applications, dev, test & prod, separation of
concerns. >750 nodes (at time of writing)
• How do you maintain configuration and consistency?
• How does support and maintenance work for
applications co-located across many servers?
• Our solution: Configure with code, decentralise all
management, have the system describe itself
The problem:
35. • A node tuac3-tabula-prod-api-1 describes
itself as running on the physical server
tuac3, part of the tabula application, the
production deployment, the api tier
• The node sends facts about itself to the
master; e.g. that it’s running Solaris 11.2
Node classification, roles and profiles
Each node describes what it is, and that builds a
classification along with facts about the node:
• Profiles to manage SSL, Java, Tomcat
• Our YAML configuration files describe
configuration specific to applications,
deployments, tiers etc. which are combined to get
the actual configuration
• PuppetDB stores information about all configured
nodes and can be queried to create overarching
config
A number of profiles are applied to nodes based on
its classification:
36. • 24 nodes in each Tabula
deployment across 2 data centres
• 4 application nodes
• 4 API nodes
• 4 background task processors
• 4 message queue brokers
• 4 memcache nodes
• 3-node ElasticSearch cluster + 1 Kibana
node
• Multiply out for dev, test, sandbox
(training) deployments
What does it look like in practice?
• F5 BIG-IP load balancer
• Object storage service for storing
files (OpenStack Swift)
• Oracle RAC cluster for database
38. • check_mk monitors general health of the node (memory/CPU etc)
• This runs as a script on the node itself and the result is collected by central
monitoring server(s)
• Deliver check_mk plugins for each profile applied to a node (e.g. Tomcat
or memcache)
• Each application we deploy delivers service endpoints to monitor the
application – gtg, healthcheck, metrics
• Keep performance data for a long time to spot trends
• Notifications go into Slack channels, email, SMS, depending on the
importance of the node (e.g. is it prod, is it a dependency for other apps)
Monitoring individual nodes
39. • All our logs (application, audit and access) are sent and stored securely on
an ELK centralised logging service
• We monitor patterns in logs and alert on an exception basis (e.g.
increased error rates, increased avg response times etc.)
• We can visualise log data across multiple instances of an application or
even multiple applications to diagnose issues
Monitoring applications