DevOps is a software engineering culture and practice that aims at unifying software development (Dev) and software operation (Ops). [By-wikipedia] So why so many confusing it with Job Description or position in Organization ???
9. Our pricing model is
freemium so we can do it.
Everything is free.
You only pay for private domains,
removing the Wix banner and
receiving payments.
12. Taiichi Ohno
(1912-1990)
The father of Toyota’s production system
1. Build only what is needed
2. Eliminate anything which does not add value
3. Stop if something goes wrong
Throwaways
● Code documentation
● Integrations Documentation
● Task Switching
● Wrong process
16. Every developer / engineer is
responsible for their product’s
performance in production.
Toggle it, TDD, E2E Automation, Monitor, Alerts, Rollback,
Maintain with the support of tools built by cross-engineering
services
Automation System CI BI Product UX
18. From Git to Production
Feature toggles - Petri OSS
Graduate release - A/B testing
CI - Small, incremental changes
Monitoring - alerts, thresholds &
anomaly detection
19. Risk to
production =
# of changes X size of
change
Waterfall risk
Continuous Deployment risk
size
size
#
#
20. When we do a FADICHA
Median time to resolution based on find out method
Watching
graphs
Alerts Wix
employee
Manual
testing
Automatio
n
Wix Users
* numbers are based on data Jul-2017 - Dec-
2017
We expect from
engineers
Quick win
21. Getting it right
Urgent@wix.com
# Employee Register
Time to be “on it”
Time to Resolution
Rollback %
Postmortems
1.53 / Day
+800
7 min
47 min
3.6%
22. Takeaway
DevOps is a (dev-centric) culture.
Change your structure, tools and operations
to make it happen.
Challenges
Recruitment / Onboarding
Management
Offshoring teams
Mobile
Legacy
Hi,
My name is Erez Papenry I joined Wix 2 ½ years ago as the Head of R&D operation
By a show of hands,
Q: Who here works in a Corporate?
Q: Who here works in a StartUP?
Q: Who here works in a StartUP of more than 2K people?
So my story begins 2 ½ years ago once I joined wix I came to my Boss (Yaniv) and ask him what it is I’m expected to do: the response was “Good luck” what is my yearly budget and the answer was “Don’t look above 3 months, and focus on what is needed and why and not your limits”
I was shocked, how you can manage an organization like that ……..
and then I remembered a small sentence that I heard during my interview with Broder (The one I replaced).
There are a lot of papers around the office, pick up the one that interests you and make it yours.
Q: Who knows Wix?
Wix started with the Editor that gives a drag-and-drop solution to create web sites.
Then 4 years ago Wix understands that it needs to give more solutions for businesses and Wix start to target VERTICALS that focus on business solution needs.
Then 2 years ago Wix understands that drag-and-drop is easy but not for everyone, so it launches Wix Artificial Design Intelligence [ADI] that with 5 questions and crawling of the Net the system builds for you a stunning website.
Then in the past year we launched Wix Code that actually allows you to add JS code to make an application out of your website, or anything else with our web elements (code is endless).
Q: We wanted to launch new page and we needed to decide which page to use. Which one you think that we should use? A or B? Which one is Better?
define ‘better’. Better = Our Users are happier and react better (Use it more) to it
Our users can be everybody: SMBs, Students that build themselves a CV, bloggers for fun and couples that get married.
and Yes we have 122M users. Actually in the past 5 min that I’m standing here and talking 500 additional users joined Wix.
With that tempo and speed of growth and with the fact that we need to measure everything in Real Time we need a speedy development and deployment method that will allow our business to continue growing
We have to use the Continue Deployment method.
This means that we deploy our code in one repository, check the Quality of the code, Deploy, and then test again with E2E automation run on your production. And that must be done as simple as pressing a button.
By that you are making sure that your code / process and functionality of the system is in good quality.
Q: With that said can you gu/estimate the amount of changes we deliver to production each week?
Wix production change (What users see) more than 1K times a week.
It's important to understand that Wix business model is Freemium, this mean in a high level that ALL Wix services are free except from 3 main parts:
Domain
Removal of Wix Banner
Payment
97% of our users are FREE users. Only 3% of the users are paying customers.
SO, due to our model and to the fact that our breakdown cost is low our product type can / should work this way.
So it’s important to understand that this method does not fit all:
Exm. Surgery equipment se for sure is not fit this method.
But it may fit to the monthly report tool of the same product.
and you should know / understand it’s possible and to decide when to use it.
Why we need to look at a startup?
Q: Who is a parent here?
You know that once kids grow, things become more complex.
So before we complex stuff let’s look at the basics … and we are all in the SW industry so let's talk / think startup …..
We wanted to open a startup, congratulations, now we need to decide who will be the developer? QA? product? Analysis? Marketing? Production Operation? Support?
Yes we can go with outsourcing but for-sure we will not be 100% happy with the result and once it becomes bigger and bigger we are getting closer and closer to the blaming game.
We try all the time to remember and take from Taiichi Ohno the father of Toyota production system the production method and compare it to our world of SW: Is it Needed? Does it Give value to our customers? and if we see something wrong we Stop, Learn, Fix and Improve.
So taking it to our SW world we clearly understand that
Code documentation - If your code needs documentation it mean it’s written BAD so fix it.
Integration documents - are helping our organization Yes but most of the time are not adding value to our product / customers. So we try to eliminate integration points and documents when they’re not needed.
Task switching - the smaller our tasks will be, the less interruption they cause and by reducing the task switching that we all understand is a big throwaway.
Wrong Process - Ohhhhh process, first it’s clear that all organizations work with processes and Kate Heddleston talked about Null Process. but you have to be careful that you don’t work for the process, instead remember that the process is there to serve you.
So how we are doing it in Wix with no/minimal blaming game and being most effective:
Gangs - Responsible for E2E lifecycle of the product or sub product depending on the size of the product. From analyzing the market / user needs to deleting the SW from production.
Q: Can you guess what the size of a Gang is? Size of a room, we try that each Gang sits in the same location.
Guilds - Responsible for the E2E lifecycle of the employee / engineer. From recruitment, onboarding, training, shuffling to firing when needed.
Company is responsible for the E2E lifecycle of product.
Every product is a small Company / startup
Each employee invests 80% of their time for the Company 20% for the guild.
(How: Guild day, weeks, Conferences …..) Then they come back to the company and spread the word.
When engineers join Wix they get two presents:
Small welcoming shot of Whiskey
Keys (Q: Any guess what are the keys for?)
Keys to production actually each engineer in Wix can break Wix down in a click of a button from her first day of work.
Think about it if I know that I have someone else behind me that check my code I will always trust her to be there to save me.
If I’m exposed to the production and I don’t have any safety net I will do everything with more care and responsibility.
As an engineer my job is to make sure my code performs right in production.
In order to do it I need to make sure that:
New features will mostly be covered with feature toggles with backward capability.
Your code is Tested before it’s written. Yes we are using TDD Test Driven Development. You start with writing a test and only then your code. 70% of Wix code is testing. 0 QA in backend and 1:8 QA in E2E / FED
Deployment and E2E have 100% automation coverage to identify regression issues.
Pick Up your eyes and look on the monitors Traffic / Error / Success rates.
Alerts - make sure you have the right alerts in place you never know when its will catch you
Rollback is not a shame, use it once you identify the issue and fix the issue in a relaxed mode.
Continue to maintain your code don't let it become a legacy. You have to clean toggles once a year. You have to change something in your services once a month.
But the fact that as an engineer you do it all right, doesn’t mean that you didn’t mess up stuff.
and in order to make sure it carries a minimum effect on our customers, the following is our rollout methodology.
The fact that code is deployed and that we don’t have a testing environment doesn’t mean that our users see it:
Every new feature is covered with a Toggle that allows us to control who gets it (Geo, Lang, Browser, Specific user, Set of Users …)
In case of a mess you can turn it off (full Rollback capability) in matter of seconds.
Releasing in baby steps. First to you, then to Wix employees, then to monitored countries, then 50% …...
Measure everything new that you release. Define KPIs to your main funnel. If something changes in the test environment of the test you start measuring all over. (Petri supports these tools as well)
You remember the MVP make sure that smallest MVP that you can release out has been released. Small package smaller impact easier control.
Monitor everything (Threshold and Anomalies). And make sure you have the right alerts connect to the right communication channels.
So with so many changes don’t you worry about risking your customers and I say NO,
Risk = Probability X Severity.
What is the biggest impact on Probability = Number of changes to production.
What is the biggest impact on Severity = Size of code that we are changing in production.
So actually the risk is the same only in continuous deployment we are increasing the probability but reducing the severity so the risk is the same.
Yup, we are all developers and we all understand that we will have FADICHA’s stuff that we will breakdown.
But what we learned (And this is based on real production incident numbers of Wix)
If we look at the deployment process “the earliest after GA that we identify the issue - it increases time to resolution exponentially”
So yes we trust our engineers will add more alerts / more monitoring ……
But one simple tool that we have in Wix is the Urgent email: Simple email distribution that allows each employee that sees an issue to send an email to this distribution list.
Can you guess the number of Urgents that are sent a day?
Can you guess the number of people that are registered to the Urgent distribution list?
avg. time to till somebody says it’s my fault I’m on it?
avg. time to resolution of Urgent?
% of rollbacks that are done in general in Wix?
We try to ensure that all Urgent incident are followed up by postmortems. Who owns it? then she communicates with all organization (Currently ~30% of incident are followed up by postmortems)
So, Yes, DevOps is not position, or a job title …….
DevOps is a Culture that requires a structure change, new tools and operation that can make it happen.
Few challenges that we have with this method:
Recruitment is not easy
Management is very sensitive as it’s easy to break this apart.
Offshoring teams (You all need to sit at the same room)
Mobile, releases of mobile are very problematic due to stores/distribution tools
Legacy components from the day before you start are there, we are handling those from time to time during Guild weeks.
Questions?