Changing Etsy's Architectural Foundations with Continuous Deployment
Upcoming SlideShare
Loading in...5
×
 

Changing Etsy's Architectural Foundations with Continuous Deployment

on

  • 884 views

 

Statistics

Views

Total Views
884
Views on SlideShare
873
Embed Views
11

Actions

Likes
3
Downloads
22
Comments
5

2 Embeds 11

http://lanyrd.com 9
http://feeds.feedburner.com 2

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Hey, thanks for the feedback! Glad you liked the hand-drawn graphs. One time I got sick of trying to make a chart drawing program do what I wanted, so I decided to draw them myself. I figured I could get away with it since I work at Etsy, the handmade marketplace.

    Enabling the unfeasible is about taking changes that would be too risky to do normally, and breaking them down into small enough chunks that they can be done safely.

    It is a lot of slides but I was able to get through it in ~40 minutes. Some of the slides were only up for a few seconds to get a laugh or something. And as you say, it was for a technical conference. Within the domain, this is talking about a different approach to what is otherwise a famliar topic.
    Are you sure you want to
    Your message goes here
    Processing…
  • Actually I forgot, you are not presenting to Etsy people, you are presenting to people familiar with architecture, so forget everything I just said. ; )
    Are you sure you want to
    Your message goes here
    Processing…
  • Lots and lots of slides. Enough for a 3 hour presentation. Tip: never forget you are way way familiar with this stuff, and 95% are way way unfamiliar with it, so keep it slow and simple and fundamental, talk like you are presenting to a child, like Dean's kids Zack or Noah or even Akira. Just from looking at the slides, I think you are already aware of this, but if you try to cover all the ground you have laid out, you will almost certainly leave most of the people in the dust scratching their heads, wondering what just happened. ; )
    Are you sure you want to
    Your message goes here
    Processing…
  • A tale of 6 bugs looks like it will be both fun and instructional, and the phrase 'enable the unfeasible' makes me curious, makes me think: 'whaaaat?' ; )
    Are you sure you want to
    Your message goes here
    Processing…
  • I like the humour and the handwritten graphs, it makes them more people-friendly for lower tech people and I think more digestible.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • I'm here to talk about continuous deployment and how it helps with BIG architectural changes. I'm an engineer on Etsy's Core Team and I'm coming at this from the perspective of an engineer involved with making fundamental changes to Etsy's infrastructure. I've found continuous deployment to be a very good way to work and I'm here to spread the good word and hopefully trigger some ideas about how it could help you and how to get started with it.
  • As the business grows, there is change pressuring the software from all over.
  • Start with an axiom: that good architecture is not static. The same company had 2 different architectures at 2 points in its history. While the 2005 architecture never would have scaled to 2012, the 2012 architecture would have been complete overkill in 2005. Etsy wouldn't have had time to build the features that got it started.
  • Here's a very not to scale graph of how the business and architecture grew together.
  • Here's what it would look like if they had shot straight to the 2012 architecture back in 2005. First, they probably wouldn't have gotten it right, but let's assume they did. We're done right?
  • What now? We overshot the 2005 architecture to be able to handle 2012, but now we're still not prepared for 2017. What do we do now? The point here is that you can't escape architectural change. All you can do is try to make it easier. Continuous deployment makes architectural change easier.
  • As the business grows, there is change pressuring the software from all over.
  • Ultimately, the correct architecture needs to change too.
  • We have to be able to make big architectural changes, and we need them to go better than this The key to success here is breaking down big changes into many smaller changes When we write code, we break it down into manageable modules. But when it comes time to deploy it, we mash it back together into an unmanageable chunk. This limits the scale of changes. With continuous deployment we remove that limit.
  • Let's step back and look at how deployment got to where it is. And we'll start here, the 80s. In the olden days... software had to be copied onto floppy disks, put in a box, shipped to a store and then finally purchased from the store And you wouldn't want to give people updates for free, so what happened? They'd all be batched up in an “upgrade” release or a new version altogether. Deploys were understandably rare
  • And then this happened. Likewise, we even started distributing our thick client software over the internet: Windows Update is a good example We took applications that used to be physically distributed thick clients and made them “web applications” For the most part we were still using the old school deploy cycle
  • But some people wanted to go faster. We realized, “Hey, it's a website, we can deploy it every month.”
  • But... why can't we just deploy and deploy and deploy? Well, we can. We can invert the unit of measurement from days per deploy, to be deploys per day
  • Makes it possible to do what might otherwise be too risky
  • Most modern software is dealing primarily with electrons The impact on the real world is minimal and indirect In these cases, MTTR is far cheaper than MTTF. Say you're coming up to a monthly release and you have 3 people spend 6 days testing for 100 different bugs. They find 4 and miss 2. The 2 that were missed take 2 days to get fixed and deployed. With continuous deployment, say we find 2 and miss 3.
  • Most modern software is dealing primarily with electrons The impact on the real world is minimal and indirect In these cases, MTTR is far cheaper than MTTF.
  • Continuous Deployment minimizes bug hours
  • Not all bugs are equal though With MTTF, you're telling yourself, if we test it enough, there won't be any bugs. With MTTR, you're saying, we know there will be bugs, let's fix them as quickly as possible.
  • Cost to recover Steve Austin – $6 million $31 million today
  • Cost to recover Steve Austin – $6 million $31 million today
  • Most other cases, continuous deployment may help
  • GE MRI? No
  • NASA? No
  • Enterprise software? Yes!
  • Printing Health Insurance, Credit Cards? No
  • Continuously deploy to the App Store?
  • What about when it comes to processing financial transactions?
  • Etsy is PCI compliant, so we are financial software. The process is different for our credit card processing software, but we don't deploy on a schedule. We push code whenever it's necessary.
  • This is all it is, and they're all cultural. Everything else is an implementation detail. Doing it is cultural, the technical part is just improving how well you're doing it. I'll talk about a few things that Etsy does to help us, but they're not necessary if you want to start continuous deployment. It's just a few of things you're likely to find helpful once you do start. Continuous Deployment is like everything else in software ship early and iterate.
  • This is all it is, and they're all cultural. Everything else is an implementation detail. Doing it is cultural, the technical part is just improving how well you're doing it. I'll talk about a few things that Etsy does to help us, but they're not necessary if you want to start continuous deployment. It's just a few of things you're likely to find helpful once you do start. Continuous Deployment is like everything else in software ship early and iterate.
  • This is all it is, and they're all cultural. Everything else is an implementation detail. Doing it is cultural, the technical part is just improving how well you're doing it. I'll talk about a few things that Etsy does to help us, but they're not necessary if you want to start continuous deployment. It's just a few of things you're likely to find helpful once you do start. Continuous Deployment is like everything else in software ship early and iterate.
  • Everything else measures how good you are at continuous deployment.
  • We don't have the mythical 100% automated test coverage, so we do manual testing too.
  • Great incentive to add automated tests Manual testing once/month or week can be tolerated Manual testing for multiple deploys/day is painful
  • Laurie Denness
  • grep and Enter the Dragon both released in 1973 Tailing a log and using grep to filter out uninteresting stuff is a great way to monitor the health of the system.
  • We use a tool called deployinator to actually execute our deploys. Deployinator has buttons on it to kick off each stage of the deploy It triggers a shell script that uses dsh to do stuff on each server And it logs what's happening on deploys It's designed to allow only minimal human output as a feature. All we do is say, “Start.” There are no options that we might screw up. The “deploy button” is probably the tool that most contributes to cutting down time spent on deploys
  • With ~100 developers, there is going to be contention for doing deploys We use another advanced technology for resolving that contention: the topic of an IRC channel. I'm mattg and I'm at the front of the queue. will_gallego is sharing my deploy and Michael Horowitz will do his deploy when he's done.
  • Ganglia is a common graphing tool. It's great for looking at a pool of machines. Each band here is a separate machine.
  • Graphite is another graphing tool. It let's you easily apply functions or stack graphs and is a better for displaying system level and busines metrics. These 2 graphs actually show the same event where we switched to an optimized version of libjpeg.
  • Here's another ganglia graph and there's a clear drop in memcache connections. What caused that? We draw these vertical lines at each deploys and this one is blue. That means there was a configuration change at that time that led to the drop. Now I know I can check the deploy logs to see what went out.
  • Graphs are great to look at, but they don't help if there's not an easy way for developers to get the right data into them. We use a tool that we've open sourced called StatsD. It's a node.js UDP server that just listens for incoming data and sends it to Graphite. From our application code, the only thing we need to write is this little bit.
  • Logster is another tool we use to easily get data into graphs. It scans production logs Uses plugins to parse out interesting information Pushes it to Graphite or Ganglia
  • Logster is another tool we use to easily get data into graphs. It scans production logs Uses plugins to parse out interesting information Pushes it to Graphite or Ganglia
  • A deploy is not the same as product launch Just because you deploy frequently, doesn't mean you have to give up control of when software is “launched” Feature flags are the tools that give that control through a “dark launch” Credit Card Processing
  • Imagine if you have this function to get a link for feature A. It formats the string and returns it.
  • So now to dark launch it, change that function to check a config value. If it's enabled, return the value from a new funciton otherwise return the value from the old function. This is just to generate a link, but we use these all over our code so there's not really any limit to how you use feature flags.
  • A sample of configuration At Etsy, we use admin to mean only Etsy employees. This so FeatureA is dark launched for employees only so we can see how the development is progressing.
  • Now we're ready to let real users start seeing the new feature so we increase it to 1% of all users. We can also white list which specific users get the new feature This ramp up is a powerful way to reduce risk in a change and is why continuous deployment could work for financial trading software.
  • If at any point we see a problem, we just roll back to admin only.
  • A sample of configuration At Etsy, we use admin to mean only Etsy employees. This so FeatureA is dark launched for employees only so we can see how the development is progressing.
  • If at any point we see a problem, we just roll back to admin only.
  • Now 100% of people are using NewFeatureA
  • Have an interface change and want to see if it moves metrics? Split users across different options and see what happens Have a new feature and want to see if people like it before getting behind it, let people select themselves into a beta group. Google does this with the Labs. With continuous deployment you can make these changes and instantly see results and then make more changes.
  • Communication a very helpful tool in both directions.
  • First, if something small breaks, you want to have a feedback path for users to inform you Not specific to continuous deployment but changes are spread with low intensity over a period of time, so you want to have a good low intensity Forums or message boards are a great, low intensity way for users to send feedback
  • If something big breaks, you want to be able to inform users out of band of a potentially non-functional site At etsy we have a blog hosted by wordpress where we post outages or even slowness on the site
  • We also have an etsystatus twitter account
  • And here's one reason to have 2 channels
  • What we're doing with all these tools is making deployment a first class member of the system Compare to tech support features or business intelligence
  • Listing photos are a core part of our site as it's what lets buyers see what people are selling and it gets 400k uploads per day. The postgres DB was our central DB and we're migrating all of the data there over to the shards. All this happens live.
  • Ultimately, the correct architecture needs to change too.
  • Ultimately, the correct architecture needs to change too.
  • Finally, get rid of the old stuff. This is the most satisfying step. Also very important to keep unused stuff from causing confusion.

Changing Etsy's Architectural Foundations with Continuous Deployment Changing Etsy's Architectural Foundations with Continuous Deployment Presentation Transcript

  • Changing EtsysArchitectural Foundation with Continuous Deployment Matt Graham Core Engineer @ Etsy Continuous Deployer#surgeconSeptember 28, 2012
  • Marketplace forHandmade GoodsGross Sales 2011: $537 millionTotal Members: 19 millionItems For Sale: 15 millionUniques / month: 40 millionPage Views / month: 1.4 billion
  • Architecture is Relative
  • Organic Architecture
  • Premature Architecture
  • Premature Architecture
  • Passing Time => Change● Scale● Product● Technology● Engineering Team
  • Passing Time => Change● Scale● Product● Technology● Engineering Team● The Correct Architecture Changes
  • Architectural Change Antipattern
  • A Brief History of Deployment
  • The Internet
  • Agility
  • Continuous
  • What its all about● Reduce Failure Time● Start with Culture● Tools Help● Enable the Unfeasible
  • A Tale of Six Bugs
  • Six Bugs withMonthly Deploys 4 caught ---> 2 missed <--- fix live: 24 hours
  • Six Bugs withContinuous Deploys 2 caught ---> 4 missed <--- fix live: 6 hours
  • Failure Time2 Bugs * 24 Hours = 48 BH4 Bugs * 6 Hours = 24 BH Minimize BugHours 24 < 48
  • MTTR vs MTTF
  • Cost of RecoveryPhotons MinimalElectrons LowProtons & Neutrons HighHumans Prohibitive
  • Cost of Recovery$6 million in 1973 = $31m today
  • Good Excuses● Infrequent Changes● Infrequent Executions● Life and Death● Physical Investment
  • Medical Devices? No
  • NASA? No
  • Enterprise Software? Yes!
  • Print of Cards? No
  • App Store? No
  • Financial Transactions?
  • Financial Transactions? Yes!
  • Getting Started
  • Culture Before Tools● Throw out the deploy schedule● Ship changes when tested & ready● Software is stable & supported
  • Tools of Etsy Deployment
  • Jenkins● Unit Tests● Functional Tests
  • Jenkins● Unit Tests● Functional Tests● Manual Testing
  • Jenkins● Unit Tests● Functional Tests● Manual Testing
  • Nagios & Naglite2github.com/lozzd/Naglite2
  • tail -f | grep
  • github.com/etsy/deployinator
  • IRC
  • Graphs!!!
  • Ganglia
  • Graphite
  • Event Overlay
  • StatsDif ($success) { StatsD::timing(query.runtime, $time);} else { StatsD::increment(query.failure);} github.com/etsy/statsd
  • github.com/etsy/logster
  • github.com/etsy/logster
  • Practices @ Etsy Feature FlagsCustomer Communication
  • Feature FlagsDeploy != Product Launch
  • Dark Launchdef get_payment_link(): return ...
  • Dark Launchdef get_payment_link(): if enabled(creditcards): return creditcard_link() else: return check_link()
  • Dark Launchapplication_config: - creditcards: admin - NewFeatureB: off - NewFeatureC: on
  • Ramp Upapplication_config - creditcards: 1% - NewFeatureB: off - NewFeatureC: on
  • Ramp Upapplication_config - creditcards: 5% - NewFeatureB: off - NewFeatureC: on
  • Whoops!application_config: - creditcards: admin - NewFeatureB: off - NewFeatureC: on
  • Ramp Upapplication_config - creditcards: 5% - NewFeatureB: off - NewFeatureC: on
  • Ramp Upapplication_config - creditcards: 25% - NewFeatureB: off - NewFeatureC: on
  • Credit Cards are ONapplication_config - creditcards: 100% - NewFeatureB: off - NewFeatureC: on
  • AB Testing● Prove success of interface changes● Prove interest in new features
  • Community Communication
  • Forums / Message Boards
  • etsystatus.com
  • twitter.com/etsystatus
  • twitter.com/etsystatus
  • Deployment is First Class Deployment is a First Class Feature
  • Engineers are Users Too
  • Examples from Etsy● Photos From Twisted to PHP● PostgreSQL to MySQL Shards
  • From Twisted to PHP● Run Apache/PHP on a new port● Implement one service in PHP● Ramp up users on new service● Repeat for each service● Shut down Twisted version
  • PostgreSQL to MySQL Shards● Migrate table by table● Tee writes to both DBs● Copy old data from PostgreSQL● Verify data matches● Ramp up reads from MySQL● Stop PostgreSQL writes
  • Continuous Deploy Pattern● Change in small steps● Dark launch via config● Iterations to prod while dark● Maintain old & new in parallel● Ramp up new architecture● Remove old architecture
  • Once Again● Minimize BugHours● Trash the Schedule● Iterate on the Tools● Make Big Changes
  • Mean Time To Addiction
  • Changing EtsysArchitectural Foundation with Continuous Deployment Matt Graham http://twitter.com/lapsu http://lapsu.tv Core Engineer @ Etsy Continuous Deployer http://codeascraft.etsy.com http://www.etsy.com/careers