Architecting for Change: QCONNYC 2012
Upcoming SlideShare
Loading in...5
×
 

Architecting for Change: QCONNYC 2012

on

  • 14,358 views

a broad overview of Etsy's why and how.

a broad overview of Etsy's why and how.

Statistics

Views

Total Views
14,358
Views on SlideShare
14,056
Embed Views
302

Actions

Likes
37
Downloads
163
Comments
0

14 Embeds 302

https://twitter.com 110
http://dcxwiki.energy.com.au 97
http://lanyrd.com 60
http://eventifier.co 8
http://us-w1.rockmelt.com 5
http://twitter.com 4
http://eliq82.blogspot.de 3
http://eliq82.blogspot.kr 3
http://eliq82.blogspot.com 3
http://www.linkedin.com 3
https://www.linkedin.com 3
https://twimg0-a.akamaihd.net 1
http://eliq82.blogspot.co.il 1
http://www.scuallan.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Architecting for Change: QCONNYC 2012 Architecting for Change: QCONNYC 2012 Presentation Transcript

  • Optimized for change: Architecture @ Etsy Kellan Elliott-McCrea @kellan CTO, EtsyMonday, June 18, 12
  • Monday, June 18, 12
  • Launched June 18, 2005 875,000 active sellers 33.5MM items for sale $65.9MM in sales, in May 1.4B page views, in May 102 engineers 32 releases, last FridayMonday, June 18, 12 View slide
  • LAMP any questions?8BitLit, http://www.etsy.com/listing/90066890/Monday, June 18, 12 View slide
  • Why?Monday, June 18, 12
  • 3 inevitabilities we design for: 1. Things break, unexpectedly 2. What were building changes 3. We dont get to start overMonday, June 18, 12
  • 2 years of change.Monday, June 18, 12
  • Architectural Principles * Dont bet against the future. * Our customers are humans. * Simplicity always wins, in the end. * Favor global vs local optimization. * Ambiguity kills momentum. * Make failure cheap. * Technical debt is an inevitable by-product of shipping code. * Optimize for change.Monday, June 18, 12
  • ClevernessCkrickett, http://www.etsy.com/listing/90611466Monday, June 18, 12
  • Complex systems and change 1. Distributed systems are inherently complex. 2. The outcome of change in complex systems is hard to predict. 3. The outcome of small, frequent, measurable changes are easier to predict, easier to recover from, and promote learning.Ckrickett, http://www.etsy.com/listing/90611466Monday, June 18, 12
  • Continuous deployment, MetricsDriven Development, Blameless Post-MortemsCkrickett, http://www.etsy.com/listing/90611466Monday, June 18, 12
  • Continuous deployment: Small, frequent changes to productionCkrickett, http://www.etsy.com/listing/90611466Monday, June 18, 12
  • Continuous Deployment: No branching. “All existing revision control systems were built by people who build installed software” - Paul Hammond, Always Ship Trunk, Velocity 2010 Thursday, March 17, 2011Monday, June 18, 12
  • Continuous Deployment: feature flags if ($cfg[‘awesome_new_search’]) { # new hotness $rsp = do_solr(); } else { # boring old stuff $rsp = do_grep(); }Monday, June 18, 12
  • Continuous Deployment: Ramp - ups (on top of feature flags) 1. Launch to staff only 2. Launch to 1% of all users 3. Launch to members of a beta groupMonday, June 18, 12
  • Continuous Deployment: any engineer can launch a feature to 1% of usersMonday, June 18, 12
  • Continuous Deployment: ~200 experiments live right nowMonday, June 18, 12
  • Metrics driven development: introspection isn’t optional. measure everything, log everythingMonday, June 18, 12
  • Metrics driven development: Metrics happen when you make it easy. And visible.Monday, June 18, 12
  • Metrics driven development: Teach computer to read graphs holtWintersConfidence(Upper|Lower)Monday, June 18, 12
  • Metrics driven development: More info: http://www.slideshare.net/ mikebrittain/metricsdriven-engineeringMonday, June 18, 12
  • Optimize for MTTR, not MTBFMonday, June 18, 12
  • How?Monday, June 18, 12
  • EtsyMonday, June 18, 12
  • Etsy EMR/S3 PCI BCP, ColdMonday, June 18, 12
  • inbound request CDNs - diversified at the DNS level Internet providers - diversified at borders AWS Etsy network appliances analytics imstor etsystatic.com/ etsy.com/ bcn.etsy.com EMR S3 photos api.etsy.com JRuby/ /atlas Squid Cascading apache apache apache S3 logs php application php PHP logrotate MySQL imstor MySQL HDFS search analytics NFS memcache async http StatsD sqlite gearman logs MySQL server/OS search mail out PCI hardware Thrift SMTP dbindex Jetty dbshards X-Yarnblaster Solr slaves via jsonp, dbaux datasets no privileged access dbdata Solr master etc HBase sharded MySQLMonday, June 18, 12
  • CDNs: Put a slider on it Just works via weighted DNSMonday, June 18, 12
  • Apache * Well known * PHP is native * apache_note * fast start time * cheap in place replacement * .htaccess * Challenge: memory usageMonday, June 18, 12
  • Apache: apache_note intr Addit osp ive! ecti insa on nely thro apache_note(etsy_uaid, $id); ugh usefu the l! life cyc leMonday, June 18, 12
  • Apache: log format LogFormat "%{X-Forwarded-For}i % {True-Client-IP}i %l %u %t "%r" %>s %b "%{Referer}i" "%{User- Agent}i" % {etsy_shop_id}n % {etsy_uaid}n %V % {etsy_ab_selections}n % {etsy_request_uuid}n % {etsy_api_consumer_key}n % {etsy_api_method_name}n % {php_memory_usage_bytes}n % {php_time_microsec}n %D" combinedMonday, June 18, 12
  • Etsy: the App * 487,000 lines of PHP * 214,000 lines of Javascript * Monolithic codebase * 3 front ends, Etsy.com, API, AtlasMonday, June 18, 12
  • Etsy: the App * routing handled by Apache * scripts fronting OO PHP5 * PHP, fast by default * opcode caching * Challenge: liveliness when calling servicesMonday, June 18, 12
  • Etsy: coding patterns * light weight, home rolled “framework” * ORM handles DAO across backends * config and feature flags systems used everywhere * small slow moving datasets stored as PHP arrays * A/B tests * Smarty * StatsD * Concurrency * memcacheMonday, June 18, 12
  • Etsy: A/B tests * beaconed * inserted into logs via apache_note * conditionalized on feature flags * nightly reports on conversion, bounce rate, etc * nightly reports on page speed, memory usage, etcMonday, June 18, 12
  • Etsy: Smarty * pre-compiled * pre-compiled per languageMonday, June 18, 12
  • Etsy: StatsD StatsD::increment("logins.success"); StatsD::timing("gearman.time", $msec); * 340,000 application metricsMonday, June 18, 12
  • Etsy: Concurrency * no native concurrency in PHP * asynchronous HTTP calls * GearmanMonday, June 18, 12
  • Etsy: Async HTTP calls * curl_multi_exec * non-blocking, per request time outs * used for optional aspects of a page * curl against http://localhost to avoid network overheadMonday, June 18, 12
  • Etsy: Gearman * language agnostic job server * don’t use an MQ when you want a job server * 150 job types * persistent jobs flushed to MySQL, read from memory * non-persistent jobs just stored in memory * NP queue is wicked fast.Monday, June 18, 12
  • Etsy: Gearman * scaling CPU of cron jobs * denormalizing data * pushing to 3rd party servicesMonday, June 18, 12
  • Etsy: Challenges * Apache memory usage * liveliness talking to services, no concurrency, blocking by defaultMonday, June 18, 12
  • Etsy: graph of distributed failureMonday, June 18, 12
  • Etsy: Challenges * Apache memory usage * liveliness talking to services: no concurrency, blocking by default Enforce liveliness with a judicious application of forceMonday, June 18, 12
  • Etsy: judicious application of force list($v, $res, $shar) = @fopen(‘/proc/self/statm, r); $mine = $res-$shar; if ($mine > $cfg[‘sizelimit’]) { $pid = getmypid(); @exec("kill -USR1 $pid"); }Monday, June 18, 12
  • Etsy: judicious application of force Bowhunter * Find long running PHP processes * Try to avoid those mid-post open(APACHE, "/usr/bin/curl -s http://localhost/server- status|") || die "$!";Monday, June 18, 12
  • Etsy: judicious application of force Query_killer * Same idea, long running queries * MySQL “SHOW PROCESSLIST();”Monday, June 18, 12
  • Memcache * Caching, obviously * Cache invalidation is hard * Write buffering * multi_get * rate limitsMonday, June 18, 12
  • Memcache * atomic INCR is awesome * slice your time windows to reduce risk of cache eviction * we’ve been unlucky, lots of segfaults :( * multi_get slows down the more boxes in the poolMonday, June 18, 12
  • MySQL: By the numbers * 25K+ queries/sec avg * 3TB InnoDB buffer pool * 15TB + data stored * 50 servers * 99.99% queries under 1msMonday, June 18, 12
  • MySQL: a NotMuchSQL server * no joins * no foreign keys * no transactions or locks * no sub-selects * store data like you want to read it. * also: no auto_incrementMonday, June 18, 12
  • MySQL: a NotMuchSQL server “Normalization is for sissie.” - Cal Henderson, FlickrMonday, June 18, 12
  • MySQL: scale horizontally * objects shared by key * lookups maintained in dbindex (MySQL is a FAST key-value store) * avoid key hashing, range partitions, and partitioning functions more: http://www.slideshare.net/jgoulah/the-etsy-shard-architecture-starts-with-s-and-ends-with-hardMonday, June 18, 12
  • MySQL: Master-Master * objects hashed to a side, avoid split brain * allows in place schema upgrades without slave promotion * simplified capacity planning more: http://codeascraft.etsy.com/2012/04/20/two-sides-for-salvation/Monday, June 18, 12
  • MySQL: Introspection web0038 : [Mon Jun 18 09:58:38 2012] [error] [client 10.101.1.12] [C6kds9y1MVptEDMoOe5KCYha9VWl] [error] [ORM_LONG_QUERY] [/var/etsy/ current/phplib/EtsyORM/Query/RawSql.php:752] [15877310] Query exceeded 10 seconds: long_query_time=83.0927 long_query_string=/* [etsy_shard_005_A] [/ remove_favorite_listing.php] */ DELETE FROM `users_favoritelistings` WHERE `user_id` = ? AND `listing_id` = ? long_query_trace=#10 __construct() /EtsyModel/ UserFavoriteListingMirror.php:310 #4 delete() /EtsyModel/UserFavoriteListing.php:39 #3 delete() /EtsyModel/User.php:1840 #2 unfavoriteListing() /Controller/ Favorites.php:344 #1 removeFavoriteListingRecord() /Controller/Favorites.php:94 #0 performRemoveFavoriteListing() /var/etsy/current/htdocs/remove_favorite_listing.php: 9, referer: http://www.etsy.com/people/kellanem/favorites?page=5 SQL Comments are awesome!Monday, June 18, 12
  • MySQL: Deletes are expensive * update objects to state=‘deleted’ * use partitions * truncatenator - on ext3, hard link file, move, delete slowly.Monday, June 18, 12
  • Anatomy of a feature: Shop StatsMonday, June 18, 12
  • Anatomy of a feature: Shop Stats “Never get into a land war in Asia, and never build an analytics tool on top of MySQL.Monday, June 18, 12
  • Anatomy of a feature: Shop Stats * buffer writes in Memcache using predictable keys * flush to MySQL tables periodically via cron * bake old data into all possible date ranges, and archived to S3 * truncate tablesMonday, June 18, 12
  • Monday, June 18, 12
  • bcn.etsy.com: beaconed event stream * Server-side and javascript event stream * At least one per page view * Apache serving static assets * Aggregated on HDFS via logrotate * Archived on S3 * Analyzed via JRuby/Cascading on Hadoop * Doesn’t use: Flume, Scribe, etcMonday, June 18, 12
  • bcn.etsy.com: beaconed event stream {"event_guid":"c2ffb51808b.6d2be52959ef{".user_id": 8528531,"php_event_name":"s2","php_unique_id":"4fdf1cb5d5c078.37523961","php_event_dat e":"18/Jun/2012:08:19:01","locale_currency_code":"USD","pref_language":"en- US","region":"US","detected_region":"US","accept-languages":"en- US,en","isMobileDevice":"0","isMobileSupported":"0","isTabletSupported":"0","isTouch":"0","isEt syApp":"0","listing_ids":[60274277,101504389,98682771,88585080],"cids": [14103953,14239293,14247717,14209614],"query":"blue","keywords": ["blue","blue","blue","blue"],"position":1,"replay_number":1,"s2_cached": 1,"php_ab_test_names":"orm_record_instance_caching;mobile_detector.all_blackberry;multila ng_shops_listings.view;ga_replacement_cookie;disable_search_autosuggest;admin_toolbar;tra nslations.live_translations;ab_analytics_test;search_type_experiment;search_ads.max_replays_ less;search_diversity_experiment;search_cached_listing_cards;placefinder.cache_memcached_ migration;search_stream_a;search_all_items_ignores_supplies;search_default_type;search.two _cluster_deploy;search_parameter_sample;thrift_category2_transform;search.similar_listing_b rowse_page;orm_replicant_safe_find_many;bottom_first;foreign_language_carousel;search.rel ated_searches_all_items;weddings.srp_promos;search_log_page_position;newrelic;clientlog;go ogle_analytics_async;personalized_endpoint;search_no_dropdown;community_nav_popout;se curity_settings;search_changes_tooltip;inline_listing_hearts;framelogger;log_normal;analytics_ second_beacon;analytics_second_beacon_privileged;analytics_second_beacon_mobile","php_a b_var_names":"1;1;1;1;control;1;0;A;ponycorn_v3;1;threshold_off;1;1;1;0;all_sans_supplies; 0;1;1;1;1;0;top;0;0;1;0;1;0;1;1;1;0;1;1;1;0;1;0;1","php_ab_selector_names":"Monday, June 18, 12
  • Search Search Master BitTorrent to distribute indexes Thrift, with server affinity Search Slave01 Web01 to improve cache hit ratio, just returns ids Search Slave02 Web02 Search SlaveNN WebNN 100% of all indexes on each slave incremental index, every 7 minutes, avoid even numbered cron times hydrate IDs via multi-get, ignore a few failures pull via cron, push via gearman denormalized listing store, databases and memcache transition from MySQL to Hbase, not user facingMonday, June 18, 12
  • Search * Solr trunk * Custom ranking via crunched datasets * BitSet fields for personalized search * Scaling the JVM * 32% of visits, 40% of sales * Also powers categories, unshardable queries * Next time, just use HTTP * Up next: custom codecs * Avoiding shardingMonday, June 18, 12
  • Search * JVM slow start * Search deployinator does rolling restart * HotSpot and GC causes unpredictable throughput * Overfetch - ask multiple servers, go with 1st response * Index size is important. Don’t store too much.Monday, June 18, 12
  • Photos * 400 million photos * Uploaded locally, then streamed to S3 * GraphicsMagick FTW * Working set is tiny, served out of Squid * 2% read failure rate during full S3 outage. * 0% write failure rate during full S3 outage.JonathanOtis, http://www.etsy.com/listing/96361102/Monday, June 18, 12
  • Technology no longer part of the stack * Python Twisted * PostgreSQL and stored procedures * Scala and MongoDB * Clojure and Tokyo Tyrant * Rails * ActiveMQ * RabbitMQ * a "Routes" framework * building RPMs * LighttpdMonday, June 18, 12
  • Take aways 1. A few simple, boring, well known components 2. Extensive instrumentation 3. Rapid iteration and feedback loops 4. Human centric 5. A few tweaks on the classics for scale 6. Technology supports business goalsMonday, June 18, 12
  • Questions? More info: http://codeascraft.etsy.com http://slideshare.net/etsy http://github.com/etsy http://www.etsy.com/jobs kellan@etsy.comMonday, June 18, 12