Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Patchwork Data at Etsy

5,770 views

Published on

Big data at Etsy began in early 2010 and has since grown to power applications as diverse as ETL, A/B testing, recommender systems, and search indexing. Join us at this talk for an amusing tour through the history of big data at Etsy going back to the roots of our mission-critical A/B testing approach followed by a dive into a selection of the technologies that power such applications today.

Published in: Technology
  • Be the first to comment

Patchwork Data at Etsy

  1. 1. Patchwork Data at Etsy Matt Walker
  2. 2. Etsy June2005 2007 2009 2011 2013
  3. 3. What happened?
  4. 4. We don’t like to talk about it
  5. 5. Okay, we do• http://codeascraft.etsy.com• https://www.etsy.com/codeascraft/talks• http://kongscreenprinting.com
  6. 6. Catch Phrases• Continuous deployment• Blameless postmortems• Measure everything• Continuous experimentation
  7. 7. Metrics-Driven Development• Ganglia• StatsD/Graphite• Splunk
  8. 8. Scaling a Traditional RDBMS• Sharded MySQL• memcached• Object-relational mapping in PHP
  9. 9. December2005 2007 2009 2011 2013
  10. 10. Adtuitive• Online advertising network• Match forum post with rich product advertisements• Unafraid of scaling across Etsy sellers
  11. 11. Adtuitive• Amazon Web Services• JRuby• Rails
  12. 12. LAMP Stack for Big Data• HDFS • Pig• MapReduce • Oozie• HBase • Avro• Hive • Zookeeper• Flume• JDBC/ODBC http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/• Hue
  13. 13. LAMP Stack for Big Data• HDFS • Pig• MapReduce • Oozie• HBase • Avro• Hive • Zookeeper• Flume• JDBC/ODBC http://gigaom.com/2010/08/01/meet-big-data-equivalent-of-the-lamp-stack/• Hue
  14. 14. LAMP Stack for Big Data• HDFS S3 • Pig Cascading• MapReduce (Elastic) • Oozie• HBase • Avro TupleSerialization• Hive • Zookeeper• Flume• JDBC/ODBC• Hue
  15. 15. Powered by MapReduce• ETL• Analytics• A/B testing• Recommenders• Search
  16. 16. Applications• Log ETL • A/B Analyzer• Database snapshotter • Catapult• TasteTest • Distributed search indexing• Facebook Gift Recommender • Fast Game (search index)• Complimentary/similar listings • Search autosuggest• Funnel Cake • SearchAds• Feature Funnel • SCRAM ETL (fraud detection)
  17. 17. Applications• Log ETL • A/B Analyzer• Database snapshotter • Catapult• TasteTest • Distributed search indexing• Facebook Gift Recommender • Fast Game (search index)• Complimentary/similar listings • Search autosuggest• Funnel Cake • SearchAds• Feature Funnel • SCRAM ETL (fraud detection)
  18. 18. Catapult• End-to-end success story• Extremely valuable for a web shop
  19. 19. Relevancy Thursdays January2005 2007 2009 2011 2013
  20. 20. Relevancy Thursdays• Switch default sort order to relevance• Each Thursday in January
  21. 21. Relevancy Thursdays• Default search order was recency• Relisting was our equivalent of advertising• $0.20 updated your listing’s timestamp
  22. 22. Relevancy Thursdays• Recency was meant to support “freshness” in search results• Search originated as PostgreSQL query• Converted to Solr to scale
  23. 23. What happens if we switch to relevance?
  24. 24. Relevancy Thursdays• No A/B testing framework• No event logs• Limping along with Google Analytics
  25. 25. First Log Analysis February2005 2007 2009 2011 2013
  26. 26. First Log Analysis• Raw web access logs• URL- and ref tag-based• Regex parser
  27. 27. Heyday of Tooling• A/B framework• Front end event logger• Database snapshotter• Barnum and Bailey• Custom operator library• Loaders
  28. 28. LAMP Stack for Big Data• HDFS S3 • Pig Cascading• MapReduce (Elastic) • Oozie• HBase • Avro TupleSerialization• Hive • Zookeeper• Flume• JDBC/ODBC• Hue
  29. 29. LAMP Stack for Big Data• HDFS S3 • Pig Cascading• MapReduce (Elastic) • Oozie Barnum• HBase • Avro TupleSerialization• Hive • Zookeeper• Flume Akamai• JDBC/ODBC snapshotter/loaders• Hue
  30. 30. A/B Framework• Ramp-ups + A/B testing• Feature flag development
  31. 31. Self-service analytics for any A/B test on the site
  32. 32. A/B Framework June2005 2007 2009 2011 2013
  33. 33. A/B Analyzer November2005 2007 2009 2011 2013
  34. 34. Why did it take so long?• Non-web developers learning the PHP stack• Failed experiments with “easier to use” MapReduce tools• Realizing self-service analytics was what Etsy needed
  35. 35. Catapult February2005 2007 2009 2011 2013
  36. 36. Catapult• A/B Analyzer + Launch Calendar• Full product lifecycle
  37. 37. LAMP Stack for Big Data• HDFS S3 • Pig Cascading• MapReduce (Elastic) • Oozie Barnum• HBase • Avro TupleSerialization• Hive • Zookeeper• Flume Akamai• JDBC/ODBC snapshotter/loaders• Hue
  38. 38. LAMP Stack for Big Data• HDFS • Pig Cascading• MapReduce • Oozie• HBase • Avro TupleSerialization• Hive Vertica • Zookeeper• Flume logrotate• JDBC/ODBC snapshotter/loaders• Hue
  39. 39. Computation Models• Batch• Interactive• Streaming
  40. 40. Batch
  41. 41. Cascading
  42. 42. RDBMS / Cascading SQL cascading.jrubyQuery Planner/Optimizer Cascading Execution Engine MapReduce Storage HDFS
  43. 43. cascading.jruby
  44. 44. cascading.jruby• Productivity: no compile• Reuse: factor out structure• Efficiency: no JRuby runtime• Optimization: move aggregations map-side
  45. 45. A nice constructor
  46. 46. cascading.jruby
  47. 47. Productivity• Job templates• Reloader• Cascading local mode• Sampled data
  48. 48. Reuse
  49. 49. Reuse
  50. 50. Field Names
  51. 51. Efficiency• Just a constructor• Calls into Cascading API• No JRuby runtime on cluster
  52. 52. Optimization
  53. 53. Tuple Data Model
  54. 54. UDFs
  55. 55. Scalding• Distributed collections• Function literals replace UDFs
  56. 56. Interactive
  57. 57. Vertica
  58. 58. Sharded MySQL• Borrowed from Flickr• Works
  59. 59. Thou Shalt Not Join
  60. 60. Hive January2005 2007 2009 2011 2013
  61. 61. Hive Turned Off April2005 2007 2009 2011 2013
  62. 62. Hive• Slow• Sensitive• Operational burden• Educational burden
  63. 63. Vertica• Offline copy of shards, master, auxiliary databases• Joins are easy• Reasonable latency
  64. 64. Vertica November2005 2007 2009 2011 2013
  65. 65. Vertica• Game changer at Etsy• High demand for joins• Rapid prototyping data pipelines
  66. 66. RDBMS / Cascading SQL cascading.jrubyQuery Planner/Optimizer Cascading Execution Engine MapReduce Storage HDFS
  67. 67. Back to MapReduce• Event logs• Schedule• Load data in prod• Scale
  68. 68. Vertica• Not Hive, Impala, Shark, etc.• May change our minds
  69. 69. Streaming
  70. 70. Not Powered by MapReduce• Activity Feed• Shop Stats
  71. 71. Etsyweb• memcached• Gearman• Sharded MySQL
  72. 72. Usecases• Trending• Fraud detection• ?
  73. 73. Turns out people don’t makeproduct decisions in real time http://mcfunley.com/whom-the-gods-would-destroy-they-first-give-real-time-analytics
  74. 74. Summing Up• Be glad you’re living in the future• Automated tools for the common case• Don’t be afraid to experiment
  75. 75. Image Credits• http://kongscreenprinting.com/what-we-do- • http://www.globaltimes.cn/ showcase SPECIALCOVERAGE/Top10Peopleof2011.aspx• http://animal.discovery.com • http://www.theculturemap.com/scream-time- edvard-munch-museum/• http://www.rallyrace.com/turning-over-the- stone-event-production-basics/ • http://www.repentamerica.com/webelieve.html• http://www.flickr.com/photos/bbalaji/ • https://soundcloud.com/tearland/tl-hive 2443820505/ • http://pocketnow.com/2012/08/02/wifi-vs-data-• http://www.madeyoulaugh.com/funny_photos/ speed-vs-battery-life/bush-scratching-head caveman_harley/caveman_harley.jpg• http://theundercoverrecruiter.com/6-ways- catapult-your-job-search-after-layoff/
  76. 76. Contact / Reference• Matt Walker• @data_daddy• http://codeascraft.etsy.com/• http://www.etsy.com/codeascraft/talks

×