Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Chris Bohn "Migrating Etsy infrastructure from On-premises to Google Cloud Platform"


Published on

Etsy is one of the largest and best-known specialty online marketplaces worldwide, with gross sales in 2017 exceeding $3 Billion. Etsy was founded in 2005, before the emergence of viable cloud platforms. Until recently, all of Etsy's critical systems -including production and analytics data stacks - were hosted and managed on premises. In 2017, the decision was made to migrate all infrastructure to Google Cloud Platform (GCP), to become operational in 2018. This talk describes the migration, with a focus on moving Etsy's analytics data systems. The Etsy Analytics Data Stack consists of Hadoop for large batch jobs, Vertica for data analysis, and Kafka for clickstream and production data distribution, as well as custom tools for Data Science projects and ETL processes. In addition to migrating legacy technologies to GCP, Etsy has also integrated native GCP data products such as Big Query (big data processing) and Airflow (workflow management replacing Oozie).

The technical challenges and cloud economics of the migration will be discussed. This has been a very large project that has gone well, due to good planning and building the right teams. Anyone considering migrating infrastructure to the cloud, especially to GCP, will benefit from hearing about Etsy's challenges and solutions.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Chris Bohn "Migrating Etsy infrastructure from On-premises to Google Cloud Platform"

  1. 1. 1 Tittle Name and surname of speaker CompanyCB BOHN Senior Database Engineer Etsy Migrating From On-Prem To The Cloud
  2. 2. Migrating From On-Prem To The Cloud CB BOHN Senior Database Engineer Etsy
  3. 3. SELLER
 Pursues craft, grows business ETSY Invests in the platform and delivers a global base of buyers BUYER Finds unique goods that are hard to find elsewhere ETSY Facilitates the transaction 3 ETSY EMPOWERMENT LOOP
  4. 4. Etsy Fun Facts • Founded in 2005 • GMS $2.84B in 2016 • 1.8 Million Active Sellers • 31 Million Active Buyers • % mobile visits 66% • % mobile GMS 51% • % international GMS 32% • 87% of sellers identify as female • ~500TB in analytics management 4
  5. 5. 5
  6. 6. 6
  7. 7. Why Move to the Cloud? 7 • Etsy founded in 2005, before any viable commercial “cloud” existed (had to “roll your own”) • Momentum kept Etsy hosting its own infrastructure (it’s easy to stay in a relationship) • Investors demanded a move to the cloud to achieve cost reduction (but it is not really cheaper) • Real benefit of Cloud is being nimble: Faster time to deploy new infrastructure (“elasticity”)
  8. 8. Cloud Migration Planning 8 • Decision made in Spring 2017 to move all infrastructure to Cloud • Extensive review of Cloud providers, with decision by end of 2017 • Target of moving production servers (application servers and MySQL servers) to Cloud by Summer 2018 • Analytics databases to be moved in stages, with final move in Summer 2019
  9. 9. Cloud Economics 9 • Cloud charges are a combination of CPU, storage and network usage • Cloud providers charge for metered usage of shared resources, and fixed price for dedicated resources, whether utilized or not (ELASTICITY HELPS HERE!) • Some Cloud providers bill usage of computing products (like databases) by storage access (bytes read from disk). This can get very expensive, flat fee is better but need to run the numbers :) • OLTP and OLAP have different economics in the Cloud
  10. 10. Production Data vs Analytics Data • Production data (OnLine Transaction Processing — OLTP) • Analytics Data (OnLine Analytics Processing — OLAP) DATA WAREHOUSE BUSINESS PROCESSING OLTP OLAP ● Business Intelligence ● Data Analytics ● Data Science ● Facts ● Dimensions Clickstream Data● User Behavior ● Conversion Rate ● A/B testing RDMS Postgres, MySQL, etc. AWS, Azure, GCP, Vertica, etc. 10
  11. 11. Sheet1 OLTP OLAP Data Source Business events RDBMS and Clickstream Data Usage Business introspection Design Highly normalized tables Often denormalized Purpose Fast data aggregation Data Ingestion Query Types Speed Design Indexes (B-Tree, etc.) Storage Processing and storage of business events Quick transaction processing INSERT, UPDATE, DELETE ETL from production RDBMS and Clickstreams Simple but fast transactions Complex but fast aggregations No indexes; uses sorting and encoding to effect very fast read times Relatively moderate if no change history Can become very large if it serves as data repository OLTP vs OLAP 11
  12. 12. More OLTP vs OLAP 12 Relational databases and Analytic databases serve different purposes, and the underlying architectures differ accordingly: • OLTP databases are designed for fast transactions on specific records. Optimization pattern is to put indexes on high cardinality fields, so that individual records can be quickly retrieved. • OLAP databases are designed for fast aggregation queries. These are usually Columnar Store databases and the optimization pattern is by sorting and encoding low cardinality columns, so that aggregation is very fast.
  13. 13. Comparing Enterprise vs Cloud OLAP 13 Vertica: • Can be “On Premises” or in the Cloud (AWS, Google Cloud, Azure) • Advanced analytic SQL queries • SQL language very close to PostgreSQL • New “Eon” mode separates “Compute” from “Storage” in Cloud deployments; now very “elastic” Google BigQuery: • Cloud ONLY • Inexpensive storage • Separates “Compute” from “Storage” — very elastic • Rich SQL language • Queries can be expensive, BE CAREFUL! Charges based on runtime and rows touched • Until recently, data ingestion was append only — no updates/deletes
  14. 14. Architecture for Elastic OLAP 14 • OLAP elasticity achieved by partitioning workloads by “vertical” classifications: Finance, Analytics, Rollups, etc. • Difficult to apply elasticity to OLTP databases, but can be applied to application layer • Need good scheduling to time the spin-up and spin-down of verticals
  15. 15. Relational Database Indexes 15
  16. 16. Columnar Databases: Sorting and Encoding 16
  17. 17. Example Query Let’s assume we have a table called “students” that have the same fields on both the relational database and the analytics database: CREATE TABLE students ( student_id INT, gender char(1), class char(2) ); For the relational database, let’s assume a primary key on student_id. Let’s see how each database handles this query: SELECT COUNT(*) FROM students WHERE gender = ‘F’;
  18. 18. Relational Database Query Path Sequence scan on entire table required to satisfy query because there is no index on the GENDER field. Even if there was, the query planner would likely ignore it — not enough cardinality
  19. 19. Columnar Store Query Path
  20. 20. Questions? 20