Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)


Published on

Amazon Redshift is a fast, simple, cost-effective data warehousing solution, and in this session, we look at the tools and techniques you can use to migrate your existing data warehouse to Amazon Redshift. We will then present a case study on Scholastic’s migration to Amazon Redshift. Scholastic, a large 100-year-old publishing company, was running their business with older, on-premise, data warehousing and analytics solutions, which could not keep up with business needs and were expensive. Scholastic also needed to include new capabilities like streaming data and real time analytics. Scholastic migrated to Amazon Redshift, and achieved agility and faster time to insight while dramatically reducing costs. In this session, Scholastic will discuss how they achieved this, including options considered, technical architecture implemented, results, and lessons learned.

Published in: Technology

AWS re:Invent 2016: Migrating Your Data Warehouse to Amazon Redshift (DAT202)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. November 30, 2016 Migrating Your Data Warehouse to Amazon Redshift DAT202 Pavan Pothukuchi, Sr. Manager PM, Amazon Redshift Ali Khan, Director of BI and Analytics, Scholastic Laxmikanth Malladi, Principal Architect, Northbay Solutions “It’s our biggest driver of growth in our biggest markets, and is a feature of the company” …on Data Mining in Redshift – Chris Lambert, Lyft CTO “The doors were blown wide open to create custom dashboards for anyone to instantly go in and see and assess what is going in our ad delivery landscape, something we have never been able to do until now.” – Bryan Blair, Vevo’s VP of Ad Operations “Analytical queries are 10 times faster in Amazon Redshift than they were with our previous data warehouse.” – Yuki Moritani, NTT Docomo Innovation Manager “We have several petabytes of data and use a massive Redshift cluster. Our data science team can get to the data faster and then analyze that data to find new ways to reduce costs, market products, and enable new business.” – Yuki Moritani, NTT Docomo Innovation Manager “We saw a 2x performance improvement on a wide variety of workloads. The more complex the queries, the higher the performance improvement..” - Naeem Ali, Director of Software Development, Data Science at Cablevision (Optimum) “Over the last few years, we’ve tried all kinds of databases in search of more speed, including $15k of custom hardware. Of everything we’ve tried, Amazon Redshift won out each time.” – Periscope Data, Analyst’s Guide to Redshift “We took Amazon Redshift for a test run the moment it was released. It’s fast. It’s easy. Did I mention it’s ridiculously fast? We’re using it to provide our analysts an alternative to Hadoop.” – Justin Yan, Data Scientist at Yelp “The move to Redshift also significantly improved dashboard query performance… Redshift performed ~200% faster than the traditional SQL Server we had been using in the past.” - Dean Donovan, Product Development at DiamondStream “…[Redshift] performance has blown away everyone here (we generally see 50-100x speedup over Hive)” - Jie Li Data Infrastructure at Pinterest “450,000 online queries 98 percent faster than previous traditional data center, while reducing infrastructure costs by 80 percent.” - John O’Donovan, CTO, Financial Times “We needed to load six months' worth of data, about 10 TB of data, for a campaign. That type of load would have taken about 20 days with our previous solution. By using Amazon Redshift, it only took six hours to load the data.” - Zhong Hong, VP of Infrastructure, Vivaki (Publicis Groupe) “We regularly process multibillion row datasets and we do that in a matter of hours. We are heading to up to 10 times more data volumes in the next couple of years, easily.” - Bob Harris, CTO, Channel 4 “On our previous big data warehouse system, it took around 45 minutes to run a query against a year of data, but that number went down to just 25 seconds using Amazon Redshift” - Kishore Raja Director of Strategic Programs and R&D, Boingo Wireless “Most competing data warehousing solutions would have cost us up to $1 million a year. By contrast, Amazon Redshift costs us just $100,000 all-in, representing a total cost savings of around 90%” - Joel Cumming, Head of Data, Kik Interactive “Annual costs of Redshift are equivalent to just the annual maintenance of some of the cheaper on-premises options for data warehouses..” - Kevin Diamond, CTO, HauteLook (Nordstrom) “Our data volume keeps growing, and we can support that growth because Amazon Redshift scales so well.. We wouldn’t have that capability using the supporting on-premises hardware in our previous solution.” - Ajit Zadgaonkar, Director of Ops. and Infrastructure, Edmunds “With Amazon Redshift and Tableau, anyone in the company can set up any queries they like - from how users are reacting to a feature, to growth by demographic or geography, to the impact sales efforts had in different areas” - Jon Hoffman, Head of Engineering, Foursquare
  2. 2. Today’s agenda • Amazon Redshift Overview • Use cases and benefits • Migration options • Scholastic’s use case • Architecture details • Technical overview • Key project learnings
  3. 3. Relational data warehouse Massively parallel; petabyte scale Fully managed HDD and SSD platforms $1,000/TB/year; starts at $0.25/hour Amazon Redshift a lot faster a lot simpler a lot cheaper
  4. 4. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change. Forrester Wave™ Enterprise Data Warehouse Q4 ’15
  5. 5. Selected Amazon Redshift customers
  6. 6. Why migrate to Amazon Redshift? 100x faster Scales from GBs to PBs Analyze data without storage constraints 10x cheaper Easy to provision and operate Higher productivity 10x faster No programming Standard interfaces and integration to leverage BI tools, machine learning, streaming Transactional database MPP database Hadoop
  7. 7. Migration from Oracle @ Boingo Wireless 2000+ Commercial Wi-Fi locations 1 million+ Hotspots 90M+ ad engagements 100+ countries Legacy DW: Oracle 11g based DW Before migration Rapid data growth slowed analytics Mediocre IOPS, limited memory, vertical scaling Admin overhead Expensive (license, h/w, support) After migration 180x performance improvement 7x cost savings
  8. 8. 0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 Exadata SAP HANA Redshift $400,000 $300,000 $55,000 7,200 2,700 15 15 Query Performance Data Load Performance 1 year of data 1 million records Latencyinseconds RedshiftExisting System 7X cheaper than Oracle Exadata 180X faster than Oracle database Migration from Oracle @ Boingo Wireless
  9. 9. Migration from Greenplum @ NTT Docomo 68 million customers 10s of TBs per day of data across mobile network 6PB of total data (uncompressed) Data science for marketing operations, logistics etc. Legacy DW: Greenplum on-premises After migration: 125 node DS2.8XL cluster 4,500 vCPUs, 30TB RAM 6 PB uncompressed 10x faster analytic queries 50% reduction in time for new BI app. deployment Significantly less ops. overhead
  10. 10. Migration from SQL on Hadoop @ Yahoo Analytics for website/mobile events across multiple Yahoo properties On an average day 2B events 25M devices Before migration: Hive – Found it to be slow, hard to use, share and repeat After migration: 21 node DC1.8XL (SSD) 50TB compressed data 100x performance improvement Real-time insights Easier deployment and maintenance
  11. 11. Migration from SQL on Hadoop @ Yahoo 1 10 100 1000 10000 Count Distinct Devices Count All Events Filter Clauses Joins Seconds Amazon Redshift Impala
  12. 12. Business Value and Productivity Business Productivity Benefits Analyze more data Faster time to market Get better insights Match capacity with demand
  13. 13. ENGINE X Amazon Redshift ETL Scripts SQL in reports Adhoc. queries How to Migrate? Schema Conversion Database Migration Map data types Choose compression encoding, sort keys, distribution keys Generate and apply DDL Schema & Data Transformation Data Migration Convert SQL Code Bulk Load Capture updates Transformations Assess Gaps Stored Procedures Functions 1 2 3 4
  14. 14. Convert schema in a few clicks Sources include Oracle, Teradata, Greenplum and Netezza Automatic schema optimization Converts application SQL code Detailed assessment report AWS Schema Conversion Tool (AWS SCT)
  15. 15. AWS Schema Conversion Tool
  16. 16. Start your first migration in few minutes Sources include: Aurora, Oracle, SQL Server, MySQL and PostgreSQL Bulk load and continuous replication Migrate a TB for $3 Fault tolerant (AWS DMS)
  17. 17. AWS DMS: Change data capture Replication instance Source Target Update t1 t2 t1 t2 Transactions Change apply after bulk load
  18. 18. Data integration partners Data Integration Systems Integrators Amazon Redshift
  19. 19. Beyond Amazon Redshift…
  20. 20. Scholastic, Established 1920
  21. 21. Where were we? Platform 13+ years old. IBM AS/400 DB2 and Microsoft SQL Server are the primary data warehouse platforms. BI Platform is primarily Microsoft (SSRS, SSAS, Excel, SharePoint) 500+ direct users across every LOB and business function 20+ TB. 5,500+ DB2 workloads, 350+ SQL Server workloads, 15 SSAS cubes, 150+ SSRS reports Challenges Inflexible, multi-layered architecture – slow time to market Inability to meet internal SLAs due to performance of daily ETL processes Scalability limitations with SQL Server Analysis Services (SSAS) for reports Limited ability to perform self-service Business Intelligence 21
  22. 22. Moving forward: Key decision factors • Improved performance, scalability, availability, logging, security • Enablement of self service business intelligence • Leverage the skill set of current team (Relational DB & SQL) • Integration with existing technology stack • Alignment with the tech strategy (devops model, Cloud First) • Ability to support Big Data initiatives • Team up with an experienced consulting partner 22
  23. 23. Why we chose AWS and Amazon Redshift AWS was chosen for its agility, scalability, elasticity, and security Redshift • Scalable, fast • Managed service, cost-optimization models, elastic • SQL/relational matched skillset of team S3 was chosen as location for ingestion process NorthBay was chosen as the implementation partner for their expertise in Big Data and Redshift migrations 23
  24. 24. How the project unfolded Goals • 3-month pilot to migrate a Functional area in key LOB • Demonstrate immediate business value • Use AWS Stack & Open Source for Data Movement from DB2 (No CDC/ETL tool) Outcomes • Core Framework for Migration • ELT Architecture and Validation • Visualization/Self-service capability through Tableau
  25. 25. EMR Cluster running Sqoop Script Output Bucket EC2 Instance running Copy Command Redshift (Staging) Data Pipeline SNS Topic (Pipeline Status) (Pipeline Failure) SNS Email Notification Lambda (Save Pipeline Stats) RDS MySQL Instance (Pipeline Configurations) DynamoDB Redshift (Enterprise Data Repository) AS400 / DB2 (Staging) SQL Server EDW Tableau (Reporting Tool) Source DBs SSAS CubesSSRS Reports Scholastic data cloud: Technical architecture
  26. 26. Core Framework • Jobs and Job Groups are defined as metadata in DynamoDB • Control-M scheduler, Custom Application and Data Pipeline for Orchestration • ELT Process with EMR/Sqoop for Extraction. Load and Transform the data through Redshift SQL scripts • Core Framework enables • Restart capability from point of failure • Capturing of operational statistics (# of rows updated, etc.) • Audit capability (which feed caused the Fact to change, etc.) 26
  27. 27. Extract • Pre-create EMR resources at the start of Batch • Achieve parallelism in Sqoop with mappers and Fair Scheduling • Sqoop query to add additional fields like Batch_id, Updated_date etc • Data extracts are split and compressed for optimized loading into Redshift 27 AS400 / DB2 EMR with Sqoop S3 Metadata KMS Data Pipeline 1 2 3 4 5 6 Control Flow Data Flow
  28. 28. Load • Truncate and Load through Data Pipeline for Staging tables • Dynamic Work Load Management (WLM) queues setup to allow maximum resources during Loading/Transformation • Check and terminate any locks on tables to allow truncation • Capture metrics related to number of rows loaded, time taken, etc.28 StagingS3 KMS Data Pipeline 4 1 2 3 EC2 Control Flow Data Flow
  29. 29. Transform • Custom Application for building Dimensions and Facts • SQL Scripts are stored in S3 and executed by ELT process • SQL scripts refactored from SQL Server and AS400 scripts • Non-Functional Requirements are achieved through Custom App 29 1 3 2 4 5 6 7a 7b S3 Staging Facts Metadata Dimensions App Control Flow Data Flow
  30. 30. Schema Design • Modified Star Schema • Natural Keys instead of generating unique identifiers • Commonly used columns from Dimensions are copied over to Facts • Surrogate keys are eliminated except for few cases • Compression • Define appropriate Distribution and Sort Keys • Define primary key and Foreign keys
  31. 31. Security • AWS Key Management Service (KMS) is used for encrypting access credentials to Source and Target databases • Jenkins job to allow encrypting of credentials using KMS directly by Database Administrators • Amazon EMR, Jenkins resources are given KMS decrypt permissions to allow connecting to Sources and Targets during the ELT process • Standard Security in Transit and at Rest throughout the process • IAM federation through Enterprise Active Directory 31
  32. 32. Reporting • Business users access to Facts/Dimensions through Tableau • Power users access to Staging tables through Tableau • Enable Data Analysts access to files in S3 using Hive/Presto • Self-Service capability across business users 32 S3 Staging Facts/ Dimensions Business Analysts Power Users Data Analysts EMR Presto/Hive
  33. 33. Workstream Effort • Define Jobs and Job Groups specific to each Workstream • Create Redshift tables (Staging, Facts, Dimensions) based on mapping from AS400 and best practices learned • Create new SQL scripts (based on the logic from AS400/SQL Server code) for transformation • Develop, Test and Deploy in 2-week Agile sprints 33
  34. 34. Key Lessons - Technical • Isolate core framework with project specific code repositories • Consolidating logging solution across Amazon S3, Amazon Redshift, Amazon DynamoDB etc., was a challenge • Make appropriate schema changes when migrating to new platform • Custom Framework for gathering operational stats (eg: # of rows loaded etc.) • Start with Test Automation tools and Acceptance Test Driven Development (ATDD) earlier in the project 34
  35. 35. Project timeline revisited After the successful pilot: • Executive Leadership accelerated timeline: • Reduce project timeline by 50% (to 12 months) to deliver value faster to LOBs • Realize cost savings by eliminating the DB2 and SQL Server platforms earlier • Users wanted to be on the new platform! • Scholastic & NorthBay partnered to create a training curriculum to ensure a supply of skilled staff would be available to our teams 35
  36. 36. Scaling up: 7 workstreams • Developed a model for estimating effort and cost (AWS costs & Labor per LOB migration) • Running agile teams in parallel – employed Agile coaches • Enhanced the core framework to ensure it would scale effectively when in use by multiple teams simultaneously • Building a Code repository for use by all teams • Building CI / CD Frameworks
  37. 37. Where are we now? • 4 of 7 LOBs migrated – framework enables complete migration of a functional area within days/weeks as opposed to months. On track to migrate and decommission entire legacy environment within next 6 months • 10 weeks to migrate from an external vendor hosting data and providing reports for one LoB • Cost of Data Ingestion Framework is under $40/day (EC2, EMR, Data Pipeline) • First “Big Data” initiative in production, captures and processes an average of 1.5 Million e reading events daily (peak: 7 Million) • Profile: LOB #1 • Loading ~5-6 Million rows/day (6-7GB/day) • Processing over 1.5 billion rows within Redshift daily • Complete ETL/ELT batch cycle performance improved by over 170%
  38. 38. Key lessons – project execution • Essential to monitor and optimize AWS costs • “Data Champion” / “Data Guide” partnership absolutely critical for successful adoption of new platforms • Importance of strong Agile coaches while scaling out Agile teams • Criticality of choosing consulting partners (AWS & North Bay) who can ramp up and supply key resources fast and cycle off the project when finished • Creating new data platforms and migrating data into them is easy, especially with AWS. Decommission of existing data platforms is hard! 38
  39. 39. Thank you!
  40. 40. Remember to complete your evaluations!
  41. 41. Related Sessions Hear from other customers discussing their Amazon Redshift use cases: • BDM402—Best Practices for Data Warehousing with Amazon Redshift ( • BDA304—What’s New with Amazon Redshift • SVR308—Content and Data Platforms at Vevo: Rebuilding and Scaling from Zero in One Year • GAM301—How EA Leveraged Amazon Redshift and AWS Partner 47Lining to Gather Meaningful Player Insights • BDA207—Fanatics: Deploying Scalable, Self-Service Business Intelligence on AWS • BDM306— Netflix: Using Amazon S3 as the fabric of our big data ecosystem • BDA203 — Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (GE Power and Water) • BDM206 — Understanding IoT Data: How to Leverage Amazon Kinesis in Building an IoT Analytics Platform on AWS (Hello) • STG307— Case Study: How Prezi Built and Scales a Cost-Effective, Multipetabyte Data Platform and Storage Infrastructure on Amazon S3