Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Migrate your Data Warehouse to Amazon Redshift - September Webinar Series

4,048 views

Published on

You can gain substantially more business insights and save costs by migrating your on-premise data warehouse to Amazon Redshift, a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. This webinar will cover the key benefits of migrating to Amazon Redshift, migration strategies, and tools and resources that can help you in the process.

Learning Objectives:
• Understand how Amazon Redshift can deliver a richer, faster analytics at much lower costs.
• Learn key factors to consider before migrating and how to put together a migration plan.
• Learn best practices and tools for migrating schema, data, ETL and SQL queries.

Published in: Technology
  • Be the first to comment

Migrate your Data Warehouse to Amazon Redshift - September Webinar Series

  1. 1. Migrate Your Data Warehouse to Amazon Redshift Greg Khairallah, Business Development Manger, AWS David Giffin, VP Technology, TrueCar Sharat Nair, Director of Data, TrueCar Blagoy Kaloferov, Data Engineer, TrueCar September 21, 2016
  2. 2. Agenda • Motivation for Change and Migration • Migration patterns and Best Practices • AWS Database Migration Service • Use Case – TrueCar • Questions and Answers
  3. 3. Relational data warehouse Massively parallel; petabyte scale Fully managed HDD and SSD platforms $1,000/TB/year; starts at $0.25/hour Amazon Redshift a lot faster a lot simpler a lot cheaper
  4. 4. Amazon Redshift delivers performance “[Amazon] Redshift is twenty times faster than Hive.” (5x–20x reduction in query times) link “Queries that used to take hours came back in seconds. Our analysts are orders of magnitude more productive.” (20x–40x reduction in query times) link “…[Amazon Redshift] performance has blown away everyone here (we generally see 50–100x speedup over Hive).” link “Team played with [Amazon] Redshift today and concluded it is awesome. Un-indexed complex queries returning in < 10s.” “Did I mention it's ridiculously fast? We'll be using it immediately to provide our analysts an alternative to Hadoop.” “We saw… 2x improvement in query times.” Channel “We regularly process multibillion row datasets and we do that in a matter of hours.” link
  5. 5. Amazon Redshift is cost optimized DS2 (HDD) Price Per Hour for DS2.XLarge Single Node Effective Annual Price per TB compressed On-Demand $ 0.850 $ 3,725 1 Year Reservation $ 0.500 $ 2,190 3 Year Reservation $ 0.228 $ 999 DC1 (SSD) Price Per Hour for DC1.Large Single Node Effective Annual Price per TB compressed On-Demand $ 0.250 $ 13,690 1 Year Reservation $ 0.161 $ 8,795 3 Year Reservation $ 0.100 $ 5,500 Pricing is simple Number of nodes x price/hour No charge for leader node No up front costs Pay as you go Prices shown for US East Other regions may vary
  6. 6. Considerations Before You Migrate • Data is often being loaded into another warehouse – existing ETL process with investment in code and process • Temptation is to ‘lift & shift’ workload. • Resist temptation. Instead consider: – What do I really want to do? – What do I need? • Some data does not lend itself to a relational schema • Common pattern is to use Amazon EMR: – impose structure – import into Amazon Redshift
  7. 7. Amazon Redshift architecture • Leader Node Simple SQL end point Stores metadata Optimizes query plan Coordinates query execution • Compute Nodes Local columnar storage Parallel/distributed execution of all queries, loads, backups, restores, resizes • Start at just $0.25/hour, grow to 2 PB (compressed) DC1: SSD; scale from 160 GB to 326 TB DS2: HDD; scale from 2 TB to 2 PB Ingestion/Backup Backup Restore JDBC/ODBC 10 GigE (HPC)
  8. 8. A deeper look at compute node architecture Leader Node Dense compute nodes Large • 2 slices/cores • 15 GB RAM • 160 GB SSD 8XL • 32 slices/cores • 244 GB RAM • 2.56 TB SSD Dense storage nodes X-large • 2 slices/4 cores • 31 GB RAM • 2 TB HDD 8XL • 16 slices/ 36 cores • 244 GB RAM • 16 TB HDD
  9. 9. Amazon Redshift Migration Overview AWS CloudCorporate Data center Amazon DynamoDB Amazon S3 Data Volume Amazon Elastic MapReduce Amazon RDS Amazon Redshift Amazon Glacier logs / files Source DBs VPN Connection AWS Direct Connect S3 Multipart Upload Amazon Snowball EC2 or On-Prem (using SSH) Database Migration Service Kinesis AWS Lambda AWS Datapipeline
  10. 10. Uploading Files to Amazon S3 Amazon Redshiftmydata Client.txt Corporate Data center Region Ensure that your data resides in the same Region as your Redshift clusters Split the data into multiple files to facilitate parallel processing Optionally, you can encrypt your data using Amazon S3 Server-Side or Client-Side Encryption Client.txt.1 Client.txt.2 Client.txt.3 Client.txt.4 Files should be individually compressed using GZIP or LZOP
  11. 11. • Use the COPY command • Each slice can load one file at a time • A single input file means only one slice is ingesting data • Instead of 100MB/s, you’re only getting 6.25MB/s Loading – Use multiple input files to maximize throughput
  12. 12. • Use the COPY command • You need at least as many input files as you have slices • With 16 input files, all slices are working so you maximize throughput • Get 100MB/s per node; scale linearly as you add nodes Loading – Use multiple input files to maximize throughput
  13. 13. Loading Data with Manifest Files • Use manifest to loads all required files • Supply JSON-formatted text file that lists the files to be loaded • Can load files from different buckets or with different prefix { "entries": [ {"url":"s3://mybucket-alpha/2013-10-04-custdata", "mandatory":true}, {"url":"s3://mybucket-alpha/2013-10-05-custdata", "mandatory":true}, {"url":"s3://mybucket-beta/2013-10-04-custdata", "mandatory":true}, {"url":"s3://mybucket-beta/2013-10-05-custdata", "mandatory":true} ] }
  14. 14. Redshift COPY Command • Loads data into a table from data files in S3 or from an Amazon DynamoDB table. • The COPY command requires only three parameters: – Table name – Data Source – Credentials Copy table_name FROM data_source CREDENTIALS ‘aws_access_credentials’ • Optional Parameters include: – Column mapping options – mapping source to target – Data Format Parameters – FORMAT, CSV, DELIMITER, FIXEDWIDTH, AVRO, JSON, BZIP2, GZIP, LZOP – Data Conversion Parameters – Data type conversion between source and target – Data Load Operations –troubleshoot load times or reduce load times with parameters like COMROWS, COMPUPDATE, MAXERROR, NOLOAD, STATUPDATE
  15. 15. Loading JSON Data • COPY uses a jsonpaths text file to parse JSON data • JSONPath expressions specify the path to JSON name elements • Each JSONPath expression corresponds to a column in the Amazon Redshift target table Suppose you want to load the VENUE table with the following content { "id": 15, "name": "Gillette Stadium", "location": [ "Foxborough", "MA" ], "seats": 68756 } { "id": 15, "name": "McAfee Coliseum", "location": [ "Oakland", "MA" ], "seats": 63026 } You would use the following jsonpaths file to parse the JSON data. { "jsonpaths": [ "$['id']", "$['name']", "$['location'][0]", "$['location'][1]", "$['seats']" ] }
  16. 16. Loading Data in Avro Format • Avro is a data serialization protocol. An Avro source file includes a schema that defines the structure of the data. The Avro schema type must be record. • COPY uses a avro_option to parse Avro data. Valid values for avro_option are as follows: – 'auto’ (default) - COPY automatically maps the data elements in the Avro source data to the columns in the target table by matching field names in the Avro schema to column names in the target table. – 's3://jsonpaths_file' - To explicitly map Avro data elements to columns, you can use an JSONPaths file. Avro Schema { "name": "person", "type": "record", "fields": [ {"name": "id", "type": "int"}, {"name": "guid", "type": "string"}, {"name": "name", "type": "string"}, {"name": "address", "type": "string"} }
  17. 17. Amazon Kinesis Firehose Load massive volumes of streaming data into Amazon S3, Redshift and Elasticsearch • Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other destinations without writing an application or managing infrastructure. • Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 secs using simple configurations. • Seamless elasticity: Seamlessly scales to match data throughput w/o intervention Capture and submit streaming data Analyze streaming data using your favorite BI tools Firehose loads streaming data continuously into Amazon S3, Redshift and Elasticsearch
  18. 18. Best Practices for Loading Data • Use a COPY Command to load data • Use a single COPY command per table • Split your data into multiple files • Compress your data files with GZIP or LZOP • Use multi-row inserts whenever possible • Bulk insert operations (INSERT INTO…SELECT and CREATE TABLE AS) provide high performance data insertion • Use Amazon Kinesis Firehose for Streaming Data direct load to S3 and/or Redshift
  19. 19. Best Practices for Loading Data Continued • Load your data in sort key order to avoid needing to vacuum • Organize your data as a sequence of time-series tables, where each table is identical but contains data for different time ranges • Use staging tables to perform an upsert • Run the VACUUM command whenever you add, delete, or modify a large number of rows, unless you load your data in sort key order • Increase the memory available to a COPY or VACUUM by increasing wlm_query_slot_count • Run the ANALYZE command whenever you’ve made a non-trivial number of changes to your data to ensure your table statistics are current
  20. 20. Amazon Partner ETL • Amazon Redshift is supported by a variety of ETL vendors • Many simplify the process of data loading • Visit http://aws.amazon.com/redshift/partners • There are a variety of vendors offering a free trial of their products, allowing you to evaluate and choose the one that suits your needs.
  21. 21. • Start your first migration in 10 minutes or less • Keep your apps running during the migration • Replicate within, to, or from Amazon EC2 or RDS • Move data from commercial database engines to open source engines • Or…move data to the same database engine • Consolidate databases and/or tables AWS Database Migration Service (DMS) Benefits:
  22. 22. Sources and Targets for AWS DMS Sources: On-premises and Amazon EC2 instance databases: • Oracle Database 10g – 12c • Microsoft SQL Server 2005 – 2014 • MySQL 5.5 – 5.7 • MariaDB (MySQL-compatible data source) • PostgreSQL 9.4 – 9.5 • SAP ASE 15.7+ RDS instance databases: • Oracle Database 11g – 12c • Microsoft SQL Server 2008R2 - 2014. CDC operations are not supported yet. • MySQL versions 5.5 – 5.7 • MariaDB (MySQL-compatible data source) • PostgreSQL 9.4 – 9.5. CDC operations are not supported yet. • Amazon Aurora (MySQL-compatible data source) Targets: On-premises and EC2 instance databases: • Oracle Database 10g – 12c • Microsoft SQL Server 2005 – 2014 • MySQL 5.5 – 5.7 • MariaDB (MySQL-compatible data source) • PostgreSQL 9.3 – 9.5 • SAP ASE 15.7+ RDS instance databases: • Oracle Database 11g – 12c • Microsoft SQL Server 2008 R2 - 2014 • MySQL 5.5 – 5.7 • MariaDB (MySQL-compatible data source) • PostgreSQL 9.3 – 9.5 • Amazon Aurora (MySQL-compatible data source) Amazon Redshift
  23. 23. AWS Database Migration Service Pricing • T2 for developing and periodic data migration tasks • C4 for large databases and minimizing time • T2 pricing starts at $0.018 per hour for T2.micro • C4 pricing starts at $0.154 per hour for C4.large • 50 GB GP2 storage included with T2 instances • 100 GB GP2 storage included with C4 instances • • Data transfer inbound and within AZ is free • Data transfer across AZs starts at $0.01 per GB
  24. 24. AWS Schema Conversion Tool
  25. 25. Resources on the AWS Big Data Blog • Best Practices for Micro-Batch Loading on Amazon Redshift • Using Attunity Cloudbeam at UMUC to Replicate Data to Amazon RDS and Amazon Redshift • A Zero-Administration Amazon Redshift Database Loader • Best Practices for Designing Tables • Best Practices for Designing Queries • Best Practices for Loading Data Best Practices References
  26. 26. This Is The Presentation Title Entered In Master Slide Footer Area Amazon Redshift at TrueCar Sep 21, 2016
  27. 27. This Is The Presentation Title Entered In Master Slide Footer Area ● About TrueCar ● David Giffin – VP Technology ● Sharat Nair – Director of Data ● Blagoy Kaloferov – Data Engineer About us 27
  28. 28. This Is The Presentation Title Entered In Master Slide Footer Area ● Amazon Redshift use case overview ● Architecture and migration process ● Tips and lessons learned Agenda 28
  29. 29. Amazon Redshift at TrueCar 29
  30. 30. This Is The Presentation Title Entered In Master Slide Footer Area ● Datasets that flow into Amazon Redshift ● Clickstream Transaction ● Sales Inventory ● Dealer Leads ● How we do analytics and reporting ● Redshift is our data store for BI tools and ad-hoc ● Data that is loaded into Amazon Redshift is already processed Amazon Redshift at TrueCar 30
  31. 31. This Is The Presentation Title Entered In Master Slide Footer Area 31 Architecture 31 ETL (MR, Hive, Pig, Oozie,Talend) Postgres HDFS Leads Dealer Transactions Sales Inventory Clickstream Staging, DWH ETL Data Processing
  32. 32. This Is The Presentation Title Entered In Master Slide Footer Area 3232 ETL (MR, Hive, Pig, Oozie,Talend) Postgres HDFS Amazon Redshift Leads Dealer Transactions Sales Inventory Clickstream Staging, DWH ETL S3 Loading utility MSTR Tableau Data Processing Reporting Ad Hoc Architecture
  33. 33. This Is The Presentation Title Entered In Master Slide Footer Area ● Schemas ● Our datasets are in a read only schema for ad-hoc and scheduled reporting ● Ad-hoc and User tables in separate schemas ● Makes it easy to separate final data from user created one. ● Simple table’s naming conventions ● F_ - facts ● D_ - dimensions, ● AGG_ - aggregates ● V_ - views Schema design 33
  34. 34. 34 Amazon Redshift learnings
  35. 35. This Is The Presentation Title Entered In Master Slide Footer Area ● ETL is orchestrated through Talend and Oozie ● Processing tools: Talend, Hive , Pig and MapReduce pushing data into HDFS and S3 ● We built our own Amazon Redshift loading utility ● Handles all loading use cases: ● Load ● TruncateLoad ● DeleteAppend ● Upsert Redshift loading process 35
  36. 36. This Is The Presentation Title Entered In Master Slide Footer Area ● Train developers on table design and Redshift best practices ● Compress columns and encodings ● Analyze compression ● It makes a significant difference on space usage ● Sort and distribution keys ● Plan on Workload management strategy ● As usage of Redshift cluster grows you need to ensure that critical jobs get bandwidth Table design considerations 36
  37. 37. This Is The Presentation Title Entered In Master Slide Footer Area ● Retain pre “COPY” data in S3 ● It can easily be used by other tools (Spark, Pig, MapReduce) ● Offload historical datasets into separate tables on rolling basis ● Pre aggregate data when possible to reduce load on the system Space considerations 37
  38. 38. This Is The Presentation Title Entered In Master Slide Footer Area ● Have a cluster resize strategy ● Use Reserved instances for cost savings ● Plan on having enough space for long-term growth ● Plan on your maintenance ● Vacuuming ● System tables are your friends ● Useful collection of utilities: https://github.com/awslabs/amazon-redshift-utils/ Long-term usage tips 38
  39. 39. Thanks! Questions? 39

×