Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Migration to Redshift from SQL Server

20,299 views

Published on

Published in: Technology
  • I think S3 support was added to CloverETL around version 3.0, right now 4.0 is coming out.
    CloverETL can write/read (gzipped) file to/from S3 without any file size limitation(tested over 5GB file), files listing from S3 bucket on roadmap.
    Failures would be handled best probably with CloverETL 'jobflow' feature.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • @dpavlis 'not correct' is a big statement. When was S3 supported added to Clover? What aspects of the S3 API does Clover support? Can it write files larger than 5GB? How is failure handled?
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Actually the claim that no ETL supports S3 storage is not correct. CloverETL has support for S3 - can directly store data to S3 buckets and there are few projects on Redshift using CloverETL (both on-cloud and on-premise) to first create data on S3 and then quickly load them to Redshift. See www.cloveretl.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Migration to Redshift from SQL Server

  1. 1. SQL Server to Redshift
  2. 2. Background RealityMine provides digital behaviour analytics. Our applications passively measure the activity of opt-in users on all digital platforms. This could be focused on • how to direct marketing • how to direct product development • question individuals whom undertake certain behavior patterns
  3. 3. Starting State • • • • • SQL Server DW on in-house server SQL Server 2008 R2 Enterprise Edition Single 4 core (8 thread) i7 w/ 16GB RAM 2 960GB PCIe SSDs for DBs 1 240GB PCIe SSD for TempDb SQL Server to Redshift - @joeharris76
  4. 4. Data Environment • • • • • ~20 billion rows in active use Largest table is also the widest Volume is doubling more than annually Data is in many languages Starts as JSON, ends as Star Schema DW SQL Server to Redshift - @joeharris76
  5. 5. Pain Points • • • • • Biggest cost is SQL Server license Biggest bottleneck is single threaded perf. Hand tuning needed to push CPU / disks SSD reliability is not perfect SSD performance degrades over time SQL Server to Redshift - @joeharris76
  6. 6. Why Redshift • • • • • • Vertica wanted £45k per terabyte 16 SQL Server Enterprise cores even more! Teradata, Netezza, etc. don’t want <5TB sales SAP HANA not viable for this volume on AWS Infobright does not support incremental loads Hadoop/Impala slow & requires lots of learning SQL Server to Redshift - @joeharris76
  7. 7. Data Processing Approach • No ETL tool truly supports Redshift – Requirement to load from S3 is a killer – Tried SSIS, Pentaho, Talend and others • You’re stuck with ELT – Load data then transform as needed – Keep data raw as possible from source SQL Server to Redshift - @joeharris76
  8. 8. War of Encodings The road to heaven goes through ÜÑÎÇØDÈ hell SQL Server to Redshift - @joeharris76
  9. 9. Redshift: UTF-8 Only • Redshift has zero-tolerance for certain chars – NUL/0x00 => Treated as EOR, documented – DEL/0x7F => Treated as EOR, undocumented – 0xBFEFEF => UTF-8 spec "guaranteed non-char" – These must be removed before loading data • Other control characters can be loaded by escaping – You cannot escape a single column, all or nothing SQL Server to Redshift - @joeharris76
  10. 10. SQL Server: UTF-16LE Only • NVARCHAR takes 2x as much space as a VARCHAR • Makes functions consistent across ASCII & Unicode – N/VARCHAR(32) = 32chars / Redshift = 32 bytes • SQL Server tolerates anything character columns • Input and output is not sanitized against UTF-16 spec – Invalid or "guaranteed non-chars" are stored as is SQL Server to Redshift - @joeharris76
  11. 11. SQL Extract: The Hard Way • BCP is the “standard” way to extract data • Using BCP your process looks something like this: – Extract data as a huge UTF-16LE file using bcp – Convert to a new UTF-8 file using iconv – Remove or escape problem chars using sed – Compress the final file using gzip – All steps are heavily constrained by disk speed SQL Server to Redshift - @joeharris76
  12. 12. SQL Extract: The Easy Way SQLCMD one-liner for extracts: Set the cmd code page to UTF-8 Interactive SQL terminal Prevent summary in output Select from the table / view No column headers Remove special characters Delimit output with 1 ASCII char No padding in output Output in Unicode Pipe stdout to gzip chcp 65001 & sqlcmd –E -Q “SET NOCOUNT ON; SELECT * FROM Db.Schema.Table;” -h-1 -k1 -s”|” -W -u | gzip > “C:file.gz” SQL Server to Redshift - @joeharris76
  13. 13. Data Encryption • • • • • On SQL Server we use TDE Redshift offers AES encrypted data on disk Redshift can load client-side encrypted data Client side encryption only applies while on S3 “Small performance penalty” for using AES SQL Server to Redshift - @joeharris76
  14. 14. Security • S3 Access => Create bucket(s) just for Redshift staging • Redshift admin => Use IAM, create automation user(s) • Redshift database => – Do not use admin it’s like SQL Server ‘sa’ • Database objects => – Must actively GRANT access to each object – Use groups to make management easier SQL Server to Redshift - @joeharris76
  15. 15. Sizing your cluster • Redshift is over-provisioned on storage • Redshift is super efficient at compression – Compression not affected by the data model • Redshift scale out is almost perfectly linear – 2 nodes is twice as fast as 1 node • You'll be sizing your cluster for speed! SQL Server to Redshift - @joeharris76
  16. 16. Performance • Redshift speed depends on node count – A single node is not particularly fast • Loading speed appears to be linked to S3 speed – You must use multiple files for bulk loads • Query speed appears to be CPU constrained – Vacuum runs 250 MB/s, queries <20 MB/s • Data modeling matters for complex query speed – Use a star schema & well chosen distribution key SQL Server to Redshift - @joeharris76
  17. 17. Data Modeling 2 main concepts to learn • Distribution key – Where data is placed, which node & slice – Needs to be common across most tables • Sort key – How data is ordered on disk within the slice – Good sort keys simply expensive joins SQL Server to Redshift - @joeharris76
  18. 18. Database Maintenance • • • • Data loaded to non-empty tables is not sorted Data loaded to non-empty tables may kills their stats ANALYZE rebuilds the stats without making changes VACUUM re-sorts the physical data and rebuilds stats – Needed to get the best performance – Very similar to a REBUILD in SQL Server SQL Server to Redshift - @joeharris76
  19. 19. Database Backups • Redshift ‘backups’ are snapshots of the system • Taken very quickly, much slower to restore • Redshift automatically takes intra-day snapshots • Manual snapshots can be run using AWS cmd line • Snapshot storage is free up to size of cluster storage • Snapshots must be restored to an identical cluster • Snapshots cannot be restored to a running cluster SQL Server to Redshift - @joeharris76
  20. 20. Code Changes Code changes required so far • ROW_NUMBER() missing in Redshift • We gain LAG() and LEAD() which helps • But very difficult to persist an order value • DATETIMEOFFSET (e.g. timezone) not avail. • DATETIMEs now split into 2 columns • Work in progress… SQL Server to Redshift - @joeharris76
  21. 21. That’s all folks! SQL Server to Redshift - @joeharris76
  22. 22. Come Work With Me! http://www.realitymine.com/careers/ • Currently trying to fill the following roles: • Business Intelligence Architect (Redshift!) • Business Intelligence Developer (Tableau!) • Test Engineer (Quality!) • Server Developer (C#!) • Mobile App Developer (Android! iOS!) • Project Manager SQL Server to Redshift - @joeharris76

×