Migration to Redshift from SQL Server


Published on

Published in: Technology
  • I think S3 support was added to CloverETL around version 3.0, right now 4.0 is coming out.
    CloverETL can write/read (gzipped) file to/from S3 without any file size limitation(tested over 5GB file), files listing from S3 bucket on roadmap.
    Failures would be handled best probably with CloverETL 'jobflow' feature.
    Are you sure you want to  Yes  No
    Your message goes here
  • @dpavlis 'not correct' is a big statement. When was S3 supported added to Clover? What aspects of the S3 API does Clover support? Can it write files larger than 5GB? How is failure handled?
    Are you sure you want to  Yes  No
    Your message goes here
  • Actually the claim that no ETL supports S3 storage is not correct. CloverETL has support for S3 - can directly store data to S3 buckets and there are few projects on Redshift using CloverETL (both on-cloud and on-premise) to first create data on S3 and then quickly load them to Redshift. See www.cloveretl.com
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Data and Log are always on different disks.Criss-cross pattern used to balance wear.TempDbsplit across 8 files (1 per thread)
  • TDE required for data encryption.Compression used to maximise SSD speed.A lot of tuning done to push CPU and disks harder.We've seen silent partial failures without any indication.Now have to regularly run DBCC to verify databases. So far we've seen a ~20% perf loss over a year.
  • We’re actually using out existing SQL Server automation setup to run batch scripts that execute SQL on Redshift.
  • Four byte character support was recently added and that makes things a little easier.SQL Server's REPLACE() function is **broken** and ***cannot remove any of these values***! Yes, really. I can't tell you how fun it was to figure that out. Because it wasn't fun at all.All escape sensitive data must be escaped in all columns.Embedded newlines **must** be escaped as '\n’
  • vsOracle which has LENGTH() for characters and LENGTHB() for bytes.vsRedshift which has only LENGTH() and no way to get the byte length.SQL Server will tolerate _anything_ inside a character columnNo sanitisation of inputs or outputsUTF-16LE *compatible*, rather than *compliant* I know this from painful experience
  • All web searches will suggest using BCP.All ETL tools actually wrap BCP to get data out**Forget about BCP. BCP is the enemy.**BCP DOES NOT SUPPORT STDOUT!!!
  • Voila! UTF-8 output from SQL Server directly to a gzip file.
  • * On SQL Server we use TDE (transparent encryption) * Data on disk is AES encrypted, transparently.* Redshift offers AES encryption of the data on disk. * Not actively encrypted during use, same as SQL Server.* Redshift supports loading client-side 'evelope' encrypted data. * Good luck with that! * Slow: You'll have land your data on disk and then reprocess it. * Custom: You'll have to write your own encrypter using Open SSL or some such. * Client side encryption is somewhat moot as it only applies while data is on S3. * My 2p: Enable AES on both S3 and Redshift. Call it a day.* Amazon says there is a 'small perfomance penalty' for using AES. * In practice it seems to be acceptable. * I have *not actually tested* it without AES because I don't want to generate 10 billion rows of sample data.
  • * Managing user and admin access is kind of a pain in Redshift1. Access to S3 * Create bucket(s) just for Redshift staging data.2. Access to Redshift admin * Use IAM access controls to limit individual's access. * Create users just for automation and enforce password rotation. 3. Access to Redshift database * **Do not allow** use of the admin user - it's like SQL Server's `sa`. * Create 1:1 map of external users to Redshift users (no LDAP/AD support)4. Access to specific database objects * You must actively `GRANT` access to each object. * Use groups to make this task easier. * We have just 2 groups: "admin" (`GRANT ALL`) and "readers" (`GRANT SELECT`)
  • * Redshift nodes are waaaaaay over-provisioned on storage * 2 TB of storage available per node* Redshift is suuuuuper efficient at compression * Our data in Redshift is roughly 2x the gzipped UTF8 input. * The size varies depending on how we sort the tables. * Therefore you'll be sizing the cluster for **speed**. * You add nodes to go faster _not when you run out of disk._* Tough to get your head around.
  • Still faster than SQL Server on PCIe SSDs for our dataYou must use multiple files for bulk loads
  • You cannot schedule these AFAICTThey are auto-deleted on a schedule you can setDefault auto-delete is 1 dayPriced same as S3 beyond cluster size
  • Migration to Redshift from SQL Server

    1. 1. SQL Server to Redshift
    2. 2. Background RealityMine provides digital behaviour analytics. Our applications passively measure the activity of opt-in users on all digital platforms. This could be focused on • how to direct marketing • how to direct product development • question individuals whom undertake certain behavior patterns
    3. 3. Starting State • • • • • SQL Server DW on in-house server SQL Server 2008 R2 Enterprise Edition Single 4 core (8 thread) i7 w/ 16GB RAM 2 960GB PCIe SSDs for DBs 1 240GB PCIe SSD for TempDb SQL Server to Redshift - @joeharris76
    4. 4. Data Environment • • • • • ~20 billion rows in active use Largest table is also the widest Volume is doubling more than annually Data is in many languages Starts as JSON, ends as Star Schema DW SQL Server to Redshift - @joeharris76
    5. 5. Pain Points • • • • • Biggest cost is SQL Server license Biggest bottleneck is single threaded perf. Hand tuning needed to push CPU / disks SSD reliability is not perfect SSD performance degrades over time SQL Server to Redshift - @joeharris76
    6. 6. Why Redshift • • • • • • Vertica wanted £45k per terabyte 16 SQL Server Enterprise cores even more! Teradata, Netezza, etc. don’t want <5TB sales SAP HANA not viable for this volume on AWS Infobright does not support incremental loads Hadoop/Impala slow & requires lots of learning SQL Server to Redshift - @joeharris76
    7. 7. Data Processing Approach • No ETL tool truly supports Redshift – Requirement to load from S3 is a killer – Tried SSIS, Pentaho, Talend and others • You’re stuck with ELT – Load data then transform as needed – Keep data raw as possible from source SQL Server to Redshift - @joeharris76
    8. 8. War of Encodings The road to heaven goes through ÜÑÎÇØDÈ hell SQL Server to Redshift - @joeharris76
    9. 9. Redshift: UTF-8 Only • Redshift has zero-tolerance for certain chars – NUL/0x00 => Treated as EOR, documented – DEL/0x7F => Treated as EOR, undocumented – 0xBFEFEF => UTF-8 spec "guaranteed non-char" – These must be removed before loading data • Other control characters can be loaded by escaping – You cannot escape a single column, all or nothing SQL Server to Redshift - @joeharris76
    10. 10. SQL Server: UTF-16LE Only • NVARCHAR takes 2x as much space as a VARCHAR • Makes functions consistent across ASCII & Unicode – N/VARCHAR(32) = 32chars / Redshift = 32 bytes • SQL Server tolerates anything character columns • Input and output is not sanitized against UTF-16 spec – Invalid or "guaranteed non-chars" are stored as is SQL Server to Redshift - @joeharris76
    11. 11. SQL Extract: The Hard Way • BCP is the “standard” way to extract data • Using BCP your process looks something like this: – Extract data as a huge UTF-16LE file using bcp – Convert to a new UTF-8 file using iconv – Remove or escape problem chars using sed – Compress the final file using gzip – All steps are heavily constrained by disk speed SQL Server to Redshift - @joeharris76
    12. 12. SQL Extract: The Easy Way SQLCMD one-liner for extracts: Set the cmd code page to UTF-8 Interactive SQL terminal Prevent summary in output Select from the table / view No column headers Remove special characters Delimit output with 1 ASCII char No padding in output Output in Unicode Pipe stdout to gzip chcp 65001 & sqlcmd –E -Q “SET NOCOUNT ON; SELECT * FROM Db.Schema.Table;” -h-1 -k1 -s”|” -W -u | gzip > “C:file.gz” SQL Server to Redshift - @joeharris76
    13. 13. Data Encryption • • • • • On SQL Server we use TDE Redshift offers AES encrypted data on disk Redshift can load client-side encrypted data Client side encryption only applies while on S3 “Small performance penalty” for using AES SQL Server to Redshift - @joeharris76
    14. 14. Security • S3 Access => Create bucket(s) just for Redshift staging • Redshift admin => Use IAM, create automation user(s) • Redshift database => – Do not use admin it’s like SQL Server ‘sa’ • Database objects => – Must actively GRANT access to each object – Use groups to make management easier SQL Server to Redshift - @joeharris76
    15. 15. Sizing your cluster • Redshift is over-provisioned on storage • Redshift is super efficient at compression – Compression not affected by the data model • Redshift scale out is almost perfectly linear – 2 nodes is twice as fast as 1 node • You'll be sizing your cluster for speed! SQL Server to Redshift - @joeharris76
    16. 16. Performance • Redshift speed depends on node count – A single node is not particularly fast • Loading speed appears to be linked to S3 speed – You must use multiple files for bulk loads • Query speed appears to be CPU constrained – Vacuum runs 250 MB/s, queries <20 MB/s • Data modeling matters for complex query speed – Use a star schema & well chosen distribution key SQL Server to Redshift - @joeharris76
    17. 17. Data Modeling 2 main concepts to learn • Distribution key – Where data is placed, which node & slice – Needs to be common across most tables • Sort key – How data is ordered on disk within the slice – Good sort keys simply expensive joins SQL Server to Redshift - @joeharris76
    18. 18. Database Maintenance • • • • Data loaded to non-empty tables is not sorted Data loaded to non-empty tables may kills their stats ANALYZE rebuilds the stats without making changes VACUUM re-sorts the physical data and rebuilds stats – Needed to get the best performance – Very similar to a REBUILD in SQL Server SQL Server to Redshift - @joeharris76
    19. 19. Database Backups • Redshift ‘backups’ are snapshots of the system • Taken very quickly, much slower to restore • Redshift automatically takes intra-day snapshots • Manual snapshots can be run using AWS cmd line • Snapshot storage is free up to size of cluster storage • Snapshots must be restored to an identical cluster • Snapshots cannot be restored to a running cluster SQL Server to Redshift - @joeharris76
    20. 20. Code Changes Code changes required so far • ROW_NUMBER() missing in Redshift • We gain LAG() and LEAD() which helps • But very difficult to persist an order value • DATETIMEOFFSET (e.g. timezone) not avail. • DATETIMEs now split into 2 columns • Work in progress… SQL Server to Redshift - @joeharris76
    21. 21. That’s all folks! SQL Server to Redshift - @joeharris76
    22. 22. Come Work With Me! http://www.realitymine.com/careers/ • Currently trying to fill the following roles: • Business Intelligence Architect (Redshift!) • Business Intelligence Developer (Tableau!) • Test Engineer (Quality!) • Server Developer (C#!) • Mobile App Developer (Android! iOS!) • Project Manager SQL Server to Redshift - @joeharris76