In this article we will try to learn how to load data from SQL Server to Amazon Redshift Data warehouse using SSIS. Techniques outlined in this article can be also applied while extracting data from other Relational Source (e.g. Loading Data from MySQL to Redshift, Oracle to Redshift etc). First we will discuss steps needed to load data into Amazon Redshift Data Warehouse, challenges and then we will simplify whole process using SSIS Task for Amazon Redshift Data Transfer.
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
SQL Server to Redshift Data Load Using SSIS
1. SQL Server To Redshift
Data Load Using SSIS
Reach for the Clouds, Inc.
Next Generation SSIS Tasks and Connectors Series
AUTHOR:
NAYAN PATEL | SR. ETL SSIS ARCHITECT
N PAT E L @ R F TC LO U D S . C O M
2. Content
• Introduction – SQL Server to Redshift Load
• VideoTutorial – Redshift Data Load
• Right way but hard way
• Steps for Amazon Redshift Data Load from On-Premise files or RDBMS (e.g. MySQL, SQL Server)
• Doing it easy way
• Should I use SSIS to load Redshift
• Setup your Amazon Redshift Cluster
• Add inbound rule for Redshift Cluster
• Automate Redshift Cluster Creation
• Create Sample table and data in Source – (in this example SQL Server)
• Create Sample table in Amazon Redshift
• SQL Server to Redshift Data Load using SSIS
• Conclusion - Related Links
3. Introduction – SQL Server to Redshift Load
• Before we talk data load from SQL Server to Redshift using SSIS lets
talk what is Amazon Redshift (or sometimes referred to as AWS
Redshift). Amazon Redshift is a Cloud based Data warehouse service.
This type of system also referred as MPP (Massively Parallel
Processing). Amazon Redshift uses highly modified version of
PostGrey SQL Engine behind the scene. Amazon Redshift provides
advantage of Scale as you go, at very low cost compared to onsite
dedicated hardware/software approach.
4. Right way but hard way
• If you are reading some of the guidelines published by Amazon
regarding Redshift Data load then you will quickly realize that there is
a lot to do under the cover to get it going right way. Here are few
steps you will have to perform while loading data to Redshift from
your On-Premise server (Data can be sitting in files or Relational
source).
5. Right way but hard way
Steps for Amazon Redshift Data Load from On-Premise files or RDBMS (e.g. MySQL, SQL Server)
• Export local RDBMS data to flat files (Make sure you remove invalid characters, apply escape sequence
during export)
• Split files into 10-15 MB each to get optimal performance during upload and final Data load
• Compress files to *.gz format so you don’t end up with $1000 surprise bill :) .. In my caseText files were
compressed 10-20 times
• List all file names to manifest file so when you issue COPY command to Redshift its treated as one unit
of load
• Upload manifest file to Amazon S3 bucket
• Upload local *.gz files to Amazon S3 bucket
• Issue RedshiftCOPY command with different options
• Schedule file archiving from on-premises and S3 Staging area on AWS
• Capturing Errors, setting up restart ability if something fails
6. Doing it the easy way
• So if you are not sure you ready to code many steps listed above then
you can use Amazon Redshift DataTransferTask.
• In next few sections we will describe how to setup your Redshift
Cluster for Demo purpose and load Data from SQL Server to Redshift
using SSIS.
7. Doing it the easy way
Should I use SSIS to load Redshift
• If you are curious which approach to use to load data then consider few facts
• Do you have existing ETL processes written in SSIS?
• Do you need more visual approach and better work flow management (what SSIS
Provides)?
• Do you need connection string encryption and other goodies offered by SSIS such
as native logging, passing parameters from SSIS environment
• Do you have expertise available for SSIS in-house or you better stay with command
line scripts?
• Do you have need to create workflow which can run on any server where SSIS is not
installed?
8. Setup your Amazon Redshift
Cluster
NOTE: SKIP THIS STEP IF YOU ALREADY SETUP
YOU REDSHIFT CLUSTER
1.LOGIN TO YOUR AWS CONSOLE
AND CLICK ON REDSHIFT ICON. OR
CLICK HERE TO LAND DIRECTLY TO
REDSHIFT
2.CLICK ON LAUNCH CLUSTER
3.ON CLUSTER DETAIL PAGE SPECIFY
CLUSTER IDENTIFIER, DATABASE
NAME, PORT, MASTER USER AND
PASSWORD. CLICK CONTINUE TO GO
TO NEXT PAGE
9. Setup your Amazon Redshift
Cluster
4. ON NODE CONFIGURATION PAGE
SPECIFY NODE TYPE (THIS IS VM
TYPE), CLUSTER TYPE AND NUMBER
OF NODE. IF YOU ARE TRYING UNDER
FREE TIRE THEN SELECT SMALLEST
NODE POSSIBLE (IN THIS CASE IT
WAS DW2.LARGE). CLICK CONTINUE
TO GO TO NEXT PAGE
10. Setup your Amazon Redshift
Cluster
5. ON ADDITIONAL CONFIGURATION
PAGE YOU CAN PICK VPC (VIRTUAL
PRIVATE CONNECTION), SECURITY
GROUP FOR CLUSTER AND OTHER
OPTIONS FOR ENCRYPTION. FOR
DEMO PURPOSE SELECT AS BELOW
SCREENSHOT . CLICK CONTINUE TO
REVIEW YOUR SETTINGS AND CLICK
CREATE CLUSTER
11. Setup your Amazon Redshift
Cluster
6. GIVE IT FEW MINS WHILE YOUR
CLUSTER IS BEING CREATED. AFTER
FEW MINUTES (5-10 MINS) YOU CAN
GO BACK TO SAME PAGE AND
REVIEW CLUSTER STATUS AND
OTHER PROPERTIES AS BELOW.
COPY CLUSTER ENDPOINT TO
SOMEWHERE BECAUSE WE WILL
NEED IT LATER.
12. Add inbound rule for Redshift
Cluster
NOTE: SKIP THIS STEP IF YOU HAVE ALREADY
ADDED YOUR IP TO INBOUND EXCLUSION RULE.
BY DEFAULT YOU CANNOT CONNECT TO AMAZON
REDSHIFT CLUSTER FROM OUTSIDE AWS
NETWORK (E.G. FROM YOUR ON -PREMISES
MACHINE). IF YOU WISH TO CONNECT THEN YOU
MUST ADD INBOUND EXCEPTION RULE TO ALLOW
YOUR REQUEST TO REDSHIFT CLUSTER ON
SPECIFIC PORT.
TO ADD CREATE NEW INBOUND RULE PERFORM
FOLLOWING STEPS
1. UNDER REDSHIFT HOME PAGE
CLICK [SECURITY] TAB. YOU MAY SEE
FOLLOWING NOTICE DEPENDING ON
WHICH REGION YOU ARE. CLICK ON
[GO TO THE EC2 CONSOLE] LINK OR
YOU CAN DIRECT GO TO EC2 BY
CLICKING SERVICES -> EC2 MENU AT
THE TOP
13. Add inbound rule for Redshift
Cluster
2. ON EC2 SECURITY GROUPS PAGE
SELECT SECURITY GROUP ATTACHED
WITH YOUR REDSHIFT CLUSTER AND
THEN IN THE BOTTOM PANE CLICK
ON INBOUND TAB
3. ON INBOUND TAB CLICK EDIT
OPTION TO MODIFY DEFAULT ENTRY
OR YOU CAN ADD NEW RULE
4. CLICK ON ADD RULE IF YOU WISH
TO ADD NEW ENTRY ELSE EDIT AS
BELOW AND CLICK SAVE
14. Automate Redshift Cluster Creation
If you have need to automate Redshift Cluster Creation or any of the following things
automatically then check Redshift Cluster managementTask
• Automate Amazon Redshift Cluster Create Action in few clicks.You can also add
Access Security Rule.
• Automate Amazon Redshift Cluster Delete Action
• Fetch Amazon Redshift Cluster Property to SSISVariable (e.g. Fetch Cluster Status)
• Fetch all cluster and their properties as DataTable (Use ForEach Loop and iterate
through all clusters)
• Automate Redshift Cluster Snapshot Creation
• Automate Redshift Cluster Snapshot Delete Action
• Support forWait until Cluster operation is done
15. Create Sample table and data in Source – (in this
example SQL Server)
Note: Skip this step if you wish to use your own table. If you do so please ignore certain steps and
screenshots mentioned in this article.
For this demo we will use Free Northwind sample database
supplied by Microsoft.
• Download Sample Database from here.
• Extract the zip file -> Open *.sql file and run it to create new
database with sample tables and data.
16. Create Sample table in Amazon
Redshift
4. DOUBLE CLICK ON THE TASK TO
SEE UI.
5.CLICK ON [NEW] CONNECTION.
6. CONFIGURE REDSHIFT
CONNECTION PROPERTIES AND
CLICK TEST.
17. Create Sample table in Amazon
Redshift
7. TEST CONNECTION IS SUCCESSFUL
THEN CLICK OK TO SAVE
CONNECTION DETAIL.
8. ENTER FOLLOWING SCRIPT IN THE
SQL TEXTBOX AND HIT OK TO SAVE
IT.
18. Create Sample table in Amazon
Redshift
9. NOW RIGHT CLICK ON THE TASK
AND EXECUTE. THIS SHOULD CREATE
NEW TABLE IN REDSHIFT.
19. SQL Server to Redshift Data Load using SSIS
Once table is created now lets do real work to get data moving from SQL Server to Amazon Redshift.
Perform the following steps to configure SSISAmazon Redshift DataTransferTask
1. Drag Amazon Redshift DataTransferTask on the SSIS designer surface.
2. Double click on the task to edit properties.
3. Select Action: In the top Action drop down select Bulk Import to Redshift from any RDBMS (e.g.
MySQL, Oracle, SQL Server) option
4. Configure Source: On the Source tab click [New] next to connection dropdown and configure Source
connection or pick existing connection. In our case we are extracting data from SQl Server database
(Northwind) on local server.
Enter the following SQL Query to extract 100,000 rows from SQL Server
20. Create Sample table in Amazon
Redshift
5. CONFIGURE SOURCE STAGING
AREA: ON THE SOURCE TAB YOU
HAVE TO ENTER FOLDER LOCATION
WHERE STAGING FILES WILL BE
SAVED BEFORE WE UPLOAD TO
REDSHIFT (SEE ABOVE SCREEN).
21. Create Sample table in Amazon Redshift
6. CONFIGURE TARGET: ON TARGET
TAB SELECT EXISTING REDSHIFT
CONNECTION MANAGER (OR CREATE
NEW), SELECT TARGET TABLE FROM
THE DROPDOWN WHERE YOU WANT
TO LOAD DATA. IF YOU HAVE LONG
LIST OF TABLES THEN SIMPLY ENTER
SCHEMA NAME IN THE SCHEMA
FILTER TEXT BOX AND CLICK
REFRESH TO RELOAD TABLE
DROPDOWN WITH FEWER ITEMS.
22. Create Sample table in Amazon Redshift
7. CONFIGURE RELOAD OPTION AND
TARGET STAGING AREA: ON TARGET
TAB CHECK TRUNCATE TARGET
TABLE OPTION IF YOU WANT TO
RELOAD EACH TIME EXECUTE THIS
TASK ELSE LEAVE IT UNCHECKED TO
APPEND RECORDS. WE ALSO HAVE
TO SPECIFY AMAZON S3 STAGING
AREAS WHERE REDSHIFT WILL LOOK
FOR FILES TO LOAD.
23. Create Sample table in Amazon Redshift
8. CONFIGURE FILE FORMAT: WE
ARE GOING TO GENERATE CSV FILES
FOR REDSHIFT LOAD SO MAKE SURE
YOU SELECT CORRECT COLUMN
DELIMITER. ALSO MAKE SURE YOU
CHECK ALWAYS COMPRESS FILE
OPTION TO REDUCE BANDWIDTH.
24. Create Sample table in Amazon Redshift
9. CONFIGURE ARCHIVE OPTIONS:
ON ARCHIVE TAB WE CAN SPECIFY
HOW TO ARCHIVE SOURCE AND
TARGET FILES WE GENERATED.
SOURCE FILES ARE CSV FILES AND
SOURCE STAGE FILES ARE *.GZ
FILES (IF YOU SELECT
COMPRESSION). TARGET STAGE
FILES ARE EITHER CSV OR *.GZ
FILES.BY DEFAULT SOURCE CSV
FILES ARE KEPT AND ALL OTHER
STAGE FILES ARE DELETED. SEE
BELOW SCREENSHOT
25. Create Sample table in Amazon Redshift
10. CONFIGURE ADVANCED
OPTIONS: ON ADVANCED OPTIONS
TAB YOU FINE TUNE LOAD PROCESS
SUCH AS HOW TO HANDLE NULL
DATA, HOW TO HANDLE DATA
TRUNCATION ETC. READ HELP FILE
FOR MORE INFO
26. Create Sample table in Amazon Redshift
11. CONFIGURE ERROR HANDLING
OPTIONS: ON ERROR HANDLING TAB
YOU CAN SPECIFY HOW MANY
ERRORS YOU WANT TO IGNORE
BEFORE FAILING ENTIRE LOAD. YOU
CAN ALSO REPLACE SOME INVALID
CHARACTERS DURING YOUR IF YOU
CHECK [ALLOW INVALID
CHARACTERS] OPTION.
27. Create Sample table in Amazon Redshift
12. NOW FINALLY WE READY TO EXECUTE OUR SSIS PACKAGE. ONCE ITS
DONE YOU CAN REVIEW LOG. HERE IS THE SAMPLE EXECUTION LOG .
28. Conclusion
So in this article we outlined different steps needed to load data into Redshift from relational source (e.g.
MySQL, SQL Server, Oracle). Redshift is a great way to offload your expensive data warehouse to cloud so
you don’t have to worry about costly maintenance and future growth.With redshift you can grow your data
size from Gigabyte to Petabyte. SSISAmazon Redshift DataTransferTask. can give you an easy way to
maintain your Redshift data transfer process with ease of use and fast load options (for full or incremental
load).
Again this was just proof of concept but we encourage you to do your own benchmarking and research see
which approach suites best for your need.
• Related Links:
• SSIS Amazon Redshift DataTransferTask
TAGS: amazon redshift Amazon Redshift Data Transfer Task aws command line csv excel export How-To json mysql PDF Redshift SSIS SSIS PowerPack