SlideShare a Scribd company logo
1 of 26
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
Loading Data into Amazon Redshift
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
PostgreSQL
Columnar
MPP
OLAP
AWS IAMAmazon VPC
Amazon S3 AWS KMS
Amazon
Route 53
Amazon
CloudWatch
Amazon EC2
Amazon Redshift
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Load
Unload
Backup
Restore
• Massively parallel, shared nothing
columnar architecture
• Leader node
– SQL endpoint
– Stores metadata
– Coordinates parallel SQL
processing
• Compute nodes
– Local, columnar storage
– Executes queries in parallel
– Load, unload, backup, restore
• Amazon Redshift Spectrum nodes
– Execute queries directly against
Amazon Simple Storage Service
(Amazon S3)
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node
Amazon S3
...
1 2 3 4 N
Amazon
Redshift
Spectrum
Amazon Redshift Architecture
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Terminology and Concepts: Disks
• Amazon Redshift utilizes locally attached storage devices
• Compute nodes have 2½ to 3 times the advertised storage capacity
• Each disk is split into two partitions
– Local data storage, accessed by local CN
– Mirrored data, accessed by remote CN
• Partitions are raw devices
– Local storage devices are ephemeral in nature
– Tolerant to multiple disk failures on a single node
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Terminology and Concepts: Redundancy
• Global commit ensures all permanent tables have written blocks to another
node in the cluster to ensure data redundancy
• Asynchronously backup blocks to Amazon S3—always consistent snapshot
– Every 5 GB of changed data or eight hours
– User on-demand manual snapshots
• Temporary tables
– Blocks are not mirrored to the remote partition—two-times faster write
performance
– Do not trigger a full commit or backups
Disable backups at the table level:
CREATE TABLE example(id int) BACKUP NO;
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Terminology and Concepts: Transactions
• Amazon Redshift is a fully transactional, ACID compliant data
warehouse
• Isolation level is serializable
• Two phase commits (local and global commit phases)
• Cluster commit statistics:
https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminScripts/commit_stats.sql
Design consideration:
• Because of the expense of commit overhead, limit commits by
explicitly creating transactions
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Data Ingestion: COPY Statement
• Ingestion throughput:
– Each slice’s query
processors can load
one file at a time:
o Streaming
decompression
o Parse
o Distribute
o Write
• Realizing only partial
node usage as 6.25% of
slices are active
0 2 4 6 8 10 12 141 3 5 7 9 11 13 15
DC2.8XL Compute Node
1 Input File
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Number of input files should be a
multiple of the number of slices
Splitting the single file into 16
input files, all slices are working
to maximize ingestion
performance
COPY continues to scale linearly
as you add nodes
Data Ingestion: COPY Statement
16 Input Files
Recommendation is to use delimited files—1 MB to 1 GB after gzip compression
0 2 4 6 8 10 12 141 3 5 7 9 11 13 15
DC2.8XL Compute Node
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Best Practices: COPY Ingestion
• Delimited files are recommend
– Pick a simple delimiter '|' or ',' or tabs
– Pick a simple NULL character (N)
– Use double quotes and an escape character ('  ') for varchars
– UTF-8 varchar columns take four bytes per char
• Split files so there is a multiple of the number of slices
• Files sizes should be 1 MB–1 GB after gzip compression
• Useful COPY options
– MAXERRORS
– ACCEPTINVCHARS
– NULL AS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Data Ingestion: Amazon Redshift Spectrum
• Use INSERT INTO SELECT against external Amazon S3 tables
– Ingest additional file formats: Parquet, ORC, Grok
– Aggregate incoming data
– Select subset of columns and/or rows
– Manipulate incoming column data with SQL
• Best practices:
– Save cluster resources for querying and reporting rather than on ELT
– Filtering/aggregating incoming data can improve performance over COPY
• Design considerations:
– Repeated reads against Amazon S3 are not transactional
– $5/TB of (compressed) data scanned
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Design Considerations: Data Ingestion
• Designed for large writes
• Batch processing system, optimized for processing massive amounts of
data
• 1 MB size plus immutable blocks means that we clone blocks on write so
as not to introduce fragmentation
• Small write (~1-10 rows) has similar cost to a larger write (~100K rows)
• UPDATE and DELETE
• Immutable blocks means that we only logically delete rows on UPDATE or
DELETE
• Must VACUUM or DEEP COPY to remove ghost rows from table
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
s3://bucket/dd.csv
aid loc dt
1 SFO 2017-10-20
2 JFK 2017-10-20
5 SJC 2017-10-10
6 SEA 2017-11-29
Data Ingestion: Deduplication/UPSERT
Table: deep_dive
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
Table: deep_dive
aid loc dt
1 SFO 2017-10-20
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
Table: deep_dive
aid loc dt
1 SFO 2017-10-20
2 JFK 2017-10-20
3 SFO 2017-04-01
4 JFK 2017-05-14
5 SJC 2017-10-10
Table: deep_dive
aid loc dt
1 SFO 2017-10-20
2 JFK 2017-10-20
3 SFO 2017-04-01
4 JFK 2017-05-14
5 SJC 2017-10-10
6 SEA 2017-11-29
Table: deep_dive
aid loc dt
1 SFO 2017-10-20
2 JFK 2017-10-20
3 SFO 2017-04-01
4 JFK 2017-05-14
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Data Ingestion: Deduplication/UPSERT
1. Load CSV data into a staging table
2. Delete duplicate data from the production table
3. Insert (or append) data from the staging into the production
table
Steps:
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Data Ingestion: Deduplication/UPSERT
BEGIN;
CREATE TEMP TABLE staging(LIKE deep_dive);
COPY staging FROM 's3://bucket/dd.csv'
: ' creds ' COMPUPDATE OFF;
DELETE deep_dive d
USING staging s WHERE d.aid = s.aid;
INSERT INTO deep_dive SELECT * FROM staging;
DROP TABLE staging;
COMMIT;
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
• Wrap workflow/statements in an explicit transaction
• Consider using DROP TABLE or TRUNCATE instead of DELETE
• Staging Tables:
– Use temporary table or permanent table with the “BACKUP NO” option
– If possible use DISTSTYLE KEY on both the staging table and production
table to speed up the INSERT INTO SELECT statement
– Turn off automatic compression—COMPUPDATE OFF
– Copy compression settings from production table or use ANALYZE
COMPRESSION statement
• Use CREATE TABLE LIKE or write encodings into the DDL
– For copying a large number of rows (> hundreds of millions) consider
using ALTER TABLE APPEND instead of INSERT INTO SELECT
Best Practices: ELT
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
• VACUUM will globally sort the table and remove rows that are marked as
deleted
– For tables with a sort key, ingestion operations will locally sort new data
and write it into the unsorted region
• ANALYZE collects table statistics for optimal query planning
Best Practices:
• VACUUM should be run only as necessary
– Typically nightly or weekly
– Consider Deep Copy (recreating and copying data) for larger or wide tables
• ANALYZE can be run periodically after ingestion on just the columns that WHERE predicates are filtered on
• Utility to VACUUM and ANALYZE all the tables in the cluster:
https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AnalyzeVacuumUtility
Vacuum and Analyze
Amazon Kinesis: Streaming Data Done the AWS Way
Makes it easy to capture, deliver, and process real-time data streams
Pay as you go, no up-front costs
Elastically scalable
Right services for your specific use cases
Real-time latencies
Easy to provision, deploy, and manage
Streaming Data Scenarios Across Verticals
Scenarios/
Verticals
Accelerated Ingest-
Transform-Load
Continuous Metrics
Generation
Responsive Data Analysis
Digital Ad
Tech/Marketing
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, and
conversion
User engagement with ads,
optimized bid/buy engines
IoT Sensor, device telemetry
data ingestion
Operational metrics and
dashboards
Device operational
intelligence and alerts
Gaming Online data aggregation,
e.g., top 10 players
Massively multiplayer online
game (MMOG) live
dashboard
Leader board generation,
player-skill match
Consumer
Online
Clickstream analytics Metrics like impressions and
page views
Recommendation engines,
proactive care
Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Redshift and Elasticsearch
• Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other destinations
without writing an application or managing infrastructure.
• Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data destinations in
as little as 60 secs using simple configurations.
• Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
• Serverless ETL using AWS Lambda - Firehose can invoke your Lambda function to transform incoming source data.
Capture and submit streaming
data
Analyze streaming data using your
favorite BI tools
Firehose loads streaming data continuously
into Amazon S3, Redshift and Elasticsearch
AWS Platform SDKs Mobile SDKs Kinesis Agent AWS IoT
Amazon S3 Amazon Redshift
• Send data from IT infra, mobile devices, sensors
• Integrated with AWS SDK, Agents, and AWS IoT
• Fully-managed service to capture streaming data
• Elastic w/o resource provisioning
• Pay-as-you-go: 3.5 cents/ GB transferred
• Batch, compress, and encrypt data before loads
• Loads data into Redshift tables base on COPY
command
Amazon Kinesis Firehose
Capture IT & App Logs, Device & Sensor Data, and more
Enable near-real time analytics using existing tools
Amazon Kinesis Firehose to Amazon S3
Amazon Kinesis Firehose to Amazon Redshift
Amazon Kinesis Firehose to Amazon Elasticsearch
Amazon Kinesis Data Firehose vs.
Amazon Kinesis Data Streams
Amazon Kinesis Data Streams is for use cases that require custom
processing, per incoming record, with sub-1 second processing latency, and
a choice of stream processing frameworks.
Amazon Kinesis Data Firehose is for use cases that require zero
administration, ability to use existing analytics tools based on Amazon S3,
Amazon Redshift and Amazon Elasticsearch, and a data latency of 60
seconds or higher.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
Hands-on Lab: Redshift Basics
http://github.com/wrbaldwin/da-week/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
aws.amazon.com/activate
Everything and Anything Startups
Need to Get Started on AWS

More Related Content

What's hot

In Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAPIn Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAPHBaseCon
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezDataWorks Summit
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or WorseEric Sun
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR Technologies
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesAmazon Web Services
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine OverviewKunal Gupta
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid DataWorks Summit
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsGuido Schmutz
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaAlluxio, Inc.
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesNacho García Fernández
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseDataStax
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Chris Nauroth
 
Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupGianmario Spacagna
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksData Con LA
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...DataWorks Summit
 

What's hot (20)

In Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAPIn Search of Database Nirvana: Challenges of Delivering HTAP
In Search of Database Nirvana: Challenges of Delivering HTAP
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 
ETL Practices for Better or Worse
ETL Practices for Better or WorseETL Practices for Better or Worse
ETL Practices for Better or Worse
 
MapR-DB Elasticsearch Integration
MapR-DB Elasticsearch IntegrationMapR-DB Elasticsearch Integration
MapR-DB Elasticsearch Integration
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Splice Machine Overview
Splice Machine OverviewSplice Machine Overview
Splice Machine Overview
 
Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid Sherlock: an anomaly detection service on top of Druid
Sherlock: an anomaly detection service on top of Druid
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architecturesApache Flink & Kudu: a connector to develop Kappa architectures
Apache Flink & Kudu: a connector to develop Kappa architectures
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
 
Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder HortonworksThe Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
SDM (Standardized Data Management) - A Dynamic Adaptive Ingestion Frameworks ...
 

Similar to Loading Data into Redshift: Data Analytics Week at the SF Loft

Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFAmazon Web Services
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon RedshiftAmazon Web Services
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with LabAmazon Web Services
 
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)Amazon Web Services
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Amazon Web Services
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Amazon Web Services
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Data Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFData Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFAmazon Web Services
 
Optimising your Amazon Redshift Cluster for Peak Performance
Optimising your Amazon Redshift Cluster for Peak PerformanceOptimising your Amazon Redshift Cluster for Peak Performance
Optimising your Amazon Redshift Cluster for Peak PerformanceAmazon Web Services
 
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...Amazon Web Services
 
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...Amazon Web Services
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSAmazon Web Services
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Amazon Web Services
 
Data Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksData Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksAmazon Web Services
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 

Similar to Loading Data into Redshift: Data Analytics Week at the SF Loft (20)

Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
Amazon Redshift 與 Amazon Redshift Spectrum 幫您建立現代化資料倉儲 (Level 300)
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
Migrating Your Data Warehouse to Amazon Redshift (DAT337) - AWS re:Invent 2018
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Data Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SFData Warehousing with Amazon Redshift: Data Analytics Week SF
Data Warehousing with Amazon Redshift: Data Analytics Week SF
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 
Optimising your Amazon Redshift Cluster for Peak Performance
Optimising your Amazon Redshift Cluster for Peak PerformanceOptimising your Amazon Redshift Cluster for Peak Performance
Optimising your Amazon Redshift Cluster for Peak Performance
 
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
Optimizing Your Amazon Redshift Cluster for Peak Performance - AWS Summit Syd...
 
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
 
Data Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech TalksData Transformation Patterns in AWS - AWS Online Tech Talks
Data Transformation Patterns in AWS - AWS Online Tech Talks
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Loading Data into Redshift: Data Analytics Week at the SF Loft

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft Loading Data into Amazon Redshift
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved PostgreSQL Columnar MPP OLAP AWS IAMAmazon VPC Amazon S3 AWS KMS Amazon Route 53 Amazon CloudWatch Amazon EC2 Amazon Redshift
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Load Unload Backup Restore • Massively parallel, shared nothing columnar architecture • Leader node – SQL endpoint – Stores metadata – Coordinates parallel SQL processing • Compute nodes – Local, columnar storage – Executes queries in parallel – Load, unload, backup, restore • Amazon Redshift Spectrum nodes – Execute queries directly against Amazon Simple Storage Service (Amazon S3) SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores JDBC/ODBC 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Leader Node Amazon S3 ... 1 2 3 4 N Amazon Redshift Spectrum Amazon Redshift Architecture
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Terminology and Concepts: Disks • Amazon Redshift utilizes locally attached storage devices • Compute nodes have 2½ to 3 times the advertised storage capacity • Each disk is split into two partitions – Local data storage, accessed by local CN – Mirrored data, accessed by remote CN • Partitions are raw devices – Local storage devices are ephemeral in nature – Tolerant to multiple disk failures on a single node
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Terminology and Concepts: Redundancy • Global commit ensures all permanent tables have written blocks to another node in the cluster to ensure data redundancy • Asynchronously backup blocks to Amazon S3—always consistent snapshot – Every 5 GB of changed data or eight hours – User on-demand manual snapshots • Temporary tables – Blocks are not mirrored to the remote partition—two-times faster write performance – Do not trigger a full commit or backups Disable backups at the table level: CREATE TABLE example(id int) BACKUP NO;
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Terminology and Concepts: Transactions • Amazon Redshift is a fully transactional, ACID compliant data warehouse • Isolation level is serializable • Two phase commits (local and global commit phases) • Cluster commit statistics: https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminScripts/commit_stats.sql Design consideration: • Because of the expense of commit overhead, limit commits by explicitly creating transactions
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Data Ingestion: COPY Statement • Ingestion throughput: – Each slice’s query processors can load one file at a time: o Streaming decompression o Parse o Distribute o Write • Realizing only partial node usage as 6.25% of slices are active 0 2 4 6 8 10 12 141 3 5 7 9 11 13 15 DC2.8XL Compute Node 1 Input File
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Number of input files should be a multiple of the number of slices Splitting the single file into 16 input files, all slices are working to maximize ingestion performance COPY continues to scale linearly as you add nodes Data Ingestion: COPY Statement 16 Input Files Recommendation is to use delimited files—1 MB to 1 GB after gzip compression 0 2 4 6 8 10 12 141 3 5 7 9 11 13 15 DC2.8XL Compute Node
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Best Practices: COPY Ingestion • Delimited files are recommend – Pick a simple delimiter '|' or ',' or tabs – Pick a simple NULL character (N) – Use double quotes and an escape character (' ') for varchars – UTF-8 varchar columns take four bytes per char • Split files so there is a multiple of the number of slices • Files sizes should be 1 MB–1 GB after gzip compression • Useful COPY options – MAXERRORS – ACCEPTINVCHARS – NULL AS
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Data Ingestion: Amazon Redshift Spectrum • Use INSERT INTO SELECT against external Amazon S3 tables – Ingest additional file formats: Parquet, ORC, Grok – Aggregate incoming data – Select subset of columns and/or rows – Manipulate incoming column data with SQL • Best practices: – Save cluster resources for querying and reporting rather than on ELT – Filtering/aggregating incoming data can improve performance over COPY • Design considerations: – Repeated reads against Amazon S3 are not transactional – $5/TB of (compressed) data scanned
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Design Considerations: Data Ingestion • Designed for large writes • Batch processing system, optimized for processing massive amounts of data • 1 MB size plus immutable blocks means that we clone blocks on write so as not to introduce fragmentation • Small write (~1-10 rows) has similar cost to a larger write (~100K rows) • UPDATE and DELETE • Immutable blocks means that we only logically delete rows on UPDATE or DELETE • Must VACUUM or DEEP COPY to remove ghost rows from table
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved s3://bucket/dd.csv aid loc dt 1 SFO 2017-10-20 2 JFK 2017-10-20 5 SJC 2017-10-10 6 SEA 2017-11-29 Data Ingestion: Deduplication/UPSERT Table: deep_dive aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 Table: deep_dive aid loc dt 1 SFO 2017-10-20 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 Table: deep_dive aid loc dt 1 SFO 2017-10-20 2 JFK 2017-10-20 3 SFO 2017-04-01 4 JFK 2017-05-14 5 SJC 2017-10-10 Table: deep_dive aid loc dt 1 SFO 2017-10-20 2 JFK 2017-10-20 3 SFO 2017-04-01 4 JFK 2017-05-14 5 SJC 2017-10-10 6 SEA 2017-11-29 Table: deep_dive aid loc dt 1 SFO 2017-10-20 2 JFK 2017-10-20 3 SFO 2017-04-01 4 JFK 2017-05-14
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Data Ingestion: Deduplication/UPSERT 1. Load CSV data into a staging table 2. Delete duplicate data from the production table 3. Insert (or append) data from the staging into the production table Steps:
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Data Ingestion: Deduplication/UPSERT BEGIN; CREATE TEMP TABLE staging(LIKE deep_dive); COPY staging FROM 's3://bucket/dd.csv' : ' creds ' COMPUPDATE OFF; DELETE deep_dive d USING staging s WHERE d.aid = s.aid; INSERT INTO deep_dive SELECT * FROM staging; DROP TABLE staging; COMMIT;
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved • Wrap workflow/statements in an explicit transaction • Consider using DROP TABLE or TRUNCATE instead of DELETE • Staging Tables: – Use temporary table or permanent table with the “BACKUP NO” option – If possible use DISTSTYLE KEY on both the staging table and production table to speed up the INSERT INTO SELECT statement – Turn off automatic compression—COMPUPDATE OFF – Copy compression settings from production table or use ANALYZE COMPRESSION statement • Use CREATE TABLE LIKE or write encodings into the DDL – For copying a large number of rows (> hundreds of millions) consider using ALTER TABLE APPEND instead of INSERT INTO SELECT Best Practices: ELT
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved • VACUUM will globally sort the table and remove rows that are marked as deleted – For tables with a sort key, ingestion operations will locally sort new data and write it into the unsorted region • ANALYZE collects table statistics for optimal query planning Best Practices: • VACUUM should be run only as necessary – Typically nightly or weekly – Consider Deep Copy (recreating and copying data) for larger or wide tables • ANALYZE can be run periodically after ingestion on just the columns that WHERE predicates are filtered on • Utility to VACUUM and ANALYZE all the tables in the cluster: https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AnalyzeVacuumUtility Vacuum and Analyze
  • 17. Amazon Kinesis: Streaming Data Done the AWS Way Makes it easy to capture, deliver, and process real-time data streams Pay as you go, no up-front costs Elastically scalable Right services for your specific use cases Real-time latencies Easy to provision, deploy, and manage
  • 18. Streaming Data Scenarios Across Verticals Scenarios/ Verticals Accelerated Ingest- Transform-Load Continuous Metrics Generation Responsive Data Analysis Digital Ad Tech/Marketing Publisher, bidder data aggregation Advertising metrics like coverage, yield, and conversion User engagement with ads, optimized bid/buy engines IoT Sensor, device telemetry data ingestion Operational metrics and dashboards Device operational intelligence and alerts Gaming Online data aggregation, e.g., top 10 players Massively multiplayer online game (MMOG) live dashboard Leader board generation, player-skill match Consumer Online Clickstream analytics Metrics like impressions and page views Recommendation engines, proactive care
  • 19. Amazon Kinesis Firehose Load massive volumes of streaming data into Amazon S3, Redshift and Elasticsearch • Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other destinations without writing an application or managing infrastructure. • Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 secs using simple configurations. • Seamless elasticity: Seamlessly scales to match data throughput w/o intervention • Serverless ETL using AWS Lambda - Firehose can invoke your Lambda function to transform incoming source data. Capture and submit streaming data Analyze streaming data using your favorite BI tools Firehose loads streaming data continuously into Amazon S3, Redshift and Elasticsearch
  • 20. AWS Platform SDKs Mobile SDKs Kinesis Agent AWS IoT Amazon S3 Amazon Redshift • Send data from IT infra, mobile devices, sensors • Integrated with AWS SDK, Agents, and AWS IoT • Fully-managed service to capture streaming data • Elastic w/o resource provisioning • Pay-as-you-go: 3.5 cents/ GB transferred • Batch, compress, and encrypt data before loads • Loads data into Redshift tables base on COPY command Amazon Kinesis Firehose Capture IT & App Logs, Device & Sensor Data, and more Enable near-real time analytics using existing tools
  • 21. Amazon Kinesis Firehose to Amazon S3
  • 22. Amazon Kinesis Firehose to Amazon Redshift
  • 23. Amazon Kinesis Firehose to Amazon Elasticsearch
  • 24. Amazon Kinesis Data Firehose vs. Amazon Kinesis Data Streams Amazon Kinesis Data Streams is for use cases that require custom processing, per incoming record, with sub-1 second processing latency, and a choice of stream processing frameworks. Amazon Kinesis Data Firehose is for use cases that require zero administration, ability to use existing analytics tools based on Amazon S3, Amazon Redshift and Amazon Elasticsearch, and a data latency of 60 seconds or higher.
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft Hands-on Lab: Redshift Basics http://github.com/wrbaldwin/da-week/
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft aws.amazon.com/activate Everything and Anything Startups Need to Get Started on AWS