SlideShare a Scribd company logo
1 of 17
Amazon Redshift
Spend time with your data, not your database….
Data Warehouse Challenges
Cost
Complexity
Performance
Rigidity
1990 2000 2010 2020
Enterprise Data Data in Warehouse
Amazon Redshift powers Clickstream Analytics for
Amazon.com
• Web log analysis for Amazon.com
– Petabyte workload
– Largest table: 400 TB
• Understand customer behavior
– Who is browsing but not buying
– Which products/features are winners
– What sequence led to higher customer conversion
• Solution
– Best scale-out solution—query across 1 week
– Hadoop—query across 1 month
Amazon Redshift benefits realized
• Performance
– Scan 2.25 trillion rows of data: 14 minutes
– Load 5 billion rows data: 10 minutes
– Backfill 150 billion rows of data: 9.75 hours
– Pig  Amazon Redshift: 2 days to 1 hr
• 10B row join with 700 M rows
– Oracle  Amazon Redshift: 90 hours to 8 hrs
• Cost
– 1.6 PB cluster
– 100 8xl HDD nodes
– $180/hr
• Complexity
– 20% time of one DBA
• Backup
• Restore
• Resizing
Expanding Amazon Redshift
Functionality
Scalar User-Defined Functions (UDF)
• Scalar UDFs using Python 2.7
– Return single result value for each input value
– Executed in parallel across cluster
– Syntax largely identical to PostgreSQL
– We reserve any function with f_ for customers
• Pandas, NumPy, SciPy pre-installed
– Do matrix operations, build optimization algorithms, and run
statistical analyses
– Build end-to-end modeling workflow
• Import your own libraries
CREATE FUNCTION f_function_name
( [ argument_name arg_type, ... ] )
RETURNS data_type
{ VOLATILE | STABLE | IMMUTABLE }
AS $$
python_program
$$ LANGUAGE plpythonu;
Scalar UDF Security
• Run in restricted container that is fully isolated
– Cannot make system and network calls
– Cannot corrupt your cluster or negatively impact its performance
• Current limitations
– Can’t access file system - functions that write files won’t work
– Don’t yet cache stable and immutable functions
– Slower than built-in functions compiled to machine code
• Haven’t fully optimized some cases, including nested functions
Scalar UDF example - URL parsing
CREATE FUNCTION f_hostname (url varchar)
RETURNS varchar
IMMUTABLE AS $$
import urlparse
return urlparse.urlparse(url).hostname
$$ LANGUAGE plpythonu;
SELECT f_hostname(url) FROM table;
SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘3') FROM table;
Scalar UDF example – Distance
CREATE FUNCTION f_distance (orig_lat float, orig_long float, dest_lat float, dest_long float)
RETURNS float
STABLE
AS $$
import math
r = 3963.1676 # earth's radius, in miles
phi_orig = math.radians(orig_lat)
phi_dest = math.radians(dest_lat)
delta_lat = math.radians(dest_lat - orig_lat)
delta_long = math.radians(dest_long - orig_long)
a = math.sin(delta_lat/2) * math.sin(delta_lat/2) + math.cos(phi_orig) 
* math.cos(phi_dest) * math.sin(delta_long/2) * math.sin(delta_long/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
d = r * c
return d
$$ LANGUAGE plpythonu;
Redshift Github UDF Repository
Script Purpose
f_encryption.sql
Uses pyaes library to encrypt/decrypt strings
using passphrase
f_next_business_day.sql
Uses pandas library to return dates which
are US Federal Holiday aware
f_null_syns.sql
Uses python sets to match strings, similar to
a SQL IN condition
f_parse_url_query_string.sql
Uses urlparse to parse the field-value pairs
from a url query string
f_parse_xml.sql Uses xml.etree.ElementTree to parse XML
f_unixts_to_timestamp.sql
Uses pandas library to convert a unix
timestamp to UTC datetime
github.com/awslabs/amazon-redshift-udfs
Amazon Kinesis Firehose to Amazon Redshift
Load massive volumes of streaming data into Amazon Redshift
• Zero administration: Capture and deliver streaming data into Redshift without writing an
application
• Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery
• Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Capture and submit
streaming data to Firehose
Firehose loads streaming data
continuously into S3 and Redshift
Analyze streaming data using Chartio
• Uses your S3 bucket as an intermediate destination
• S3 bucket has ‘manifests’ folder – holds manifest of files to be copied
• Issues COPY command synchronously
• Single delivery stream loads into a single Redshift cluster, database, and table
• Continuously issues COPY once previous one is finished
• Frequency of COPYs determined by how fast your cluster can load files
• No partial loads. If a single record fails, whole file or batch fails
• Info on skipped files delivered to S3 bucket as manifest in errors folder
• If cannot reach cluster, retries every 5 min for 60 min and then moves on to next batch of
objects
Amazon Kinesis Firehose to Amazon Redshift
Multi-Column Sort
• Compound sort keys
– Filter data by one leading column
• Interleaved sort keys
– Filter data by up to eight columns
– No storage overhead, unlike an index or projection
– Lower maintenance penalty
Compound sort keys illustrated
• Four records fill a
block, sorted by
customer
• Records with a given
customer are all in one
block.
• Records with a given
product are spread
across four blocks.
1
1
1
1
2
3
4
1
4
4
4
2
3
4
4
1
3
3
3
2
3
4
3
1
2
2
2
2
3
4
2
1
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
cust_id prod_id other columns blocks
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
Interleaved sort keys illustrated
• Records with a given
customer are spread
across two blocks.
• Records with a given
product are also spread
across two blocks.
• Both keys are equal.
1
1
2
2
2
1
2
3
3
4
4
4
3
4
3
1
3
4
4
2
1
2
3
3
1
2
2
4
3
4
1
1
cust_id prod_id other columns blocks
Interleaved Sort Key Considerations
• Vacuum time can increase by 10-50% for interleaved sort keys vs.
compound keys
• If data increases monotonically, such as dates, interleaved sort order
will skew over time
– You’ll need to run a vacuum operation to re-analyze the distribution and re-sort
the data.
• Query filtering on the leading sort column, runs faster using
compound sort keys vs. interleaved
SAN FRANCISCO
Questions/Comments?
Please contact us at redshift-feedback@amazon.com

More Related Content

What's hot

Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizationsSzehon Ho
 
AWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinarAWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinarArti Bhatia
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAmazon Web Services
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtMichael Stack
 
LLAP: Locality is dead (in the cloud)
LLAP: Locality is dead (in the cloud)  LLAP: Locality is dead (in the cloud)
LLAP: Locality is dead (in the cloud) Future of Data Meetup
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Databricks
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale Hakka Labs
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...Ontico
 
phoenix-on-calcite-nyc-meetup
phoenix-on-calcite-nyc-meetupphoenix-on-calcite-nyc-meetup
phoenix-on-calcite-nyc-meetupMaryann Xue
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4caizer_x
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 

What's hot (20)

Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 
AWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinarAWS Segment XO Group Joint webinar
AWS Segment XO Group Joint webinar
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
AWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing PerformanceAWS July Webinar Series: Amazon Redshift Optimizing Performance
AWS July Webinar Series: Amazon Redshift Optimizing Performance
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
LLAP: Locality is dead (in the cloud)
LLAP: Locality is dead (in the cloud)  LLAP: Locality is dead (in the cloud)
LLAP: Locality is dead (in the cloud)
 
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
 
DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale DataEngConf SF16 - Collecting and Moving Data at Scale
DataEngConf SF16 - Collecting and Moving Data at Scale
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
Columnar Table Performance Enhancements Of Greenplum Database with Block Meta...
 
phoenix-on-calcite-nyc-meetup
phoenix-on-calcite-nyc-meetupphoenix-on-calcite-nyc-meetup
phoenix-on-calcite-nyc-meetup
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
Accumulo Summit 2015: Performance Models for Apache Accumulo: The Heavy Tail ...
 
Meet Hadoop Family: part 4
Meet Hadoop Family: part 4Meet Hadoop Family: part 4
Meet Hadoop Family: part 4
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 

Viewers also liked

Optimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesOptimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesAlexandra Sasha Blumenfeld
 
Using cohort analysis to understand your SaaS business | Growth Hacking Brussels
Using cohort analysis to understand your SaaS business | Growth Hacking BrusselsUsing cohort analysis to understand your SaaS business | Growth Hacking Brussels
Using cohort analysis to understand your SaaS business | Growth Hacking BrusselsUniversem
 
The Vital Metrics Every Sales Team Should Be Measuring
The Vital Metrics Every Sales Team Should Be MeasuringThe Vital Metrics Every Sales Team Should Be Measuring
The Vital Metrics Every Sales Team Should Be MeasuringChartio
 
How To Drive Exponential Growth Using Unconventional Data Sources
How To Drive Exponential Growth Using Unconventional Data SourcesHow To Drive Exponential Growth Using Unconventional Data Sources
How To Drive Exponential Growth Using Unconventional Data SourcesChartio
 
Producing and Analyzing Rich Data with PostgreSQL
Producing and Analyzing Rich Data with PostgreSQLProducing and Analyzing Rich Data with PostgreSQL
Producing and Analyzing Rich Data with PostgreSQLChartio
 
From Data to Insight: Uncovering the 'Aha' Moments That Matter
From Data to Insight: Uncovering the 'Aha' Moments That MatterFrom Data to Insight: Uncovering the 'Aha' Moments That Matter
From Data to Insight: Uncovering the 'Aha' Moments That MatterQualtrics
 
Learn How to Run Python on Redshift
Learn How to Run Python on RedshiftLearn How to Run Python on Redshift
Learn How to Run Python on RedshiftChartio
 
Using the PostgreSQL Extension Ecosystem for Advanced Analytics
Using the PostgreSQL Extension Ecosystem for Advanced AnalyticsUsing the PostgreSQL Extension Ecosystem for Advanced Analytics
Using the PostgreSQL Extension Ecosystem for Advanced AnalyticsChartio
 
WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?
WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?
WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?Totango
 

Viewers also liked (9)

Optimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesOptimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 Minutes
 
Using cohort analysis to understand your SaaS business | Growth Hacking Brussels
Using cohort analysis to understand your SaaS business | Growth Hacking BrusselsUsing cohort analysis to understand your SaaS business | Growth Hacking Brussels
Using cohort analysis to understand your SaaS business | Growth Hacking Brussels
 
The Vital Metrics Every Sales Team Should Be Measuring
The Vital Metrics Every Sales Team Should Be MeasuringThe Vital Metrics Every Sales Team Should Be Measuring
The Vital Metrics Every Sales Team Should Be Measuring
 
How To Drive Exponential Growth Using Unconventional Data Sources
How To Drive Exponential Growth Using Unconventional Data SourcesHow To Drive Exponential Growth Using Unconventional Data Sources
How To Drive Exponential Growth Using Unconventional Data Sources
 
Producing and Analyzing Rich Data with PostgreSQL
Producing and Analyzing Rich Data with PostgreSQLProducing and Analyzing Rich Data with PostgreSQL
Producing and Analyzing Rich Data with PostgreSQL
 
From Data to Insight: Uncovering the 'Aha' Moments That Matter
From Data to Insight: Uncovering the 'Aha' Moments That MatterFrom Data to Insight: Uncovering the 'Aha' Moments That Matter
From Data to Insight: Uncovering the 'Aha' Moments That Matter
 
Learn How to Run Python on Redshift
Learn How to Run Python on RedshiftLearn How to Run Python on Redshift
Learn How to Run Python on Redshift
 
Using the PostgreSQL Extension Ecosystem for Advanced Analytics
Using the PostgreSQL Extension Ecosystem for Advanced AnalyticsUsing the PostgreSQL Extension Ecosystem for Advanced Analytics
Using the PostgreSQL Extension Ecosystem for Advanced Analytics
 
WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?
WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?
WHAT DATA DO YOU NEED TO BUILD A COMPREHENSIVE HEALTH SCORE?
 

Similar to Redshift Chartio Event Presentation

Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon AthenaJulien SIMON
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftAmazon Web Services LATAM
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseAmazon Web Services
 
Scaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customersScaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customersSpeck&Tech
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseAmazon Web Services
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From UberChester Chen
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features Amazon Web Services
 
Big data dive amazon emr processing
Big data dive amazon emr processingBig data dive amazon emr processing
Big data dive amazon emr processingOlga Lavrentieva
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Databricks
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 

Similar to Redshift Chartio Event Presentation (20)

Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
An overview of Amazon Athena
An overview of Amazon AthenaAn overview of Amazon Athena
An overview of Amazon Athena
 
Melhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon RedshiftMelhores práticas de data warehouse no Amazon Redshift
Melhores práticas de data warehouse no Amazon Redshift
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data Warehouse
 
Scaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customersScaling an invoicing SaaS from zero to over 350k customers
Scaling an invoicing SaaS from zero to over 350k customers
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 
SF Big Analytics meetup : Hoodie From Uber
SF Big Analytics meetup : Hoodie  From UberSF Big Analytics meetup : Hoodie  From Uber
SF Big Analytics meetup : Hoodie From Uber
 
Amazon Redshift Deep Dive
Amazon Redshift Deep Dive Amazon Redshift Deep Dive
Amazon Redshift Deep Dive
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Big data dive amazon emr processing
Big data dive amazon emr processingBig data dive amazon emr processing
Big data dive amazon emr processing
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Redshift Chartio Event Presentation

  • 1. Amazon Redshift Spend time with your data, not your database….
  • 2. Data Warehouse Challenges Cost Complexity Performance Rigidity 1990 2000 2010 2020 Enterprise Data Data in Warehouse
  • 3. Amazon Redshift powers Clickstream Analytics for Amazon.com • Web log analysis for Amazon.com – Petabyte workload – Largest table: 400 TB • Understand customer behavior – Who is browsing but not buying – Which products/features are winners – What sequence led to higher customer conversion • Solution – Best scale-out solution—query across 1 week – Hadoop—query across 1 month
  • 4. Amazon Redshift benefits realized • Performance – Scan 2.25 trillion rows of data: 14 minutes – Load 5 billion rows data: 10 minutes – Backfill 150 billion rows of data: 9.75 hours – Pig  Amazon Redshift: 2 days to 1 hr • 10B row join with 700 M rows – Oracle  Amazon Redshift: 90 hours to 8 hrs • Cost – 1.6 PB cluster – 100 8xl HDD nodes – $180/hr • Complexity – 20% time of one DBA • Backup • Restore • Resizing
  • 6. Scalar User-Defined Functions (UDF) • Scalar UDFs using Python 2.7 – Return single result value for each input value – Executed in parallel across cluster – Syntax largely identical to PostgreSQL – We reserve any function with f_ for customers • Pandas, NumPy, SciPy pre-installed – Do matrix operations, build optimization algorithms, and run statistical analyses – Build end-to-end modeling workflow • Import your own libraries CREATE FUNCTION f_function_name ( [ argument_name arg_type, ... ] ) RETURNS data_type { VOLATILE | STABLE | IMMUTABLE } AS $$ python_program $$ LANGUAGE plpythonu;
  • 7. Scalar UDF Security • Run in restricted container that is fully isolated – Cannot make system and network calls – Cannot corrupt your cluster or negatively impact its performance • Current limitations – Can’t access file system - functions that write files won’t work – Don’t yet cache stable and immutable functions – Slower than built-in functions compiled to machine code • Haven’t fully optimized some cases, including nested functions
  • 8. Scalar UDF example - URL parsing CREATE FUNCTION f_hostname (url varchar) RETURNS varchar IMMUTABLE AS $$ import urlparse return urlparse.urlparse(url).hostname $$ LANGUAGE plpythonu; SELECT f_hostname(url) FROM table; SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘3') FROM table;
  • 9. Scalar UDF example – Distance CREATE FUNCTION f_distance (orig_lat float, orig_long float, dest_lat float, dest_long float) RETURNS float STABLE AS $$ import math r = 3963.1676 # earth's radius, in miles phi_orig = math.radians(orig_lat) phi_dest = math.radians(dest_lat) delta_lat = math.radians(dest_lat - orig_lat) delta_long = math.radians(dest_long - orig_long) a = math.sin(delta_lat/2) * math.sin(delta_lat/2) + math.cos(phi_orig) * math.cos(phi_dest) * math.sin(delta_long/2) * math.sin(delta_long/2) c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)) d = r * c return d $$ LANGUAGE plpythonu;
  • 10. Redshift Github UDF Repository Script Purpose f_encryption.sql Uses pyaes library to encrypt/decrypt strings using passphrase f_next_business_day.sql Uses pandas library to return dates which are US Federal Holiday aware f_null_syns.sql Uses python sets to match strings, similar to a SQL IN condition f_parse_url_query_string.sql Uses urlparse to parse the field-value pairs from a url query string f_parse_xml.sql Uses xml.etree.ElementTree to parse XML f_unixts_to_timestamp.sql Uses pandas library to convert a unix timestamp to UTC datetime github.com/awslabs/amazon-redshift-udfs
  • 11. Amazon Kinesis Firehose to Amazon Redshift Load massive volumes of streaming data into Amazon Redshift • Zero administration: Capture and deliver streaming data into Redshift without writing an application • Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery • Seamless elasticity: Seamlessly scales to match data throughput w/o intervention Capture and submit streaming data to Firehose Firehose loads streaming data continuously into S3 and Redshift Analyze streaming data using Chartio
  • 12. • Uses your S3 bucket as an intermediate destination • S3 bucket has ‘manifests’ folder – holds manifest of files to be copied • Issues COPY command synchronously • Single delivery stream loads into a single Redshift cluster, database, and table • Continuously issues COPY once previous one is finished • Frequency of COPYs determined by how fast your cluster can load files • No partial loads. If a single record fails, whole file or batch fails • Info on skipped files delivered to S3 bucket as manifest in errors folder • If cannot reach cluster, retries every 5 min for 60 min and then moves on to next batch of objects Amazon Kinesis Firehose to Amazon Redshift
  • 13. Multi-Column Sort • Compound sort keys – Filter data by one leading column • Interleaved sort keys – Filter data by up to eight columns – No storage overhead, unlike an index or projection – Lower maintenance penalty
  • 14. Compound sort keys illustrated • Four records fill a block, sorted by customer • Records with a given customer are all in one block. • Records with a given product are spread across four blocks. 1 1 1 1 2 3 4 1 4 4 4 2 3 4 4 1 3 3 3 2 3 4 3 1 2 2 2 2 3 4 2 1 1 [1,1] [1,2] [1,3] [1,4] 2 [2,1] [2,2] [2,3] [2,4] 3 [3,1] [3,2] [3,3] [3,4] 4 [4,1] [4,2] [4,3] [4,4] 1 2 3 4 prod_id cust_id cust_id prod_id other columns blocks
  • 15. 1 [1,1] [1,2] [1,3] [1,4] 2 [2,1] [2,2] [2,3] [2,4] 3 [3,1] [3,2] [3,3] [3,4] 4 [4,1] [4,2] [4,3] [4,4] 1 2 3 4 prod_id cust_id Interleaved sort keys illustrated • Records with a given customer are spread across two blocks. • Records with a given product are also spread across two blocks. • Both keys are equal. 1 1 2 2 2 1 2 3 3 4 4 4 3 4 3 1 3 4 4 2 1 2 3 3 1 2 2 4 3 4 1 1 cust_id prod_id other columns blocks
  • 16. Interleaved Sort Key Considerations • Vacuum time can increase by 10-50% for interleaved sort keys vs. compound keys • If data increases monotonically, such as dates, interleaved sort order will skew over time – You’ll need to run a vacuum operation to re-analyze the distribution and re-sort the data. • Query filtering on the leading sort column, runs faster using compound sort keys vs. interleaved
  • 17. SAN FRANCISCO Questions/Comments? Please contact us at redshift-feedback@amazon.com

Editor's Notes

  1. Can’t add a file. when you run something like python script.py, the script is converted to bytecode and then the interpreter/VM/CPython–really just a C Program–reads in the python bytecode and executes the program accordingly. Not translated to machine code Other implementations, like Pypy, have JIT compilation, i.e. they translate Python to machine codes on the fly.
  2. Urlparse is part of built-in libraries from python.
  3. haversine
  4. The data producer sends data blobs as large as 1,000 KB to a delivery stream.
  5. 1,000,000 blocks (1 TB per column) with an interleaved sort key of both customer ID and page ID, you scan 1,000 blocks when you filter on a specific customer or page, a speedup of 1000x compared to the unsorted case.