AWS Senior Product Manager, Tina Adams, discusses Redshift's new feature, User Defined Functions.
Learn how the new User Defined Functions for Amazon Redshift works with Chartio for quick and dynamic data analysis.
3. Amazon Redshift powers Clickstream Analytics for
Amazon.com
• Web log analysis for Amazon.com
– Petabyte workload
– Largest table: 400 TB
• Understand customer behavior
– Who is browsing but not buying
– Which products/features are winners
– What sequence led to higher customer conversion
• Solution
– Best scale-out solution—query across 1 week
– Hadoop—query across 1 month
4. Amazon Redshift benefits realized
• Performance
– Scan 2.25 trillion rows of data: 14 minutes
– Load 5 billion rows data: 10 minutes
– Backfill 150 billion rows of data: 9.75 hours
– Pig Amazon Redshift: 2 days to 1 hr
• 10B row join with 700 M rows
– Oracle Amazon Redshift: 90 hours to 8 hrs
• Cost
– 1.6 PB cluster
– 100 8xl HDD nodes
– $180/hr
• Complexity
– 20% time of one DBA
• Backup
• Restore
• Resizing
6. Scalar User-Defined Functions (UDF)
• Scalar UDFs using Python 2.7
– Return single result value for each input value
– Executed in parallel across cluster
– Syntax largely identical to PostgreSQL
– We reserve any function with f_ for customers
• Pandas, NumPy, SciPy pre-installed
– Do matrix operations, build optimization algorithms, and run
statistical analyses
– Build end-to-end modeling workflow
• Import your own libraries
CREATE FUNCTION f_function_name
( [ argument_name arg_type, ... ] )
RETURNS data_type
{ VOLATILE | STABLE | IMMUTABLE }
AS $$
python_program
$$ LANGUAGE plpythonu;
7. Scalar UDF Security
• Run in restricted container that is fully isolated
– Cannot make system and network calls
– Cannot corrupt your cluster or negatively impact its performance
• Current limitations
– Can’t access file system - functions that write files won’t work
– Don’t yet cache stable and immutable functions
– Slower than built-in functions compiled to machine code
• Haven’t fully optimized some cases, including nested functions
8. Scalar UDF example - URL parsing
CREATE FUNCTION f_hostname (url varchar)
RETURNS varchar
IMMUTABLE AS $$
import urlparse
return urlparse.urlparse(url).hostname
$$ LANGUAGE plpythonu;
SELECT f_hostname(url) FROM table;
SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘3') FROM table;
9. Scalar UDF example – Distance
CREATE FUNCTION f_distance (orig_lat float, orig_long float, dest_lat float, dest_long float)
RETURNS float
STABLE
AS $$
import math
r = 3963.1676 # earth's radius, in miles
phi_orig = math.radians(orig_lat)
phi_dest = math.radians(dest_lat)
delta_lat = math.radians(dest_lat - orig_lat)
delta_long = math.radians(dest_long - orig_long)
a = math.sin(delta_lat/2) * math.sin(delta_lat/2) + math.cos(phi_orig)
* math.cos(phi_dest) * math.sin(delta_long/2) * math.sin(delta_long/2)
c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
d = r * c
return d
$$ LANGUAGE plpythonu;
10. Redshift Github UDF Repository
Script Purpose
f_encryption.sql
Uses pyaes library to encrypt/decrypt strings
using passphrase
f_next_business_day.sql
Uses pandas library to return dates which
are US Federal Holiday aware
f_null_syns.sql
Uses python sets to match strings, similar to
a SQL IN condition
f_parse_url_query_string.sql
Uses urlparse to parse the field-value pairs
from a url query string
f_parse_xml.sql Uses xml.etree.ElementTree to parse XML
f_unixts_to_timestamp.sql
Uses pandas library to convert a unix
timestamp to UTC datetime
github.com/awslabs/amazon-redshift-udfs
11. Amazon Kinesis Firehose to Amazon Redshift
Load massive volumes of streaming data into Amazon Redshift
• Zero administration: Capture and deliver streaming data into Redshift without writing an
application
• Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery
• Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Capture and submit
streaming data to Firehose
Firehose loads streaming data
continuously into S3 and Redshift
Analyze streaming data using Chartio
12. • Uses your S3 bucket as an intermediate destination
• S3 bucket has ‘manifests’ folder – holds manifest of files to be copied
• Issues COPY command synchronously
• Single delivery stream loads into a single Redshift cluster, database, and table
• Continuously issues COPY once previous one is finished
• Frequency of COPYs determined by how fast your cluster can load files
• No partial loads. If a single record fails, whole file or batch fails
• Info on skipped files delivered to S3 bucket as manifest in errors folder
• If cannot reach cluster, retries every 5 min for 60 min and then moves on to next batch of
objects
Amazon Kinesis Firehose to Amazon Redshift
13. Multi-Column Sort
• Compound sort keys
– Filter data by one leading column
• Interleaved sort keys
– Filter data by up to eight columns
– No storage overhead, unlike an index or projection
– Lower maintenance penalty
14. Compound sort keys illustrated
• Four records fill a
block, sorted by
customer
• Records with a given
customer are all in one
block.
• Records with a given
product are spread
across four blocks.
1
1
1
1
2
3
4
1
4
4
4
2
3
4
4
1
3
3
3
2
3
4
3
1
2
2
2
2
3
4
2
1
1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
cust_id prod_id other columns blocks
15. 1 [1,1] [1,2] [1,3] [1,4]
2 [2,1] [2,2] [2,3] [2,4]
3 [3,1] [3,2] [3,3] [3,4]
4 [4,1] [4,2] [4,3] [4,4]
1 2 3 4
prod_id
cust_id
Interleaved sort keys illustrated
• Records with a given
customer are spread
across two blocks.
• Records with a given
product are also spread
across two blocks.
• Both keys are equal.
1
1
2
2
2
1
2
3
3
4
4
4
3
4
3
1
3
4
4
2
1
2
3
3
1
2
2
4
3
4
1
1
cust_id prod_id other columns blocks
16. Interleaved Sort Key Considerations
• Vacuum time can increase by 10-50% for interleaved sort keys vs.
compound keys
• If data increases monotonically, such as dates, interleaved sort order
will skew over time
– You’ll need to run a vacuum operation to re-analyze the distribution and re-sort
the data.
• Query filtering on the leading sort column, runs faster using
compound sort keys vs. interleaved
Can’t add a file. when you run something like python script.py, the script is converted to bytecode and then the interpreter/VM/CPython–really just a C Program–reads in the python bytecode and executes the program accordingly. Not translated to machine code
Other implementations, like Pypy, have JIT compilation, i.e. they translate Python to machine codes on the fly.
Urlparse is part of built-in libraries from python.
haversine
The data producer sends data blobs as large as 1,000 KB to a delivery stream.
1,000,000 blocks (1 TB per column) with an interleaved sort key of both customer ID and page ID, you scan 1,000 blocks when you filter on a specific customer or page, a speedup of 1000x compared to the unsorted case.