Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Redshift Chartio Event Presentation

531 views

Published on

AWS Senior Product Manager, Tina Adams, discusses Redshift's new feature, User Defined Functions.

Learn how the new User Defined Functions for Amazon Redshift works with Chartio for quick and dynamic data analysis.

Published in: Technology
  • I like this service ⇒ www.WritePaper.info ⇐ from Academic Writers. I don't have enough time write it by myself.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If you’re struggling with your assignments like me, check out ⇒ www.WritePaper.info ⇐. My friend sent me a link to to tis site. This awesome company. After I was continuously complaining to my family and friends about the ordeals of student life. They wrote my entire research paper for me, and it turned out brilliantly. I highly recommend this service to anyone in my shoes. ⇒ www.WritePaper.info ⇐.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Redshift Chartio Event Presentation

  1. 1. Amazon Redshift Spend time with your data, not your database….
  2. 2. Data Warehouse Challenges Cost Complexity Performance Rigidity 1990 2000 2010 2020 Enterprise Data Data in Warehouse
  3. 3. Amazon Redshift powers Clickstream Analytics for Amazon.com • Web log analysis for Amazon.com – Petabyte workload – Largest table: 400 TB • Understand customer behavior – Who is browsing but not buying – Which products/features are winners – What sequence led to higher customer conversion • Solution – Best scale-out solution—query across 1 week – Hadoop—query across 1 month
  4. 4. Amazon Redshift benefits realized • Performance – Scan 2.25 trillion rows of data: 14 minutes – Load 5 billion rows data: 10 minutes – Backfill 150 billion rows of data: 9.75 hours – Pig  Amazon Redshift: 2 days to 1 hr • 10B row join with 700 M rows – Oracle  Amazon Redshift: 90 hours to 8 hrs • Cost – 1.6 PB cluster – 100 8xl HDD nodes – $180/hr • Complexity – 20% time of one DBA • Backup • Restore • Resizing
  5. 5. Expanding Amazon Redshift Functionality
  6. 6. Scalar User-Defined Functions (UDF) • Scalar UDFs using Python 2.7 – Return single result value for each input value – Executed in parallel across cluster – Syntax largely identical to PostgreSQL – We reserve any function with f_ for customers • Pandas, NumPy, SciPy pre-installed – Do matrix operations, build optimization algorithms, and run statistical analyses – Build end-to-end modeling workflow • Import your own libraries CREATE FUNCTION f_function_name ( [ argument_name arg_type, ... ] ) RETURNS data_type { VOLATILE | STABLE | IMMUTABLE } AS $$ python_program $$ LANGUAGE plpythonu;
  7. 7. Scalar UDF Security • Run in restricted container that is fully isolated – Cannot make system and network calls – Cannot corrupt your cluster or negatively impact its performance • Current limitations – Can’t access file system - functions that write files won’t work – Don’t yet cache stable and immutable functions – Slower than built-in functions compiled to machine code • Haven’t fully optimized some cases, including nested functions
  8. 8. Scalar UDF example - URL parsing CREATE FUNCTION f_hostname (url varchar) RETURNS varchar IMMUTABLE AS $$ import urlparse return urlparse.urlparse(url).hostname $$ LANGUAGE plpythonu; SELECT f_hostname(url) FROM table; SELECT REGEXP_REPLACE(url, '(https?)://([^@]*@)?([^:/]*)([/:].*|$)', ‘3') FROM table;
  9. 9. Scalar UDF example – Distance CREATE FUNCTION f_distance (orig_lat float, orig_long float, dest_lat float, dest_long float) RETURNS float STABLE AS $$ import math r = 3963.1676 # earth's radius, in miles phi_orig = math.radians(orig_lat) phi_dest = math.radians(dest_lat) delta_lat = math.radians(dest_lat - orig_lat) delta_long = math.radians(dest_long - orig_long) a = math.sin(delta_lat/2) * math.sin(delta_lat/2) + math.cos(phi_orig) * math.cos(phi_dest) * math.sin(delta_long/2) * math.sin(delta_long/2) c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a)) d = r * c return d $$ LANGUAGE plpythonu;
  10. 10. Redshift Github UDF Repository Script Purpose f_encryption.sql Uses pyaes library to encrypt/decrypt strings using passphrase f_next_business_day.sql Uses pandas library to return dates which are US Federal Holiday aware f_null_syns.sql Uses python sets to match strings, similar to a SQL IN condition f_parse_url_query_string.sql Uses urlparse to parse the field-value pairs from a url query string f_parse_xml.sql Uses xml.etree.ElementTree to parse XML f_unixts_to_timestamp.sql Uses pandas library to convert a unix timestamp to UTC datetime github.com/awslabs/amazon-redshift-udfs
  11. 11. Amazon Kinesis Firehose to Amazon Redshift Load massive volumes of streaming data into Amazon Redshift • Zero administration: Capture and deliver streaming data into Redshift without writing an application • Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery • Seamless elasticity: Seamlessly scales to match data throughput w/o intervention Capture and submit streaming data to Firehose Firehose loads streaming data continuously into S3 and Redshift Analyze streaming data using Chartio
  12. 12. • Uses your S3 bucket as an intermediate destination • S3 bucket has ‘manifests’ folder – holds manifest of files to be copied • Issues COPY command synchronously • Single delivery stream loads into a single Redshift cluster, database, and table • Continuously issues COPY once previous one is finished • Frequency of COPYs determined by how fast your cluster can load files • No partial loads. If a single record fails, whole file or batch fails • Info on skipped files delivered to S3 bucket as manifest in errors folder • If cannot reach cluster, retries every 5 min for 60 min and then moves on to next batch of objects Amazon Kinesis Firehose to Amazon Redshift
  13. 13. Multi-Column Sort • Compound sort keys – Filter data by one leading column • Interleaved sort keys – Filter data by up to eight columns – No storage overhead, unlike an index or projection – Lower maintenance penalty
  14. 14. Compound sort keys illustrated • Four records fill a block, sorted by customer • Records with a given customer are all in one block. • Records with a given product are spread across four blocks. 1 1 1 1 2 3 4 1 4 4 4 2 3 4 4 1 3 3 3 2 3 4 3 1 2 2 2 2 3 4 2 1 1 [1,1] [1,2] [1,3] [1,4] 2 [2,1] [2,2] [2,3] [2,4] 3 [3,1] [3,2] [3,3] [3,4] 4 [4,1] [4,2] [4,3] [4,4] 1 2 3 4 prod_id cust_id cust_id prod_id other columns blocks
  15. 15. 1 [1,1] [1,2] [1,3] [1,4] 2 [2,1] [2,2] [2,3] [2,4] 3 [3,1] [3,2] [3,3] [3,4] 4 [4,1] [4,2] [4,3] [4,4] 1 2 3 4 prod_id cust_id Interleaved sort keys illustrated • Records with a given customer are spread across two blocks. • Records with a given product are also spread across two blocks. • Both keys are equal. 1 1 2 2 2 1 2 3 3 4 4 4 3 4 3 1 3 4 4 2 1 2 3 3 1 2 2 4 3 4 1 1 cust_id prod_id other columns blocks
  16. 16. Interleaved Sort Key Considerations • Vacuum time can increase by 10-50% for interleaved sort keys vs. compound keys • If data increases monotonically, such as dates, interleaved sort order will skew over time – You’ll need to run a vacuum operation to re-analyze the distribution and re-sort the data. • Query filtering on the leading sort column, runs faster using compound sort keys vs. interleaved
  17. 17. SAN FRANCISCO Questions/Comments? Please contact us at redshift-feedback@amazon.com

×