Your SlideShare is downloading. ×
AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

AWS Webcast - Amazon Redshift Best Practices Part 2 – Performance

3,703
views

Published on

This session follows our webinar on data loading and key choices and shows you how to use Amazon Redshift efficiently. Hear our experts discuss how to extract the best performance from your Amazon …

This session follows our webinar on data loading and key choices and shows you how to use Amazon Redshift efficiently. Hear our experts discuss how to extract the best performance from your Amazon Redshift cluster by using the commands like vacuum appropriately. Understand what information is exposed in the Amazon Redshift console and how to use it. Learn how to tune performance by explaining query plans and examining how memory and disk space are used. Reasons to attend: - Learn how to use Amazon Redshift efficiently. - Manage storage effectively with Vacuum. - Attend Q&A session with Amazon Redshift experts.

Published in: Business, Technology

0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,703
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
135
Comments
0
Likes
11
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Usual Progression: Steps that happen at a certain frequency (daily, hourly, weekly)
  • If your data has updates in the short term, consider having a short-term version of the table for staging and a long term version once data gets stable - Example: Orders stay on a short term table while in process and goes to
  • Transcript

    • 1. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.Amazon Redshift Best PracticesPart 2May 2013Eric Ferreira & John Loughlin
    • 2. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.AgendaIntroduction & RecapBest Practices for• Workload Migration• Copy Command Options• Vacuum• Space ManagementQ&A
    • 3. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.Amazon DynamoDBFast, Predictable, Highly-Scalable NoSQL Data StoreAmazon RDSManaged Relational Database Service forMySQL, Oracle and SQL ServerAmazon ElastiCacheIn-Memory Caching ServiceAmazon RedshiftFast, Powerful, Fully Managed, Petabyte-ScaleData Warehouse ServiceCompute StorageAWS Global InfrastructureDatabaseApplication ServicesDeployment & AdministrationNetworkingAWS DatabaseServicesScalable High PerformanceApplication Storage in the Cloud
    • 4. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.Amazon Redshift architectureLeader Node• SQL endpoint• Postgres based• Stores metadata• Communicates with client• Compiles queries• Coordinates query executionCompute Nodes• Local, columnar storage• Execute queries in parallel - slices• Load, backup, restore via Amazon S3Everything is mirrored10 GigE(HPC)IngestionBackupRestoreJDBC/ODBC
    • 5. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.In Part 1…This is Part 2 of the Redshift Best Practices series.Visit:http://aws.amazon.com/resources/databaseservices/webinars/To watch Part 1.
    • 6. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.Workload MigrationELT/ETL Process• Load Atomic Data (target table or staging area)• Transform data (include cleanup and aggregation)• Prepare target tables for query/reports• Includes Statistics gathering and vacuum• Includes data retention policyRe-evaluate to take advantage of cloud characteristics.
    • 7. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.Workload Migration cont.Make provision for testing multiple options before youmigrate the production workflow• Different number of nodes• Few large nodes versus many small nodes (16xXL versus 2x8XL)• WLM Settings• Concurrency versus response time• Different Sort and Distribution Keys• Test both queries and load/vacuum times• Compression
    • 8. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.Workload Best PracticesOrganizing and keeping your load files in S3 allows for re-run or scenario testingas you evolve your workflow in the platform.• Keep in S3 or Glacier for fiscal/legal reasonsData updated for short-term• consider having a short-term version of the table for staging and a long term version oncedata gets stable.Round Robin distribution key• When you don’t have a good Distribution Key• Check Part 1 for query on checking for distribution skew• Trade off with collocated joinsLoading the target (final) table• Use a chronological date/timestamp columns for first sortkey. Vacuum is needed less oftenand runs faster• When first sort column has low cardinality/resolution (i.e, date instead of timestamp),subsequent columns should match common filters and/or grouping columns
    • 9. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.Workload Best Practices cont.Use UNLOAD command to archive data that is not needed forbusiness reasons• Data that needs to exist only for fiscal/legal reasons can be re-loaded asneeded.Consider applying retention policies less often than the regularworkflow• Weekly/Monthly process during a less busy time• Make space provision for the data growth• Make sure all queries have date/timestamp range filters (> and <)• Keep a sliding window of data to minimize block re-write during vacuumTake manual snapshots to save status at specific mileposts (year-end).
    • 10. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.Workload Best Practices cont.Ratio between Load/Query performance needs• Low ratio: Consider Load -> Snapshot -> Spin “Query” clusters -> Tear down• High ratio: Consider Performance above space needs whenchoosing number of nodesNormalization Rule of Thumb• De-normalize only to avoid non-collocated joins• Slow Changing Dimensions (type II): Keep normalized, matchdistkey with fact table
    • 11. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.COPY CommandCOPY table_name [ (column1 [,column2, ...]) ]FROM s3://objectpath [ WITH ] CREDENTIALS [AS] aws_access_credentials[ option [ ... ] ]Options worth mentioning:GZIP• Using compressed files saves network bandwidth and can speed up loads.MAXERROR and NOLOAD• Default maxerror is 0. Set to a larger value while troubleshooting new datastream• Use with noload option to speed up file validationSTATUPDATE• When loading significant amount of data to non empty table can update stats atthe end the load.
    • 12. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.COPY Command Common IssuesUTF-8• Currently redshift can only load well-formed uft-8 characters up to 3 bytes.NULL AS and ESCAPE• Common issues loading files can be circumvented with these options• Narrow down to small set of rows and visually find what type of problem youhave• Note that the error message might refer to a later portion. For example“Delimiter not found” might be caused by a EOL that was not escaped.DATEFORMAT and TIMEFORMAT• Currently all date/timestamp columns have to use the same formattingdefined by the option• Using ACCEPTANYDATE will not generate errors but load NULL whenformat does not match
    • 13. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.COPY Command TroubleshootingSTL_LOAD_ERRORS / STL_LOADERROR_DETAIL• Find errors during specific loads• You can create a view to simplify troubleshooting processcreate view loadview as (select distinct tbl, trim(name) as table_name, query, starttime,trim(filename) as input, line_number, colname, err_code, trim(err_reason) as reason fromstl_load_errors sl, stv_tbl_perm sp where sl.tbl = sp.id);• Then you “select * from loadview where table_name = <table>” if you have any issues.STL_LOAD_COMMITS / STL_FILE_SCAN / STL_S3CLIENT• Load times for specific files. Confirms a given file was readSTL_S3CLIENT_ERROR• Information about specific S3 or file transfer errors that happen during loadprocess
    • 14. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.COPY Command – Historical InformationLook back to confirm number files and bytes loaded byeach COPY statementselect substring(q.querytxt,1,40) as querytxt, s.n_files, size_mb, s.time_seconds,s.size_mb/decode(s.time_seconds,0,1,s.time_seconds) as mb_per_sfrom (select query, count(*) as n_files,sum(transfer_size/(1024*1024)) as size_MB, (max(end_Time) -min(start_Time))/(1000000) as time_seconds , max(end_time) as end_timefrom stl_s3client where query > 0 and transfer_time > 0 group by query ) as sLEFT JOIN stl_Query as q on q.query = s.queryorder by mb_per_s desclimit 10
    • 15. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.COPY Command – Historical Informationcont.querytxt | n_files | size_mb | time_seconds | mb_per_s--------------------------------------------------------------+---------+---------+--------------+----------copy lineitem from s3://tpc-h/100/lineitem.tbl. credential | 603 | 22201 | 2390 | 9copy lineitem from s3://tpc-h/1/lineitem.tbl. credentials | 34 | 192 | 21 | 8copy customer from s3://tpc-h/100/customer.tbl. credential | 152 | 750 | 85 | 8copy partsupp from s3://tpc-h/100/partsupp.tbl. credential | 82 | 2720 | 367 | 7COPY ANALYZE part | 22 | 40 | 7 | 5copy orders from s3://tpc-h/100/orders.tbl. credentials | 152 | 4800 | 1035 | 4copy orders from s3://tpc-h/1/orders.tbl. credentials g | 34 | 32 | 7 | 4copy part from s3://tpc-h/100/part.tbl. credentials gzi | 202 | 400 | 95 | 4COPY ANALYZE supplier | 34 | 0 | 3 | 0copy supplier from s3://tpc-h/100/supplier.tbl. credential | 102 | 0 | 10 | 0(10 rows)
    • 16. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.VacuumBefore Vacuum• Data inserted goes to a “non-sorted” area at the end of the table• As this area grows, query times grow• Data deleted is “marked” in a special column• As that column grows, query times growWhat vacuum does• Non-sorted area gets sorted and integrated into the table• Deleted rows are removed and blocks reorganized
    • 17. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.Vacuum cont.• Vacuum takes advantage of sortkey and skipsblocks that don’t need to be modified.• Vacuum is a maintenance type operation• Only one vacuum can be running at a time(cluster-wide)• More Memory = Faster Vacuum– set wlm_query_slot_count to 4;• Keep track of Vacuum progress (ETA)– SVV_VACUUM_PROGRESS• Record vacuum details after to consider adjustfrequency– SVV_VACUUM_SUMMARYApril/2013May/2013UnsortedMarch/2013May/2013June/2013April/2013
    • 18. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.Space ManagementRedshift has a single pool of space used for tables andtemporary segments.• Loads need 2.5 times the space of the data being loaded if tablehas a sortkey• Vacuum may need 2.5 times the size of the table.Monitor the free space• Performance Tab in the console• Cloudwatch Alarms• SQL
    • 19. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.Space Management cont.Tables Sizesselect trim(pgdb.datname) as Database, trim(pgn.nspname) asSchema,trim(a.name) as Table, b.mbytes, a.rowsfrom ( select db_id, id, name, sum(rows) as rowsfrom stv_tbl_perm a group by db_id, id, name ) as ajoin pg_class as pgc on pgc.oid = a.idjoin pg_namespace as pgn on pgn.oid = pgc.relnamespacejoin pg_database as pgdb on pgdb.oid = a.db_idjoin (select tbl, count(*) as mbytesfrom stv_blocklist group by tbl) b on a.id=b.tblorder by mbytes desc, a.db_id, a.name;Free Spaceselect sum(capacity)/1024 as capacity_gbytes,sum(used)/1024 as used_gbytes,(sum(capacity) - sum(used))/1024as free_gbytesfrom stv_partitionswhere part_begin=0;• Redshift allows you to resize your cluster up and down and across nodetypes. Online (read-only access).
    • 20. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.Summary• Experiment to optimize your workflows• Various STL/STV tables hold most information needed fortroubleshooting• Space Management and Vacuum schedule should beconsidered during implementation phase
    • 21. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.More informationCOPY Commandhttp://docs.aws.amazon.com/redshift/latest/dg/t_Loading_tables_with_the_COPY_command.htmlLoads Troubleshootinghttp://docs.aws.amazon.com/redshift/latest/dg/t_Troubleshooting_load_errors.htmlVacuumhttp://docs.aws.amazon.com/redshift/latest/dg/t_Reclaiming_storage_space202.htmlUNLOADING datahttp://docs.aws.amazon.com/redshift/latest/dg/c_unloading_data.html
    • 22. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.
    • 23. © 2011 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.Q&A