Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

2,272 views

Published on

We live in an era of rapid dev cycles and continuous deployment, where the code we commit is instantly tested and deployed. Yet, maintaining data warehouse schemas remains a cumbersome, manual task. Redshift is an extremely powerful warehouse, and with little fine tuning it can adapt to the pace of daily changes to the code, data and query patterns by evolving and restructuring table schemas. In this talk we will present a methodology for identifying query bottlenecks and under-optimized configurations by reviewing actual explain plans. Then, we will discuss several techniques for schema settings modification, including data types, sortkeys and distribution keys, that are robust, continuous and without downtime.

Published in: Technology
  • Be the first to comment

Redshift at Lightspeed: How to continuously optimize and modify Redshift schemas, by panoply.io - Pop-up Loft Tel Aviv

  1. 1. Redshift at Lightspeed How to continuously optimize and modify Redshift schemas
  2. 2. Panoply.io The Missing Part: Continuous Data Warehousing Core Idea Product Continuous Integration Puppet Chef New Relic Unit Tests AWS Heroku Docker Server Frameworks Github Bitbucket Client Frameworks SCRUM Kanban Extreme
  3. 3. speed /spēd/ noun the rate at which someone or something is able to operate or change state
  4. 4. Make the change easy Make the easy change First, Then, “ — Kent Beck
  5. 5. Panoply.io 1. Data Columns, Tables, Data Types, Compression, Constraints / code changes 2. Queries Transformations, Sortkeys, Distkeys / life, business, environment Continuous Data Integration
  6. 6. #1 Data & Metadata Changes
  7. 7. Panoply.io Groups int g_id string name Session Events int s_id int u_id datetime time datetime start_time datetime end_time Users int u_id string gender string name string first_name string last_name int g_id Messages int m_id int from int to string text int to_u int to_g Users support commit #4acd617 by alice Adding messages commit #0ca9e87 by bob Track user sessions commit #709ff49 by alice Breakdown the name commit #791079b by alice Add Groups commit #44ff83b by bob Session time-range commit #df7a369 by alice
  8. 8. Panoply.io automate with build scripts users: - column: u_id type: int - column: first_name type: varchar - column: last_name type: varchar - column: g_id - type: int - references: - groups.g_id groups: - column: g_id type: int Users support commit #4acd617 by alice Breakdown the name commit #791079b by alice Add Groups commit #44ff83b by bob Commit Log schema.yaml
  9. 9. Panoply.io schema.yaml Users integer id varchar address Groups integer admin users: - column: id type: integer - column: address type: varchar groups: - column: admin type: integer references: - users.id Reject on error create table ... ( ... ) alter table ... add column ... alter table ... remove column ... alter table ... rename column ... to ... remodel
  10. 10. Panoply.io Concurrency & Locks commit 1 commit 2 commit 3 rollback rollback DoneStart Users integer id varchar address Queries Alter Table locks
  11. 11. Panoply.io Messages date created Add temporary column alter table messages add column timestamp created_tmp Messages date created-old ts created Rename columns alter table messages rename column created to created-old; alter table messages rename column created-tmp to created; Users ts created Drop old column alter table messages drop column created-old Messages date created ts created_tmp Copy data to new column update messages set created_tmp = created Reject on error Altering Column Types
  12. 12. Panoply.io View Group Admins string group_name string admin_name Users string name Groups integer admin Approach #1 Drop all, and reconstruct until reaching stability Approach #2 pg_depend On error - reject DAG: Directional A-cyclic Graph Rebuilding Views & Constraints
  13. 13. #2 Query Changes Transformations Sortkeys Distkeys
  14. 14. Panoply.io ETL extract transform load Data Available Data Available ELT extract load transform ETL Is Yesterday’s Problem Rigid, Inflating Dev-dependent Lost
  15. 15. Panoply.io users … groups … View users-per-group int g_id varchar name int count_users View avg-turn-around string type float turn_around int uniques Raw Data Transformation Views Immediate Availability avg-turn-around … Selective Materialization
  16. 16. Panoply.io Unsorted gender 1mb blocks Sorted gender 1mb blocks Female Male Sortkeys: Recap
  17. 17. Panoply.io SELECT COUNT(1) FROM users WHERE gender = 'female' count 2498644 SELECT * FROM STL_EXPLAIN WHERE ... plannode cost info Aggregate cost=68718.76..68718.76 rows=1 width=0 Seq Scan on users cost=0.00..62500.00 rows=2487501 width=0 Filters: gender = ‘female’ SELECT DATEDIFF('ms', endtime, starttime), * FROM STL_SCAN slice datediff rows pre_filter is_rrscan 1 250 4071 7813 t 2 309 3846 7813 t 52% 49%
  18. 18. Panoply.io Even AllKey gender Female Male Diststyle & Distkeys: Recap
  19. 19. Panoply.io SELECT groups.name, COUNT(DISTINCT u_id) FROM groups FULL JOIN users ON groups.g_id = users.g_id GROUP BY groups.name; name count Group 1 6 Group 2 2 plan cost info HashAggregate 61000331250 Subquery Scan 61000306250 HashAggregate 61000256250 Hash Full Join DS_DIST_BOTH 61000231250 users.g_id = groups.g_id Seq Scan on users 50000 Hash 15000 Seq Scan on groups 15000 inner outer
  20. 20. Panoply.io DS_DIST_BOTH DS_DIST_ALL_INNER all all all Good OK Bad DS_DIST_INNER DS_DIST_NONE DS_DIST_ALL_NONE DS_BCAST_INNER node 1 node 2 node 3
  21. 21. Panoply.io plan cost info HashAggregate 331250 Subquery Scan 306250 HashAggregate 256250 Hash Full Join DS_DIST_NONE 231250 users.g_id = groups.g_id Seq Scan on users 50000 Hash 15000 Seq Scan on groups 15000 users DISTSTYLE KEY DISTKEY (g_id) groups DISTSTYLE KEY DISTKEY (g_id)
  22. 22. Panoply.io SELECT "type", AVG(time_to_session) avg_time_to_session FROM (SELECT u_id, "type", DATEDIFF(seconds,MS.time,first_session_after_message) time_to_session FROM (SELECT MM.u_id, MM.time, CASE WHEN to_u IS NOT NULL THEN 'Private' WHEN to_g IS NOT NULL THEN 'Group' END "type", MIN(S.start_time) first_session_after_message FROM (SELECT M.time, to_u, to_g, nvl(A.u_id,B.u_id) u_id FROM messages M LEFT JOIN (SELECT DISTINCT G.id AS g_id, U.u_id FROM groups G RIGHT JOIN users U ON G.id = U.g_id) A ON M.to_g = A.g_id LEFT JOIN users B ON M.to_u = B.u_id) MM LEFT JOIN (SELECT u_id, start_time, end_time FROM sessions) S ON MM.u_id = S.u_Id AND MM.time < S.start_time AND MM.time < S.end_time WHERE DATEDIFF (seconds,MM.time,S.start_time) < 3600 GROUP BY MM.u_id, MM.time, CASE WHEN to_u IS NOT NULL THEN 'Private' WHEN to_g IS NOT NULL THEN 'Group' END) MS) MS1 GROUP BY "type"; Real Life Example
  23. 23. Panoply.io SELECT ... type Private time_to_session 62.398 Group 102.873 EXPLAIN SELECT ... plan cost info Hash Join DS_BCAST_INNER 5872656227563 users.u_id = sessions.u_id Hash Left Join DS_BCAST_INNER 1949786557708 messages.to_u = users.u_id Hash Left Join DS_DIST_INNER 1349121397704 messages.to_g = users.g_id Hash Left Join DS_DIST_BOTH 49000231250 users.g_id = groups.g_id 1. 2. 3. 4. DS_DIST_BOTH DS_BCAST_INNER DS_DIST_INNER
  24. 24. Panoply.io users DISTKEY (u_id) sessions DISTKEY (u_id) messages DISTKEY (to_u) groups DISTSTYLE ALL Cost: 2x faster Actual: 8x faster (12 mins to 1.5mins) 50% 32% 3% 121,000% hash join cost info DS_BCAST_INNER 5872656227563 users.u_id = sessions.u_id DS_BCAST_INNER 1949786557708 messages.to_u = users.u_id DS_DIST_INNER 1349121397704 messages.to_g = users.g_id DS_DIST_BOTH 49000231250 users.g_id = groups.g_id Previously 1. 2. 3. 4. hash join cost DS_DIST_OUTER 2900030128376 DS_DIST_NONE 1300007879765 DS_BCAST_INNER 1300001568828 DS_DIST_NONE 231250 1. 2. 3. 4.
  25. 25. Panoply.io analyzeSTL_EXPLAIN STL_SCAN rebuild N-weeks optimum parse explain_analysis int q_id timestamp time varchar table varchar column float filter_cost float dist_cost 1-distkey per table Duplicate tables or use ALL current configuration users: - diststyle: KEY - diskey: g_id - sortkeys: - gender groups: - diststyle: KEY - diskey: g_id Automated Optimization
  26. 26. Panoply.io users … … users-dist-uid … …unload / copy data … … … …chase by update_time Clone table schema with new Sortkey and Distkey Empiric test replay queries / explains instant swap Rebuilding Tables
  27. 27. Summary Continuous Data Warehousing
  28. 28. Future Tip of the Iceberg Frameworks & Platforms
  29. 29. ? Automated Data Management Platform over Redshift Panoply.io
  30. 30. Panoply.io Automated Data Management Platform over Redshift

×