Aws summit 2014 redshift

2,644 views

Published on

Apresentações do AWS Summit Sao Paulo 2014. Baixe o conteúdo preparado por nossos especialistas para auxiliá-lo na jornada para a nuvem.

Published in: Business
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,644
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
137
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Aws summit 2014 redshift

  1. 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. May 2014 Usos e Melhores Práticas para Amazon Redshift Eric Ferreira Sr. Database Engineer
  2. 2. Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year Amazon Redshift
  3. 3. Redshift EMR EC2 Analyze Glacier S3 DynamoDB Store Direct Connect Collect Kinesis AWS Summit 2014
  4. 4. Petabyte scale Massively parallel Relational data warehouse Fully managed; zero admin Amazon Redshift a lot faster a lot cheaper a whole lot simpler AWS Summit 2014
  5. 5. Common Customer Use Cases • Reduce costs by extending DW rather than adding HW • Migrate completely from existing DW systems • Respond faster to business • Improve performance by an order of magnitude • Make more data available for analysis • Access business data via standard reporting tools • Add analytic functionality to applications • Scale DW capacity as demand grows • Reduce HW & SW costs by an order of magnitude Traditional Enterprise DW Companies with Big Data SaaS Companies
  6. 6. Amazon Redshift Customers AWS Summit 2014
  7. 7. Growing Ecosystem AWS Summit 2014
  8. 8. Data Loading Options • Parallel upload to Amazon S3 • AWS Direct Connect • AWS Import/Export • Amazon Kinesis • Systems integrators Data Integration Systems Integrators AWS Summit 2014
  9. 9. Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata – Coordinates query execution • Compute Nodes – Local, columnar storage – Execute queries in parallel – Load, backup, restore via Amazon S3; load from Amazon DynamoDB or SSH • Two hardware platforms – Optimized for data processing – DW1: HDD; scale from 2TB to 1.6PB – DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC AWS Summit 2014
  10. 10. Amazon Redshift Node Types • Optimized for I/O intensive workloads • High disk density • On demand at $0.85/hour • As low as $1,000/TB/Year • Scale from 2TB to 1.6PB DW1.XL: 16 GB RAM, 2 Cores 3 Spindles, 2 TB compressed storage DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles 16 TB compressed, 2 GB/sec scan rate • High performance at smaller storage size • High compute and memory density • On demand at $0.25/hour • As low as $5,500/TB/Year • Scale from 160GB to 256TB DW2.L *New*: 16 GB RAM, 2 Cores, 160 GB compressed SSD storage DW2.8XL *New*: 256 GB RAM, 32 Cores, 2.56 TB of compressed SSD storage AWS Summit 2014
  11. 11. Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage • With row storage you do unnecessary I/O • To get total amount, you have to read everything ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 AWS Summit 2014
  12. 12. • With column storage, you only read the data you need Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 AWS Summit 2014
  13. 13. analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage • COPY compresses automatically • You can analyze and override • More performance, less cost AWS Summit 2014
  14. 14. Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959 • Track the minimum and maximum value for each block • Skip over blocks that don’t contain relevant data AWS Summit 2014
  15. 15. Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage • Use local storage for performance • Maximize scan rates • Automatic replication and continuous backup • HDD & SSD platforms AWS Summit 2014
  16. 16. Amazon Redshift parallelizes and distributes everything • Query • Load • Backup/Restore • Resize AWS Summit 2014
  17. 17. • Load in parallel from Amazon S3 or Amazon DynamoDB or any SSH connection • Data automatically distributed and sorted according to DDL • Scales linearly with number of nodes Amazon Redshift parallelizes and distributes everything • Query • Load • Backup/Restore • Resize AWS Summit 2014
  18. 18. • Backups to Amazon S3 are automatic, continuous and incremental • Configurable system snapshot retention period. Take user snapshots on-demand • Cross region backups for disaster recovery • Streaming restores enable you to resume querying faster Amazon Redshift parallelizes and distributes everything • Query • Load • Backup/Restore • Resize AWS Summit 2014
  19. 19. • Resize while remaining online • Provision a new cluster in the background • Copy data in parallel from node to node • Only charged for source cluster Amazon Redshift parallelizes and distributes everything • Query • Load • Backup/Restore • Resize AWS Summit 2014
  20. 20. • Automatic SQL endpoint switchover via DNS • Decommission the source cluster • Simple operation via Console or API Amazon Redshift parallelizes and distributes everything • Query • Load • Backup/Restore • Resize AWS Summit 2014
  21. 21. Amazon Redshift is priced to let you analyze all your data • Number of nodes x cost per hour • No charge for leader node • No upfront costs • Pay as you go DW1 (HDD) Price Per Hour for DW1.XL Single Node Effective Annual Price per TB On-Demand $ 0.850 $ 3,723 1 Year Reservation $ 0.500 $ 2,190 3 Year Reservation $ 0.228 $ 999 DW2 (SSD) Price Per Hour for DW2.L Single Node Effective Annual Price per TB On-Demand $ 0.250 $ 13,688 1 Year Reservation $ 0.161 $ 8,794 3 Year Reservation $ 0.100 $ 5,498
  22. 22. Amazon Redshift has security built-in • SSL to secure data in transit; load encrypted from S3; ECDHE perfect forward security • Encryption to secure data at rest – AES-256; hardware accelerated – All blocks on disks & in Amazon S3 encrypted – On-premises HSM & CloudHSM support • No direct access to compute nodes • Audit logging & AWS CloudTrail integration • Amazon VPC support • SOC 1/2/3, PCI-DSS Level 1, FedRAMP 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC AWS Summit 2014
  23. 23. Amazon Redshift continuously backs up your data and recovers from failures • Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at all times • Backups to Amazon S3 are continuous, automatic, and incremental – Designed for eleven nines of durability • Continuous monitoring and automated recovery from failures of drives and nodes • Able to restore snapshots to any Availability Zone within a region • Easily enable backups to a second region for disaster recovery AWS Summit 2014
  24. 24. 50+ new features since launch in Feb 2013 • Regions – N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney • Certifications – PCI, SOC 1/2/3, FedRAMP, PCI-DSS Level 1, others • Security – Load/unload encrypted files, Resource-level IAM, Temporary credentials, HSM, ECDHE for perfect forward security • Manageability – Snapshot sharing, backup/restore/resize progress indicators, Cross- region backups • Query – Regex, Cursors, MD5, SHA1, Time zone, workload queue timeout, HLL, Concurrency to 50 slots • Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character substitution, CSV, auto datetime format detection, epoch, Ingest from SSH, JSON, EMR AWS Summit 2014
  25. 25. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. MicroStrategy Industry’s best BI, web and mobile applications on-demand in the cloud May 2014
  26. 26. You’ve Got the Data. Now What?
  27. 27. Deriving Big Insights from Big Data Trends in business analytics Popular use cases Agile Business Intelligence Governed Self-Service Information Driven Mobile Apps Redshift Certified Customer Success
  28. 28. Popular Use Cases • Information driven purchasing – reviews – searches – pricing – comparisons – social networks – recommendations • Omni-channel • Customer 360 • Improve sales efficiency and effectiveness • Data blending o sales o product o Marketing • CRM integration • Mobility o BYOD o Caching • Information driven experience o Interaction o Videos o Documents • In-store apps o analytics o customer 360 o personal shopper • Store of the future • Real-time decisions • Inventory management Customer Analytics Sales Enablement Retail
  29. 29. Self-Service Analytics Revolutionizes Traditional BI Boost user satisfaction while massively increasing productivity More Productive More content per creator More Producers More users can create content More Collaborative Peer-to-peer sharing 5-10x More Content 5-10x More Content Creators 5-10x More Sharing >100x more content creation and consumption
  30. 30. Governed Self-Service
  31. 31. World-Class Information-Driven Mobile Apps The future of dashboards • More than graphs • Multi-media • Transaction enabled • Live data • Comprehensive data • Intuitive • Easy to use • Guided workflow for consistent user experience • Personalized for each user
  32. 32. Business User Access to 1000s of Data Sources Faster access to your data MicroStrategy Modeled Data Personal or Departmental Cloud- Based Data Relational Databases Big Data & Hadoop Enterprise Applications SAP, Oracle e-Business, Siebel, Peoplesoft, etc. Redshift, Oracle, SQL Server, MySQL, Teradata, Netezza, etc. Salesforce.com, NetSuite. Facebook, Eloqua, Google Docs, etc. Spreadsheets, Access databases, CSV, public data downloads, etc. EMR, MapR, Cloudera, Hortonworks, etc… Enterprise-certified single- version of the truth Redshift Certified Redshift integration
  33. 33. Customer Success Netflix has deployed the MicroStrategy business intelligence software platform on top of Amazon Elastic MapReduce (Amazon EMR) for interactive insights. Netflix analysts use advanced visualizations to explore the performance of its streaming service closer to recorded time by directly accessing Hadoop data on an ad-hoc basis and without writing code.
  34. 34. Big Data Analytics Transform your growing Big Data resources into insight and profit. Self-Service Analytics See and understand data in minutes. No IT needed. Enterprise-Grade Business Intelligence Produce and publish trusted analytics to improve your business operations. AWS Partner Comprehensive Analytics Platform #1 Mobile Analytics MicroStrategy Analytics Enterprise Business Agility with Trusted, Governed Data
  35. 35. Experience MicroStrategy on AWS Today!
  36. 36. Ingestion – Best Practices • Goal • 1 leader node & n compute nodes, Leverage all the compute nodes and minimize overhead • Best Practices • Preferred method - COPY from S3 • Loads data in sorted order through the compute nodes • Single Copy command, Split data into multiple files • Strongly recommend that you gzip large datasets copy time from 's3://mybucket/data/timerows.gz’ credentials 'aws_access_key_id=<Your-Access-Key- ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ gzip delimiter '|’; • If you must ingest through SQL • Multi-row inserts • Avoid large number of singleton insert/update/delete operations • To copy from another table • CREATE TABLE AS or INSERT INTO SELECT insert into category_stage values (default, default, default, default), (20, default, 'Country', default), (21, 'Concerts', 'Rock', default); AWS Summit 2014
  37. 37. Ingestion – Best Practices (Cotd..) • Best Practices – Verifying load data files • For US east – S3 provides eventual consistency. – Verify files are in S3 Listing Object Keys – Query Redshift after load. This query returns entries for loading the tables in the TICKIT database select query, trim(filename), curtime, status from stl_load_commits where filename like '%tickit%' order by query; query | btrim | curtime | status -------+---------------------------+----------------------------+-------- 22475 | tickit/allusers_pipe.txt | 2013-02-08 20:58:23.274186 | 1 22478 | tickit/venue_pipe.txt | 2013-02-08 20:58:25.070604 | 1 22480 | tickit/category_pipe.txt | 2013-02-08 20:58:27.333472 | 1 22482 | tickit/date2008_pipe.txt | 2013-02-08 20:58:28.608305 | 1 22485 | tickit/allevents_pipe.txt | 2013-02-08 20:58:29.99489 | 1 22487 | tickit/listings_pipe.txt | 2013-02-08 20:58:37.632939 | 1 22593 | tickit/allusers_pipe.txt | 2013-02-08 21:04:08.400491 | 1 22596 | tickit/venue_pipe.txt | 2013-02-08 21:04:10.056055 | 1 22598 | tickit/category_pipe.txt | 2013-02-08 21:04:11.465049 | 1 22600 | tickit/date2008_pipe.txt | 2013-02-08 21:04:12.461502 | 1 22603 | tickit/allevents_pipe.txt | 2013-02-08 21:04:14.785124 | 1 AWS Summit 2014
  38. 38. Ingestion – Best Practices (Cotd..) • Best Practices – We do not support an upsert statement. Use staging tables to perform an upsert by doing a join on the staging table with the target – Update then Insert. – We do NOT enforce primary key constraint, if you COPY same data twice, there will be a duplicate copy. – Increase the memory available to a COPY or VACUUM by increasing wlm_query_slot_count set wlm_query_slot_count to 3; – Run the ANALYZE command whenever you’ve made a non-trivial number of changes to your data to ensure your table statistics are current – Amazon Redshift system table that can be helpful in troubleshooting data load issues:STL_LOAD_ERRORS discovers the errors that occurred during specific loads. Adjust MAX ERRORS as needed. – Check character set : Support only UTF8 AWS Summit 2014
  39. 39. Choose a Sort key • Goal • Skip over data blocks to minimize IO • Best Practice • Sort based on range or equality predicate (WHERE clause) • If you access recent data frequently, sort based on TIMESTAMP AWS Summit 2014
  40. 40. Choose a Distribution Key • Goal • Distribute data evenly across nodes • Minimize data movement among nodes : Co-located Joins and Co-located Aggregates • Best Practice • Consider using Join key as distribution key (JOIN clause) • If multiple joins, use the foreign key of the largest dimension as distribution key • Consider using Group By column as distribution key.( GROUP BY clause) • Avoid • Keys used as equality filter as your distribution key • If de-normalized tables and no aggregates, do not specify a distribution key. Redshift will use round robin AWS Summit 2014
  41. 41. Query Performance – Best practices • Encode date and time using “TIMESTAMP” data type instead of “Char” • Specify Constraints • RedShift does not enforce constraints (primary key, foreign key, unique values) but the optimizer uses it. • Loading and/or applications need to be aware • Specify redundant predicate on the sort column SELECT * FROM tab1, tab2 WHERE tab1.key = tab2.key AND tab1.timestamp > '1/1/2013' AND tab2.timestamp > '1/1/2013'; AWS Summit 2014
  42. 42. Workload Manager • Allows you to manage and adjust query concurrency • WLM allows you to • Increase query concurrency up to 50 • Define user groups and query groups • Segregate short and long running queries • Help improve performance of individual queries • Be aware that query workload is distributed to every compute node. • Increasing concurrency may not always help due to resource contention (cpu, memory and I/O). • Total throughput may increase by letting one query to complete first and other queries to wait. AWS Summit 2014
  43. 43. Workload Best Practices • Organizing and keeping your load files in S3 allows for re-run or scenario testing as you evolve your workflow in the platform. – Keep in S3 or Glacier for fiscal/legal reasons • Data updated for short-term – consider having a short-term version of the table for staging and a long term version once data gets stable. • Round Robin distribution key – When you don’t have a good Distribution Key – Check Part 1 for query on checking for distribution skew – Trade off with collocated joins • Loading the target (final) table – Use a chronological date/timestamp columns for first sortkey. Vacuum is needed less often and runs faster – When first sort column has low cardinality/resolution (i.e, date instead of timestamp), subsequent columns should match common filters and/or grouping columns AWS Summit 2014
  44. 44. Workload Best Practices cont. • Use UNLOAD command to archive data that is not needed for business reasons – Data that needs to exist only for fiscal/legal reasons can be re-loaded as needed. • Consider applying retention policies less often than the regular workflow – Weekly/Monthly process during a less busy time – Make space provision for the data growth – Make sure all queries have date/timestamp range filters (> and <) – Keep a sliding window of data to minimize block re-write during vacuum • Take manual snapshots to save status at specific mileposts (year-end). AWS Summit 2014
  45. 45. Workload Best Practices cont. • Ratio between Load/Query performance needs – Low ratio: Consider Load -> Snapshot -> Spin “Query” clusters -> Tear down – High ratio: Consider Performance above space needs when choosing number of nodes • Normalization Rule of Thumb – De-normalize only to avoid large non-collocated joins – Slow Changing Dimensions (type II): Keep normalized, match distkey with fact table AWS Summit 2014
  46. 46. Space Management • Redshift has a single pool of space used for tables and temporary segments. – Loads need 2.5 times the space of the data being loaded if table has a sortkey – Vacuum may need 2.5 times the size of the table. • Monitor the free space – Performance Tab in the console – Cloudwatch Alarms – SQL AWS Summit 2014
  47. 47. Space Management cont. • Tables Sizes select trim(pgdb.datname) as Database, trim(pgn.nspname) as Schema, trim(a.name) as Table, b.mbytes, a.rows from ( select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name ) as a join pg_class as pgc on pgc.oid = a.id join pg_namespace as pgn on pgn.oid = pgc.relnamespace join pg_database as pgdb on pgdb.oid = a.db_id join (select tbl, count(*) as mbytes from stv_blocklist group by tbl) b on a.id=b.tbl order by mbytes desc, a.db_id, a.name; • Free Space select sum(capacity)/1024 as capacity_gbytes, sum(used)/1024 as used_gbytes, (sum(capacity) - sum(used))/1024 as free_gbytes from stv_partitions where part_begin=0; • Redshift allows you to resize your cluster up and down and across node types. Online (read-only access). AWS Summit 2014
  48. 48. For more best practices, search youtube for “Amazon Redshift Best Practices “ Thank You ! AWS Summit 2014

×