0
Amazon Redshift
Intro, Details
Ianni Vamvadelis
Solutions Architect
Amazon DynamoDB
Fast,	
  Predictable,	
  Highly-­‐Scalable	
  NoSQL	
  Data	
  Store	
  
Amazon RDS
Managed	
  Rela=onal	
...
Amazon DynamoDB
Fast,	
  Predictable,	
  Highly-­‐Scalable	
  NoSQL	
  Data	
  Store	
  
Amazon RDS
Managed	
  Rela=onal	
...
Design	
  Objec=ves	
  
A	
  petabyte-­‐scale	
  data	
  warehouse	
  service	
  that	
  was…	
  
Amazon	
  
RedshiL	
  
A...
RedshiL	
  Drama=cally	
  Reduces	
  I/O	
  
•  Direct-­‐aNached	
  storage	
  
•  Large	
  data	
  block	
  sizes	
  
•  ...
16GB RAM
2TB disk
2 cores
RedshiL	
  Runs	
  on	
  Op=mized	
  Hardware	
  
•  Op=mized	
  for	
  I/O	
  intensive	
  work...
RedshiL	
  Parallelizes	
  and	
  Distributes	
  Everything	
  
Load	
  
Query	
  
Resize	
  
Backup	
  
Restore	
  
10	
 ...
Point	
  and	
  Click	
  Resize	
  
SQL Clients/BI Tools
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB di...
Resize	
  your	
  cluster	
  while	
  remaining	
  online	
  
•  Fully	
  automated	
  
– Data	
  automa=cally	
  redistri...
Amazon	
  RedshiL	
  has	
  security	
  built-­‐in	
  
•  SSL	
  to	
  secure	
  data	
  in	
  transit	
  
•  Encryp=on	
 ...
Con=nuous	
  Backup,	
  Automated	
  Recovery	
  
•  Replica=on	
  within	
  the	
  cluster	
  and	
  backup	
  to	
  Amaz...
datavolume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Wo...
RedshiL	
  is	
  Priced	
  to	
  Analyze	
  All	
  Your	
  Data	
  
$0.85 per hour for on-demand (2TB)
$999 per TB per yea...
Integrates	
  With	
  Exis=ng	
  BI	
  Tools	
  
Amazon Redshift
JDBC/ODBC	
  
	
  
	
  
	
  
Scenarios
6
Repor=ng	
  Warehouse	
  
•  Accelerated	
  opera=onal	
  repor=ng	
  
•  Support	
  for	
  short-­‐=me	
  use	
  cases	
 ...
Data
Integration
Partners*
On-­‐Premises	
  Integra=on	
  
RDBMS
Redshift
OLTP
ERP Reporting
and BI	
  
Live	
  Archive	
  for	
  (Structured)	
  Big	
  Data	
  
•  Direct	
  integra=on	
  with	
  copy	
  command	
  
•  High	
...
Cloud	
  ETL	
  for	
  Big	
  Data	
  
•  Maintain	
  online	
  SQL	
  access	
  to	
  historical	
  logs	
  
•  Transform...
Ingestion – Best Practices
§  Goal:	
  Leverage	
  all	
  the	
  compute	
  nodes	
  and	
  minimize	
  overhead	
  
§  ...
Choose a Sort key
§  Goal	
  
§  Skip	
  over	
  data	
  blocks	
  to	
  minimize	
  IO	
  
§  Best	
  Prac=ce	
  
§  ...
Choose a Distribution Key
§  Goal	
  
§  Distribute	
  data	
  evenly	
  across	
  nodes	
  	
  
§  Minimize	
  data	
 ...
Select  sum( S.Price * S.Quantity )!
FROM SALES S!
JOIN CATEGORY C   ON C.ProductId = S.ProductId!
JOIN  FRANCHISE  F ON F...
Workload Manager
§  Allows	
  you	
  to	
  manage	
  and	
  adjust	
  query	
  concurrency	
  
§  WLM	
  	
  allows	
  y...
Workload Manager
§  Default	
  :	
  1	
  queue	
  with	
  a	
  concurrency	
  of	
  5	
  
§  Define	
  up	
  to	
  8	
  q...
Query Performance – Best Practices
§  Encode	
  date	
  and	
  =me	
  using	
  “TIMESTAMP”	
  data	
  type	
  instead	
  ...
Summary
§  Avoid	
  large	
  number	
  of	
  singleton	
  DML	
  statements	
  if	
  
possible	
  
§  Use	
  COPY	
  for...
More Information
Best	
  Prac=ces	
  for	
  Designing	
  Tables	
  
http://docs.aws.amazon.com/redshift/latest/dg/c_design...
Thanks.
aws.amazon.com/big-data
Upcoming SlideShare
Loading in...5
×

Amazon RedShift - Ianni Vamvadelis

2,161

Published on

In this talk, Ian will table about Amazon Redshift, a managed petabyte scale data warehouse, give an overview of integration with Amazon Elastic MapReduce, a managed Hadoop environment, and cover some exciting new developments in the analytics space.

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,161
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
130
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Amazon RedShift - Ianni Vamvadelis"

  1. 1. Amazon Redshift Intro, Details Ianni Vamvadelis Solutions Architect
  2. 2. Amazon DynamoDB Fast,  Predictable,  Highly-­‐Scalable  NoSQL  Data  Store   Amazon RDS Managed  Rela=onal  Database  Service  for   MySQL,  Oracle  and  SQL  Server   Amazon ElastiCache In-­‐Memory  Caching  Service   Amazon Redshift Fast,  Powerful,  Fully  Managed,  Petabyte-­‐Scale   Data  Warehouse  Service   Compute Storage AWS Global Infrastructure Database Application Services Deployment & Administration Networking AWS Database Services Scalable High Performance Application Storage in the Cloud
  3. 3. Amazon DynamoDB Fast,  Predictable,  Highly-­‐Scalable  NoSQL  Data  Store   Amazon RDS Managed  Rela=onal  Database  Service  for   MySQL,  Oracle  and  SQL  Server   Amazon ElastiCache In-­‐Memory  Caching  Service   Amazon Redshift Fast,  Powerful,  Fully  Managed,  Petabyte-­‐Scale   Data  Warehouse  Service   Compute Storage AWS Global Infrastructure Database Application Services Deployment & Administration Networking AWS  Database   Services   Scalable High Performance Application Storage in the Cloud
  4. 4. Design  Objec=ves   A  petabyte-­‐scale  data  warehouse  service  that  was…   Amazon   RedshiL   A Whole Lot Simpler A Lot Cheaper A Lot Faster
  5. 5. RedshiL  Drama=cally  Reduces  I/O   •  Direct-­‐aNached  storage   •  Large  data  block  sizes   •  Columnar  storage   •  Data  compression   •  Zone  maps   Id Age State 123 20 CA 345 25 WA 678 40 FL Row storage Column storage
  6. 6. 16GB RAM 2TB disk 2 cores RedshiL  Runs  on  Op=mized  Hardware   •  Op=mized  for  I/O  intensive  workloads   •  HS1.8XL  available  on  Amazon  EC2   •  Runs  in  HPC  -­‐  fast  network   •  High  disk  density   HS1.8XL: 128GB RAM, 16 Cores, 24 Spindles, 16TB Storage, 2GB/sec scan rate HS1.XL: 16GB RAM, 2 Cores, 3 Spindles, 2TB Storage 16GB RAM 2TB disk 2 cores 16GB RAM 2TB disk 2 cores 16GB RAM 2TB disk 2 cores 16GB RAM 2TB disk 2 cores 16GB RAM 2TB disk 2 cores 16GB RAM 2TB disk 2 cores 16GB RAM 2TB disk 2 cores 16GB RAM 2TB disk 2 cores Click to grow … to 1.6PB
  7. 7. RedshiL  Parallelizes  and  Distributes  Everything   Load   Query   Resize   Backup   Restore   10  GigE   (HPC)   Inges=on   Backup   Restore   SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores Amazon S3 JDBC/ODBC   128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Leader Node
  8. 8. Point  and  Click  Resize  
  9. 9. SQL Clients/BI Tools 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Leader Node Resize  your  cluster  while  remaining  online   128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Leader Node New  target  provisioned  in  the  background   Only  charged  for  source  cluster  
  10. 10. Resize  your  cluster  while  remaining  online   •  Fully  automated   – Data  automa=cally  redistributed   •  Read  only  mode  during  resize   •  Parallel  node-­‐to-­‐node  data  copy   •  Automa=c  DNS-­‐based  endpoint   cut-­‐over   •  Only  charged  for  one  cluster   SQL Clients/BI Tools 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Leader Node
  11. 11. Amazon  RedshiL  has  security  built-­‐in   •  SSL  to  secure  data  in  transit   •  Encryp=on  to  secure  data  at  rest   – AES-­‐256   – All  blocks  on  disks  and  in  Amazon  S3   encrypted   •  No  direct  access  to  compute  nodes   •  Amazon  VPC  support   10  GigE   (HPC)   Inges=on   Backup   Restore   SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores Amazon S3 Customer  VPC   Internal   VPC   JDBC/ODBC   Leader Node Compute Node Compute Node Compute Node
  12. 12. Con=nuous  Backup,  Automated  Recovery   •  Replica=on  within  the  cluster  and  backup  to  Amazon  S3  to   maintain  mul=ple  copies  of  data  at  all  =mes   •  Backups  to  Amazon  S3  are  con=nuous,  automa=c,  and   incremental   •  Con=nuous  monitoring  and  automated  recovery  from  failures  of   drives  and  nodes   •  Able  to  restore  snapshots  to  any  Availability  Zone  within  a  region  
  13. 13. datavolume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares data available for analysis data generated Gap cost  +   effort  
  14. 14. RedshiL  is  Priced  to  Analyze  All  Your  Data   $0.85 per hour for on-demand (2TB) $999 per TB per year (3-yr reservation)
  15. 15. Integrates  With  Exis=ng  BI  Tools   Amazon Redshift JDBC/ODBC        
  16. 16. Scenarios 6
  17. 17. Repor=ng  Warehouse   •  Accelerated  opera=onal  repor=ng   •  Support  for  short-­‐=me  use  cases   •  Data  compression,  index  redundancy   RDBMS Redshift OLTP ERP Reporting and BI  
  18. 18. Data Integration Partners* On-­‐Premises  Integra=on   RDBMS Redshift OLTP ERP Reporting and BI  
  19. 19. Live  Archive  for  (Structured)  Big  Data   •  Direct  integra=on  with  copy  command   •  High  velocity  data     •  Data  ages  into  RedshiL   •  Low  cost,  high  scale  op=on  for  new  apps   DynamoDB Redshift OLTP Web Apps Reporting and BI  
  20. 20. Cloud  ETL  for  Big  Data   •  Maintain  online  SQL  access  to  historical  logs   •  Transforma=on  and  enrichment  with  EMR   •  Longer  history  ensures  beNer  insight   Redshift Reporting and BI  Elastic MapReduce S3
  21. 21. Ingestion – Best Practices §  Goal:  Leverage  all  the  compute  nodes  and  minimize  overhead   §  Best  Prac=ces   §  Preferred  method  -­‐  COPY  from  S3   §  Loads  data  in  sorted  order  through  the  compute  nodes   §  Single  Copy  command,  Split  data  into  mul=ple  files   §  Strongly  recommend  that  you  gzip  large  datasets   §  If  you  must  ingest  through  SQL   §  Mul=-­‐row  inserts   §  Avoid  large  number  of  singleton    insert/update/delete  opera=ons     §  To  copy  from  another  table   §  CREATE  TABLE  AS  or  INSERT  INTO  SELECT   insert into category_stage values! (default, default, default, default),! (20, default, 'Country', default),! (21, 'Concerts', 'Rock', default);! copy time from 's3://mybucket/data/timerows.gz’ credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access- Key>’ gzip delimiter '|’;!
  22. 22. Choose a Sort key §  Goal   §  Skip  over  data  blocks  to  minimize  IO   §  Best  Prac=ce   §  Sort  based  on  range  or  equality  predicate  (WHERE  clause)   §  If  you  access  recent  data  frequently,  sort  based  on  TIMESTAMP  
  23. 23. Choose a Distribution Key §  Goal   §  Distribute  data  evenly  across  nodes     §  Minimize  data  movement  among  nodes  :  Co-­‐located  Joins  and  Co-­‐located  Aggregates   §  Best  Prac=ce   §  Consider  using  Join  key  as  distribu=on  key  (JOIN  clause)   §  If  mul=ple  joins,  use  the  foreign  key  of  the  largest  dimension  as  distribu=on  key   §  Consider  using  Group  By  column  as  distribu=on  key  (GROUP  BY  clause)   §  Avoid   §  Keys  used  as  equality  filter  as  your  distribu=on  key   §  If  de-­‐normalized  tables  and  no  aggregates,  do  not  specify  a  distribu=on  key  -­‐RedshiL  will   use  round  robin  
  24. 24. Select  sum( S.Price * S.Quantity )! FROM SALES S! JOIN CATEGORY C   ON C.ProductId = S.ProductId! JOIN  FRANCHISE  F ON F.FranchiseId = S.FranchiseId! Where C.CategoryId = ‘Produce’  And  F.State = ‘WA’! AND S.Date Between ‘1/1/2013’  AND ‘1/31/2013’! Example Dist key (C) = ProductID Sort key (S) = Date -- Total Produce sold in Washington in January 2013 Dist key (F) = FranchiseID Dist key (S) = ProductID
  25. 25. Workload Manager §  Allows  you  to  manage  and  adjust  query  concurrency   §  WLM    allows  you  to   §  Increase  query  concurrency  up  to  15   §  Define  user  groups  and  query  groups   §  Segregate  short  and  long  running  queries   §  Help  improve  performance  of  individual  queries   §  Be  aware:  query  workload  is  distributed  to  every  compute  node   §  Increasing  concurrency  may  not  always  help  due  to  resource  conten=on   §  CPU,  Memory  and  I/O   §  Total  throughput  may  increase  by  lekng  one  query  complete  first  and  allowing   other  queries  to  wait  
  26. 26. Workload Manager §  Default  :  1  queue  with  a  concurrency  of  5   §  Define  up  to  8  queues  with  a  total  concurrency  of  15   §  RedshiL  has  a  super  user  queue  internally  
  27. 27. Query Performance – Best Practices §  Encode  date  and  =me  using  “TIMESTAMP”  data  type  instead  of  “CHAR”   §  Specify  Constraints   §  RedshiL  does  not  enforce  constraints  (primary  key,  foreign  key,  unique  values)  but   the  op=mizer  uses  it   §  Loading  and/or  applica=ons  need  to  be  aware   §  Specify  redundant  predicate  on  the  sort  column   ! !SELECT * FROM tab1, tab2 ! ! !WHERE tab1.key = tab2.key ! ! !AND tab1.timestamp > '1/1/2013' ! ! !AND tab2.timestamp > '1/1/2013';! §  WLM  sekngs  
  28. 28. Summary §  Avoid  large  number  of  singleton  DML  statements  if   possible   §  Use  COPY  for  uploading  large  datasets   §  Choose  Sort  and  Distribu=on  keys  with  care   §  Encode  data  and  =me  with  TIMESTAMP  data  type   §  Experiment  with  WLM  sekngs  
  29. 29. More Information Best  Prac=ces  for  Designing  Tables   http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html   Best  Prac=ces  for  Data  Loading   http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html View the Redshift Developer Guide at: http://aws.amazon.com/documentation/redshift/
  30. 30. Thanks. aws.amazon.com/big-data
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×