• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Amazon RedShift - Ianni Vamvadelis
 

Amazon RedShift - Ianni Vamvadelis

on

  • 1,719 views

In this talk, Ian will table about Amazon Redshift, a managed petabyte scale data warehouse, give an overview of integration with Amazon Elastic MapReduce, a managed Hadoop environment, and cover some ...

In this talk, Ian will table about Amazon Redshift, a managed petabyte scale data warehouse, give an overview of integration with Amazon Elastic MapReduce, a managed Hadoop environment, and cover some exciting new developments in the analytics space.

Statistics

Views

Total Views
1,719
Views on SlideShare
1,719
Embed Views
0

Actions

Likes
1
Downloads
65
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Amazon RedShift - Ianni Vamvadelis Amazon RedShift - Ianni Vamvadelis Presentation Transcript

    • Amazon Redshift Intro, Details Ianni Vamvadelis Solutions Architect
    • Amazon DynamoDB Fast,  Predictable,  Highly-­‐Scalable  NoSQL  Data  Store   Amazon RDS Managed  Rela=onal  Database  Service  for   MySQL,  Oracle  and  SQL  Server   Amazon ElastiCache In-­‐Memory  Caching  Service   Amazon Redshift Fast,  Powerful,  Fully  Managed,  Petabyte-­‐Scale   Data  Warehouse  Service   Compute Storage AWS Global Infrastructure Database Application Services Deployment & Administration Networking AWS Database Services Scalable High Performance Application Storage in the Cloud
    • Amazon DynamoDB Fast,  Predictable,  Highly-­‐Scalable  NoSQL  Data  Store   Amazon RDS Managed  Rela=onal  Database  Service  for   MySQL,  Oracle  and  SQL  Server   Amazon ElastiCache In-­‐Memory  Caching  Service   Amazon Redshift Fast,  Powerful,  Fully  Managed,  Petabyte-­‐Scale   Data  Warehouse  Service   Compute Storage AWS Global Infrastructure Database Application Services Deployment & Administration Networking AWS  Database   Services   Scalable High Performance Application Storage in the Cloud
    • Design  Objec=ves   A  petabyte-­‐scale  data  warehouse  service  that  was…   Amazon   RedshiL   A Whole Lot Simpler A Lot Cheaper A Lot Faster
    • RedshiL  Drama=cally  Reduces  I/O   •  Direct-­‐aNached  storage   •  Large  data  block  sizes   •  Columnar  storage   •  Data  compression   •  Zone  maps   Id Age State 123 20 CA 345 25 WA 678 40 FL Row storage Column storage
    • 16GB RAM 2TB disk 2 cores RedshiL  Runs  on  Op=mized  Hardware   •  Op=mized  for  I/O  intensive  workloads   •  HS1.8XL  available  on  Amazon  EC2   •  Runs  in  HPC  -­‐  fast  network   •  High  disk  density   HS1.8XL: 128GB RAM, 16 Cores, 24 Spindles, 16TB Storage, 2GB/sec scan rate HS1.XL: 16GB RAM, 2 Cores, 3 Spindles, 2TB Storage 16GB RAM 2TB disk 2 cores 16GB RAM 2TB disk 2 cores 16GB RAM 2TB disk 2 cores 16GB RAM 2TB disk 2 cores 16GB RAM 2TB disk 2 cores 16GB RAM 2TB disk 2 cores 16GB RAM 2TB disk 2 cores 16GB RAM 2TB disk 2 cores Click to grow … to 1.6PB
    • RedshiL  Parallelizes  and  Distributes  Everything   Load   Query   Resize   Backup   Restore   10  GigE   (HPC)   Inges=on   Backup   Restore   SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores Amazon S3 JDBC/ODBC   128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Leader Node
    • Point  and  Click  Resize  
    • SQL Clients/BI Tools 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Leader Node Resize  your  cluster  while  remaining  online   128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Leader Node New  target  provisioned  in  the  background   Only  charged  for  source  cluster  
    • Resize  your  cluster  while  remaining  online   •  Fully  automated   – Data  automa=cally  redistributed   •  Read  only  mode  during  resize   •  Parallel  node-­‐to-­‐node  data  copy   •  Automa=c  DNS-­‐based  endpoint   cut-­‐over   •  Only  charged  for  one  cluster   SQL Clients/BI Tools 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Compute Node 128GB RAM 48TB disk 16 cores Leader Node
    • Amazon  RedshiL  has  security  built-­‐in   •  SSL  to  secure  data  in  transit   •  Encryp=on  to  secure  data  at  rest   – AES-­‐256   – All  blocks  on  disks  and  in  Amazon  S3   encrypted   •  No  direct  access  to  compute  nodes   •  Amazon  VPC  support   10  GigE   (HPC)   Inges=on   Backup   Restore   SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores Amazon S3 Customer  VPC   Internal   VPC   JDBC/ODBC   Leader Node Compute Node Compute Node Compute Node
    • Con=nuous  Backup,  Automated  Recovery   •  Replica=on  within  the  cluster  and  backup  to  Amazon  S3  to   maintain  mul=ple  copies  of  data  at  all  =mes   •  Backups  to  Amazon  S3  are  con=nuous,  automa=c,  and   incremental   •  Con=nuous  monitoring  and  automated  recovery  from  failures  of   drives  and  nodes   •  Able  to  restore  snapshots  to  any  Availability  Zone  within  a  region  
    • datavolume Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares data available for analysis data generated Gap cost  +   effort  
    • RedshiL  is  Priced  to  Analyze  All  Your  Data   $0.85 per hour for on-demand (2TB) $999 per TB per year (3-yr reservation)
    • Integrates  With  Exis=ng  BI  Tools   Amazon Redshift JDBC/ODBC        
    • Scenarios 6
    • Repor=ng  Warehouse   •  Accelerated  opera=onal  repor=ng   •  Support  for  short-­‐=me  use  cases   •  Data  compression,  index  redundancy   RDBMS Redshift OLTP ERP Reporting and BI  
    • Data Integration Partners* On-­‐Premises  Integra=on   RDBMS Redshift OLTP ERP Reporting and BI  
    • Live  Archive  for  (Structured)  Big  Data   •  Direct  integra=on  with  copy  command   •  High  velocity  data     •  Data  ages  into  RedshiL   •  Low  cost,  high  scale  op=on  for  new  apps   DynamoDB Redshift OLTP Web Apps Reporting and BI  
    • Cloud  ETL  for  Big  Data   •  Maintain  online  SQL  access  to  historical  logs   •  Transforma=on  and  enrichment  with  EMR   •  Longer  history  ensures  beNer  insight   Redshift Reporting and BI  Elastic MapReduce S3
    • Ingestion – Best Practices §  Goal:  Leverage  all  the  compute  nodes  and  minimize  overhead   §  Best  Prac=ces   §  Preferred  method  -­‐  COPY  from  S3   §  Loads  data  in  sorted  order  through  the  compute  nodes   §  Single  Copy  command,  Split  data  into  mul=ple  files   §  Strongly  recommend  that  you  gzip  large  datasets   §  If  you  must  ingest  through  SQL   §  Mul=-­‐row  inserts   §  Avoid  large  number  of  singleton    insert/update/delete  opera=ons     §  To  copy  from  another  table   §  CREATE  TABLE  AS  or  INSERT  INTO  SELECT   insert into category_stage values! (default, default, default, default),! (20, default, 'Country', default),! (21, 'Concerts', 'Rock', default);! copy time from 's3://mybucket/data/timerows.gz’ credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access- Key>’ gzip delimiter '|’;!
    • Choose a Sort key §  Goal   §  Skip  over  data  blocks  to  minimize  IO   §  Best  Prac=ce   §  Sort  based  on  range  or  equality  predicate  (WHERE  clause)   §  If  you  access  recent  data  frequently,  sort  based  on  TIMESTAMP  
    • Choose a Distribution Key §  Goal   §  Distribute  data  evenly  across  nodes     §  Minimize  data  movement  among  nodes  :  Co-­‐located  Joins  and  Co-­‐located  Aggregates   §  Best  Prac=ce   §  Consider  using  Join  key  as  distribu=on  key  (JOIN  clause)   §  If  mul=ple  joins,  use  the  foreign  key  of  the  largest  dimension  as  distribu=on  key   §  Consider  using  Group  By  column  as  distribu=on  key  (GROUP  BY  clause)   §  Avoid   §  Keys  used  as  equality  filter  as  your  distribu=on  key   §  If  de-­‐normalized  tables  and  no  aggregates,  do  not  specify  a  distribu=on  key  -­‐RedshiL  will   use  round  robin  
    • Select  sum( S.Price * S.Quantity )! FROM SALES S! JOIN CATEGORY C   ON C.ProductId = S.ProductId! JOIN  FRANCHISE  F ON F.FranchiseId = S.FranchiseId! Where C.CategoryId = ‘Produce’  And  F.State = ‘WA’! AND S.Date Between ‘1/1/2013’  AND ‘1/31/2013’! Example Dist key (C) = ProductID Sort key (S) = Date -- Total Produce sold in Washington in January 2013 Dist key (F) = FranchiseID Dist key (S) = ProductID
    • Workload Manager §  Allows  you  to  manage  and  adjust  query  concurrency   §  WLM    allows  you  to   §  Increase  query  concurrency  up  to  15   §  Define  user  groups  and  query  groups   §  Segregate  short  and  long  running  queries   §  Help  improve  performance  of  individual  queries   §  Be  aware:  query  workload  is  distributed  to  every  compute  node   §  Increasing  concurrency  may  not  always  help  due  to  resource  conten=on   §  CPU,  Memory  and  I/O   §  Total  throughput  may  increase  by  lekng  one  query  complete  first  and  allowing   other  queries  to  wait  
    • Workload Manager §  Default  :  1  queue  with  a  concurrency  of  5   §  Define  up  to  8  queues  with  a  total  concurrency  of  15   §  RedshiL  has  a  super  user  queue  internally  
    • Query Performance – Best Practices §  Encode  date  and  =me  using  “TIMESTAMP”  data  type  instead  of  “CHAR”   §  Specify  Constraints   §  RedshiL  does  not  enforce  constraints  (primary  key,  foreign  key,  unique  values)  but   the  op=mizer  uses  it   §  Loading  and/or  applica=ons  need  to  be  aware   §  Specify  redundant  predicate  on  the  sort  column   ! !SELECT * FROM tab1, tab2 ! ! !WHERE tab1.key = tab2.key ! ! !AND tab1.timestamp > '1/1/2013' ! ! !AND tab2.timestamp > '1/1/2013';! §  WLM  sekngs  
    • Summary §  Avoid  large  number  of  singleton  DML  statements  if   possible   §  Use  COPY  for  uploading  large  datasets   §  Choose  Sort  and  Distribu=on  keys  with  care   §  Encode  data  and  =me  with  TIMESTAMP  data  type   §  Experiment  with  WLM  sekngs  
    • More Information Best  Prac=ces  for  Designing  Tables   http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html   Best  Prac=ces  for  Data  Loading   http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html View the Redshift Developer Guide at: http://aws.amazon.com/documentation/redshift/
    • Thanks. aws.amazon.com/big-data