Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Redshift Introduction

689 views

Published on

Boston Data Mining Meetup introduction slides from Big Data Infrastructure workshop - A hands-on introduction

Published in: Software
  • Be the first to comment

  • Be the first to like this

Redshift Introduction

  1. 1. Amazon Redshift Saturday, December 6, 2014
  2. 2. Agenda 08:30 AM Breakfast 09:00 AM Introduction and Strengths of Technologies 10:00 AM break + set up query tool 10:20 AM Hadoop hands-on 10:55 AM break 11:10 AM Redshift hands-on 11:40 AM Operationalizing your code 12:00 PM adjourn 12/6/2014 2
  3. 3. Session Goals • Understand: • Why an Analytic Database? • What is Amazon Redshift • Do: • ‘Fire Up’ an Redshift Database • Load Data • Do a few queries • Shut it down • Have fun! 12/6/2014 3
  4. 4. Why an Analytic Database? Why use one? • It a database optimized for read-only queries. • It’s fast • It can handle a lot of data Why not to use one? • Poor Transaction processing (aka OLTP) • Rollback, multi-phase commits, etc 12/6/2014 4
  5. 5. Under the hood. Analytic Database typically have features like: • Compression • Column (as opposed to row) storage • Parallel queries across clusters of machines • Support for partitioning • Other cool stuff to make your queries fast 12/6/2014 5
  6. 6. Columns vs Row Storage 12/6/2014 6
  7. 7. Parallel Queries 12/6/2014 7
  8. 8. Compression 12/6/2014 8
  9. 9. Amazon Redshift is an Example of an Analytic Database 12/6/2014 9
  10. 10. Amazon Redshift uses typical SQL to query the database 12/6/2014 10
  11. 11. Let’s Get Stared! The basics: • You will need an AWS account • AWS Secret Key • AWS Access Key • Install SQL Workbench • http://www.sql-workbench.net/manual/install.html • Install Postres JDBC Drivers: • http://jdbc.postgresql.org/ 12/6/2014 11
  12. 12. Let’s Get Stared!: https://aws.amazon.com/ 12/6/2014 12 Click Here
  13. 13. Redshift: https://console.aws.amazon.com/redshift/. Click Here 12/6/2014 13
  14. 14. Launch: http://docs.aws.amazon.com/redshift/latest/gsg/rs-gsg-launch-sample-cluster.html 12/6/2014 14 Fill these out
  15. 15. Single Node: https://console.aws.amazon.com/redshift/home?region=us-east-1#launch-cluster: 12/6/2014 15 Single Node
  16. 16. Security: https://console.aws.amazon.com/redshift/home?region=us-east-1#launch-cluster: 12/6/2014 16 East, not in VPC, default, no alarms (below)
  17. 17. Review: https://console.aws.amazon.com/redshift/home?region=us-east-1#launch-cluster: 12/6/2014 17 Review
  18. 18. Launch!: 12/6/2014 18 Click
  19. 19. Launch!: 12/6/2014 19 Click
  20. 20. Wait: 12/6/2014 20 Wait, then click
  21. 21. When Active: 12/6/2014 21 You’ll need these details
  22. 22. Connect with SQL Workbench: 12/6/2014 22 Select Connect Window
  23. 23. Connect with SQL Workbench: 12/6/2014 23 Fill this out
  24. 24. Get the JDBC URL 12/6/2014 24 Copy this
  25. 25. Connect with SQL Workbench: 12/6/2014 25 Paste and Fill this out
  26. 26. Success!: 12/6/2014 26
  27. 27. New SQL Tab 12/6/2014 27 Add Tab
  28. 28. New SQL Tab 12/6/2014 28 Add Tab
  29. 29. Make Tables 12/6/2014 29 Create Some Tables CREATE TABLE rankings ( pageURL VARCHAR(300), pageRank INT, avgDuration INT ); CREATE TABLE uservisits ( sourceIP VARCHAR(116), destinationURL VARCHAR(100), visitDate DATE, adRevenue FLOAT, UserAgent VARCHAR(256), cCode CHAR(3), lCode CHAR(6), searchWord VARCHAR(32), duration INT );
  30. 30. Load Data copy uservisits FROM 's3://big-data-benchmark/pavlo/text/tiny/uservisits/' CREDENTIALS 'aws_access_key_id=<your key>;aws_secret_access_key=<your key>' delimiter ','; 12/6/2014 30 Load Data from S3 copy rankings FROM 's3://big-data-benchmark/pavlo/text/tiny/rankings/' CREDENTIALS 'aws_access_key_id =<your key>;aws_secret_access_key =<your key>' delimiter ',';
  31. 31. Load Bigger Data 12/6/2014 31 Load Data from S3 's3://big-data-benchmark/pavlo/text/tiny/uservisits/‘ -- options: "tiny", "1node", "5nodes", "10nodes"
  32. 32. Simple Queries 12/6/2014 32 Query select * from uservisits limit 100; SELECT COUNT(*) from uservisits; select * from rankings limit 100; SELECT COUNT(*) from rankings;
  33. 33. Complex Queries 12/6/2014 33 Query SELECT pageURL, pageRank FROM rankings WHERE pageRank > 10; SELECT sourceIP, SPLIT_PART(sourceIP, '.', 1) as fn, SPLIT_PART(sourceIP, '.', 2) as sn FROM uservisits LIMIT 100; SELECT sourceIP, SUM(adRevenue) AS totalRevenue, AVG(pageRank) AS pageRank FROM rankings R JOIN (SELECT sourceIP, destinationURL, adRevenue FROM uservisits uv) NUV ON (R.pageURL = NUV.destinationURL) GROUP BY sourceIP ORDER BY totalRevenue DESC LIMIT 100;
  34. 34. Shut it down! 12/6/2014 34 Click
  35. 35. Shut it down! Click 12/6/2014 35
  36. 36. Shut it down! 12/6/2014 36 No snapshot
  37. 37. Shut it down! 12/6/2014 37
  38. 38. Thanks … happy querying! See also • http://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html 12/6/2014 38

×