Data & Analytics - Session 2 - Introducing Amazon Redshift


Published on

Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. This presentation will give an introduction to the service and its pricing before diving into how it delivers fast query performance on data sets ranging from hundreds of gigabytes to a petabyte or more.

Steffen Krause, Technical Evangelist, AWS
Padraic Mulligan, Architect and Lead Developer and Mike McCarthy, CTO, Skillspage

Published in: Technology

Data & Analytics - Session 2 - Introducing Amazon Redshift

  1. 1. Steffen Krause, Technical EvangelistIntroducing Amazon Redshift
  2. 2. Data warehousing done the AWS way• No upfront costs, pay as you go• Really fast performance at a really low price• Open and flexible with support for popular tools• Easy to provision and scale up massively
  3. 3. We set out to build…A fast and powerful, petabyte-scale data warehouse that is:Delivered as a managed serviceA Lot FasterA Lot CheaperA Lot SimplerAmazon Redshift
  4. 4. We’re off to a good start
  5. 5. Amazon Redshift dramatically reduces I/OID Age State123 20 CA345 25 WA678 40 FLRow storage Column storageScanDirection
  6. 6. Amazon Redshift automatically compresses your data• Compress saves space and reduces disk I/O• COPY automatically analyzes and compressesyour data– Samples data; selects best compression encoding– Supports: byte dictionary, delta, mostly n, runlength, text• Customers see 4-8x space savings with real data– 20x and higher possible based on data set• ANALYZE COMPRESSION to see detailsanalyze compression listing;Table | Column | Encoding---------+----------------+----------listing | listid | deltalisting | sellerid | delta32klisting | eventid | delta32klisting | dateid | bytedictlisting | numtickets | bytedictlisting | priceperticket | delta32klisting | totalprice | mostly32listing | listtime | raw
  7. 7. Amazon Redshift architecture• Leader Node– SQL endpoint– Stores metadata– Coordinates query execution• Compute Nodes– Local, columnar storage– Execute queries in parallel– Load, backup, restore via Amazon S3– Parallel load from Amazon DynamoDB• Single node version available10 GigE(HPC)IngestionBackupRestoreJDBC/ODBC
  8. 8. Amazon Redshift runs on optimized hardwareHS1.8XL: 128 GB RAM, 16 Cores, 24 Spindles, 16 TB compressed user storage, 2 GB/sec scan rateHS1.XL: 16 GB RAM, 2 Cores, 3 Spindles, 2 TB compressed customer storage• Optimized for I/O intensive workloads• High disk density• Runs in HPC - fast network• HS1.8XL available on Amazon EC2
  9. 9. Amazon Redshift parallelizes and distributes everything• Query• Load• Backup• Restore• Resize10 GigE(HPC)IngestionBackupRestoreJDBC/ODBC
  10. 10. Amazon Redshift lets you start small and grow bigExtra Large Node (HS1.XL)3 spindles, 2 TB, 16 GB RAM, 2 coresSingle Node (2 TB)Cluster 2-32 Nodes (4 TB – 64 TB)Eight Extra Large Node (HS1.8XL)24 spindles, 16 TB, 128 GB RAM, 16 cores, 10 GigECluster 2-100 Nodes (32 TB – 1.6 PB)Note: Nodes not to scale
  11. 11. Amazon Redshift is priced to let you analyze all your dataPrice Per Hour for HS1.XLSingle NodeEffective Hourly Price PerTBEffective Annual Priceper TBOn-Demand $ 0.850 $ 0.425 $ 3,7231 Year Reservation $ 0.500 $ 0.250 $ 2,1903 Year Reservation $ 0.228 $ 0.114 $ 999Simple PricingNumber of Nodes x Cost per HourNo charge for Leader NodeNo upfront costsPay as you go
  12. 12. Amazon Redshift is easy to use• Provision in minutes• Monitor query performance• Point and click resize• Built in security• Automatic backups
  13. 13. Provision a data warehouse in minutes
  14. 14. Monitor query performance
  15. 15. Point and click resize
  16. 16. Resize your cluster while remaining online• New target provisioned in the background• Only charged for source cluster
  17. 17. Resize your cluster while remaining online• Fully automated– Data automatically redistributed• Read only mode during resize• Parallel node-to-node data copy• Automatic DNS-based endpoint cutover• Only charged for one cluster
  18. 18. Amazon Redshift has security built-in• SSL to secure data in transit• Encryption to secure data at rest– AES-256; hardware accelerated– All blocks on disks and in Amazon S3encrypted• No direct access to compute nodes• Amazon VPC support10 GigE(HPC)IngestionBackupRestoreCustomer VPCInternalVPCJDBC/ODBC
  19. 19. Amazon Redshift continuously backs up your data andrecovers from failures• Replication within the cluster and backup to Amazon S3 to maintain multiple copies ofdata at all times• Backups to Amazon S3 are continuous, automatic, and incremental– Designed for eleven nines of durability• Continuous monitoring and automated recovery from failures of drives and nodes• Able to restore snapshots to any Availability Zone within a region
  20. 20. Amazon Redshift integrates with multiple data sourcesAmazonDynamoDBAmazon ElasticMapReduceAmazon SimpleStorage Service (S3)Amazon ElasticCompute Cloud (EC2)AWS StorageGateway ServiceCorporateData CenterAmazon RelationalDatabase Service(RDS)AmazonRedshiftMore coming soon…
  21. 21. Amazon Redshift provides multiple data loading options• Upload to Amazon S3• AWS Import/Export• AWS Direct Connect• Work with a partnerData Integration Systems IntegratorsMore coming soon…
  22. 22. Amazon Redshift works with your existing analysis toolsJDBC/ODBCAmazon RedshiftMore coming soon…
  23. 23. Customer Use CaseMike McCarthyCTO,
  24. 24. One Place to Find Skilled People
  25. 25. Everyone NeedsSkilled PeopleAt HomeAt WorkIn LifeRepeatedly
  26. 26. 2 million15 millionREGISTEREDMEMBERS2011 2012 2013
  27. 27. 77 Instances 3 Availability Zones2.5+ Billion Relationships Tech Team of 21 10M+ Growth IncrementsReserved/Demand & SpotAdd capacity as requiredAuto scaleUS East (Northern VA)Planned for Multi RegionOur social graph modelsover 2.5 Billion socialrelationshipsReady for additional 10 millionusers at any point in time1 Data AnalystTotal company size 37150M+ Emails>150,000,000 emails sentper month
  28. 28. 21,000,000+SKILLS ADDED BY MEMBERS
  29. 29. 1,500,000+NEW MEMBERS/MONTH
  30. 30. 1,200,000,000+SOCIAL CONNECTIONS IMPORTED
  32. 32. We Measure Everything!
  33. 33. Why Measure?• Business Insights• KPIs• Campaign Management• Behavioural Analysis• Algorithm Improvements• Performance ManagementBest user experience
  34. 34. History with Redshift• Amazon Customer since 2010• Proprietary SQL Data Warehouse 2011• Rapid Growth 2012• Redshift Trials 2012• Redshift Production DW 2013
  35. 35. Data ArchitectureData AnalystRaw DataGetDataJoin via FacebookAdd a Skill PageInvite FriendsWeb Servers Amazon S3User Action Trace EventsEMRHive Scripts Process Content• Process log files withregular expressions toparse out the info we need.• Processes cookies intouseful searchable data suchas Session, UserId, APISecurity token.• Filters surplus info likeinternal varnish logging.Amazon S3Aggregated DataRaw EventsInternal WebExcel TableauAmazon Redshift
  36. 36. EMR• Heavy Lifting• Log Parsing & Data Extraction• Cookies• Clickstream• Directory Generation• Network Processing• Process 40GB+ Telemetry data daily• Reserved & Spot Instances
  37. 37. Redshift Implementation• High Storage Extra Large (XL) DW Node• Growing from 2 xDW.HS1.XLARGE nodes• Reservations• ETL Activities• Approx. 90 minutes including exports from RDBMS, copying to S3,loading stage tables, loading target tables, vacuuming and analysingtables• Schema• Compression• Starting to use columnar compression• Retention
  38. 38. DW AnatomyDimension PurposeUsers Analyse the composition of the user baseEvents Analyse significant actions that reflect user activity & behaviourClickstreamAnalyse user browsing and landing events at a page levelEmail Click through and Cohort AnalysisNotificationsAnalyse user to user messaging – what users are mailing whatusers and when.Sessions Traffic & Visit AnalysisSkills Analyse Skills by Classification and User ContextOpportunitiesAnalyse Opportunities by Classification, User Context andresponse rate.SearchAnalyse and quantify the characteristics of each search made onthe platform.
  39. 39. Performance
  40. 40. Accessing Data• Consumers• Tableau• Excel/PowerPivot• Technical Team• SqlworkbenchDriver: JDBC for postgressql 8.xx
  41. 41. Data Visualisation
  42. 42. Redshift - Nice to haves• Possibility to load lzo files from S3• Additional analytical functions e.g. MEDIAN• Hierarchies• ETL tool working with S3, many database vendors
  43. 43. Why Redshift works for SkillPages• Scale - MPP• Performance• Columnar• Platform Integration• S3, Dynamo• Operational Advantages• Ease of Access• Cost
  44. 44. Thank you!Customer Use CaseMike McCarthyCTO,
  45. 45. Resources & Questions• Steffen Krause | | @AWS_Aktuell•••