Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS re:Invent 2016: Deep Dive on Amazon DynamoDB (DAT304)

2,784 views

Published on

Explore Amazon DynamoDB capabilities and benefits in detail and learn how to get the most out of your DynamoDB database. We go over best practices for schema design with DynamoDB across multiple use cases, including gaming, AdTech, IoT, and others. We explore designing efficient indexes, scanning, and querying, and go into detail on a number of recently released features, including JSON document support, DynamoDB Streams, and more. We also provide lessons learned from operating DynamoDB at scale, including provisioning DynamoDB for IoT.

Published in: Technology

AWS re:Invent 2016: Deep Dive on Amazon DynamoDB (DAT304)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. December 1, 2016 DAT304 Deep Dive on Amazon DynamoDB Rick Houlihan, Principal TPM, DBS NoSQL
  2. 2. What to Expect from the Session • Brief history of data processing (Why NoSQL?) • Overview of Amazon DynamoDB • Scaling characteristics, hot keys, and design considerations • NoSQL data modeling • Normalized versus unstructured schema • Common NoSQL design patterns • Time Series, Write Sharding, MVCC, etc. • Amazon DynamoDB in the serverless ecosystem
  3. 3. Timeline of Database Technology DataPressure
  4. 4. Data volume since 2011 DataVolume Historical Current  90 percent of stored data generated in last 2 years  1 terabyte of data in 2011 equals 6.5 petabytes today  Linear correlation between data pressure and technical innovation  No reason these trends will not continue over time
  5. 5. Technology adoption and the hype curve
  6. 6. Why NoSQL? Optimized for storage Optimized for compute Normalized/relational Denormalized/hierarchical Ad hoc queries Instantiated views Scale vertically Scale horizontally Good for OLAP Built for OLTP at scale SQL NoSQL
  7. 7. Amazon DynamoDB Document or key-value Scales to any workloadFully managed NoSQL Access control Event driven programmingFast and consistent
  8. 8. Table Table Items Attributes Partition Key Sort Key Mandatory Key-value access pattern Determines data distribution Optional Model 1:N relationships Enables rich query capabilities All items for key ==, <, >, >=, <= “begins with” “between” “contains” “in” sorted results counts top/bottom N values
  9. 9. 00 55 A954 FFAA00 FF Partition Keys Id = 1 Name = Jim Hash (1) = 7B Id = 2 Name = Andy Dept = Eng Hash (2) = 48 Id = 3 Name = Kim Dept = Ops Hash (3) = CD Key Space Partition key uniquely identifies an item Partition key is used for building an unordered hash index Allows table to be partitioned for scale
  10. 10. Scaling
  11. 11. Scaling Throughput  Provision any amount of throughput to a table Size  Add any number of items to a table - Max item size is 400 KB - LSIs limit the number of range keys due to 10 GB limit Scaling is achieved through partitioning
  12. 12. Throughput Provisioned at the table level  Write capacity units (WCUs) are measured in 1 KB per second  Read capacity units (RCUs) are measured in 4 KB per second - RCUs measure strictly consistent reads - Eventually consistent reads cost 1/2 of consistent reads Read and write throughput limits are independent WCURCU
  13. 13. Partitioning Math In the future, these details might change… Number of Partitions By Capacity (Total RCU / 3000) + (Total WCU / 1000) By Size Total Size / 10 GB Total Partitions CEILING(MAX (Capacity, Size))
  14. 14. Partitioning Example Table size = 8 GB, RCUs = 5000, WCUs = 500 RCUs per partition = 5000/3 = 1666.67 WCUs per partition = 500/3 = 166.67 Data/partition = 10/3 = 3.33 GB RCUs and WCUs are uniformly spread across partitions Number of Partitions By Capacity (5000 / 3000) + (500 / 1000) = 2.17 By Size 8 / 10 = 0.8 Total Partitions CEILING(MAX (2.17, 0.8)) = 3
  15. 15. What causes throttling? If sustained throughput goes beyond provisioned throughput per partition Non-uniform workloads  Hot keys/hot partitions  Very large items Mixing hot data with cold data  Use a table per time period From the example before:  Table created with 5000 RCUs, 500 WCUs  RCUs per partition = 1666.67  WCUs per partition = 166.67  If sustained throughput > (1666 RCUs or 166 WCUs) per key or partition, DynamoDB may throttle requests - Solution: Increase provisioned throughput
  16. 16. What bad NoSQL looks like… Partition Time Heat
  17. 17. Getting the most out of DynamoDB throughput “To get the most out of DynamoDB throughput, create tables where the partition key element has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.” —DynamoDB Developer Guide Space: access is evenly spread over the key-space Time: requests arrive evenly spaced in time
  18. 18. Much better picture…
  19. 19. Data Modeling in NoSQL
  20. 20. It’s all about aggregations… Document management Process controlSocial network Data treesIT monitoring
  21. 21. SQL vs. NoSQL Design Pattern
  22. 22. Hierarchical data Tiered relational data structures
  23. 23. Hierarchical data structures as items Use composite sort key to define a hierarchy Highly selective result sets with sort queries Index anything, scales to any size Primary Key Attributes ProductID type Items 1 bookID title author genre publisher datePublished ISBN Some Book John Smith Science Fiction Ballantine Oct-70 0-345-02046-4 2 albumID title artist genre label studio released producer Some Album Some Band Progressive Rock Harvest Abbey Road 3/1/73 Somebody 2 albumID:trackID title length music vocals Track 1 1:30 Mason Instrumental 2 albumID:trackID title length music vocals Track 2 2:43 Mason Mason 2 albumID:trackID title length music vocals Track 3 3:30 Smith Johnson 3 movieID title genre writer producer Some Movie Scifi Comedy Joe Smith 20th Century Fox 3 movieID:actorID name character image Some Actor Joe img2.jpg 3 movieID:actorID name character image Some Actress Rita img3.jpg 3 movieID:actorID name character image Some Actor Frito img1.jpg
  24. 24. … or as documents (JSON) JSON data types (M, L, BOOL, NULL) Document SDKs available Indexing only by using DynamoDB Streams or AWS Lambda 400 KB maximum item size (limits hierarchical data structure) Primary Key Attributes ProductID Items 1 id title author genre publisher datePublished ISBN bookID Some Book Some Guy Science Fiction Ballantine Oct-70 0-345-02046-4 2 id title artist genre Attributes albumID Some Album Some Band Progressive Rock { label:"Harvest", studio: "Abbey Road", published: "3/1/73", producer: "Pink Floyd", tracks: [{title: "Speak to Me", length: "1:30", music: "Mason", vocals: "Instrumental"},{title: ”Breathe", length: ”2:43", music: ”Waters, Gilmour, Wright", vocals: ”Gilmour"},{title: ”On the Run", length: “3:30", music: ”Gilmour, Waters", vocals: "Instrumental"}]} 3 id title genre writer Attributes movieID Some Movie Scifi Comedy Joe Smith { producer: "20th Century Fox", actors: [{ name: "Luke Wilson", dob: "9/21/71", character: "Joe Bowers", image: "img2.jpg"},{ name: "Maya Rudolph", dob: "7/27/72", character: "Rita", image: "img1.jpg"},{ name: "Dax Shepard", dob: "1/2/75", character: "Frito Pendejo", image: "img3.jpg"}]
  25. 25. Design patterns and best practices
  26. 26. DynamoDB Streams and AWS Lambda
  27. 27. Triggers Lambda function Notify change Derivative tables Amazon CloudSearch Amazon ElastiCache
  28. 28. Real-time voting Write-heavy items
  29. 29. Partition 1 1000 WCUs Partition K 1000 WCUs Partition M 1000 WCUs Partition N 1000 WCUs Votes Table Candidate A Candidate B Scaling bottlenecks Voters Provision 200,000 WCUs
  30. 30. Write sharding Candidate A_2 Candidate B_1 Candidate B_2 Candidate B_3 Candidate B_5 Candidate B_4 Candidate B_7 Candidate B_6 Candidate A_1 Candidate A_3 Candidate A_4 Candidate A_7 Candidate B_8 Candidate A_6 Candidate A_8 Candidate A_5 Voter Votes Table
  31. 31. Write sharding Candidate A_2 Candidate B_1 Candidate B_2 Candidate B_3 Candidate B_5 Candidate B_4 Candidate B_7 Candidate B_6 Candidate A_1 Candidate A_3 Candidate A_4 Candidate A_7 Candidate B_8 UpdateItem: “CandidateA_” + rand(0, 10) ADD 1 to Votes Candidate A_6 Candidate A_8 Candidate A_5 Voter Votes Table
  32. 32. Votes Table Shard aggregation Candidate A_2 Candidate B_1 Candidate B_2 Candidate B_3 Candidate B_5 Candidate B_4 Candidate B_7 Candidate B_6 Candidate A_1 Candidate A_3 Candidate A_4 Candidate A_5 Candidate A_6 Candidate A_8 Candidate A_7 Candidate B_8 Periodic process Candidate A Total: 2.5M 1. Sum 2. Store Voter
  33. 33. Trade off read cost for write scalability Consider throughput per partition key Shard write-heavy partition keys Your write workload is not horizontally scalable
  34. 34. Event logging Storing time series data
  35. 35. Time Series Tables (Static TTL Timestamps) Events_table_2015_April Event_id (Partition) Timestamp (Sort) Attribute1 …. Attribute N Events_table_2015_March Event_id (Partition) Timestamp (Sort) Attribute1 …. Attribute N Events_table_2015_Feburary Event_id (Partition) Timestamp (Sort) Attribute1 …. Attribute N Events_table_2015_January Event_id (Partition) Timestamp (Sort) Attribute1 …. Attribute N RCUs = 1000 WCUs = 1 RCUs = 10000 WCUs = 10000 RCUs = 100 WCUs = 1 RCUs = 10 WCUs = 1 Current table Older tables HotdataColddata Don’t mix hot and cold data; archive cold data to Amazon S3
  36. 36. Use a table per time period Precreate daily, weekly, monthly tables Provision required throughput for current table Writes go to the current table Turn off (or reduce) throughput for older tables Dealing with static time series data
  37. 37. Time Series Tables (Dynamic TTL Timestamps) Active Tickets_Table Event_id (Partition) Timestamp (Sort) GSIKey Rand(0-N) … Attribute N Expired Tickets GSI GSIKey (Partition) Timestamp (Sort) Archive Table Event_id (Partition) Timestamp (Sort) Attribute1 …. Attribute N RCUs = 10000 WCUs = 10000 RCUs = 100 WCUs = 1 Current table HotdataColddata Scatter Query GSI for Expired Tickets and use Streams to archive
  38. 38. Use Lambda to purge expired items Use a write sharded GSI to selectively query expired items Create a Lambda “stored procedure” to delete items Migrate data between tables with Streams/Lambda trigger Dealing with dynamic time series data
  39. 39. Product catalog Popular items (read)
  40. 40. Partition 1 2000 RCUs Partition K 2000 RCUs Partition M 2000 RCUs Partition 50 2000 RCU Scaling bottlenecks Product A Product B Shoppers ProductCatalog Table SELECT Id, Description, ... FROM ProductCatalog WHERE Id="POPULAR_PRODUCT"
  41. 41. Partition 1 Partition 2 ProductCatalog Table User DynamoDB User Cache popular items SELECT Id, Description, ... FROM ProductCatalog WHERE Id="POPULAR_PRODUCT"
  42. 42. Multiversion Concurrency Transactions in NoSQL
  43. 43. How OLTP Apps Use Data  Mostly hierarchical structures  Entity-driven workflows  Data spread across tables  Requires complex queries  Primary driver for ACID
  44. 44. Reading item partitions Get all items for a product assembly SELECT * FROM Assemblies WHERE ID=1 Order Item Picker ID PartID Vendor Bin 1 1 ACME 27Z19 2 ACME 39M97 3 ACME 75B25 (Many more assemblies) Assemblies table Partition key Sort key
  45. 45. ID PartID Vendor Bin 1 1 ACME 27Z19 2 ACME 39M97 3 ACME 75B25 4 UAC 53G56 5 UAC 64B17 6 UAC 48J19 Updates are problematic (Many more assemblies) Assemblies table Old items SELECT * FROM Assemblies WHERE ID=1 Order Item Picker New items
  46. 46. ID PartID Vendor Bin 1 1 ACME 27Z19 2 ACME 39M97 3 ACME 75B25 4 UAC 53G56 5 UAC 64B17 6 UAC 48J19 Creating partition “locks” (Many more assemblies) Assemblies table • Use metadata item for versioning SET #ItemList.i = list append(:newList, #ItemList.i ), #Lock = 1 • Obtain locks with conditional writes --condition-expression “Lock = 0” • Remove old items or add version info to Sort Key Query composite keys for specific versions • Readers can determine how to handle “fat” reads • Add additional metadata to manage transactional workflow as needed Current State, Timeout, etc. ID PartID ItemList Lock 1 0 {i:[[1,2,3],[4,5,6]]} 0
  47. 47. Use item partitions to manage transactional workflows Manage versioning across items with metadata Tag attributes to maintain multiple versions Code your app to recognize when updates are in progress Implement app layer error handling and recovery logic Transactional writes across items is required
  48. 48. Messaging app Large items Filters vs. indexes M:N Modeling—inbox and outbox
  49. 49. Messages table Messages app David SELECT * FROM Messages WHERE Recipient='David' LIMIT 50 ORDER BY Date DESC Inbox SELECT * FROM Messages WHERE Sender ='David' LIMIT 50 ORDER BY Date DESC Outbox
  50. 50. Recipient Date Sender Message David 2014-10-02 Bob … … 48 more messages for David … David 2014-10-03 Alice … Alice 2014-09-28 Bob … Alice 2014-10-01 Carol … Large and small attributes mixed (Many more messages) David Messages table 50 items × 256 KB each Partition key Sort key Large message bodies Attachments SELECT * FROM Messages WHERE Recipient='David' LIMIT 50 ORDER BY Date DESC Inbox
  51. 51. Computing inbox query cost Items evaluated by query Average item size Conversion ratio Eventually consistent reads 50 * 256KB * (1 RCU / 4 KB) * (1 / 2) = 1600 RCU
  52. 52. Recipient Date Sender Subject MsgId David 2014-10-02 Bob Hi!… afed David 2014-10-03 Alice RE: The… 3kf8 Alice 2014-09-28 Bob FW: Ok… 9d2b Alice 2014-10-01 Carol Hi!... ct7r Separate the bulk data Inbox-GSI Messages table MsgId Body 9d2b … 3kf8 … ct7r … afed … David 1. Query inbox-GSI: 1 RCU 2. BatchGetItem messages: 1600 RCU (50 separate items at 256 KB) (50 sequential items at 128 bytes) Uniformly distributes large item reads
  53. 53. Messaging app Messages table David Inbox global secondary index Inbox Outbox global secondary index Outbox
  54. 54. Reduce one-to-many item sizes Configure secondary index projections Use GSIs to model M:N relationship between sender and recipient Distribute large items Querying many large items at once InboxMessagesOutbox
  55. 55. Multiplayer online gaming Query filters vs. composite key indexes
  56. 56. Secondary index Opponent Date GameId Status Host Alice 2014-10-02 d9bl3 DONE David Carol 2014-10-08 o2pnb IN_PROGRESS Bob Bob 2014-09-30 72f49 PENDING Alice Bob 2014-10-03 b932s PENDING Carol Bob 2014-10-03 ef9ca IN_PROGRESS David BobPartition key Sort key Multivalue sorts and filters
  57. 57. Secondary Index Approach 1: Query filter Bob Opponent Date GameId Status Host Alice 2014-10-02 d9bl3 DONE David Carol 2014-10-08 o2pnb IN_PROGRESS Bob Bob 2014-09-30 72f49 PENDING Alice Bob 2014-10-03 b932s PENDING Carol Bob 2014-10-03 ef9ca IN_PROGRESS David SELECT * FROM Game WHERE Opponent='Bob' ORDER BY Date DESC FILTER ON Status='PENDING' (filtered out)
  58. 58. Needle in a haystack Bob
  59. 59. Send back less data “on the wire” Simplify application code Simple SQL-like expressions • AND, OR, NOT, () Use query filter Your index isn’t entirely selective
  60. 60. Approach 2: Composite key StatusDate DONE_2014-10-02 IN_PROGRESS_2014-10-08 IN_PROGRESS_2014-10-03 PENDING_2014-09-30 PENDING_2014-10-03 Status DONE IN_PROGRESS IN_PROGRESS PENDING PENDING Date 2014-10-02 2014-10-08 2014-10-03 2014-10-03 2014-09-30 + =
  61. 61. Secondary Index Approach 2: Composite key Opponent StatusDate GameId Host Alice DONE_2014-10-02 d9bl3 David Carol IN_PROGRESS_2014-10-08 o2pnb Bob Bob IN_PROGRESS_2014-10-03 ef9ca David Bob PENDING_2014-09-30 72f49 Alice Bob PENDING_2014-10-03 b932s Carol Partition key Sort key
  62. 62. Opponent StatusDate GameId Host Alice DONE_2014-10-02 d9bl3 David Carol IN_PROGRESS_2014-10-08 o2pnb Bob Bob IN_PROGRESS_2014-10-03 ef9ca David Bob PENDING_2014-09-30 72f49 Alice Bob PENDING_2014-10-03 b932s Carol Secondary index Approach 2: Composite key Bob SELECT * FROM Game WHERE Opponent='Bob' AND StatusDate BEGINS_WITH 'PENDING'
  63. 63. Needle in a sorted haystack Bob
  64. 64. Sparse indexes Id (Partition) User Game Score Date Award 1 Bob G1 1300 2012-12-23 2 Bob G1 1450 2012-12-23 3 Jay G1 1600 2012-12-24 4 Mary G1 2000 2012-10-24 Champ 5 Ryan G2 123 2012-03-10 6 Jones G2 345 2012-03-20 Game-scores-table Award (Partition) Id User Score Champ 4 Mary 2000 Award-GSI Scan sparse GSIs
  65. 65. Concatenate attributes to form useful secondary index keys Take advantage of sparse indexes Replace filter with indexes You want to optimize a query as much as possible Status + Date
  66. 66. Reference architecture Amazon Kinesis consumer Amazon EMR
  67. 67. Elastic Serverless Applications
  68. 68. Thank you!
  69. 69. Remember to complete your evaluations!

×