Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Dive on Amazon DynamoDB

620 views

Published on

by Rick Houlihan, Sr. Practice Manager, AWS

  • Hey guys! Who wants to chat with me? More photos with me here 👉 http://www.bit.ly/katekoxx
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Deep Dive on Amazon DynamoDB

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. June 2017 Deep Dive on DynamoDB Rick Houlihan, Senior Practice Manager – NoSQL
  2. 2. What to Expect from the Session • Brief History of Data Processing (Why NoSQL?) • Overview of DynamoDB • Scaling characteristics, hot keys, and design considerations • NoSQL Data Modeling • Normalized versus Unstructured schema • Common NoSQL Design Patterns • Time Series, Write Sharding, MVCC, etc. • DynamoDB in the Serverless Ecosystem
  3. 3. Timeline of Database Technology DataPressure
  4. 4. Data volume since 2011 DataVolume Historical Current  90 percent of stored data generated in last 2 years  1 terabyte of data in 2011 equals 6.5 petabytes today  Linear correlation between data pressure and technical innovation  No reason these trends will not continue over time
  5. 5. Technology adoption and the hype curve
  6. 6. Why NoSQL? Optimized for storage Optimized for compute Normalized/relational Denormalized/hierarchical Ad hoc queries Instantiated views Scale vertically Scale horizontally Good for OLAP Built for OLTP at scale SQL NoSQL
  7. 7. Amazon DynamoDB Document or Key-Value Scales to Any WorkloadFully Managed NoSQL Access Control Event Driven ProgrammingFast and Consistent
  8. 8. Table Table Items Attributes Partition Key Sort Key Mandatory Key-value access pattern Determines data distribution Optional Model 1:N relationships Enables rich query capabilities All items for key ==, <, >, >=, <= “begins with” “between” “contains” “in” sorted results counts top/bottom N values
  9. 9. 00 55 A954 FFAA00 FF Partition Keys Id = 1 Name = Jim Hash (1) = 7B Id = 2 Name = Andy Dept = Eng Hash (2) = 48 Id = 3 Name = Kim Dept = Ops Hash (3) = CD Key Space Partition Key uniquely identifies an item Partition Key is used for building an unordered hash index Allows table to be partitioned for scale
  10. 10. Partition 3 Partition:Sort Key uses two attributes together to uniquely identify an Item Within unordered hash index, data is arranged by the sort key No limit on the number of items (∞) per partition key Except if you have local secondary indexes Partition:Sort Key 00:0 FF:∞ Hash (2) = 48 Customer# = 2 Order# = 10 Item = Pen Customer# = 2 Order# = 11 Item = Shoes Customer# = 1 Order# = 10 Item = Toy Customer# = 1 Order# = 11 Item = Boots Hash (1) = 7B Customer# = 3 Order# = 10 Item = Book Customer# = 3 Order# = 11 Item = Paper Hash (3) = CD 55 A9:∞54:∞ AAPartition 1 Partition 2
  11. 11. Partitions are three-way replicated Id = 2 Name = Andy Dept = Engg Id = 3 Name = Kim Dept = Ops Id = 1 Name = Jim Id = 2 Name = Andy Dept = Engg Id = 3 Name = Kim Dept = Ops Id = 1 Name = Jim Id = 2 Name = Andy Dept = Engg Id = 3 Name = Kim Dept = Ops Id = 1 Name = Jim Replica 1 Replica 2 Replica 3 Partition 1 Partition 2 Partition N
  12. 12. Eventually Consistent Reads Partition A 1000 RCUs Partition A 1000 RCUs Partition A 1000 RCUs Host A Host B Host C Rand(1,3) Facility 1 Facility 2 Facility 3
  13. 13. Strongly Consistent Reads Partition A 1000 RCUs Partition A 1000 RCUs Partition A 1000 RCUs Primary Host A Host B Host C Locate Primary Facility 1 Facility 2 Facility 3
  14. 14. Indexes
  15. 15. Local secondary index (LSI) Alternate sort key attribute Index is local to a partition key A1 (partition) A3 (sort) A2 (item key) A1 (partition) A2 (sort) A3 A4 A5 LSIs A1 (partition) A4 (sort) A2 (item key) A3 (projected) Table KEYS_ONLY INCLUDE A3 A1 (partition) A5 (sort) A2 (item key) A3 (projected) A4 (projected) ALL 10 GB max per partition key, i.e. LSIs limit the # of range keys!
  16. 16. Global secondary index (GSI) Alternate partition and/or sort key Index is across all partition keys Use composite sort keys for compound indexes A1 (partition) A2 A3 A4 A5 A5 (partition) A4 (sort) A1 (item key) A3 (projected) INCLUDE A3 A4 (partition) A5 (sort) A1 (item key) A2 (projected) A3 (projected) ALL A2 (partition) A1 (itemkey) KEYS_ONLY GSIs Table RCUs/WCUs provisioned separately for GSIs Online indexing
  17. 17. How do GSI updates work? Table Primary table Primary table Primary table Primary table Global Secondary Index Client 2. Asynchronous update (in progress) If GSIs don’t have enough write capacity, table writes will be throttled!
  18. 18. LSI or GSI? LSI can be modeled as a GSI If data size in an item collection > 10 GB, use GSI If eventual consistency is okay for your scenario, use GSI!
  19. 19. Scaling
  20. 20. Throughput Provisioned at the table level  Write capacity units (WCUs) are measured in 1 KB per second  Read capacity units (RCUs) are measured in 4 KB per second - RCUs measure strictly consistent reads - Eventually consistent reads cost 1/2 of consistent reads Read and write throughput limits are independent WCURCU
  21. 21. Partitioning Math In the future, these details might change… Number of Partitions By Capacity (Total RCU / 3000) + (Total WCU / 1000) By Size Total Size / 10 GB Total Partitions CEILING(MAX (Capacity, Size))
  22. 22. Partitioning Example Table size = 8 GB, RCUs = 5000, WCUs = 500 RCUs per partition = 5000/3 = 1666.67 WCUs per partition = 500/3 = 166.67 Data/partition = 10/3 = 3.33 GB RCUs and WCUs are uniformly spread across partitions Number of Partitions By Capacity (5000 / 3000) + (500 / 1000) = 2.17 By Size 8 / 10 = 0.8 Total Partitions CEILING(MAX (2.17, 0.8)) = 3
  23. 23. What causes throttling? If sustained throughput goes beyond provisioned throughput per partition Non-uniform workloads  Hot keys/hot partitions  Very large items Mixing hot data with cold data  Use a table per time period From the example before:  Table created with 5000 RCUs, 500 WCUs  RCUs per partition = 1666.67  WCUs per partition = 166.67  If sustained throughput > (1666 RCUs or 166 WCUs) per key or partition, DynamoDB may throttle requests - Solution: Increase provisioned throughput
  24. 24. What bad NoSQL looks like… Partition Time Heat
  25. 25. Getting the most out of DynamoDB throughput “To get the most out of DynamoDB throughput, create tables where the partition key element has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.” —DynamoDB Developer Guide Space: access is evenly spread over the key-space Time: requests arrive evenly spaced in time
  26. 26. Much better picture…
  27. 27. Data Modeling in NoSQL
  28. 28. It’s all about aggregations… Document Management Process ControlSocial Network Data TreesIT Monitoring
  29. 29. SQL vs. NoSQL Design Pattern
  30. 30. 1:1 relationships or key-values Use a table or GSI with an alternate partition key Use GetItem or BatchGetItem API Example: Given an SSN or license number, get attributes Users Table Partiton key Attributes SSN = 123-45-6789 Email = johndoe@nowhere.com, License = TDL25478134 SSN = 987-65-4321 Email = maryfowler@somewhere.com, License = TDL78309234 Users-License-GSI Partition key Attributes License = TDL78309234 Email = maryfowler@somewhere.com, SSN = 987-65-4321 License = TDL25478134 Email = johndoe@nowhere.com, SSN = 123-45-6789
  31. 31. 1:N relationships or parent-children Use a table or GSI with partition and sort key Use Query API Example: • Given a device, find all readings between epoch X, Y Device-measurements Partition Key Sort key Attributes DeviceId = 1 epoch = 5513A97C Temperature = 30, pressure = 90 DeviceId = 1 epoch = 5513A9DB Temperature = 30, pressure = 90
  32. 32. N:M relationships Use a table and GSI with partition and sort key elements switched Use Query API Example: Given a user, find all games. Or given a game, find all users. User-Games-Table Partition Key Sort key UserId = bob GameId = Game1 UserId = fred GameId = Game2 UserId = bob GameId = Game3 Game-Users-GSI Partition Key Sort key GameId = Game1 UserId = bob GameId = Game2 UserId = fred GameId = Game3 UserId = bob
  33. 33. Hierarchical Data Relational data modeling
  34. 34. Hierarchical data structures as items Use composite sort key to define a hierarchy Highly selective result sets with sort queries Index anything, scales to any size Primary Key Attributes ProductID type Items 1 Book title author genre publisher datePublished ISBN Ringworld Larry Niven Science Fiction Ballantine Oct-70 0-345-02046-4 2 Album title artist genre label studio released producer Dark Side of the Moon Pink Floyd Progressive Rock Harvest Abbey Road 3/1/73 Pink Floyd Track-0 title length music vocals Speak to Me 1:30 Mason Instrumental Track-1 title length music vocals Breathe 2:43 Waters, Gilmour, Wright Gilmour Track-2 title length music vocals On the Run 3:30 Gilmour, Waters Instrumental 3 Movie title genre writer producer Idiocracy Scifi Comedy Mike Judge 20th Century Fox Actor-0 name character image Luke Wilson Joe Bowers img2.jpg Actor-1 name character image Maya Rudolph Rita img3.jpg Actor-2 name character image Dax Shepard Frito Pendejo img1.jpg
  35. 35. … or as documents (JSON) JSON data types (M, L, BOOL, NULL) Document SDKs available Indexing only by using DynamoDB Streams or AWS Lambda 400 KB maximum item size (limits hierarchical data structure) Primary Key Attributes ProductID Items 1 id title author genre publisher datePublished ISBN bookID Ringworld Larry Niven Science Fiction Ballantine Oct-70 0-345-02046-4 2 id title artist genre Attributes albumID Dark Side of the Moon Pink Floyd Progressive Rock { label:"Harvest", studio: "Abbey Road", published: "3/1/73", producer: "Pink Floyd", tracks: [{title: "Speak to Me", length: "1:30", music: "Mason", vocals: "Instrumental"},{title: ”Breathe", length: ”2:43", music: ”Waters, Gilmour, Wright", vocals: ”Gilmour"},{title: ”On the Run", length: “3:30", music: ”Gilmour, Waters", vocals: "Instrumental"}]} 3 id title genre writer Attributes movieID Idiocracy Scifi Comedy Mike Judge { producer: "20th Century Fox", actors: [{ name: "Luke Wilson", dob: "9/21/71", character: "Joe Bowers", image: "img2.jpg"},{ name: "Maya Rudolph", dob: "7/27/72", character: "Rita", image: "img1.jpg"},{ name: "Dax Shepard", dob: "1/2/75", character: "Frito Pendejo", image: "img3.jpg"}]
  36. 36. Design patterns and best practices
  37. 37. DynamoDB Streams and AWS Lambda
  38. 38. Triggers Lambda function Notify change Derivative tables Amazon CloudSearch Amazon ElastiCache
  39. 39. Write Sharding Partitioning high velocity writes
  40. 40. Partition 1 1000 WCUs Partition K 1000 WCUs Partition M 1000 WCUs Partition N 1000 WCUs Votes Table Candidate A Candidate B Scaling bottlenecks Voters Provision 200,000 WCUs
  41. 41. Write sharding Candidate A_2 Candidate B_1 Candidate B_2 Candidate B_3 Candidate B_5 Candidate B_4 Candidate B_7 Candidate B_6 Candidate A_1 Candidate A_3 Candidate A_4 Candidate A_7 Candidate B_8 Candidate A_6 Candidate A_8 Candidate A_5 Voter Votes Table
  42. 42. Write sharding Candidate A_2 Candidate B_1 Candidate B_2 Candidate B_3 Candidate B_5 Candidate B_4 Candidate B_7 Candidate B_6 Candidate A_1 Candidate A_3 Candidate A_4 Candidate A_7 Candidate B_8 UpdateItem: “CandidateA_” + rand(0, 10) ADD 1 to Votes Candidate A_6 Candidate A_8 Candidate A_5 Voter Votes Table
  43. 43. Votes Table Shard aggregation Candidate A_2 Candidate B_1 Candidate B_2 Candidate B_3 Candidate B_5 Candidate B_4 Candidate B_7 Candidate B_6 Candidate A_1 Candidate A_3 Candidate A_4 Candidate A_5 Candidate A_6 Candidate A_8 Candidate A_7 Candidate B_8 Periodic process Candidate A Total: 2.5M 1. Sum 2. Store Voter
  44. 44. Trade off read cost for write scalability Consider throughput per partition key Shard write-heavy partition keys Your write workload is not horizontally scalable
  45. 45. GSI Partitioning Creating full table indexes
  46. 46. Active Tickets Table Customer (Partition) Ticket (Sort) GSIKey (0-N) …. Timestamp Time Based Workflows Expired Tickets GSI GSIKey (Partition) Timestamp (Sort) Daily Archive Table GSIKey (Partition) Timestamp (Sort) Attribute1 …. Attribute N RCUs = 10000 WCUs = 10000 RCUs = 100 WCUs = 1 Current table HotdataColddata Scatter query GSI for any item on the table matching the sort condition
  47. 47. TTL job Time-To-Live (TTL) Amazon DynamoDB Table CustomerActiveO rder OrderId: 1 CustomerId: 1 MyTTL: 1492641900 DynamoDB Streams Stream Amazon Kinesis Amazon Redshift An epoch timestamp marking when an item can be deleted by a background process, without consuming any provisioned capacity Time-To-Live Removes data that is no longer relevant
  48. 48. Use Lambda to execute workflows on expired items Use a write sharded GSI to selectively query items across the table space Create a Lambda “stored procedure” to execute workflows Creating dense indexes
  49. 49. Product catalog Popular items (read)
  50. 50. Partition 1 2000 RCUs Partition K 2000 RCUs Partition M 2000 RCUs Partition 50 2000 RCU Scaling bottlenecks Product A Product B Shoppers ProductCatalog Table SELECT Id, Description, ... FROM ProductCatalog WHERE Id="POPULAR_PRODUCT"
  51. 51. Partition 1 Partition 2 ProductCatalog Table User DynamoDB User Cache popular items SELECT Id, Description, ... FROM ProductCatalog WHERE Id="POPULAR_PRODUCT"
  52. 52. Targeting Queries Query filters vs. composite key indexes
  53. 53. Secondary index Opponent Date GameId Status Host Alice 2014-10-02 d9bl3 DONE David Carol 2014-10-08 o2pnb IN_PROGRESS Bob Bob 2014-09-30 72f49 PENDING Alice Bob 2014-10-03 b932s PENDING Carol Bob 2014-10-03 ef9ca IN_PROGRESS David BobPartition key Sort key Multivalue sorts and filters
  54. 54. Secondary Index Approach 1: Query filter Bob Opponent Date GameId Status Host Alice 2014-10-02 d9bl3 DONE David Carol 2014-10-08 o2pnb IN_PROGRESS Bob Bob 2014-09-30 72f49 PENDING Alice Bob 2014-10-03 b932s PENDING Carol Bob 2014-10-03 ef9ca IN_PROGRESS David SELECT * FROM Game WHERE Opponent='Bob' ORDER BY Date DESC FILTER ON Status='PENDING' (filtered out)
  55. 55. Approach 2: Composite key StatusDate DONE_2014-10-02 IN_PROGRESS_2014-10-08 IN_PROGRESS_2014-10-03 PENDING_2014-09-30 PENDING_2014-10-03 Status DONE IN_PROGRESS IN_PROGRESS PENDING PENDING Date 2014-10-02 2014-10-08 2014-10-03 2014-10-03 2014-09-30 + =
  56. 56. Secondary Index Approach 2: Composite key Opponent StatusDate GameId Host Alice DONE_2014-10-02 d9bl3 David Carol IN_PROGRESS_2014-10-08 o2pnb Bob Bob IN_PROGRESS_2014-10-03 ef9ca David Bob PENDING_2014-09-30 72f49 Alice Bob PENDING_2014-10-03 b932s Carol Partition key Sort key
  57. 57. Opponent StatusDate GameId Host Alice DONE_2014-10-02 d9bl3 David Carol IN_PROGRESS_2014-10-08 o2pnb Bob Bob IN_PROGRESS_2014-10-03 ef9ca David Bob PENDING_2014-09-30 72f49 Alice Bob PENDING_2014-10-03 b932s Carol Secondary index Approach 2: Composite key Bob SELECT * FROM Game WHERE Opponent='Bob' AND StatusDate BEGINS_WITH 'PENDING'
  58. 58. Concatenate attributes to form useful secondary index keys Take advantage of sparse indexes Use composite keys to model data You want to optimize a query as much as possible Status + Date
  59. 59. Advanced Data Modeling
  60. 60. Multiversion Concurrency Transactions in NoSQL
  61. 61. How OLTP Apps Use Data  Mostly Hierarchical Structures  Entity Driven Workflows  Data Spread Across Tables  Requires Complex Queries  Primary driver for ACID
  62. 62. Reading item partitions Get all items for a product assembly SELECT * FROM Assemblies WHERE ID=1 Order Item Picker ID PartID Vendor Bin 1 1 ACME 27Z19 2 ACME 39M97 3 ACME 75B25 (Many more assemblies) Assemblies table Partition key Sort key
  63. 63. ID PartID Vendor Bin 1 1 ACME 27Z19 2 ACME 39M97 3 ACME 75B25 4 UAC 53G56 5 UAC 64B17 6 UAC 48J19 Updates are problematic (Many more assemblies) Assemblies table Old items SELECT * FROM Assemblies WHERE ID=1 Order Item Picker New items
  64. 64. ID PartID Vendor Bin 1 1 ACME 27Z19 2 ACME 39M97 3 ACME 75B25 4 UAC 53G56 5 UAC 64B17 6 UAC 48J19 Creating partition “locks” (Many more assemblies) Assemblies table • Use metadata item for versioning SET #ItemList.i = list append(:newList, #ItemList.i ), #Lock = 1 • Obtain locks with conditional writes --condition-expression “Lock = 0” • Remove old items or add version info to Sort Key Query composite keys for specific versions • Readers can determine how to handle “fat” reads • Add additional metadata to manage transactional workflow as needed Current State, Timeout, etc. ID PartID ItemList Lock 1 0 {i:[[1,2,3],[4,5,6]]} 0
  65. 65. Use item partitions to manage transactional workflows Manage versioning across items with metadata Tag attributes to maintain multiple versions Code your app to recognize when updates are in progress Implement app layer error handling and recovery logic Transactional writes across items is required
  66. 66. • Partition table on nodeId, add edges to define adjacency list • Define a default edge for every node type to describe the node itself • Use partitioned GSI’s to query large nodes (dates, places, etc) • Use Streams/Lambda/EMR for graph query projections • Neighbor Entity State • Subtree Aggregations • Breadth First Search • Node Ranking Materialized Graph Modeling GSI Primary Key Attributes GSIkey Data Target Type Node Projection 0-N Jason Bourne 1 Person 1 … James Bond 4 4 … 20170418 2 Birthdate 1 … 4 … Date 2 … Finland 3 Birthplace 1 … 4 … Place 3 … GSI Primary Key Attributes GSIkey Type Target Data Node Projection 0-N Person 1 Jason Bourne 1 … 4 James Bond 4 … Birthdate 2 20170418 1 … 4 … Birthplace 3 Finland 1 … 4 … Date 2 20170418 2 … Place 3 Finland 3 … Table Primary Key Attributes Node Type Target Data GSIkey Projection 1 Person 1 Jason Bourne HASH(Person.Data) Edge/spanning tree rollups Birthdate 2 20170418 Birthplace 3 Finland 2 Date 2 20170418 HASH(Data) (Summary Stats) 3 Place 3 Finland 4 Person 4 James Bond HASH(Person.Data Edge/spanning tree rollups Birthdate 2 20170418 Birthplace 3 Finland
  67. 67. Running Aggegations and Metrics DynamoDB Streams AWS Lambda Kinesis Firehose Amazon EMR
  68. 68. Elastic Serverless Applications
  69. 69. Thank you!

×