Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tim Griffith – Amazon
October 2015
Orchestration...
What to Expect from the Session
• How Amazon processes listing information
• Challenges that led us to look for a new appr...
…with this
Gateway
Process
Instances
Process
Templates
Channel
Definitions
Action
Definitions
Channel
Dispatch Q
Identifie...
Listing on Amazon
Starts with a Product
What are we selling?
• Keyed by a Stock
Keeping Unit (SKU)
• Holds series of facts
about the product...
Then We Need an Offer
Product
How much are we selling it
for?
• Talks in relation to a
Product [FK: SKU]
• Price
• When to...
We Can Also Relate Items
Product
Consider a shoe:
• Facts for a model of shoe
should be the same
• However, each size and
...
Shared Nothing Is Easy!
= ?
Sellers EC2 API
cluster
Catalog by
‘seller/sku’
CloudSearch
reverse index
DynamoDB
streams
Cus...
No… But Why?
• Amazon has the world’s largest selection with one of
the largest number of sellers.
• We believe that searc...
The Solution – Single Item Detail Page
Common description and
features across products
The Solution – Single Item Detail Page
Sparse matrix of products
by color and size
variations available
The Solution – Single Item Detail Page
Multiple Sellers often
selling same specific item
(size + color)
What Does this Add to the Architecture?
Need to bring sellers’
representations of a product
together.
• Requires standardi...
What Does this Add to the Architecture?
Top-level orchestration now
required:
• Different types of data need to
be sequenc...
Challenges
Mission Accomplished?
This paradigm has been running
in prod since the turn of the
century.
• Number of sellers has grown
...
Mission Accomplished?
• Serialization leads to unpredictable performance
• Feed processing orchestration is macro and can
...
Requirements
What Do We Need?
Client perspective:
• Ability to reason when work will be done
• Ability to check a submission’s progress...
Costing Workflow
Let’s Abstract This…
Gateway
Process
Instances
Process
Templates
Channel
Definitions
Action
Definitions
Channel
Dispatch Q...
… to a Simpler “Mile High” Generic View
Orchestrator
Seller
Services
Invoke orchestrator, getting
• Acceptance
• Completio...
Mile High View - Vended
Seller
Services
Pricing Math – SWF (/hour)
for 5K TPS average 2 activities
- executions = $1,800
-...
Mile High View - Traditional
Orchestrator
Seller
Services
Pricing math – SQS/DynamoDB/EC2
for 5K TPS average 2 activities
...
Mile High View – Starting to Look at Amazon Kinesis
Orchestrator
Seller
Services
Pricing math – SQS/DynamoDB/Kinesis/EC2
f...
Mile High View – Questions on Storage?
Orchestrator
Seller
Services
Storage is multi-function:
• Continuation of execution...
Mile High View – In-Memory Orchestration
Orchestrator
Gateway
Seller
Services
Orchestrator
Back-End
Thinking about memory:...
What About Durability?
• Amazon Kinesis only guarantees 24 hours of durability
• May not be sufficient for some low-priori...
Asynchronous Durability
…we get bandwidth such that we can read twice for every write.
Amazon Kinesis gives us a clue as t...
Asynchronous Durability with an Archiver
Orchestrator
Gateway
Seller
Services
Orchestrator
Back-End
Orchestrator
Back-End
...
Archiver – Constraints Imposed (Checkpointing)
Orchestrator
Back-End
KCL
Shard processed
checkpoint
(high water mark)
New ...
Archiver – Closed Input Monotonicity
Services
Orchestrator
Back-End
Backpressure
stats
Mile High View – Kinesis Aggregator
Orchestrator
Gateway
Seller
Services
Orchestrator
Back-End
Orchestrator
Back-EndPricin...
How Are We Doing?
Client perspective:
• Ability to reason when work will be done
• Ability to check a submission’s progres...
Exploiting Affinity
Priority and Sequencing
Imagine scenario on a key:
• Low-priority update v1 at t
• High-priority update v2 at t+2
• v1 has...
How Are We Doing?
Client perspective:
• Ability to reason when work will be done
• Ability to check a submission’s progres...
Batching Activity
In several situations, multiple
updates on a key can be done for
lower cost.
• At time of dispatch, if k...
How Are We Doing?
Client perspective:
• Ability to reason when work will be done
• Ability to check a submission’s progres...
Predictable Execution
Predictably Scheduling Work
• Existing system has no sense of when work will be
completed, let alone the strength of the c...
Acting on SLA / QoS – Channel Dispatch Q
SLA:‘…:18:29.15.123’
QoS: 4
Work: …
In-memory shard processor priority queue per-...
Predictably Scheduling Work – Priority Abuse
Don’t want to punish bursty
customers, but:
• Need to smooth out traffic
• Ne...
Predictably Scheduling Work – Priority Abuse
Imagine we have an SLA config for a client:
• 0.8 – 1.0, SLA: 2 seconds, TPS:...
Predictably Scheduling Work – Priority Abuse
Imagine we have an SLA config for a client:
• 0.8 – 1.0, SLA: 2 seconds, TPS:...
now = 0.5, hwm = 1, base = max(0.5, 1)
• Datum Sla = base (1) + SLA (2) = 3
• hwm = base (1) + 1 / TPS (1) = 2
Predictably...
How Are We Doing?
Client perspective:
• Ability to reason when work will be done
• Ability to check a submission’s progres...
Summary
By putting Amazon Kinesis at the core of our architecture:
• We reduced TCO from to $187.5K to $81K.
• We internal...
Thank you!
And of course… we’re hiring!
Cupertino | New York | Phoenix | Seattle
Remember to complete
your evaluations!
Related Sessions
Upcoming SlideShare
Loading in …5
×

(ARC310) Solving Amazon's Catalog Contention With Amazon Kinesis

1,427 views

Published on

The Amazon.com product catalog receives millions of updates an hour across billions of products with many of the updates concentrated on comparatively few products. In this session, hear how Amazon.com has used Amazon Kinesis to build a pipeline orchestrator that provides sequencing, optimistic batching, and duplicate suppression whilst at the same time significantly lowering costs. This session covers the architecture of that solution and draws out the key enabling features that Amazon Kinesis provides. This talk is intended for those who are interested in learning more about the power of the distributed log and understanding its importance for enabling OLTP just as DHT is for storage.

Published in: Technology
  • FREE TRAINING: "How to Earn a 6-Figure Side-Income Online" ... ♥♥♥ http://ishbv.com/j1r2c/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Stop getting scammed by online, programs that don't even work! ▲▲▲ http://scamcb.com/ezpayjobs/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

(ARC310) Solving Amazon's Catalog Contention With Amazon Kinesis

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tim Griffith – Amazon October 2015 Orchestration over Amazon Kinesis Solving Amazon’s Catalog Contention and Cost with Amazon Kinesis ARC310
  2. 2. What to Expect from the Session • How Amazon processes listing information • Challenges that led us to look for a new approach • (One method) of building an orchestrator on top of Amazon Kinesis – looking deeply into: • A comparison of the cost mechanics • How it models its distributed coordination • Other serendipitous benefits we found from “working with the log” • How to get 2/3rds cost savings and radically simplified processing semantics
  3. 3. …with this Gateway Process Instances Process Templates Channel Definitions Action Definitions Channel Dispatch Q Identified client submits prioritized work SLA Config SLA Mgmt Gets submission ID, completion SLA, and QoS Dispatch Manager Asynchronous Action Request Handlers (A-ARH) Kinesis / SQS SQS Kinesis Amazon Kinesis
  4. 4. Listing on Amazon
  5. 5. Starts with a Product What are we selling? • Keyed by a Stock Keeping Unit (SKU) • Holds series of facts about the product, e.g. • What type of product is it? • What color is it? • How big is it? • Who made it? Product - SKU [PK] - attributes
  6. 6. Then We Need an Offer Product How much are we selling it for? • Talks in relation to a Product [FK: SKU] • Price • When to start and stop selling it - SKU [PK] - attributes - price - when Offer
  7. 7. We Can Also Relate Items Product Consider a shoe: • Facts for a model of shoe should be the same • However, each size and color is a different product - SKU [PK] - attributes - price - when Offer - attributes Relation
  8. 8. Shared Nothing Is Easy! = ? Sellers EC2 API cluster Catalog by ‘seller/sku’ CloudSearch reverse index DynamoDB streams CustomersEC2 web hosting stack Query cache
  9. 9. No… But Why? • Amazon has the world’s largest selection with one of the largest number of sellers. • We believe that searching these directly yields a poor user experience: • Conflates finding the product desired with finding the best offer (and provides little opportunity to default that decision) • Dilutes feedback on the product itself – conflates seller quality with product quality
  10. 10. The Solution – Single Item Detail Page Common description and features across products
  11. 11. The Solution – Single Item Detail Page Sparse matrix of products by color and size variations available
  12. 12. The Solution – Single Item Detail Page Multiple Sellers often selling same specific item (size + color)
  13. 13. What Does this Add to the Architecture? Need to bring sellers’ representations of a product together. • Requires standardizing the data to ensure it’s consistent • Have to determine if each product is new, or matches other products • Will now often end up with duplicate attributes; so we need to choose the best value(s) Standardize & validate Catalog by ‘seller/sku’ Match Reconcile Amazon catalog by ID Offers by ‘ID’ / seller / sku
  14. 14. What Does this Add to the Architecture? Top-level orchestration now required: • Different types of data need to be sequenced • Can do phases as aggregate (for bulk) • Can also do individually • Orchestration differs depending on approach Product Relations Offers
  15. 15. Challenges
  16. 16. Mission Accomplished? This paradigm has been running in prod since the turn of the century. • Number of sellers has grown enormously! • Number of sellers on some ASINs can be extreme – leading to contention • Optimistic gave way to pessimistic – leading to serialization Standardize & validate Catalog by ‘seller/sku’ Match Reconcile Amazon catalog by ‘ASIN’ Standardize & validate Catalog by ‘seller/sku’ Match Standardize & validate Catalog by ‘seller/sku’ Match
  17. 17. Mission Accomplished? • Serialization leads to unpredictable performance • Feed processing orchestration is macro and can compounds performance problems • Dual entry between 1x1 and feeds can cause sequencing issues • Unpredictable performance leads to regular user contact following up on status
  18. 18. Requirements
  19. 19. What Do We Need? Client perspective: • Ability to reason when work will be done • Ability to check a submission’s progress Orchestrated services’ perspective: • Support for both order and priority • Ability to batch work on a key Non-function perspective: • Run as fast as possible • Run as cheaply as possible
  20. 20. Costing Workflow
  21. 21. Let’s Abstract This… Gateway Process Instances Process Templates Channel Definitions Action Definitions Channel Dispatch Q Identified client submits prioritized work SLA Config SLA Mgmt Gets submission ID, completion SLA, and QoS Dispatch Manager Asynchronous Action Request Handlers (A-ARH) Kinesis / SQS SQS Kinesis Amazon Kinesis
  22. 22. … to a Simpler “Mile High” Generic View Orchestrator Seller Services Invoke orchestrator, getting • Acceptance • Completion time SLA • QoS guarantee for the SLA Services invoked and response asynchronously to avoid coupling Service scaling directly to seller demand Orchestrator responds with outcome to seller
  23. 23. Mile High View - Vended Seller Services Pricing Math – SWF (/hour) for 5K TPS average 2 activities - executions = $1,800 - tasks = $900 $23.6M / year – not viable
  24. 24. Mile High View - Traditional Orchestrator Seller Services Pricing math – SQS/DynamoDB/EC2 for 5K TPS average 2 activities - Amazon SQS: 15K TPS x 3 (send, read, delete) – ~$8.1/hour (10 batching) - Amazon DynamoDB: 10K read ($1.30), 15K write ($9.75) = ~$10/hour - Amazon EC2: ~200 TPS/host on c4.large… 30 = ~$3.3/hour ~$187.5K/year – viable, but worth further investigation
  25. 25. Mile High View – Starting to Look at Amazon Kinesis Orchestrator Seller Services Pricing math – SQS/DynamoDB/Kinesis/EC2 for 5K TPS average 2 activities average size – 2K - Amazon SQS: 5K TPS x 3 (send, read, delete) – ~$2.7/hour (10 batching) - Amazon DynamoDB: 10K read ($1.30), 15K write ($9.75) = ~$10/hour - Amazon EC2: ~200 TPS/host on c4.large… 30 = ~$3.3/hour - Amazon Kinesis: Shards = 10 MB/s ~= 10 shards, ~$.15/hour, PUT units = ~$.25/hour
  26. 26. Mile High View – Questions on Storage? Orchestrator Seller Services Storage is multi-function: • Continuation of execution (re-location via execution ID) • Audit of execution Observations: • Amazon Kinesis has key affinity as well… process will go to same shard… • Audit doesn’t have to be conflated with online… or be key addressable
  27. 27. Mile High View – In-Memory Orchestration Orchestrator Gateway Seller Services Orchestrator Back-End Thinking about memory: • Residency of processes (worst case) – 12 hours of unprocessed [2K * 5K * 3600 * 12] = 412 GB • R3 series 30 x r3.large gives 450 GB
  28. 28. What About Durability? • Amazon Kinesis only guarantees 24 hours of durability • May not be sufficient for some low-priority work • Reconstruction of 24 hours of data would be considerable • Need to think of an alternate approach to ensuring durability… • “Snapshotting” many GBs of memory is slow – don’t want a stop-the-world operation
  29. 29. Asynchronous Durability …we get bandwidth such that we can read twice for every write. Amazon Kinesis gives us a clue as to how to solve this…
  30. 30. Asynchronous Durability with an Archiver Orchestrator Gateway Seller Services Orchestrator Back-End Orchestrator Back-End • Loads prior snapshot for shard, opens stream at sequence – reads deltas forward until high water mark, writes down new snapshot • Archiver can be (much) smaller fleet than main orchestrator as it can rotate through all the shards • [Bonus] Effectively constantly checking the recovery strategy
  31. 31. Archiver – Constraints Imposed (Checkpointing) Orchestrator Back-End KCL Shard processed checkpoint (high water mark) New snapshotPrevious snapshot Shard recovery checkpoint (low water mark) Start-up Recover state Recommence processing Archiver (stores state)
  32. 32. Archiver – Closed Input Monotonicity Services Orchestrator Back-End Backpressure stats
  33. 33. Mile High View – Kinesis Aggregator Orchestrator Gateway Seller Services Orchestrator Back-End Orchestrator Back-EndPricing math (hourly) for 5K TPS average 2 activities - Amazon SQS: 5K TPS x 3 (send, read, delete) – ~$2.7 (10 batching) - Amazon EC2: ~200 TPS/host on r3.large… 30 = ~$5.25 - Amazon Kinesis: Shards = 10 MB/s ~= 10 shards x 3, ~$.45, PUT units = ~$.75 - Amazon S3: p100 retention = 450 GB ~= $.02, ~10K PUT/hour = $.05 - Amazon DynamoDB: (checkpoints) = 100 TPS write / 10 TPS read ~= $.07 ~$80.8K/year
  34. 34. How Are We Doing? Client perspective: • Ability to reason when work will be done • Ability to check a submission’s progress Orchestrated services’ perspective: • Support for both order and priority • Ability to batch work on a key Non-function perspective: • Run as fast as possible • Run as cheaply as possible
  35. 35. Exploiting Affinity
  36. 36. Priority and Sequencing Imagine scenario on a key: • Low-priority update v1 at t • High-priority update v2 at t+2 • v1 has large SLA / low QoS and blocks v2 • As v1 and v2 are in the same host, v2 can simply apply its precedence onto its predecessors, thus dispatching them sooner Key:A918741894 SLA:‘…:14:31.14.978’ QoS: 2 Work: {… (v1) } Key:A918741894 SLA:‘…:13:35.20.107’ QoS: 5 Work: {… (v2) } Key:A918741894 SLA:‘…:13:35.15.107’ QoS: 5 Work: {… (v1) }
  37. 37. How Are We Doing? Client perspective: • Ability to reason when work will be done • Ability to check a submission’s progress Orchestrated services’ perspective: • Support for both order and priority • Ability to batch work on a key Non-function perspective: • Run as fast as possible • Run as cheaply as possible
  38. 38. Batching Activity In several situations, multiple updates on a key can be done for lower cost. • At time of dispatch, if key is batchable, walk dependency chain • Provided no other dependencies outstanding, can be added to batch • Up to max batch size for activity Dependencies Channel Dispatch Q A (v1) A (v2)A (v1) A (v3) B (v1) A (v1) A (v2) Both (v1) and (v2) dispatched In single payload Dispatch Batch
  39. 39. How Are We Doing? Client perspective: • Ability to reason when work will be done • Ability to check a submission’s progress Orchestrated services’ perspective: • Support for both order and priority • Ability to batch work on a key Non-function perspective: • Run as fast as possible • Run as cheaply as possible
  40. 40. Predictable Execution
  41. 41. Predictably Scheduling Work • Existing system has no sense of when work will be completed, let alone the strength of the commitment • Work backwards from client contract; provide a completion time instance Service Level Agreement and the associated Quality of Service (QoS) guarantee • Orchestrator explicitly chooses to dispatch in order of SLA, dispatch vs. breach is considered based on cumulative QoS risk
  42. 42. Acting on SLA / QoS – Channel Dispatch Q SLA:‘…:18:29.15.123’ QoS: 4 Work: … In-memory shard processor priority queue per-channel SLA:‘…:18:29.14.978’ QoS: 2 Work: … SLA:‘…:18:29.13.540’ QoS: 3 Work: … { ordered on min(SLA) } Evaluate Dispatch Record Event [inc UTC timestamp] Passes pseudo-random [seeded sequence ID] Dice roll ? Breach SLA and Extend N Dispatch to ChannelY Generate weighted dispatch probability
  43. 43. Predictably Scheduling Work – Priority Abuse Don’t want to punish bursty customers, but: • Need to smooth out traffic • Need to prevent QoS abuse by client sending all traffic as priority 1.0 • Need to insulate against denial of service / sustained SLA exceed SLA Config Config Look-up Exceeds max tolerable lead? Y: Downgrade to next SLA tier N: issue base + config SLA Calculate SLA base max(now, prior SLA + pro-rata’d TPS) SLA Base Store SLA Key -> last base
  44. 44. Predictably Scheduling Work – Priority Abuse Imagine we have an SLA config for a client: • 0.8 – 1.0, SLA: 2 seconds, TPS: 1, QoS: 5, max lead: 1s • 0.5 – 0.8, SLA: 5 seconds, TPS: 2, QoS: 3, max lead: 10s 1 1 20 3 4 1 5 now = 0, hwm = null, base = max(0, null) • Datum Sla = base (0) + SLA (2) = 2 • hwm = base (0) + 1 / TPS (1) = 1
  45. 45. Predictably Scheduling Work – Priority Abuse Imagine we have an SLA config for a client: • 0.8 – 1.0, SLA: 2 seconds, TPS: 1, QoS: 5, max lead: 1s • 0.5 – 0.8, SLA: 5 seconds, TPS: 2, QoS: 3, max lead: 10s 1 1 20 3 4 2 1 2 5 now = 0, hwm = 1, base = max(0, 1) • Datum Sla = base (0) + SLA (5) = 5 • hwm = base (0) + 1 / TPS (2) = 0.5 • hwm (1) – now (0) >= max lead (1) now = 0, hwm = null, base = max(0, null)
  46. 46. now = 0.5, hwm = 1, base = max(0.5, 1) • Datum Sla = base (1) + SLA (2) = 3 • hwm = base (1) + 1 / TPS (1) = 2 Predictably Scheduling Work – Priority Abuse Imagine we have an SLA config for a client: • 0.8 – 1.0, SLA: 2 seconds, TPS: 1, QoS: 5, max lead: 1s • 0.5 – 0.8, SLA: 5 seconds, TPS: 2, QoS: 3, max lead: 10s 1 1 20 3 4 2 1 2 5 3 3
  47. 47. How Are We Doing? Client perspective: • Ability to reason when work will be done • Ability to check a submission’s progress Orchestrated services’ perspective: • Support for both order and priority • Ability to batch work on a key Non-function perspective: • Run as fast as possible • Run as cheaply as possible
  48. 48. Summary By putting Amazon Kinesis at the core of our architecture: • We reduced TCO from to $187.5K to $81K. • We internalized sequencing and prioritization, making action handlers simple. • We gained new capabilities (entity update batching).
  49. 49. Thank you! And of course… we’re hiring! Cupertino | New York | Phoenix | Seattle
  50. 50. Remember to complete your evaluations!
  51. 51. Related Sessions

×