0
Me?Yves Goeleven Solution Architect Capgemini Windows Azure MVP   Co-founder Azug   Founder Goeleven.com   Founder Me...
AgendaDeep Dive Architecture   Best Practices   Global Namespace        Table Partitioning Strategy   Front End Layer  ...
Warning! This is a level 400 session!
I assume you knowStorage services Highly scalable Highly available Cloud storage
I assume you knowThat brings following data abstractions   Tables   Blobs   Drives   Queues
I assume you knowIt is NOT a relational database Storage services exposed trough a REST full API With some sugar on top ...
This talk is based on an academic whitepaper   http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf
Which discusses howStorage services defy the rules of the CAP theorem   Distributed & scalable system   Highly available...
Now how did they do that?Deep Dive: Ready for a plunge?
Global Namespace
Global namespaceEvery request gets sent to http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
Global namespaceDNS based service location   http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName   D...
StampScale unit A windows azure deployment 3 layers   Typically 20 racks each   Roughly 360 XL instances With disks a...
Incoming read requestArrives at Front End Routed to partition layer http(s)://AccountName.Service.core.windows.net/Parti...
Incoming read request2 nodes per request Blazingly fast Flat “Quantum 10” network topology 50 Tbps of bandwidth   10 G...
Incoming Write requestWritten to stream layer Synchronous replication 1 Primary Extend Node 2 Secondary Extend Nodes   ...
Geo replication                               Inter-stamp                                replication     Intra-stamp repli...
Location ServiceStamp Allocation Manages DNS records Determines active/passive stamp configuration     Assigns accounts...
Front End Layer
Front End NodesStateless role instances Authentication & Authorization   Shared Key (storage key)   Shared Key Lite   ...
Partition MapDetermine partition server Used By Front End Nodes Managed by Partition Manager (Partition Layer) Ordered ...
Practical examplesBlob   Each blob is a partition   Blob uri determines range   Range only matters for listing   Same ...
Practical examplesQueue   Each queue is a partition   Queue name determines range   All messages same partition   Rang...
Practical examplesTable   Each group of rows with same PartitionKey is a partition   Ordering within set by RowKey   Ra...
Continuation tokensRequests across partitions All on same partition server   1 result Across multiple ranges (or to man...
Partition Layer
PartitionLayerDynamic Scalable Object Index Object Tables   System defined data structures   Account, Entity, Blob, Mes...
Partition layerConsists of Partition Master   Distributes Object Tables   Partitions across Partition Servers          ...
Dynamic Scalable Object IndexHow it works  Account   Container   Blob  Aaaa      Aaaa        Aaaa                         ...
Dynamic Load BalancingNot in the critical path!                                                                   Partitio...
Partition ServerArchitecture Log Structured Merge-Tree   Easy replication Append only persisted streams      Memory    ...
Incoming Read requestHOT Data Completely in memory   Memory      Row    Table     Cache                                  ...
Incoming Read requestCOLD Data Partially Memory    Memory      Row     Table     Cache                                   ...
Incoming Write RequestPersistent First Followed by memory    Memory      Row     Table     Cache                         ...
Checkpoint processNot in critical path!    Memory       Row     Table      Cache                                          ...
Geo replication processNot in critical path!  Memory     Row                                                              ...
Stream Layer
Stream layerAppend-Only Distributed File System Streams are very large files   File system like directory namespace Str...
Stream layerConcepts Block   Min unit of write/read                                         Stream //foo/myfile.data   ...
Stream layerConsists of Paxos System   Distributed consensus protocol   Stream Master                                  ...
Extent creationAllocation process Partition Server   Request extent / stream creation                                   ...
Replication ProcessBitwise equality on all nodes Synchronous                                                             ...
Dealing with failureAck from primary lost going back to partition layer Retry from partition layer can cause multiple blo...
Replication Failures (Scenario 1)                                                                                         ...
Replication Failures (Scenario 2)                                                                                         ...
Network partitioning during readData Stream                                                                               ...
Network partitioning during readLog Stream                                                                                ...
Garbage CollectionDownside to this architecture Lot‟s of wasted disk space   „Forgotten‟ streams   „Forgotten‟ extends ...
Beating the CAP theoremLayering & co-design Stream Layer   Availability with Partition/failure tolerance   For Consiste...
Best practices
Partitioning StrategyTables: Great for writing data, horrible for complex querying Data is automatically spread across ex...
Partitioning StrategyInsert all aggregates in a dedicated partition   Aggregate                                  Account ...
Partitioning StrategyQuery By = Store By   Do not query across partitions, it hurts!   Acct   Table    PK                ...
Partitioning StrategyDO Denormalize!     It‟s not a sin!                             Acct   Table    PK                  ...
Bonus TipGreat partition keys   Any combination of properties of the object   Who     F.e. current userid   When     ...
Change Propagation StrategyUse queued messaging     NO TRANSACTION support     Split multi-partition update             ...
Change Propagation StrategySounds hard, but don’t worry   I‟ve gone through all the hard work already   And contributed ...
CountingCounting records across partitions is a PITA, because querying is…   And there is no built in support   Do it up...
CountingCounting actions in Goeleven.com public class CountActions : IHandleMessages<ActionCreated>  {      public void Ha...
CountingRollup  public class Counter  {      public long Value { get; private set; }      public Counter(IEnumerable<Count...
Master ElectionScheduling in multi-node environment     Like rollup     All nodes kick in together doing the same thing ...
Master ElectionTry to acquire a lease   If success start work, and hold on to the lease until done       leaseId = blob.T...
Wrapup
What we discussedDeep Dive              Best Practices   Global Namespace      Table Partitioning Strategy   Front End ...
ResourcesWant to know more?   White paper: http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf   ...
Q&A
Deep Dive and Best Practices for Windows Azure Storage Services
Upcoming SlideShare
Loading in...5
×

Deep Dive and Best Practices for Windows Azure Storage Services

409

Published on

More info on http://www.techdays.be

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
409
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
27
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Deep Dive and Best Practices for Windows Azure Storage Services"

  1. 1. Me?Yves Goeleven Solution Architect Capgemini Windows Azure MVP Co-founder Azug Founder Goeleven.com Founder Messagehandler.net Contributor NServiceBus
  2. 2. AgendaDeep Dive Architecture Best Practices Global Namespace  Table Partitioning Strategy Front End Layer  Change Propagation Strategy Partition Layer  Counting Strategy Stream Layer  Master Election Strategy
  3. 3. Warning! This is a level 400 session!
  4. 4. I assume you knowStorage services Highly scalable Highly available Cloud storage
  5. 5. I assume you knowThat brings following data abstractions Tables Blobs Drives Queues
  6. 6. I assume you knowIt is NOT a relational database Storage services exposed trough a REST full API With some sugar on top (aka an SDK) LINQ to tables support might make you think it is a database
  7. 7. This talk is based on an academic whitepaper http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf
  8. 8. Which discusses howStorage services defy the rules of the CAP theorem Distributed & scalable system Highly available Strongly Consistent Partition tolerant All at the same time!
  9. 9. Now how did they do that?Deep Dive: Ready for a plunge?
  10. 10. Global Namespace
  11. 11. Global namespaceEvery request gets sent to http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
  12. 12. Global namespaceDNS based service location http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName DNS resolution Datacenter (f.e. Western Europe) Virtual IP of a Stamp
  13. 13. StampScale unit A windows azure deployment 3 layers  Typically 20 racks each  Roughly 360 XL instances With disks attached  2 Petabyte (before june 2012)  30 Petabyte Intra-stamp replication
  14. 14. Incoming read requestArrives at Front End Routed to partition layer http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName Reads usually from memory Front End Layer FE FE FE FE FE FE FE FE FE Partition Layer PS PS PS PS PS PS PS PS PS
  15. 15. Incoming read request2 nodes per request Blazingly fast Flat “Quantum 10” network topology 50 Tbps of bandwidth  10 Gbps / node  Faster than typical ssd drive
  16. 16. Incoming Write requestWritten to stream layer Synchronous replication 1 Primary Extend Node 2 Secondary Extend Nodes Front End Layer Partition Layer PS PS PS PS PS PS PS PS PS Stream Layer EN EN EN EN EN EN EN Primary Secondary Secondary
  17. 17. Geo replication Inter-stamp replication Intra-stamp replication Intra-stamp replication
  18. 18. Location ServiceStamp Allocation Manages DNS records Determines active/passive stamp configuration  Assigns accounts to stamps  Geo Replication  Disaster Recovery  Load Balancing  Account Migration
  19. 19. Front End Layer
  20. 20. Front End NodesStateless role instances Authentication & Authorization  Shared Key (storage key)  Shared Key Lite  Shared Access Signatures Routing to Partition Servers  Based on PartitionMap (in memory cache) Versioning  x-ms-version: 2011-08-18 Throttling Front End Layer FE FE FE FE FE FE FE FE FE
  21. 21. Partition MapDetermine partition server Used By Front End Nodes Managed by Partition Manager (Partition Layer) Ordered Dictionary  Key: http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName  Value: Partition Server Split across partition servers  Range Partition
  22. 22. Practical examplesBlob Each blob is a partition Blob uri determines range Range only matters for listing Same container: More likely same range  https://myaccount.blobs.core.windows.net/mycontainer/subdir/file1.jpg  Https://myaccount.blobs.core.windows.net/mycontainer/subdir/file2.jpg Other container: Less likely  https://myaccount.blobs.core.windows.net/othercontainer/subdir/file1.jpg
  23. 23. Practical examplesQueue Each queue is a partition Queue name determines range All messages same partition Range only matters for listing
  24. 24. Practical examplesTable Each group of rows with same PartitionKey is a partition Ordering within set by RowKey Range matters a lot! Expect continuation tokens!
  25. 25. Continuation tokensRequests across partitions All on same partition server  1 result Across multiple ranges (or to many results)  Sequential queries required  1 result + continuation token  Repeat query with token until done Will almost never occur in development!  As dataset is typically small
  26. 26. Partition Layer
  27. 27. PartitionLayerDynamic Scalable Object Index Object Tables  System defined data structures  Account, Entity, Blob, Message, Schema, PartitionMap Dynamic Load Balancing  Monitor load to each part of the index to determine hot spots  Index is dynamically split into thousands of Index Range Partitions based on load  Index Range Partitions are automatically load balanced across servers to quickly adapt to changes in load  Updates every 15 seconds
  28. 28. Partition layerConsists of Partition Master  Distributes Object Tables  Partitions across Partition Servers Partition Layer  Dynamic Load Balancing PM PM Partition Server LS LS  Serves data for exactly 1 Range Partition  Manages Streams Lock Server PS PS PS  Master Election  Partition Lock
  29. 29. Dynamic Scalable Object IndexHow it works Account Container Blob Aaaa Aaaa Aaaa Front End Layer A-H FE FE FE FE Harry Pictures Sunrise Harry Pictures Sunset Partition Layer PM PM Map H‟-R LS LS Richard Images Soccer Richard Images Tennis R‟-Z PS PS PS Zzzz Zzzz zzzz
  30. 30. Dynamic Load BalancingNot in the critical path! Partition Layer PM PM LS LS PS PS PS PS PS PS PS PS PS Stream Layer EN EN EN EN EN EN EN Primary Secondary Secondary
  31. 31. Partition ServerArchitecture Log Structured Merge-Tree  Easy replication Append only persisted streams Memory Row Table Cache Memory Data  Commit log Index Bloom Load Metrics Cache Filters  Metadata log  Row Data (Snapshots)  Blob Data (Actual blob data) Caches in memory Persistent Data Commit Log Stream Row Data Stream  Memory table (= Commit log)  Row Data Metadata Log Stream Blob Data Stream  Index (snapshot location)  Bloom filter  Load Metrics
  32. 32. Incoming Read requestHOT Data Completely in memory Memory Row Table Cache Memory Data Index Bloom Load Metrics Cache Filters Persistent Data Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream
  33. 33. Incoming Read requestCOLD Data Partially Memory Memory Row Table Cache Memory Data Index Bloom Load Metrics Cache Filters Persistent Data Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream
  34. 34. Incoming Write RequestPersistent First Followed by memory Memory Row Table Cache Memory Data Index Bloom Load Metrics Cache Filters Persistent Data Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream
  35. 35. Checkpoint processNot in critical path! Memory Row Table Cache Memory Data Index Bloom Load Metrics Cache Filters Persistent Data Commit Log Stream Row Data Stream Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream
  36. 36. Geo replication processNot in critical path! Memory Row Memory Row Table Cache Memory Data Table Cache Memory Data Index Bloom Load Metrics Index Bloom Load Metrics Cache Filters Cache Filters Persistent Data Persistent Data Commit Log Stream Row Data Stream Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream Metadata Log Stream Blob Data Stream
  37. 37. Stream Layer
  38. 38. Stream layerAppend-Only Distributed File System Streams are very large files  File system like directory namespace Stream Operations  Open, Close, Delete Streams  Rename Streams  Concatenate Streams together  Append for writing  Random reads
  39. 39. Stream layerConcepts Block  Min unit of write/read Stream //foo/myfile.data  Checksum  Up to N bytes (e.g. 4MB) Pointer 1 Pointer 2 Pointer 3 Extent  Unit of replication Extent Extent Extent  Sequence of blocks Block Block Block Block Block Block Block Block Block Block Block  Size limit (e.g. 1GB)  Sealed/unsealed Sealed Unsealed Sealed Stream  Hierarchical namespace  Ordered list of pointers to extents  Append/Concatenate
  40. 40. Stream layerConsists of Paxos System  Distributed consensus protocol  Stream Master Stream Layer Extend Nodes  Attached Disks SM  Manages Extents Paxos SM EN EN EN EN  Optimization Routines SM
  41. 41. Extent creationAllocation process Partition Server  Request extent / stream creation Partition Layer PS Stream Master  Chooses 3 random extend nodes based on available capacity Stream Layer  Chooses Primary & Secondarys SM  Allocates replica set Random allocation  To reduce Mean Time To Recovery EN EN EN Primary Secondary Secondary
  42. 42. Replication ProcessBitwise equality on all nodes Synchronous Partition Layer Append request to primary PS Written to disk, checksum Offset and data forwarded to Stream Layer secondaries SM Ack EN EN EN Primary Secondary SecondaryJournaling Dedicated disk (spindle/ssd) For writes Simultaneously copied to data disk (from memory or journal)
  43. 43. Dealing with failureAck from primary lost going back to partition layer Retry from partition layer can cause multiple blocks to be appended (duplicate records)Unresponsive/Unreachable Extent Node (EN) Seal the failed extent & allocate a new extent and append immediately Stream //foo/myfile.data Pointer 1 Pointer 2 Pointer 3 Extent Extent Extent Extent Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block
  44. 44. Replication Failures (Scenario 1) Partition Layer PS Seal Stream Layer SM 120 120 EN EN EN Primary Secondary Secondary Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block
  45. 45. Replication Failures (Scenario 2) Partition Layer PS Seal Stream Layer SM 120 100 EN EN EN Primary Secondary Secondary ? Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block
  46. 46. Network partitioning during readData Stream Partition Layer Partition Layer only reads from PS offsets returned from successful appends Committed on all replicas Stream Layer SM Row and Blob Data Streams Offset valid on any replica Safe to read EN EN EN Primary Secondary Secondary Block Block Block Block Block Block Block Block Block Block Block
  47. 47. Network partitioning during readLog Stream Partition Layer Logs are used on partition load PS Check commit length first Only read from  Unsealed replica if all replicas have 1, 2 Stream Layer the same commit length SM  A sealed replica Seal EN EN EN Primary Secondary Secondary Secondary Secondary Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block
  48. 48. Garbage CollectionDownside to this architecture Lot‟s of wasted disk space  „Forgotten‟ streams  „Forgotten‟ extends  Unused blocks in extends after write errors Garbage collection required  Partition servers mark used streams  Stream manager cleans up unreferenced streams  Stream manager cleans up unreferenced extends  Stream manager performs erasure coding  Read Solomon algorithm  Allows recreation of cold streams after deleting extends
  49. 49. Beating the CAP theoremLayering & co-design Stream Layer  Availability with Partition/failure tolerance  For Consistency, replicas are bit-wise identical up to the commit length Partition Layer  Consistency with Partition/failure tolerance  For Availability, RangePartitions can be served by any partition server  Are moved to available servers if a partition server fails Designed for specific classes of partitioning/failures seen in practice  Process to Disk to Node to Rack failures/unresponsiveness  Node to Rack level network partitioning
  50. 50. Best practices
  51. 51. Partitioning StrategyTables: Great for writing data, horrible for complex querying Data is automatically spread across extend nodes Indexes are automatically spread across partition servers Queries to many servers may be required to list data Front End Layer Partition Layer PS PS PS PS PS PS PS PS PS Stream Layer EN EN EN EN EN EN EN Secondary Secondary
  52. 52. Partitioning StrategyInsert all aggregates in a dedicated partition  Aggregate Account Table PK RowKey Properties  Records that belong together as a unit Europe Orders Order1 Order 1 data  F.e. Order with orderliness Order1 01 Orderline 1 data  Dedicated Order1 02 Orderline 2 data  Unique combination of partition parts Europe Orders Order2 Order 2 data  Account, table, partitionkey Order2 01 Orderline 1 data  Best insert performance Order2 02 Orderline 2 data  Use Entity Group Transactions US Orders Order3 Order 3 data  All aggregate records insert in 1 go Order3 01 Orderline 1 data  Transactional Order3 02 Orderline 2 data  Use Rowkey only for ordering  Can be blank
  53. 53. Partitioning StrategyQuery By = Store By  Do not query across partitions, it hurts! Acct Table PK RK Properties  Instead store a copy of the data to query Yves Orders 0634911534000000000 1 Order 1 data  In 1 partition 0634911534000000000 1_01 Orderline 1 data PartitionKey 0634911534000000000 1_02 Orderline 2 data  Combination of query by‟s Yves Orders 0634911534000000000 3 Order 3 data  or well known identifiers 0634911534000000000 3_01 Orderline 1 data Rowkey 0634911534000000000 3_02 Orderline 2 data  Use for order Yves Orders 0634911666000000000 2 Order 2 data  Must be unique in combo PK 0634911666000000000 2_01 Orderline 1 data 0634911666000000000 2_02 Orderline 2 data Guarantees fast retrieval  Every query = 1 partition server
  54. 54. Partitioning StrategyDO Denormalize!  It‟s not a sin! Acct Table PK RK Properties  Limit dataset to values required by query Yves Orders 0634911534000000000 1 €200  Do precompute Yves Orders 0634911534000000000 3 $500  Do format up front Yves Orders 0634911666000000000 2 €1000  Etc…
  55. 55. Bonus TipGreat partition keys  Any combination of properties of the object  Who  F.e. current userid  When  DateTime.Format("{0:D19}", DateTime.MaxValue.Ticks – DateTime.UtcNow.Ticks)  Where  Geohash algorithm (geohash.org)  Why  Flow context through operations/messages  Add all partition key values as properties to the original aggregate root!  Or update will be your next nightmare
  56. 56. Change Propagation StrategyUse queued messaging  NO TRANSACTION support  Split multi-partition update Web Roles/Websites  In separate atomic operations  Built-in retry mechanism on storage queues  Use compensation logic if needed Worker roles Example P  Update primary records in entity group tx  Publish „Updated‟ event through queues  Maintain list interested nodes in storage  Secondary nodes update indexes in response S S S
  57. 57. Change Propagation StrategySounds hard, but don’t worry  I‟ve gone through all the hard work already  And contributed this back to the community Web Roles/Websites  NServiceBus Bonus  Works very well to update web role caches Worker roles  Very nice in combination with SignalR! P As A Service  www.messagehandler.net S S S
  58. 58. CountingCounting records across partitions is a PITA, because querying is…  And there is no built in support  Do it upfront Web Roles/Websites  I typically count „created‟ events in flight  It‟s just another handler  BUT… Worker roles  +1 is not an idempotent operation P  Keep the count atomic  No other operation except +1 & save  Use sharded counters  Every thread on every node „Owns‟ S S S C a shard of the counter (prevent concurrency)  In single partition  Roll up when querying and/or regular interval
  59. 59. CountingCounting actions in Goeleven.com public class CountActions : IHandleMessages<ActionCreated> { public void Handle(ActionCreated message) { string identifier = RoleEnvironment.CurrentRoleInstance.Id + "." + Thread.CurrentThread.ManagedThreadId; var shard = counterShardRepository.TryGetShard("Actions", identifier); if (shard == null) { shard = new CounterShard { PartitionKey = "Actions", RowKey = identifier, Partial = 1 }; counterShardRepository.Add(shard); } else Acct Table PK RK Properties { My Counters Actions RollUp 2164 shard.Partial += 1; counterShardRepository.Update(shard); My Counters Actions Goeleven.Worker_IN_0.11 12 } My Counters Actions Goeleven.Worker_IN_1.56 15 counterShardRepository.SaveAllChanges(); } }
  60. 60. CountingRollup public class Counter { public long Value { get; private set; } public Counter(IEnumerable<CounterShard> counterShards) { foreach (var counterShard in counterShards) { Value += counterShard.Partial; } Acct Table PK RK Properties } } My Counters Actions RollUp 2164 My Counters Actions Goeleven.Worker_IN_0.11 12 My Counters Actions Goeleven.Worker_IN_1.56 15 My Counters Actions 2181
  61. 61. Master ElectionScheduling in multi-node environment  Like rollup  All nodes kick in together doing the same thing  Master election required to prevent conflicts  Remember the locking service?  It surfaces through blob as a „Lease‟ So you can use it too! Partition Layer  PM PM LS LS PS PS PS
  62. 62. Master ElectionTry to acquire a lease  If success start work, and hold on to the lease until done leaseId = blob.TryAcquireLease(); if (leaseId != null) { renewalThread = new Thread(() => { Thread.Sleep(TimeSpan.FromSeconds(40)); blob.RenewLease(leaseId); }); renewalThread.Start(); } // do your thing renewalThread.Abort(); blob.ReleaseLease(leaseId);  Find blob extension on http://blog.smarx.com/posts/leasing-windows-azure-blobs-using-the- storage-client-library
  63. 63. Wrapup
  64. 64. What we discussedDeep Dive Best Practices Global Namespace  Table Partitioning Strategy Front End Layer  Change Propagation Strategy Partition Layer  Counting Strategy Stream Layer  Master Election Strategy
  65. 65. ResourcesWant to know more?  White paper: http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf  Video: http://www.youtube.com/watch?v=QnYdbQO0yj4  More slides: http://sigops.org/sosp/sosp11/current/2011-Cascais/11-calder.pptx  Blob extensions: http://blog.smarx.com/posts/leasing-windows-azure-blobs-using-the-storage- client-library  NServiceBus: https://github.com/nservicebus/nservicebusSee it in action  www.goeleven.com  www.messagehandler.net
  66. 66. Q&A
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×