• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Deep Dive and Best Practices for Windows Azure Storage Services
 

Deep Dive and Best Practices for Windows Azure Storage Services

on

  • 484 views

More info on http://www.techdays.be

More info on http://www.techdays.be

Statistics

Views

Total Views
484
Views on SlideShare
484
Embed Views
0

Actions

Likes
1
Downloads
20
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Deep Dive and Best Practices for Windows Azure Storage Services Deep Dive and Best Practices for Windows Azure Storage Services Presentation Transcript

    • Me?Yves Goeleven Solution Architect Capgemini Windows Azure MVP Co-founder Azug Founder Goeleven.com Founder Messagehandler.net Contributor NServiceBus
    • AgendaDeep Dive Architecture Best Practices Global Namespace  Table Partitioning Strategy Front End Layer  Change Propagation Strategy Partition Layer  Counting Strategy Stream Layer  Master Election Strategy
    • Warning! This is a level 400 session!
    • I assume you knowStorage services Highly scalable Highly available Cloud storage
    • I assume you knowThat brings following data abstractions Tables Blobs Drives Queues
    • I assume you knowIt is NOT a relational database Storage services exposed trough a REST full API With some sugar on top (aka an SDK) LINQ to tables support might make you think it is a database
    • This talk is based on an academic whitepaper http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf
    • Which discusses howStorage services defy the rules of the CAP theorem Distributed & scalable system Highly available Strongly Consistent Partition tolerant All at the same time!
    • Now how did they do that?Deep Dive: Ready for a plunge?
    • Global Namespace
    • Global namespaceEvery request gets sent to http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
    • Global namespaceDNS based service location http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName DNS resolution Datacenter (f.e. Western Europe) Virtual IP of a Stamp
    • StampScale unit A windows azure deployment 3 layers  Typically 20 racks each  Roughly 360 XL instances With disks attached  2 Petabyte (before june 2012)  30 Petabyte Intra-stamp replication
    • Incoming read requestArrives at Front End Routed to partition layer http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName Reads usually from memory Front End Layer FE FE FE FE FE FE FE FE FE Partition Layer PS PS PS PS PS PS PS PS PS
    • Incoming read request2 nodes per request Blazingly fast Flat “Quantum 10” network topology 50 Tbps of bandwidth  10 Gbps / node  Faster than typical ssd drive
    • Incoming Write requestWritten to stream layer Synchronous replication 1 Primary Extend Node 2 Secondary Extend Nodes Front End Layer Partition Layer PS PS PS PS PS PS PS PS PS Stream Layer EN EN EN EN EN EN EN Primary Secondary Secondary
    • Geo replication Inter-stamp replication Intra-stamp replication Intra-stamp replication
    • Location ServiceStamp Allocation Manages DNS records Determines active/passive stamp configuration  Assigns accounts to stamps  Geo Replication  Disaster Recovery  Load Balancing  Account Migration
    • Front End Layer
    • Front End NodesStateless role instances Authentication & Authorization  Shared Key (storage key)  Shared Key Lite  Shared Access Signatures Routing to Partition Servers  Based on PartitionMap (in memory cache) Versioning  x-ms-version: 2011-08-18 Throttling Front End Layer FE FE FE FE FE FE FE FE FE
    • Partition MapDetermine partition server Used By Front End Nodes Managed by Partition Manager (Partition Layer) Ordered Dictionary  Key: http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName  Value: Partition Server Split across partition servers  Range Partition
    • Practical examplesBlob Each blob is a partition Blob uri determines range Range only matters for listing Same container: More likely same range  https://myaccount.blobs.core.windows.net/mycontainer/subdir/file1.jpg  Https://myaccount.blobs.core.windows.net/mycontainer/subdir/file2.jpg Other container: Less likely  https://myaccount.blobs.core.windows.net/othercontainer/subdir/file1.jpg
    • Practical examplesQueue Each queue is a partition Queue name determines range All messages same partition Range only matters for listing
    • Practical examplesTable Each group of rows with same PartitionKey is a partition Ordering within set by RowKey Range matters a lot! Expect continuation tokens!
    • Continuation tokensRequests across partitions All on same partition server  1 result Across multiple ranges (or to many results)  Sequential queries required  1 result + continuation token  Repeat query with token until done Will almost never occur in development!  As dataset is typically small
    • Partition Layer
    • PartitionLayerDynamic Scalable Object Index Object Tables  System defined data structures  Account, Entity, Blob, Message, Schema, PartitionMap Dynamic Load Balancing  Monitor load to each part of the index to determine hot spots  Index is dynamically split into thousands of Index Range Partitions based on load  Index Range Partitions are automatically load balanced across servers to quickly adapt to changes in load  Updates every 15 seconds
    • Partition layerConsists of Partition Master  Distributes Object Tables  Partitions across Partition Servers Partition Layer  Dynamic Load Balancing PM PM Partition Server LS LS  Serves data for exactly 1 Range Partition  Manages Streams Lock Server PS PS PS  Master Election  Partition Lock
    • Dynamic Scalable Object IndexHow it works Account Container Blob Aaaa Aaaa Aaaa Front End Layer A-H FE FE FE FE Harry Pictures Sunrise Harry Pictures Sunset Partition Layer PM PM Map H‟-R LS LS Richard Images Soccer Richard Images Tennis R‟-Z PS PS PS Zzzz Zzzz zzzz
    • Dynamic Load BalancingNot in the critical path! Partition Layer PM PM LS LS PS PS PS PS PS PS PS PS PS Stream Layer EN EN EN EN EN EN EN Primary Secondary Secondary
    • Partition ServerArchitecture Log Structured Merge-Tree  Easy replication Append only persisted streams Memory Row Table Cache Memory Data  Commit log Index Bloom Load Metrics Cache Filters  Metadata log  Row Data (Snapshots)  Blob Data (Actual blob data) Caches in memory Persistent Data Commit Log Stream Row Data Stream  Memory table (= Commit log)  Row Data Metadata Log Stream Blob Data Stream  Index (snapshot location)  Bloom filter  Load Metrics
    • Incoming Read requestHOT Data Completely in memory Memory Row Table Cache Memory Data Index Bloom Load Metrics Cache Filters Persistent Data Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream
    • Incoming Read requestCOLD Data Partially Memory Memory Row Table Cache Memory Data Index Bloom Load Metrics Cache Filters Persistent Data Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream
    • Incoming Write RequestPersistent First Followed by memory Memory Row Table Cache Memory Data Index Bloom Load Metrics Cache Filters Persistent Data Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream
    • Checkpoint processNot in critical path! Memory Row Table Cache Memory Data Index Bloom Load Metrics Cache Filters Persistent Data Commit Log Stream Row Data Stream Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream
    • Geo replication processNot in critical path! Memory Row Memory Row Table Cache Memory Data Table Cache Memory Data Index Bloom Load Metrics Index Bloom Load Metrics Cache Filters Cache Filters Persistent Data Persistent Data Commit Log Stream Row Data Stream Commit Log Stream Row Data Stream Metadata Log Stream Blob Data Stream Metadata Log Stream Blob Data Stream
    • Stream Layer
    • Stream layerAppend-Only Distributed File System Streams are very large files  File system like directory namespace Stream Operations  Open, Close, Delete Streams  Rename Streams  Concatenate Streams together  Append for writing  Random reads
    • Stream layerConcepts Block  Min unit of write/read Stream //foo/myfile.data  Checksum  Up to N bytes (e.g. 4MB) Pointer 1 Pointer 2 Pointer 3 Extent  Unit of replication Extent Extent Extent  Sequence of blocks Block Block Block Block Block Block Block Block Block Block Block  Size limit (e.g. 1GB)  Sealed/unsealed Sealed Unsealed Sealed Stream  Hierarchical namespace  Ordered list of pointers to extents  Append/Concatenate
    • Stream layerConsists of Paxos System  Distributed consensus protocol  Stream Master Stream Layer Extend Nodes  Attached Disks SM  Manages Extents Paxos SM EN EN EN EN  Optimization Routines SM
    • Extent creationAllocation process Partition Server  Request extent / stream creation Partition Layer PS Stream Master  Chooses 3 random extend nodes based on available capacity Stream Layer  Chooses Primary & Secondarys SM  Allocates replica set Random allocation  To reduce Mean Time To Recovery EN EN EN Primary Secondary Secondary
    • Replication ProcessBitwise equality on all nodes Synchronous Partition Layer Append request to primary PS Written to disk, checksum Offset and data forwarded to Stream Layer secondaries SM Ack EN EN EN Primary Secondary SecondaryJournaling Dedicated disk (spindle/ssd) For writes Simultaneously copied to data disk (from memory or journal)
    • Dealing with failureAck from primary lost going back to partition layer Retry from partition layer can cause multiple blocks to be appended (duplicate records)Unresponsive/Unreachable Extent Node (EN) Seal the failed extent & allocate a new extent and append immediately Stream //foo/myfile.data Pointer 1 Pointer 2 Pointer 3 Extent Extent Extent Extent Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block
    • Replication Failures (Scenario 1) Partition Layer PS Seal Stream Layer SM 120 120 EN EN EN Primary Secondary Secondary Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block
    • Replication Failures (Scenario 2) Partition Layer PS Seal Stream Layer SM 120 100 EN EN EN Primary Secondary Secondary ? Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block
    • Network partitioning during readData Stream Partition Layer Partition Layer only reads from PS offsets returned from successful appends Committed on all replicas Stream Layer SM Row and Blob Data Streams Offset valid on any replica Safe to read EN EN EN Primary Secondary Secondary Block Block Block Block Block Block Block Block Block Block Block
    • Network partitioning during readLog Stream Partition Layer Logs are used on partition load PS Check commit length first Only read from  Unsealed replica if all replicas have 1, 2 Stream Layer the same commit length SM  A sealed replica Seal EN EN EN Primary Secondary Secondary Secondary Secondary Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block
    • Garbage CollectionDownside to this architecture Lot‟s of wasted disk space  „Forgotten‟ streams  „Forgotten‟ extends  Unused blocks in extends after write errors Garbage collection required  Partition servers mark used streams  Stream manager cleans up unreferenced streams  Stream manager cleans up unreferenced extends  Stream manager performs erasure coding  Read Solomon algorithm  Allows recreation of cold streams after deleting extends
    • Beating the CAP theoremLayering & co-design Stream Layer  Availability with Partition/failure tolerance  For Consistency, replicas are bit-wise identical up to the commit length Partition Layer  Consistency with Partition/failure tolerance  For Availability, RangePartitions can be served by any partition server  Are moved to available servers if a partition server fails Designed for specific classes of partitioning/failures seen in practice  Process to Disk to Node to Rack failures/unresponsiveness  Node to Rack level network partitioning
    • Best practices
    • Partitioning StrategyTables: Great for writing data, horrible for complex querying Data is automatically spread across extend nodes Indexes are automatically spread across partition servers Queries to many servers may be required to list data Front End Layer Partition Layer PS PS PS PS PS PS PS PS PS Stream Layer EN EN EN EN EN EN EN Secondary Secondary
    • Partitioning StrategyInsert all aggregates in a dedicated partition  Aggregate Account Table PK RowKey Properties  Records that belong together as a unit Europe Orders Order1 Order 1 data  F.e. Order with orderliness Order1 01 Orderline 1 data  Dedicated Order1 02 Orderline 2 data  Unique combination of partition parts Europe Orders Order2 Order 2 data  Account, table, partitionkey Order2 01 Orderline 1 data  Best insert performance Order2 02 Orderline 2 data  Use Entity Group Transactions US Orders Order3 Order 3 data  All aggregate records insert in 1 go Order3 01 Orderline 1 data  Transactional Order3 02 Orderline 2 data  Use Rowkey only for ordering  Can be blank
    • Partitioning StrategyQuery By = Store By  Do not query across partitions, it hurts! Acct Table PK RK Properties  Instead store a copy of the data to query Yves Orders 0634911534000000000 1 Order 1 data  In 1 partition 0634911534000000000 1_01 Orderline 1 data PartitionKey 0634911534000000000 1_02 Orderline 2 data  Combination of query by‟s Yves Orders 0634911534000000000 3 Order 3 data  or well known identifiers 0634911534000000000 3_01 Orderline 1 data Rowkey 0634911534000000000 3_02 Orderline 2 data  Use for order Yves Orders 0634911666000000000 2 Order 2 data  Must be unique in combo PK 0634911666000000000 2_01 Orderline 1 data 0634911666000000000 2_02 Orderline 2 data Guarantees fast retrieval  Every query = 1 partition server
    • Partitioning StrategyDO Denormalize!  It‟s not a sin! Acct Table PK RK Properties  Limit dataset to values required by query Yves Orders 0634911534000000000 1 €200  Do precompute Yves Orders 0634911534000000000 3 $500  Do format up front Yves Orders 0634911666000000000 2 €1000  Etc…
    • Bonus TipGreat partition keys  Any combination of properties of the object  Who  F.e. current userid  When  DateTime.Format("{0:D19}", DateTime.MaxValue.Ticks – DateTime.UtcNow.Ticks)  Where  Geohash algorithm (geohash.org)  Why  Flow context through operations/messages  Add all partition key values as properties to the original aggregate root!  Or update will be your next nightmare
    • Change Propagation StrategyUse queued messaging  NO TRANSACTION support  Split multi-partition update Web Roles/Websites  In separate atomic operations  Built-in retry mechanism on storage queues  Use compensation logic if needed Worker roles Example P  Update primary records in entity group tx  Publish „Updated‟ event through queues  Maintain list interested nodes in storage  Secondary nodes update indexes in response S S S
    • Change Propagation StrategySounds hard, but don’t worry  I‟ve gone through all the hard work already  And contributed this back to the community Web Roles/Websites  NServiceBus Bonus  Works very well to update web role caches Worker roles  Very nice in combination with SignalR! P As A Service  www.messagehandler.net S S S
    • CountingCounting records across partitions is a PITA, because querying is…  And there is no built in support  Do it upfront Web Roles/Websites  I typically count „created‟ events in flight  It‟s just another handler  BUT… Worker roles  +1 is not an idempotent operation P  Keep the count atomic  No other operation except +1 & save  Use sharded counters  Every thread on every node „Owns‟ S S S C a shard of the counter (prevent concurrency)  In single partition  Roll up when querying and/or regular interval
    • CountingCounting actions in Goeleven.com public class CountActions : IHandleMessages<ActionCreated> { public void Handle(ActionCreated message) { string identifier = RoleEnvironment.CurrentRoleInstance.Id + "." + Thread.CurrentThread.ManagedThreadId; var shard = counterShardRepository.TryGetShard("Actions", identifier); if (shard == null) { shard = new CounterShard { PartitionKey = "Actions", RowKey = identifier, Partial = 1 }; counterShardRepository.Add(shard); } else Acct Table PK RK Properties { My Counters Actions RollUp 2164 shard.Partial += 1; counterShardRepository.Update(shard); My Counters Actions Goeleven.Worker_IN_0.11 12 } My Counters Actions Goeleven.Worker_IN_1.56 15 counterShardRepository.SaveAllChanges(); } }
    • CountingRollup public class Counter { public long Value { get; private set; } public Counter(IEnumerable<CounterShard> counterShards) { foreach (var counterShard in counterShards) { Value += counterShard.Partial; } Acct Table PK RK Properties } } My Counters Actions RollUp 2164 My Counters Actions Goeleven.Worker_IN_0.11 12 My Counters Actions Goeleven.Worker_IN_1.56 15 My Counters Actions 2181
    • Master ElectionScheduling in multi-node environment  Like rollup  All nodes kick in together doing the same thing  Master election required to prevent conflicts  Remember the locking service?  It surfaces through blob as a „Lease‟ So you can use it too! Partition Layer  PM PM LS LS PS PS PS
    • Master ElectionTry to acquire a lease  If success start work, and hold on to the lease until done leaseId = blob.TryAcquireLease(); if (leaseId != null) { renewalThread = new Thread(() => { Thread.Sleep(TimeSpan.FromSeconds(40)); blob.RenewLease(leaseId); }); renewalThread.Start(); } // do your thing renewalThread.Abort(); blob.ReleaseLease(leaseId);  Find blob extension on http://blog.smarx.com/posts/leasing-windows-azure-blobs-using-the- storage-client-library
    • Wrapup
    • What we discussedDeep Dive Best Practices Global Namespace  Table Partitioning Strategy Front End Layer  Change Propagation Strategy Partition Layer  Counting Strategy Stream Layer  Master Election Strategy
    • ResourcesWant to know more?  White paper: http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf  Video: http://www.youtube.com/watch?v=QnYdbQO0yj4  More slides: http://sigops.org/sosp/sosp11/current/2011-Cascais/11-calder.pptx  Blob extensions: http://blog.smarx.com/posts/leasing-windows-azure-blobs-using-the-storage- client-library  NServiceBus: https://github.com/nservicebus/nservicebusSee it in action  www.goeleven.com  www.messagehandler.net
    • Q&A