Storage Services
Yves Goeleven
Still alive?
Yves Goeleven
Founder of MessageHandler.net
• Windows Azure MVP
Exhibition theater @ kinepolis
www.messagehandler.net
What you might have expected
Azure Storage Services
• VM (Disk persistence)
• Azure Files
• CDN
• StoreSimple
• Backup & r...
I assume you know what storage
services does!
The real agenda
Deep Dive Architecture
• Global Namespace
• Front End Layer
• Partition Layer
• Stream Layer
WARNING!
Level 666
Which discusses how
Storage defies the rules of the CAP theorem
• Distributed & scalable system
• Highly available
• Stron...
That can’t be done, right?
Let’s dive in
Global namespace
Global namespace
Every request gets sent to
• http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
Global namespace
DNS based service location
• http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
• DN...
Stamp
Scale unit
• A windows azure deployment
• 3 layers
• Typically 20 racks each
• Roughly 360 XL instances
• With disks...
Incoming read request
Arrives at Front End
• Routed to partition layer
• http(s)://AccountName.Service.core.windows.net/Pa...
Incoming read request
2 nodes per request
• Blazingly fast
• Flat “Quantum 10” network topology
• 50 Tbps of bandwidth
• 1...
Incoming Write request
Written to stream layer
• Synchronous replication
• 1 Primary Extend Node
• 2 Secondary Extend Node...
Geo replication
Intra-stamp replicationIntra-stamp replication
Inter-stamp
replication
Location Service
Stamp Allocation
• Manages DNS records
• Determines active/passive stamp configuration
• Assigns accounts...
Front end layer
Front End Nodes
Stateless role instances
• Authentication & Authorization
• Shared Key (storage key)
• Shared Access Signa...
Partition Map
Determine partition server
• Used By Front End Nodes
• Managed by Partition Manager (Partition Layer)
• Orde...
Practical examples
Blob
• Each blob is a partition
• Blob uri determines range
• Range only matters for listing
• Same con...
Practical examples
Queue
• Each queue is a partition
• Queue name determines range
• All messages same partition
• Range o...
Practical examples
Table
• Each group of rows with same PartitionKey is a partition
• Ordering within set by RowKey
• Rang...
Continuation tokens
Requests across partitions
• All on same partition server
• 1 result
• Across multiple ranges (or to m...
Partition layer
Partition layer
Dynamic Scalable Object Index
• Object Tables
• System defined data structures
• Account, Entity, Blob, Me...
Consists of
• Partition Master
• Distributes Object Tables
• Partitions across Partition Servers
• Dynamic Load Balancing
...
How it works
Dynamic Scalable Object Index
Partition Layer
PS PS PS
PM PM
LS LS
Front End Layer
FE FE FE FE
Account Contai...
Not in the critical path!
Dynamic Load Balancing
Stream Layer
Partition Layer
Primary Secondary Secondary
PS PS PS PS PS P...
Architecture
• Log Structured Merge-Tree
• Easy replication
• Append only persisted streams
• Commit log
• Metadata log
• ...
HOT Data
• Completely in memory
Incoming Read request
Memory Data
Persistent Data
Commit Log Stream
Metadata Log Stream
Ro...
COLD Data
• Partially Memory
Incoming Read request
Memory Data
Persistent Data
Commit Log Stream
Metadata Log Stream
Row D...
Persistent First
• Followed by memory
Incoming Write Request
Memory Data
Persistent Data
Commit Log StreamCommit Log Strea...
Not in critical path!
Checkpoint process
Memory Data
Persistent Data
Commit Log StreamCommit Log Stream
Metadata Log Strea...
Not in critical path!
Geo replication process
Memory Data
Persistent Data
Commit Log StreamCommit Log Stream
Metadata Log
...
Stream layer
Append-Only Distributed File System
• Streams are very large files
• File system like directory namespace
• Stream Operati...
Concepts
• Block
• Min unit of write/read
• Checksum
• Up to N bytes (e.g. 4MB)
• Extent
• Unit of replication
• Sequence ...
Consists of
• Paxos System
• Distributed consensus protocol
• Stream Master
• Extend Nodes
• Attached Disks
• Manages Exte...
Allocation process
• Partition Server
• Request extent / stream creation
• Stream Master
• Chooses 3 random extend nodes b...
Bitwise equality on all nodes
• Synchronous
• Append request to primary
• Written to disk, checksum
• Offset and data forw...
Ack from primary lost going back to partition layer
• Retry from partition layer
• can cause multiple blocks to be appende...
Replication Failures (Scenario 1)
Stream Layer
EN EN EN
SM
Partition Layer
PS
Primary
Block
Block
Block
Secondary
Block
Bl...
Replication Failures (Scenario 2)
Stream Layer
EN EN EN
SM
Partition Layer
PS
Primary
Block
Block
Block
Secondary
Block
Bl...
Data Stream
• Partition Layer only reads from offsets
returned from successful appends
• Committed on all replicas
• Row a...
Log Stream
• Logs are used on partition load
• Check commit length first
• Only read from
• Unsealed replica if all replic...
Downside to this architecture
• Lot’s of wasted disk space
• ‘Forgotten’ streams
• ‘Forgotten’ extends
• Unused blocks in ...
Wrapup
Layering & co-design
• Stream Layer
• Availability with Partition/failure tolerance
• For Consistency, replicas are bit-wi...
Want to know more?
• White paper: http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf
• Video: http...
Still alive?
And take home the
Lumia 1320
Present your feedback form when you exit
the last session & go for the drink
Give Me Feedback
Follow Technet Belgium
@technetbelux
Subscribe to the TechNet newsletter
aka.ms/benews
Be the first to know
Belgiums’ biggest IT PRO Conference
Azure storage deep dive
Azure storage deep dive
Upcoming SlideShare
Loading in...5
×

Azure storage deep dive

355

Published on

In this session I take a very deep dive into the microsoft azure storage internals

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
355
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
12
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Real time message processing as a service
    Think of it as IFTTT for internet of things

    Solves today’s integration issues
    Scalability, data volume, multitude protocols & platforms, multitude of integration points, saas & social integration, mobile platforms, business ecosystems, ownership & centralized management, …
  • Azure storage deep dive

    1. 1. Storage Services Yves Goeleven
    2. 2. Still alive?
    3. 3. Yves Goeleven Founder of MessageHandler.net • Windows Azure MVP
    4. 4. Exhibition theater @ kinepolis
    5. 5. www.messagehandler.net
    6. 6. What you might have expected Azure Storage Services • VM (Disk persistence) • Azure Files • CDN • StoreSimple • Backup & replica’s • Application storage • Much much more...
    7. 7. I assume you know what storage services does!
    8. 8. The real agenda Deep Dive Architecture • Global Namespace • Front End Layer • Partition Layer • Stream Layer
    9. 9. WARNING!
    10. 10. Level 666
    11. 11. Which discusses how Storage defies the rules of the CAP theorem • Distributed & scalable system • Highly available • Strongly Consistent • Partition tolerant • All at the same time!
    12. 12. That can’t be done, right?
    13. 13. Let’s dive in
    14. 14. Global namespace
    15. 15. Global namespace Every request gets sent to • http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
    16. 16. Global namespace DNS based service location • http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName • DNS resolution • Datacenter (f.e. Western Europe) • Virtual IP of a Stamp
    17. 17. Stamp Scale unit • A windows azure deployment • 3 layers • Typically 20 racks each • Roughly 360 XL instances • With disks attached • 2 Petabyte (before june 2012) • 30 Petabyte Intra-stamp replication
    18. 18. Incoming read request Arrives at Front End • Routed to partition layer • http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName • Reads usually from memory Partition Layer Front End Layer FE FE FE FE FE FE FE FE FE PS PS PS PS PS PS PS PS PS
    19. 19. Incoming read request 2 nodes per request • Blazingly fast • Flat “Quantum 10” network topology • 50 Tbps of bandwidth • 10 Gbps / node • Faster than typical ssd drive
    20. 20. Incoming Write request Written to stream layer • Synchronous replication • 1 Primary Extend Node • 2 Secondary Extend Nodes Front End Layer Stream Layer Partition Layer Primary Secondary Secondary PS PS PS PS PS PS PS PS PS EN EN EN EN EN EN EN
    21. 21. Geo replication Intra-stamp replicationIntra-stamp replication Inter-stamp replication
    22. 22. Location Service Stamp Allocation • Manages DNS records • Determines active/passive stamp configuration • Assigns accounts to stamps • Geo Replication • Disaster Recovery • Load Balancing • Account Migration
    23. 23. Front end layer
    24. 24. Front End Nodes Stateless role instances • Authentication & Authorization • Shared Key (storage key) • Shared Access Signatures • Routing to Partition Servers • Based on PartitionMap (in memory cache) • Versioning • x-ms-version: 2011-08-18 • Throttling Front End Layer FE FE FE FE FE FE FE FE FE
    25. 25. Partition Map Determine partition server • Used By Front End Nodes • Managed by Partition Manager (Partition Layer) • Ordered Dictionary • Key: http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName • Value: Partition Server • Split across partition servers • Range Partition
    26. 26. Practical examples Blob • Each blob is a partition • Blob uri determines range • Range only matters for listing • Same container: More likely same range • https://myaccount.blobs.core.windows.net/mycontainer/subdir/file1.jpg • https://myaccount.blobs.core.windows.net/mycontainer/subdir/file2.jpg • Other container: Less likely • https://myaccount.blobs.core.windows.net/othercontainer/subdir/file1.jpg PS PS PS
    27. 27. Practical examples Queue • Each queue is a partition • Queue name determines range • All messages same partition • Range only matters for listing queues PS PS PS
    28. 28. Practical examples Table • Each group of rows with same PartitionKey is a partition • Ordering within set by RowKey • Range matters a lot! • Expect continuation tokens! PS PS PS
    29. 29. Continuation tokens Requests across partitions • All on same partition server • 1 result • Across multiple ranges (or to many results) • Sequential queries required • 1 result + continuation token • Repeat query with token until done • Will almost never occur in development! • As dataset is typically small PS PS PS
    30. 30. Partition layer
    31. 31. Partition layer Dynamic Scalable Object Index • Object Tables • System defined data structures • Account, Entity, Blob, Message, Schema, PartitionMap • Dynamic Load Balancing • Monitor load to each part of the index to determine hot spots • Index is dynamically split into thousands of Index Range Partitions based on load • Index Range Partitions are automatically load balanced across servers to quickly adapt to changes in load • Updates every 15 seconds
    32. 32. Consists of • Partition Master • Distributes Object Tables • Partitions across Partition Servers • Dynamic Load Balancing • Partition Server • Serves data for exactly 1 Range Partition • Manages Streams • Lock Server • Master Election • Partition Lock Partition layer Partition Layer PS PS PS PM PM LS LS
    33. 33. How it works Dynamic Scalable Object Index Partition Layer PS PS PS PM PM LS LS Front End Layer FE FE FE FE Account Container Blob Aaaa Aaaa Aaaa Harry Pictures Sunrise Harry Pictures Sunset Richard Images Soccer Richard Images Tennis Zzzz Zzzz zzzz A-H H’-R R’-Z Map
    34. 34. Not in the critical path! Dynamic Load Balancing Stream Layer Partition Layer Primary Secondary Secondary PS PS PS PS PS PS PS PS PS EN EN EN EN EN EN EN PM PM LS LS
    35. 35. Architecture • Log Structured Merge-Tree • Easy replication • Append only persisted streams • Commit log • Metadata log • Row Data (Snapshots) • Blob Data (Actual blob data) • Caches in memory • Memory table (= Commit log) • Row Data • Index (snapshot location) • Bloom filter • Load Metrics Partition Server Memory Data Persistent Data Commit Log Stream Metadata Log Stream Row Data Stream Blob Data Stream Memory Table Row Cache Index Cache Bloom Filters Load Metrics
    36. 36. HOT Data • Completely in memory Incoming Read request Memory Data Persistent Data Commit Log Stream Metadata Log Stream Row Data Stream Blob Data Stream Memory Table Row Cache Index Cache Bloom Filters Load Metrics
    37. 37. COLD Data • Partially Memory Incoming Read request Memory Data Persistent Data Commit Log Stream Metadata Log Stream Row Data StreamRow Data Stream Blob Data Stream Memory Table Row Cache Index Cache Bloom Filters Load Metrics
    38. 38. Persistent First • Followed by memory Incoming Write Request Memory Data Persistent Data Commit Log StreamCommit Log Stream Metadata Log Stream Row Data Stream Blob Data Stream Memory Table Row Cache Index Cache Bloom Filters Load Metrics
    39. 39. Not in critical path! Checkpoint process Memory Data Persistent Data Commit Log StreamCommit Log Stream Metadata Log Stream Row Data Stream Blob Data Stream Memory Table Row Cache Index Cache Bloom Filters Load Metrics Row Data StreamCommit Log Stream
    40. 40. Not in critical path! Geo replication process Memory Data Persistent Data Commit Log StreamCommit Log Stream Metadata Log Stream Row Data Stream Blob Data Stream Memory Table Row Cache Index Cache Bloom Filters Load Metrics Memory Data Persistent Data Commit Log StreamCommit Log Stream Metadata Log Stream Row Data Stream Blob Data Stream Memory Table Row Cache Index Cache Bloom Filters Load Metrics Row Data Stream
    41. 41. Stream layer
    42. 42. Append-Only Distributed File System • Streams are very large files • File system like directory namespace • Stream Operations • Open, Close, Delete Streams • Rename Streams • Concatenate Streams together • Append for writing • Random reads Stream layer
    43. 43. Concepts • Block • Min unit of write/read • Checksum • Up to N bytes (e.g. 4MB) • Extent • Unit of replication • Sequence of blocks • Size limit (e.g. 1GB) • Sealed/unsealed • Stream • Hierarchical namespace • Ordered list of pointers to extents • Append/Concatenate Stream layer Extent Stream //foo/myfile.data Extent Extent Extent Pointer 1 Pointer 2 Pointer 3 Block Block Block Block Block Block Block Block Block Block Block
    44. 44. Consists of • Paxos System • Distributed consensus protocol • Stream Master • Extend Nodes • Attached Disks • Manages Extents • Optimization Routines Stream layer Stream Layer EN EN EN EN SM SM SM Paxos
    45. 45. Allocation process • Partition Server • Request extent / stream creation • Stream Master • Chooses 3 random extend nodes based on available capacity • Chooses Primary & Secondary's • Allocates replica set • Random allocation • To reduce Mean Time To Recovery Extent creation Stream Layer EN EN EN SM Partition Layer PS Primary Secondary Secondary
    46. 46. Bitwise equality on all nodes • Synchronous • Append request to primary • Written to disk, checksum • Offset and data forwarded to secondaries • Ack Replication Process Journaling • Dedicated disk (spindle/ssd) • For writes • Simultaneously copied to data disk (from memory or journal) Stream Layer EN EN EN SM Partition Layer PS Primary Secondary Secondary
    47. 47. Ack from primary lost going back to partition layer • Retry from partition layer • can cause multiple blocks to be appended (duplicate records) Dealing with failure Unresponsive/Unreachable Extent Node (EN) • Seal the failed extent & allocate a new extent and append immediately ExtentExtent Stream //foo/myfile.data Extent Extent Pointer 1 Pointer 2 Pointer 3 Block Block Block Block Block Block Block Block Block Block Block Block Extent Block Block Block Extent Block
    48. 48. Replication Failures (Scenario 1) Stream Layer EN EN EN SM Partition Layer PS Primary Block Block Block Secondary Block Block Block Secondary Block Block Block Block Block Seal 120 120 Primary Block Block Block Secondary Block Block Block Block Block Secondary Block Block Block Block
    49. 49. Replication Failures (Scenario 2) Stream Layer EN EN EN SM Partition Layer PS Primary Block Block Block Secondary Block Block Block Secondary Block Block Block Block Seal 120 100 Primary Block Block Block Secondary Block Block Block Secondary Block Block Block ?
    50. 50. Data Stream • Partition Layer only reads from offsets returned from successful appends • Committed on all replicas • Row and Blob Data Streams • Offset valid on any replica • Safe to read Network partitioning during read Stream Layer EN EN EN SM Partition Layer PS Primary Block Block Block Secondary Block Block Block Secondary Block Block Block Block Block
    51. 51. Log Stream • Logs are used on partition load • Check commit length first • Only read from • Unsealed replica if all replicas have the same commit length • A sealed replica Network partitioning during read Stream Layer EN EN EN SM Partition Layer PS Primary Block Block Block Secondary Block Block Block Secondary Block Block Block Block 1, 2 Seal Secondary Block Block Block Block Block Secondary Block Block Block Block
    52. 52. Downside to this architecture • Lot’s of wasted disk space • ‘Forgotten’ streams • ‘Forgotten’ extends • Unused blocks in extends after write errors • Garbage collection required • Partition servers mark used streams • Stream manager cleans up unreferenced streams • Stream manager cleans up unreferenced extends • Stream manager performs erasure coding • Reed Solomon algorithm • Allows recreation of cold streams after deleting extends Garbage Collection
    53. 53. Wrapup
    54. 54. Layering & co-design • Stream Layer • Availability with Partition/failure tolerance • For Consistency, replicas are bit-wise identical up to the commit length • Partition Layer • Consistency with Partition/failure tolerance • For Availability, RangePartitions can be served by any partition server • Are moved to available servers if a partition server fails • Designed for specific classes of partitioning/failures seen in practice • Process to Disk to Node to Rack failures/unresponsiveness • Node to Rack level network partitioning Beating the CAP theorem
    55. 55. Want to know more? • White paper: http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf • Video: http://www.youtube.com/watch?v=QnYdbQO0yj4 • More slides: http://sigops.org/sosp/sosp11/current/2011-Cascais/11-calder.pptx • Erasure coding: http://www.snia.org/sites/default/files2/SDC2012/presentations/Cloud/ ChengHuang_Ensuring_Code_in_Windows_Azure.pv3.pdf Resources
    56. 56. Still alive?
    57. 57. And take home the Lumia 1320 Present your feedback form when you exit the last session & go for the drink Give Me Feedback
    58. 58. Follow Technet Belgium @technetbelux Subscribe to the TechNet newsletter aka.ms/benews Be the first to know
    59. 59. Belgiums’ biggest IT PRO Conference
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×