SlideShare a Scribd company logo
1 of 61
Download to read offline
Storage Services
Yves Goeleven
Still alive?
Yves Goeleven
Founder of MessageHandler.net
• Windows Azure MVP
Exhibition theater @ kinepolis
Azure storage deep dive
www.messagehandler.net
What you might have expected
Azure Storage Services
• VM (Disk persistence)
• Azure Files
• CDN
• StoreSimple
• Backup & replica’s
• Application storage
• Much much more...
I assume you know what storage
services does!
The real agenda
Deep Dive Architecture
• Global Namespace
• Front End Layer
• Partition Layer
• Stream Layer
WARNING!
Level 666
Azure storage deep dive
Which discusses how
Storage defies the rules of the CAP theorem
• Distributed & scalable system
• Highly available
• Strongly Consistent
• Partition tolerant
• All at the same time!
That can’t be done, right?
Let’s dive in
Global namespace
Global namespace
Every request gets sent to
• http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
Global namespace
DNS based service location
• http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
• DNS resolution
• Datacenter (f.e. Western Europe)
• Virtual IP of a Stamp
Stamp
Scale unit
• A windows azure deployment
• 3 layers
• Typically 20 racks each
• Roughly 360 XL instances
• With disks attached
• 2 Petabyte (before june 2012)
• 30 Petabyte
Intra-stamp replication
Incoming read request
Arrives at Front End
• Routed to partition layer
• http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
• Reads usually from memory
Partition Layer
Front End Layer
FE FE FE FE FE FE FE FE FE
PS PS PS PS PS PS PS PS PS
Incoming read request
2 nodes per request
• Blazingly fast
• Flat “Quantum 10” network topology
• 50 Tbps of bandwidth
• 10 Gbps / node
• Faster than typical ssd drive
Incoming Write request
Written to stream layer
• Synchronous replication
• 1 Primary Extend Node
• 2 Secondary Extend Nodes
Front End Layer
Stream Layer
Partition Layer
Primary Secondary Secondary
PS PS PS PS PS PS PS PS PS
EN EN EN EN EN EN EN
Geo replication
Intra-stamp replicationIntra-stamp replication
Inter-stamp
replication
Location Service
Stamp Allocation
• Manages DNS records
• Determines active/passive stamp configuration
• Assigns accounts to stamps
• Geo Replication
• Disaster Recovery
• Load Balancing
• Account Migration
Front end layer
Front End Nodes
Stateless role instances
• Authentication & Authorization
• Shared Key (storage key)
• Shared Access Signatures
• Routing to Partition Servers
• Based on PartitionMap (in memory cache)
• Versioning
• x-ms-version: 2011-08-18
• Throttling
Front End Layer
FE FE FE FE FE FE FE FE FE
Partition Map
Determine partition server
• Used By Front End Nodes
• Managed by Partition Manager (Partition Layer)
• Ordered Dictionary
• Key: http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
• Value: Partition Server
• Split across partition servers
• Range Partition
Practical examples
Blob
• Each blob is a partition
• Blob uri determines range
• Range only matters for listing
• Same container: More likely same range
• https://myaccount.blobs.core.windows.net/mycontainer/subdir/file1.jpg
• https://myaccount.blobs.core.windows.net/mycontainer/subdir/file2.jpg
• Other container: Less likely
• https://myaccount.blobs.core.windows.net/othercontainer/subdir/file1.jpg
PS PS PS
Practical examples
Queue
• Each queue is a partition
• Queue name determines range
• All messages same partition
• Range only matters for listing queues
PS PS PS
Practical examples
Table
• Each group of rows with same PartitionKey is a partition
• Ordering within set by RowKey
• Range matters a lot!
• Expect continuation tokens!
PS PS PS
Continuation tokens
Requests across partitions
• All on same partition server
• 1 result
• Across multiple ranges (or to many results)
• Sequential queries required
• 1 result + continuation token
• Repeat query with token until done
• Will almost never occur in development!
• As dataset is typically small
PS PS PS
Partition layer
Partition layer
Dynamic Scalable Object Index
• Object Tables
• System defined data structures
• Account, Entity, Blob, Message, Schema, PartitionMap
• Dynamic Load Balancing
• Monitor load to each part of the index to determine hot spots
• Index is dynamically split into thousands of Index Range Partitions based on
load
• Index Range Partitions are automatically load balanced across servers to
quickly adapt to changes in load
• Updates every 15 seconds
Consists of
• Partition Master
• Distributes Object Tables
• Partitions across Partition Servers
• Dynamic Load Balancing
• Partition Server
• Serves data for exactly 1 Range Partition
• Manages Streams
• Lock Server
• Master Election
• Partition Lock
Partition layer
Partition Layer
PS PS PS
PM PM
LS LS
How it works
Dynamic Scalable Object Index
Partition Layer
PS PS PS
PM PM
LS LS
Front End Layer
FE FE FE FE
Account Container Blob
Aaaa Aaaa Aaaa
Harry Pictures Sunrise
Harry Pictures Sunset
Richard Images Soccer
Richard Images Tennis
Zzzz Zzzz zzzz
A-H
H’-R
R’-Z
Map
Not in the critical path!
Dynamic Load Balancing
Stream Layer
Partition Layer
Primary Secondary Secondary
PS PS PS PS PS PS PS PS PS
EN EN EN EN EN EN EN
PM PM
LS LS
Architecture
• Log Structured Merge-Tree
• Easy replication
• Append only persisted streams
• Commit log
• Metadata log
• Row Data (Snapshots)
• Blob Data (Actual blob data)
• Caches in memory
• Memory table (= Commit log)
• Row Data
• Index (snapshot location)
• Bloom filter
• Load Metrics
Partition Server
Memory Data
Persistent Data
Commit Log Stream
Metadata Log Stream
Row Data Stream
Blob Data Stream
Memory
Table
Row
Cache
Index
Cache
Bloom
Filters
Load Metrics
HOT Data
• Completely in memory
Incoming Read request
Memory Data
Persistent Data
Commit Log Stream
Metadata Log Stream
Row Data Stream
Blob Data Stream
Memory
Table
Row
Cache Index
Cache
Bloom
Filters
Load Metrics
COLD Data
• Partially Memory
Incoming Read request
Memory Data
Persistent Data
Commit Log Stream
Metadata Log Stream
Row Data StreamRow Data Stream
Blob Data Stream
Memory
Table
Row
Cache Index
Cache
Bloom
Filters
Load Metrics
Persistent First
• Followed by memory
Incoming Write Request
Memory Data
Persistent Data
Commit Log StreamCommit Log Stream
Metadata Log Stream
Row Data Stream
Blob Data Stream
Memory
Table
Row
Cache Index
Cache
Bloom
Filters
Load Metrics
Not in critical path!
Checkpoint process
Memory Data
Persistent Data
Commit Log StreamCommit Log Stream
Metadata Log Stream
Row Data Stream
Blob Data Stream
Memory
Table
Row
Cache Index
Cache
Bloom
Filters
Load Metrics
Row Data StreamCommit Log Stream
Not in critical path!
Geo replication process
Memory Data
Persistent Data
Commit Log StreamCommit Log Stream
Metadata Log
Stream
Row Data Stream
Blob Data Stream
Memory
Table
Row
Cache
Index
Cache
Bloom
Filters
Load Metrics
Memory Data
Persistent Data
Commit Log StreamCommit Log Stream
Metadata Log
Stream
Row Data Stream
Blob Data Stream
Memory
Table
Row
Cache
Index
Cache
Bloom
Filters
Load Metrics
Row Data Stream
Stream layer
Append-Only Distributed File System
• Streams are very large files
• File system like directory namespace
• Stream Operations
• Open, Close, Delete Streams
• Rename Streams
• Concatenate Streams together
• Append for writing
• Random reads
Stream layer
Concepts
• Block
• Min unit of write/read
• Checksum
• Up to N bytes (e.g. 4MB)
• Extent
• Unit of replication
• Sequence of blocks
• Size limit (e.g. 1GB)
• Sealed/unsealed
• Stream
• Hierarchical namespace
• Ordered list of pointers to extents
• Append/Concatenate
Stream layer
Extent
Stream //foo/myfile.data
Extent Extent Extent
Pointer 1 Pointer 2 Pointer 3
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Consists of
• Paxos System
• Distributed consensus protocol
• Stream Master
• Extend Nodes
• Attached Disks
• Manages Extents
• Optimization Routines
Stream layer
Stream Layer
EN EN EN EN
SM
SM
SM
Paxos
Allocation process
• Partition Server
• Request extent / stream creation
• Stream Master
• Chooses 3 random extend nodes based
on available capacity
• Chooses Primary & Secondary's
• Allocates replica set
• Random allocation
• To reduce Mean Time To Recovery
Extent creation
Stream Layer
EN EN EN
SM
Partition Layer
PS
Primary Secondary Secondary
Bitwise equality on all nodes
• Synchronous
• Append request to primary
• Written to disk, checksum
• Offset and data forwarded to secondaries
• Ack
Replication Process
Journaling
• Dedicated disk (spindle/ssd)
• For writes
• Simultaneously copied to data disk
(from memory or journal)
Stream Layer
EN EN EN
SM
Partition Layer
PS
Primary Secondary Secondary
Ack from primary lost going back to partition layer
• Retry from partition layer
• can cause multiple blocks to be appended (duplicate records)
Dealing with failure
Unresponsive/Unreachable Extent Node (EN)
• Seal the failed extent & allocate a new extent and append immediately
ExtentExtent
Stream //foo/myfile.data
Extent Extent
Pointer 1 Pointer 2 Pointer 3
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Block
Extent
Block
Block
Block
Extent
Block
Replication Failures (Scenario 1)
Stream Layer
EN EN EN
SM
Partition Layer
PS
Primary
Block
Block
Block
Secondary
Block
Block
Block
Secondary
Block
Block
Block
Block
Block
Seal
120
120
Primary
Block
Block
Block
Secondary
Block
Block
Block
Block
Block
Secondary
Block
Block
Block
Block
Replication Failures (Scenario 2)
Stream Layer
EN EN EN
SM
Partition Layer
PS
Primary
Block
Block
Block
Secondary
Block
Block
Block
Secondary
Block
Block
Block
Block
Seal
120
100
Primary
Block
Block
Block
Secondary
Block
Block
Block
Secondary
Block
Block
Block
?
Data Stream
• Partition Layer only reads from offsets
returned from successful appends
• Committed on all replicas
• Row and Blob Data Streams
• Offset valid on any replica
• Safe to read
Network partitioning during read
Stream Layer
EN EN EN
SM
Partition Layer
PS
Primary
Block
Block
Block
Secondary
Block
Block
Block
Secondary
Block
Block
Block
Block
Block
Log Stream
• Logs are used on partition load
• Check commit length first
• Only read from
• Unsealed replica if all replicas
have the same commit length
• A sealed replica
Network partitioning during read
Stream Layer
EN EN EN
SM
Partition Layer
PS
Primary
Block
Block
Block
Secondary
Block
Block
Block
Secondary
Block
Block
Block
Block
1, 2
Seal
Secondary
Block
Block
Block
Block
Block
Secondary
Block
Block
Block
Block
Downside to this architecture
• Lot’s of wasted disk space
• ‘Forgotten’ streams
• ‘Forgotten’ extends
• Unused blocks in extends after write errors
• Garbage collection required
• Partition servers mark used streams
• Stream manager cleans up unreferenced streams
• Stream manager cleans up unreferenced extends
• Stream manager performs erasure coding
• Reed Solomon algorithm
• Allows recreation of cold streams after deleting extends
Garbage Collection
Wrapup
Layering & co-design
• Stream Layer
• Availability with Partition/failure tolerance
• For Consistency, replicas are bit-wise identical up to the commit length
• Partition Layer
• Consistency with Partition/failure tolerance
• For Availability, RangePartitions can be served by any partition server
• Are moved to available servers if a partition server fails
• Designed for specific classes of partitioning/failures seen in practice
• Process to Disk to Node to Rack failures/unresponsiveness
• Node to Rack level network partitioning
Beating the CAP theorem
Want to know more?
• White paper: http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf
• Video: http://www.youtube.com/watch?v=QnYdbQO0yj4
• More slides: http://sigops.org/sosp/sosp11/current/2011-Cascais/11-calder.pptx
• Erasure coding: http://www.snia.org/sites/default/files2/SDC2012/presentations/Cloud/
ChengHuang_Ensuring_Code_in_Windows_Azure.pv3.pdf
Resources
Still alive?
And take home the
Lumia 1320
Present your feedback form when you exit
the last session & go for the drink
Give Me Feedback
Follow Technet Belgium
@technetbelux
Subscribe to the TechNet newsletter
aka.ms/benews
Be the first to know
Belgiums’ biggest IT PRO Conference

More Related Content

What's hot

HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceCloudera, Inc.
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworksViet-Trung TRAN
 
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017Alex Robinson
 
Proving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slobProving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slobKapil Goyal
 
Solr on Windows: Does it Work? Does it Scale? - Teun Duynstee
Solr on Windows: Does it Work? Does it Scale? - Teun DuynsteeSolr on Windows: Does it Work? Does it Scale? - Teun Duynstee
Solr on Windows: Does it Work? Does it Scale? - Teun Duynsteelucenerevolution
 
Advanced databases ben stopford
Advanced databases   ben stopfordAdvanced databases   ben stopford
Advanced databases ben stopfordBen Stopford
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneEnkitec
 
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseMarcel Franke
 
Best Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisBest Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisJignesh Shah
 
[Altibase] 13 backup and recovery
[Altibase] 13 backup and recovery[Altibase] 13 backup and recovery
[Altibase] 13 backup and recoveryaltistory
 
Pnuts yahoo!’s hosted data serving platform
Pnuts  yahoo!’s hosted data serving platformPnuts  yahoo!’s hosted data serving platform
Pnuts yahoo!’s hosted data serving platformlammya aa
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataJignesh Shah
 
WalB: Block-level WAL. Concept.
WalB: Block-level WAL. Concept.WalB: Block-level WAL. Concept.
WalB: Block-level WAL. Concept.Takashi Hoshino
 
CNIT 40: 4: Monitoring and detecting security breaches
CNIT 40: 4: Monitoring and detecting security breachesCNIT 40: 4: Monitoring and detecting security breaches
CNIT 40: 4: Monitoring and detecting security breachesSam Bowne
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberFlink Forward
 
Comparing high availability solutions with percona xtradb cluster and percona...
Comparing high availability solutions with percona xtradb cluster and percona...Comparing high availability solutions with percona xtradb cluster and percona...
Comparing high availability solutions with percona xtradb cluster and percona...Marco Tusa
 

What's hot (20)

HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworks
 
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
The Hows and Whys of a Distributed SQL Database - Strange Loop 2017
 
Proving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slobProving out flash storage array performance using swingbench and slob
Proving out flash storage array performance using swingbench and slob
 
Solr on Windows: Does it Work? Does it Scale? - Teun Duynstee
Solr on Windows: Does it Work? Does it Scale? - Teun DuynsteeSolr on Windows: Does it Work? Does it Scale? - Teun Duynstee
Solr on Windows: Does it Work? Does it Scale? - Teun Duynstee
 
Advanced databases ben stopford
Advanced databases   ben stopfordAdvanced databases   ben stopford
Advanced databases ben stopford
 
In Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry OsborneIn Memory Database In Action by Tanel Poder and Kerry Osborne
In Memory Database In Action by Tanel Poder and Kerry Osborne
 
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data Warehouse
 
Best Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on SolarisBest Practices with PostgreSQL on Solaris
Best Practices with PostgreSQL on Solaris
 
[Altibase] 13 backup and recovery
[Altibase] 13 backup and recovery[Altibase] 13 backup and recovery
[Altibase] 13 backup and recovery
 
2 technical-dns-workshop-day1
2 technical-dns-workshop-day12 technical-dns-workshop-day1
2 technical-dns-workshop-day1
 
Pnuts yahoo!’s hosted data serving platform
Pnuts  yahoo!’s hosted data serving platformPnuts  yahoo!’s hosted data serving platform
Pnuts yahoo!’s hosted data serving platform
 
4 technical-dns-workshop-day2
4 technical-dns-workshop-day24 technical-dns-workshop-day2
4 technical-dns-workshop-day2
 
Redshift deep dive
Redshift deep diveRedshift deep dive
Redshift deep dive
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
 
WalB: Block-level WAL. Concept.
WalB: Block-level WAL. Concept.WalB: Block-level WAL. Concept.
WalB: Block-level WAL. Concept.
 
CNIT 40: 4: Monitoring and detecting security breaches
CNIT 40: 4: Monitoring and detecting security breachesCNIT 40: 4: Monitoring and detecting security breaches
CNIT 40: 4: Monitoring and detecting security breaches
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
 
Measuring Firebird Disk I/O
Measuring Firebird Disk I/OMeasuring Firebird Disk I/O
Measuring Firebird Disk I/O
 
Comparing high availability solutions with percona xtradb cluster and percona...
Comparing high availability solutions with percona xtradb cluster and percona...Comparing high availability solutions with percona xtradb cluster and percona...
Comparing high availability solutions with percona xtradb cluster and percona...
 

Similar to Azure storage deep dive

Monitoring MongoDB’s Engines in the Wild
Monitoring MongoDB’s Engines in the WildMonitoring MongoDB’s Engines in the Wild
Monitoring MongoDB’s Engines in the WildTim Vaillancourt
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectureshypertable
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
 
Right-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual MachineRight-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual Machineheraflux
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features Amazon Web Services
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsSpeedment, Inc.
 
Fluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at KubeconFluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at KubeconN Masahiro
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Speedment, Inc.
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Malin Weiss
 
VDI storage and storage virtualization
VDI storage and storage virtualizationVDI storage and storage virtualization
VDI storage and storage virtualizationSisimon Soman
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAmazon Web Services
 
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)Bob Pusateri
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)MongoDB
 
Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Shi Shao Feng
 
Replication and replica sets
Replication and replica setsReplication and replica sets
Replication and replica setsChris Westin
 

Similar to Azure storage deep dive (20)

Monitoring MongoDB’s Engines in the Wild
Monitoring MongoDB’s Engines in the WildMonitoring MongoDB’s Engines in the Wild
Monitoring MongoDB’s Engines in the Wild
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
Dissecting Scalable Database Architectures
Dissecting Scalable Database ArchitecturesDissecting Scalable Database Architectures
Dissecting Scalable Database Architectures
 
Hadoop
HadoopHadoop
Hadoop
 
CAO-Unit-III.pptx
CAO-Unit-III.pptxCAO-Unit-III.pptx
CAO-Unit-III.pptx
 
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)
 
Right-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual MachineRight-Sizing your SQL Server Virtual Machine
Right-Sizing your SQL Server Virtual Machine
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMsJava one2015 - Work With Hundreds of Hot Terabytes in JVMs
Java one2015 - Work With Hundreds of Hot Terabytes in JVMs
 
Fluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at KubeconFluentd and Distributed Logging at Kubecon
Fluentd and Distributed Logging at Kubecon
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]
 
VDI storage and storage virtualization
VDI storage and storage virtualizationVDI storage and storage virtualization
VDI storage and storage virtualization
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
 
Cnam azure 2015 storage
Cnam azure 2015  storageCnam azure 2015  storage
Cnam azure 2015 storage
 
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)
 
cache memory
cache memorycache memory
cache memory
 
Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0Realtime olap architecture in apache kylin 3.0
Realtime olap architecture in apache kylin 3.0
 
Replication and replica sets
Replication and replica setsReplication and replica sets
Replication and replica sets
 

More from Yves Goeleven

Back to the 90s' - Revenge of the static website
Back to the 90s' - Revenge of the static websiteBack to the 90s' - Revenge of the static website
Back to the 90s' - Revenge of the static websiteYves Goeleven
 
Io t privacy and security considerations
Io t   privacy and security considerationsIo t   privacy and security considerations
Io t privacy and security considerationsYves Goeleven
 
Connecting your app to the real world
Connecting your app to the real worldConnecting your app to the real world
Connecting your app to the real worldYves Goeleven
 
Madn - connecting things with people
Madn - connecting things with peopleMadn - connecting things with people
Madn - connecting things with peopleYves Goeleven
 
Message handler customer deck
Message handler customer deckMessage handler customer deck
Message handler customer deckYves Goeleven
 
Cloudbrew - Internet Of Things
Cloudbrew - Internet Of ThingsCloudbrew - Internet Of Things
Cloudbrew - Internet Of ThingsYves Goeleven
 
Windows azure storage services
Windows azure storage servicesWindows azure storage services
Windows azure storage servicesYves Goeleven
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabitsYves Goeleven
 
Eda on the azure services platform
Eda on the azure services platformEda on the azure services platform
Eda on the azure services platformYves Goeleven
 

More from Yves Goeleven (10)

Back to the 90s' - Revenge of the static website
Back to the 90s' - Revenge of the static websiteBack to the 90s' - Revenge of the static website
Back to the 90s' - Revenge of the static website
 
Io t privacy and security considerations
Io t   privacy and security considerationsIo t   privacy and security considerations
Io t privacy and security considerations
 
Connecting your app to the real world
Connecting your app to the real worldConnecting your app to the real world
Connecting your app to the real world
 
Madn - connecting things with people
Madn - connecting things with peopleMadn - connecting things with people
Madn - connecting things with people
 
Message handler customer deck
Message handler customer deckMessage handler customer deck
Message handler customer deck
 
Cloudbrew - Internet Of Things
Cloudbrew - Internet Of ThingsCloudbrew - Internet Of Things
Cloudbrew - Internet Of Things
 
Windows azure storage services
Windows azure storage servicesWindows azure storage services
Windows azure storage services
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabits
 
Eda on the azure services platform
Eda on the azure services platformEda on the azure services platform
Eda on the azure services platform
 
Sql Azure
Sql AzureSql Azure
Sql Azure
 

Recently uploaded

EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarThousandEyes
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)codyslingerland1
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfTejal81
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kitJamie (Taka) Wang
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameKapil Thakar
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch TuesdayIvanti
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptxHansamali Gamage
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveIES VE
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsDianaGray10
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTxtailishbaloch
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FESTBillieHyde
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInThousandEyes
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud DataEric D. Schabell
 

Recently uploaded (20)

EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)
 
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdfQ4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
Q4 2023 Quarterly Investor Presentation - FINAL - v1.pdf
 
20140402 - Smart house demo kit
20140402 - Smart house demo kit20140402 - Smart house demo kit
20140402 - Smart house demo kit
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First Frame
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch Tuesday
 
.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx.NET 8 ChatBot with Azure OpenAI Services.pptx
.NET 8 ChatBot with Azure OpenAI Services.pptx
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
Automation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projectsAutomation Ops Series: Session 2 - Governance for UiPath projects
Automation Ops Series: Session 2 - Governance for UiPath projects
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENTSIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
SIM INFORMATION SYSTEM: REVOLUTIONIZING DATA MANAGEMENT
 
Technical SEO for Improved Accessibility WTS FEST
Technical SEO for Improved Accessibility  WTS FESTTechnical SEO for Improved Accessibility  WTS FEST
Technical SEO for Improved Accessibility WTS FEST
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
 
3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data3 Pitfalls Everyone Should Avoid with Cloud Data
3 Pitfalls Everyone Should Avoid with Cloud Data
 

Azure storage deep dive

  • 3. Yves Goeleven Founder of MessageHandler.net • Windows Azure MVP
  • 7. What you might have expected Azure Storage Services • VM (Disk persistence) • Azure Files • CDN • StoreSimple • Backup & replica’s • Application storage • Much much more...
  • 8. I assume you know what storage services does!
  • 9. The real agenda Deep Dive Architecture • Global Namespace • Front End Layer • Partition Layer • Stream Layer
  • 13. Which discusses how Storage defies the rules of the CAP theorem • Distributed & scalable system • Highly available • Strongly Consistent • Partition tolerant • All at the same time!
  • 14. That can’t be done, right?
  • 17. Global namespace Every request gets sent to • http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName
  • 18. Global namespace DNS based service location • http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName • DNS resolution • Datacenter (f.e. Western Europe) • Virtual IP of a Stamp
  • 19. Stamp Scale unit • A windows azure deployment • 3 layers • Typically 20 racks each • Roughly 360 XL instances • With disks attached • 2 Petabyte (before june 2012) • 30 Petabyte Intra-stamp replication
  • 20. Incoming read request Arrives at Front End • Routed to partition layer • http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName • Reads usually from memory Partition Layer Front End Layer FE FE FE FE FE FE FE FE FE PS PS PS PS PS PS PS PS PS
  • 21. Incoming read request 2 nodes per request • Blazingly fast • Flat “Quantum 10” network topology • 50 Tbps of bandwidth • 10 Gbps / node • Faster than typical ssd drive
  • 22. Incoming Write request Written to stream layer • Synchronous replication • 1 Primary Extend Node • 2 Secondary Extend Nodes Front End Layer Stream Layer Partition Layer Primary Secondary Secondary PS PS PS PS PS PS PS PS PS EN EN EN EN EN EN EN
  • 23. Geo replication Intra-stamp replicationIntra-stamp replication Inter-stamp replication
  • 24. Location Service Stamp Allocation • Manages DNS records • Determines active/passive stamp configuration • Assigns accounts to stamps • Geo Replication • Disaster Recovery • Load Balancing • Account Migration
  • 26. Front End Nodes Stateless role instances • Authentication & Authorization • Shared Key (storage key) • Shared Access Signatures • Routing to Partition Servers • Based on PartitionMap (in memory cache) • Versioning • x-ms-version: 2011-08-18 • Throttling Front End Layer FE FE FE FE FE FE FE FE FE
  • 27. Partition Map Determine partition server • Used By Front End Nodes • Managed by Partition Manager (Partition Layer) • Ordered Dictionary • Key: http(s)://AccountName.Service.core.windows.net/PartitionName/ObjectName • Value: Partition Server • Split across partition servers • Range Partition
  • 28. Practical examples Blob • Each blob is a partition • Blob uri determines range • Range only matters for listing • Same container: More likely same range • https://myaccount.blobs.core.windows.net/mycontainer/subdir/file1.jpg • https://myaccount.blobs.core.windows.net/mycontainer/subdir/file2.jpg • Other container: Less likely • https://myaccount.blobs.core.windows.net/othercontainer/subdir/file1.jpg PS PS PS
  • 29. Practical examples Queue • Each queue is a partition • Queue name determines range • All messages same partition • Range only matters for listing queues PS PS PS
  • 30. Practical examples Table • Each group of rows with same PartitionKey is a partition • Ordering within set by RowKey • Range matters a lot! • Expect continuation tokens! PS PS PS
  • 31. Continuation tokens Requests across partitions • All on same partition server • 1 result • Across multiple ranges (or to many results) • Sequential queries required • 1 result + continuation token • Repeat query with token until done • Will almost never occur in development! • As dataset is typically small PS PS PS
  • 33. Partition layer Dynamic Scalable Object Index • Object Tables • System defined data structures • Account, Entity, Blob, Message, Schema, PartitionMap • Dynamic Load Balancing • Monitor load to each part of the index to determine hot spots • Index is dynamically split into thousands of Index Range Partitions based on load • Index Range Partitions are automatically load balanced across servers to quickly adapt to changes in load • Updates every 15 seconds
  • 34. Consists of • Partition Master • Distributes Object Tables • Partitions across Partition Servers • Dynamic Load Balancing • Partition Server • Serves data for exactly 1 Range Partition • Manages Streams • Lock Server • Master Election • Partition Lock Partition layer Partition Layer PS PS PS PM PM LS LS
  • 35. How it works Dynamic Scalable Object Index Partition Layer PS PS PS PM PM LS LS Front End Layer FE FE FE FE Account Container Blob Aaaa Aaaa Aaaa Harry Pictures Sunrise Harry Pictures Sunset Richard Images Soccer Richard Images Tennis Zzzz Zzzz zzzz A-H H’-R R’-Z Map
  • 36. Not in the critical path! Dynamic Load Balancing Stream Layer Partition Layer Primary Secondary Secondary PS PS PS PS PS PS PS PS PS EN EN EN EN EN EN EN PM PM LS LS
  • 37. Architecture • Log Structured Merge-Tree • Easy replication • Append only persisted streams • Commit log • Metadata log • Row Data (Snapshots) • Blob Data (Actual blob data) • Caches in memory • Memory table (= Commit log) • Row Data • Index (snapshot location) • Bloom filter • Load Metrics Partition Server Memory Data Persistent Data Commit Log Stream Metadata Log Stream Row Data Stream Blob Data Stream Memory Table Row Cache Index Cache Bloom Filters Load Metrics
  • 38. HOT Data • Completely in memory Incoming Read request Memory Data Persistent Data Commit Log Stream Metadata Log Stream Row Data Stream Blob Data Stream Memory Table Row Cache Index Cache Bloom Filters Load Metrics
  • 39. COLD Data • Partially Memory Incoming Read request Memory Data Persistent Data Commit Log Stream Metadata Log Stream Row Data StreamRow Data Stream Blob Data Stream Memory Table Row Cache Index Cache Bloom Filters Load Metrics
  • 40. Persistent First • Followed by memory Incoming Write Request Memory Data Persistent Data Commit Log StreamCommit Log Stream Metadata Log Stream Row Data Stream Blob Data Stream Memory Table Row Cache Index Cache Bloom Filters Load Metrics
  • 41. Not in critical path! Checkpoint process Memory Data Persistent Data Commit Log StreamCommit Log Stream Metadata Log Stream Row Data Stream Blob Data Stream Memory Table Row Cache Index Cache Bloom Filters Load Metrics Row Data StreamCommit Log Stream
  • 42. Not in critical path! Geo replication process Memory Data Persistent Data Commit Log StreamCommit Log Stream Metadata Log Stream Row Data Stream Blob Data Stream Memory Table Row Cache Index Cache Bloom Filters Load Metrics Memory Data Persistent Data Commit Log StreamCommit Log Stream Metadata Log Stream Row Data Stream Blob Data Stream Memory Table Row Cache Index Cache Bloom Filters Load Metrics Row Data Stream
  • 44. Append-Only Distributed File System • Streams are very large files • File system like directory namespace • Stream Operations • Open, Close, Delete Streams • Rename Streams • Concatenate Streams together • Append for writing • Random reads Stream layer
  • 45. Concepts • Block • Min unit of write/read • Checksum • Up to N bytes (e.g. 4MB) • Extent • Unit of replication • Sequence of blocks • Size limit (e.g. 1GB) • Sealed/unsealed • Stream • Hierarchical namespace • Ordered list of pointers to extents • Append/Concatenate Stream layer Extent Stream //foo/myfile.data Extent Extent Extent Pointer 1 Pointer 2 Pointer 3 Block Block Block Block Block Block Block Block Block Block Block
  • 46. Consists of • Paxos System • Distributed consensus protocol • Stream Master • Extend Nodes • Attached Disks • Manages Extents • Optimization Routines Stream layer Stream Layer EN EN EN EN SM SM SM Paxos
  • 47. Allocation process • Partition Server • Request extent / stream creation • Stream Master • Chooses 3 random extend nodes based on available capacity • Chooses Primary & Secondary's • Allocates replica set • Random allocation • To reduce Mean Time To Recovery Extent creation Stream Layer EN EN EN SM Partition Layer PS Primary Secondary Secondary
  • 48. Bitwise equality on all nodes • Synchronous • Append request to primary • Written to disk, checksum • Offset and data forwarded to secondaries • Ack Replication Process Journaling • Dedicated disk (spindle/ssd) • For writes • Simultaneously copied to data disk (from memory or journal) Stream Layer EN EN EN SM Partition Layer PS Primary Secondary Secondary
  • 49. Ack from primary lost going back to partition layer • Retry from partition layer • can cause multiple blocks to be appended (duplicate records) Dealing with failure Unresponsive/Unreachable Extent Node (EN) • Seal the failed extent & allocate a new extent and append immediately ExtentExtent Stream //foo/myfile.data Extent Extent Pointer 1 Pointer 2 Pointer 3 Block Block Block Block Block Block Block Block Block Block Block Block Extent Block Block Block Extent Block
  • 50. Replication Failures (Scenario 1) Stream Layer EN EN EN SM Partition Layer PS Primary Block Block Block Secondary Block Block Block Secondary Block Block Block Block Block Seal 120 120 Primary Block Block Block Secondary Block Block Block Block Block Secondary Block Block Block Block
  • 51. Replication Failures (Scenario 2) Stream Layer EN EN EN SM Partition Layer PS Primary Block Block Block Secondary Block Block Block Secondary Block Block Block Block Seal 120 100 Primary Block Block Block Secondary Block Block Block Secondary Block Block Block ?
  • 52. Data Stream • Partition Layer only reads from offsets returned from successful appends • Committed on all replicas • Row and Blob Data Streams • Offset valid on any replica • Safe to read Network partitioning during read Stream Layer EN EN EN SM Partition Layer PS Primary Block Block Block Secondary Block Block Block Secondary Block Block Block Block Block
  • 53. Log Stream • Logs are used on partition load • Check commit length first • Only read from • Unsealed replica if all replicas have the same commit length • A sealed replica Network partitioning during read Stream Layer EN EN EN SM Partition Layer PS Primary Block Block Block Secondary Block Block Block Secondary Block Block Block Block 1, 2 Seal Secondary Block Block Block Block Block Secondary Block Block Block Block
  • 54. Downside to this architecture • Lot’s of wasted disk space • ‘Forgotten’ streams • ‘Forgotten’ extends • Unused blocks in extends after write errors • Garbage collection required • Partition servers mark used streams • Stream manager cleans up unreferenced streams • Stream manager cleans up unreferenced extends • Stream manager performs erasure coding • Reed Solomon algorithm • Allows recreation of cold streams after deleting extends Garbage Collection
  • 56. Layering & co-design • Stream Layer • Availability with Partition/failure tolerance • For Consistency, replicas are bit-wise identical up to the commit length • Partition Layer • Consistency with Partition/failure tolerance • For Availability, RangePartitions can be served by any partition server • Are moved to available servers if a partition server fails • Designed for specific classes of partitioning/failures seen in practice • Process to Disk to Node to Rack failures/unresponsiveness • Node to Rack level network partitioning Beating the CAP theorem
  • 57. Want to know more? • White paper: http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf • Video: http://www.youtube.com/watch?v=QnYdbQO0yj4 • More slides: http://sigops.org/sosp/sosp11/current/2011-Cascais/11-calder.pptx • Erasure coding: http://www.snia.org/sites/default/files2/SDC2012/presentations/Cloud/ ChengHuang_Ensuring_Code_in_Windows_Azure.pv3.pdf Resources
  • 59. And take home the Lumia 1320 Present your feedback form when you exit the last session & go for the drink Give Me Feedback
  • 60. Follow Technet Belgium @technetbelux Subscribe to the TechNet newsletter aka.ms/benews Be the first to know
  • 61. Belgiums’ biggest IT PRO Conference

Editor's Notes

  1. Real time message processing as a service Think of it as IFTTT for internet of things Solves today’s integration issues Scalability, data volume, multitude protocols & platforms, multitude of integration points, saas & social integration, mobile platforms, business ecosystems, ownership & centralized management, …