SlideShare a Scribd company logo
1 of 136
Download to read offline
© Matthew Bass 2013
Architecting for the Cloud
Len and Matt Bass
Storage in the Cloud
© Matthew Bass 2013
Outline
This section will focus on storage in the cloud
• We will first look at relational databases
• What solutions emerged for the cloud
• Storage options for NoSQL databases
• Architecture of typical NoSQL databases
© Matthew Bass 2013
Outline
This section will focus on storage in the cloud
• We will first look at relational databases
• What solutions emerged for the cloud
• Storage options for NoSQL databases
• Architecture of typical NoSQL databases
© Matthew Bass 2013
History
• The relational data model was created in the
late 1960s
• In the 1980s relational databases became
commercially successful
– Replacing Hierarchical and Network data bases
• Relational databases continue to be the
dominate db model today
© Matthew Bass 2013
Relational Databases
• The relational model is a mathematical model
for describing the structure of data
– We will not go into this model
• Let’s take a quick review of the 1st and 2nd
normal form, however
© Matthew Bass 2013
Example
Imagine you sell car parts
– You have warehouses
– You have part inventories
– You have orders
What’s the problem?
Warehouse Warehouse Address Part
© Matthew Bass 2013
What Happens Here?
Warehouse 1 123 Main Street Transmission, Steering
wheel, Brake pads, …
What about here?
Warehouse 1 123 Main Street Transmission
Warehouse 1 123 Main Street Steering wheel
Warehouse 1 123 Main Street Brake Pads
© Matthew Bass 2013
The Solution …
Warehouse ID Warehouse Address
Warehouse Table:
Parts Table:
Relations Table:
Part ID Part Description
Warehouse ID Part ID
© Matthew Bass 2013
This Works
• We have a standard language for querying the
data (SQL)
• We can now extract data in a very flexible way
• We can read, write, update, and delete data
pretty efficiently
– Joins add some overhead
© Matthew Bass 2013
Moreover We Have RDBMS
• We have robust software systems that manage the
data
• These systems provide many advanced features
including:
– Behavior
– Concurrency control
– Transactions
– Referential integrity
– Optimization
© Matthew Bass 2013
Behavior
• DBMS’s provide mechanisms for building in
behavior
• These are mechanisms like
– Stored procedures
– PLSQL
• This allows you to simplify the application
logic
© Matthew Bass 2013
Concurrency Control
• DBMS’s will support multiple user access
• They will lock tables during updates to ensure
that writes are complete prior to reads
• They will manage multiple updates to ensure
integrity and consistency of data
© Matthew Bass 2013
Transactions
• Transactions are supported
• This ensures that updates either happen
completely or not at all
– Often an atomic update is a set up updates to
individual records across multiple tables
– If only some of these updates happen the integrity
of the overall database is compromised
© Matthew Bass 2013
Referential Integrity
• Ensures that references from one table refer
to a valid entry in another table
© Matthew Bass 2013
Optimization
• Database systems will perform a variety of actions
to optimize based on usage patterns
• They will
– Create indexes
– Create virtual tables
– Cache values
– …
© Matthew Bass 2013
Impedance Mismatch
• There is however, a mismatch
– We need to translate between the relational structure and the
organizational needs
• Think about the reports needed for the warehouse
– Purchase orders
– History of orders for customer
– Parts inventory per warehouse
– …
• This means we will need lots of Joins
– This isn’t too much of an issue until we scale …
© Matthew Bass 2013
Speaking of Scaling …
Do relational databases scale?
© Matthew Bass 2013
Internet Scale Is Difficult
• We can “shard” the data
– Split the data across the machines
• This is very difficult to do efficiently
• This makes joins more costly
– Remember joins are common
• This also has a practical limit
– At some point you will need to replicate the data
• The database becomes slow …
© Matthew Bass 2013
Change is Needed
• For this reason internet scale applications moved to
distributed file systems
– Google was the first
– Many others followed
• This allowed the data to be partitioned across nodes
more efficiently
– We’ll talk about this in a minute
© Matthew Bass 2013
Outline
This section will focus on storage in the cloud
• We will first look at relational databases
• What solutions emerged for the cloud
• Storage options for NoSQL databases
• Architecture of typical NoSQL databases
© Matthew Bass 2013
Needs
• Let’s explore the needs in a bit more detail
• The file system needed to:
– Be fault-tolerant
– Handle large files
– Accommodate extremely large data sets
– Accommodate many concurrent clients
– Be flexible enough to handle multiple kinds of applications
© Matthew Bass 2013
Fault-Tolerance
• Due to the scale of the systems they were deployed
on hundreds or thousands of servers
• This meant that at any given time some of these
nodes would not be operational
• Problems from application bugs, operating system
bugs, human error, hardware failures, and networks
are common
© Matthew Bass 2013
Large Files/Large Data Sets
• It’s common for files in these systems to be
multiple GBs
• Each file could have millions of objects
– E.g. many individual web pages
• The data sets grow quickly
• The data sets can be multiple terabytes or
petabytes
© Matthew Bass 2013
Many Concurrent Clients
• The system needs to efficiently handle
multiple clients
• These clients could be reading or writing
© Matthew Bass 2013
Multiple Applications
• Additionally the system needs to be flexible
enough to handle multiple applications
• Applications have a variety of needs
– Long streaming reads
– Throughput oriented operations
– Low latency reads
– …
© Matthew Bass 2013
Addressing Needs
• There were a number of things that were done to address the
needs
• One primary decision was the de-normalization of the data
– We’ll talk about this more in the next slides
• Other decisions include (we’ll talk about these in a bit)
– Block size
– Replication strategy
– Data consistency checks
– API and capability of the system
© Matthew Bass 2013
De-Normalizing Data
• Remember what was difficult with relational models?
– Joins across nodes are expensive
– As is synchronization for replicated data
• If the data is de-normalized it can be “localized”
– Data that will likely be accessed together can be collocated
– In other words store it as you will use it
© Matthew Bass 2013
Example
• Imagine a Purchase Order
• Typically this would contain
– Customer information
– Product information
– Pricing
© Matthew Bass 2013
Relational Purchase Order
• The data could would be split across multiple
tables such as
– Customer
– Product Catalog
– Inventory
– …
• If the data set is large enough the data would
be distributed
© Matthew Bass 2013
De-Normalized Purchase Order
• In a file system without a relational model the data
doesn’t need to be split up
• The purchase order data would be co-located
• If the data set was very large purchase orders would
still be co-located
– Different purchased orders could be distributed
– A single purchase order, however would not be
© Matthew Bass 2013
Relational vs NoSQL
Relational Model
NoSQL
Customers Product Catalog Inventory
Orders 1 - 100 Orders 101 - 200 Orders 201 - 300
© Matthew Bass 2013
What Does This Mean?
• Data has no explicit structure (not entirely
true … but we’ll talk about this)
– Data is largely treated as a blob
• This has several implications
– You can change the nature of the data as needed
– You can collocate the data as desired
– The application now has increased burden
© Matthew Bass 2013
Back to Purchase Order
PO Number PO
1 Contents of PO1 …
2 Contents of PO2 …
3 Contents of PO3 …
4 Contents of PO4 …
Key Value
© Matthew Bass 2013
Retrieving Data
• To retrieve the purchase order data you provide the reference
key
• The file system routes you to the appropriate node (more
later)
• The single node returns the entire purchase order
• This can happen quickly … regardless of how many purchase
orders you have
Do you see any potential issues?
© Matthew Bass 2013
Data Locality
• First, being able to retrieve the data quickly depends on the
location of the data
• If the data is distributed it’s difficult to retrieve quickly
– Imagine you want to get the number of times a customer ordered
product X
– More on this later
• While there is not an explicit structure there is an implicit
structure
– Design of this structure is important
© Matthew Bass 2013
Data Processing
• As the file system treats the data as
unstructured it’s not able to preprocess the
data
• Getting an ordered list, for example, has to be
done in the application
• The validity of the data needs to be checked
by the application
© Matthew Bass 2013
Updating Data
• What happens if you want to change the data?
– Imagine trying to update the customer’s address
• Updates tend to be difficult
• In this environment you tend to not update data
– Instead you will append the new data
– You can establish rules for the lifetime of the data
© Matthew Bass 2013
Other Issues
• Things like data integrity are not managed by the file
system
• You don’t (typically) have full support for transactions
• There is no notion of referential integrity
• There is support for some concurrent access, but with
built in assumptions
• Consistency is not typically guaranteed (more later)
© Matthew Bass 2013
A New Tool in Your Toolbox
• You’ve been given a new kind of hammer
– Remember that everything is not a nail
– In other words these kinds of data stores are good
for some things … and not others
• Today there are many different flavors of
these data stores
– Both in terms of structures and features
© Matthew Bass 2013
Multiple Data Structures
• Today many options exist
– Key value stores
– Document centric data stores
– Column data bases
• We’ve also started to see old models
reemerge e.g.
– Hierarchical data stores
© Matthew Bass 2013
Key Value Databases
• Basically you have a key that maps to some “value”
• This value is just a blob
– The database doesn’t care about the content or structure of this value
• The operations are quite simple e.g.
– Read (get the value given a key)
– Insert (inserts a key/value pair)
– Remove (removes the value associated with a given key)
© Matthew Bass 2013
Key Value Databases II
• There is no real schema
– Basically you query a key and get the value
– This can be useful when accessing things like user sessions, shopping
carts, …
• Concurrency
– Concurrency only makes sense at the level of a single key
– Can have either optimistic write or eventual consistency – we’ll talk
about this more later
• Replication
– Can be handled by the client or the data store – more about this later
© Matthew Bass 2013
Uses
• Very fast reads
• Scales well
• Good for quick access of data without complex
querying needs
– The classic example is for session management
• Not good for
– Situations where data integrity is critical
– Data with complex querying needs
© Matthew Bass 2013
Document Centric Databases
• Stores a “document”
ID : 123
Customer : 8790
Line Items : [{product id: 2, quantity: 2}
{product id: 34, quantity 1}]
…
© Matthew Bass 2013
Document Centric
• No schema
• You can query the data store
– Can return all or part of the document
– Typically query the store by using the id (or key)
• As with key value, discussing concurrency only makes
sense at the level of a single document
© Matthew Bass 2013
Advantages
• A document centric data store is similar in
many ways to a key/value data store
• It does, however, allow for more complex
queries
– For example you can query using a non-primary
key
© Matthew Bass 2013
Column Databases
• Row key maps to “column families”
1234 …
Name Matt
Billing Address 123 Main st
Phone 412 770-4145
Profile
Order Data …
Order Data …
Order Data …
Orders
© Matthew Bass 2013
Column Databases - Rows
• Rows are grouped together to form units of load balancing
– Row keys are ordered and grouped together by locality
– In this example consecutive rows would be from the same domain
(CNN)
• Concurrency makes sense at the level of a row
Key Contents Anchor:cnnsi.com Anchor:my.look.ca
com.cnn.www Html page … …
© Matthew Bass 2013
Column Databases – Columns
• Columns are grouped into “column families”
• Column families form the unit of access
control
– Clients may or may not have access to all column
families
• Column keys can be used to query data
© Matthew Bass 2013
Column Databases – Timestamps
• The cells in a column database can be versioned with
a timestamp
• The cells can contain multiple versions
– The application can typically specify how many versions to
keep or when a version times out
• You can use either use a client generated timestamp
or one generated by the storage node
© Matthew Bass 2013
Examples
Document Centric
• MongoDB
• CouchDB
• RavenDB
Key Value
• DynamoDB
• Azure Table
• Redis
• Riak
Column
• Hbase
• Cassandra
• Hypertable
• SimpleDB
© Matthew Bass 2013
NoSQL vs RDBMS
• Explicit vs Implicit Schema
– NoSQL databases do have an implicit schema – at
least in most cases
• Distribution of data
• Consistency
• Efficiency of storage
• Additional capabilities
© Matthew Bass 2013
Schema
• Clearly with Relational DB there is an explicit schema
• You do have an implicit schema with NoSQL db as well
– You typically want to do something with the data
• With relational schema distributed data has a big
performance impact
• Data model of NoSQL data impacts performance as well
– It is easier to distribute data so that related data is co-located
© Matthew Bass 2013
Consistency - CAP Theorem
• When data becomes distributed you need to
worry about a network partition
– Essentially this means that instances of your data
store can’t communicate
• When this happens you need to choose
between availability or consistency
© Matthew Bass 2013
Let’s Demonstrate
• Imagine we start a store that takes orders
– Who wants to work at this store?
• The operators need to be able to:
– Take orders
– Give order history
– Modify orders
• We will start with one operator until business grows
…
© Matthew Bass 2013
Consistency in the Cloud
• Many NoSQL databases give you options
– Eventual consistency
– Optimistic consistency
– …
• They all come with different trade offs
• You must understand the needs of your system to
ensure appropriate behavior
– We’ll talk more about this later
© Matthew Bass 2013
Outline
This section will focus on storage in the cloud
• We will first look at relational databases
• What solutions emerged for the cloud
• Storage options for NoSQL databases
• Architecture of typical NoSQL databases
© Matthew Bass 2013
Fault Tolerance
• As we said earlier fault tolerance was a prime
motivator for many of the decisions
• These systems are built with commodity components
that are prone to failure
• They also need to deal with other issues (previously
mentioned) that arise
• We’ll look at a representative example of such a system
to understand what decisions have been made
© Matthew Bass 2013
Google File System
• Grew out of “BigFiles”
• Distributed, scalable and portable file system
• Written in Java
• Supports the kinds of applications we discussed
earlier
– Search
– Large data retrieval
© Matthew Bass 2013
Leads to following requirements
1. High reliability through commodity hardware
– Even with RAID, disks will still have one failure per day. If the system has to deal with
failure smoothly in any case, it is much more economical to use commodity hardware.
– Even if disks do not fail, data blocks may get corrupted.
2. Minimal synchronization on writes
– Require each application process to write to a distinct file. File merge can take place
after files are written.
– This means minimal locking during the write process (or read process).
3. Data blocks are all the same size
– Streaming data. ALL blocks are 64MBytes.
– GFS is unaware of any internal logic of data and the Internal logic of data must be
managed by the application
© Matthew Bass 2013
GFS Interfaces
• Supports the following commands
– Open
– Create
– Read
– Write
– Close
– Append
– Snapshot
© Matthew Bass 2013
Organization of GFS
• Organized into clusters
• Each cluster might have thousands of machines
• Within each cluster you have the following kinds of
entities
– Clients
– Master servers
– Chunk servers
© Matthew Bass 2013
GFS Clients
• Clients are any entity that makes a file request
• Requests are often to retrieve existing files
• They might also include manipulating or creating files
• Clients are other computers or applications
– Think of the web server that serves your search
engine as a client
© Matthew Bass 2013
Chunk Servers
• Responsible for storing the data “chunks”
– These chunks are all 64 MB blocks
• These chunk servers are the work horses of the file system
• They receive requests for data and send the chunks directly to the
client
• The client also writes the files directly to the appropriate chunk
servers
– The reference for replicas come from the master as well
• The chunk server is responsible for determining the correctness of
the write (more later)
© Matthew Bass 2013
Master Servers
• Acts as a coordinator for the cluster
• Keeps track of the metadata
– This is data that describes the data blocks (or chunks)
– Tells the Master what chunk the file belongs to
• Master tells the client where the chunk is located
• Master keeps an operations log
– Logs the activities of the cluster
– One of the mechanisms used to keep service outages to a
minimum (more later)
© Matthew Bass 2013
Two Additional Concepts
Lease:
• Lease is the minimal locking that is performed. Client receives lease on a file
when it is opened and, until file is closed or lease expires, no other process
can write to that file. This prevents accidently using the same file name twice.
• Client must renew lease periodically (~ 1 minute) or lease is expired.
Block:
• Every file managed by GFS is divided into 64MByte blocks. Each read/write is
in terms of <file, block #>
• Each block is replicated – three is the default number of replicas.
• As far as GFS is concerned there is no internal structure to a block. The
application must perform any parsing of the data that is necessary.
© Matthew Bass 2013
Basic Read Operation
Client
Master
Chunk
Server
Chunk
Server
Chunk
Server
Requests
location
of File
Sends
read
request
Returns
location
Returns
file
content
© Matthew Bass 2013
Basic Write Operation
Client
Master
Chunk
Server
Chunk
Server
Chunk
Server
Requests
location of
primary and
secondary
Sends
data to
write
Returns
locations
Caches locations Sends
data to
write
Applies Mutations
© Matthew Bass 2013
Reliability Mechanisms
• Master and chunk replication
• Rebalancing
• Stale replication detection
• Checksumming
• Garbage removal
© Matthew Bass 2013
Master Replication
• One active Master per cluster
• “Shadow” masters exist on other machines
– These shadows may perform limited functions (i.e. reads)
• Monitors the operations of the active master
– Though the operations log
• Maintains contact with the Chunk Servers by polling
– Does this to keep track of data
• If the Master fails the shadow takes over
© Matthew Bass 2013
Data Replication/Rebalancing
• File system replicates chunks of data
• It stores data on different machines across different racks
– That way if a machine or rack fail another replica exists
• Master also monitors cluster as a whole
• It periodically rebalances the load across the cluster
– All chunk servers run at near capacity but never at full capacity
• Master also monitors each chunk to ensure data is current
– If not it’s designated as a stale replica
– The stale replica become garbage
© Matthew Bass 2013
Checksum
• In order to detect data corruption checksumming is used
• The system breaks each 64 MB chunk into 64 KB blocks
• Each block has it’s own 32 bit checksum
• The Master monitors the checksums for each block
• If the checksums don’t match what the Master has on record
it is deleted and a new replica created
© Matthew Bass 2013
Failure Scenarios
• Let’s look at the following failure scenarios to
see what happens
– Client failure
– Corrupt disk
– Chunk server failure
– Master failure
© Matthew Bass 2013
Client Failure
• Client fails while file open
• Master recognizes this because lease expires
• File is placed in intermediate state where client can re-
activate lease
• After intermediate state expires (~hour), Master informs
Chunk Server that have blocks for that file to delete them
• Master removes all entries associated with file
• Chunk Server deletes blocks
© Matthew Bass 2013
Corrupt Disk
This is the case where a block becomes corrupted after writing.
Replica1 writes a checksum for every 64 KB in a parallel file.
Replica1 returns checksums along with the block during a read.
Client checks checksum when block returned
If there is an error then Client:
• Retries read from different Replica2
• Informs Master of corrupt block on Replica1
Master:
• Allocates new replica for that block on Replica3
• Informs Replica2 with an existing replica to copy it to Replica3.
• Informs Replica1 with corrupted block to delete that block.
© Matthew Bass 2013
Chunk Server Failure
Master sends Heartbeat request to Chunk Server
• Active Replica responds with a list of block #, replica #s it has.
• Failed Replica does not respond
Master recognizes Replica’s failure.
Master maintains block #, replica # -> Chunk Server mapping from last Heartbeat.
Master queues all of blocks replicated on failed Chunk Server to generate an additional
replica.
The generation of an additional replica of Block A:
• Allocate new replica on an active Chunk Server say Replica1
• Instruct one of the Chunk Servers with valid replica of Block A to copy it to Replica1.
© Matthew Bass 2013
Master Failure
• Back up Master maintains copy of log
• Responsible for creating checkpoint image
and trimming EditLog
• BackupNode takes over in case of Master
failure
• BackupNode may also fail
BackupNode
Master
EditLog
Checkpoint
Image
© Matthew Bass 2013
More about Master Structure
Four Threads:
• Main – perform file management operations.
• Ping/Echo – check on status of Chunk Servers and receive responses from Chunk
Servers
• Replica Management – manage new replica creation and replica deletion
• Lease Management – cancel leases when they expire. Queues replicas for replica
deletion for files where the client has failed.
Three Modes
• Normal operations
• Safe mode – when Master is restarted then no new requests are accepted until
percentage of Chunk Servers have reported their block allocations
• Backup – act as Master backup
© Matthew Bass 2013
Summary
• Relational databases are difficult to distribute efficiently
– Scalability can be problematic
• NoSQL databases offer an alternative
– Data is typically schema-less
• Aggregates of data that mirror primary use cases are
considered a unit of data
• Queries across nodes requires an efficient mechanism for
aggregation
© Matthew Bass 2013
Architecting for the Cloud
MiscTopics
© Matthew Bass 2013
Topics
These are topics that have architectural implications
and do not fit neatly into one of the other lectures.
• Zookeeper
• Failure in the cloud
• Business continuity
• Release planning
• Managing configuration parameters
• Monitoring
81
© Matthew Bass 2013
Zookeeper
• Zookeeper is intended to manage distributed
coordination
– Synchronization
– data
82
© Matthew Bass 2013
Distributed applications
• Zookeeper provides guaranteed consistent
(mostly) data structure for every instance of a
distributed application.
– Definition of “mostly” is within eventual consistency
lag (but this is small). More on eventual consistency
later.
• Zookeeper deals with managing failure as well as
consistency.
– Done using Paxos algorithm.
• Zookeeper guarantees that service requests are
linearly ordered and processed in a FIFO order
© Matthew Bass 2013
Model
• Zookeeper maintains a file type data structure
– Hierarchical
– Data in every node (called znode)
– Amount of data in each node assumed small
(<1M)
– Intended for metadata
• Configuration
• Location
• Group
© Matthew Bass 2013
Zookeeper znode structure
/
<data>
/b1
<data>
/b1/c1
<data>
/b1/c2
<data>
/b2
<data>
/b2/c1
<data>
© Matthew Bass 2013
API
Function Type
create write
delete write
Exists read
Get children Read
Get data Read
Set data write
+ others
• All calls return atomic views of state – either
succeed or fail. No partial state returned.
Writes also are atomic. Either succeed or fail.
If they fail, no side effects.
© Matthew Bass 2013
Example - Group membership
• Remember the load balancer. It has a list of
registered servers.
• The load balancer wants to know which of its
servers are
– Alive
– Providing service
• The list must be
– highly available
– Reflect failure of individual servers
• Strict performance requirements on list manager
© Matthew Bass 2013
Using Zookeeper to manage group
membership
• Load balancer on initialization
– connects to zookeeper
– Gets list of zookeeper servers
– Create session (if server fails – automatic fail over)
• Load balancer issues Create /”Servers” call
• If already exists get a failure
• Servers register by creating /Server/my_IP
• Load balancer can list children of /Servers and get their IPs.
• Watcher will inform Load balancer if a server fails or leaves.
• Latency is low (order of micro seconds) since
Zookeeper keeps data structures in memory.
© Matthew Bass 2013
Other use cases
• Leader election
• Distributed locks
• Synchronization
• Configuration
© Matthew Bass 2013
Topics
• Zookeeper
• Failure in the cloud
• Business continuity
• Release planning
• Managing configuration parameters
• Monitoring
90
© Matthew Bass 2013
Failures in the cloud
• Cloud failures large and small
• The Long Tail
• Techniques for dealing with the long tail
© Matthew Bass 2013
Sometimes the whole cloud fails …
© Matthew Bass 2013
Selected Cloud Outages - 2013
• July 10, Google down for 10 minutes
• June 18, Facebook down for 30 minutes
• Aug 14-17 Outlook.com offline for three days
• Aug 19, Amazon.com down for 40-45 minutes
• Aug 22, Apple iCloud down for 11 hours
• Aug 16, Google down for 5 minutes
• Sept 13, AWS down for ~two hours
• Nov 21, Microsoft services intermittent for ~2
hours
© Matthew Bass 2013
And sometimes just a part of it fails …
© Matthew Bass 2013
A year in the life of a Google
datacenter
• Typical first year for a new cluster:
– ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to
recover)
– ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come
back)
- ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
– ~5 racks go wonky (40-80 machines see 50% packetloss)
– ~8 network maintenances (4 might cause ~30-minute random connectivity
losses)
– ~12 router reloads (takes out DNS and external vips for a couple minutes)
– ~3 router failures (have to immediately pull traffic for an hour)
– ~dozens of minor 30-second blips for dns
– ~1000 individual machine failures
– ~thousands of hard drive failures
• slow disks, bad memory, misconfigured machines, flaky machines, dead
horses, etc.
© Matthew Bass 2013
Amazon failure statistics
• In a data center with ~64,000 servers with 2
disks each
~5 servers and ~17 disks fail every day.
© Matthew Bass 2013
What does this mean for a
consumer of the cloud?
• You need to be concerned about “long tail”
distribution for requests due to piecewise
failure
• You need to be concerned about business
continuity due to overall failure.
© Matthew Bass 2013
Short digression into probability
• A distribution describes the probability than any given
reading will have a particular value.
• Many phenomenon in nature are “normally distributed”.
• Most values will cluster
around the mean with
progressively smaller
numbers of values going
toward the edges.
• In a normal distribution
the mean is equal to the
median
© Matthew Bass 2013
Long Tail
• In a long tail distribution, there are some values
far from the mean.
• These values are sufficient to influence the mean.
• The mean and the
median are
dramatically
different in a long
tail distribution.
© Matthew Bass 2013
What does this mean?
• If there is a partial failure of the cloud some
activities will take a long time to complete and
exhibit a long tail.
• The figure shows
distribution of
1000 AWS
“launch instance”
calls.
• 4.5% of calls were
“long tail”
Mean Median STD Max
launch
instance
EC2 27.81 23.10 25.12 202.3
© Matthew Bass 2013
What can you do to prevent long
tail problems?
• “Hedged” request. Suppose you wish to launch
10 instances. Issue 11 requests. Terminate the
request that has not completed when 10 are
completed.
• “Alternative” request. In the above scenario, issue
10 requests. When 8 requests have completed
issue 2 more. Cancel the last 2 to respond.
• Using these techniques reduces the time of the
longest of the 1000 launch instance requests
from 202 sec to 51 sec.
© Matthew Bass 2013
Topics
• Zookeeper
• Failure in the cloud
• Business continuity
• Release planning
• Managing configuration parameters
• Monitoring
102
© Matthew Bass 2013
Business continuity
• Business continuity means that the business should
continue to provide service even if a disaster such as a
fire, floor, or cloud outage occurs.
• Two numbers characterize a business continuity
strategy
– RTO is the Recovery Time Objective – how long before the
service is available again
– RPO is the Recovery Point Objective – what is the point in
time that the system rolls back to. i.e. how much data can
be potentially lost
• Allows for cost/benefit trade offs.
• Many industries such as banks have compliance rules
that require business continuity policies and practices.
© Matthew Bass 2013
How does business continuity work?
• Replicate site in physically distant location.
• Recall DNS server with multiple sites
• If first site does not respond promptly, client
will try second site.
X
Site 1 Site 2
Website.com
123.45.67.89
456.77.88.99123.45.67.89
456.77.88.99
DNS
© Matthew Bass 2013
What does it mean to “replicate
site”?
• Must have a parallel datacenter
• Data must be replicated within RPO
– If RPO is small or zero this implies DB replication
– If RPO is larger then can use other means to replicate data
• Software must also be replicated.
– Versions must be identical in both sites
• Using different versions in different sites may result in different
results.
• Configurations in two sites will be different but must yield the
same results.
• Replication of a site incurs costs. You may wish to
increase the RPO and just copy (back up) data to
another site.
© Matthew Bass 2013
Recall discussion about DNS servers
• There is a hierarchy of DNS servers.
• Local DNS servers are under the control of the
local organization.
• When a disaster happens, the new data center
can be made operative by changing the IP
address in the local DNS server.
© Matthew Bass 2013
What the the architectural
implications
• State maintained in servers will be lost if a
disaster happens
• Dependencies on other than configuration
parameters must be identical in a replicated
site.
• Applications must be architected to be
movable for one environment to another.
© Matthew Bass 2013
Topics
• Zookeeper
• Failure in the cloud
• Business continuity
• Release planning
• Managing configuration parameters
• Monitoring
108
© Matthew Bass 2013
Dependencies
• There exist many different types of dependencies within a
system. E.g.
– Inter component
– Version
– Configuration parameters
– Hardware
– Location
– Names
– DB schemas
– Platform
– Libraries
• Inconsistency among these dependencies is a common
source of production time errors.
© Matthew Bass 2013
For example
• You develop some code on your desktop.
– You have installed the latest Java update
– You configure your code to use a Python script to do
some data cleansing
– You depend on a component that your colleagues are
simultaneously developing.
• You deploy your code into production.
– The latest Java version has not been installed.
– Python has not been installed in the production
environment.
– Your colleagues are delayed in their development.
© Matthew Bass 2013
You finally get your code into
production
• A user has a problem and calls the help desk.
• The help desk doesn’t know how to solve the
problem and escalates it back to you.
• You have gone on vacation.
© Matthew Bass 2013
Problems lead to a requirement for
a formal “release plan”
1. Define and agree release and deployment plans with customers/stakeholders.
2. Ensure that each release package consists of a set of related assets and service
components that are compatible with each other.
3. Ensure that integrity of a release package and its constituent components is maintained
throughout the transition activities and recorded accurately in the configuration
management system.
4. „„Ensure that all release and deployment packages can be tracked, installed, tested,
verified, and/or uninstalled or backed out, if appropriate.
5. „„Ensure that change is managed during the release and deployment activities.
6. „„Record and manage deviations, risks, issues related to the new or changed service, and
take necessary corrective action.
7. „„Ensure that there is knowledge transfer to enable the customers and users to optimise
their use of the service to support their business activities.
8. „„Ensure that skills and knowledge are transferred to operations and support staff to
enable them to effectively and efficiently deliver, support and maintain the service,
according to required warranties and service levels
*http://en.wikipedia.org/wiki/Deployment_Plan
112
© Matthew Bass 2013
Release planning is labor intensive
• Note the requirements for coordination in the release plan
• Each item requires multiple people and time consuming
activities.
– Time consuming activities delay introducing features included in
the release.
• Open questions
– Which items are dealt with through process?
– Which items are dealt with through tool support?
– Which items are dealt with through architecture design?
– Which items are dealt with through a combination of the
above?
• We will see an architecture designed to reduce team
coordination inn a subsequent lecture.
© Matthew Bass 2013
Topics
• Zookeeper
• Failure in the cloud
• Business continuity
• Release planning
• Managing configuration parameters
• Monitoring
114
© Matthew Bass 2013
What is a configuration parameter?
• A configuration parameter or environment
variable is a parameter for an application that
either controls the behavior of the application
or specifies a connection of the application to
its environment
– Thread pool or database connection pool size
control the behavior of the application.
– Database url specifies an connection of the app to
a database.
© Matthew Bass 2013
When are configuration parameters
bound?
• Recommended practice is to bind these at
initialization time for the app.
– App is loaded into an execution environment
– App is told where to find configuration parameters
through language, OS, or environment specific means.
E.g. main parameter in C
– App reads configuration parameters from the
specified location.
• The virtue of this approach is that an app can be
loaded into different execution environments and
doesn’t need to be aware of which environment
it is.
© Matthew Bass 2013
Use DB as an example – Unit test
• App is given URL for database access
component.
• In the case of unit test, the database access
component is a component that maintains
some fake data in memory for fast access
without the overhead of the full DB.
© Matthew Bass 2013
Integration Test
• Test database is maintained for integration
testing.
• Test database has subset of full data base.
• URL of test database is provided to App
• App can read or write test database
© Matthew Bass 2013
Performance testing
• Special database access component exists for
performance testing
– Passes through reads to production database
– Writes to mirror database
• App is given URL of special database access
component
• Allows testing with real data but blocks and
writes to real database
• Mirror database is checked at end of test for
correctness.
© Matthew Bass 2013
Other configuration parameters
• Other configuration parameters should be
identical from integration test through to
production.
• Reduces possibility of incorrect specification
of configuration parameters.
– Incorrect specification of configuration
parameters is a major source of deployment
errors.
© Matthew Bass 2013
Topics
• Zookeeper
• Failure in the cloud
• Business continuity
• Release planning
• Managing configuration parameters
• Monitoring
121
© Matthew Bass 2013
Monitoring
• When is this done
• Why is it done
• What can you get from monitoring.
• Data sources – monitor/logs
© Matthew Bass 2013
What is monitoring?
• Monitoring is the collection of data from
individual or collections of systems during the
runtime of these systems.
• Isn’t this an operations problem and not an
architectural problem?
– No.
• Operators are first class stakeholders and their needs should
be considered when designing the system.
• In the modern world, difficult run time problems are solved
by the architect so its to your advantage that the correct
information is available.
• Other reasons are implicit in the uses of monitoring
information which we are about to go into.
© Matthew Bass 2013
Why monitor?
1. Identifying failures and the associated faults both at
runtime and during post-mortems held after a failure has
occurred.
2. Identifying performance problems both of individual
systems and collections of interacting systems.
3. Characterizing workload for both short term and long
term billing and capacity planning purposes.
4. Measuring user reactions to various types of interfaces
or business offerings. We will discuss A/B testing later.
5. Detecting intruders who are attempting to break into
the system. (outside of our scope).
© Matthew Bass 2013
Basic metrics
• Per VM instance provider will collect
– CPU utilization
– Disk read/writes
– Messages in/out
• These metrics are used for
– Charging
– Scaling
– Mapping utilization to workload
• Similar type of metrics for storage and utilities
• Can aggregate these metrics over autoscaling groups,
regions, accounts, etc.
© Matthew Bass 2013
Other metrics
• The problem with the basic metrics is that they are not
related to particular activities whether business or
internal.
• Other things to monitor
– Transactions – transactions per second gives the business
an idea of how many customers are utilizing the system.
– Transactions by type.
– Messages from one portion of the system to another.
– Error conditions detected by different portions of the
system
– … anything you want
© Matthew Bass 2013
How do I decide what to
monitor?
• Look at reasons for monitoring
– Failure detection
– Performance degradation
– Workload characterization
– User reactions
• For each reason,
– decide what symptoms you would like reported.
– Place responsibilities to detect symptoms in various modules.
– Decide on active/passive monitoring (discussed soon)
– Decide what constitutes an alarm (discussed soon)
– Logic should be under configuration control – levels of reporting
© Matthew Bass 2013
Metadata is crucial
• Data by itself is not that useful.
• It must be tagged with identifying information including
timestamp..
• For example
– VM CPU usage divided among which processes
– I/O requests to which disks triggered from which VM process
– Messages from which component to which other component in
response to what user requests.
• Ideal – each user request is given a tag and all monitoring
information as a consequence of satisfying that request are
tagged with request ID.
• Other monitoring activities are tagged with ID that
identifies why activity was trigger.
© Matthew Bass 2013
Why this emphasis on metadata?
• Any of the uses enumerated for monitoring
data require associating effect with its cause.
• The monitoring data represents the effect.
• The metadata enables determining the cause.
© Matthew Bass 2013
Active/Passive
• Active data collection involves the component
that generates the data. It emits it periodically or
based on a triggering event
– To a key-value store
– To a file
– A message to a known location
• Passive data collection involves the component
that generates the data making it available to an
agent in the same address space. The agent emits
the data either periodically or based on events.
© Matthew Bass 2013
Data Collection
• Whether active or passive data, the data is
emitted from a component to a known
location periodically or based on events.
System
application
agent
System
application
agent
Monitoring System
© Matthew Bass 2013
Monitoring Systems
• Data collecting tools
– Ngaio .
– Sensu
– Inciga
– Cloud Watch – AWS specific
© Matthew Bass 2013
Volumes of data
• It is possible to generate huge amounts of data.
• That is the purpose of data collating tools
– Logstash
– Splunk
• Features of such tools
– Collating data from different instances
– Visualization
– Filtering
– Organizing data
– Reports
© Matthew Bass 2013
Alarms
• An alarm is a specific message about some
condition needing attention.
– Can be e-mail, text, or on screen for operators.
• Problems with alarms
– False positives – an alarm is raised without
justification
– False negatives – justification exists but no alarm
is raised.
© Matthew Bass 2013
Summary
• Distributed coordination problems are simplified when
using a tool such as Zookeeper
• You must expect failure in the cloud and prepare for it.
• A disaster is when everything has failed and you need
to have business continuity plans
• Flexibility in the cloud is managed by setting
configuration parameters and they need to be
managed.
• Monitoring lets you know what is going on with your
system from whatever perspective you wish. But, you
must choose your perspective.
© Matthew Bass 2013
Questions??

More Related Content

What's hot

To SQL or NoSQL, that is the question
To SQL or NoSQL, that is the questionTo SQL or NoSQL, that is the question
To SQL or NoSQL, that is the questionKrishnakumar S
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151xlight
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureVenu Anuganti
 
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...Denodo
 
Datavail Health Check
Datavail Health CheckDatavail Health Check
Datavail Health CheckDatavail
 
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Edureka!
 
Developing and Implementing a QA Plan During Your Legacy Data to S1000D
Developing and Implementing a QA Plan During Your Legacy Data to S1000DDeveloping and Implementing a QA Plan During Your Legacy Data to S1000D
Developing and Implementing a QA Plan During Your Legacy Data to S1000Ddclsocialmedia
 
Storage Characteristics Of Call Data Records In Column Store Databases
Storage Characteristics Of Call Data Records In Column Store DatabasesStorage Characteristics Of Call Data Records In Column Store Databases
Storage Characteristics Of Call Data Records In Column Store DatabasesDavid Walker
 
Data Warehouse Design on Cloud ,A Big Data approach Part_One
Data Warehouse Design on Cloud ,A Big Data approach Part_OneData Warehouse Design on Cloud ,A Big Data approach Part_One
Data Warehouse Design on Cloud ,A Big Data approach Part_OnePanchaleswar Nayak
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Miningidnats
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?Venu Anuganti
 
Deep Dive Into Email Archiving Products
Deep Dive Into Email Archiving ProductsDeep Dive Into Email Archiving Products
Deep Dive Into Email Archiving ProductsStephen Foskett
 
Securing data and preventing data breaches
Securing data and preventing data breachesSecuring data and preventing data breaches
Securing data and preventing data breachesMariaDB plc
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingEyad Manna
 
Managing Deliverable-Specific Link Anchors: New Suggested Best Practice for Keys
Managing Deliverable-Specific Link Anchors: New Suggested Best Practice for KeysManaging Deliverable-Specific Link Anchors: New Suggested Best Practice for Keys
Managing Deliverable-Specific Link Anchors: New Suggested Best Practice for Keysdclsocialmedia
 

What's hot (20)

To SQL or NoSQL, that is the question
To SQL or NoSQL, that is the questionTo SQL or NoSQL, that is the question
To SQL or NoSQL, that is the question
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
Data vault what's Next: Part 2
Data vault what's Next: Part 2Data vault what's Next: Part 2
Data vault what's Next: Part 2
 
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
Implementing Data Virtualization for Data Warehouses and Master Data Manageme...
 
Datavail Health Check
Datavail Health CheckDatavail Health Check
Datavail Health Check
 
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
Data Warehouse Tutorial For Beginners | Data Warehouse Concepts | Data Wareho...
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Developing and Implementing a QA Plan During Your Legacy Data to S1000D
Developing and Implementing a QA Plan During Your Legacy Data to S1000DDeveloping and Implementing a QA Plan During Your Legacy Data to S1000D
Developing and Implementing a QA Plan During Your Legacy Data to S1000D
 
Storage Characteristics Of Call Data Records In Column Store Databases
Storage Characteristics Of Call Data Records In Column Store DatabasesStorage Characteristics Of Call Data Records In Column Store Databases
Storage Characteristics Of Call Data Records In Column Store Databases
 
Data Warehouse Design on Cloud ,A Big Data approach Part_One
Data Warehouse Design on Cloud ,A Big Data approach Part_OneData Warehouse Design on Cloud ,A Big Data approach Part_One
Data Warehouse Design on Cloud ,A Big Data approach Part_One
 
Data Warehousing and Data Mining
Data Warehousing and Data MiningData Warehousing and Data Mining
Data Warehousing and Data Mining
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
 
SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?SQL/NoSQL How to choose ?
SQL/NoSQL How to choose ?
 
Deep Dive Into Email Archiving Products
Deep Dive Into Email Archiving ProductsDeep Dive Into Email Archiving Products
Deep Dive Into Email Archiving Products
 
Securing data and preventing data breaches
Securing data and preventing data breachesSecuring data and preventing data breaches
Securing data and preventing data breaches
 
Data-ware Housing
Data-ware HousingData-ware Housing
Data-ware Housing
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Managing Deliverable-Specific Link Anchors: New Suggested Best Practice for Keys
Managing Deliverable-Specific Link Anchors: New Suggested Best Practice for KeysManaging Deliverable-Specific Link Anchors: New Suggested Best Practice for Keys
Managing Deliverable-Specific Link Anchors: New Suggested Best Practice for Keys
 

Similar to Cloud Storage Architectures

Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxRadhika R
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7abdulrahmanhelan
 
DataBaseManagementSystem-DBMS
DataBaseManagementSystem-DBMSDataBaseManagementSystem-DBMS
DataBaseManagementSystem-DBMSkangrehmat
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2Fabio Fumarola
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabasesAdi Challa
 
UNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdfUNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdfShitalGhotekar
 
Overview of Data Base Systems Concepts and Architecture
Overview of Data Base Systems Concepts and ArchitectureOverview of Data Base Systems Concepts and Architecture
Overview of Data Base Systems Concepts and ArchitectureRubal Sagwal
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-ConceptsBhaskar Gunda
 
UNIT machine learning unit 1,algorithm pdf
UNIT machine learning  unit 1,algorithm pdfUNIT machine learning  unit 1,algorithm pdf
UNIT machine learning unit 1,algorithm pdfOmarFarooque9
 
CST204 DBMSMODULE1 PPT (1).pptx
CST204 DBMSMODULE1 PPT (1).pptxCST204 DBMSMODULE1 PPT (1).pptx
CST204 DBMSMODULE1 PPT (1).pptxMEGHANA508383
 
Complete Databases managment system.pptx
Complete Databases managment system.pptxComplete Databases managment system.pptx
Complete Databases managment system.pptxJunaidRamzan4
 
Understanding data
Understanding dataUnderstanding data
Understanding dataShahd Salama
 
Real World Performance - OLTP
Real World Performance - OLTPReal World Performance - OLTP
Real World Performance - OLTPConnor McDonald
 
Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...
Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...
Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...Pedro Mac Dowell Innecco
 
01-Database Administration and Management.pdf
01-Database Administration and Management.pdf01-Database Administration and Management.pdf
01-Database Administration and Management.pdfTOUSEEQHAIDER14
 

Similar to Cloud Storage Architectures (20)

Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptx
 
Relational and non relational database 7
Relational and non relational database 7Relational and non relational database 7
Relational and non relational database 7
 
dbms introduction.pptx
dbms introduction.pptxdbms introduction.pptx
dbms introduction.pptx
 
NoSQL and Couchbase
NoSQL and CouchbaseNoSQL and Couchbase
NoSQL and Couchbase
 
DataBaseManagementSystem-DBMS
DataBaseManagementSystem-DBMSDataBaseManagementSystem-DBMS
DataBaseManagementSystem-DBMS
 
DBMS.pptx
DBMS.pptxDBMS.pptx
DBMS.pptx
 
6 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/26 Data Modeling for NoSQL 2/2
6 Data Modeling for NoSQL 2/2
 
NoSQLDatabases
NoSQLDatabasesNoSQLDatabases
NoSQLDatabases
 
UNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdfUNIT 5- Other Databases.pdf
UNIT 5- Other Databases.pdf
 
Overview of Data Base Systems Concepts and Architecture
Overview of Data Base Systems Concepts and ArchitectureOverview of Data Base Systems Concepts and Architecture
Overview of Data Base Systems Concepts and Architecture
 
NoSQL-Database-Concepts
NoSQL-Database-ConceptsNoSQL-Database-Concepts
NoSQL-Database-Concepts
 
Chapter 5 data resource management
Chapter 5  data resource managementChapter 5  data resource management
Chapter 5 data resource management
 
UNIT machine learning unit 1,algorithm pdf
UNIT machine learning  unit 1,algorithm pdfUNIT machine learning  unit 1,algorithm pdf
UNIT machine learning unit 1,algorithm pdf
 
CST204 DBMSMODULE1 PPT (1).pptx
CST204 DBMSMODULE1 PPT (1).pptxCST204 DBMSMODULE1 PPT (1).pptx
CST204 DBMSMODULE1 PPT (1).pptx
 
Rdbms
RdbmsRdbms
Rdbms
 
Complete Databases managment system.pptx
Complete Databases managment system.pptxComplete Databases managment system.pptx
Complete Databases managment system.pptx
 
Understanding data
Understanding dataUnderstanding data
Understanding data
 
Real World Performance - OLTP
Real World Performance - OLTPReal World Performance - OLTP
Real World Performance - OLTP
 
Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...
Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...
Auxilion - The Implications of Big Data on the Roadmap Towards Business Intel...
 
01-Database Administration and Management.pdf
01-Database Administration and Management.pdf01-Database Administration and Management.pdf
01-Database Administration and Management.pdf
 

More from Len Bass

Devops syllabus
Devops syllabusDevops syllabus
Devops syllabusLen Bass
 
DevOps Syllabus summer 2020
DevOps Syllabus summer 2020DevOps Syllabus summer 2020
DevOps Syllabus summer 2020Len Bass
 
11 secure development
11  secure development 11  secure development
11 secure development Len Bass
 
10 disaster recovery
10 disaster recovery  10 disaster recovery
10 disaster recovery Len Bass
 
9 postproduction
9 postproduction 9 postproduction
9 postproduction Len Bass
 
8 pipeline
8 pipeline 8 pipeline
8 pipeline Len Bass
 
7 configuration management
7 configuration management 7 configuration management
7 configuration management Len Bass
 
6 microservice architecture
6 microservice architecture6 microservice architecture
6 microservice architectureLen Bass
 
5 infrastructure security
5 infrastructure security5 infrastructure security
5 infrastructure securityLen Bass
 
4 container management
4  container management4  container management
4 container managementLen Bass
 
3 the cloud
3 the cloud 3 the cloud
3 the cloud Len Bass
 
1 virtual machines
1 virtual machines1 virtual machines
1 virtual machinesLen Bass
 
2 networking
2 networking2 networking
2 networkingLen Bass
 
Quantum talk
Quantum talkQuantum talk
Quantum talkLen Bass
 
Icsa2018 blockchain tutorial
Icsa2018 blockchain tutorialIcsa2018 blockchain tutorial
Icsa2018 blockchain tutorialLen Bass
 
Experience in teaching devops
Experience in teaching devopsExperience in teaching devops
Experience in teaching devopsLen Bass
 
Understanding blockchains
Understanding blockchainsUnderstanding blockchains
Understanding blockchainsLen Bass
 
What is a blockchain
What is a blockchainWhat is a blockchain
What is a blockchainLen Bass
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systemsLen Bass
 
My first deployment pipeline
My first deployment pipelineMy first deployment pipeline
My first deployment pipelineLen Bass
 

More from Len Bass (20)

Devops syllabus
Devops syllabusDevops syllabus
Devops syllabus
 
DevOps Syllabus summer 2020
DevOps Syllabus summer 2020DevOps Syllabus summer 2020
DevOps Syllabus summer 2020
 
11 secure development
11  secure development 11  secure development
11 secure development
 
10 disaster recovery
10 disaster recovery  10 disaster recovery
10 disaster recovery
 
9 postproduction
9 postproduction 9 postproduction
9 postproduction
 
8 pipeline
8 pipeline 8 pipeline
8 pipeline
 
7 configuration management
7 configuration management 7 configuration management
7 configuration management
 
6 microservice architecture
6 microservice architecture6 microservice architecture
6 microservice architecture
 
5 infrastructure security
5 infrastructure security5 infrastructure security
5 infrastructure security
 
4 container management
4  container management4  container management
4 container management
 
3 the cloud
3 the cloud 3 the cloud
3 the cloud
 
1 virtual machines
1 virtual machines1 virtual machines
1 virtual machines
 
2 networking
2 networking2 networking
2 networking
 
Quantum talk
Quantum talkQuantum talk
Quantum talk
 
Icsa2018 blockchain tutorial
Icsa2018 blockchain tutorialIcsa2018 blockchain tutorial
Icsa2018 blockchain tutorial
 
Experience in teaching devops
Experience in teaching devopsExperience in teaching devops
Experience in teaching devops
 
Understanding blockchains
Understanding blockchainsUnderstanding blockchains
Understanding blockchains
 
What is a blockchain
What is a blockchainWhat is a blockchain
What is a blockchain
 
Dev ops and safety critical systems
Dev ops and safety critical systemsDev ops and safety critical systems
Dev ops and safety critical systems
 
My first deployment pipeline
My first deployment pipelineMy first deployment pipeline
My first deployment pipeline
 

Recently uploaded

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 

Recently uploaded (20)

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 

Cloud Storage Architectures

  • 1. © Matthew Bass 2013 Architecting for the Cloud Len and Matt Bass Storage in the Cloud
  • 2. © Matthew Bass 2013 Outline This section will focus on storage in the cloud • We will first look at relational databases • What solutions emerged for the cloud • Storage options for NoSQL databases • Architecture of typical NoSQL databases
  • 3. © Matthew Bass 2013 Outline This section will focus on storage in the cloud • We will first look at relational databases • What solutions emerged for the cloud • Storage options for NoSQL databases • Architecture of typical NoSQL databases
  • 4. © Matthew Bass 2013 History • The relational data model was created in the late 1960s • In the 1980s relational databases became commercially successful – Replacing Hierarchical and Network data bases • Relational databases continue to be the dominate db model today
  • 5. © Matthew Bass 2013 Relational Databases • The relational model is a mathematical model for describing the structure of data – We will not go into this model • Let’s take a quick review of the 1st and 2nd normal form, however
  • 6. © Matthew Bass 2013 Example Imagine you sell car parts – You have warehouses – You have part inventories – You have orders What’s the problem? Warehouse Warehouse Address Part
  • 7. © Matthew Bass 2013 What Happens Here? Warehouse 1 123 Main Street Transmission, Steering wheel, Brake pads, … What about here? Warehouse 1 123 Main Street Transmission Warehouse 1 123 Main Street Steering wheel Warehouse 1 123 Main Street Brake Pads
  • 8. © Matthew Bass 2013 The Solution … Warehouse ID Warehouse Address Warehouse Table: Parts Table: Relations Table: Part ID Part Description Warehouse ID Part ID
  • 9. © Matthew Bass 2013 This Works • We have a standard language for querying the data (SQL) • We can now extract data in a very flexible way • We can read, write, update, and delete data pretty efficiently – Joins add some overhead
  • 10. © Matthew Bass 2013 Moreover We Have RDBMS • We have robust software systems that manage the data • These systems provide many advanced features including: – Behavior – Concurrency control – Transactions – Referential integrity – Optimization
  • 11. © Matthew Bass 2013 Behavior • DBMS’s provide mechanisms for building in behavior • These are mechanisms like – Stored procedures – PLSQL • This allows you to simplify the application logic
  • 12. © Matthew Bass 2013 Concurrency Control • DBMS’s will support multiple user access • They will lock tables during updates to ensure that writes are complete prior to reads • They will manage multiple updates to ensure integrity and consistency of data
  • 13. © Matthew Bass 2013 Transactions • Transactions are supported • This ensures that updates either happen completely or not at all – Often an atomic update is a set up updates to individual records across multiple tables – If only some of these updates happen the integrity of the overall database is compromised
  • 14. © Matthew Bass 2013 Referential Integrity • Ensures that references from one table refer to a valid entry in another table
  • 15. © Matthew Bass 2013 Optimization • Database systems will perform a variety of actions to optimize based on usage patterns • They will – Create indexes – Create virtual tables – Cache values – …
  • 16. © Matthew Bass 2013 Impedance Mismatch • There is however, a mismatch – We need to translate between the relational structure and the organizational needs • Think about the reports needed for the warehouse – Purchase orders – History of orders for customer – Parts inventory per warehouse – … • This means we will need lots of Joins – This isn’t too much of an issue until we scale …
  • 17. © Matthew Bass 2013 Speaking of Scaling … Do relational databases scale?
  • 18. © Matthew Bass 2013 Internet Scale Is Difficult • We can “shard” the data – Split the data across the machines • This is very difficult to do efficiently • This makes joins more costly – Remember joins are common • This also has a practical limit – At some point you will need to replicate the data • The database becomes slow …
  • 19. © Matthew Bass 2013 Change is Needed • For this reason internet scale applications moved to distributed file systems – Google was the first – Many others followed • This allowed the data to be partitioned across nodes more efficiently – We’ll talk about this in a minute
  • 20. © Matthew Bass 2013 Outline This section will focus on storage in the cloud • We will first look at relational databases • What solutions emerged for the cloud • Storage options for NoSQL databases • Architecture of typical NoSQL databases
  • 21. © Matthew Bass 2013 Needs • Let’s explore the needs in a bit more detail • The file system needed to: – Be fault-tolerant – Handle large files – Accommodate extremely large data sets – Accommodate many concurrent clients – Be flexible enough to handle multiple kinds of applications
  • 22. © Matthew Bass 2013 Fault-Tolerance • Due to the scale of the systems they were deployed on hundreds or thousands of servers • This meant that at any given time some of these nodes would not be operational • Problems from application bugs, operating system bugs, human error, hardware failures, and networks are common
  • 23. © Matthew Bass 2013 Large Files/Large Data Sets • It’s common for files in these systems to be multiple GBs • Each file could have millions of objects – E.g. many individual web pages • The data sets grow quickly • The data sets can be multiple terabytes or petabytes
  • 24. © Matthew Bass 2013 Many Concurrent Clients • The system needs to efficiently handle multiple clients • These clients could be reading or writing
  • 25. © Matthew Bass 2013 Multiple Applications • Additionally the system needs to be flexible enough to handle multiple applications • Applications have a variety of needs – Long streaming reads – Throughput oriented operations – Low latency reads – …
  • 26. © Matthew Bass 2013 Addressing Needs • There were a number of things that were done to address the needs • One primary decision was the de-normalization of the data – We’ll talk about this more in the next slides • Other decisions include (we’ll talk about these in a bit) – Block size – Replication strategy – Data consistency checks – API and capability of the system
  • 27. © Matthew Bass 2013 De-Normalizing Data • Remember what was difficult with relational models? – Joins across nodes are expensive – As is synchronization for replicated data • If the data is de-normalized it can be “localized” – Data that will likely be accessed together can be collocated – In other words store it as you will use it
  • 28. © Matthew Bass 2013 Example • Imagine a Purchase Order • Typically this would contain – Customer information – Product information – Pricing
  • 29. © Matthew Bass 2013 Relational Purchase Order • The data could would be split across multiple tables such as – Customer – Product Catalog – Inventory – … • If the data set is large enough the data would be distributed
  • 30. © Matthew Bass 2013 De-Normalized Purchase Order • In a file system without a relational model the data doesn’t need to be split up • The purchase order data would be co-located • If the data set was very large purchase orders would still be co-located – Different purchased orders could be distributed – A single purchase order, however would not be
  • 31. © Matthew Bass 2013 Relational vs NoSQL Relational Model NoSQL Customers Product Catalog Inventory Orders 1 - 100 Orders 101 - 200 Orders 201 - 300
  • 32. © Matthew Bass 2013 What Does This Mean? • Data has no explicit structure (not entirely true … but we’ll talk about this) – Data is largely treated as a blob • This has several implications – You can change the nature of the data as needed – You can collocate the data as desired – The application now has increased burden
  • 33. © Matthew Bass 2013 Back to Purchase Order PO Number PO 1 Contents of PO1 … 2 Contents of PO2 … 3 Contents of PO3 … 4 Contents of PO4 … Key Value
  • 34. © Matthew Bass 2013 Retrieving Data • To retrieve the purchase order data you provide the reference key • The file system routes you to the appropriate node (more later) • The single node returns the entire purchase order • This can happen quickly … regardless of how many purchase orders you have Do you see any potential issues?
  • 35. © Matthew Bass 2013 Data Locality • First, being able to retrieve the data quickly depends on the location of the data • If the data is distributed it’s difficult to retrieve quickly – Imagine you want to get the number of times a customer ordered product X – More on this later • While there is not an explicit structure there is an implicit structure – Design of this structure is important
  • 36. © Matthew Bass 2013 Data Processing • As the file system treats the data as unstructured it’s not able to preprocess the data • Getting an ordered list, for example, has to be done in the application • The validity of the data needs to be checked by the application
  • 37. © Matthew Bass 2013 Updating Data • What happens if you want to change the data? – Imagine trying to update the customer’s address • Updates tend to be difficult • In this environment you tend to not update data – Instead you will append the new data – You can establish rules for the lifetime of the data
  • 38. © Matthew Bass 2013 Other Issues • Things like data integrity are not managed by the file system • You don’t (typically) have full support for transactions • There is no notion of referential integrity • There is support for some concurrent access, but with built in assumptions • Consistency is not typically guaranteed (more later)
  • 39. © Matthew Bass 2013 A New Tool in Your Toolbox • You’ve been given a new kind of hammer – Remember that everything is not a nail – In other words these kinds of data stores are good for some things … and not others • Today there are many different flavors of these data stores – Both in terms of structures and features
  • 40. © Matthew Bass 2013 Multiple Data Structures • Today many options exist – Key value stores – Document centric data stores – Column data bases • We’ve also started to see old models reemerge e.g. – Hierarchical data stores
  • 41. © Matthew Bass 2013 Key Value Databases • Basically you have a key that maps to some “value” • This value is just a blob – The database doesn’t care about the content or structure of this value • The operations are quite simple e.g. – Read (get the value given a key) – Insert (inserts a key/value pair) – Remove (removes the value associated with a given key)
  • 42. © Matthew Bass 2013 Key Value Databases II • There is no real schema – Basically you query a key and get the value – This can be useful when accessing things like user sessions, shopping carts, … • Concurrency – Concurrency only makes sense at the level of a single key – Can have either optimistic write or eventual consistency – we’ll talk about this more later • Replication – Can be handled by the client or the data store – more about this later
  • 43. © Matthew Bass 2013 Uses • Very fast reads • Scales well • Good for quick access of data without complex querying needs – The classic example is for session management • Not good for – Situations where data integrity is critical – Data with complex querying needs
  • 44. © Matthew Bass 2013 Document Centric Databases • Stores a “document” ID : 123 Customer : 8790 Line Items : [{product id: 2, quantity: 2} {product id: 34, quantity 1}] …
  • 45. © Matthew Bass 2013 Document Centric • No schema • You can query the data store – Can return all or part of the document – Typically query the store by using the id (or key) • As with key value, discussing concurrency only makes sense at the level of a single document
  • 46. © Matthew Bass 2013 Advantages • A document centric data store is similar in many ways to a key/value data store • It does, however, allow for more complex queries – For example you can query using a non-primary key
  • 47. © Matthew Bass 2013 Column Databases • Row key maps to “column families” 1234 … Name Matt Billing Address 123 Main st Phone 412 770-4145 Profile Order Data … Order Data … Order Data … Orders
  • 48. © Matthew Bass 2013 Column Databases - Rows • Rows are grouped together to form units of load balancing – Row keys are ordered and grouped together by locality – In this example consecutive rows would be from the same domain (CNN) • Concurrency makes sense at the level of a row Key Contents Anchor:cnnsi.com Anchor:my.look.ca com.cnn.www Html page … …
  • 49. © Matthew Bass 2013 Column Databases – Columns • Columns are grouped into “column families” • Column families form the unit of access control – Clients may or may not have access to all column families • Column keys can be used to query data
  • 50. © Matthew Bass 2013 Column Databases – Timestamps • The cells in a column database can be versioned with a timestamp • The cells can contain multiple versions – The application can typically specify how many versions to keep or when a version times out • You can use either use a client generated timestamp or one generated by the storage node
  • 51. © Matthew Bass 2013 Examples Document Centric • MongoDB • CouchDB • RavenDB Key Value • DynamoDB • Azure Table • Redis • Riak Column • Hbase • Cassandra • Hypertable • SimpleDB
  • 52. © Matthew Bass 2013 NoSQL vs RDBMS • Explicit vs Implicit Schema – NoSQL databases do have an implicit schema – at least in most cases • Distribution of data • Consistency • Efficiency of storage • Additional capabilities
  • 53. © Matthew Bass 2013 Schema • Clearly with Relational DB there is an explicit schema • You do have an implicit schema with NoSQL db as well – You typically want to do something with the data • With relational schema distributed data has a big performance impact • Data model of NoSQL data impacts performance as well – It is easier to distribute data so that related data is co-located
  • 54. © Matthew Bass 2013 Consistency - CAP Theorem • When data becomes distributed you need to worry about a network partition – Essentially this means that instances of your data store can’t communicate • When this happens you need to choose between availability or consistency
  • 55. © Matthew Bass 2013 Let’s Demonstrate • Imagine we start a store that takes orders – Who wants to work at this store? • The operators need to be able to: – Take orders – Give order history – Modify orders • We will start with one operator until business grows …
  • 56. © Matthew Bass 2013 Consistency in the Cloud • Many NoSQL databases give you options – Eventual consistency – Optimistic consistency – … • They all come with different trade offs • You must understand the needs of your system to ensure appropriate behavior – We’ll talk more about this later
  • 57. © Matthew Bass 2013 Outline This section will focus on storage in the cloud • We will first look at relational databases • What solutions emerged for the cloud • Storage options for NoSQL databases • Architecture of typical NoSQL databases
  • 58. © Matthew Bass 2013 Fault Tolerance • As we said earlier fault tolerance was a prime motivator for many of the decisions • These systems are built with commodity components that are prone to failure • They also need to deal with other issues (previously mentioned) that arise • We’ll look at a representative example of such a system to understand what decisions have been made
  • 59. © Matthew Bass 2013 Google File System • Grew out of “BigFiles” • Distributed, scalable and portable file system • Written in Java • Supports the kinds of applications we discussed earlier – Search – Large data retrieval
  • 60. © Matthew Bass 2013 Leads to following requirements 1. High reliability through commodity hardware – Even with RAID, disks will still have one failure per day. If the system has to deal with failure smoothly in any case, it is much more economical to use commodity hardware. – Even if disks do not fail, data blocks may get corrupted. 2. Minimal synchronization on writes – Require each application process to write to a distinct file. File merge can take place after files are written. – This means minimal locking during the write process (or read process). 3. Data blocks are all the same size – Streaming data. ALL blocks are 64MBytes. – GFS is unaware of any internal logic of data and the Internal logic of data must be managed by the application
  • 61. © Matthew Bass 2013 GFS Interfaces • Supports the following commands – Open – Create – Read – Write – Close – Append – Snapshot
  • 62. © Matthew Bass 2013 Organization of GFS • Organized into clusters • Each cluster might have thousands of machines • Within each cluster you have the following kinds of entities – Clients – Master servers – Chunk servers
  • 63. © Matthew Bass 2013 GFS Clients • Clients are any entity that makes a file request • Requests are often to retrieve existing files • They might also include manipulating or creating files • Clients are other computers or applications – Think of the web server that serves your search engine as a client
  • 64. © Matthew Bass 2013 Chunk Servers • Responsible for storing the data “chunks” – These chunks are all 64 MB blocks • These chunk servers are the work horses of the file system • They receive requests for data and send the chunks directly to the client • The client also writes the files directly to the appropriate chunk servers – The reference for replicas come from the master as well • The chunk server is responsible for determining the correctness of the write (more later)
  • 65. © Matthew Bass 2013 Master Servers • Acts as a coordinator for the cluster • Keeps track of the metadata – This is data that describes the data blocks (or chunks) – Tells the Master what chunk the file belongs to • Master tells the client where the chunk is located • Master keeps an operations log – Logs the activities of the cluster – One of the mechanisms used to keep service outages to a minimum (more later)
  • 66. © Matthew Bass 2013 Two Additional Concepts Lease: • Lease is the minimal locking that is performed. Client receives lease on a file when it is opened and, until file is closed or lease expires, no other process can write to that file. This prevents accidently using the same file name twice. • Client must renew lease periodically (~ 1 minute) or lease is expired. Block: • Every file managed by GFS is divided into 64MByte blocks. Each read/write is in terms of <file, block #> • Each block is replicated – three is the default number of replicas. • As far as GFS is concerned there is no internal structure to a block. The application must perform any parsing of the data that is necessary.
  • 67. © Matthew Bass 2013 Basic Read Operation Client Master Chunk Server Chunk Server Chunk Server Requests location of File Sends read request Returns location Returns file content
  • 68. © Matthew Bass 2013 Basic Write Operation Client Master Chunk Server Chunk Server Chunk Server Requests location of primary and secondary Sends data to write Returns locations Caches locations Sends data to write Applies Mutations
  • 69. © Matthew Bass 2013 Reliability Mechanisms • Master and chunk replication • Rebalancing • Stale replication detection • Checksumming • Garbage removal
  • 70. © Matthew Bass 2013 Master Replication • One active Master per cluster • “Shadow” masters exist on other machines – These shadows may perform limited functions (i.e. reads) • Monitors the operations of the active master – Though the operations log • Maintains contact with the Chunk Servers by polling – Does this to keep track of data • If the Master fails the shadow takes over
  • 71. © Matthew Bass 2013 Data Replication/Rebalancing • File system replicates chunks of data • It stores data on different machines across different racks – That way if a machine or rack fail another replica exists • Master also monitors cluster as a whole • It periodically rebalances the load across the cluster – All chunk servers run at near capacity but never at full capacity • Master also monitors each chunk to ensure data is current – If not it’s designated as a stale replica – The stale replica become garbage
  • 72. © Matthew Bass 2013 Checksum • In order to detect data corruption checksumming is used • The system breaks each 64 MB chunk into 64 KB blocks • Each block has it’s own 32 bit checksum • The Master monitors the checksums for each block • If the checksums don’t match what the Master has on record it is deleted and a new replica created
  • 73. © Matthew Bass 2013 Failure Scenarios • Let’s look at the following failure scenarios to see what happens – Client failure – Corrupt disk – Chunk server failure – Master failure
  • 74. © Matthew Bass 2013 Client Failure • Client fails while file open • Master recognizes this because lease expires • File is placed in intermediate state where client can re- activate lease • After intermediate state expires (~hour), Master informs Chunk Server that have blocks for that file to delete them • Master removes all entries associated with file • Chunk Server deletes blocks
  • 75. © Matthew Bass 2013 Corrupt Disk This is the case where a block becomes corrupted after writing. Replica1 writes a checksum for every 64 KB in a parallel file. Replica1 returns checksums along with the block during a read. Client checks checksum when block returned If there is an error then Client: • Retries read from different Replica2 • Informs Master of corrupt block on Replica1 Master: • Allocates new replica for that block on Replica3 • Informs Replica2 with an existing replica to copy it to Replica3. • Informs Replica1 with corrupted block to delete that block.
  • 76. © Matthew Bass 2013 Chunk Server Failure Master sends Heartbeat request to Chunk Server • Active Replica responds with a list of block #, replica #s it has. • Failed Replica does not respond Master recognizes Replica’s failure. Master maintains block #, replica # -> Chunk Server mapping from last Heartbeat. Master queues all of blocks replicated on failed Chunk Server to generate an additional replica. The generation of an additional replica of Block A: • Allocate new replica on an active Chunk Server say Replica1 • Instruct one of the Chunk Servers with valid replica of Block A to copy it to Replica1.
  • 77. © Matthew Bass 2013 Master Failure • Back up Master maintains copy of log • Responsible for creating checkpoint image and trimming EditLog • BackupNode takes over in case of Master failure • BackupNode may also fail BackupNode Master EditLog Checkpoint Image
  • 78. © Matthew Bass 2013 More about Master Structure Four Threads: • Main – perform file management operations. • Ping/Echo – check on status of Chunk Servers and receive responses from Chunk Servers • Replica Management – manage new replica creation and replica deletion • Lease Management – cancel leases when they expire. Queues replicas for replica deletion for files where the client has failed. Three Modes • Normal operations • Safe mode – when Master is restarted then no new requests are accepted until percentage of Chunk Servers have reported their block allocations • Backup – act as Master backup
  • 79. © Matthew Bass 2013 Summary • Relational databases are difficult to distribute efficiently – Scalability can be problematic • NoSQL databases offer an alternative – Data is typically schema-less • Aggregates of data that mirror primary use cases are considered a unit of data • Queries across nodes requires an efficient mechanism for aggregation
  • 80. © Matthew Bass 2013 Architecting for the Cloud MiscTopics
  • 81. © Matthew Bass 2013 Topics These are topics that have architectural implications and do not fit neatly into one of the other lectures. • Zookeeper • Failure in the cloud • Business continuity • Release planning • Managing configuration parameters • Monitoring 81
  • 82. © Matthew Bass 2013 Zookeeper • Zookeeper is intended to manage distributed coordination – Synchronization – data 82
  • 83. © Matthew Bass 2013 Distributed applications • Zookeeper provides guaranteed consistent (mostly) data structure for every instance of a distributed application. – Definition of “mostly” is within eventual consistency lag (but this is small). More on eventual consistency later. • Zookeeper deals with managing failure as well as consistency. – Done using Paxos algorithm. • Zookeeper guarantees that service requests are linearly ordered and processed in a FIFO order
  • 84. © Matthew Bass 2013 Model • Zookeeper maintains a file type data structure – Hierarchical – Data in every node (called znode) – Amount of data in each node assumed small (<1M) – Intended for metadata • Configuration • Location • Group
  • 85. © Matthew Bass 2013 Zookeeper znode structure / <data> /b1 <data> /b1/c1 <data> /b1/c2 <data> /b2 <data> /b2/c1 <data>
  • 86. © Matthew Bass 2013 API Function Type create write delete write Exists read Get children Read Get data Read Set data write + others • All calls return atomic views of state – either succeed or fail. No partial state returned. Writes also are atomic. Either succeed or fail. If they fail, no side effects.
  • 87. © Matthew Bass 2013 Example - Group membership • Remember the load balancer. It has a list of registered servers. • The load balancer wants to know which of its servers are – Alive – Providing service • The list must be – highly available – Reflect failure of individual servers • Strict performance requirements on list manager
  • 88. © Matthew Bass 2013 Using Zookeeper to manage group membership • Load balancer on initialization – connects to zookeeper – Gets list of zookeeper servers – Create session (if server fails – automatic fail over) • Load balancer issues Create /”Servers” call • If already exists get a failure • Servers register by creating /Server/my_IP • Load balancer can list children of /Servers and get their IPs. • Watcher will inform Load balancer if a server fails or leaves. • Latency is low (order of micro seconds) since Zookeeper keeps data structures in memory.
  • 89. © Matthew Bass 2013 Other use cases • Leader election • Distributed locks • Synchronization • Configuration
  • 90. © Matthew Bass 2013 Topics • Zookeeper • Failure in the cloud • Business continuity • Release planning • Managing configuration parameters • Monitoring 90
  • 91. © Matthew Bass 2013 Failures in the cloud • Cloud failures large and small • The Long Tail • Techniques for dealing with the long tail
  • 92. © Matthew Bass 2013 Sometimes the whole cloud fails …
  • 93. © Matthew Bass 2013 Selected Cloud Outages - 2013 • July 10, Google down for 10 minutes • June 18, Facebook down for 30 minutes • Aug 14-17 Outlook.com offline for three days • Aug 19, Amazon.com down for 40-45 minutes • Aug 22, Apple iCloud down for 11 hours • Aug 16, Google down for 5 minutes • Sept 13, AWS down for ~two hours • Nov 21, Microsoft services intermittent for ~2 hours
  • 94. © Matthew Bass 2013 And sometimes just a part of it fails …
  • 95. © Matthew Bass 2013 A year in the life of a Google datacenter • Typical first year for a new cluster: – ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) – ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) - ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) – ~5 racks go wonky (40-80 machines see 50% packetloss) – ~8 network maintenances (4 might cause ~30-minute random connectivity losses) – ~12 router reloads (takes out DNS and external vips for a couple minutes) – ~3 router failures (have to immediately pull traffic for an hour) – ~dozens of minor 30-second blips for dns – ~1000 individual machine failures – ~thousands of hard drive failures • slow disks, bad memory, misconfigured machines, flaky machines, dead horses, etc.
  • 96. © Matthew Bass 2013 Amazon failure statistics • In a data center with ~64,000 servers with 2 disks each ~5 servers and ~17 disks fail every day.
  • 97. © Matthew Bass 2013 What does this mean for a consumer of the cloud? • You need to be concerned about “long tail” distribution for requests due to piecewise failure • You need to be concerned about business continuity due to overall failure.
  • 98. © Matthew Bass 2013 Short digression into probability • A distribution describes the probability than any given reading will have a particular value. • Many phenomenon in nature are “normally distributed”. • Most values will cluster around the mean with progressively smaller numbers of values going toward the edges. • In a normal distribution the mean is equal to the median
  • 99. © Matthew Bass 2013 Long Tail • In a long tail distribution, there are some values far from the mean. • These values are sufficient to influence the mean. • The mean and the median are dramatically different in a long tail distribution.
  • 100. © Matthew Bass 2013 What does this mean? • If there is a partial failure of the cloud some activities will take a long time to complete and exhibit a long tail. • The figure shows distribution of 1000 AWS “launch instance” calls. • 4.5% of calls were “long tail” Mean Median STD Max launch instance EC2 27.81 23.10 25.12 202.3
  • 101. © Matthew Bass 2013 What can you do to prevent long tail problems? • “Hedged” request. Suppose you wish to launch 10 instances. Issue 11 requests. Terminate the request that has not completed when 10 are completed. • “Alternative” request. In the above scenario, issue 10 requests. When 8 requests have completed issue 2 more. Cancel the last 2 to respond. • Using these techniques reduces the time of the longest of the 1000 launch instance requests from 202 sec to 51 sec.
  • 102. © Matthew Bass 2013 Topics • Zookeeper • Failure in the cloud • Business continuity • Release planning • Managing configuration parameters • Monitoring 102
  • 103. © Matthew Bass 2013 Business continuity • Business continuity means that the business should continue to provide service even if a disaster such as a fire, floor, or cloud outage occurs. • Two numbers characterize a business continuity strategy – RTO is the Recovery Time Objective – how long before the service is available again – RPO is the Recovery Point Objective – what is the point in time that the system rolls back to. i.e. how much data can be potentially lost • Allows for cost/benefit trade offs. • Many industries such as banks have compliance rules that require business continuity policies and practices.
  • 104. © Matthew Bass 2013 How does business continuity work? • Replicate site in physically distant location. • Recall DNS server with multiple sites • If first site does not respond promptly, client will try second site. X Site 1 Site 2 Website.com 123.45.67.89 456.77.88.99123.45.67.89 456.77.88.99 DNS
  • 105. © Matthew Bass 2013 What does it mean to “replicate site”? • Must have a parallel datacenter • Data must be replicated within RPO – If RPO is small or zero this implies DB replication – If RPO is larger then can use other means to replicate data • Software must also be replicated. – Versions must be identical in both sites • Using different versions in different sites may result in different results. • Configurations in two sites will be different but must yield the same results. • Replication of a site incurs costs. You may wish to increase the RPO and just copy (back up) data to another site.
  • 106. © Matthew Bass 2013 Recall discussion about DNS servers • There is a hierarchy of DNS servers. • Local DNS servers are under the control of the local organization. • When a disaster happens, the new data center can be made operative by changing the IP address in the local DNS server.
  • 107. © Matthew Bass 2013 What the the architectural implications • State maintained in servers will be lost if a disaster happens • Dependencies on other than configuration parameters must be identical in a replicated site. • Applications must be architected to be movable for one environment to another.
  • 108. © Matthew Bass 2013 Topics • Zookeeper • Failure in the cloud • Business continuity • Release planning • Managing configuration parameters • Monitoring 108
  • 109. © Matthew Bass 2013 Dependencies • There exist many different types of dependencies within a system. E.g. – Inter component – Version – Configuration parameters – Hardware – Location – Names – DB schemas – Platform – Libraries • Inconsistency among these dependencies is a common source of production time errors.
  • 110. © Matthew Bass 2013 For example • You develop some code on your desktop. – You have installed the latest Java update – You configure your code to use a Python script to do some data cleansing – You depend on a component that your colleagues are simultaneously developing. • You deploy your code into production. – The latest Java version has not been installed. – Python has not been installed in the production environment. – Your colleagues are delayed in their development.
  • 111. © Matthew Bass 2013 You finally get your code into production • A user has a problem and calls the help desk. • The help desk doesn’t know how to solve the problem and escalates it back to you. • You have gone on vacation.
  • 112. © Matthew Bass 2013 Problems lead to a requirement for a formal “release plan” 1. Define and agree release and deployment plans with customers/stakeholders. 2. Ensure that each release package consists of a set of related assets and service components that are compatible with each other. 3. Ensure that integrity of a release package and its constituent components is maintained throughout the transition activities and recorded accurately in the configuration management system. 4. „„Ensure that all release and deployment packages can be tracked, installed, tested, verified, and/or uninstalled or backed out, if appropriate. 5. „„Ensure that change is managed during the release and deployment activities. 6. „„Record and manage deviations, risks, issues related to the new or changed service, and take necessary corrective action. 7. „„Ensure that there is knowledge transfer to enable the customers and users to optimise their use of the service to support their business activities. 8. „„Ensure that skills and knowledge are transferred to operations and support staff to enable them to effectively and efficiently deliver, support and maintain the service, according to required warranties and service levels *http://en.wikipedia.org/wiki/Deployment_Plan 112
  • 113. © Matthew Bass 2013 Release planning is labor intensive • Note the requirements for coordination in the release plan • Each item requires multiple people and time consuming activities. – Time consuming activities delay introducing features included in the release. • Open questions – Which items are dealt with through process? – Which items are dealt with through tool support? – Which items are dealt with through architecture design? – Which items are dealt with through a combination of the above? • We will see an architecture designed to reduce team coordination inn a subsequent lecture.
  • 114. © Matthew Bass 2013 Topics • Zookeeper • Failure in the cloud • Business continuity • Release planning • Managing configuration parameters • Monitoring 114
  • 115. © Matthew Bass 2013 What is a configuration parameter? • A configuration parameter or environment variable is a parameter for an application that either controls the behavior of the application or specifies a connection of the application to its environment – Thread pool or database connection pool size control the behavior of the application. – Database url specifies an connection of the app to a database.
  • 116. © Matthew Bass 2013 When are configuration parameters bound? • Recommended practice is to bind these at initialization time for the app. – App is loaded into an execution environment – App is told where to find configuration parameters through language, OS, or environment specific means. E.g. main parameter in C – App reads configuration parameters from the specified location. • The virtue of this approach is that an app can be loaded into different execution environments and doesn’t need to be aware of which environment it is.
  • 117. © Matthew Bass 2013 Use DB as an example – Unit test • App is given URL for database access component. • In the case of unit test, the database access component is a component that maintains some fake data in memory for fast access without the overhead of the full DB.
  • 118. © Matthew Bass 2013 Integration Test • Test database is maintained for integration testing. • Test database has subset of full data base. • URL of test database is provided to App • App can read or write test database
  • 119. © Matthew Bass 2013 Performance testing • Special database access component exists for performance testing – Passes through reads to production database – Writes to mirror database • App is given URL of special database access component • Allows testing with real data but blocks and writes to real database • Mirror database is checked at end of test for correctness.
  • 120. © Matthew Bass 2013 Other configuration parameters • Other configuration parameters should be identical from integration test through to production. • Reduces possibility of incorrect specification of configuration parameters. – Incorrect specification of configuration parameters is a major source of deployment errors.
  • 121. © Matthew Bass 2013 Topics • Zookeeper • Failure in the cloud • Business continuity • Release planning • Managing configuration parameters • Monitoring 121
  • 122. © Matthew Bass 2013 Monitoring • When is this done • Why is it done • What can you get from monitoring. • Data sources – monitor/logs
  • 123. © Matthew Bass 2013 What is monitoring? • Monitoring is the collection of data from individual or collections of systems during the runtime of these systems. • Isn’t this an operations problem and not an architectural problem? – No. • Operators are first class stakeholders and their needs should be considered when designing the system. • In the modern world, difficult run time problems are solved by the architect so its to your advantage that the correct information is available. • Other reasons are implicit in the uses of monitoring information which we are about to go into.
  • 124. © Matthew Bass 2013 Why monitor? 1. Identifying failures and the associated faults both at runtime and during post-mortems held after a failure has occurred. 2. Identifying performance problems both of individual systems and collections of interacting systems. 3. Characterizing workload for both short term and long term billing and capacity planning purposes. 4. Measuring user reactions to various types of interfaces or business offerings. We will discuss A/B testing later. 5. Detecting intruders who are attempting to break into the system. (outside of our scope).
  • 125. © Matthew Bass 2013 Basic metrics • Per VM instance provider will collect – CPU utilization – Disk read/writes – Messages in/out • These metrics are used for – Charging – Scaling – Mapping utilization to workload • Similar type of metrics for storage and utilities • Can aggregate these metrics over autoscaling groups, regions, accounts, etc.
  • 126. © Matthew Bass 2013 Other metrics • The problem with the basic metrics is that they are not related to particular activities whether business or internal. • Other things to monitor – Transactions – transactions per second gives the business an idea of how many customers are utilizing the system. – Transactions by type. – Messages from one portion of the system to another. – Error conditions detected by different portions of the system – … anything you want
  • 127. © Matthew Bass 2013 How do I decide what to monitor? • Look at reasons for monitoring – Failure detection – Performance degradation – Workload characterization – User reactions • For each reason, – decide what symptoms you would like reported. – Place responsibilities to detect symptoms in various modules. – Decide on active/passive monitoring (discussed soon) – Decide what constitutes an alarm (discussed soon) – Logic should be under configuration control – levels of reporting
  • 128. © Matthew Bass 2013 Metadata is crucial • Data by itself is not that useful. • It must be tagged with identifying information including timestamp.. • For example – VM CPU usage divided among which processes – I/O requests to which disks triggered from which VM process – Messages from which component to which other component in response to what user requests. • Ideal – each user request is given a tag and all monitoring information as a consequence of satisfying that request are tagged with request ID. • Other monitoring activities are tagged with ID that identifies why activity was trigger.
  • 129. © Matthew Bass 2013 Why this emphasis on metadata? • Any of the uses enumerated for monitoring data require associating effect with its cause. • The monitoring data represents the effect. • The metadata enables determining the cause.
  • 130. © Matthew Bass 2013 Active/Passive • Active data collection involves the component that generates the data. It emits it periodically or based on a triggering event – To a key-value store – To a file – A message to a known location • Passive data collection involves the component that generates the data making it available to an agent in the same address space. The agent emits the data either periodically or based on events.
  • 131. © Matthew Bass 2013 Data Collection • Whether active or passive data, the data is emitted from a component to a known location periodically or based on events. System application agent System application agent Monitoring System
  • 132. © Matthew Bass 2013 Monitoring Systems • Data collecting tools – Ngaio . – Sensu – Inciga – Cloud Watch – AWS specific
  • 133. © Matthew Bass 2013 Volumes of data • It is possible to generate huge amounts of data. • That is the purpose of data collating tools – Logstash – Splunk • Features of such tools – Collating data from different instances – Visualization – Filtering – Organizing data – Reports
  • 134. © Matthew Bass 2013 Alarms • An alarm is a specific message about some condition needing attention. – Can be e-mail, text, or on screen for operators. • Problems with alarms – False positives – an alarm is raised without justification – False negatives – justification exists but no alarm is raised.
  • 135. © Matthew Bass 2013 Summary • Distributed coordination problems are simplified when using a tool such as Zookeeper • You must expect failure in the cloud and prepare for it. • A disaster is when everything has failed and you need to have business continuity plans • Flexibility in the cloud is managed by setting configuration parameters and they need to be managed. • Monitoring lets you know what is going on with your system from whatever perspective you wish. But, you must choose your perspective.
  • 136. © Matthew Bass 2013 Questions??