Cloud computing is a new concept that emerged in the late 1990s involving hosting applications and data over the internet. There are two main types - software as a service where vendors host specific applications, and infrastructure as a service where clients run their own software on vendor machines through virtual machines. Large cloud vendors include Amazon and Google. Cloud databases store and retrieve data for large numbers of users and prioritize availability and scalability over consistency. BigTable is Google's cloud database that stores attribute values as strings using a hierarchical key of record identifier and attribute name to retrieve data.
1. Cloud computing
A new concept is computing that emerged in the late 1990s and the 2000s.
First, software as a service
o Vendors of software services provided specific customizable applications that
they hosted on their own machines
Then, generic computers as a service
o Clients runs its own software, but runs it on vendor’s computers.
o These machines are called virtual machines, which are simulated by software that
allows a single real computer to simulate several independent computers
o Clients can add machines as needed to meet demand and release them at times of
light load.
Other services
o Data storage services, map services, and other services can be accessed using a
Web-service API.
Venders of cloud service
o Traditional computing vendors, Amazon, Google
Cloud-based database
o Web applications need to store and retrieve data for very large numbers of users
o Value availability and scalability over consistency
Systems for data storage on the cloud
o Bigtable from Google
o Simple Storage Service (S3) from Amazon
o Cassandra from Facebook
o Sherpa/PNUTs from Yahoo!
Data Representation
It needs to provide flexibility in the set of attributes that a record contains, and the types
of these attributes
2. o XML, JSON
o BigTable has its own data model (the next page)
It does not need extensive query language support. Two primitive functions:
o put(key, value): store values with an associated key
o get(key): retrieve the stored value associated with the specified key
An example application
o The profile of a user needs to be accessible to many different application that are
run by an organization.
o The profile contains my attributes, and there are frequent additions to the
attributes stored in the profile
o Some attributes may contain complex data.
BigTable
A record is split into component attributes that are stored separately.
The key for an attribute value consists of (record-identifier, attribute-name).
Each attribute value is just a string.
Example: A record with identifier “22222”, can have multiple attribute names such as
“name.firstname”, “deptname”, “children[1].firstname”, “children[2].lastname”. (cf the
JSON example in chapter 23).
To fetch all attributes of a record, a prefix-match query consisting of just the record
identifier, is used.
The record identifier can itself be structured hierarchically
A single instance of Bigtable can store data for multiple application, with multiple tables
per application, by simply prefixing the application name and table name to the record
identifier
Partitioning and Retrieving Data
Unlike regular parallel database, it is usually not possible to decide on a partitioning
function ahead of time.
Therefore, it partition data into small units, called tablets.
3. The partitioning is done on the search key, so that a request for a specific key value is
directed to a single tablet.
The site to which a tablet is assigned acts as the master site.
o All updates are routed through this site, and then propagated to replicas
The partitioning of data is not fixed, but happens dynamically.
A tablet controller site tracks the partitioning function, to map a get() request to tablets,
and map from tablets to sites
Architecture of a cloud data storage system
Challenges with Cloud-based Database
advantages
o Do not need to build a computing infrastructures from scratch
o Essential for certain applications
Disadvantage
o Additional communication cost like traditional distributed database system
4. o The physical location of data is under the control of the vendor, which is unaware
Hard to perform query optimization
o Replication is under the control of the vendor
Hard to ensure the latest version of data are read
o Data held by another organization are risked in terms of security and legal
liability