Bigtable_Paper

Understanding Bigtable
Tarun Kumar Sarkar
Adviser: Prof. Dr. Stefan B̈ttcher
University of Paderborn
September 30, 2015
Abstract
Bigtable is a distributed storage system designed by Google to manage large scale of
structured data. Various application of Google (Google Analytics, Google Earth,
Google Finance etc.) having different kind of demands in terms of dada size, latency
requirement, flexibility of managing its data. Google wanted to develop an
application, which can solve the varied demands from those applications and can be
deployed over a distributed environment. After years of brainstorming they developed
Bigtable, which provide high scalability, flexibility and high performance needed by
those application. Bigtable provide a very simple data model, which gives the clients
dynamic control over its data layout and format. We will discuss the Bigtable data
model, its architecture and implementation of the architecture in this paper.
1. Introduction
1.1. Motivation
One main problem Google faced was to store and manage the large and rapidly
growing volume of information, another requirement was to analyze that information
which could add significant value to the decision making process. Dealing these
issues using traditional system may involve complex workloads, which push the
boundaries of what are possible using traditional data warehousing and data
management techniques and technologies. Traditional relational databases present a
view that is composed of multiple tables, each with rows and named columns.
Queries, mostly performed in SQL (Structured Query Language) allow one to extract
specific columns from a row where certain conditions are met (e.g., a column has a
specific value). Moreover, one can perform queries across multiple tables (this is the
“relational” part of a relational database). For example a table of students may include
a student’s name, ID number, and contact information. A table of grades may include
a student’s ID number, course number, and grade. We can construct a query that
extracts a grade by name by searching for the ID number in the student table and then
matching that ID number in the grade table. Moreover, with traditional databases, we
expect ACID guarantees: that transactions will be atomic, consistent, isolated, and
durable. As with distributed transactions, it is impossible to guarantee consistency
while providing high availability and network partition tolerance. This makes ACID
databases unattractive for highly distributed environments and led to the emergence of
alternate data stores that are target to high availability and high performance. Here,
we will look at the structure and capabilities of Bigtable.
1.2. Ground Work
A basic understanding of Relational Database concept as well as fundamentals of Big
Data would help understanding this paper. May reference available online even going
through the Wikipedia page about Relational Database and Big Data would be of
great help.

2. Bigtable
Bigtable is a distributed storage system that is structured as a large table (e.g. one that
may be petabytes in size and distributed among tens of thousands of machines). It is
designed for storing items such as billions of URLs, with many versions of content,
over 100 TB of satellite image data. It has the capability of handling hundreds of
millions of users, and has the ability to performing thousands of queries per second.
Bigtable was developed at Google and it has been in use since 2005 in hundreds of
Google services.
2.1. Characteristics
Bigtable basically is a sparse, distributed, persistent multi-dimensional sorted map.
Map
A map is an associative array; a data structure that allows one to look up a value to a
corresponding key quickly. Bigtable is a collection of (key, value) pairs.
Persistent
The data is stored persistently on disk.
Distributed
Bigtable’s data is distributed among many independent machines. The table is broken
up among rows, with groups of adjacent rows managed by a server. A row itself is
never distributed.
Sparse
The table is sparse, meaning that different rows in a table may use different columns,
with many of the columns empty for a particular row.
Sorted
Bigtable sorts its data by keys. This helps keep related data close together; usually on
the same machine assuming that one structures keys in such a way that sorting brings
the data together.
Multidimensional
A table is indexed by row key. Each row contains one or more named column
families. Each column family may have multiple columns and each cell (intersection
of row and column) may contain multiple versions of data based on time stamp.
Timestamp based
Time is another dimension in Bigtable data storage. Every cell in Bigtable may keep
multiple versions of column family data. If an application does not specify a
timestamp, it will retrieve the latest version of the column family. Alternatively, it can
specify a timestamp and get the latest version that is earlier than or equal to that
timestamp.
2.2. Data Model
Bigtable is designed with semi-structured data storage in mind. It is a large map that
is indexed by a row key, column key, and a timestamp. Each value within the map is
an array of bytes that is interpreted by the application.

Let us look at a sample slice of a table (Figure 1) that stores information about many
server performances named serverperformance.
Figure 1: A slice of an example table that stores cpu performance and memory of many
server. The row key is the name of the server (e.g. server1, server2 etc.). The cpu column
family contains the cpu usage, and the memory column family contains the memory usage of
each server. The cell (intersection of server1 row and cpu:core2 column) has three versions
of data, at different timestamps t1, t2, and t3.
Rows
Bigtable maintains data in lexicographic order by row key. Every read or write of data
to a row is atomic, regardless of how many different columns are read or written
within that row. A table is logically split among rows into multiple sub-tables called
tablets. A tablet is a set of consecutive rows of a table and is the unit of distribution
and load balancing within Bigtable. Because the table is always sorted by row, reads
of short ranges of rows are efficient; one typically requires communicating with a
small number of machines. Hence, this is a key idea to ensure a high degree of
locality for their data access. The row keys in a table are arbitrary strings, in our
example server1, server2 are the row keys.
Column Families
Each row contains one or more named column families. Basically column keys are
grouped into sets called column families. A column family must be defined before
data can be stored under any column key in that family. Within a column family, one
may have one or any number of named columns. All data within a column family is
usually of the same type. The implementation of Bigtable usually compresses all the
column’s data together within a column family. Columns within a column family can
be created on the fly. Rows, column families and columns provide a three level
naming hierarchy in identifying data. A column key is defined using a printable
family name and the column name of arbitrary string. For example, cpu:core1 is a
column key. Column family is the unit of access control and both disk and memory
accounting also performed at the column family level. For example a client only
allowed reading data of cpu column family.
core1 core2 core3 physical virtual
server1
server2
server3
cpu memory
t1
t2
t3
serverperformance

Timestamps
Each column family cell can contain multiple versions of same data. Such as, in the
example, we may have several time stamped versions (t1, t2, t3) of cpu performance
data in cpu:core2 column of server1. Each version is identified by a 64-bit timestamp
that either represents real time or is a value assigned by the client. A table is
configured with per-column-family settings for garbage collection of old data. A
column family can be defined to keep only the latest n versions or to keep only the
versions written since some time t (e.g. only keep values that were written in the last
seven days).
2.3. Supported API
Bigtable support functions for creating and deleting tables and column families,
changing cluster, table, and column family metadata, such as access control rights. A
Bigtable client application can write or delete values into table, retrieve values from
individual rows, or iterate over a subset of the data in a table.
Bigtable supports many features that allow the user to work on data and manipulate it
in complex ways. It does not support transaction across row keys. Currently Bigtable
supports only single-row transactions, which mean atomic read-modify-write
operation sequences can be performed on data stored under a single row key.
A Bigtable client can execute its scripts under the address space of the servers. The
supported scripting language is Sawzall, developed at Google for processing data.
Bigtable provide a set of wrappers, which can be used with MapReduce, it allow a
Bigtable to be used both as an input source and as an output target for MapReduce
jobs.
2.4. Bigtable Architecture
The Bigtable comprises three main components as we can see in (Figure 2); a client
library (that is linked into every client), a master server that coordinates activities, and
many tablet servers.
Figure 2: Bigtable Architecture
Master
Server
tablet
Tablet Server
tablet
Tablet Server
tablet
Tablet Server
client
client
client

A Bigtable cluster may stores a number of tables and each table consists of a set of
tablets, and each tablet contains a set of row range. Initially, each table consists of just
one tablet. As a table grows, it is automatically split into multiple tablets (typically
100-200 MB in size). Tablet servers can be added or removed dynamically.
The master assigns tablets to tablet servers and balances tablet server load. It is also
responsible for garbage collection of files in GFS and managing schema changes
(table and column family creation).
Each tablet server manages a set of tablets (typically 10-1,000 tablets per server). It
handles read/write requests to the tablets it manages and splits tablets when a tablet
becomes too large. As with other distributed systems client data does not move
through the master, clients communicate directly with tablet servers for reads/writes
operation. This makes the master lightly loaded.
2.5. Architecture Implementation
This section describes the fundamentals of the Bigtable architecture implementation.
Building Blocks
Bigtable is not independent. It is built on several other pieces of Google infrastructure
to do what it does.
Bigtable uses the Google File System (GFS) to store data and log files. It provides
efficient, reliable access to data using large clusters of commodity hardware.
Bigtable depends on a cluster management system it schedule jobs, manage resources
on shared machines, deal with machine failures, and monitor machine status.
SSTable file format is used to store Bigtable data. SSTables are designed so that a
data access requires, at most, a single disk access. An SSTable, once created, is never
changed. If new data is added, a new SSTable is created. Once an old SSTable is no
longer needed, it is set out for garbage collection. SSTable immutability is at the core
of Bigtable’s data check pointing and recovery routines.
Chubby is a highly available and persistent distributed lock service that manages
leases for resources and stores configuration information. The service runs with five
active replicas, one of which is elected as the master to serve requests. A majority
must be running for the service to work. Paxos algorithm is used to keep the replicas
consistent. Chubby provides a namespace of files & directories. Each file or directory
can be used as a lock. Bigtable uses Chubby to ensure there is only one active master,
to store the bootstrap location of Bigtable data, to discover tablet servers, to store
Bigtable schema information, and to store access control lists.
Tablet Location
Locating tablet within a Bigtable is managed in a three-level hierarchy (Figure 3).
The first level is a file stored in Chubby that contains the location of the root tablet.
The root (top-level) tablet stores the location of all Metadata tablets in a special
Metadata table. Each Metadata tablet contains the location of user data tablets. The
client library caches tablet locations for efficiency. Some secondary information is
stored in the METADATA table for debugging and performance analysis.

Figure 3: Tablet Location Hierarchy
Tablet Assignment
A tablet is assigned to one tablet server at any point of time. The master is responsible
to keep track of the set of live tablet servers, and the current assignment of tablets to
tablet servers, including which tablets are unassigned. If a tablet is unassigned, and
place is available in a tablet server, the master assigns the tablet by sending a tablet
load request to the tablet server. Chubby keep track of tablet servers. When a tablet
server starts, it creates, and acquires an exclusive lock on, a uniquely named file in a
specific Chubby directory. The master monitors this directory (the servers directory)
to discover tablet servers.
Whenever a master is started by the Bigtable cluster management system, it executes
the following steps to discover the current tablet assignments (1) The master grabs a
unique master lock in Chubby, which prevents con-current master instantiations. (2)
The master scans the server’s directory in Chubby to find the live servers. (3) The
master communicates with every live tablet server to discover what tablets are already
assigned to each server. (4) The master scans the METADATA table to learn the set
of tablets. (5) Builds a set of unassigned tablet, which are become eligible for
assignment.
Tablet splits are treated specially since a tablet server initiates them. The tablet server
commits the split by recording information for the new tablet in the METADATA
table. When the split has committed, it notifies the master.
Tablet Serving
Bigtable stores the persistent state of its data into GFS (Figure 4). Any updates
information to the Bigtable are first stored in a commit log, which is basically redo
records. This redo records are used for recovery incase of failure. The recently
committed updates are stored in memtable (in memory sorted buffer); the older
updates are stored in a sequence of SSTables.
Chubby file
Root tablet
(1st METADATA tablet)
Other
METADATA
tablets
User Table 1
User Table N

Figure 4: Tablet Representation
When a write operation arrives at a tablet server, the server checks for well-
formedness of the data, and that the sender is authorized to perform the mutation
(Mutation is abstraction to perform a series of updates). A valid mutation is first
written to the commit log, after that its contents are inserted into the memtable.
When a read operation arrives at a tablet server, it is similarly checked for well-
formedness and proper authorization. A valid read operation is executed on a merged
view of the sequence of SSTables and the memtable.
Compactions
Bigtable perform two kind of compaction one is minor compaction and another is
major compaction. In minor compaction the existing memtable is frozen and a new
meltable is created once its size reaches a threshold, and the frozen memtable is
converted to an SSTable and written to GFS. It has two goals, one is to shrinks the
memory usage of the tablet server, and second is to reduce the amount of data that has
to be read from the commit log during recovery.
In major compaction Bigtable reads the contents of many SSTable (created during
minor compaction) and the meltable content and write out to a new SSTable.
The SSTables produced during major compactions does not contain special deletion
entries, which might be available in SSTable created during minor compaction. These
major compactions allow Bigtable to reclaim resources used by deleted data. A client
application can optionally specify which compression to use for SSTables.
2.6. Open Ends
Where to use?
Bigtable is ideal for applications that need very high throughput and scalability for
non-structured data. Bigtable also can be used as a storage engine for batch
MapReduce operations, stream processing, analytics, and machine-learning
applications. Bigtable can be used to store and query marketing data (such as
purchase histories and customer preferences), financial data (such as transaction
histories, stock prices, and currency exchange rate), Internet of Things data (such as
Write Op
commit log
memtable Read Op
Memory
GFS
Tablet Serving
SSTable Files

usage reports from energy meters and home appliances) and Time-series data (such as
CPU and memory usage over time for multiple servers).
Bigtable is not a relational database, it does not support SQL queries or joins, nor
does it support multi-row transactions. It is not a good solution for small amounts of
data.
What is Next?
Research is going on for supporting some additional Bigtable features, such as
support for secondary indices and infrastructure for building cross-data-center
replicated Bigtables with multiple master replicas.
3. Conclusions
I have described the characteristics of Bigtable, Data Model, the Architecture and its
implementation and who need it. It is important to realize that data comes in many
shapes and sizes. It also has many different uses; real-time fraud detection, web
display advertising and competitive analysis, social media and sentiment analysis,
intelligent traffic management and smart power grids, are few example. All of these
analytical solutions involve significant volumes of both semi-structured and
structured data. Many of these analytical solutions were not possible previously
because they were too costly to implement using standard relational database system.
Bigtable in combination with these new and evolving analytical processing
technologies can bring significant benefits to the business. Initially Google developed
this distributed system for storing structured data for its internal use. Bigtable clusters
have been in production use since April 2005. Currently Bigtable used in more than
hundreds of Goggle product and Google has many customers outside. Bigtable users
like the performance and high availability provided by the Bigtable implementation,
and that they can scale the capacity of their clusters by simply adding more machines
to the system as their resource demands change over time.
4. References
[1] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber:
Bigtable: A Distributed Storage System for Structured Data. OSDI'06: Seventh
Symposium on Operating System Design and Implementation,
Seattle, WA, November 2006.
[2] https://cloud.google.com/bigtable/
[3] https://en.wikipedia.org/wiki/BigTable
[4] https://en.wikipedia.org/wiki/Big_data
[5] https://en.wikipedia.org/wiki/Relational_database_management_system

Bigtable_Paper

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (13)

Similar to Bigtable_Paper

Similar to Bigtable_Paper (20)

Bigtable_Paper