2. Dec 8th , 2011Dec 8th , 2011
Bigtable: A Distributed Storage System
1. Introduction
2. What is a Bigtable?
3. Why not A DBMS?
4. Data model: Row
Column
Timestamps
5. APIs
6. Building Blocks
8. Conclusion
7.Real Applications
3. Dec 8th , 2011Dec 8th , 2011
Introduction
• BigTable is a distributed storage system
for managing structured data.
• Designed to scale to a very large size
- Petabytes of data across thousands of
servers
• Used for many Google projects
- Web indexing, Personalized Search, Google
Earth, Google Analytics, Google Finance, …
• Flexible, high-performance solution for
all of Google’s products
4. Dec 8th , 2011Dec 8th , 2011
What is a Bigtable?
• “A BigTable is a sparse, distributed,
persistent multidimensional sorted map. The
map is indexed by a row key, a column key,
and a timestamp; each value in the map is an
uninterpreted array of bytes.”
5. Dec 8th , 2011Dec 8th , 2011
Why not A DBMS?
• Few DBMS’s support the requisite scale
– Required DB with wide scalability, wide
applicability, high performance and high
availability
• Couldn’t afford it if there was one
– Most DBMSs require very expensive
infrastructure
• DBMSs provide more than Google needs
– E.g., full transactions, SQL
• Google has highly optimized lower-level systems
that could be exploited
– GFS, Chubby, MapReduce, Job scheduling
6. Dec 8th , 2011Dec 8th , 2011
Data model: Row
• Row keys are arbitrary strings
• Row is the unit of transactional consistency
• Data is maintained in lexicographic order by row
key
• Rows with consecutive keys (Row Range) are
grouped together as “tablets”.
7. Dec 8th , 2011Dec 8th , 2011
Data model: Column
• Column keys are grouped into sets called “column
families”, which form the unit of access control.
• Column key is named using the following syntax:
family :qualifier
• Access control and disk/memory accounting are
performed at column family level
8. Dec 8th , 2011Dec 8th , 2011
Data model: timestamps
• Each cell in Bigtable can contain multiple versions
of data, each indexed by timestamp
• Timestamps are 64-bit integers
• Assigned by:
– Bigtable
– Client application
• Data is stored in decreasing timestamp order, so
that most recent data is easily accessed
– Application specifies how many versions (n) of data items
are maintained in a cell
- Bigtable garbage-collects cell versions automatically.
9. Dec 8th , 2011Dec 8th , 2011
Data Model
Example: Web Indexing
14. Dec 8th , 2011Dec 8th , 2011
Data Model
timestamps
15. Dec 8th , 2011Dec 8th , 2011
Data Model
Column family
16. Dec 8th , 2011Dec 8th , 2011
Data Model
Column family
family: qualifier
17. Dec 8th , 2011Dec 8th , 2011
Data Model
Column family
family: qualifier
18. Dec 8th , 2011Dec 8th , 2011
APIs
•The Bigtable API provides functions :
- Creating and deleting tables and column families.
-Changing cluster , table and column family
metadata.
-Support for single row transactions
-Allows cells to be used as integer counters
19. Dec 8th , 2011Dec 8th , 2011
Building Blocks
. Bigtable uses the distributed Google File
System (GFS) to store log and data files
• The Google SSTable file format is used
internally to store Bigtable data
• An SSTable provides a persistent , ordered
immutable map from keys to values
20. Dec 8th , 2011Dec 8th , 2011
Real Applications
•Google Analytics
http://analytics.google.com
•Google Earth & Google Maps
http://earth.google.com
•Personalized Search
www.google.com/psearch
•Web Indexing
•Google Finance
•Orkut
•Writely
21. Dec 8th , 2011Dec 8th , 2011
Conclusion
• Bigtable has achieved its goals of high performance,
data availability and scalability.
It has been successfully deployed in real apps
(Personalized Search, Orkut, GoogleMaps, …)
• Significant advantages of building own storage
system like flexibility in designing data model, control
over implementation and other infrastructure on which
Bigtable relies on.