Big table

Presented By:
Riddhi Tandon
Akshay Gupta
Vasu Ragan Lohia

Outline
1) Introduction.
2) Google Services
3) GFS
4) Chubby
5) Map Reduce
6) Big Table
7) Structure Of BigTable
8) Log Files and Compaction
9) Load Balancing
10) LookUp
11) Compression:Snappy
12) Conclusion

Introduction
Google is best known for it’s reliable and fast services, but what’s there
working behind the scene?
Let’s have a short introduction of Google.
About Google:
 Google.com domain was registered on September 15, 1993.
 Google services are highly efficient, robust and trustworthy.
 If I start to name them, First would be obviously Google Search, Docs,
App Engine, Maps, Gmail and many more.

What is Google ?
 Google is an Internet Information Provider Company (according to
NASDAQ).
It makes money from its advertising business : AdWords & AdSense.
 Google lets your business grow by advertising and you pay it for CPC
(Cost Per Click) or CPM (Cost Per Impression).
Google has setup a revolutionary advertising model in the world.
 By earning from these businesses, Google makes amazing and costly
products (according to its maintenance) , which we get for free.

How come Google’s services so fast?
Undoubtedly, there are number of aspects that matter behind this
(like Hardware, Software, Operating System, Best Staff in the world
etc. )
But, What I am going to explain here is the Software part.
 GFS
 Chubby
 Map Reduce
 Bigtable

What is GFS?
 GFS stands for Google File System.
 It’s a Proprietary(means for their personal use, not open source)
distributed file system developed by Google for their services.
 It is specially designed to provide efficient, reliable access to data
using large clusters of commodity hardware, means they are
using low cost hardware, not state-of-the-art computers. Google
uses relatively inexpensive computers running Linux Operating
System and the GFS works just fine with them !

What is Chubby?
 Chubby is a Lock Service. (It’s related to gain access of Shared
resources)
 It is used to synchronize accesses to shared resources.
 It is now used as a replacement of Google’s Domain Name System.

What is Map Reduce?
 MapReduce is a software framework that process massive amounts
of unstructured data.
 It allows developers to write programs that process data in parallel
across a distributed cluster of processors or stand-alone computers.
 It is now used by Google mainly for their Web Indexing Service,
applied since 2004.
 Map() procedure performs all the process related to Filtering and
Sorting.
 Reduce() procedure performs all the Summary related operations.

What is Google BigTable ?
 BigTable is a compressed, high performance, and proprietary data
storage system built on Google File System, Chubby Lock Service,
SSTable (log-structured storage like LevelDB) and a few other Google
technologies.
 It’s Proprietary Data Storage System (that means it is for Google’s
personal use only).
 Most important point, It’s a Non-Relational Database.
 It uses amazing Load Balancing Structure so that it runs on
Commodity Hardware.
 It uses Snappy compression utility for compacting the data.

Means:-
 It’s a Database, which uses compression utilities to store and
retrieve data efficiently.
 It uses a special structure for storing data, therefore it gives high
performance. (Load Balancing Structure)
 It’s proprietary, that means it is for Google’s personal use only. It
is not open source.
 Google BigTable is built upon different Google technologies.

Requirements ?
 BigTable is designed to run on Commodity Hardware ( Low cost
computers ).
 Thus BigTable can run on any PC like ours.
 Very less incremental cost for new services and expansion of
computing power

Special Features
 It’s a Robust database, That means it can work similarly even in worse
situation.
 BigTable given highest importance to Read and Query performance.
 Higher Data Availability : -
A write is immediately replicated to multiple data centers.
 Automatic Scaling :
BigTable uses a distributed architecture to automatically
manage scaling to very large data sets.

Structure of BigTable
 Each table is a Multi-Dimensional Sparse Map( Memory Efficient hash-map
implementation).
 The table consists of (1) Rows, (2) Columns and (3) Each cell has a Time Version
(Time-Stamp).
 Time Version results in multiple copies of each cell with different times, resulting
Unimaginable Redundancy which is requirement for Google services, so don’t
ever think it as a drawback of this system.
 Google does Web Indexing to get the data of all the websites. They store all the
URLs, their titles, time-stamp and many more required fields
 Web Indexing :- indexing the contents of a website

Load Balancing Structure
(dummy sitemap of my website Codeplaza, where 5 fields are shown)
 Consider this one huge Table with millions of entries.
 In order to manage such tables,they are split at row boundaries and saved
as Tablets.
 Each Tablets size is 100-200 MB and each machine stores about 100 of them.100-
200 MB of data can store thousands (even more ) rows.

Example showing 4 rows = 1 tablet.
 This setup allows us Fine-Grain Load Balancing. (Suppose, if one tablet is
receives lots of queries, it can share or divide data with other tablets or move
the busy tablet to another not-so-busy machine.)
 This setup also allows Fast Rebuilding. (Means, when a machine goes down,
other machines take one tablet from the downed machine, so 100 machines
get a new tablet, but the load on each machine to pick up new tablet is fairly
small.)

Log Files and Compaction
 Tablets are stored on systems as Immutable SSTables and a tail of logs (one
log per machine).
 SSTable stands for ‘Sorted String Table’. Some also call it ‘Static and Sorted
Table’. The figure below shows a dummy structure of SSTable.
 When system memory is filled, it compacts some tablets.
 Two compactions :- Minor and Major compactions.

 Minor compactions involve only a few tablets, while Major compactions ones
involve the whole system results in reclaim of hard disk space. The location of
the tablets are actually stored in special BigTable cells.
Immutable SSTable :-
Mutation means to change/update over time. Remember the
mutants from X-Men & Krrish-3. (Mutants are special kind of species , whose
DNA is changed over time . )
Thus , SSTables which are Immutable , they are never changed or updated , that
is , they are Static !
 Know ,the question is that, How the entries in SSTable are stored or
modification is done to a Immutable SSTable?
 Answer to the above question is that , remove the old one, Make a new
SSTable.
Sounds weird ? But , It is a great idea because it saves a lot of time
of searching and sorting for updating data on a single (large)table.

LookUp
 Lookup is a three-level system.
 Benefit :- NO Big Bottleneck in the system and it also make heavy use of Pre-
Fetching and Caching
Tablet Location Hierarchy
Chubby file
contains location
of the root tablet.
Root tablet
contains all tablet
locations in
Metadata table.
Metadata table stores
locations of actual
tablets.
Client moves up the
hierarchy (Metadata
-> Root -> Chubby), if
location of tablet is
unknown or
incorrect.

Compression : Snappy
 Lot of redundant data in system (especially through time), so they make
heavy use of compression.
 Compression looks for similar values along the rows, columns, and times. (
Here comes the use of priority as mentioned earlier. Less priority , less data
fetching and more compression. )
 Used variations of BMDiff and Zippy to develop compression software.
BMDiff gives them high write speeds (~100MB/s) and even faster read
speeds (~1000MB/s). Zippy compresses very fast.After Research, They built a
software named “Snappy”.
 Snappy is a compression/decompression library which does not aim for
maximum compression, instead, it aims for very high speeds and reasonable
compression. (On a single core of a Core i7 processor in 64-bit mode, Snappy
compresses at about 250 MB/sec or more and decompresses at about 500
MB/sec or more.)

Actual Hierarchical Load Balancing Structure
request arrives at
ROOT (Master
Computer).
ROOT checks its master
record and sends the
request to the right PC.
SSTable contains the
records of tablets.
Via Meta Tablets, request is sent to
tablet containing original data
tablet and the data is fetched then.
This is how, it works

Conclusion
 Bigtable has achieved its goals of high performance, data
availability and scalability.
 It has been successfully deployed in real apps (Personalized
Search, Orkut, Google Maps, …)
 Significant advantages of building own storage system like
flexibility in designing data model, control over implementation
and other infrastructure on which Bigtable relies on.

Big table

More Related Content

What's hot

Viewers also liked

Similar to Big table

Recently uploaded

Big table