Presented By: 
Riddhi Tandon 
Akshay Gupta 
Vasu Ragan Lohia
Outline 
1) Introduction. 
2) Google Services 
3) GFS 
4) Chubby 
5) Map Reduce 
6) Big Table 
7) Structure Of BigTable 
8) Log Files and Compaction 
9) Load Balancing 
10) LookUp 
11) Compression:Snappy 
12) Conclusion
Introduction 
Google is best known for it’s reliable and fast services, but what’s there 
working behind the scene? 
Let’s have a short introduction of Google. 
About Google: 
 Google.com domain was registered on September 15, 1993. 
 Google services are highly efficient, robust and trustworthy. 
 If I start to name them, First would be obviously Google Search, Docs, 
App Engine, Maps, Gmail and many more.
What is Google ? 
 Google is an Internet Information Provider Company (according to 
NASDAQ). 
It makes money from its advertising business : AdWords & AdSense. 
 Google lets your business grow by advertising and you pay it for CPC 
(Cost Per Click) or CPM (Cost Per Impression). 
Google has setup a revolutionary advertising model in the world. 
 By earning from these businesses, Google makes amazing and costly 
products (according to its maintenance) , which we get for free.
How come Google’s services so fast? 
Undoubtedly, there are number of aspects that matter behind this 
(like Hardware, Software, Operating System, Best Staff in the world 
etc. ) 
But, What I am going to explain here is the Software part. 
 GFS 
 Chubby 
 Map Reduce 
 Bigtable
What is GFS? 
 GFS stands for Google File System. 
 It’s a Proprietary(means for their personal use, not open source) 
distributed file system developed by Google for their services. 
 It is specially designed to provide efficient, reliable access to data 
using large clusters of commodity hardware, means they are 
using low cost hardware, not state-of-the-art computers. Google 
uses relatively inexpensive computers running Linux Operating 
System and the GFS works just fine with them !
What is Chubby? 
 Chubby is a Lock Service. (It’s related to gain access of Shared 
resources) 
 It is used to synchronize accesses to shared resources. 
 It is now used as a replacement of Google’s Domain Name System.
What is Map Reduce? 
 MapReduce is a software framework that process massive amounts 
of unstructured data. 
 It allows developers to write programs that process data in parallel 
across a distributed cluster of processors or stand-alone computers. 
 It is now used by Google mainly for their Web Indexing Service, 
applied since 2004. 
 Map() procedure performs all the process related to Filtering and 
Sorting. 
 Reduce() procedure performs all the Summary related operations.
What is Google BigTable ? 
 BigTable is a compressed, high performance, and proprietary data 
storage system built on Google File System, Chubby Lock Service, 
SSTable (log-structured storage like LevelDB) and a few other Google 
technologies. 
 It’s Proprietary Data Storage System (that means it is for Google’s 
personal use only). 
 Most important point, It’s a Non-Relational Database. 
 It uses amazing Load Balancing Structure so that it runs on 
Commodity Hardware. 
 It uses Snappy compression utility for compacting the data.
Means:- 
 It’s a Database, which uses compression utilities to store and 
retrieve data efficiently. 
 It uses a special structure for storing data, therefore it gives high 
performance. (Load Balancing Structure) 
 It’s proprietary, that means it is for Google’s personal use only. It 
is not open source. 
 Google BigTable is built upon different Google technologies.
Requirements ? 
 BigTable is designed to run on Commodity Hardware ( Low cost 
computers ). 
 Thus BigTable can run on any PC like ours. 
 Very less incremental cost for new services and expansion of 
computing power
Special Features 
 It’s a Robust database, That means it can work similarly even in worse 
situation. 
 BigTable given highest importance to Read and Query performance. 
 Higher Data Availability : - 
A write is immediately replicated to multiple data centers. 
 Automatic Scaling : 
BigTable uses a distributed architecture to automatically 
manage scaling to very large data sets.
Structure of BigTable 
 Each table is a Multi-Dimensional Sparse Map( Memory Efficient hash-map 
implementation). 
 The table consists of (1) Rows, (2) Columns and (3) Each cell has a Time Version 
(Time-Stamp). 
 Time Version results in multiple copies of each cell with different times, resulting 
Unimaginable Redundancy which is requirement for Google services, so don’t 
ever think it as a drawback of this system. 
 Google does Web Indexing to get the data of all the websites. They store all the 
URLs, their titles, time-stamp and many more required fields 
 Web Indexing :- indexing the contents of a website
Load Balancing Structure 
(dummy sitemap of my website Codeplaza, where 5 fields are shown) 
 Consider this one huge Table with millions of entries. 
 In order to manage such tables,they are split at row boundaries and saved 
as Tablets. 
 Each Tablets size is 100-200 MB and each machine stores about 100 of them.100- 
200 MB of data can store thousands (even more ) rows.
Example showing 4 rows = 1 tablet. 
 This setup allows us Fine-Grain Load Balancing. (Suppose, if one tablet is 
receives lots of queries, it can share or divide data with other tablets or move 
the busy tablet to another not-so-busy machine.) 
 This setup also allows Fast Rebuilding. (Means, when a machine goes down, 
other machines take one tablet from the downed machine, so 100 machines 
get a new tablet, but the load on each machine to pick up new tablet is fairly 
small.)
Log Files and Compaction 
 Tablets are stored on systems as Immutable SSTables and a tail of logs (one 
log per machine). 
 SSTable stands for ‘Sorted String Table’. Some also call it ‘Static and Sorted 
Table’. The figure below shows a dummy structure of SSTable. 
 When system memory is filled, it compacts some tablets. 
 Two compactions :- Minor and Major compactions.
 Minor compactions involve only a few tablets, while Major compactions ones 
involve the whole system results in reclaim of hard disk space. The location of 
the tablets are actually stored in special BigTable cells. 
Immutable SSTable :- 
Mutation means to change/update over time. Remember the 
mutants from X-Men & Krrish-3. (Mutants are special kind of species , whose 
DNA is changed over time . ) 
Thus , SSTables which are Immutable , they are never changed or updated , that 
is , they are Static ! 
 Know ,the question is that, How the entries in SSTable are stored or 
modification is done to a Immutable SSTable? 
 Answer to the above question is that , remove the old one, Make a new 
SSTable. 
Sounds weird ? But , It is a great idea because it saves a lot of time 
of searching and sorting for updating data on a single (large)table.
LookUp 
 Lookup is a three-level system. 
 Benefit :- NO Big Bottleneck in the system and it also make heavy use of Pre- 
Fetching and Caching 
Tablet Location Hierarchy 
Chubby file 
contains location 
of the root tablet. 
Root tablet 
contains all tablet 
locations in 
Metadata table. 
Metadata table stores 
locations of actual 
tablets. 
Client moves up the 
hierarchy (Metadata 
-> Root -> Chubby), if 
location of tablet is 
unknown or 
incorrect.
Compression : Snappy 
 Lot of redundant data in system (especially through time), so they make 
heavy use of compression. 
 Compression looks for similar values along the rows, columns, and times. ( 
Here comes the use of priority as mentioned earlier. Less priority , less data 
fetching and more compression. ) 
 Used variations of BMDiff and Zippy to develop compression software. 
BMDiff gives them high write speeds (~100MB/s) and even faster read 
speeds (~1000MB/s). Zippy compresses very fast.After Research, They built a 
software named “Snappy”. 
 Snappy is a compression/decompression library which does not aim for 
maximum compression, instead, it aims for very high speeds and reasonable 
compression. (On a single core of a Core i7 processor in 64-bit mode, Snappy 
compresses at about 250 MB/sec or more and decompresses at about 500 
MB/sec or more.)
Actual Hierarchical Load Balancing Structure 
request arrives at 
ROOT (Master 
Computer). 
ROOT checks its master 
record and sends the 
request to the right PC. 
SSTable contains the 
records of tablets. 
Via Meta Tablets, request is sent to 
tablet containing original data 
tablet and the data is fetched then. 
This is how, it works
Conclusion 
 Bigtable has achieved its goals of high performance, data 
availability and scalability. 
 It has been successfully deployed in real apps (Personalized 
Search, Orkut, Google Maps, …) 
 Significant advantages of building own storage system like 
flexibility in designing data model, control over implementation 
and other infrastructure on which Bigtable relies on.
Thank You

Big table

  • 1.
    Presented By: RiddhiTandon Akshay Gupta Vasu Ragan Lohia
  • 2.
    Outline 1) Introduction. 2) Google Services 3) GFS 4) Chubby 5) Map Reduce 6) Big Table 7) Structure Of BigTable 8) Log Files and Compaction 9) Load Balancing 10) LookUp 11) Compression:Snappy 12) Conclusion
  • 3.
    Introduction Google isbest known for it’s reliable and fast services, but what’s there working behind the scene? Let’s have a short introduction of Google. About Google:  Google.com domain was registered on September 15, 1993.  Google services are highly efficient, robust and trustworthy.  If I start to name them, First would be obviously Google Search, Docs, App Engine, Maps, Gmail and many more.
  • 4.
    What is Google?  Google is an Internet Information Provider Company (according to NASDAQ). It makes money from its advertising business : AdWords & AdSense.  Google lets your business grow by advertising and you pay it for CPC (Cost Per Click) or CPM (Cost Per Impression). Google has setup a revolutionary advertising model in the world.  By earning from these businesses, Google makes amazing and costly products (according to its maintenance) , which we get for free.
  • 5.
    How come Google’sservices so fast? Undoubtedly, there are number of aspects that matter behind this (like Hardware, Software, Operating System, Best Staff in the world etc. ) But, What I am going to explain here is the Software part.  GFS  Chubby  Map Reduce  Bigtable
  • 6.
    What is GFS?  GFS stands for Google File System.  It’s a Proprietary(means for their personal use, not open source) distributed file system developed by Google for their services.  It is specially designed to provide efficient, reliable access to data using large clusters of commodity hardware, means they are using low cost hardware, not state-of-the-art computers. Google uses relatively inexpensive computers running Linux Operating System and the GFS works just fine with them !
  • 7.
    What is Chubby?  Chubby is a Lock Service. (It’s related to gain access of Shared resources)  It is used to synchronize accesses to shared resources.  It is now used as a replacement of Google’s Domain Name System.
  • 8.
    What is MapReduce?  MapReduce is a software framework that process massive amounts of unstructured data.  It allows developers to write programs that process data in parallel across a distributed cluster of processors or stand-alone computers.  It is now used by Google mainly for their Web Indexing Service, applied since 2004.  Map() procedure performs all the process related to Filtering and Sorting.  Reduce() procedure performs all the Summary related operations.
  • 9.
    What is GoogleBigTable ?  BigTable is a compressed, high performance, and proprietary data storage system built on Google File System, Chubby Lock Service, SSTable (log-structured storage like LevelDB) and a few other Google technologies.  It’s Proprietary Data Storage System (that means it is for Google’s personal use only).  Most important point, It’s a Non-Relational Database.  It uses amazing Load Balancing Structure so that it runs on Commodity Hardware.  It uses Snappy compression utility for compacting the data.
  • 10.
    Means:-  It’sa Database, which uses compression utilities to store and retrieve data efficiently.  It uses a special structure for storing data, therefore it gives high performance. (Load Balancing Structure)  It’s proprietary, that means it is for Google’s personal use only. It is not open source.  Google BigTable is built upon different Google technologies.
  • 11.
    Requirements ? BigTable is designed to run on Commodity Hardware ( Low cost computers ).  Thus BigTable can run on any PC like ours.  Very less incremental cost for new services and expansion of computing power
  • 12.
    Special Features It’s a Robust database, That means it can work similarly even in worse situation.  BigTable given highest importance to Read and Query performance.  Higher Data Availability : - A write is immediately replicated to multiple data centers.  Automatic Scaling : BigTable uses a distributed architecture to automatically manage scaling to very large data sets.
  • 13.
    Structure of BigTable  Each table is a Multi-Dimensional Sparse Map( Memory Efficient hash-map implementation).  The table consists of (1) Rows, (2) Columns and (3) Each cell has a Time Version (Time-Stamp).  Time Version results in multiple copies of each cell with different times, resulting Unimaginable Redundancy which is requirement for Google services, so don’t ever think it as a drawback of this system.  Google does Web Indexing to get the data of all the websites. They store all the URLs, their titles, time-stamp and many more required fields  Web Indexing :- indexing the contents of a website
  • 14.
    Load Balancing Structure (dummy sitemap of my website Codeplaza, where 5 fields are shown)  Consider this one huge Table with millions of entries.  In order to manage such tables,they are split at row boundaries and saved as Tablets.  Each Tablets size is 100-200 MB and each machine stores about 100 of them.100- 200 MB of data can store thousands (even more ) rows.
  • 15.
    Example showing 4rows = 1 tablet.  This setup allows us Fine-Grain Load Balancing. (Suppose, if one tablet is receives lots of queries, it can share or divide data with other tablets or move the busy tablet to another not-so-busy machine.)  This setup also allows Fast Rebuilding. (Means, when a machine goes down, other machines take one tablet from the downed machine, so 100 machines get a new tablet, but the load on each machine to pick up new tablet is fairly small.)
  • 16.
    Log Files andCompaction  Tablets are stored on systems as Immutable SSTables and a tail of logs (one log per machine).  SSTable stands for ‘Sorted String Table’. Some also call it ‘Static and Sorted Table’. The figure below shows a dummy structure of SSTable.  When system memory is filled, it compacts some tablets.  Two compactions :- Minor and Major compactions.
  • 17.
     Minor compactionsinvolve only a few tablets, while Major compactions ones involve the whole system results in reclaim of hard disk space. The location of the tablets are actually stored in special BigTable cells. Immutable SSTable :- Mutation means to change/update over time. Remember the mutants from X-Men & Krrish-3. (Mutants are special kind of species , whose DNA is changed over time . ) Thus , SSTables which are Immutable , they are never changed or updated , that is , they are Static !  Know ,the question is that, How the entries in SSTable are stored or modification is done to a Immutable SSTable?  Answer to the above question is that , remove the old one, Make a new SSTable. Sounds weird ? But , It is a great idea because it saves a lot of time of searching and sorting for updating data on a single (large)table.
  • 18.
    LookUp  Lookupis a three-level system.  Benefit :- NO Big Bottleneck in the system and it also make heavy use of Pre- Fetching and Caching Tablet Location Hierarchy Chubby file contains location of the root tablet. Root tablet contains all tablet locations in Metadata table. Metadata table stores locations of actual tablets. Client moves up the hierarchy (Metadata -> Root -> Chubby), if location of tablet is unknown or incorrect.
  • 19.
    Compression : Snappy  Lot of redundant data in system (especially through time), so they make heavy use of compression.  Compression looks for similar values along the rows, columns, and times. ( Here comes the use of priority as mentioned earlier. Less priority , less data fetching and more compression. )  Used variations of BMDiff and Zippy to develop compression software. BMDiff gives them high write speeds (~100MB/s) and even faster read speeds (~1000MB/s). Zippy compresses very fast.After Research, They built a software named “Snappy”.  Snappy is a compression/decompression library which does not aim for maximum compression, instead, it aims for very high speeds and reasonable compression. (On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.)
  • 20.
    Actual Hierarchical LoadBalancing Structure request arrives at ROOT (Master Computer). ROOT checks its master record and sends the request to the right PC. SSTable contains the records of tablets. Via Meta Tablets, request is sent to tablet containing original data tablet and the data is fetched then. This is how, it works
  • 21.
    Conclusion  Bigtablehas achieved its goals of high performance, data availability and scalability.  It has been successfully deployed in real apps (Personalized Search, Orkut, Google Maps, …)  Significant advantages of building own storage system like flexibility in designing data model, control over implementation and other infrastructure on which Bigtable relies on.
  • 22.