32. The Google File System
GFSは、安価なPCサーバからなるクラスタ上に構築
された分散ファイルシステムである。クラスタは、
2,000台以上のchunkserverからなり、Googleに
は、30以上のClusterがある。
GFSは、ペタバイトサイズのファイルシステムであり、
read/write 2,000MB/s 以上の性能がある
http://209.85.163.132/papers/gfs-sosp2003.pdf
33. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
The Google File System
Gobioff氏は、2008年3月11日、悪性リンパ腫で亡くなった。
34. GFS Architecture
Application
(file name,chunk index)
GFS Master
/foo/bar
Chunk 2ef0
File NameSpace
GFS Client
(chunk handle,
chunk location)
Instruction to chunkserver
Chunkserver state
(chunk handle,byte range)
GFS chunkserver
GFS chunkserver
Linux File System
Linux File System
chunk data
・・・・
・・・・
・・・・
35. MapReduce: Simplified Data
Processing on Large Clusters
MapReduce は、関数型のプログラミン
グ・モデルに基づいた、大規模なデー
タ・セットの処理・生成を行う
http://209.85.163.132/papers/mapreduce-osdi04.pdf
36. Jeffrey Dean and Sanjay Ghemawat
MapReduce: Simplied Data Processing on Large Clusters
38. Bigtable: A Distributed Storage
System for Structured Data
Bigtable は、数千台のサーバ上のペタ
バイト単位の非常に大きなサイズにまで
スケールするようにデザインされた、構
造化されたデータを管理する、分散スト
レージ・システムである。
http://209.85.163.132/papers/bigtable-osdi06.pdf
http://video.google.com/videoplay?
docid=7278544055668715642&q=bigtable
39. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh,
Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes,
Robert E. Gruber
Bigtable: A Distributed Storage System for Structured Data
40. Bigtable Client
BigTableシステム構成
Bigtable Client
Library
Open
Bigtable セル
Bigtable Master
メタデータのオペレーションを実行
ロードバランシング
Bigtable
Tablet Server
Bigtable
Tablet Server
Bigtable
Tablet Server
サーバデータ
サーバデータ
サーバデータ
Cluster
Sheduling
System
フェイルオーバのハンドリング
モニタリング
Google
File
System
タブレットデータの
保持 ログ
Chubby
Lock
Service
メタデータ保持、
マスター選定のハンドリング
44. Transactions Across Datacenters
(Weekend Project)
この障害事故の少し前、2009年5月27日 Google IO
で、Ryan Barrettは、自らのWeekend Projectとして、
データセンターをまたいだシステムを提案していた。
http://www.google.com/intl/ja/events/io/2009/
sessions/TransactionsAcrossDatacenters.html
what if your entire datacenter falls off the face of the
earth? This talk will examine how current large scale
storage systems handle fault tolerance and
consistency, with a particular focus on the App
Engine datastore. We'll cover techniques such as
replication, sharding, two phase commit, and
consensus protocols (e.g. Paxos), then explore how
they can be applied across datacenters.
45. Post-mortem for
February 24th, 2010 outage
After further analysis, we determine that
although power has returned to the datacenter,
many machines in the datacenter are missing
due to the power outage, and are not able to
serve traffic.
Particularly, it is determined that the GFS and
Bigtable clusters are not in a functioning state
due to having lost too many machines, and
that thus the Datastore is not usable in the
primary datacenter at that time.
46. 2009年9月、Bigtableの見直し
2009年9月14日、Ryan Barrettは、“Migration
to a Better Datastore”を発表
http://googleappengine.blogspot.jp/2009/09/mig
ration-to-better-datastore.html
Megastore replication saves the day!
Megastore is an internal library on top of
Bigtable that supports declarative
schemas, multi-row transactions,
secondary indices, and recently,
consistent replication across datacenters.
47. 2011年1月
Megastore: Providing Scalable, Highly
Available Storage for Interactive Services
http://www.cidrdb.org/cidr2011/Papers/CIDR1
1_Paper32.pdf
cf. 日本 2012年6月20日 ファースト・サーバー社事故
84. Our new search index:
Caffeine
2010年6月8日
Google Webmaster Central Blog
http://googlewebmastercentral.blogs
pot.jp/2010/06/our-new-searchindex-caffeine.html
94. Google Caffeine jolts
worldwide search machine
2010年6月9日
Interview with Matt
http://www.theregister.co.uk/2010/0
6/09/google_completes_caffeine_sea
rch_index_overhaul/
97. Google search index splits with
MapReduce
Welds BigTable to file system 'Colossus’
2010年9月9日
Interview with Lipkovitz
http://www.theregister.co.uk/2010/0
9/09/google_caffeine_explained/
98. 新しい検索のインフラは、GFS分散ファイルシス
テムの改修版を使用している。これは、Google
File System 2、GFS2と呼ばれていたのだが、
Googleの内部では、Lipkovitzがいうように
Colossusとして知られているという。
「Caffeineはデータベース・ドリブンで、BigTable
を変化させたインデックス・システムである。
Googleは、このシステムを論じた論文を、来月
行われるUSENIX Symposium on
Operating Systems Design and
Implementation (OSDI) で発表する。」
107. Incrementally Indexing the
Web with Percolator
2010年10月4日
Frank Dabek and Daniel Peng
at OSDI, 2010
https://docs.google.com/presentation/d/1gKD4FbaUIGtoi
mP6-ZB0iiW8V0KZt8fVET-Cfu5KiG8/present#slide=id.i0
124. System Building by D. Rumsfeld
There are known unknowns.
That is to say
We know there are some things
We do not know.
But there are also unknown unknowns,
The ones we don't know
We don't know.
— Sec. Donald Rumsfeld
125. Unknown unknowns
CPUs that get XOR wrong periodically: checksum "failures”
Bigtable scaling: can't delete files fast enough
>50% of seeks going to useless readahead and metadata.
Incorrect resistor value: our workload powers off machine
Advice:
Push performance debugging through all layers of system
Expected weirdness proportional to machine count
•
•
127. Large-scale Incremental
Processing Using Distributed
Transactions and Notifications
Daniel Peng and Frank Dabek
Presented by Nick Radcliffe
http://courses.cs.vt.edu/cs5204/fall11butt/lectures/perculator.pdf
128.
129. Tradeoffs
Percolator trades efficient use of
resources for scalability.
Caffeine (the Percolator-based indexing
system) uses twice as many resources
as the previous system to process the
same crawl rate.
Roughly 30 times more CPU per
transactions than a standard DBMS.
130. Overview of Percolator design
Percolator is built on top of Bigtable.
A percolator system consists of three binaries
that run on every machine in the cluster:
Percolator worker
Bigtable tablet server
GFS chunkserver.
131. Overview of Percolator design
Data is organized into Bigtable rows and
columns, with Percolator metadata
stored alongside in special columns.
The Percolator library largely consists of
Bigtable operations wrapped in
Percolatorspecific computation.
Percolator adds multirow transactions
and observers to Bigtable.
132. Overview of Percolator design
An observer is like an event-handler
that is invoked whenever a userspecified column changes.
Percolator applications are structured as
a series of observers.
Each observer completes a task and
creates more work for “downstream”
observers by writing to the table.
133. Large-scale Incremental
Processing Using Distributed
Transactions and Notifications
2010年10月4日
D Peng, F Dabek - OSDI, 2010
https://www.usenix.org/legacy/event
s/osdi10/tech/full_papers/Peng.pdf
153. Dremel: Interactive Analysis
of Web-Scale Datasets
Proc. of the 36th Int'l Conf on Very
Large Data Bases (2010), pp. 330339
http://static.googleusercontent.com/
external_content/untrusted_dlcp/rese
arch.google.com/ja//pubs/archive/36
632.pdf
164. Dremelの検索言語
SELECT DocId AS Id,
COUNT(Name.Language.Code) WITHIN Name AS Cnt,
Name.Url + ',' + Name.Language.Code AS Str
FROM t
WHERE REGEXP(Name.Url, '^http') AND DocId < 20;
出力結果
Id: 10
Name
Cnt: 2
Language
Str: 'http://A,en-us'
Str: 'http://A,en'
Name
Cnt: 0
出力のスキーマ
message QueryResult {
required int64 Id;
repeated group Name {
optional uint64 Cnt;
repeated group Language {
optional string Str; }}}
165. 検索の木構造
SELECT A, COUNT(B) FROM T GROUP BY A
SELECT A, SUM(c) FROM (R11 UNION ALL ...R1n)
GROUP BY A
R1i = SELECT A, COUNT(B) AS c FROM T1i
GROUP BY A
T1iは、レベル1のサーバーiで処理される、テーブルTの
Tabletの分割
166.
167. MapReduceとDremel
numRecs: table sum of int;
numWords: table sum of int;
emit numRecs <- 1;
emit numWords <- CountWords(input.txtField);
Q1: SELECT SUM(CountWords(txtField))
/ COUNT(*) FROM T1
3000 nodes, 85 billion records
172. Example: Social Network
x1000
x1000
San Francisco
Brazil
Seattle
User posts
Arizona
Friend lists
US
x1000
Spain
OSDI 2012
Sao Paulo
Santiago
Buenos Aires
London
Paris
Berlin
Madrid
Lisbon
x1000
Moscow
Berlin
Krakow
Russia
172
179. 複数のデータセンター
Friend1 post
US
User posts
x1000
Friend lists
Friend2 post
Spain
…
User posts
x1000
Friend lists
マイページの生成
Friend999 post
Brazil
User posts
x1000
Friend lists
Friend1000 post
Russia
User posts
x1000
Friend lists
OSDI 2012
179
188. Commit Wait and 2-Phase
Commit
Start logging Done logging
Acquired locks
Release locks
TC
Acquired locks
Committed
Notify participants of s
Release locks
TP1
Acquired locks
TP2
Release locks
Prepared
Send s
Compute s for each
Commit wait done
Compute overall s
OSDI 2012
188
189. Example
Remove X from
my friend list
Risky post P
TC
T2
sC=6
TP
s=8
s=15
Remove myself from
X’s friend list
sP=8
s=8
Time
8
My friends
My posts
X’s friends
OSDI 2012
<8
[X]
15
[]
[P]
[me]
[]
189
223. Spanner: Google’s GloballyDistributed Database
James C. Corbett, Jeffrey Dean, Michael Epstein,
Andrew Fikes, Christopher Frost, JJ Furman, Sanjay
Ghemawat et al.
http://static.googleusercontent.com/external_conten
t/untrusted_dlcp/research.google.com/ja//archive/sp
anner-osdi2012.pdf