Your SlideShare is downloading. ×
0
Handling Large Datasets at Google:
   Current Systems and Future
            Directions
                  Jeff Dean
      ...
Outline
• Hardware infrastructure
• Distributed systems infrastructure:
  – Scheduling system
  – GFS
  – BigTable
  – Map...
Sample Problem Domains
• Offline batch jobs
   – Large datasets (PBs), bulk reads/writes (MB chunks)
   – Short outages ac...
Typical New Engineer
                     • Never seen a
                       petabyte of data
                     • Ne...
Google’s Hardware Philosophy
        Truckloads of low-cost machines
• Workloads are large and easily parallelized
• Care ...
Effects of Hardware Philosophy
                   • Software must
                     tolerate failure
                  ...
Current Design
• In-house rack design
• PC-class
  motherboards
• Low-end storage and
  networking hardware
• Linux
• + in...
The Joys of Real Hardware
Typical first year for a new cluster:

~0.5 overheating (power down most machines in <5 mins, ~1...
Typical Cluster



      Machine 1                    Machine 2                        Machine N




                     ...
Typical Cluster
    Cluster scheduling master              Chubby Lock service            GFS master




      Machine 1  ...
Typical Cluster
    Cluster scheduling master              Chubby Lock service            GFS master




        Machine 1...
Typical Cluster
    Cluster scheduling master              Chubby Lock service            GFS master




        Machine 1...
Typical Cluster
    Cluster scheduling master              Chubby Lock service            GFS master




        Machine 1...
Typical Cluster
    Cluster scheduling master              Chubby Lock service            GFS master




        Machine 1...
File Storage: GFS
                                                          Client
                            GFS
       ...
GFS Usage

• 200+ GFS clusters
• Managed by an internal service team
• Largest clusters
  – 5000+ machines
  – 5+ PB of di...
Data Storage: BigTable
         What is it, really?
         • 10-ft view: Row &
           column abstraction for
       ...
BigTable Data Model
• Multi-dimensional sparse sorted map
   (row, column, timestamp) => value
BigTable Data Model
• Multi-dimensional sparse sorted map
   (row, column, timestamp) => value


       Rows


  “www.cnn....
BigTable Data Model
• Multi-dimensional sparse sorted map
   (row, column, timestamp) => value
                         “c...
BigTable Data Model
• Multi-dimensional sparse sorted map
   (row, column, timestamp) => value
                         “c...
BigTable Data Model
• Multi-dimensional sparse sorted map
   (row, column, timestamp) => value
                         “c...
BigTable Data Model
• Multi-dimensional sparse sorted map
   (row, column, timestamp) => value
                         “c...
BigTable Data Model
• Multi-dimensional sparse sorted map
   (row, column, timestamp) => value
                         “c...
Tablets (cont.)
                          “language:”   “contents:”

“aaa.com”

“cnn.com”                     EN         “...
Tablets (cont.)
                          “language:”   “contents:”

“aaa.com”

“cnn.com”                     EN         “...
Tablets (cont.)
                          “language:”   “contents:”

“aaa.com”

“cnn.com”                     EN         “...
Tablets (cont.)
                          “language:”        “contents:”

“aaa.com”

“cnn.com”                     EN     ...
Tablets (cont.)
                          “language:”        “contents:”

“aaa.com”

“cnn.com”                     EN     ...
Tablets (cont.)
                          “language:”        “contents:”

“aaa.com”

“cnn.com”                     EN     ...
Tablets (cont.)
                          “language:”        “contents:”

“aaa.com”

“cnn.com”                     EN     ...
Tablets (cont.)
                          “language:”        “contents:”

“aaa.com”

“cnn.com”                     EN     ...
Tablets (cont.)
                          “language:”        “contents:”

“aaa.com”

“cnn.com”                     EN     ...
Tablets (cont.)
                          “language:”        “contents:”

“aaa.com”

“cnn.com”                     EN     ...
Bigtable System Structure
Bigtable Cell

                           Bigtable master




Bigtable tablet server            ...
Bigtable System Structure
Bigtable Cell

                             Bigtable master
                         performs me...
Bigtable System Structure
Bigtable Cell

                             Bigtable master
                         performs me...
Bigtable System Structure
Bigtable Cell

                             Bigtable master
                         performs me...
Bigtable System Structure
Bigtable Cell

                              Bigtable master
                          performs ...
Bigtable System Structure
Bigtable Cell

                              Bigtable master
                          performs ...
Bigtable System Structure
Bigtable Cell

                              Bigtable master
                          performs ...
Bigtable System Structure
                                                              Bigtable client
Bigtable Cell     ...
Bigtable System Structure
                                                              Bigtable client
Bigtable Cell     ...
Bigtable System Structure
                                                                Bigtable client
Bigtable Cell   ...
Bigtable System Structure
                                                                Bigtable client
Bigtable Cell   ...
Some BigTable Features
• Single-row transactions: easy to do read/modify/write
  operations
• Locality groups: segregate c...
BigTable Usage
• 500+ BigTable cells
• Largest cells manage 6000TB+ of data,
  3000+ machines
• Busiest cells sustain >500...
Data Processing: MapReduce
• Google’s batch processing tool of choice
• Users write two functions:
  – Map: Produces (key,...
Example: Document Indexing
• Input: Set of documents D1, …, DN
• Map
  – Parse document D into terms T1, …, TN
  – Produce...
MapReduce Execution
MapReduce
                                            GFS
  master



      Map task 1                ...
MapReduce Tricks / Features

     •                             •
         Data locality                 Backup copies of ...
MapReduce Programs in Google’s Source Tree
      12500



      10000



       7500



       5000



       2500



    ...
New MapReduce Programs Per Month
700



525



350



175



  0
       Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 M...
New MapReduce Programs Per Month
                                                     Summer intern effect
700



525



3...
MapReduce in Google
      Easy to use. Library hides complexity.
                           Mar, ‘05 Mar, ‘06   Sep, '07
N...
Current Work
Scheduling system + GFS + BigTable + MapReduce work
  well within single clusters

Many separate instances in...
Next Generation Infrastructure
Truly global systems to span all our datacenters
• Global namespace with many replicas of d...
Questions?
Further info:

• The Google File System, Sanjay Ghemawat, Howard Gobioff, Shun-Tak
 Leung, SOSP ‘03.

• Web Sea...
Upcoming SlideShare
Loading in...5
×

6 Dean Google

1,973

Published on

google系统架构pdf

Published in: Technology, News & Politics
0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,973
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
131
Comments
0
Likes
11
Embeds 0
No embeds

No notes for slide

Transcript of "6 Dean Google"

  1. 1. Handling Large Datasets at Google: Current Systems and Future Directions Jeff Dean Google Fellow http://labs.google.com/people/jeff
  2. 2. Outline • Hardware infrastructure • Distributed systems infrastructure: – Scheduling system – GFS – BigTable – MapReduce • Challenges and Future Directions – Infrastructure that spans all datacenters – More automation
  3. 3. Sample Problem Domains • Offline batch jobs – Large datasets (PBs), bulk reads/writes (MB chunks) – Short outages acceptable – Web indexing, log processing, satellite imagery, etc. • Online applications – Smaller datasets (TBs), small reads/writes small (KBs) – Outages immediately visible to users, low latency vital – Web search, Orkut, GMail, Google Docs, etc. • Many areas: IR, machine learning, image/video processing, NLP, machine translation, ...
  4. 4. Typical New Engineer • Never seen a petabyte of data • Never used a thousand machines • Never really experienced machine failure Our software has to make them successful.
  5. 5. Google’s Hardware Philosophy Truckloads of low-cost machines • Workloads are large and easily parallelized • Care about perf/$, not absolute machine perf • Even reliable hardware fails at our scale • Many datacenters, all around the world – Intra-DC bandwidth >> Inter-DC bandwidth – Speed of light has remained fixed in last 10 yrs :)
  6. 6. Effects of Hardware Philosophy • Software must tolerate failure • Application’s particular machine should not matter • No special machines - just 2 or 3 flavors Google - 1999
  7. 7. Current Design • In-house rack design • PC-class motherboards • Low-end storage and networking hardware • Linux • + in-house software
  8. 8. The Joys of Real Hardware Typical first year for a new cluster: ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover) ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back) ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours) ~1 network rewiring (rolling ~5% of machines down over 2-day span) ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back) ~5 racks go wonky (40-80 machines see 50% packet loss) ~8 network maintenances (4 might cause ~30-minute random connectivity losses) ~12 router reloads (takes out DNS and external vips for a couple minutes) ~3 router failures (have to immediately pull traffic for an hour) ~dozens of minor 30-second blips for dns ~1000 individual machine failures ~thousands of hard drive failures slow disks, bad memory, misconfigured machines, flaky machines, etc.
  9. 9. Typical Cluster Machine 1 Machine 2 Machine N … Scheduler Scheduler Scheduler GFS GFS GFS slave slave slave chunkserver chunkserver chunkserver Linux Linux Linux
  10. 10. Typical Cluster Cluster scheduling master Chubby Lock service GFS master Machine 1 Machine 2 Machine N … Scheduler Scheduler Scheduler GFS GFS GFS slave slave slave chunkserver chunkserver chunkserver Linux Linux Linux
  11. 11. Typical Cluster Cluster scheduling master Chubby Lock service GFS master Machine 1 Machine 2 Machine N User app1 User … app1 Scheduler Scheduler Scheduler GFS GFS GFS slave slave slave chunkserver chunkserver chunkserver Linux Linux Linux
  12. 12. Typical Cluster Cluster scheduling master Chubby Lock service GFS master Machine 1 Machine 2 Machine N User app1 User … User app2 app1 Scheduler Scheduler Scheduler GFS GFS GFS slave slave slave chunkserver chunkserver chunkserver Linux Linux Linux
  13. 13. Typical Cluster Cluster scheduling master Chubby Lock service GFS master Machine 1 Machine 2 Machine N User app1 BigTable BigTable server server User … User app2 app1 Scheduler Scheduler Scheduler GFS GFS GFS slave slave slave chunkserver chunkserver chunkserver Linux Linux Linux
  14. 14. Typical Cluster Cluster scheduling master Chubby Lock service GFS master Machine 1 Machine 2 Machine N User app1 BigTable BigTable BigTable master server server User … User app2 app1 Scheduler Scheduler Scheduler GFS GFS GFS slave slave slave chunkserver chunkserver chunkserver Linux Linux Linux
  15. 15. File Storage: GFS Client GFS Master Client Client C1 C1 C0 C5 C0 … C2 C3 C5 C2 C5 Chunkserver 2 Chunkserver N Chunkserver 1 • Master: Manages file metadata • Chunkserver: Manages 64MB file chunks • Clients talk to master to open and find files • Clients talk directly to chunkservers for data
  16. 16. GFS Usage • 200+ GFS clusters • Managed by an internal service team • Largest clusters – 5000+ machines – 5+ PB of disk usage – 10000+ clients
  17. 17. Data Storage: BigTable What is it, really? • 10-ft view: Row & column abstraction for storing data • Reality: Distributed, persistent, multi-level sorted map
  18. 18. BigTable Data Model • Multi-dimensional sparse sorted map (row, column, timestamp) => value
  19. 19. BigTable Data Model • Multi-dimensional sparse sorted map (row, column, timestamp) => value Rows “www.cnn.com”
  20. 20. BigTable Data Model • Multi-dimensional sparse sorted map (row, column, timestamp) => value “contents:” Columns Rows “www.cnn.com”
  21. 21. BigTable Data Model • Multi-dimensional sparse sorted map (row, column, timestamp) => value “contents:” Columns Rows “www.cnn.com” “<html>…”
  22. 22. BigTable Data Model • Multi-dimensional sparse sorted map (row, column, timestamp) => value “contents:” Columns Rows “www.cnn.com” t17 “<html>…” Timestamps
  23. 23. BigTable Data Model • Multi-dimensional sparse sorted map (row, column, timestamp) => value “contents:” Columns Rows t11 “www.cnn.com” t17 “<html>…” Timestamps
  24. 24. BigTable Data Model • Multi-dimensional sparse sorted map (row, column, timestamp) => value “contents:” Columns Rows t3 t11 “www.cnn.com” t17 “<html>…” Timestamps
  25. 25. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  26. 26. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  27. 27. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  28. 28. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  29. 29. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  30. 30. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  31. 31. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  32. 32. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets “website.com” … … “yahoo.com/kids.html” … “zuppa.com/menu.html”
  33. 33. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets “website.com” … … “yahoo.com/kids.html” “yahoo.com/kids.html” … “zuppa.com/menu.html”
  34. 34. Tablets (cont.) “language:” “contents:” “aaa.com” “cnn.com” EN “<html>…” “cnn.com/sports.html” Tablets “website.com” … … “yahoo.com/kids.html” “yahoo.com/kids.html” … “zuppa.com/menu.html”
  35. 35. Bigtable System Structure Bigtable Cell Bigtable master Bigtable tablet server … Bigtable tablet server Bigtable tablet server
  36. 36. Bigtable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server
  37. 37. Bigtable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data
  38. 38. Bigtable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service
  39. 39. Bigtable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service handles failover, monitoring
  40. 40. Bigtable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service handles failover, monitoring holds tablet data, logs
  41. 41. Bigtable System Structure Bigtable Cell Bigtable master performs metadata ops + load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service holds metadata, handles failover, monitoring holds tablet data, logs handles master-election
  42. 42. Bigtable System Structure Bigtable client Bigtable Cell Bigtable client library Bigtable master performs metadata ops + load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service holds metadata, handles failover, monitoring holds tablet data, logs handles master-election
  43. 43. Bigtable System Structure Bigtable client Bigtable Cell Bigtable client library Bigtable master performs metadata ops + Open() load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service holds metadata, handles failover, monitoring holds tablet data, logs handles master-election
  44. 44. Bigtable System Structure Bigtable client Bigtable Cell Bigtable client library Bigtable master performs metadata ops + Open() read/write load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service holds metadata, handles failover, monitoring holds tablet data, logs handles master-election
  45. 45. Bigtable System Structure Bigtable client Bigtable Cell Bigtable client metadata ops library Bigtable master performs metadata ops + Open() read/write load balancing Bigtable tablet server … Bigtable tablet server Bigtable tablet server serves data serves data serves data Cluster scheduling system GFS Lock service holds metadata, handles failover, monitoring holds tablet data, logs handles master-election
  46. 46. Some BigTable Features • Single-row transactions: easy to do read/modify/write operations • Locality groups: segregate columns into different files • In-memory columns: random access to small items • Suite of compression techniques: per-locality group • Bloom filters: avoid seeks for non-existent data • Replication: eventual-consistency replication across datacenters, between multiple BigTable serving setups (master/slave & multi-master)
  47. 47. BigTable Usage • 500+ BigTable cells • Largest cells manage 6000TB+ of data, 3000+ machines • Busiest cells sustain >500000+ ops/ second 24 hours/day, and peak much higher
  48. 48. Data Processing: MapReduce • Google’s batch processing tool of choice • Users write two functions: – Map: Produces (key, value) pairs from input – Reduce: Merges (key, value) pairs from Map • Library handles data transfer and failures • Used everywhere: Earth, News, Analytics, Search Quality, Indexing, …
  49. 49. Example: Document Indexing • Input: Set of documents D1, …, DN • Map – Parse document D into terms T1, …, TN – Produces (key, value) pairs • (T1, D), …, (TN, D) • Reduce – Receives list of (key, value) pairs for term T • (T, D1), …, (T, DN) – Emits single (key, value) pair • (T, (D1, …, DN))
  50. 50. MapReduce Execution MapReduce GFS master Map task 1 Map task 2 Map task 3 map map map k1:v k2:v k1:v k3:v k1:v k4:v Shuffle and Sort Reduce task 1 Reduce task 2 k1:v,v,v k3,v k2:v k4,v reduce reduce GFS
  51. 51. MapReduce Tricks / Features • • Data locality Backup copies of tasks • # tasks >> # machines • Multiple I/O data types • Task re-execution on failure • Data compression • Local or cluster execution • Pipelined shuffle stage • Distributed counters • Fast sorter 21
  52. 52. MapReduce Programs in Google’s Source Tree 12500 10000 7500 5000 2500 0 Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07 22
  53. 53. New MapReduce Programs Per Month 700 525 350 175 0 Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07 23
  54. 54. New MapReduce Programs Per Month Summer intern effect 700 525 350 175 0 Jan-03 May-03 Sep-03 Jan-04 May-04 Sep-04 Jan-05 May-05 Sep-05 Jan-06 May-06 Sep-06 Jan-07 May-07 Sep-07 23
  55. 55. MapReduce in Google Easy to use. Library hides complexity. Mar, ‘05 Mar, ‘06 Sep, '07 Number of jobs 72K 171K 2,217K Average time (seconds) 934 874 395 Machine years used 981 2,002 11,081 Input data read (TB) 12,571 52,254 403,152 Intermediate data (TB) 2,756 6,743 34,774 Output data written (TB) 941 2,970 14,018 Average worker machines 232 268 394
  56. 56. Current Work Scheduling system + GFS + BigTable + MapReduce work well within single clusters Many separate instances in different data centers – Tools on top deal with cross-cluster issues – Each tool solves relatively narrow problem – Many tools => lots of complexity Can next generation infrastructure do more?
  57. 57. Next Generation Infrastructure Truly global systems to span all our datacenters • Global namespace with many replicas of data worldwide • Support both consistent and inconsistent operations • Continued operation even with datacenter partitions • Users specify high-level desires: “99%ile latency for accessing this data should be <50ms” “Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia” – Increased utilization through automation – Automatic migration, growing and shrinking of services – Lower end-user latency – Provide high-level programming model for data-intensive interactive services
  58. 58. Questions? Further info: • The Google File System, Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung, SOSP ‘03. • Web Search for a Planet: The Google Cluster Architecture, Luiz Andre Barroso, Jeffrey Dean, Urs Hölzle, IEEE Micro, 2003. • Bigtable: A Distributed Storage System for Structured Data, Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, OSDI’06 • MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDIʼ04 • Failure Trends in a Large Disk Drive Population, Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz André Barroso. FAST, ‘07. http://labs.google.com/papers http://labs.google.com/people/jeff or jeff@google.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×