SlideShare a Scribd company logo
1 of 19
Download to read offline
F1 - The Fault-Tolerant
Distributed RDBMS
Supporting Google's Ad
Business
Jeff Shute, Mircea Oancea, Stephan Ellner,
Ben Handy, Eric Rollins, Bart Samwel,
Radek Vingralek, Chad Whipkey, Xin Chen,
Beat Jegerlehner, Kyle Littlefield, Phoenix Tong

SIGMOD
May 22, 2012
Today's Talk
F1 - A Hybrid Database combining the
● Scalability of Bigtable
● Usability and functionality of SQL databases

Key Ideas
● Scalability: Auto-sharded storage
● Availability & Consistency: Synchronous replication
● High commit latency: Can be hidden
   ○ Hierarchical schema
   ○ Protocol buffer column types
   ○ Efficient client code

Can you have a scalable database without going NoSQL? Yes.
The AdWords Ecosystem
One shared database backing Google's core AdWords business

               advertiser


SOAP API         web UI             reports

                                                   Java / "frontend"
 ad-hoc
SQL users                                          C++ / "backend"
                  DB
                                log aggregation

ad servers    ad approvals
                                 spam analysis

                ad logs
Our Legacy DB: Sharded MySQL
Sharding Strategy
● Sharded by customer
● Apps optimized using shard awareness

Limitations
 ● Availability
     ○ Master / slave replication -> downtime during failover
     ○ Schema changes -> downtime for table locking
 ● Scaling
     ○ Grow by adding shards
     ○ Rebalancing shards is extremely difficult and risky
     ○ Therefore, limit size and growth of data stored in database
 ● Functionality
     ○ Can't do cross-shard transactions or joins
Demanding Users
Critical applications driving Google's core ad business
● 24/7 availability, even with datacenter outages
● Consistency required
      ○ Can't afford to process inconsistent data
      ○ Eventual consistency too complex and painful
● Scale: 10s of TB, replicated to 1000s of machines

Shared schema
● Dozens of systems sharing one database
● Constantly evolving - multiple schema changes per week

SQL Query
● Query without code
Our Solution: F1
A new database,
● built from scratch,
● designed to operate at Google scale,
● without compromising on RDBMS features.

Co-developed with new lower-level storage system, Spanner
Underlying Storage - Spanner
Descendant of Bigtable, Successor to Megastore

Properties
● Globally distributed
● Synchronous cross-datacenter replication (with Paxos)
● Transparent sharding, data movement
● General transactions
    ○ Multiple reads followed by a single atomic write
    ○ Local or cross-machine (using 2PC)
● Snapshot reads
F1
Architecture                                           F1
                                                      client
● Sharded Spanner servers
     ○ data on GFS and in memory
● Stateless F1 server
● Pool of workers for query execution
                                                     F1 server

                                          F1 query
                                          workers
Features
● Relational schema                                  Spanner
    ○ Extensions for hierarchy and rich data types    server
    ○ Non-blocking schema changes
● Consistent indexes
● Parallel reads with SQL or Map-Reduce                GFS
How We Deploy
● Five replicas needed for high availability
● Why not three?
   ○ Assume one datacenter down
   ○ Then one more machine crash => partial outage

Geography
● Replicas spread across the country to survive regional disasters
   ○ Up to 100ms apart

Performance
● Very high commit latency - 50-100ms
● Reads take 5-10ms - much slower than MySQL
● High throughput
Hierarchical Schema
Explicit table hierarchies. Example:
 ● Customer (root table): PK (CustomerId)
 ● Campaign (child): PK (CustomerId, CampaignId)
 ● AdGroup (child): PK (CustomerId, CampaignId, AdGroupId)



              Rows and PKs              Storage Layout
                                        Customer   (1)
                1              2        Campaign   (1,3)
                                        AdGroup    (1,3,5)
                                        AdGroup    (1,3,6)
        1,3           1,4     2,5
                                        Campaign   (1,4)
                                        AdGroup    (1,4,7)
                                        Customer   (2)
1,3,5         1,3,6   1,4,7   2,5,8
                                        Campaign   (2,5)
                                        AdGroup    (2,5,8)
Clustered Storage
●   Child rows under one root row form a cluster
●   Cluster stored on one machine (unless huge)
●   Transactions within one cluster are most efficient
●   Very efficient joins inside clusters (can merge with no sorting)

              Rows and PKs                    Storage Layout
                                              Customer    (1)
                1              2              Campaign    (1,3)
                                              AdGroup     (1,3,5)
                                              AdGroup     (1,3,6)
        1,3           1,4     2,5
                                              Campaign    (1,4)
                                              AdGroup     (1,4,7)
                                              Customer    (2)
1,3,5         1,3,6   1,4,7   2,5,8
                                              Campaign    (2,5)
                                              AdGroup     (2,5,8)
Protocol Buffer Column Types
Protocol Buffers
● Structured data types with optional and repeated fields
● Open-sourced by Google, APIs in several languages

Column data types are mostly Protocol Buffers
● Treated as blobs by underlying storage
● SQL syntax extensions for reading nested fields
● Coarser schema with fewer tables - inlined objects instead

Why useful?
● Protocol Buffers pervasive at Google -> no impedance mismatch
● Simplified schema and code - apps use the same objects
   ○ Don't need foreign keys or joins if data is inlined
SQL Query
● Parallel query engine implemented from scratch
● Fully functional SQL, joins to external sources
● Language extensions for protocol buffers


SELECT CustomerId
FROM Customer c PROTO JOIN c.Whitelist.feature f
WHERE f.feature_id = 302
  AND f.status = 'STATUS_ENABLED'


Making queries fast
● Hide RPC latency
● Parallel and batch execution
● Hierarchical joins
Coping with High Latency
Preferred transaction structure
● One read phase: No serial reads
    ○ Read in batches
    ○ Read asynchronously in parallel
● Buffer writes in client, send as one RPC

Use coarse schema and hierarchy
● Fewer tables and columns
● Fewer joins

For bulk operations
● Use small transactions in parallel - high throughput

Avoid ORMs that add hidden costs
ORM Anti-Patterns
● Obscuring database operations from app developers
● Serial reads
   ○ for loops doing one query per iteration
● Implicit traversal
   ○ Adding unwanted joins and loading unnecessary data

These hurt performance in all databases.
They are disastrous on F1.
Our Client Library
● Very lightweight ORM - doesn't really have the "R"
   ○ Never uses Relational joins or traversal
● All objects are loaded explicitly
   ○ Hierarchical schema and protocol buffers make this easy
   ○ Don't join - just load child objects with a range read
● Ask explicitly for parallel and async reads
Results
Development
● Code is slightly more complex
    ○ But predictable performance, scales well by default
● Developers happy
    ○ Simpler schema
    ○ Rich data types -> lower impedance mismatch

User-Facing Latency
● Avg user action: ~200ms - on par with legacy system
● Flatter distribution of latencies
    ○ Mostly from better client code
    ○ Few user actions take much longer than average
    ○ Old system had severe latency tail of multi-second transactions
Current Challenges
● Parallel query execution
   ○ Failure recovery
   ○ Isolation
   ○ Skew and stragglers
   ○ Optimization

● Migrating applications, without downtime
  ○ Core systems already on F1, many more moving
  ○ Millions of LOC
Summary
We've moved a large and critical application suite from MySQL to F1.

This gave us
● Better scalability
● Better availability
● Equivalent consistency guarantees
● Equally powerful SQL query

And also similar application latency, using
● Coarser schema with rich column types
● Smarter client coding patterns

In short, we made our database scale, and didn't lose any key
database features along the way.

More Related Content

What's hot

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...InfluxData
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model佳蓉 倪
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Julien Le Dem
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringTaro L. Saito
 
Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Neo4j
 
Adbms 36 transaction support in sql 2
Adbms 36 transaction support in sql 2Adbms 36 transaction support in sql 2
Adbms 36 transaction support in sql 2Vaibhav Khanna
 
Scaling the mirrorworld with knowledge graphs
Scaling the mirrorworld with knowledge graphsScaling the mirrorworld with knowledge graphs
Scaling the mirrorworld with knowledge graphsAlan Morrison
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageJulien Le Dem
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrievalBasma Gamal
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Glove global vectors for word representation
Glove global vectors for word representationGlove global vectors for word representation
Glove global vectors for word representationhyunyoung Lee
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 

What's hot (20)

Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
Impacts of Sharding, Partitioning, Encoding, and Sorting on Distributed Query...
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020Data lineage and observability with Marquez - subsurface 2020
Data lineage and observability with Marquez - subsurface 2020
 
Presto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoringPresto as a Service - Tips for operation and monitoring
Presto as a Service - Tips for operation and monitoring
 
Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach Training Week: Create a Knowledge Graph: A Simple ML Approach
Training Week: Create a Knowledge Graph: A Simple ML Approach
 
Graph Databases at Netflix
Graph Databases at NetflixGraph Databases at Netflix
Graph Databases at Netflix
 
Adbms 36 transaction support in sql 2
Adbms 36 transaction support in sql 2Adbms 36 transaction support in sql 2
Adbms 36 transaction support in sql 2
 
Scaling the mirrorworld with knowledge graphs
Scaling the mirrorworld with knowledge graphsScaling the mirrorworld with knowledge graphs
Scaling the mirrorworld with knowledge graphs
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 
RDF and OWL
RDF and OWLRDF and OWL
RDF and OWL
 
TCS MODULE 6.pdf
TCS MODULE 6.pdfTCS MODULE 6.pdf
TCS MODULE 6.pdf
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
automatic classification in information retrieval
automatic classification in information retrievalautomatic classification in information retrieval
automatic classification in information retrieval
 
Introduction to Survey Paper
Introduction to Survey PaperIntroduction to Survey Paper
Introduction to Survey Paper
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Glove global vectors for word representation
Glove global vectors for word representationGlove global vectors for word representation
Glove global vectors for word representation
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 

Similar to Google F1

Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...ScyllaDB
 
Evolution of DBA in the Cloud Era
 Evolution of DBA in the Cloud Era Evolution of DBA in the Cloud Era
Evolution of DBA in the Cloud EraMydbops
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data gridBogdan Dina
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid Matt Sarrel
 
Benchmarking Apache Druid
Benchmarking Apache DruidBenchmarking Apache Druid
Benchmarking Apache DruidImply
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et Rpkernevez
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...DataStax
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Omid Vahdaty
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...NETWAYS
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
Cpp In Soa
Cpp In SoaCpp In Soa
Cpp In SoaWSO2
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed_Hat_Storage
 
Olap scalability
Olap scalabilityOlap scalability
Olap scalabilitylucboudreau
 
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...CodeScience
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About ShardingMongoDB
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentJean-François Gagné
 
MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014Dylan Tong
 

Similar to Google F1 (20)

Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
Evolution of DBA in the Cloud Era
 Evolution of DBA in the Cloud Era Evolution of DBA in the Cloud Era
Evolution of DBA in the Cloud Era
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
 
Benchmarking Apache Druid
Benchmarking Apache Druid Benchmarking Apache Druid
Benchmarking Apache Druid
 
Benchmarking Apache Druid
Benchmarking Apache DruidBenchmarking Apache Druid
Benchmarking Apache Druid
 
REX Hadoop et R
REX Hadoop et RREX Hadoop et R
REX Hadoop et R
 
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
Webinar: Dyn + DataStax - helping companies deliver exceptional end-user expe...
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
Cpp In Soa
Cpp In SoaCpp In Soa
Cpp In Soa
 
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based HardwareRed hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
Red hat Storage Day LA - Designing Ceph Clusters Using Intel-Based Hardware
 
Olap scalability
Olap scalabilityOlap scalability
Olap scalability
 
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
Technical Webinar: Patterns for Integrating Your Salesforce App with Off-Plat...
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About Sharding
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014
 

More from ikewu83

《云计算核心技术剖析》Mini书
《云计算核心技术剖析》Mini书《云计算核心技术剖析》Mini书
《云计算核心技术剖析》Mini书ikewu83
 
云计算与NoSQL
云计算与NoSQL云计算与NoSQL
云计算与NoSQLikewu83
 
Yun table 云时代的数据库
Yun table 云时代的数据库Yun table 云时代的数据库
Yun table 云时代的数据库ikewu83
 
Dean keynote-ladis2009
Dean keynote-ladis2009Dean keynote-ladis2009
Dean keynote-ladis2009ikewu83
 
云计算091124(李德毅院士)
云计算091124(李德毅院士)云计算091124(李德毅院士)
云计算091124(李德毅院士)ikewu83
 
04 陈良忠ibm cloud forum ibm experience 0611
04 陈良忠ibm cloud forum  ibm experience 061104 陈良忠ibm cloud forum  ibm experience 0611
04 陈良忠ibm cloud forum ibm experience 0611ikewu83
 
05 朱近之 ibm云计算解决方案概览 0611
05 朱近之 ibm云计算解决方案概览 061105 朱近之 ibm云计算解决方案概览 0611
05 朱近之 ibm云计算解决方案概览 0611ikewu83
 
03 李实恭-乘云之势以智致远 0611
03 李实恭-乘云之势以智致远 061103 李实恭-乘云之势以智致远 0611
03 李实恭-乘云之势以智致远 0611ikewu83
 
Cisco nexus 1000v
Cisco nexus 1000vCisco nexus 1000v
Cisco nexus 1000vikewu83
 
OVF 1.0 Whitepaper
OVF 1.0 WhitepaperOVF 1.0 Whitepaper
OVF 1.0 Whitepaperikewu83
 
De 03 Introduction To V Cloud Api V1
De 03 Introduction To V Cloud Api V1De 03 Introduction To V Cloud Api V1
De 03 Introduction To V Cloud Api V1ikewu83
 

More from ikewu83 (13)

《云计算核心技术剖析》Mini书
《云计算核心技术剖析》Mini书《云计算核心技术剖析》Mini书
《云计算核心技术剖析》Mini书
 
云计算与NoSQL
云计算与NoSQL云计算与NoSQL
云计算与NoSQL
 
Yun table 云时代的数据库
Yun table 云时代的数据库Yun table 云时代的数据库
Yun table 云时代的数据库
 
Pnp
PnpPnp
Pnp
 
Dean keynote-ladis2009
Dean keynote-ladis2009Dean keynote-ladis2009
Dean keynote-ladis2009
 
云计算091124(李德毅院士)
云计算091124(李德毅院士)云计算091124(李德毅院士)
云计算091124(李德毅院士)
 
04 陈良忠ibm cloud forum ibm experience 0611
04 陈良忠ibm cloud forum  ibm experience 061104 陈良忠ibm cloud forum  ibm experience 0611
04 陈良忠ibm cloud forum ibm experience 0611
 
05 朱近之 ibm云计算解决方案概览 0611
05 朱近之 ibm云计算解决方案概览 061105 朱近之 ibm云计算解决方案概览 0611
05 朱近之 ibm云计算解决方案概览 0611
 
03 李实恭-乘云之势以智致远 0611
03 李实恭-乘云之势以智致远 061103 李实恭-乘云之势以智致远 0611
03 李实恭-乘云之势以智致远 0611
 
Cisco nexus 1000v
Cisco nexus 1000vCisco nexus 1000v
Cisco nexus 1000v
 
OVF 1.0 Whitepaper
OVF 1.0 WhitepaperOVF 1.0 Whitepaper
OVF 1.0 Whitepaper
 
De 03 Introduction To V Cloud Api V1
De 03 Introduction To V Cloud Api V1De 03 Introduction To V Cloud Api V1
De 03 Introduction To V Cloud Api V1
 
OVF 1.1
OVF 1.1OVF 1.1
OVF 1.1
 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Google F1

  • 1. F1 - The Fault-Tolerant Distributed RDBMS Supporting Google's Ad Business Jeff Shute, Mircea Oancea, Stephan Ellner, Ben Handy, Eric Rollins, Bart Samwel, Radek Vingralek, Chad Whipkey, Xin Chen, Beat Jegerlehner, Kyle Littlefield, Phoenix Tong SIGMOD May 22, 2012
  • 2. Today's Talk F1 - A Hybrid Database combining the ● Scalability of Bigtable ● Usability and functionality of SQL databases Key Ideas ● Scalability: Auto-sharded storage ● Availability & Consistency: Synchronous replication ● High commit latency: Can be hidden ○ Hierarchical schema ○ Protocol buffer column types ○ Efficient client code Can you have a scalable database without going NoSQL? Yes.
  • 3. The AdWords Ecosystem One shared database backing Google's core AdWords business advertiser SOAP API web UI reports Java / "frontend" ad-hoc SQL users C++ / "backend" DB log aggregation ad servers ad approvals spam analysis ad logs
  • 4. Our Legacy DB: Sharded MySQL Sharding Strategy ● Sharded by customer ● Apps optimized using shard awareness Limitations ● Availability ○ Master / slave replication -> downtime during failover ○ Schema changes -> downtime for table locking ● Scaling ○ Grow by adding shards ○ Rebalancing shards is extremely difficult and risky ○ Therefore, limit size and growth of data stored in database ● Functionality ○ Can't do cross-shard transactions or joins
  • 5. Demanding Users Critical applications driving Google's core ad business ● 24/7 availability, even with datacenter outages ● Consistency required ○ Can't afford to process inconsistent data ○ Eventual consistency too complex and painful ● Scale: 10s of TB, replicated to 1000s of machines Shared schema ● Dozens of systems sharing one database ● Constantly evolving - multiple schema changes per week SQL Query ● Query without code
  • 6. Our Solution: F1 A new database, ● built from scratch, ● designed to operate at Google scale, ● without compromising on RDBMS features. Co-developed with new lower-level storage system, Spanner
  • 7. Underlying Storage - Spanner Descendant of Bigtable, Successor to Megastore Properties ● Globally distributed ● Synchronous cross-datacenter replication (with Paxos) ● Transparent sharding, data movement ● General transactions ○ Multiple reads followed by a single atomic write ○ Local or cross-machine (using 2PC) ● Snapshot reads
  • 8. F1 Architecture F1 client ● Sharded Spanner servers ○ data on GFS and in memory ● Stateless F1 server ● Pool of workers for query execution F1 server F1 query workers Features ● Relational schema Spanner ○ Extensions for hierarchy and rich data types server ○ Non-blocking schema changes ● Consistent indexes ● Parallel reads with SQL or Map-Reduce GFS
  • 9. How We Deploy ● Five replicas needed for high availability ● Why not three? ○ Assume one datacenter down ○ Then one more machine crash => partial outage Geography ● Replicas spread across the country to survive regional disasters ○ Up to 100ms apart Performance ● Very high commit latency - 50-100ms ● Reads take 5-10ms - much slower than MySQL ● High throughput
  • 10. Hierarchical Schema Explicit table hierarchies. Example: ● Customer (root table): PK (CustomerId) ● Campaign (child): PK (CustomerId, CampaignId) ● AdGroup (child): PK (CustomerId, CampaignId, AdGroupId) Rows and PKs Storage Layout Customer (1) 1 2 Campaign (1,3) AdGroup (1,3,5) AdGroup (1,3,6) 1,3 1,4 2,5 Campaign (1,4) AdGroup (1,4,7) Customer (2) 1,3,5 1,3,6 1,4,7 2,5,8 Campaign (2,5) AdGroup (2,5,8)
  • 11. Clustered Storage ● Child rows under one root row form a cluster ● Cluster stored on one machine (unless huge) ● Transactions within one cluster are most efficient ● Very efficient joins inside clusters (can merge with no sorting) Rows and PKs Storage Layout Customer (1) 1 2 Campaign (1,3) AdGroup (1,3,5) AdGroup (1,3,6) 1,3 1,4 2,5 Campaign (1,4) AdGroup (1,4,7) Customer (2) 1,3,5 1,3,6 1,4,7 2,5,8 Campaign (2,5) AdGroup (2,5,8)
  • 12. Protocol Buffer Column Types Protocol Buffers ● Structured data types with optional and repeated fields ● Open-sourced by Google, APIs in several languages Column data types are mostly Protocol Buffers ● Treated as blobs by underlying storage ● SQL syntax extensions for reading nested fields ● Coarser schema with fewer tables - inlined objects instead Why useful? ● Protocol Buffers pervasive at Google -> no impedance mismatch ● Simplified schema and code - apps use the same objects ○ Don't need foreign keys or joins if data is inlined
  • 13. SQL Query ● Parallel query engine implemented from scratch ● Fully functional SQL, joins to external sources ● Language extensions for protocol buffers SELECT CustomerId FROM Customer c PROTO JOIN c.Whitelist.feature f WHERE f.feature_id = 302 AND f.status = 'STATUS_ENABLED' Making queries fast ● Hide RPC latency ● Parallel and batch execution ● Hierarchical joins
  • 14. Coping with High Latency Preferred transaction structure ● One read phase: No serial reads ○ Read in batches ○ Read asynchronously in parallel ● Buffer writes in client, send as one RPC Use coarse schema and hierarchy ● Fewer tables and columns ● Fewer joins For bulk operations ● Use small transactions in parallel - high throughput Avoid ORMs that add hidden costs
  • 15. ORM Anti-Patterns ● Obscuring database operations from app developers ● Serial reads ○ for loops doing one query per iteration ● Implicit traversal ○ Adding unwanted joins and loading unnecessary data These hurt performance in all databases. They are disastrous on F1.
  • 16. Our Client Library ● Very lightweight ORM - doesn't really have the "R" ○ Never uses Relational joins or traversal ● All objects are loaded explicitly ○ Hierarchical schema and protocol buffers make this easy ○ Don't join - just load child objects with a range read ● Ask explicitly for parallel and async reads
  • 17. Results Development ● Code is slightly more complex ○ But predictable performance, scales well by default ● Developers happy ○ Simpler schema ○ Rich data types -> lower impedance mismatch User-Facing Latency ● Avg user action: ~200ms - on par with legacy system ● Flatter distribution of latencies ○ Mostly from better client code ○ Few user actions take much longer than average ○ Old system had severe latency tail of multi-second transactions
  • 18. Current Challenges ● Parallel query execution ○ Failure recovery ○ Isolation ○ Skew and stragglers ○ Optimization ● Migrating applications, without downtime ○ Core systems already on F1, many more moving ○ Millions of LOC
  • 19. Summary We've moved a large and critical application suite from MySQL to F1. This gave us ● Better scalability ● Better availability ● Equivalent consistency guarantees ● Equally powerful SQL query And also similar application latency, using ● Coarser schema with rich column types ● Smarter client coding patterns In short, we made our database scale, and didn't lose any key database features along the way.