I promise that understand NoSQL is as easy as playing with LEGO bricks ! The Google Bigtable presented in 2006 is the inspiration for Apache HBase: let's take a deep dive into Bigtable to better understand Hbase.
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
We will review a multi-layered framework for PostgreSQL security, with a deeper focus on limiting access to the database and data, as well as securing the data. Using the popular AAA (Authentication, Authorization, Auditing) framework we will cover:
Best practices for authentication (trust, certificate, MD5, Scram, etc).
Advanced approaches, such as password profiles.
Deep dive of authorization and data access control for roles, database objects (tables etc), view usage, row level security and data redaction.
Auditing, encryption and SQL injection attack prevention.
Kanban is the simplest approach which is currently used in software development. Since Kanban prescribes close to nothing there are often a lot of basic questions about the method.
The presentation depicts what Kanban is generally using Scrum as a reference point. Then it presents a series of situations to answer basic questions about working with Kanban
C* Summit 2013: How Not to Use Cassandra by Axel LiljencrantzDataStax Academy
At Spotify, we see failure as an opportunity to learn. During the two years we've used Cassandra in our production environment, we have learned a lot. This session touches on some of the exciting design anti-patterns, performance killers and other opportunities to lose a finger that are at your disposal with Cassandra.
I promise that understand NoSQL is as easy as playing with LEGO bricks ! The Google Bigtable presented in 2006 is the inspiration for Apache HBase: let's take a deep dive into Bigtable to better understand Hbase.
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...Spark Summit
Contemporary computing hardware offers massive new performance opportunities. Yet high-performance programming remains a daunting challenge.
We present some of the lessons learned while designing faster indexes, with a particular emphasis on compressed bitmap indexes. Compressed bitmap indexes accelerate queries in popular systems such as Apache Spark, Git, Elastic, Druid and Apache Kylin.
We will review a multi-layered framework for PostgreSQL security, with a deeper focus on limiting access to the database and data, as well as securing the data. Using the popular AAA (Authentication, Authorization, Auditing) framework we will cover:
Best practices for authentication (trust, certificate, MD5, Scram, etc).
Advanced approaches, such as password profiles.
Deep dive of authorization and data access control for roles, database objects (tables etc), view usage, row level security and data redaction.
Auditing, encryption and SQL injection attack prevention.
Kanban is the simplest approach which is currently used in software development. Since Kanban prescribes close to nothing there are often a lot of basic questions about the method.
The presentation depicts what Kanban is generally using Scrum as a reference point. Then it presents a series of situations to answer basic questions about working with Kanban
C* Summit 2013: How Not to Use Cassandra by Axel LiljencrantzDataStax Academy
At Spotify, we see failure as an opportunity to learn. During the two years we've used Cassandra in our production environment, we have learned a lot. This session touches on some of the exciting design anti-patterns, performance killers and other opportunities to lose a finger that are at your disposal with Cassandra.
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBaseCon
Speaker: Sudarshan Kadambi and Matthew Hunt (Bloomberg LP)
Bloomberg is a financial data and analytics provider, so data management is core to what we do. There's tremendous diversity in the type of data we manage, and HBase is a natural fit for many of these datasets - from the perspective of the data model as well as in terms of a scalable, distributed database. This talk covers data and analytics use cases at Bloomberg and operational challenges around HA. We'll explore the work currently being done under HBASE-10070, further extensions to it, and how this solution is qualitatively different to how failover is handled by Apache Cassandra.
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
Spanner is a globally distributed database that provides external consistency between data centers and stores data in a schema based semi-relational data structure. Not only that, Spanner provides a versioned view of the data that allows for instantaneous snapshot isolation across any segment of the data. This versioned isolation allows Spanner to provide globally consistent reads of the database at a particular time allowing for lock-free read-only transactions (and therefore no communications overhead for consensus during these types of reads). Spanner also provides externally consistent reads and writes with a timestamp-based linear execution of transactions and two phase commits. Spanner is the first distributed database that provides global sharding and replication with strong consistency semantics.
Instaclustr has a diverse customer base including Ad Tech, IoT and messaging applications ranging from small start ups to large enterprises. In this presentation we share our experiences, common issues, diagnosis methods, and some tips and tricks for managing your Cassandra cluster.
About the Speaker
Brooke Jensen VP Technical Operations & Customer Services, Instaclustr
Instaclustr is the only provider of fully managed Cassandra as a Service in the world. Brooke Jensen manages our team of Engineers that maintain the operational performance of our diverse fleet clusters, as well as providing 24/7 advice and support to our customers. Brooke has over 10 years' experience as a Software Engineer, specializing in performance optimization of large systems and has extensive experience managing and resolving major system incidents.
This simple and crisp quick reference card is for Agile and Scrum basics. It is a simple way to glance through all the concepts and use it as a tool for revision, even before an interview.
Apache HBase Improvements and Practices at XiaomiHBaseCon
Duo Zhang and Liangliang He (Xiaomi)
In this session, we’ll discuss the various practices around HBase in use at Xiaomi, including those relating to HA, tiered compaction, multi-tenancy, and failover across data centers.
In this session, you get an overview of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service. We'll cover how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also discuss new features, architecture best practices, and share how customers are using Amazon Redshift for their Big Data workloads.
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...DataStax
Deleting data from Cassandra has several challenges, and existing solutions (tombstones or TTLs) have limitations that make them unusable or untenable in certain circumstances. We'll explore the cases where existing deletion options fail or are inadequate, then describe a solution we developed which deletes data from Cassandra during standard or user-defined compaction, but without resorting to tombstones or TTL's.
About the Speaker
Eric Stevens Principal Architect, ProtectWise, Inc.
Eric is the principal architect, and day one employee of ProtectWise, Inc., specializing in massive real time processing and scalability problems. The team at ProtectWise processes, analyzes, optimizes, indexes, and stores billions of network packets each second. They look for threats in real time, but also store full fidelity network data (including PCAP), and when new security intelligence is received, automatically replay existing network history through that new intelligence.
SQL Server Wait Types Everyone Should KnowDean Richards
Many people use wait types for performance tuning, but do not know what some of the most common ones indicate. This presentation will go into details about the top 8 wait types I see at the customers I work with. It will provide wait descriptions as well as solutions.
Security Best Practices for your Postgres DeploymentPGConf APAC
These slides were used by Sameer Kumar of Ashnik for presenting his topic at pgDay Asia 2016. He took audience through some of the security best practices for deploying and hardening PostgreSQL
What Every Developer Should Know About Database Scalabilityjbellis
Replication. Partitioning. Relational databases. Bigtable. Dynamo. There is no one-size-fits-all approach to scaling your database, and the CAP theorem proved that there never will be. This talk will explain the advantages and limits of the approaches to scaling traditional relational databases, as well as the tradeoffs made by the designers of newer distributed systems like Cassandra. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7955
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBaseCon
Speaker: Sudarshan Kadambi and Matthew Hunt (Bloomberg LP)
Bloomberg is a financial data and analytics provider, so data management is core to what we do. There's tremendous diversity in the type of data we manage, and HBase is a natural fit for many of these datasets - from the perspective of the data model as well as in terms of a scalable, distributed database. This talk covers data and analytics use cases at Bloomberg and operational challenges around HA. We'll explore the work currently being done under HBASE-10070, further extensions to it, and how this solution is qualitatively different to how failover is handled by Apache Cassandra.
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
Spanner is a globally distributed database that provides external consistency between data centers and stores data in a schema based semi-relational data structure. Not only that, Spanner provides a versioned view of the data that allows for instantaneous snapshot isolation across any segment of the data. This versioned isolation allows Spanner to provide globally consistent reads of the database at a particular time allowing for lock-free read-only transactions (and therefore no communications overhead for consensus during these types of reads). Spanner also provides externally consistent reads and writes with a timestamp-based linear execution of transactions and two phase commits. Spanner is the first distributed database that provides global sharding and replication with strong consistency semantics.
Instaclustr has a diverse customer base including Ad Tech, IoT and messaging applications ranging from small start ups to large enterprises. In this presentation we share our experiences, common issues, diagnosis methods, and some tips and tricks for managing your Cassandra cluster.
About the Speaker
Brooke Jensen VP Technical Operations & Customer Services, Instaclustr
Instaclustr is the only provider of fully managed Cassandra as a Service in the world. Brooke Jensen manages our team of Engineers that maintain the operational performance of our diverse fleet clusters, as well as providing 24/7 advice and support to our customers. Brooke has over 10 years' experience as a Software Engineer, specializing in performance optimization of large systems and has extensive experience managing and resolving major system incidents.
This simple and crisp quick reference card is for Agile and Scrum basics. It is a simple way to glance through all the concepts and use it as a tool for revision, even before an interview.
Apache HBase Improvements and Practices at XiaomiHBaseCon
Duo Zhang and Liangliang He (Xiaomi)
In this session, we’ll discuss the various practices around HBase in use at Xiaomi, including those relating to HA, tiered compaction, multi-tenancy, and failover across data centers.
In this session, you get an overview of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service. We'll cover how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also discuss new features, architecture best practices, and share how customers are using Amazon Redshift for their Big Data workloads.
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...DataStax
Deleting data from Cassandra has several challenges, and existing solutions (tombstones or TTLs) have limitations that make them unusable or untenable in certain circumstances. We'll explore the cases where existing deletion options fail or are inadequate, then describe a solution we developed which deletes data from Cassandra during standard or user-defined compaction, but without resorting to tombstones or TTL's.
About the Speaker
Eric Stevens Principal Architect, ProtectWise, Inc.
Eric is the principal architect, and day one employee of ProtectWise, Inc., specializing in massive real time processing and scalability problems. The team at ProtectWise processes, analyzes, optimizes, indexes, and stores billions of network packets each second. They look for threats in real time, but also store full fidelity network data (including PCAP), and when new security intelligence is received, automatically replay existing network history through that new intelligence.
SQL Server Wait Types Everyone Should KnowDean Richards
Many people use wait types for performance tuning, but do not know what some of the most common ones indicate. This presentation will go into details about the top 8 wait types I see at the customers I work with. It will provide wait descriptions as well as solutions.
Security Best Practices for your Postgres DeploymentPGConf APAC
These slides were used by Sameer Kumar of Ashnik for presenting his topic at pgDay Asia 2016. He took audience through some of the security best practices for deploying and hardening PostgreSQL
What Every Developer Should Know About Database Scalabilityjbellis
Replication. Partitioning. Relational databases. Bigtable. Dynamo. There is no one-size-fits-all approach to scaling your database, and the CAP theorem proved that there never will be. This talk will explain the advantages and limits of the approaches to scaling traditional relational databases, as well as the tradeoffs made by the designers of newer distributed systems like Cassandra. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7955
A Lecture given in Aalto University course "Design of WWW Services".
Single page app is already several years old web application paradigm that is now gaining traction due to the interest towards HTML5 and particularly cross-platform mobile (web) applications. The presentation overviews the single page application paradigm and compares it with other web app paradigms.
The presentation uses Backbone.js as the sample and gives practical tips on how to best structure Backbone.js applications. It contains an extensive set of tips and links in the notes section.
The reader is adviced to download the presentation for better readability of the notes.
Back in 2015, Square and Google collaborated to launch gRPC, an open source RPC framework backed by protocol buffers and HTTP/2, based on real-world experiences operating microservices at scale. If you build microservices, you will be interested in gRPC.
This webcast covers:
- a technical overview of gRPC
- use cases and applicability in your stack
- a deep dive into the practicalities of operationalizing gRPC
Video: https://www.parleys.com/tutorial/life-beyond-illusion-present
Summary: The idea of the present is an illusion. Everything we see, hear and feel is just an echo from the past. But this illusion has influenced us and the way we view the world in so many ways; from Newton’s physics with a linearly progressing timeline accruing absolute knowledge along the way to the von Neumann machine with its total ordering of instructions updating mutable state with full control of the “present”. But unfortunately this is not how the world works. There is no present, all we have is facts derived from the merging of multiple pasts. The truth is closer to Einstein’s physics where everything is relative to one’s perspective.
As developers we need to wake up and break free from the perceived reality of living in a single globally consistent present. The advent of multicore and cloud computing architectures meant that most applications today are distributed systems—multiple cores separated by the memory bus or multiple nodes separated by the network—which puts a harsh end to this illusion. Facts travel at the speed of light (at best), which makes the distinction between past and perceived present even more apparent in a distributed system where latency is higher and where facts (messages) can get lost.
The only way to design truly scalable and performant systems that can construct a sufficiently consistent view of history—and thereby our local “present”—is by treating time as a first class construct in our programming model and to model the present as facts derived from the merging of multiple concurrent pasts.
In this talk we will explore what all this means to the design of our systems, how we need to view and model consistency, consensus, communication, history and behaviour, and look at some practical tools and techniques to bring it all together.
An overview of how recent changes in technology have changed priorities for databases to distributed systems, and how you can preserve consistency in distributed data stores like Riak.
PostgreSQL is very differently architected and presents none
of these problems. All PostgreSQL operations are multi-versioned using
Multi-Version Concurrency Control (MVCC). As a result, common
operations such as re-indexing, adding or dropping columns, and
recreating views can be performed online and without excessive locking,
5. Architectural view of the storage hierarchy
…
P
L1$
P
L1$
L2$
…
P
L1$
P
L1$
L2$
…
Local DRAM
Disk
One server
DRAM: 16GB, 100ns, 20GB/s
Disk: 2TB, 10ms, 200MB/s
6. Architectural view of the storage hierarchy
…
P
L1$
P
L1$
L2$
…
P
L1$
P
L1$
L2$
…
Local DRAM
Disk
One server
DRAM: 16GB, 100ns, 20GB/s
Disk: 2TB, 10ms, 200MB/s
Rack Switch
DRAM Disk
DRAM Disk
DRAM Disk
DRAM
Disk
DRAM Disk
DRAM Disk
DRAM Disk
DRAM
Disk
…
Local rack (80 servers)
DRAM: 1TB, 300us, 100MB/s
Disk: 160TB, 11ms, 100MB/s
7. Architectural view of the storage hierarchy
…
P
L1$
P
L1$
L2$
…
P
L1$
P
L1$
L2$
…
Local DRAM
Disk
One server
DRAM: 16GB, 100ns, 20GB/s
Disk: 2TB, 10ms, 200MB/s
Rack Switch
DRAM Disk
DRAM Disk
DRAM Disk
DRAM
Disk
DRAM Disk
DRAM Disk
DRAM Disk
DRAM
Disk
…
Local rack (80 servers)
DRAM: 1TB, 300us, 100MB/s
Disk: 160TB, 11ms, 100MB/s
Cluster Switch
…
Cluster (30+ racks)
DRAM: 30TB, 500us, 10MB/s
Disk: 4.80PB, 12ms, 10MB/s
8. Storage hierarchy: a different view
A bumpy ride that has been getting bumpier over time
9. Reliability & Availability
• Things will crash. Deal with it!
– Assume you could start with super reliable servers (MTBF of 30 years)
– Build computing system with 10 thousand of those
– Watch one fail per day
• Fault-tolerant software is inevitable
• Typical yearly flakiness metrics
– 1-5% of your disk drives will die
– Servers will crash at least twice (2-4% failure rate)
10. The Joys of Real Hardware
Typical first year for a new cluster:
~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
~1 network rewiring (rolling ~5% of machines down over 2-day span)
~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
~5 racks go wonky (40-80 machines see 50% packetloss)
~8 network maintenances (4 might cause ~30-minute random connectivity losses)
~12 router reloads (takes out DNS and external vips for a couple minutes)
~3 router failures (have to immediately pull traffic for an hour)
~dozens of minor 30-second blips for dns
~1000 individual machine failures
~thousands of hard drive failures
slow disks, bad memory, misconfigured machines, flaky machines, etc.
Long distance links: wild dogs, sharks, dead horses, drunken hunters, etc.
15. Understanding fault statistics matters
Can we expect faults to be independent or correlated?
Are there common failure patterns we should program around?
16. • Cluster is 1000s of machines, typically one or handful of configurations
• File system (GFS) + Cluster scheduling system are core services
• Typically 100s to 1000s of active jobs (some w/1 task, some w/1000s)
• mix of batch and low-latency, user-facing production jobs
Google Cluster Environment
...
Linux
chunk
server
scheduling
slave
Commodity HW
job 1
task
job 3
task
job 12
task
Linux
chunk
server
scheduling
slave
Commodity HW
job 7
task
job 3
task
job 5
task
Machine 1
...
Machine N
scheduling
master
GFS
master
Chubby
lock service
18. • 200+ clusters
• Many clusters of 1000s of machines
• Pools of 1000s of clients
• 4+ PB Filesystems
• 40 GB/s read/write load
– (in the presence of frequent HW failures)
GFS Usage @ Google
19. Google: Most Systems are Distributed Systems
• Distributed systems are a must:
– data, request volume or both are too large for single machine
• careful design about how to partition problems
• need high capacity systems even within a single datacenter
–multiple datacenters, all around the world
• almost all products deployed in multiple locations
–services used heavily even internally
• a web search touches 50+ separate services, 1000s
machines
20. Many Internal Services
• Simpler from a software engineering standpoint
– few dependencies, clearly specified (Protocol Buffers)
– easy to test new versions of individual services
– ability to run lots of experiments
• Development cycles largely decoupled
– lots of benefits: small teams can work independently
– easier to have many engineering offices around the world
21. Protocol Buffers
• Good protocol description language is vital
• Desired attributes:
– self-describing, multiple language support
– efficient to encode/decode (200+ MB/s), compact serialized form
Our solution: Protocol Buffers (in active use since 2000)
message SearchResult {
required int32 estimated_results = 1; // (1 is the tag number)
optional string error_message = 2;
repeated group Result = 3 {
required float score = 4;
required fixed64 docid = 5;
optional message<WebResultDetails> = 6;
…
}
};
22. Protocol Buffers (cont)
• Automatically generated language wrappers
• Graceful client and server upgrades
– systems ignore tags they don't understand, but pass the information through
(no need to upgrade intermediate servers)
• Serialization/deserialization
– high performance (200+ MB/s encode/decode)
– fairly compact (uses variable length encodings)
– format used to store data persistently (not just for RPCs)
• Also allow service specifications:
service Search {
rpc DoSearch(SearchRequest) returns (SearchResponse);
rpc DoSnippets(SnippetRequest) returns
(SnippetResponse);
rpc Ping(EmptyMessage) returns (EmptyMessage) {
{ protocol=udp; };
};
• Open source version: http://code.google.com/p/protobuf/
23. Designing Efficient Systems
Given a basic problem definition, how do you choose the "best" solution?
• Best could be simplest, highest performance, easiest to extend, etc.
Important skill: ability to estimate performance of a system design
– without actually having to build it!
24. Numbers Everyone Should Know
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 25 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 3,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns
25. Back of the Envelope Calculations
How long to generate image results page (30 thumbnails)?
Design 1: Read serially, thumbnail 256K images on the fly
30 seeks * 10 ms/seek + 30 * 256K / 30 MB/s = 560 ms
26. Back of the Envelope Calculations
How long to generate image results page (30 thumbnails)?
Design 1: Read serially, thumbnail 256K images on the fly
30 seeks * 10 ms/seek + 30 * 256K / 30 MB/s = 560 ms
Design 2: Issue reads in parallel:
10 ms/seek + 256K read / 30 MB/s = 18 ms
(Ignores variance, so really more like 30-60 ms, probably)
27. Back of the Envelope Calculations
How long to generate image results page (30 thumbnails)?
Design 1: Read serially, thumbnail 256K images on the fly
30 seeks * 10 ms/seek + 30 * 256K / 30 MB/s = 560 ms
Design 2: Issue reads in parallel:
10 ms/seek + 256K read / 30 MB/s = 18 ms
(Ignores variance, so really more like 30-60 ms, probably)
Lots of variations:
– caching (single images? whole sets of thumbnails?)
– pre-computing thumbnails
– …
Back of the envelope helps identify most promising…
28. Write Microbenchmarks!
• Great to understand performance
– Builds intuition for back-of-the-envelope calculations
• Reduces cycle time to test performance improvements
Benchmark Time(ns) CPU(ns) Iterations
BM_VarintLength32/0 2 2 291666666
BM_VarintLength32Old/0 5 5 124660869
BM_VarintLength64/0 8 8 89600000
BM_VarintLength64Old/0 25 24 42164705
BM_VarintEncode32/0 7 7 80000000
BM_VarintEncode64/0 18 16 39822222
BM_VarintEncode64Old/0 24 22 31165217
29. Know Your Basic Building Blocks
Core language libraries, basic data structures,
protocol buffers, GFS, BigTable,
indexing systems, MySQL, MapReduce, …
Not just their interfaces, but understand their
implementations (at least at a high level)
If you don’t know what’s going on, you can’t do
decent back-of-the-envelope calculations!
30. Encoding Your Data
• CPUs are fast, memory/bandwidth are precious, ergo…
– Variable-length encodings
– Compression
– Compact in-memory representations
• Compression/encoding very important for many systems
– inverted index posting list formats
– storage systems for persistent data
• We have lots of core libraries in this area
– Many tradeoffs: space, encoding/decoding speed, etc. E.g.:
• Zippy: encode@300 MB/s, decode@600MB/s, 2-4X compression
• gzip: encode@25MB/s, decode@200MB/s, 4-6X compression
31. Designing & Building Infrastructure
Identify common problems, and build software systems to address them
in a general way
• Important not to try to be all things to all people
– Clients might be demanding 8 different things
– Doing 6 of them is easy
– …handling 7 of them requires real thought
– …dealing with all 8 usually results in a worse system
• more complex, compromises other clients in trying to satisfy everyone
Don't build infrastructure just for its own sake
• Identify common needs and address them
• Don't imagine unlikely potential needs that aren't really there
• Best approach: use your own infrastructure (especially at first!)
– (much more rapid feedback about what works, what doesn't)
32. Design for Growth
Try to anticipate how requirements will evolve
keep likely features in mind as you design base system
Ensure your design works if scale changes by 10X or 20X
but the right solution for X often not optimal for 100X
33. Interactive Apps: Design for Low Latency
• Aim for low avg. times (happy users!)
–90%ile and 99%ile also very important
–Think about how much data you’re shuffling around
• e.g. dozens of 1 MB RPCs per user request -> latency will be lousy
• Worry about variance!
–Redundancy or timeouts can help bring in latency tail
• Judicious use of caching can help
• Use higher priorities for interactive requests
• Parallelism helps!
34. Making Applications Robust Against Failures
Canary requests
Failover to other replicas/datacenters
Bad backend detection:
stop using for live requests until behavior gets better
More aggressive load balancing when imbalance is more severe
Make your apps do something reasonable even if not all is right
– Better to give users limited functionality than an error page
35. Add Sufficient Monitoring/Status/Debugging Hooks
All our servers:
• Export HTML-based status pages for easy diagnosis
• Export a collection of key-value pairs via a standard interface
– monitoring systems periodically collect this from running servers
• RPC subsystem collects sample of all requests, all error requests, all
requests >0.0s, >0.05s, >0.1s, >0.5s, >1s, etc.
• Support low-overhead online profiling
– cpu profiling
– memory profiling
– lock contention profiling
If your system is slow or misbehaving, can you figure out why?
36. MapReduce
• A simple programming model that applies to many large-scale
computing problems
• Hide messy details in MapReduce runtime library:
– automatic parallelization
– load balancing
– network and disk transfer optimizations
– handling of machine failures
– robustness
– improvements to core library benefit all users of library!
37. Typical problem solved by MapReduce
•Read a lot of data
•Map: extract something you care about from each record
•Shuffle and Sort
•Reduce: aggregate, summarize, filter, or transform
•Write the results
Outline stays the same,
map and reduce change to fit the problem
38. Example: Rendering Map Tiles
Input Map Shuffle Reduce Output
Emit each to all
overlapping latitude-
longitude rectangles
Sort by key
(key= Rect. Id)
Render tile using
data for all enclosed
features
Rendered tiles
Geographic
feature list
I-5
Lake Washington
WA-520
I-90
(0, I-5)
(0, Lake Wash.)
(0, WA-520)
(1, I-90)
(1, I-5)
(1, Lake Wash.)
(0, I-5)
(0, Lake Wash.)
(0, WA-520)
(1, I-90)
0
1 (1, I-5)
(1, Lake Wash.)
…
…
…
…
40. Parallel MapReduce
Map Map Map Map
Input
data
Reduce
Shuffle
Reduce
Shuffle
Reduce
Shuffle
Partitioned
output
Master
For large enough problems, it’s more about disk and
network performance than CPU & DRAM
41. MapReduce Usage Statistics Over Time
Number of jobs
Aug, ‘04
29K
Mar, ‘06
171K
Sep, '07
2,217K
Sep, ’09
3,467K
Average completion time (secs) 634 874 395 475
Machine years used 217 2,002 11,081 25,562
Input data read (TB) 3,288 52,254 403,152 544,130
Intermediate data (TB) 758 6,743 34,774 90,120
Output data written (TB) 193 2,970 14,018 57,520
Average worker machines 157 268 394 488
42. MapReduce in Practice
•Abstract input and output interfaces
–lots of MR operations don’t just read/write simple files
• B-tree files
• memory-mapped key-value stores
• complex inverted index file formats
• BigTable tables
• SQL databases, etc.
• ...
•Low-level MR interfaces are in terms of byte arrays
–Hardly ever use textual formats, though: slow, hard to parse
–Most input & output is in encoded Protocol Buffer format
• See “MapReduce: A Flexible Data Processing Tool” (to appear in
upcoming CACM)
43. • Lots of (semi-)structured data at Google
– URLs:
• Contents, crawl metadata, links, anchors, pagerank, …
– Per-user data:
• User preference settings, recent queries/search results, …
– Geographic locations:
• Physical entities (shops, restaurants, etc.), roads, satellite image data, user
annotations, …
• Scale is large
– billions of URLs, many versions/page (~20K/version)
– Hundreds of millions of users, thousands of q/sec
– 100TB+ of satellite image data
BigTable: Motivation
44. • Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
Rows
Columns
• Rows are ordered lexicographically
• Good match for most of our applications
Basic Data Model
45. • Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
“www.cnn.com”
“contents:”
Rows
Columns
“<html>…”
• Rows are ordered lexicographically
• Good match for most of our applications
Basic Data Model
46. • Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
“www.cnn.com”
“contents:”
Rows
Columns
Timestamps
t17
“<html>…”
• Rows are ordered lexicographically
• Good match for most of our applications
Basic Data Model
47. • Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
“www.cnn.com”
“contents:”
Rows
Columns
Timestamps
t11
t17
“<html>…”
• Rows are ordered lexicographically
• Good match for most of our applications
Basic Data Model
48. • Distributed multi-dimensional sparse map
(row, column, timestamp) → cell contents
“www.cnn.com”
“contents:”
Rows
Columns
Timestamps
t3
t11
t17
“<html>…”
• Rows are ordered lexicographically
• Good match for most of our applications
Basic Data Model
52. Bigtable master
Bigtable tablet server Bigtable tablet server
Bigtable tablet server …
Bigtable Cell
BigTable System Structure
53. Bigtable master
Bigtable tablet server Bigtable tablet server
Bigtable tablet server …
performs metadata ops +
load balancing
Bigtable Cell
BigTable System Structure
54. Bigtable master
Bigtable tablet server Bigtable tablet server
Bigtable tablet server …
performs metadata ops +
load balancing
serves data serves data
serves data
Bigtable Cell
BigTable System Structure
55. Lock service
Bigtable master
Bigtable tablet server Bigtable tablet server
Bigtable tablet server
GFS
Cluster scheduling system
…
performs metadata ops +
load balancing
serves data serves data
serves data
Bigtable Cell
BigTable System Structure
56. Lock service
Bigtable master
Bigtable tablet server Bigtable tablet server
Bigtable tablet server
GFS
Cluster scheduling system
…
handles failover, monitoring
performs metadata ops +
load balancing
serves data serves data
serves data
Bigtable Cell
BigTable System Structure
57. Lock service
Bigtable master
Bigtable tablet server Bigtable tablet server
Bigtable tablet server
GFS
Cluster scheduling system
…
holds tablet data, logs
handles failover, monitoring
performs metadata ops +
load balancing
serves data serves data
serves data
Bigtable Cell
BigTable System Structure
58. Lock service
Bigtable master
Bigtable tablet server Bigtable tablet server
Bigtable tablet server
GFS
Cluster scheduling system
…
holds metadata,
handles master-election
holds tablet data, logs
handles failover, monitoring
performs metadata ops +
load balancing
serves data serves data
serves data
Bigtable Cell
BigTable System Structure
59. Lock service
Bigtable master
Bigtable tablet server Bigtable tablet server
Bigtable tablet server
GFS
Cluster scheduling system
…
holds metadata,
handles master-election
holds tablet data, logs
handles failover, monitoring
performs metadata ops +
load balancing
serves data serves data
serves data
Bigtable Cell
Bigtable client
Bigtable client
library
BigTable System Structure
60. Lock service
Bigtable master
Bigtable tablet server Bigtable tablet server
Bigtable tablet server
GFS
Cluster scheduling system
…
holds metadata,
handles master-election
holds tablet data, logs
handles failover, monitoring
performs metadata ops +
load balancing
serves data serves data
serves data
Bigtable Cell
Bigtable client
Bigtable client
library
Open()
BigTable System Structure
61. Lock service
Bigtable master
Bigtable tablet server Bigtable tablet server
Bigtable tablet server
GFS
Cluster scheduling system
…
holds metadata,
handles master-election
holds tablet data, logs
handles failover, monitoring
performs metadata ops +
load balancing
serves data serves data
serves data
Bigtable Cell
Bigtable client
Bigtable client
library
Open()
read/write
BigTable System Structure
62. Lock service
Bigtable master
Bigtable tablet server Bigtable tablet server
Bigtable tablet server
GFS
Cluster scheduling system
…
holds metadata,
handles master-election
holds tablet data, logs
handles failover, monitoring
performs metadata ops +
load balancing
serves data serves data
serves data
Bigtable Cell
Bigtable client
Bigtable client
library
Open()
read/write
metadata ops
BigTable System Structure
63. BigTable Status
• Design/initial implementation started beginning of 2004
• Production use or active development for 100+ projects:
– Google Print
– My Search History
– Orkut
– Crawling/indexing pipeline
– Google Maps/Google Earth
– Blogger
– …
• Currently ~500 BigTable clusters
• Largest cluster:
–70+ PB data; sustained: 10M ops/sec; 30+ GB/s I/O
64. BigTable: What’s New Since OSDI’06?
• Lots of work on scaling
• Service clusters, managed by dedicated team
• Improved performance isolation
–fair-share scheduler within each server, better
accounting of memory used per user (caches, etc.)
–can partition servers within a cluster for different users
or tables
• Improved protection against corruption
–many small changes
–e.g. immediately read results of every compaction,
compare with CRC.
• Catches ~1 corruption/5.4 PB of data compacted
65. BigTable Replication (New Since OSDI’06)
• Configured on a per-table basis
• Typically used to replicate data to multiple bigtable
clusters in different data centers
• Eventual consistency model: writes to table in one cluster
eventually appear in all configured replicas
• Nearly all user-facing production uses of BigTable use
replication
66. BigTable Coprocessors (New Since OSDI’06)
• Arbitrary code that runs run next to each tablet in table
–as tablets split and move, coprocessor code
automatically splits/moves too
• High-level call interface for clients
–Unlike RPC, calls addressed to rows or ranges of rows
• coprocessor client library resolves to actual locations
–Calls across multiple rows automatically split into multiple
parallelized RPCs
• Very flexible model for building distributed services
– automatic scaling, load balancing, request routing for apps
67. Example Coprocessor Uses
• Scalable metadata management for Colossus (next gen
GFS-like file system)
• Distributed language model serving for machine
translation system
• Distributed query processing for full-text indexing support
• Regular expression search support for code repository
68. Current Work: Spanner
• Storage & computation system that spans all our datacenters
– single global namespace
• Names are independent of location(s) of data
• Similarities to Bigtable: tables, families, locality groups, coprocessors, ...
• Differences: hierarchical directories instead of rows, fine-grained replication
• Fine-grained ACLs, replication configuration at the per-directory level
– support mix of strong and weak consistency across datacenters
• Strong consistency implemented with Paxos across tablet replicas
• Full support for distributed transactions across directories/machines
– much more automated operation
• system automatically moves and adds replicas of data and computation based
on constraints and usage patterns
• automated allocation of resources across entire fleet of machines
69. • Future scale: ~106 to 107 machines, ~1013 directories,
~1018 bytes of storage, spread at 100s to 1000s of
locations around the world, ~109 client machines
– zones of semi-autonomous control
– consistency after disconnected operation
– users specify high-level desires:
“99%ile latency for accessing this data should be <50ms”
“Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia”
Design Goals for Spanner
70. Adaptivity in World-Wide Systems
• Challenge: automatic, dynamic world-wide placement of
data & computation to minimize latency and/or cost, given
constraints on:
– bandwidth
– packet loss
– power
– resource usage
– failure modes
– ...
• Users specify high-level desires:
“99%ile latency for accessing this data should be <50ms”
“Store this data on at least 2 disks in EU, 2 in U.S. & 1 in Asia”
71. Building Applications on top of Weakly
Consistent Storage Systems
• Many applications need state replicated across a wide area
– For reliability and availability
• Two main choices:
– consistent operations (e.g. use Paxos)
• often imposes additional latency for common case
– inconsistent operations
• better performance/availability, but apps harder to write and reason about in
this model
• Many apps need to use a mix of both of these:
– e.g. Gmail: marking a message as read is asynchronous, sending a
message is a heavier-weight consistent operation
72. Building Applications on top of Weakly
Consistent Storage Systems
• Challenge: General model of consistency choices,
explained and codified
– ideally would have one or more “knobs” controlling performance vs.
consistency
– “knob” would provide easy-to-understand tradeoffs
• Challenge: Easy-to-use abstractions for resolving
conflicting updates to multiple versions of a piece of state
– Useful for reconciling client state with servers after disconnected
operation
– Also useful for reconciling replicated state in different data centers
after repairing a network partition
73. Thanks! Questions...?
Further reading:
• Ghemawat, Gobioff, & Leung. Google File System, SOSP 2003.
• Barroso, Dean, & Hölzle . Web Search for a Planet: The Google Cluster Architecture, IEEE Micro, 2003.
• Dean & Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004.
• Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, & Gruber. Bigtable: A Distributed
Storage System for Structured Data, OSDI 2006.
• Burrows. The Chubby Lock Service for Loosely-Coupled Distributed Systems. OSDI 2006.
• Pinheiro, Weber, & Barroso. Failure Trends in a Large Disk Drive Population. FAST 2007.
• Brants, Popat, Xu, Och, & Dean. Large Language Models in Machine Translation, EMNLP 2007.
• Barroso & Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale
Machines, Morgan & Claypool Synthesis Series on Computer Architecture, 2009.
• Malewicz et al. Pregel: A System for Large-Scale Graph Processing. PODC, 2009.
• Schroeder, Pinheiro, & Weber. DRAM Errors in the Wild: A Large-Scale Field Study. SEGMETRICS’09.
• Protocol Buffers. http://code.google.com/p/protobuf/
These and many more available at: http://labs.google.com/papers.html