Linked in nosql_atnetflix_2012_v1

Motivation

 Circa late 2008, Netflix had a single data center
 Single-point-of-failure (a.k.a. SPOF)
 Approaching limits on cooling, power, space, traffic
capacity

 Alternatives
 Build more data centers
 Outsource the majority of capacity planning and scale
out
 Allows us to focus on core competencies

@r39132 3

Motivation

 Winner : Outsource the majority of capacity planning and scale
out
 Leverage a leading Infrastructure-as-a-service provider
 Amazon Web Services

@r39132 4

Cloud Migration Strategy

 Components
 Applications and Software Infrastructure
 Data

 Migration Considerations
 Avoid sensitive data for now
 PII and PCI DSS stays in our DC, rest can go to the cloud
 Favor Web Scale applications & data

@r39132 6


Examples of Data that can be moved

 Video-centric data
 Critics’ and Users’ reviews
 Video Metadata (e.g. director, actors, plot description, etc…)

 User-video-centric data – some of our largest data sets
 Video Queue
 Watched History
 Video Ratings (i.e. a 5-star rating system)
 Video Playback Metadata (e.g. streaming bookmarks, activity
logs)

@r39132 7


@r39132 9


@r39132 10


@r39132 11

Pick a Data Store in the Cloud

An ideal storage solution should have the following features:
• Be hosted in AWS
• We wanted a database-as-a-service

• Be highly scalable and available and have acceptable latencies
• It should automatically scale with Netflix’s traffic growth
• It should be as available as television – i.e. zero downtime

• Support SQL
• Developers already familiar with the model

@r39132 13

Pick a Data Store in the Cloud

 We picked SimpleDB and S3
 SimpleDB was targeted as the AP (c.f. CAP theorem)
equivalent of our RDBMS databases in our Data
Center

 S3 was used for data sets where item or row data
exceeded SimpleDB limits and could be looked up
purely by a single key (i.e. does not require
secondary indices and complex query semantics)
 Video encodes
 Streaming device activity logs (i.e. CLOB, BLOB, etc…)
 Compressed (old) Rental History

@r39132 14

Technology Overview : SimpleDB

Terminology
SimpleDB Hash Table Relational Databases

Domain Hash Table Table

Item Entry Row

Item Name Key Mandatory Primary Key

Attribute Part of the Entry Value Column

@r39132 16

Soccer Players
Key Value

Nickname = Wizard of Teams = Leeds United,
ab12ocs12v9 First Name = Harold Last Name = Kewell Oz Liverpool, Galatasaray
Nickname = Czech Teams = Lazio,
b24h3b3403b First Name = Pavel Last Name = Nedved Cannon Juventus
Teams = Sporting,
Manchester United,
cc89c9dc892 First Name = Cristiano Last Name = Ronaldo Real Madrid

SimpleDB’s salient characteristics
• SimpleDB offers a range of consistency options

• SimpleDB domains are sparse and schema-less

• The Key and all Attributes are indexed

• Each item must have a unique Key

• An item contains a set of Attributes
• Each Attribute has a name
• Each Attribute has a set of values
• All data is stored as UTF-8 character strings (i.e. no support for types such as numbers or dates)

@r39132 17

What does the API look like?
 Manage Domains
 CreateDomain
 DeleteDomain
 ListDomains
 DomainMetaData
 Access Data
 Retrieving Data
 GetAttributes – returns a single item
 Select – returns multiple items using SQL syntax
 Writing Data
 PutAttributes – put single item
 BatchPutAttributes – put multiple items
 Removing Data
 DeleteAttributes – delete single item
 BatchDeleteAttributes – delete multiple items

@r39132 18


 Options available on reads and writes
 Consistent Read
 Read the most recently committed write
 May have lower throughput/higher latency/lower
availability

 Conditional Put/Delete
 i.e. Optimistic Locking
 Useful if you want to build a consistent multi-master data
store – you will still require your own consistency
checking

@r39132 19

Major Issues with SimpleDB

 Manual data set partitioning was needed to work around size and
throughput limits

 Read & Write latency variance could be large

 SimpleDB multi-tenancy created both throughput and latency issues

 Bad cost model – poor performance cost us more money

 Not Global – new requirement : global expansion

 No (external) back-up and recovery available – new requirement : refresh
Test DBs from Prod DBs

@r39132 21

Pick Another Data Store in the Cloud

An ideal storage solution should have the following
features:
• New Requirements
• Support Global Use-cases
• Support Back-up and Recovery
• Retained Requirements
• Be highly scalable and available and have acceptable
latencies
• Obsolete Requirements
• Be hosted in AWS
• Support SQL

@r39132 23


We picked Cassandra (i.e. Dynamo + BigTable)
• Support Global Use-cases
• Easy to add new nodes to the cluster, even if the new nodes
are on a different continent
•
• Be easy to own and operate
• Dynamo’s masterless design avoids SPOF
• Dynamo’s repair mechanisms (i.e. read repair, anti-entropy
repair, and hinted handoff) promote self-management

@r39132 24


We picked Cassandra (i.e. Dynamo + BigTable)
• Be highly scalable and available and have acceptable
latencies
• Known to scale for writes
• Netflix achieved >1 million writes/sec
• Netflix leverages caches for reads

• Bonus
• Data model identical to SimpleDB – i.e. one less thing
for developers to learn

@r39132 25

Technology Overview : Cassandra

Features
• Consistent Hashing
• Tunable Consistency at the Request-level (like Simple DB)
• Automatic Healing
• Read Repair
• Anti-entropy Repair
• Hinted-Handoff
• Failure Detection
• Clusters are configurable & upgradeable without restart
• Infinite Incremental Write Scalability

@r39132 27

Consistent Hashing

How does Consistent Hashing
work in Cassandra?

• Take a number line from [0-159]

• Wrap it on itself, so you now
have a number ring

@r39132 28

Consistent Hashing

• Given a key k, map the key to the ring
using:

• hash_func(k)%160 = token = position
on the ring

@r39132 29

Consistent Hashing

• You can then manually map
machines to the ring

• 8 machines are mapped to the
ring here
• N1  0
• N2  20
• N3  40

@r39132 30

Consistent Hashing

Now map key ranges to
machines:

• Host N2 owns all tokens in
the range (0, 20]
• In other words, “foo” is
mapped to the bucket 9
but assigned to server N2

• Host N5 owns all tokens in
the range (60, 80]
• “bar” is mapped to bucket
79 and assigned to server
N5
@r39132 31

Consistent Hashing : Dead
node?
What happens if a node dies or
becomes unresponsive?

With a replication factor of say 3, data
is always written to 3 places

• Writes to tokens in N2’s primary
token range are also written to N3
and N4

• Writes to tokens in N5’s primary
token range are also written to N6
and N7

@r39132 32

The Write Path

• Cassandra clients are not
required to know token-ring
mapping

• Client can send a request to any
node in the cluster

• The receiving node is called the
“coordinator” for that request

• The coordinator will always
execute <Replication Factor>
number of writes (e.g. 3)
@r39132 34

The Write Path

• Coordinators take care of:
• Key routing

• Executing the consistency level
of the request
• e.g. “CL=1” means that the
coordinator will wait till 1
node ACKs the write before
sending a response to the
cllent

• e.g. “CL=Quorum” means
that the coordinator will wait
@r39132
till 2 of 3 nodes ACK the 35
write

The Write Path

• If node N3 is presumed dead by
the failure detection algorithm or
if N3 is too busy to respond

• N5 will log the write to N3
and deliver it as soon as N3
comes back

• This is called Hinted
Handoff

@r39132 36

The Write Path

• Now that N5 is holding
uncommitted writes, what
happens if N5 dies before it can
replay the failed write?

• How will the logged write
ever make it to N3?

• Another repair mechanism
helps in this case: Read
Repair (stay tuned)

@r39132 37

The Read Path

• Again, N5 acts as a proxy, this
time for a get request

• Assume that the the
consistency_level on the
request is “Quorum”, N5 will
send a full read request to N2
and digest requests to N3 & N4

• This is a network
optimization

@r39132 39

The Read Path : Read Repair

• N5 will compare the responses as
soon as any 2 requests return

• At least one response will be a
digest (e.g. say N2 and N3)

• If the digest is more recent than the
full read, N5 doesn’t have the latest
data

• N5 then ask the N3 replica for a
full read

@r39132 40

The Read Path : Read Repair

• Once N5 receives the response of
the full read from N3, N5 returns this
data to the client

• But wait! What do we do about the
stale data on N2?

• Then schedule an async update for
the stale node(s)

• This is called read repair

@r39132 41

Consistent Hashing : Growing the
Ring

• To double the capacity of
the ring and keep the data
distributed evenly, we add
nodes (N9, N10, etc… ) in
the interstitial positions
(10, 30, etc… )

• Cassandra bootstraps new
nodes via data streaming

• For large data sets, Netflix
recovers from backup and
runs AER
@r39132 43

Consistent Hashing : Growing the
Ring
• After new nodes have
been added, they take
over half of the primary
token ranges of each old
node

@r39132 44

A Global Ring

• One of the benefits of
Cassandra is that it
can support
deployment of a
global ring

@r39132 45

A Global Ring

• In January 2012,
Netflix rolled out its
streaming service to
the UK and Ireland

• It did this by running a
few Cassandra
clusters between Va
and the UK

@r39132 46

A Global Ring
• Growing the ring into the UK from
its origin in Va was easy

• Double the ring as mentioned
before

• New interstitial nodes are in the
UK

• Increase the global replication
factor 3  6

• Now each key is covered by 6
nodes, 3 in Va, 3 in the UK
@r39132 47

Single Node Design

A single node is implemented as per the LSM
tree (i.e. log structured merge tree) pattern:

• Writes first append to the commit log and
then write to the memtable in memory

• This model gives fast writes

When the memtable is full, it is flushed to disk
to give SSTable 3

The SSTables are immutable

@r39132 49

Single Node Design

In a pure KV LSM store (e.g. Level DB),
when a Read occurs, the memtable is first
consulted

If the key is found, the data is returned from
the memtable

This is a fast read

If the data is not in the memtable, then it is
read from the latest SSTable

This is a slower read

@r39132 50

Single Node Design

Since Cassandra supports multiple value
columns, only a subset of which may be
written at any time, a read must consult
memtable and potentially multiple sstables

Hence, reads can be slower than they would
be for pure LSM KV stores

@r39132 51

Single Node Design

To reduce the number of SSTables
that need to be read, periodic
compaction needs to be run

This puts an additional load on GC
and I/O

The end result of compaction is
fewer files with hopefully fewer total
rows

To mitigate the problems inherent in
compaction, Cassandra 1.x support
Leveled Compaction
@r39132 52

Current Status
Timeline of Migration

• Jan 2009 – Start investigating the AWS cloud & SimpleDB

• Feb 2010 – Deliver the app to production

• Dec 2010 – ~95% of Netflix traffic has been moved into both the
AWS cloud and SimpleDB (+10 months)

• Apr 2011 – Start migration to Cassandra

• Jan 2012 – Netflix EU launched on Cassandra (+9 months)

@r39132 54

Linked in nosql_atnetflix_2012_v1

Linked in nosql_atnetflix_2012_v1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Linked in nosql_atnetflix_2012_v1

Similar to Linked in nosql_atnetflix_2012_v1 (20)

More from Sid Anand

More from Sid Anand (20)

Recently uploaded

Recently uploaded (20)

Linked in nosql_atnetflix_2012_v1

Editor's Notes