#MongoDBDays Chicago

Introduction To Sharding
J. Randall Hunt

Hackathoner, MongoDB

@jrhunt, randall@mongodb.com
In Today's Talk

•

What? Why? When?

•

How?

•

What's happening beind the scenes?
What Is Sharding?
This is a picture of my cat.
This is a picture of ~100 cats.

http://a1.s6img.com/cdn/0011/p/3123272_8220815_lz.jpg
This is a cat trying to find a home

webserver

mongod
100 cats trying to find a home.

webserver

(not to scale)

mongod
Scale Up?
Data Store Scalability

•

Custom Hardware

•

Custom Software

In the past you've had two options for achieving data store scalability:
1) custom hardware (oracle?)
2) custom software (google, facebook)

!

The reason these things were custom were that these problems were not yet common enough. The number of people on the internet 10 years ago is
incredibly small compared to the number of people using web services 10 years from now.
Scale Out?
Scale Out?
The MongoDB Sharding Solution
•

Automatically partition your data

•

Worry about failover at the partition layer

•

Application independent

•

Free and open source
Why Do I Shard?
Input/Output

You input/output exceeds the capacity of a single node or replica set.

this is not easy to do!
Working Set Exceeds Physical Memory

RAM
Working Set Exceeds Physical Memory

Data

RAM
Working Set Exceeds Physical Memory

Data

RAM

Indexes
Working Set Exceeds Physical Memory

Data

RAM Sorts

Indexes
Working Set Exceeds Physical Memory

Data

RAM Sorts

Indexes

Aggregations
Working Set Exceeds Physical Memory

Data

Indexes
RAM

Sorts

Aggregations
Working Set Exceeds Physical Memory
How Does Sharding Work?
MongoDB's Sharding Infrastructure
MongoDB's Sharding Infrastructure
app server

mongod
MongoDB's Sharding Infrastructure
app server

mongod
mongod
mongod
MongoDB's Sharding Infrastructure
app server

shard
MongoDB's Sharding Infrastructure
app server

shard
MongoDB's Sharding Infrastructure
app server

mongos

shard
MongoDB's Sharding Infrastructure
app server

mongos

mongod --configsvr

shard
MongoDB's Sharding Infrastructure
app server

mongos

mongod --configsvr

shard
Terminology
•

Shards

•

Chunks

•

Config Servers

•

mongos

A shard is a server, or a collection of servers, that holds chunks of info which are split up according to a shard key, a shard holds a subset of a collection's
data
A chunk of info is a group of data falling in a particular range based on a shard key that can be moved logically from server to server
config serves hold information about where chunks live
mongos is the router and balancer -- it communicates with the config servers and figures out how to intelligently direct your query.
What exactly is a shard?
•

Shard is a node of the cluster

•

Can be a single mongod or an entire replica set

Shard

Mongod

Shard

or

Primary
Secondary
Secondary

Now what do shards hold? Chunks, which are partitions of your data that live in certain ranges.
Partitioning
•

User defines a shard key or uses hash based sharding

•

Shard key defines a range of data

•

The key space is like points on a line

•

A range is a segment of that line

-∞

Remember interval notation?

Key Space

+∞
Data Distribution
Initially a single chunk
Default Max Chunk Size: 64mb
MongoDB willMongos Mongos split and migrate chunks as
automatically Mongos
they reach the max size
Config
Node 1

Secondary
Server

Shard 1
Mongod

Shard 2
Shards and Shard Keys
Shards and Shard Keys
Chunks!
Shards and Shard Keys
Chunks!

Shard Keys!
What is a config server?
•

A config server is for storing shard meta-data

•

It stores chunk ranges and locations

•

Run with 3 in production!
Config
Node 1

Secondary
Server

Config
Node 1

Secondary
Server

or

Config
Node 1

Secondary
Server

Config
Node 1

Secondary
Server

this is not a replica set, the three servers are purely for failover purposes.

!

pro-tip use CNAMEs to identify these.
What is a mongos?
•

Acts as a router / balancer for queries and ops

•

No local data (persists all info to the config servers)

•

Can run with just one or many
App Server

App Server

App Server

App Server

or
Mongos

Mongos

Mongos
MongoDB's Sharding Infrastructure
App Server

Config
Node 1
Secondary
Server

App Server

App Server

Mongos

Mongos

Mongos

Shard

Shard

Shard

Config
Node 1

Secondary
Server

Config
Node 1

Secondary
Server
Get Started With Sharding?
1. Choose a shard key (we'll talk about this later)
2. Start config servers
3. Turn on sharding
4. Profit.
Mechanics of Sharding
Oh hey there devops!
Start the Configuration Server

Config
Node 1

Secondary
Server

mongod --configsvr
Starts a configuration server on the default port (27019)
Start the mongos router

Mongos

Config
Node 1

Secondary
Server

mongos --configdb catconf.mongodb.com:27019
Start the mongod
Mongos

Config
Node 1

Secondary
Server

Shard
Mongod

mongod --shardsvr
Starts a mongod with the default shard port (27018)
Shard is not yet connected to the rest of the cluster
Could have already been a part of the cluster
Add the Shard
Mongos

Config
Node 1

Secondary
Server

Shard
Mongod

On mongos:
sh.addShard('cat1.mongodb.com:27018')
For a replica set:
sh.addShard('<rsname>/<seedlist>')
Check that everything is working!
Mongos

Config
Node 1

Secondary
Server

Shard
Mongod

[mongos] admin> db.runCommand({ listshards: 1 })
{
"shards": [
{
"_id": "shard0000",
"host": "cat1.mongodb.com:27018"
}
],
"ok": 1
}
Now enable sharding
•

Enable Sharding on a database

sh.enableSharding("<dbname>")

•

Shard a collection (with a key):

sh.shardCollection(

"<dbname>.cat",

{"name": 1})

•

Use a compound shard key to prevent duplicates

sh.shardCollection(

"<dbname>.cats",

{"name": 1, "uniqueid": 1})
Tag Aware Sharding
•

Total control over the distribution of your data!

•

Tag a range of shard keys:

sh.addTagRange(<collection>,<min>,<max>,<tag>)

•

Tag a shard:

sh.addShardTag("shard0000","NYC")

The Balancer

•
•

Transparent to driver and application

•

try to minimize clock skew with ntpd

Ensures even distribution of chunks across the cluster

Very tuneable but defaults are often sensible
Routing Requests
(Oh hi there application developers!)
Cluster Request Routing

Scatter Gather

Targeted

Choose your own adventure!
Targeted Query

Mongos

Shard

Shard

Shard
Routable request received
1

Mongos

Shard

Shard

Shard
Request routed to appropriate shard
1

Mongos

2

Shard

Shard

Shard
Shard returns results
1

Mongos

2
3

Shard

Shard

Shard
mongos returns results to client
1
4
Mongos

2
3

Shard

Shard

Shard
Non-targeted queries

Mongos

Shard

Shard

Shard
request received
1

Mongos

Shard

Shard

Shard
Farm request out to all shards
1

Mongos

2

Shard

2

Shard

2

Shard
shards return results to mongos
1

Mongos

2
3

Shard

2

2
3

Shard

3

Shard
mongos returns results to client
1
4
Mongos

2
3

Shard

2

2
3

Shard

3

Shard
Choosing A Shard Key
Things to remember!
•
•

Shard key values are immutable

•

Shard key must be indexed

•

It is limited to 512 bytes in size

•

Try to choose a field used in queries

•

should not be monotonically increasing!

Shard Key is immutable

Only the shard key can be guaranteed unique across shards
How to choose your key?
•

Cardinality

•

Write Distribution

•

Query Isolation

•

Reliability

•

Index Locality

Cardinality – Can your data be broken down enough?
Query Isolation - query targeting to a specific shard
Reliability – shard outages


!

A good shard key can:


Optimize routing
Minimize (unnecessary) traffic
Allow best scaling

!

consider pre splitting
no unique indexes keys unless part of the shard key

!

geokeys cannot be part of a shardkey
$near won't work but the $geo commands work fine
Thanks!
•

What's Next?

•

Resources:

https://education.mongodb.com/

https://www.mongodb.com/presentations

•

Me:

@jrhunt, randall@mongodb.com

In summary -- and this is not a sales pitch... lots of other databases out there have sharding and replication... not many of them provide the granularity of
control that you need for your applications while maintaining sensible defaults.

Sharding in MongoDB Days 2013