Agility and Scalability with MongoDB

MongoDB
Scalability and Agility
Chris.Biow@MongoDB.com

Data Challenge
“I want my data...”
• Now
• Secure
• All varieties
• Fast and interactive
• Scalable to “Big”
• Agile to develop and deploy operationally
• Cloud and edge
2
iStock licensed (pixelfit)

Scalability with MongoDB
Metric Meaning Examples
Operations per
Second
3
Concurrent reads and writes per
second
> 1 Million per second
Nodes per
Cluster
Horizontal scale-out, distributed to
multiple data centers worldwide, with
high availability, using inexpensive
cloud resources
> 1000 nodes
Records /
Documents
Data objects in any number of
schemas or structures
> 10 billion
Data Volume Total amount of data: documents X
size
> 1 Petabyte
= 10^15
= 1,000,000,000,000,000
≈ 2^50

Operational Database Landscape
5

Document Data Model
Relational MongoDB
6
{
first_name: ‘Paul’,
surname: ‘Miller’,
city: ‘London’,
location:
[45.123,47.232],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
}
}

Documents are Rich Data Structures
7
{
first_name: ‘Paul’,
surname: ‘Miller’,
cell: ‘+447557505611’
city: ‘London’,
location: [45.123,47.232],
Profession: [banking, finance, trader],
cars: [
{ model: ‘Bentley’,
year: 1973,
value: 100000, … },
{ model: ‘Rolls Royce’,
year: 1965,
value: 330000, … }
}
}
Fields can contain an array of
sub-documents
Fields
Typed field values
Fields can
contain arrays

Document Model Benefits
• Agility and flexibility
8
– Data model supports business change
– Rapidly iterate to meet new requirements
• Intuitive, natural data representation
– Eliminates ORM layer
– Developers are more productive
• Reduces the need for joins, disk seeks
– Programming is more simple
– Performance delivered at scale

Big Data Tech Interest Comparison
11
j.mp/Ssvpev

Enterprise Adoption Comparison
12
bit.ly/1vAI7rF

Architecture for
Availability & Scalability

Replica Sets
14
• Replica Set – two or more copies
• Availability solution
– High Availability
– Disaster Recovery
– Maintenance
• Deployment Flexibility
– Data locality to users
– Workload isolation: operational &
analytics
• Self-healing shard
Application
Driver
Primary
Secondary
Secondary
Repl ication

Global Data Distribution
16
Real-time
Real-time Real-time
Real-time
Real-time
Real-time
Real-time
Primary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary
Secondary

Automatic Sharding
• Sharding types
17
• Range
• Hash
• Tag-aware
• Elastic increase or decrease in capacity
• Automatic balancing

Query Routing
• Multiple query optimization models
• Each sharding option appropriate for different apps
18

Drag Strip:
straight ahead, quarter-mile, stop
20

Road Race:
stay fast, stay agile, continuous
21
Nürburgring, Germany

CarFax
• Large data set
24

25
Baseline MongoDB Comparison Initial Production
• Vehicle History
Database
• 11 billion records
(growing at 1 billion per
year)
• 30-year-old VMS-based
RDBMS
• Cumbersome
• Costly
• Performance: 4x faster
than baseline, 10x key-value
• Scale out using
inexpensive commodity
servers
• Built-in redundancy
• Flexible dynamic schema
data model
• Strong consistency
• Analytics/aggregation
• MongoDB is primary data
store
• 50 servers
• 10 shards
• 5 node replica sets per
shard
In-depth NoSQL evaluation

CARFAX Sharding and Replication
• 13 billion+ documents
26
– 1.5 billion documents added every year
• 1 vehicle history report is > 200 documents
• 12 Shards
• 9-node replica sets
• Replicas distributed across 3 data centers

Foursquare
• 50M users.
• 6B check-ins to date (6M per day growth).
• 55M points of interest / venues.
• 1.7M merchants using the platform for marketing
• Operations Per Second: 300,000
• Documents: 5.5B (~16.5B with replication).*
29

Foursquare clusters
• 11 MongoDB clusters
30
– 8 are sharded
• Largest cluster for check-ins
• 15 shards (check ins)
• Shard key user_id

Facebook / parse.com mobile apps
• Persistent database for 270,000 mobile applications
• 200 M end-user mobile devices
• 250% annual growth in client apps
• 500% growth in requests
• 1.5 M collections
• Key differentiators:
31
– Document data model
– High perf. & avail.
– Geospatial query and index
• Charity Majors operations: j.mp/X3jVRC
– Understand your database and your data, and build for them.

Scalability Exercises in the Cloud
with Amazon Web Services

Petascale Database
• 27x hs1.8xlarge instances
35
– 16x VCPU
– 24x 2TB SATA drives, RAID0
– 8x mongod microshards
• Modified Yahoo Cloud Serving
Benchmark (YCSB)
– Long Integer IDs (>2B)
– Zipfian-distributed integer fields
– Aggregation queries
• Load direct to 216 shards, 10 days, $4K
"objects" : 7,170,648,489,
"avgObjSize" : 147,438.99952658816,
"dataSize" : NumberLong("1,057,240,224,818,640")
(commas added)

CGroup Memory Segregation
for DB in `seq 0 3`; do
sudo cgcreate
-a mongodb:mongodb
-t mongodb:mongodb
-g memory:mongodb$D
sudo echo 48G >
/sys/fs/cgroup/memory/mongodb$D/memory.limit_in_bytes
cgexec
-g memory:mongodb$DB
numactl –interleave=all
mongod –-config ~/mongod$DB.conf
done

Megawrite Ingest
• Ingest 250-byte stock quotes at 2M/s
• Concurrently run 5 QPS, subsecond/indexed response on
37
timeStamp, accountId, instrumentId, systemKey
• 5x r3.4xlarge
– 16x VCPU, 1x 320GB SSD, 122GB RAM, 16x mongod
– 2.1M insert/second direct to shards
• 16x c3.8xlarge
– 32x VCPU, 2x 320GB SSD, 60GB RAM, 16x mongod, 4x mongos
– 2.1M insert/second via mongos

Java API comparison
• 2 threads on c3.8xl
• 264 bsonsize object, _id index only
• coll.insert()
38
15,600 ins / sec
• coll.insert(List<DBObject>)
listsize = 64: 118,000 ins / sec
• Bulk ops API
size = 64: 120,000 ins / sec

BulkWriteOperation bo = null;
for(a = 0; a < this.items && stayAlive; a++) {
if(bo == null) {
bo = collection.initializeUnorderedBulkOperation();
}
fillMap(this.m);
BasicDBObject dbObject = new BasicDBObject(this.m);
bo.insert(dbObject);
if(0 == a % listsize) {
BulkWriteResult rc = bo.execute();
bo = null;
}
}
7x Load with BulkOp

Shard Key characteristics
41
• A good shard key has:
– sufficient cardinality
– distributed writes
– targeted reads ("query isolation")
• Shard key should be in every query if possible
– scatter gather otherwise
• Choosing a good shard key is important!
– affects performance and scalability
– changing it later is expensive

Hashed shard key
42
• Pros:
– Evenly distributed writes
• Cons:
– Random data (and index) updates can be IO intensive
– Range-based queries turn into scatter gather
Shard 1
mongos
Shard 2 Shard 3 Shard N

Low cardinality shard key
43
• Induces "jumbo chunks"
• Examples: boolean field
Shard 1
mongos
[ a,
b )

Ascending shard key
44
• Monotonically increasing shard key values
cause "hot spots" on inserts
• Examples: timestamps, _id
Shard 1
mongos
[ ISODate(…),
$maxKey )

Ensuring Success with
High Scalability

Success Factors
• Storage: random seeks (IOPS)
• RAM: working set based on query patterns
• Query: indexing
• Delete: most expensive operation
• Real-time vs. bulk operations
• Continuity: HA, DR, backup, restore
• Agile process: iterate by powers of 4
• Sharding: shard key and strategy
• Resources: don’t go it alone!
46

Agility and Scalability with MongoDB

More Related Content

What's hot

Viewers also liked

Similar to Agility and Scalability with MongoDB

More from MongoDB

Recently uploaded

Agility and Scalability with MongoDB