2. Challenges Opportunities
• Variably typed data • Agility
• Complex objects • Cost reduction
• High transaction rates • Simplification
• Large data size
• High availability
2
3. AGILE DEVELOPMENT
• Iterative and continuous
• New and emerging apps
VOLUME AND TYPE
OF DATA
• Trillions of records
• 10’s of millions of queries NEW ARCHITECTURES
per second • Systems scaling horizontally,
• Volume of data not vertically
• Semi-structured and • Commodity servers
unstructured data • Cloud Computing
3
4. PROJECT
DENORMALIZE
START
DEVELOPER PRODUCTIVITY DECREASES DATA MODEL
STOP USING
JOINS CUSTOM
• Needed to add new software layers of ORM, Caching, CACHING LAYER
Sharding and Message Queue CUSTOM
SHARDING
• Polymorphic, semi-structured and unstructured data
not well supported
COST OF DATABASE INCREASES
+1 YEAR
• Increased database licensing cost
• Vertical, not horizontal, scaling
• High cost of SAN
+6 MONTHS
+90 DAYS
+30 DAYS
LAUNCH
4
8. Entity Attribute Value
1 Type Book
1 Title Inside Intel
Difficult to query
1 Author Andy Grove
2 Type Laptop Difficult to index
2 Manufactur Dell
er Expensive to query
2 Model Inspiron
2 RAM 8gb
3 Type Cereal
8
9. Type Title Author Manufacturer Model RAM Screen
Width
Book Inside Intel Andy Grove
Laptop Dell Inspiron 8gb 12”
TV Panasonic Viera 52”
MP3 Margin Walker Fugazi Dischord
Constant schema changes
Space inefficient
Overloaded fields
9
10. Type Manufacturer Model RA Screen
M Width
Laptop Dell Inspiron 8gb 12”
Querying is difficult Manufacturer Model Screen
Width
Hard to add more types Panasonic Viera 52”
Title Author Manufacturer
Margin Fugazi Dischord
Walker
10
12. Challenges Quer
y
Quer
y
Quer
• Constant load from client y
• Far in excess of single server Quer
y
Quer
y
capacity
• Can never take the system Query
Query
down
Query
Database
12
13. Challenges
• Adding more storage over time
• Aging out data that’s no longer needed
• Minimizing resource overhead of “cold” data
Recent Data Old Data
Fast Storage Archival Storage
Add Capacity
13
16. Variably typed • Rigid schemas
data
Complex • Normalization can be hard
Objects • Dependent on joins
High transaction • Vertical scaling
rate • Poor data locality
• Difficult to maintain consistency & HA
High Availability • HA a bolt-on to many RDBMS
Agile • Schema changes
Development • Monolithic data model
16
18. var post = { author: “Jared”,
date: new Date(),
text: “NoSQL Now 2012”,
tags: [“NoSQL”, “MongoDB”]}
> db.posts.save(post)
18
19. >db.posts.find()
{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"),
author : ”Jared",
date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)",
text : ”NoSQL Now 2012",
tags : [ ”NoSQL", ”MongoDB" ] }
Notes:
- _id is unique, but can be anything you’d like
19
20. Create index on any Field in Document
// 1 means ascending, -1 means descending
>db.posts.ensureIndex({author: 1})
>db.posts.find({author: ’Jared'})
{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"),
author : ”Jared",
... }
20
40. Value Meaning
<n:integer> Replicate to N members of replica set
“majority” Replicate to a majority of replica set
members
<m:modeName> Use custom error mode name
40
41. { _id: “someSet”,
members: [
{ _id:0, host:”A”, tags: { dc: “ny”}},
{ _id:1, host:”B”, tags: { dc: “ny”}},
{ _id:2, host:”C”, tags: { dc: “sf”}},
{ _id:3, host:”D”, tags: { dc: “sf”}},
{ _id:4, host:”E”, tags: { dc: “cloud”}},
settings: {
getLastErrorModes: {
These are the
veryImportant: { dc: 3 }, modes you can
sortOfImportant: { dc: 2 } use in write
} concern
}
}
41
42. • Between 0..1000
• Highest member that is up to date wins
– Up to date == within 10 seconds of primary
• If a higher priority member catches up, it will force election
and win
Primary Secondary Secondary
priority = 3 priority = 2 priority = 1
42
43. • Lags behind master by configurable time delay
• Automatically hidden from clients
• Protects against operator errors
– Accidentally delete database
– Application corrupts data
43
44. • Vote in elections
• Don’t store a copy of data
• Use as tie breaker
44
47. Data Center
Zone 1 Zone 2 Zone 3
Primary Secondary Secondary
hidden = true
backups
47
48. Active Data Center Standby Data Center
Zone 1 Zone 2
Primary Secondary Secondary
priority = 1 priority = 1 priority = 0
48
49. West Coast DC Central DC East Coast DC
Zone 1 Zone 1
Primary Secondary
priority = 2 priority = 1
Zone 2 Zone 2
Abiter
Secondary Secondary
priority = 2 priority = 1
49
60. Min Key Max Key Shard
-∞ dan@10gen.com 1
dan@10gen.com jsr@10gen.com 1
jsr@10gen.com scott@10gen.com 1
scott@10gen.com +∞ 1
• Stored in the config servers
• Cached in MongoS
• Used to route requests and keep cluster balanced
60
66. By Shard Key Routed db.users.find(
{email: “jsr@10gen.com”})
Sorted by Routed in order db.users.find().sort({email:-1})
shard key
Find by non Scatter Gather db.users.find({state:”CA”})
shard key
Sorted by Distributed merge db.users.find().sort({state:1})
sort
non shard
key
66
72. Content Management Operational Intelligence Meta Data Management
w
User Data Management High Volume Data Feeds
72
73. Machine • More machines, more sensors,
Generated more data
Data • Variably structured
Stock Market • High frequency trading
Data
• Multiple sources of data
Social Media
• Each changes their format
Firehose constantly
73
74. Flexible document
model can adapt to
changes in sensor
format
Asynchronous writes
Data
Data
Sources
Data
Sources
Data Write to memory with
Sources periodic disk flush
Sources
Scale writes over
multiple shards
74
75. • Large volume of state about users
Ad Targeting • Very strict latency requirements
• Expose report data to millions of customers
Real time • Report on large volumes of data
dashboards • Reports that update in real time
Social Media • What are people talking about?
Monitoring
75
76. Parallelize queries
Low latency reads
across replicas and
shards
API
In database
aggregation
Dashboards
Flexible schema
adapts to changing
input data
Can use same cluster
to collect, store, and
report on data
76
77. Intuit relies on a MongoDB-powered real-time analytics tool for small businesses to
derive interesting and actionable patterns from their customers’ website traffic
Problem Why MongoDB Impact
Intuit hosts more than 500,000 Intuit hosts more than 500,000 In one week Intuit was able to
websites websites become proficient in MongoDB
wanted to collect and analyze wanted to collect and analyze development
data to recommend conversion data to recommend conversion Developed application features
and lead generation and lead generation more quickly for MongoDB than
improvements to customers. improvements to customers. for relational databases
With 10 years worth of user With 10 years worth of user MongoDB was 2.5 times faster
data, it took several days to data, it took several days to than MySQL
process the information using a process the information using a
relational database. relational database.
We did a prototype for one week, and within one week we had made big progress. Very big progress. It
was so amazing that we decided, “Let’s go with this.” -Nirmala Ranganathan, Intuit
77
78. Rich profiles
collecting multiple
complex actions
1 See Ad
Scale out to support { cookie_id: ‚1234512413243‛,
high throughput of advertiser:{
apple: {
activities tracked actions: [
2 See Ad { impression: ‘ad1’, time: 123 },
{ impression: ‘ad2’, time: 232 },
{ click: ‘ad2’, time: 235 },
{ add_to_cart: ‘laptop’,
sku: ‘asdf23f’,
time: 254 },
Click { purchase: ‘laptop’, time: 354 }
3 ]
}
}
}
Dynamic schemas
make it easy to track
Indexing and
4 Convert vendor specific
querying to support
attributes
matching, frequency
capping
78
79. Data • Meta data about artifacts
• Content in the library
Archiving
• Have data sources that you don’t have
Information access to
• Stores meta-data on those stores and figure
discovery out which ones have the content
• Retina scans
Biometrics • Finger prints
79
80. Indexing and rich
query API for easy
searching and sorting
db.archives.
find({ ‚country”: ‚Egypt‛ });
Flexible data model
for similar, but
different objects
{ type: ‚Artefact‛, { ISBN: ‚00e8da9b‛,
medium: ‚Ceramic‛, type: ‚Book‛,
country: ‚Egypt‛, country: ‚Egypt‛,
year: ‚3000 BC‛ title: ‚Ancient Egypt‛
} }
80
81. Shutterfly uses MongoDB to safeguard more than six billion images for millions of
customers in the form of photos and videos, and turn everyday pictures into keepsakes
Problem Why MongoDB Impact
Managing 20TB of data (six JSON-based data structure 500% cost reduction and 900%
billion images for millions of Provided Shutterfly with an performance improvement
customers) partitioning by agile, high performance, compared to previous Oracle
function. scalable solution at a low cost. implementation
Home-grown key value store on Works seamlessly with Accelerated time-to-market for
top of their Oracle database Shutterfly’s services-based nearly a dozen projects on
offered sub-par performance architecture MongoDB
Codebase for this hybrid store Improved Performance by
became hard to manage reducing average latency for
High licensing, HW costs inserts from 400ms to 2ms.
The “really killer reason” for using MongoDB is its rich JSON-based data structure, which offers Shutterfly
an agile approach to develop software. With MongoDB, the Shutterfly team can quickly develop and
deploy new applications, especially Web 2.0 and social features. -Kenny Gorman, Director of Data Services
81
82. • Comments and user generated
News Site content
• Personalization of content, layout
Multi-Device • Generate layout on the fly for
rendering each device that connects
• No need to cache static pages
• Store large objects
Sharing • Simple modeling of metadata
82
83. Geo spatial indexing
Flexible data model for location based
GridFS for large
for similar, but searches
object storage
different objects
{ camera: ‚Nikon d4‛,
location: [ -122.418333, 37.775 ]
}
{ camera: ‚Canon 5d mkII‛,
people: [ ‚Jim‛, ‚Carol‛ ],
taken_on: ISODate("2012-03-07T18:32:35.002Z")
}
{ origin: ‚facebook.com/photos/xwdf23fsdf‛,
license: ‚Creative Commons CC0‛,
size: {
dimensions: [ 124, 52 ],
units: ‚pixels‛
Horizontal scalability }
for large data sets }
83
84. Wordnik uses MongoDB as the foundation for its “live” dictionary that stores its entire
text corpus – 3.5T of data in 20 billion records
Problem Why MongoDB Impact
Analyze a staggering amount of Migrated 5 billion records in a Reduced code by 75%
data for a system build on single day with zero downtime compared to MySQL
continuous stream of high- MongoDB powers every Fetch time cut from 400ms to
quality text pulled from online website requests: 20m API calls 60ms
sources per day Sustained insert speed of 8k
Adding too much data too Ability to eliminated words per second, with
quickly resulted in outages; memcached layer, creating a frequent bursts of up to 50k per
tables locked for tens of simplified system that required second
seconds during inserts fewer resources and was less Significant cost savings and 15%
Initially launched entirely on prone to error. reduction in servers
MySQL but quickly hit
performance road blocks
Life with MongoDB has been good for Wordnik. Our code is faster, more flexible and dramatically smaller.
Since we don’t spend time worrying about the database, we can spend more time writing code for our
application. -Tony Tam, Vice President of Engineering and Technical Co-founder
84
85. • Scale out to large graphs
Social
Graphs • Easy to search and
process
• Authentication,
Identity
Authorization and
Management
Accounting
85
86. Native support for
Arrays makes it easy
to store connections
inside user profile
Sharding partitions
user profiles across Documents enable
Social Graph available servers disk locality of all
profile data for a user
86