Presented by Andrew Erlichson, Vice President, Engineering, Developer Experience, MongoDB
Audience level: Beginner
MongoDB’s basic unit of storage is a document. Documents can represent rich, schema-free data structures, meaning that we have several viable alternatives to the normalized, relational model. In this talk, we’ll discuss the tradeoff of various data modeling strategies in MongoDB. You will learn:
- How to work with documents
- How to evolve your schema
- Common schema design patterns
4. 4
Medical Records
• Collects all patient information in a central repository
• Provide central point of access for
– Patients
– Care providers: physicians, nurses, etc.
– Billing
– Insurance reconciliation
• Hospitals, physicians, patients, procedures, records
Patient
Records
Medications
Lab Results
Procedures
Hospital
Records
Physicians
Patients
Nurses
Billing
5. 5
Medical Record Data
• Hospitals
– have physicians
• Physicians
– Have patients
– Perform procedures
– Belong to hospitals
• Patients
– Have physicians
– Are the subject of procedures
• Procedures
– Associated with a patient
– Associated with a physician
– Have a record
– Variable meta data
• Records
– Associated with a procedure
– Binary data
– Variable fields
19. 20
Embedding (1:1)
• Advantages
– Retrieve all relevant information in a single query/document
– Avoid implementing joins in application code
– Update related information as a single atomic operation
• MongoDB doesn’t offer multi-document transactions
• Limitations
– Large documents mean more overhead if most fields are not relevant
– 16 MB document size limit
20. 23
Referencing (1:1)
• Advantages
– Smaller documents
– Less likely to reach 16 MB document limit
– Infrequently accessed information not accessed on every query
• Limitations
– Two queries required to retrieve information
– Cannot update related information atomically
21. 24
One to One: General Recommendations
• Embed
– No additional data duplication
– Can query or index on
embedded field
• e.g., “result.type”
• Exceptional cases…
• Embedding results in large
documents
• Set of infrequently access
fields
{
"_id" : 333,
"date" : "2003-02-09T05:00:00"),
"hospital" : “County Hills”,
"patient" : “John Doe”,
"physician" : “Stephen Smith”,
"type" : ”Chest X-ray",
”result" : {
"type" : "txt",
"size" : NumberInt(12),
"content" : {
value1: 343,
value2: “abc”,
…
}
}
}
24. 27
One to Many: General Recommendations
• Embed, when possible
– Access all information in a single query
– Take advantage of update atomicity
– No additional data duplication
– Can query or index on any field
• e.g., { “phones.type”: “mobile” }
• Exceptional cases:
– 16 MB document size
– Large number of infrequently accessed fields
{
_id: 2,
first: “Joe”,
last: “Patient”,
addr: { …},
procedures: [
{
id: 12345,
date: 2015-02-15,
type: “Cat scan”,
…},
{
id: 12346,
date: 2015-02-15,
type: “blood test”,
…}]
}
26. 29
Many to Many
Traditional Relational Association
Join table
Physicians
name
specialty
phone
Hospitals
name
HosPhysicanRel
hospitalId
physicianId
X
Use arrays instead
29. 32
Many to Many
General Recommendation
• Use case determines whether to reference or
embed:
1. Data Duplication
• Embedding may result in data duplication
• Duplication may be okay if reads
dominate updates
2. Referencing may be required if many
related items
3. Hybrid approach
• Potentially do both
{
_id: 2,
name: “Oak Valley Hospital”,
city: “New York”,
beds: 131,
physicians: [12345, 12346]}
{
_id: 12345,
name: “Joe Doctor”,
address: {…},
…}
{
_id: 12346,
name: “Mary Well”,
address: {…},
…}
Hospitals
Reference
Physicians
30. What If I Want to Store Large Files in MongoDB?
40. 43
Data From Vital Signs Monitoring Device
{
deviceId: 123456,
spO2: 88,
pulse: 74,
bp: [128, 80],
ts: ISODate("2013-10-16T22:07:00.000-0500")
}
• One document per minute per device
• Relational approach
41. 44
Document Per Hour (By minute)
{
deviceId: 123456,
spO2: { 0: 88, 1: 90, …, 59: 92},
pulse: { 0: 74, 1: 76, …, 59: 72},
bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]},
ts: ISODate("2013-10-16T22:00:00.000-0500")
}
• Store per-minute data at the hourly level
• Update-driven workload
• 1 document per device per hour
42. 45
Characterizing Write Differences
• Example: data generated every minute
• Recording the data for 1 patient for 1 hour:
Document Per Event
60 inserts
Document Per Hour
1 insert, 59 updates
43. 46
Characterizing Read Differences
• Want to graph 24 hour of vital signs for a patient:
• Read performance is greatly improved
Document Per Event
1440 reads
Document Per Hour
24 reads
44. 47
Characterizing Memory and Storage Differences
Document Per Minute Document Per Hour
Number Documents 52.6 B 876 M
Total Index Size 6364 GB 106 GB
_id index 1468 GB 24.5 GB
{ts: 1, deviceId: 1} 4895 GB 81.6 GB
Document Size 92 Bytes 758 Bytes
Database Size 4503 GB 618 GB
• 100K Devices
• 1 years worth of data
100000 * 365 *
24 * 60
100000 * 365 *
24
100000 * 365 *
24 * 60 * 130
100000 * 365 *
24 * 130
100000 * 365 *
24 * 60 * 92
100000 * 365 *
24 * 758
45. 48
Summary
• Relationships can be modeled by embedding or references
• Decision should be made in context of application data and query workload
– Tailor schema to application workload
• It is okay recommended to violate RDBMS schema design principles
– No duplication of data
– Normalization
• Different schemas may result in dramatically different
– Query performance
– Hardware requirements