In this session, you will learn how to translate one-to-one, one-to-many and many-to-many relationships, and learn how MongoDB's JSON structures, atomic updates and rich indexes can influence your design. We will also explore implications of storage engines, indexing and query patterns, available tools and related new features in MongoDB 3.2.
5. Medical Records
• Collects all patient information in a central repository
• Provide central point of access for
• Patients
• Care providers: physicians, nurses, etc.
• Billing
• Insurance reconciliation
• Hospitals, physicians, patients, procedures, records
Patient
Records
Medications
Lab Results
Procedures
Hospital
Records
Physicians
Patients
Nurses
Billing
6. Medical Record Data
• Hospitals
• have physicians
• Physicians
• Have patients
• Perform procedures
• Belong to hospitals
• Patients
• Have physicians
• Are the subject of procedures
• Procedures
• Associated with a patient
• Associated with a physician
• Have a record
• Variable meta data
• Records
• Associated with a procedure
• Binary data
• Variable fields
10. Attribute MongoDB Relational
Storage N-dimensional Two-dimensional
Field Values 0, 1, many, or embed Single value
Query Any field, at any level Any field
Schema Flexible Very structured
MongoDB vs. Relational
20. Embedding
• Advantages
• Retrieve all relevant information in a single query/document
• Avoid implementing joins in application code
• Update related information as a single atomic operation
• MongoDB doesn’t offer multi-document transactions
• Limitations
• Large documents mean more overhead if most fields are not relevant
• 16 MB document size limit
22. Embedding
• Advantages
• Retrieve all relevant information in a single query/document
• Avoid implementing joins in application code
• Update related information as a single atomic operation
• MongoDB doesn’t offer multi-document transactions
• Limitations
• Large documents mean more overhead if most fields are not relevant
• 16 MB document size limit
23. Referencing
• Advantages
• Smaller documents
• Less likely to reach 16 MB document limit
• Infrequently accessed information not accessed on every query
• No duplication of data
• Limitations
• Two queries required to retrieve information
• Cannot update related information atomically
24. 1-1: General Recommendations
• Embed
• No additional data duplication
• Can query or index on embedded
field
• e.g., “result.type”
• Exceptional cases…
• Embedding results in large
documents
• Set of infrequently access fields
{
"_id": 333,
"date": "2003-02-09T05:00:00",
"hospital": "County Hills",
"patient": "John Doe",
"physician": "Stephen Smith",
"type": "Chest X - ray",
"result": {
"type": "txt",
"size": 12,
"content": {
"value1": 343,
"value2": "abc"
}
}
}
27. 1-M : General Recommendations
• Embed, when possible
• Many are weak entities
• Access all information in a single query
• Take advantage of update atomicity
• No additional data duplication
• Can query or index on any field
• e.g., { “phones.type”: “mobile” }
• Exceptional cases:
• 16 MB document size
• Large number of infrequently accessed fields
{
_id: 2,
first: “Joe”,
last: “Patient”,
addr: { …},
procedures: [
{
id: 12345,
date: 2015-02-15,
type: “Cat scan”,
…},
{
id: 12346,
date: 2015-02-15,
type: “blood test”,
…}]
}
32. M-M : General Recommendation
• Use case determines whether to reference or
embed:
1. Data Duplication
• Embedding may result in data duplication
• Duplication may be okay if reads
dominate updates
• Of the two, which one changes the
least?
2. Referencing may be required if many
related items
3. Hybrid approach
• Potentially do both .. It’s ok!
{
_id: 2,
name: “Oak Valley Hospital”,
city: “New York”,
beds: 131,
physicians: [12345, 12346]}
{
_id: 12345,
name: “Joe Doctor”,
address: {…},
…}
{
_id: 12346,
name: “Mary Well”,
address: {…},
…}
Hospitals
Reference
Physicians
39. Vital Sign Monitoring Device
Vital Signs Measured:
• Blood Pressure
• Pulse
• Blood Oxygen Levels
Produces data at regular intervals
• Once per minute
• Many Devices, Many Hospitals
40. Data From Vital Signs Monitoring Device
{
deviceId: 123456,
ts: ISODate("2013-10-16T22:07:00.000-0500"),
spO2: 88,
pulse: 74,
bp: [128, 80]
}
• One document x minute x device
• Relational approach
41. Document Per Hour (By minute)
{
deviceId: 123456,
ts: ISODate("2013-10-16T22:00:00.000-0500"),
spO2: { 0: 88, 1: 90, …, 59: 92},
pulse: { 0: 74, 1: 76, …, 59: 72},
bp: { 0: [122, 80], 1: [126, 84], …, 59: [124, 78]}
}
• 1 document x device x hour
• Store per-minute data at the hourly level
• Update-driven workload
42. Characterizing Write Differences
• Example: data generated every minute
• Recording the data for 1 patient for 1 hour:
Document Per Event
60 inserts
Document Per Hour
1 insert, 59 updates
43. Characterizing Read Differences
• Want to graph 24 hour of vital signs for a patient:
• Read performance is greatly improved
Document Per Event
1440 reads
Document Per Hour
24 reads
44. Characterizing Memory and Storage Differences
Document Per Minute Document Per Hour
Number Documents 52.6 Billion 876 Million
Total Index Size 6,364 GB 106 GB
_id index 1,468 GB 24.5 GB
{ts: 1, deviceId: 1} 4,895 GB 81.6 GB
Document Size 92 Bytes 758 Bytes
Database Size 4,503 GB 618 GB
• 100K Devices
• 1 years worth of data, at second resolution (365 x 24 x 60)
46. MongoDB 3.2 – a GIANT Release
Hash-Based Sharding
Roles
Kerberos
On-Prem Monitoring
2.2 2.4 2.6 3.0 3.2
Agg. Framework
Location-Aware
Sharding
$out
Index Intersection
Text Search
Field-Level Redaction
LDAP & x509
Auditing
Document Validation
Fast Failover
Simpler Scalability
Aggregation ++
Encryption At Rest
In-Memory Storage
Engine
BI Connector
$lookup
MongoDB Compass
APM Integration
Profiler Visualization
Auto Index Builds
Backups to File
System
Doc-Level
Concurrency
Compression
Storage Engine API
≤50 replicas
Auditing ++
Ops Manager
47. Tools
• mgenerate
• Part of mtools: https://github.com/rueckstiess/mtools/wiki/mgenerate
• Model schema using json definition
• Generate Millions of documents with random data
• How well does the schema work?
• Queries, Indexes, Data Size, Index Size, Replication
• Demo
48. Documents are Rich Data Structures{
first_name: ‘Paul’,
last_name: ‘Miller’,
cell: 1234567890,
city: ‘London’,
location: [45.123,47.232],
professions: [‘banking’, ‘finance’, ‘trader’],
physicians: [
{ name: ‘Canelo Álvarez, M.D.’,
last_visit: ‘Mission Hospital’,
last_visit_dt: ‘20160501’, … },
{ name: ‘Érik Morales, M.D.’,
last_visit: ‘Del Prado Hospital’,
last_visit_dt: ‘20160302’, … }
]
}
Fields can contain an array of sub-
documents
Fields
Typed field values
Fields can
contain arrays
Fields can be indexed and queried at
any level
ORM Layer removed – Data is already
an object!
51. Visual Query Profiler
Identify your slow-running queries with
the click of a button
Index Suggestions
Index recommendations to improve
your deployment
&
55. MongoDB 3.2 Document Validation
db.runCommand( {
collMod: "Patients",
validator: { $and: [
{ "first_name": { "$type": "string" }},
{ "last_name": { "$type": "string"}},
{ "physicians": { "$type": "array"}}
] },
validationLevel: "strict"
});
https://docs.mongodb.com/manual/core/document-validation/
All Patient records must
have alphanumeric data
for the first and last
name, and a list of
Physicians
56. Summary
Embedding and
Referencing01 Context of Application Data
and Query Workload
Decisions
031-1 : Embed
1-M : Embed
when possible
M-M : Hybrid
02
Different schemas may result
in dramatically different query
performance, data/index size
and hardware requirements!
Iterate
04 $lookup
Document Validation
3.2
06Measure data/index size, query
performance
- mgenerate/mtools
- Compass
- Cloud Manager / Ops
Manager
Tools!
05
Hi my name is Sigfrido Narvaez, and I like to go by Sig.
Today we will be talking about MongoDB schema design and some of its performance implications. We will also explore some of the new features in MongoDB 3.2 that are relevant to schema design, and some additional tools that will help you iterate and try out different approaches quickly.
During the webinar, please feel free to type any questions in the chat box, and at the end, we will have a Q&A session and answer as much as we can.
Ok, so I am a Sr. Solutions Architect here at MongoDB based out of Southern California, and prior to joining I was the Principal Software Architect for a Hybrid Cloud & Polyglot Persistence solution that used MongoDB, and that required leveraging MongoDB’s dynamic flexible schema to power cloud and mobile apps whose main source of data originated from many on-premise ERP’s. And I have also been organizing the orange county MUG for almost 4 years.
I have provided my email address and my Twitter handle, in case I don't cover all the questions or we any follow-ups, so please feel free to reach out with any questions afterwards and I will make sure I find the information you are looking for.
The agenda for today’s presentation, We will use a medical record example, and explore its schema in MongoDB vs. relational, using Embedding & Referencing, and comparing against the classic 1-1, 1-M & M-M. We will then jump into a performance analysis examining data and index growth, and finally, explore new features in MongoDB 3.2
Design a schema for a medical information system. Where we will need to store data for the Patients, the Physicians, the Procedures, and many other aspects about a medical system.
And all this data is interrelated and we have to assume the system will be around for many years and will grow over time.
Left-down to right-down
Let's examine the data entities that are going to be part of this system.
First we have hospitals and hospitals have many physicians,
Then we have the physicians who attend many patients and that will perform many procedures, and who themselves belong to many hospitals
The patients, who again are attended by many physicians, they are the subject of many procedures
The Procedures are of course applied by a physician to a patient, inside a hospital on a particular time, and the data that is produced by each of these procedures can vary a lot. For example an x-ray procedure will produce a bunch of data along with an image or set of images, but a blood test will only produce a bunch of data. Each procedure has different data, schema design problem
As we can see the main entities and their relationships maybe a great fit for a relational database.
But the procedures data is not, and, overtime procedures will change and use new medical devices or go through improvements and may produce even more data with more variability, and we still have to keep historical records too.
This is a real challenge for a relational database
But for MongoDB and the flexible document model, this is easy. The way we would model this is by having some common data points that all procedures have, such as the timestamp, the physician, the patient, and the hospital, and any other common fields but then have a variable section in the json document, for the unique data points of each procedure.
This will make great use of the polymorphic schema capabilities of MongoDB, and with modern languages, this can be modeled using base classes and extensions or inheritance.
Before we go into the modeling exercises, let's do a level set of understanding of MongoDB concepts versus Relational concepts
In MongoDB data is stored in a collection and that is analogous to a table. Collections contain Documents and that is analogous to a Row or Record.
More importantly in mongo DB we think about what data do I need to use and how will it be used, versus how will the data be stored.
In MongoDB, we need to look at queries to guide schema design decisions, where as in relational we model first, and then answer questions, and eventually add Indexes and in some cases, denormalize data to support queries and performance over time.
Another difference is that in mongo DB fields have many dimensions versus just having two (rows and columns)
Each field can contain 0, 1 or many values such as an Array, or even embedded such as sub-documents, and the type can vary from document to document. vs. a single value of a pre-defined type.
I can also query at any field and at any level in the document versus a single field, and we
Okay so when we start modeling data, the first thing to avoid is to think of every single little thing that I may not use immediately, which usually leads to creating complex over normalized schemas
DO NOT PERFORM 3rd normal form modeling, and create hundreds of tables, where you have join tables for M-M’s and store all kinds of entities entities which will be very difficult to join, will slow down performance and will be hard to maintain over time.
Instead what we do is create rich data structures that are single documents. As you can see in this example we have many fields about a patient, where they live, what professions they practice, a list of the physicians they're currently seeing, when was the last visit, etc. So I can get a quick view of a patient in a single document.
Now we have talked about MongoDB having strong data types, such as strings and numbers, but we also have more advanced data types such as coordinates, and arrays of other sub-documents
In MongoDB I can query and index using any number of fields at any level, and the document is already in object form so I don't need an ORM layer like Hibernate or Entity Framework to translate data from relational to object, the data is already an object.
Two ways to model relationships: Referencing and Embedding.
Referencing is a very relational-like approach where I duplicate ID’s across collections. But take into account that MongoDB does not enforce foreign key constraints, so if you were to delete a master document, you will likely end-up with orphans and this has to be handled by the application level. Embedding is more natural to MongoDB and it works by nesting data inside a single document. There May or may not be a need to generate an ID for nested data, but for sure there is no need to duplicate them as everything lives together.
So how does this apply to our medical schema? Let’s look at Procedures and Results. With Referencing I could use two collections and have a relationship between them. In Embedding, I could embed the results inside of the procedure. Now, something to think about, which of these two entities is a strong entity and a weak entity. Clearly the Results is weak as it cannot exist without a Procedure.
Here is how the referencing approach would look like. Obtaining all the data I need will require two reads and two roundtrips to the database. Notice we have placed the result ID in the procedure, Why? Because my application will display Procedures and their Results. This way I only need to read the Procedure collection and then lookup the Result document by its ID, and I can perform this lookup in the application layer.
And to give you a hint about the latter section of the presentation, with MongoDB 3.2 I can use the $lookup pipeline stage to perform what is essentially a left-outer join performed at the database layer.
Take a second to think about this design, using classic relational modeling and considering the strong and weak entities, I would have probably placed the ProcedureID in the Results. But then I would need to create an additional index which costs disk and memory
However, with the Embedding approach, this is quite easy to model and getting my data requires a single read and a single roundtrip to the database.
So the advantages of embedding is that I can retrieve all relevant information and read from a single document.
I don't have to implement any joins in my application code and also when I update or insert data, it is a single atomic operation. Consider that MongoDB, at this time, does not offer multi-document, multi-collection transactions.
Let’s talk about Atomicity for a bit.
In a single database command, we can update many fields, or the whole doc. If there are concurrent reads and writes to the same document, the application will see the document before or after the update, but not in between. So a single Update statement can alter either the complete document, or parts of, as we see in this example, and that is atomic. Explain particular operation.
But what is not possible, up to MongoDB 3.2, is to do multi-document transactions. You cannot begin a transaction, perform operations and then either commit or rollback.
What you may have guessed already, is that Embedding takes advantage of mongodb’s document-level atomicity
But, there are limitations. A large document can also cost more overhead, and there is a 16MB limit, although, 16MB of JSON is a considerable amount of data. So, larger documents can cost more to read and update specially if data does not change too much.
Exact of opposite than embedding
Avoid duplication (1-M)
Always look at embedding first, and then prove that embedding doesn’t work
Can always query on any embedded information
Careful: extra large documents, or embedded data not accessed frequently
Mixed or Hybrid approach – reference to keep master data, but also embedd to store latest or most-used data for speed
Avoid join tables! – what is a join table? A list of key pairs that relate to independent entities
In MongoDB we have arrays
The relationship can be done as embedded or referencing
Using Embedding, arrays can be used. Data Duplication will happen and this is not such a bad idea as it is in relational. Notice how we are denormalizing some of the fields that we need most often (like the dr’s names) and still suffice our queries in a very fast manner
Downside, if the fields we duplicate change, then we do have maintance or stale data. So take into account which fields will most likely not change, such as a Dr’s name.
What to do if the fields change often?
If the fields change quite often, then perhaps we could revert to Referencing, knowing we may need to hit the DB multiple times.
Decision is really dependent on your application
Fast queries
Atomic updates
Data maintenance when duplication - How often does data change?
Read or Write intensive?
Let’s look at Patients and Procedures
Hypothetically decided to always use Referencing
Look at queries – find all patients from a state that have had a particular procedure
Very difficult query!! Bad performance
Query Patients coll in New Hampshire – get the Patient ID’s
Now go against all procedures of type Xray and for these Patiend ID’s – join code in application
Referencing and embedding
Contains the Type of the Procedure
Can now embedd a small amount of Procedure info and can now execute in a single query
If the “Chest X-Ray” changes, have to change everywhere – but very seldom changes, maybe once a decade!
Tons of data into MongoDB every second
Patients pulse, heart pressure, from which device, when, etc.
Schema easy by creating a record per event – easy, but let’s analyze the consequences?
Millions of records very quickly, a lot of the same data repeats! E.b. Device Id, the PatiendIT, and most of the time-stamp
Index space will grow significantly, operations and queries will be expensive too!
Store one document per hour! Vs. 1 doc per minute
Each doc will contain 60 mins of data
Pulse is a two dimensional array
In general, an update is less costly than an insert, in this case we are creating less write workload by doing more updates than inserts
Graph 1 day of activity
Substantially less IOPS for Read, which means reading is faster
Order of magnitudes differences!! When planning 1 years worth of data
ALWAYS ALWAYS consider what indexes are neede, and the size of that index
Revisit for formatting and easy reads
Consider hardware needed!! – Servers with 100’s GB RAM are easy then TB sizes! - Same for disk space
Summarize total HW at bottom row
Use mgenerate to model data to see actual data sizes!
Quickly identify your slow-running queries.
Part of MongoDB Ops Manager, the Visual Query Profiler displays how query and write latency vary over time
With the click of a button, the Visual Query Profiler consolidates and displays metrics from all your nodes on a single screen
Let’s go back to this example from earlier, and imagine that the Procedure Name changes quite often, and we have decided to reference instead of embed.
But I also want to get a view of the data that just has the Patient and his/hers Procedures, but not the Physicians
Using $lookup I can do this
Finally, when schema is done, working and performing well and I am in Production, I may want to lock this down. I can do this with Document Validation in 3.2.