2. Index
• What Is a Document?
• Documents and Key-Value Pairs
• Managing Multiple Documents in Collections
• Avoid Explicit Schema Definitions
• Basic Operations on Document Databases
• Summary
2
3. What Is a Document?
• Unlike relational databases, document databases do not require you to define a common
structure for all records in the data store. Document databases, however, do have some similar
features to relational databases.
• For example, it is possible to query and filter collections of documents much as you would rows in
a relational table.
• A document is a set of key-value pairs.
• Keys are represented as strings of characters.
• Values may be basic data types (such as numbers, strings, and Booleans) or structures (such as
arrays and objects).
• Documents contain both structure information and data.
• The name in a name-value pair indicates an attribute and the value in a name-value pair is the
data assigned to that attribute.
• JSON and XML are two formats commonly used to define documents.Binary JSON, or BSON, is a
binary representation of JSON objects and is another method for specifying documents.
3
4. The Structure of JSON objects are constructed using several simple
syntax rules:
Data is organized in key-value pairs, similar to key-value databases.
Documents consist of name-value pairs separated by commas.
Documents start with a { and end with a }.
Names are strings, such as "customer_id" and "address".
Values can be numbers, strings, Booleans (true or false), arrays,
objects, or the NULL value.
The values of arrays are listed within square brackets, that is [ and ].
The values of objects are listed as key-value pairs within curly
brackets, that is, { and }.
4
5. Let’s consider a simple example of a customer record that tracks the customer
ID, name, address, first order date, and last order date. Using JavaScript Object
Notation (JSON), a sample customer record is:
{
“customer_id”:187693,
“name”: “Kiera Brown”,
“address” : {
“street” : “1232 Sandy Blvd.”,
“city” : “Vancouver”,
“state” : “Washington”,
“zip” : “99121”
},
“first_order” : “01/15/2013”,
“last_order” : ” 06/27/2014”
}
5
6. The same information in the preceding example is represented in XML as
follows:
<customer_record>
<customer_id>187693</customer_id>
<name>“Kiera Brown”</name>
<address>
<street>“1232 Sandy Blvd.”</street>
<city>“Vancouver”</city>
<state>“Washington”</state>
<zip>“99121”</zip>
</address>
<first_order>“01/15/2013”</first_order>
<last_order>“06/27/2014”</last_order>
</customer_record>
6
7. Documents and Key-Value Pairs
• Documents, like relational tables, organize
multiple attributes in a single object.
• This allows database developers to more
easily implement common requirements,
such as returning all attributes of an entity
based on a filter applied to one of the
attributes.
• For example, in one step you could filter a
list of customer documents to identify those
whose last purchase was at least six months
ago and return their IDs, names, and
addresses.
• Document databases require less code than
key-value data stores to query multiple
attributes.
7
8. Managing Multiple Documents in Collections
The full potential of document databases becomes apparent when you work with large numbers of
documents.
Documents are generally grouped into collections of similar documents.
One of the key parts of modeling document databases is deciding how you will organize your documents
into collections.
Getting Started with Collections:
• Collections can be thought of as lists of documents.
• Document database designers optimize document databases to quickly add, remove, update, and search for
documents.
• They are also designed for scalability, so as your document collection grows, you can add more servers to
your cluster to keep up with demands for your database.
• It is important to note that documents in the same collection do not need to have identical structures, but
they should share some common structures.
• One of the advantages of document databases is that they provide flexibility when it comes to the structure
of documents.
• For example, A collection of two documents with similar structures given as
8
10. Tips on Designing Collections
Collections are sets of documents.
Because collections do not impose a consistent structure on documents in the collection,
it is possible to store different types of documents in a single collection.
You could, for example, store customer, web clickstream data, and server log data in the
same collection. In practice, this is not advisable.
In general, collections should store documents about the same type of entity.
Avoid Highly Abstract Entity Types
A system event entity such as this is probably too abstract for practical modeling.
This is because web clickstream data and server log data will have few common fields.
They may share an ID field and a time stamp but few other attributes.
Notice how dissimilar web clickstream data is from server log data:
10
12. • If you were to store these two document types in a single collection, you would likely need to add a type indicator so your code could easily
distinguish between a web clickstream document and a server log event document. In the preceding example, the documents would be modified to
include a type indicator:
{
“id” : 12334578,
“datetime” : “201409182210”,
“doc_type”: “click_stream”,
“session_num” : 987943,
“client_IP_addr” : “192.168.10.10”,
“user_agent” : “Mozilla / 5.0”,
“referring_page” : “http://www.example.com/page1”
}
{
“id” : 31244578,
“datetime” : “201409172140”
“doc_type” : “server_log”
“event_type” : “add_user”
“server_IP_addr” : “192.168.11.11”
“descr” : “User jones added with sudo privileges”
}
12
13. • Filtering collections is often slower than
working directly with multiple
collections, each of which contains a
single document type.
• Consider if you had a system event
collection with 1 million documents:
650,000 clickstream documents and
350,000 server log events.
• Because both types of events are added
over time, the document collection will
likely store a mix of clickstream and
server log documents in close proximity
to each other.
• If you are using disk storage, you will
likely retrieve blocks of data that contain
both clickstream and server log
documents. This will adversely impact
performance.
Figure 1 : Mixing document types can lead to multiple document
types in a disk data block. This can lead to inefficiencies because
data is read from disk but not used by the application that filters
documents based on type.
13
14. Watch for Separate Functions for Manipulating
Different Document Types:
• Another clue that a collection should be
broken into multiple collections is your
application code.
• The application code that manipulates a
collection should have substantial amounts of
code that apply to all documents and some
amount of code that accommodates
specialized fields in some documents.
• For example, most of the code you would
write to insert, update, and delete documents
in the customer collection would apply to all
documents.
• You would probably have additional code to
handle loyalty and discount fields that would
apply to only a subset of all documents.
Figure: High-level branching in functions manipulating
documents can indicate a need to create separate collections.
Branching at lower levels is common when some documents
have optional attributes.
14
15. Use Document Subtypes When Entities Are Frequently Aggregated or
Share Substantial Code:
Avoid overly abstract document types. If you find yourself writing
separate code to process different document subtypes, you should
consider separating the types into different collections.
Poor collection design can adversely affect performance and slow your
application.
There are cases where it makes sense to group somewhat dissimilar
objects (for example, small kitchen appliances and books) if they are
treated as similar (for example, they are all products) in your
application.
Queries can help guide the organization of documents and collections
can be explained through an example given as:
15
16. e.g If in addition to tracking customers and their
clickstream data, you would like to track the products
customers have ordered.
You have decided the first step in this process is to
create a document collection containing all products,
which for our purposes, includes books, CDs, and small
kitchen appliances.
There are only three types of products now, but your
client is growing and will likely expand into other
product types as well.
All of the products have the following information
associated with them:
• Product name
• Short description
• SKU (stock keeping unit)
• Product dimensions
• Shipping weight
• Average customer review score
• Standard price to customer
• Cost of product from supplier
Each of the product types will have specific fields. Books
will have fields with information about:
• Author name
• Publisher
• Year of publication
• Page count
The CDs will have the following fields:
• Artist name
• Producer name
• Number of tracks
• Total playing time
The small kitchen appliances will have the following
fields:
• Color
• Voltage
• Style
16
17. Figure 2: When is a toaster the same as a database
design book? When they are treated the same by
application queries. Queries can help guide the
organization of documents and collections.
How should you go about deciding how to organize this data
into one or more document collections? Start with how the
data will be used. Your client might tell you that she needs
to be able to answer the following queries:
• What is the average number of products bought by each
customer?
• What is the range of number of products purchased by
customers (that is, the lowest
number to the highest number of products purchased)?
• What are the top 20 most popular products by customer
state?
• What is the average value of sales by customer state (that is,
Standard price to
customer – Cost of product from supplier)?
• How many of each type of product were sold in the last 30
days?
All the queries use data from all product types, and only the
last query subtotals the number of products sold by type. This
is a good indication that all the products should be in a single
document collection (see Figure ).
17
18. Avoid Explicit Schema Definitions
• A schema is a specification that describes the structure of an object, such as a table. A
pseudoschema specification for the customer record discussed above is
CREATE TABLE customer (
customer_ID integer,
name varchar(100),
street varchar(100),
city varchar(100),
state varchar(2),
zip varchar(5),
first_purchase_date date,
last_purchase_date date
)
18
19. • Document databases do not require this
formal definition step. Instead, developers can
create collections and documents in
collections by simply inserting them into the
database.
• Document databases are designed to
accommodate variations in documents within
a collection. Because any document could be
different from all previously inserted
documents.
• This freedom from the need to predefine the
database structure is captured in the term
often used to describe document databases:
schemaless.
• Polymorphic schema is another term that
describes document databases. Polymorphic is
derived from Latin and literally means “many
shapes.” This makes it an apt description for a
document database that supports multiple
types of documents in a single collection.
Figure 3: Relational databases require an
explicitly defined schema, but document
databases do not.
19
20. Basic Operations on Document Databases
The basic operations on document databases are the same as other types of
databases and include the following:
• Inserting
• Deleting
• Updating
• Retrieving
There is no standard data manipulation language used across document
databases to perform these operations.
The database is the container for collections and containers are for
documents. The logical relationship between these three data structures is
shown in:
20
21. Figure : The database is the highest-level logical structure in a
document database and contains collections of documents.
21
22. Inserting Documents into a Collection:
Collections are objects that can perform a number of different operations.
The insert method adds documents to a collection. For example, the following
adds a single document describing a book by Kurt Vonnegut to the books
collection:
db.books.insert( {“title”:” Mother Night”, “author”: “Kurt Vonnegut, Jr.”} )
Instead of simply adding a book document with the title and author name, a
better option would be to include a unique identifier as in
db.books.insert( {book_id: 1298747, “title”:“Mother Night”, “author”: “Kurt
Vonnegut, Jr.”} )
In many cases, it is more efficient to perform bulk inserts instead of a series of
individual inserts. For example, the following three insert commands could be
used to add three books to the book collection:
22
24. Deleting Documents from a Collection
• You can delete documents from a collection using the remove methods. The following
command deletes all documents in the books collection:
db.books.remove()
• To delete the book titled Science and the Modern World, you would issue the following
command:
db.books.remove({“book_id”: 639397})
Updating Documents in a Collection
• Once a document has been inserted into a collection, it can be modified with the update
method. The update method requires two parameters to update:
• Document query Set of keys and values to update MongoDB uses the $set operator to
specify which keys and values should be updated.
• For example, the following command adds the quantity key with the value of 10 to the
collection:
db.books.update ({“book_id”: 1298747}, {$set {“quantity” : 10 }})
24
25. • MongoDB has an increment operator ($inc), which is used to increase
the value of a key by the specified amount. The following command
would change the quantity of Mother Night from 10 to 15:
db.books.update ({“book_id”: 1298747}, {$inc {“quantity” : 5 }})
Retrieving Documents from a Collection
• The find method is used to retrieve documents from a collection.
• The following command matches all documents in a collection:
• db.books.find()
• For example, the following returns all books by Kurt Vonnegut, Jr.:
db.books.find({“author”: “Kurt Vonnegut, Jr.”})
• To retrieve all books with a quantity greater than or equal to 10 and
less than 50, you could use the following command:
Db.books.find( {“quantity” : {“$gte” : 10, “$lt” : 50 }} )
25
26. The conditionals and Booleans supported in MongoDB include the
following:
• $lt—Less than
• $let—Less than or equal to
• $gt—Greater than
• $gte—Greater than or equal to
• $in—Query for values of a single key
• $or—Query for values in multiple keys
• $not—Negation
26
27. Summary
• Documents are flexible data structures.
• They do not require predefined schemas and they readily
accommodate variations in the structure of documents.
• Documents are organized into related sets called collections.
Collections are analogous to a relational table, and documents are
analogous to rows in a relational table.
• Collections should contain similar types of entities, although the keys
and values may differ across documents.
• Unlike relational databases, there is not a standard query language for
document databases.
27