No More SQL

No More SQL
A chronicle of moving a data repository from a
traditional relational database to MongoDB

Glenn Street
Database Architect, Copyright Clearance Center

Who am I?
●

Database Architect at Copyright Clearance Center

●

Oracle Certified Professional

●

Many years of database development and
administration

●

Learning to embrace “polyglot persistence”

●

Been working with MongoDB since version 1.6

What is Copyright Clearance Center?
"Copyright Clearance Center (CCC), the rights licensing expert, is a
global rights broker for the world’s most sought-after books, journals,
blogs, movies and more.
Founded in 1978 as a not-for-profit organization, CCC provides smart
solutions that simplify the access and licensing of content. These
solutions let businesses and academic institutions quickly get
permission to share copyright-protected materials, while
compensating publishers and creators for the use of their works."
www.copyright.com

What I want to talk about today
●
●

●

Not application design, but data management issues
Our experience in moving from "legacy" relational
data way of doing things
These experiences come from one large project

What do I mean by “data
management”?
●

Topics like naming conventions, data element
definitions

●

Data modeling

●

Data integration

●

Talking to legacy (relational) databases

●

Archive, purge, retention, backups

Where we started
●

200+ tables in a relational database

●

Core set of tables fewer, but many supporting tables

●

2.5 TB total (including TEMP space, etc.)

●

Many PL/SQL packages and procedures

●

Solr for search

Today
●
●

●

We use MongoDB in several products
The one I'll talk about today is our largest MongoDB
database (> 2 TB)
Live in production end of September

What options did we have in the past
for horizontal scaling?
●

At the database layer, few

●

Clustering ($$)

●

So, we emphasized scaling at the application tier

●

We wanted to be able to scale out the database tier in
a low-cost way

What kind of data?
●

"Work" data, primarily books, articles, journals

●

Associated metadata
–

Publisher, author, etc.

Application characteristics
●

●

Most queries are reads via Solr index
Database access is needed for additional metadata not
stored in Solr

●

Custom matching algorithms for data loads

●

Database updates are done in-bulk (loading)

●

Loads of data come from third-party providers

●

On top of this we've built many reports, canned and
ad-hoc

Here's what the core data model
looked like: highly normalized

Where we are today
●

●

12 MongoDB shards x 200 GB (2.4 TB) MongoDB
database
Replica sets, including hidden members for backup
(more about that later)

●

GridFS for data to be loaded

●

MMS for monitoring

●

JEE application (no stored procedure code)

●

Solr for search

What motivated us?
●

●

Downtime every time we made even the simplest
database schema update
The data model was not appropriate for our use case
–

Bulk loading (very poor performance)

–

Read-mostly (few updates)

–

We want to be able to see most of a "work's" metadata at
once

–

This lead to many joins, given our normalized data model

More motivators
●
●

●

●

Every data loader required custom coding
The business users wanted more control over adding
data to the data model “on-the-fly” (e.g., a new data
provider with added metadata)
This would be nearly impossible using a relational
database
MongoDB's flexible schema model is perfect for this
use!

What were our constraints?
●

●

●

Originally, we wanted to revamp the nature of how
we represent a work
Our idea was to construct a work made up of
varying data sources, a “canonical” work
But, as so often happens, time the avenger was not
on our side

We needed to reverse-engineer
functionality
●

●

●

●

This meant we needed to translate the relational
structures
We probably didn't take full advantage of a documentoriented database
The entire team was more familiar with the relational
model
Lesson:
–

Help your entire team get into the polyglot persistence
mindset

We came up with a single JSON
document
●

We weighed the usual issues:
–

●

Embedding vs. linking

Several books touch on this topic, as does the
MongoDB manual
–

One excellent one: MongoDB Applied Design Patterns
by Rick Copeland, O'Reilly Media.

We favored embedding
●
●

"Child" tables became "child" documents
This seemed the most natural translation of
relational to document

●

But, this led to larger documents

●

Lesson:
–

We could have used linking more

Example: one-to-one relationship

In MongoDB
work...
"publicationCountry" :
{
"country_code" : "CHE",
"country_description" : "Switzerland"
}

Example: one-to-many relationship

In MongoDB
An array of “work contributors”
"work_contributor" : [
{
"contributorName" : "Ballauri, Jorgji S.",
"contributorRoleDescr" : "Author",
},
{
"contributorName" : "Maxwell, William",
"contributorRoleDescr" : "Editor",
},...
]

When embedding...
●
●

●

Consider the resulting size of your documents
Embedding is akin to denormalization in the
relational world
Denormalization is not always the answer (even for
RDBMS)!

Data migration from our relational
database
●
●

●

Wrote a custom series of ETL processes
Combined Talend Data Integration and custom-built
code
Also leveraged our new loader program

But...we still had to talk to a relational
database
●

●

The legacy relational database became a reporting and batchprocess database (at least for now)
Data from our new MongoDB system of record needed to be
synced with the relational database
–

●

Wrote a custom process to transform the JSON structure back to
relational tables

Lesson:
–

Consider relational constraints when syncing from MongoDB to a
relational database
●

We had to account for some discrepancies in field lengths (MongoDB is more
flexible)

More Lessons Learned
●
●

Document size is key!
The data management practices you're used to from the relational
world must be adapted; example: key names

●

In the relational world, we favor longer names

●

We found that large key names were causing us pain
–

We're not the first: see “On shortened field names in MongoDB” blog post

–

But, this goes against “good” relational database naming practices (e.g.,
longer column names are self-documenting)

More Lessons Learned
●

Our way of using Spring Data introduced it's own
problems
–

●

“scaffolding”

Nesting of keys for flexibility was painful
Example:
workItemValues.work_createdUser.rawValue

Backups at this scale are challenging!
●

Mongodump and mongoexport were too slow for
our needs

●

Decided on hidden replica set members on AWS

●

Using filesystem snapshots for backups

●

Looking into MMS Backup service

Another Lesson: Non/SemiTechnical Users
●

For example, business analysts, product owners

●

Many know and like SQL

●

Many don't understand a document-oriented database

●

Engineering spent a lot of time and effort in raising
the comfort level
–

●

This was not universally successful

An interesting project, SQL4NoSQL

Communicating Structure
●

Mind map was helpful initially

●

Difficult to maintain

JSON Schema
{"$schema": "http://json-schema.org/draft-03/schema",
"title": “Phase I Schema",
"description": "Describes the structure of the MongoDB database for Phase I",
"type":"object",
"id": "http://jsonschema.net",
"required":false,
"properties":{
"_id": {
"type":"string",
"required":false
},
...

JSON Schema for communicating
structure
●

I created a JSON schema representation of the
“work” document
–
–

●

●

JSON Schema
JSON Schema.net

Was used by QA and other teams for supporting
tools
JSON Schema also useful, but also cumbersome to
maintain

Next Steps/Challenges
●

Investigating on-disk (file system) compression
–

●

Very promising so far

Can we be more "document-oriented"?
–

Remove vestiges of relational data models

●

Implement an archiving and purging strategy

●

Investigating MMS Backup

Vote for these JIRA Items!
●

“Option to store data compressed“

●

“Bulk insert is slow in sharded environment”

●

“Tokenize the field names”

●

“Increase max document size to at least 64mb”

●

“Collection level locking”

Thanks!
●

Twitter: @GlennRStreet

●

Blog: http://glennstreet.net/

●

LinkedIn: http://www.linkedin.com/in/glennrstreet/

No More SQL

More Related Content

What's hot

Similar to No More SQL

Recently uploaded

No More SQL