Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
No More SQL

A chronicle of moving a data repository from a
traditional relational database to MongoDB
Who Am I?
●

Database Architect at Copyright Clearance Center

●

Oracle Certified Professional

●

●

Many years of datab...
What is Copyright Clearance Center?
"Copyright Clearance Center (CCC), the rights
licensing expert, is a global rights bro...
What I want to talk about today
●

Not application design, but data design issues

●

Also data management issues

●

●

O...
What Do I Mean By Data
Management?
●

Topics like naming conventions, data element
definitions

●

Data models

●

Data qu...
Where we started
●

200+ tables in an Oracle relational database

●

Core set of tables fewer, but many supporting tables
...
Today
●

●

We use MongoDB in several products
The one I'll talk about today is our largest
MongoDB database (> 2 TB)
What options did we have in the past
for scaling?
●

At the database layer, few

●

Clustering ($$)

●

So, we emphasized ...
What kind of data?
●

"Work" data, primarily books, articles, journals

●

Associated metadata
–

Publisher, author, etc.
Application characteristics
●

●

Most queries are reads via Solr index
Database access is needed for additional metadata
...
Here's what the core data model looked
like: highly normalized
Where we are today
●

●

12 MongoDB shards x 200 GB (2.4 TB) MongoDB
database
Replica sets, primarily for backup (more abo...
What motivated us?
●

●

Downtime every time we made even the simplest
database schema update
The data model was not appro...
More motivators
●

●

●

●

Every data loader required custom coding
The business users wanted more control over adding
da...
What were our constraints?
●

●

●

Originally, we wanted to revamp the nature of how
we represent a work
Our idea was to ...
We needed to reverse-engineer
functionality
●

●

●

This meant we needed to translate the relational
structures
We probab...
We came up with a single JSON
document
●

We weighed the usual issues:
–

●

Embedding vs. linking

Several books touch on...
We favored embedding
●

●

"Child" tables became "child" documents
This seemed the most natural translation of
relational ...
Example: one-to-one relationship
In MongoDB
work...
"publicationCountry" :
{
"country_code" : "CHE",
"country_description" : "Switzerland"
}
Example: one-to-many relationship
In MongoDB
An array of “work contributors”
"work_contributor" : [
{
"contributorName" : "Ballauri, Jorgji S.",
"contributo...
When embedding...
●

●

●

Consider the resulting size of your documents
Embedding is akin to denormalization in the
relat...
Data migration from our relational
database
●

●

●

Wrote a custom series of ETL processes
Combined Talend Data Integrati...
But...we still had to talk to a relational
database
●

●

●

●

The legacy relational database became a reporting
and batc...
Lessons Learned
●

●

Document size is key!
The data management practices you're used to from
the relational world must be...
More Lessons Learned
●

●

Nesting of keys was painful
(workItemValues.work_createdUser.rawValue)
Our way of using Spring ...
Another Lesson: Non/Semi-Technical
Users
●

For example, business analysts, product owners

●

Many know and like SQL

●

...
How to communicate structure?
Communicating Structure
●

Mind map was helpful initially

●

Difficult to maintain

●

JSON Schema also useful, but also ...
Information Management
●

JSON Schema

●

JSON Schema.net

●

Was used by QA and other teams for supporting
tools
Next Steps/Challenges
●

Investigating on-disk (file system) compression
–

●

Can we be more "document-oriented"?
–

●

V...
Vote for these JIRA Items!
●

“Option to store data compressed“

●

“Bulk insert is slow in sharded environment”

●

“Toke...
Thanks!
●

Twitter: @GlennRStreet

●

Blog: http://glennstreet.net/
Upcoming SlideShare
Loading in …5
×

No more sql migrating a data repository from a traditional relational database to mongo db

576 views

Published on

Published in: Technology
  • Be the first to comment

No more sql migrating a data repository from a traditional relational database to mongo db

  1. 1. No More SQL A chronicle of moving a data repository from a traditional relational database to MongoDB
  2. 2. Who Am I? ● Database Architect at Copyright Clearance Center ● Oracle Certified Professional ● ● Many years of database development and administration Learning to love “polyglot persistence”
  3. 3. What is Copyright Clearance Center? "Copyright Clearance Center (CCC), the rights licensing expert, is a global rights broker for the world’s most sought-after books, journals, blogs, movies and more. Founded in 1978 as a not-for-profit organization, CCC provides smart solutions that simplify the access and licensing of content. These solutions let businesses and academic institutions quickly get permission to share copyright-protected materials, while compensating publishers and creators for the use of their works."
  4. 4. What I want to talk about today ● Not application design, but data design issues ● Also data management issues ● ● Our experience in moving from "legacy" relational data way of doing things These experiences come from one large project; your mileage may vary
  5. 5. What Do I Mean By Data Management? ● Topics like naming conventions, data element definitions ● Data models ● Data quality ● Archive, purge, retention, backups
  6. 6. Where we started ● 200+ tables in an Oracle relational database ● Core set of tables fewer, but many supporting tables ● 2.5 TB total (including TEMP space, etc.) ● Many PL/SQL packages and procedures ● Solr for search
  7. 7. Today ● ● We use MongoDB in several products The one I'll talk about today is our largest MongoDB database (> 2 TB)
  8. 8. What options did we have in the past for scaling? ● At the database layer, few ● Clustering ($$) ● So, we emphasized scaling at the application tier ● We wanted to be able to scale out the database tier in a low-cost way
  9. 9. What kind of data? ● "Work" data, primarily books, articles, journals ● Associated metadata – Publisher, author, etc.
  10. 10. Application characteristics ● ● Most queries are reads via Solr index Database access is needed for additional metadata not stored in Solr ● Custom matching algorithms ● Database updates are done in-bulk (loading) ● Loads of data come from third-party providers ● On top of this we've built many reports
  11. 11. Here's what the core data model looked like: highly normalized
  12. 12. Where we are today ● ● 12 MongoDB shards x 200 GB (2.4 TB) MongoDB database Replica sets, primarily for backup (more about that later) ● JEE application (no stored procedure code) ● Solr for search
  13. 13. What motivated us? ● ● Downtime every time we made even the simplest database schema update The data model was not appropriate for our use case – Bulk loading – Read-mostly (few updates) – We want to be able to see most of a "work's" metadata at once – This lead to many joins, given our normalized data model
  14. 14. More motivators ● ● ● ● Every data loader required custom coding The business users wanted more control over adding data to the data model “on-the-fly” (e.g., a new data provider with added metadata) This would be nearly impossible using a relational database MongoDB's flexible schema model is perfect for this use!
  15. 15. What were our constraints? ● ● ● Originally, we wanted to revamp the nature of how we represent a work Our idea was to construct a work made up of varying data sources => a “canonical” work But, as so often happens, time was not on our side
  16. 16. We needed to reverse-engineer functionality ● ● ● This meant we needed to translate the relational structures We probably didn't take full advantage of a document-oriented database The entire team was more familiar with the relational model
  17. 17. We came up with a single JSON document ● We weighed the usual issues: – ● Embedding vs. linking Several books touch on this topic, as does the MongoDB manual – One excellent one: MongoDB Applied Design Patterns by Rick Copeland, O'Reilly Media.
  18. 18. We favored embedding ● ● "Child" tables became "child" documents This seemed the most natural translation of relational to document But, this led to larger documents ● Lesson: – We could have used linking more
  19. 19. Example: one-to-one relationship
  20. 20. In MongoDB work... "publicationCountry" : { "country_code" : "CHE", "country_description" : "Switzerland" }
  21. 21. Example: one-to-many relationship
  22. 22. In MongoDB An array of “work contributors” "work_contributor" : [ { "contributorName" : "Ballauri, Jorgji S.", "contributorRoleDescr" : "Author", } ]
  23. 23. When embedding... ● ● ● Consider the resulting size of your documents Embedding is akin to denormalization in the relational world Denormalization is not always the answer (even for RDBMS)!
  24. 24. Data migration from our relational database ● ● ● Wrote a custom series of ETL processes Combined Talend Data Integration and custom-built code Leveraged our new loader program
  25. 25. But...we still had to talk to a relational database ● ● ● ● The legacy relational database became a reporting and batch-process database (at least for now) Data from our new MongoDB system of record needed to be synced with the relational database Wrote a custom process Transform the JSON structure back to relational tables
  26. 26. Lessons Learned ● ● Document size is key! The data management practices you're used to from the relational world must be adapted; example: key names ● In the relational world, we favor longer names ● We found that large key names were causing us pain – We're not the first: see “ On shortened field names in MongoDB” blog post – But, this goes against “good” relational database naming practices (e.g., longer column names are selfdocumenting)
  27. 27. More Lessons Learned ● ● Nesting of keys was painful (workItemValues.work_createdUser.rawValue) Our way of using Spring Data introduced it's own problems – ● “scaffolding” Backups at this scale are challenging!
  28. 28. Another Lesson: Non/Semi-Technical Users ● For example, business analysts, product owners ● Many know and like SQL ● ● Many don't understand a document-oriented database Engineering spent a lot of time and effort in raising the comfort level – This was not universally successful
  29. 29. How to communicate structure?
  30. 30. Communicating Structure ● Mind map was helpful initially ● Difficult to maintain ● JSON Schema also useful, but also cumbersome to maintain
  31. 31. Information Management ● JSON Schema ● JSON Schema.net ● Was used by QA and other teams for supporting tools
  32. 32. Next Steps/Challenges ● Investigating on-disk (file system) compression – ● Can we be more "document-oriented"? – ● Very promising so far Remove vestiges of relational data models Implement an archiving and purging strategy
  33. 33. Vote for these JIRA Items! ● “Option to store data compressed“ ● “Bulk insert is slow in sharded environment” ● “Tokenize the field names” ● “Increase max document size to at least 64mb” ● “Collection level locking”
  34. 34. Thanks! ● Twitter: @GlennRStreet ● Blog: http://glennstreet.net/

×