Your SlideShare is downloading. ×
  • Like
No More SQL
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Slides from my talk at MongoDB Boston 2013 on moving an application from a relational database to MongoDB. I discussed challenges and lessons learned from this project.

Slides from my talk at MongoDB Boston 2013 on moving an application from a relational database to MongoDB. I discussed challenges and lessons learned from this project.

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. No More SQL A chronicle of moving a data repository from a traditional relational database to MongoDB Glenn Street Database Architect, Copyright Clearance Center
  • 2. Who am I? ● Database Architect at Copyright Clearance Center ● Oracle Certified Professional ● Many years of database development and administration ● Learning to embrace “polyglot persistence” ● Been working with MongoDB since version 1.6
  • 3. What is Copyright Clearance Center? "Copyright Clearance Center (CCC), the rights licensing expert, is a global rights broker for the world’s most sought-after books, journals, blogs, movies and more. Founded in 1978 as a not-for-profit organization, CCC provides smart solutions that simplify the access and licensing of content. These solutions let businesses and academic institutions quickly get permission to share copyright-protected materials, while compensating publishers and creators for the use of their works."
  • 4. What I want to talk about today ● ● ● Not application design, but data management issues Our experience in moving from "legacy" relational data way of doing things These experiences come from one large project
  • 5. What do I mean by “data management”? ● Topics like naming conventions, data element definitions ● Data modeling ● Data integration ● Talking to legacy (relational) databases ● Archive, purge, retention, backups
  • 6. Where we started ● 200+ tables in a relational database ● Core set of tables fewer, but many supporting tables ● 2.5 TB total (including TEMP space, etc.) ● Many PL/SQL packages and procedures ● Solr for search
  • 7. Today ● ● ● We use MongoDB in several products The one I'll talk about today is our largest MongoDB database (> 2 TB) Live in production end of September
  • 8. What options did we have in the past for horizontal scaling? ● At the database layer, few ● Clustering ($$) ● So, we emphasized scaling at the application tier ● We wanted to be able to scale out the database tier in a low-cost way
  • 9. What kind of data? ● "Work" data, primarily books, articles, journals ● Associated metadata – Publisher, author, etc.
  • 10. Application characteristics ● ● Most queries are reads via Solr index Database access is needed for additional metadata not stored in Solr ● Custom matching algorithms for data loads ● Database updates are done in-bulk (loading) ● Loads of data come from third-party providers ● On top of this we've built many reports, canned and ad-hoc
  • 11. Here's what the core data model looked like: highly normalized
  • 12. Where we are today ● ● 12 MongoDB shards x 200 GB (2.4 TB) MongoDB database Replica sets, including hidden members for backup (more about that later) ● GridFS for data to be loaded ● MMS for monitoring ● JEE application (no stored procedure code) ● Solr for search
  • 13. What motivated us? ● ● Downtime every time we made even the simplest database schema update The data model was not appropriate for our use case – Bulk loading (very poor performance) – Read-mostly (few updates) – We want to be able to see most of a "work's" metadata at once – This lead to many joins, given our normalized data model
  • 14. More motivators ● ● ● ● Every data loader required custom coding The business users wanted more control over adding data to the data model “on-the-fly” (e.g., a new data provider with added metadata) This would be nearly impossible using a relational database MongoDB's flexible schema model is perfect for this use!
  • 15. What were our constraints? ● ● ● Originally, we wanted to revamp the nature of how we represent a work Our idea was to construct a work made up of varying data sources, a “canonical” work But, as so often happens, time the avenger was not on our side
  • 16. We needed to reverse-engineer functionality ● ● ● ● This meant we needed to translate the relational structures We probably didn't take full advantage of a documentoriented database The entire team was more familiar with the relational model Lesson: – Help your entire team get into the polyglot persistence mindset
  • 17. We came up with a single JSON document ● We weighed the usual issues: – ● Embedding vs. linking Several books touch on this topic, as does the MongoDB manual – One excellent one: MongoDB Applied Design Patterns by Rick Copeland, O'Reilly Media.
  • 18. We favored embedding ● ● "Child" tables became "child" documents This seemed the most natural translation of relational to document ● But, this led to larger documents ● Lesson: – We could have used linking more
  • 19. Example: one-to-one relationship
  • 20. In MongoDB work... "publicationCountry" : { "country_code" : "CHE", "country_description" : "Switzerland" }
  • 21. Example: one-to-many relationship
  • 22. In MongoDB An array of “work contributors” "work_contributor" : [ { "contributorName" : "Ballauri, Jorgji S.", "contributorRoleDescr" : "Author", }, { "contributorName" : "Maxwell, William", "contributorRoleDescr" : "Editor", },... ]
  • 23. When embedding... ● ● ● Consider the resulting size of your documents Embedding is akin to denormalization in the relational world Denormalization is not always the answer (even for RDBMS)!
  • 24. Data migration from our relational database ● ● ● Wrote a custom series of ETL processes Combined Talend Data Integration and custom-built code Also leveraged our new loader program
  • 25. But...we still had to talk to a relational database ● ● The legacy relational database became a reporting and batchprocess database (at least for now) Data from our new MongoDB system of record needed to be synced with the relational database – ● Wrote a custom process to transform the JSON structure back to relational tables Lesson: – Consider relational constraints when syncing from MongoDB to a relational database ● We had to account for some discrepancies in field lengths (MongoDB is more flexible)
  • 26. More Lessons Learned ● ● Document size is key! The data management practices you're used to from the relational world must be adapted; example: key names ● In the relational world, we favor longer names ● We found that large key names were causing us pain – We're not the first: see “On shortened field names in MongoDB” blog post – But, this goes against “good” relational database naming practices (e.g., longer column names are self-documenting)
  • 27. More Lessons Learned ● Our way of using Spring Data introduced it's own problems – ● “scaffolding” Nesting of keys for flexibility was painful Example: workItemValues.work_createdUser.rawValue
  • 28. Backups at this scale are challenging! ● Mongodump and mongoexport were too slow for our needs ● Decided on hidden replica set members on AWS ● Using filesystem snapshots for backups ● Looking into MMS Backup service
  • 29. Another Lesson: Non/SemiTechnical Users ● For example, business analysts, product owners ● Many know and like SQL ● Many don't understand a document-oriented database ● Engineering spent a lot of time and effort in raising the comfort level – ● This was not universally successful An interesting project, SQL4NoSQL
  • 30. How to communicate structure?
  • 31. Communicating Structure ● Mind map was helpful initially ● Difficult to maintain
  • 32. JSON Schema {"$schema": "", "title": “Phase I Schema", "description": "Describes the structure of the MongoDB database for Phase I", "type":"object", "id": "", "required":false, "properties":{ "_id": { "type":"string", "required":false }, ...
  • 33. JSON Schema for communicating structure ● I created a JSON schema representation of the “work” document – – ● ● JSON Schema JSON Was used by QA and other teams for supporting tools JSON Schema also useful, but also cumbersome to maintain
  • 34. Next Steps/Challenges ● Investigating on-disk (file system) compression – ● Very promising so far Can we be more "document-oriented"? – Remove vestiges of relational data models ● Implement an archiving and purging strategy ● Investigating MMS Backup
  • 35. Vote for these JIRA Items! ● “Option to store data compressed“ ● “Bulk insert is slow in sharded environment” ● “Tokenize the field names” ● “Increase max document size to at least 64mb” ● “Collection level locking”
  • 36. Thanks! ● Twitter: @GlennRStreet ● Blog: ● LinkedIn: