No more sql migrating a data repository from a traditional relational database to mongo db

No more sql migrating a data repository from a traditional relational database to mongo db






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

No more sql migrating a data repository from a traditional relational database to mongo db No more sql migrating a data repository from a traditional relational database to mongo db Presentation Transcript

  • No More SQL A chronicle of moving a data repository from a traditional relational database to MongoDB
  • Who Am I? ● Database Architect at Copyright Clearance Center ● Oracle Certified Professional ● ● Many years of database development and administration Learning to love “polyglot persistence”
  • What is Copyright Clearance Center? "Copyright Clearance Center (CCC), the rights licensing expert, is a global rights broker for the world’s most sought-after books, journals, blogs, movies and more. Founded in 1978 as a not-for-profit organization, CCC provides smart solutions that simplify the access and licensing of content. These solutions let businesses and academic institutions quickly get permission to share copyright-protected materials, while compensating publishers and creators for the use of their works."
  • What I want to talk about today ● Not application design, but data design issues ● Also data management issues ● ● Our experience in moving from "legacy" relational data way of doing things These experiences come from one large project; your mileage may vary
  • What Do I Mean By Data Management? ● Topics like naming conventions, data element definitions ● Data models ● Data quality ● Archive, purge, retention, backups
  • Where we started ● 200+ tables in an Oracle relational database ● Core set of tables fewer, but many supporting tables ● 2.5 TB total (including TEMP space, etc.) ● Many PL/SQL packages and procedures ● Solr for search
  • Today ● ● We use MongoDB in several products The one I'll talk about today is our largest MongoDB database (> 2 TB)
  • What options did we have in the past for scaling? ● At the database layer, few ● Clustering ($$) ● So, we emphasized scaling at the application tier ● We wanted to be able to scale out the database tier in a low-cost way
  • What kind of data? ● "Work" data, primarily books, articles, journals ● Associated metadata – Publisher, author, etc.
  • Application characteristics ● ● Most queries are reads via Solr index Database access is needed for additional metadata not stored in Solr ● Custom matching algorithms ● Database updates are done in-bulk (loading) ● Loads of data come from third-party providers ● On top of this we've built many reports
  • Here's what the core data model looked like: highly normalized
  • Where we are today ● ● 12 MongoDB shards x 200 GB (2.4 TB) MongoDB database Replica sets, primarily for backup (more about that later) ● JEE application (no stored procedure code) ● Solr for search
  • What motivated us? ● ● Downtime every time we made even the simplest database schema update The data model was not appropriate for our use case – Bulk loading – Read-mostly (few updates) – We want to be able to see most of a "work's" metadata at once – This lead to many joins, given our normalized data model
  • More motivators ● ● ● ● Every data loader required custom coding The business users wanted more control over adding data to the data model “on-the-fly” (e.g., a new data provider with added metadata) This would be nearly impossible using a relational database MongoDB's flexible schema model is perfect for this use!
  • What were our constraints? ● ● ● Originally, we wanted to revamp the nature of how we represent a work Our idea was to construct a work made up of varying data sources => a “canonical” work But, as so often happens, time was not on our side
  • We needed to reverse-engineer functionality ● ● ● This meant we needed to translate the relational structures We probably didn't take full advantage of a document-oriented database The entire team was more familiar with the relational model
  • We came up with a single JSON document ● We weighed the usual issues: – ● Embedding vs. linking Several books touch on this topic, as does the MongoDB manual – One excellent one: MongoDB Applied Design Patterns by Rick Copeland, O'Reilly Media.
  • We favored embedding ● ● "Child" tables became "child" documents This seemed the most natural translation of relational to document But, this led to larger documents ● Lesson: – We could have used linking more
  • Example: one-to-one relationship
  • In MongoDB work... "publicationCountry" : { "country_code" : "CHE", "country_description" : "Switzerland" }
  • Example: one-to-many relationship
  • In MongoDB An array of “work contributors” "work_contributor" : [ { "contributorName" : "Ballauri, Jorgji S.", "contributorRoleDescr" : "Author", } ]
  • When embedding... ● ● ● Consider the resulting size of your documents Embedding is akin to denormalization in the relational world Denormalization is not always the answer (even for RDBMS)!
  • Data migration from our relational database ● ● ● Wrote a custom series of ETL processes Combined Talend Data Integration and custom-built code Leveraged our new loader program
  • But...we still had to talk to a relational database ● ● ● ● The legacy relational database became a reporting and batch-process database (at least for now) Data from our new MongoDB system of record needed to be synced with the relational database Wrote a custom process Transform the JSON structure back to relational tables
  • Lessons Learned ● ● Document size is key! The data management practices you're used to from the relational world must be adapted; example: key names ● In the relational world, we favor longer names ● We found that large key names were causing us pain – We're not the first: see “ On shortened field names in MongoDB” blog post – But, this goes against “good” relational database naming practices (e.g., longer column names are selfdocumenting)
  • More Lessons Learned ● ● Nesting of keys was painful (workItemValues.work_createdUser.rawValue) Our way of using Spring Data introduced it's own problems – ● “scaffolding” Backups at this scale are challenging!
  • Another Lesson: Non/Semi-Technical Users ● For example, business analysts, product owners ● Many know and like SQL ● ● Many don't understand a document-oriented database Engineering spent a lot of time and effort in raising the comfort level – This was not universally successful
  • How to communicate structure?
  • Communicating Structure ● Mind map was helpful initially ● Difficult to maintain ● JSON Schema also useful, but also cumbersome to maintain
  • Information Management ● JSON Schema ● JSON ● Was used by QA and other teams for supporting tools
  • Next Steps/Challenges ● Investigating on-disk (file system) compression – ● Can we be more "document-oriented"? – ● Very promising so far Remove vestiges of relational data models Implement an archiving and purging strategy
  • Vote for these JIRA Items! ● “Option to store data compressed“ ● “Bulk insert is slow in sharded environment” ● “Tokenize the field names” ● “Increase max document size to at least 64mb” ● “Collection level locking”
  • Thanks! ● Twitter: @GlennRStreet ● Blog: