7. MongoDB
• Not just about big data
MongoDB has a flexible schema
• Not just about new systems
MongoDB easily integrates with RDBMS
• Not just about software
It’s about saving lives
29. Future – What If…
Avoiding the typical “Catch 22”:
1.Is it worth collecting the data?
2.What is the value of the data?
3.Need the data to find the value
33. MongoDB
• Not just about big data
MongoDB has a flexible schema
• Not just about new systems
MongoDB easily integrates with RDBMS
• Not just about software
It’s about saving lives
34. “You better do it fast”
For my Father
Who I hope would have enjoyed this talk
Editor's Notes
*** possible joke:
This is the second time that I’ve been to the “First World Conference” for a ground breaking product. You don’t get this chance very often. The first time was 18 years ago for a product that some of you may have heard of: Java.
My name is Doug Garrett. I’m a software engineer in the Bioinformatics and Computational Biology department of Genentech Research and Early Develop.
Since that’s quite a mouthful I’ll just refer to them as gRED and Bioinformatics.
Genentech was the first biotech company – the first company to produce drugs, such as insulin, from genetically engineered organisms. In 2009 Genentech was purchased by the Swiss pharmaceutical company Roche who wisely decided to keep Genentech Research as a separate group reporting directly to the CEO.
*** possible joke: describe cultural clash between a laid back San Francisco academic culture and a Swiss business
– 1st time we saw senior management together on the same stage, Genentech wore ties and Roche didn’t.
gRED does basic research into disease mechanisms/causes and then uses those discoveries to develop new drugs.
Although major successes have been in Cancer,
we are now investigating other areas as well including
Neurology – Alzheimer's, Parkinson's
Immunology (arthritis and asthma)
Metabolism (diabetes)
Infectious Diseases (Flu, Hepatitis C)
My customers are the scientists discovering the cause of diseases and then trying to find new drugs for the diseases.
But upper most and most important, the ultimate customers are the patients.
How is being a software engineer in Bioinformatics different from typical software development environment?
First– most of the people within Bioinformatics are scientists. But within bioinformatics is a fairly small group of software engineers, such as myself.
Software Engineers in Bioinformatics have to speak a different language.
Have to understand the terminology AND the underlying science.
But ALSO – the need to be flexible and adapt quickly
– It’s research
Terms used for above word map:
Heterozygous, Alleles, Genes, Polymorphism, SNP nucleotide polymorphism
PCR polymerase chain reaction, IVF, Cryo, Multiplex, Primer, Probes
Genetic Assay, Colony, Congenic, Genome, Backcross, Chimera,
Microinjection, hCG chorionic PMSG
I’m going to be discussing a recently completed project which used MongoDB.
Hope to expand or extend people's understanding of what Mongodb excels at and under what situations it is best utilized.
Many talks discuss MongoDB for “big data” – but that’s not all MongoDB Excels at
Flexible schema can speed development and provide system flexibility
Most talks I’ve seen also cover MongoDB for new systems – where that’s all that’s used
How many of you would have to integrate MongoDB with an existing Relational Database?
(stop to ask this question?)
In fact though, both Relational Database and MonogDB can co-exist in the same environment
There are some simple ways to allow the two to easily work together
In many ways the two complement each other
And for us MongoDB–
Is not just about software
It’s about saving lives
Many of the people in this room have probably been touched by the death of someone in their family – quite often from cancer
In my case, my father died of non-hodgkins lymphoma shortly after I went to work for Genentech
so to me I know the importance of speeding up the development of new drugs because…
You never know when even a single day will make a major difference in someone’s life.
In our case
The flexible schema has helped us reduce the time needed to introduce new lab equipment from months to weeks, or even days
This reduced time is not entirely due to MongoDB, but MongoDB plays a key part in the improvement
As far as integrating MongoDB with our existing relational database environment,
- we did find a very simple way to integrate the two
Not completely integrated, not a two-phase commit
But Integrated “Enough”
AND – it’s simple
This allowed us to easily integrate our MongoDB with the existing system,
use existing tools geared towards Relational Database
While still being able to take advantage of MongoDB’s flexible schema.
This is an oversimplified view of drug development, but it illustrates the importance of mouse genetic models in many cases.
Drug research begins with an idea – what is the cause of this disease?
If the cause is related to genes we create new mouse genetic models, new genetic strains of mice, which are meant to reflect the underlying disease cause.
This mouse genetic model is then used to verify the underlying disease cause.
If verified – move on to trying to discover drugs to address the underlying genetic cause
They then test new drugs first on the genetically modified mouse, testing for safety and effectiveness
If safe and effective, only then will they move on to initial clinical trials with humans, although in many cases it’s back to the drawing board.
As you can see, the mouse genetic model is an important part of disease research and drug discovery.
And Increasingly we’re finding that the underlying genetic cause is much more complex than we thought
Determining disease causes and developing drugs to address those diseases requires genetically engineered mice
We support around 500 investigators and in the area of 500 different genetic strains of mice
New research requires that we develop in the area of 200 new genetic strains of mice per year
*** In most cases you can’t purchase a new genetic strain of an animal – a new Mouse Model
Creating a new mouse genetic strain requires genetic testing
LOTS of genetic testing – about 700,000 genetic tests per year for us
The entire process of developing new genetic animal strains is very complex
It requires breeding a number of generations of mice to obtain the desired genetic mutation
Today I’ll be covering only the step where we determines if a particular gene is present or not
Genetic tests uses a plate of “amplified” DNA a wells for each sample and genetic test
We Run that dna sample through one of a variety of lab instruments we use for genetic testing
We then load those test results, usually a CSV file, into our database
Using these results the investigator can then decide which animals to breed
There are different types of tests, different lab instruments – new ones coming out all the time
This has driven demand within one of the departments that I support, The Genetic Analysis Lab.
The demands for mouse genetic testing has increased both because:
There is the normal growth in research and therefore the number of samples to be tested
But in addition, the growing complexity of sample testing is driving this even faster.
We now test an average of two different genes instead of just one
In order to keep up with rising demand we needed to update the Genotyping Lab Instruments
Originally we had just a 3730 Genetic Test.
We loaded a file containing results for a plate
Each file had results for one or more wells containing a genetic test
We added a new Genetic Test.
From this test we would producesome of the same results information as for the original genetic test
But we needed to capture additional and different details for the new genetic test.
So in our relational database we created a child tables of PCR Wells.
But we still generated the original PCR Well row since that was the integration point with the rest of the system.
It took six months to integrate this new genetic test.
We then added a second new genetic test, this one from the same instrument but generating additional data.
This required another child table for the new data, and took an additional three months to implement.
You can see where this is going…
Every new instrument began to add new complexity
AND
Perhaps more important – it took too long
And – the requirement to add new types of genetic tests was expected to increase
– driven by the need to increase throughput in the lab in order to keep up with rising demand.
To help address this we had undertaken a redesign of the system
As part of the this new design we included a new DB design integrating our Relational Database with MongoDB.
We were fortunate that a project for another department had required MongoDB. As a result our oracle dba's were comfortable with supporting mongod, making it easy for us to request a new mongodb database.
The key point was to isolate data which we expected to vary for different genetic tests, into a new MongoDB document.
For each different type of genetic test we planned to create an instrument specific load process to:
Read the CSV File
Parse that file into the MongoDB Document
Edit, Validate, Preprocess
Save the preprocessed data in the MongoDB
The next step in the process, a “Generalized Loader”, would then use certain commonly defined fields within the MongoDB document to load the Relational Database.
Now, if we need to add a new genetic test– no time to modify the database schema
From a user perspective, this is how it appears.
Most of the data displayed is coming from our relational database. But details within the results which come from MongoDB are combined with the relational DB data by a Java program and then displayed on the User Interface.
Currently, the variable data is only needed when the genetic test results are initially being processed, though it will be available if needed.
In the future we may perform further analysis on this data and we may also capture more data since that has become so easy - mainly because with MongoDB Flexible Schema we can do this without any programming effort.
This is an actual example – before the new system was even done!
The users was “nice” enough to give us an “opportunity” to test out the flexibility of our MongoDB schema
While in the middle implementing the data loading for the first time,
The user decided we should drop that genetic test and instead load a different, newer genetic test that was just coming online to replace the previous one.
There was Zero impact on our data model – all changes were in the MongoDB Flexible Schema
No time required to change the schema
Approximate three week impact on project vs previous history of three to six months
Mongo’s Flexible Schema was a big help in achieving this.
It allowed us to use a new instrument without any changes to the data model.
Luckily this was NOT what going live looked like.
It wasn’t a circus.
It might have been a cirucus if we hadn’t used MongoDB though.
The entire mouse breeding program, which this genetic testing is just a part of, is so important that we maintain a “Disaster Recovery Data Center” which keeps a running copy of the system - ready to take over if our main data center fails.
Keeping a second copy of a database is a no brainer in MongoDB – keeping one or two copies of the MongoDB collections is the default configuration for most production MongoDB systems.
But if you’ve every tried to do this with Oracle, the product we use, you may find it a much more difficult task. For example, when we went from Oracle 10g to Oracle 11g, somehow the defaults changed and our “disaster recovery” copy ended up being corrupted. Even scarier, we didn’t know the until a number of months later when we ran our yearly “disaster recovery” test and it failed.
When we went live with MongoDB though, we reminded our DBAs that we needed a copy of the production database at the backup site. Although they did already have a replica running, they hadn’t set up one in our disaster recovery Data Center. Luckily, because of MongoDB, they were able to set this up in less than hour – something I wouldn’t have tried to do with Oracle.
Next let’s talk about synchronizing our relational DB with the MongoDB.
How do we maintain consistency between the Relational DB and MongoDB?
Whenever you join two databases together you run into issues regarding keeping the two “synchronized”. Often this requires a complex two phase commit or similar mechanism.
In our case we always insert the complete MongoDB document first. The MongoDB then contains a standard set of fields which are needed to define the genetic test results and are then used to load the relational database.
But suppose there is a failure before the Relational Database insert is completed?
Net result:
MongoDB Document left in Collection with no corresponding Relational Database table row
We considered a “quasi two phase commit”
Set document “status” to “in progress”
Insert and commit Relational Database
Set document “status” to “committed”
But then we still had to deal with scripts that clean up after any failure such as finding any MongoDB documents with a status of “in progress” and either setting the status to “committed” if there is a corresponding Relational Database row, or deleting the document if there wasn’t.
But the question was: why bother?
Who cares if there is an “extra” MonogDB document?
If we just look at those which have an ID in the Relational Database – we’ll never see extra MongoDB documents
Our simple solution makes Relational Database the DB of record and lets it handle the transaction management, something it does quite well.
If an ID isn’t in the Relational Database, it doesn’t exists, as far as we’re concerned
If we ever begin to go against MongoDB Directly, we can write a simple “clean up” script to delete any orphan documents
But for now, we just ignore them
doesn’t cause a problem
Won’t happen often
Not as if we have to worry about the MongoDB size
The main objective is keeping It Simple
If at some future date we did go directly against the MongoDB and needed to clean up the “orphan” MongDB Documents there are various ways we could handle this.
Here’s just one example of how it might be done. There are many other simple ways to do this though.
In this case we simply need to mark those documents where we do have a corresponding Relational Database Row,
And with a single delete command we can delete what doesn’t have a row in the relational database.
The point is that there are a number of simple ways to correct this problem, in the rare case that it even happens.
Now that we’re live we’re realizing that **IF** we could easily do so, it might be nice to load some additional data that is available from the instrument. In the past we avoided this because we’d have to add new columns to the Relational Database schema.
But many lab instruments often allow users to specify additional data elements they want in the CSV file we use to load the results.
The current CSV load program always looks for a known set of fields to load into the MongoDB document,
These fields must be at the beginning of each row
But in fact our load program will also load any other fields added onto the end of each row.
As long as the beginning of each “row” of the CSV File is what we expect, we can parse and save any additional comma separated values into the MongoDB document without programming changes.
Here again the MongoDB flexible schema allows us to do things which would otherwise be difficult to support in a relational database.
You often can’ tell how useful the data might be until you collect it and examine it.
With MongoDB’s flexible schema it becomes very easy to collect this additional data at low or no cost, providing the luxury of collecting much more than you might otherwise.
So why not collect as much as you can?
It’s inexpensive
It’s easy
As a result we may one day start to analyze some of that additional data
–access additional lab instrument specific detailed data which would otherwise be difficult to obtain.
You never know what you’ll find. How the information can be used to improve the process
Improve accuracy?
Spot problems before they occur?
Who knows what else…
Until you capture the data and take a look at it – you never know what you’ll find.
And with MongoDB you lower the barrier so much, that it becomes easy to collect all the data you’d ever want.
This is made even easier by new Aggregation Framework capabilities which have removed some of the previous resource limitations of the framework.
If you want to find out more about the MongoDB Aggregation Framework
Including major revisions included in the April MongoDB Release, 2.6
Removes the 16MB limit for aggregation pipeline results
Provides the option to removes limits for intermediate result set sizes
Allowing you to save intermediate results on disk
Chapter 6 of the soon to be released 2nd Edition of MongoDB in Action will cover this.
Please use code mdbdgcf for 44% off MongoDB in Action, 2e (all formats) for all attendees.
Please also give away a free MEAP and send us the winners name and email.
The book is scheduled to be released this summer but Manning has an early access program which will allow you to read the chapter when it is completed – which should soon.
There are other future possibilities for MongoDB in our department, Bioinformatics, as well.
While conducting an internal review of this project (BTSC), the possibilities enabled by MongoDB flexible schema started others thinking about additional ways we could leverage it.
One idea was to use MongoDB to help in dealing with different formats of data arriving from a variety of sources. (actually Jan’s idea)
If nothing else, MongoDB could provide a common and flexible access method for programs which need to process these data.
It could also provide a common place to first store and then curate the data, if we need to do any preprocessing or validation
We could then use the results to load a Relational Database, or even process it directly from MongoDB with either the aggregation framework or other languages which have MongoDB adapters, such as R.
MongoDB’s flexible schema as well as easy access makes it a natural tool for this use.
So – as you can see
MongoDB is not just about big data
The flexible schema can speed development and provide system flexibility
In our case, just for the genetic testing system,
We’ve reduced the time to introduce some new lab equipment from months to weeks
And we can actually capture some new instrument data without any programming changes
And again - MongoDB is not just for new systems where you don’t need to integrate with existing Relational Database:
We found a very simple way to integrate the two
Not completely, integrated
But Integrated “Enough”
“Eventually” as consistent as needed
And you never know when a single day will make a big difference in someone’s life.
As we’ve seen, MongoDB does help us integrate new genetic tests faster, which in turn can help reduce drug development time.
In closing I wanted to share a personal story, one that helps motivate me to do things faster.
“You better do it fast” was the punch line from the last joke my father ever made.
He died of cancer shortly after I went to work for Genentech.
A few weeks before he died I had told him that I was joining this great company, Genentech, and that we were researching cures for cancer.
He smiled, laughed and said “You better do it fast”
With the help of MongoDB we’ve reduced the time needed to introduce new genetic tests.
And you never know when even a single day will make a major difference in someone’s life.