The Best of Both Worlds: Speeding Up Drug Research with MongoDB & Oracle (Genentech)


Published on

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • *** possible joke:
    This is the second time that I’ve been to the “First World Conference” for a ground breaking product. You don’t get this chance very often. The first time was 18 years ago for a product that some of you may have heard of: Java.
    My name is Doug Garrett. I’m a software engineer in the Bioinformatics and Computational Biology department of Genentech Research and Early Develop.
    Since that’s quite a mouthful I’ll just refer to them as gRED and Bioinformatics.
    Genentech was the first biotech company – the first company to produce drugs, such as insulin, from genetically engineered organisms. In 2009 Genentech was purchased by the Swiss pharmaceutical company Roche who wisely decided to keep Genentech Research as a separate group reporting directly to the CEO.
    *** possible joke: describe cultural clash between a laid back San Francisco academic culture and a Swiss business
    – 1st time we saw senior management together on the same stage, Genentech wore ties and Roche didn’t.
  • gRED does basic research into disease mechanisms/causes and then uses those discoveries to develop new drugs.
    Although major successes have been in Cancer,
    we are now investigating other areas as well including
    Neurology – Alzheimer's, Parkinson's
    Immunology (arthritis and asthma)
    Metabolism (diabetes)
    Infectious Diseases (Flu, Hepatitis C)
  • My customers are the scientists discovering the cause of diseases and then trying to find new drugs for the diseases.
  • But upper most and most important, the ultimate customers are the patients.
  • How is being a software engineer in Bioinformatics different from typical software development environment?
    First– most of the people within Bioinformatics are scientists. But within bioinformatics is a fairly small group of software engineers, such as myself.
    Software Engineers in Bioinformatics have to speak a different language.
    Have to understand the terminology AND the underlying science.
    But ALSO – the need to be flexible and adapt quickly
    – It’s research
    Terms used for above word map:
    Heterozygous, Alleles, Genes, Polymorphism, SNP nucleotide polymorphism
    PCR polymerase chain reaction, IVF, Cryo, Multiplex, Primer, Probes
    Genetic Assay, Colony, Congenic, Genome, Backcross, Chimera,
    Microinjection, hCG chorionic PMSG
  • I’m going to be discussing a recently completed project which used MongoDB.
    Hope to expand or extend people's understanding of what Mongodb excels at and under what situations it is best utilized.
    Many talks discuss MongoDB for “big data” – but that’s not all MongoDB Excels at
    Flexible schema can speed development and provide system flexibility
    Most talks I’ve seen also cover MongoDB for new systems – where that’s all that’s used
    How many of you would have to integrate MongoDB with an existing Relational Database?
    (stop to ask this question?)
    In fact though, both Relational Database and MonogDB can co-exist in the same environment
    There are some simple ways to allow the two to easily work together
    In many ways the two complement each other
    And for us MongoDB–
    Is not just about software
    It’s about saving lives
    Many of the people in this room have probably been touched by the death of someone in their family – quite often from cancer
    In my case, my father died of non-hodgkins lymphoma shortly after I went to work for Genentech
    so to me I know the importance of speeding up the development of new drugs because…
    You never know when even a single day will make a major difference in someone’s life.
  • In our case
    The flexible schema has helped us reduce the time needed to introduce new lab equipment from months to weeks, or even days
    This reduced time is not entirely due to MongoDB, but MongoDB plays a key part in the improvement
    As far as integrating MongoDB with our existing relational database environment,
    - we did find a very simple way to integrate the two
    Not completely integrated, not a two-phase commit
    But Integrated “Enough”
    AND – it’s simple
    This allowed us to easily integrate our MongoDB with the existing system,
    use existing tools geared towards Relational Database
    While still being able to take advantage of MongoDB’s flexible schema.
  • This is an oversimplified view of drug development, but it illustrates the importance of mouse genetic models in many cases.
    Drug research begins with an idea – what is the cause of this disease?
    If the cause is related to genes we create new mouse genetic models, new genetic strains of mice, which are meant to reflect the underlying disease cause.
    This mouse genetic model is then used to verify the underlying disease cause.
  • If verified – move on to trying to discover drugs to address the underlying genetic cause
    They then test new drugs first on the genetically modified mouse, testing for safety and effectiveness
  • If safe and effective, only then will they move on to initial clinical trials with humans, although in many cases it’s back to the drawing board.
    As you can see, the mouse genetic model is an important part of disease research and drug discovery.
    And Increasingly we’re finding that the underlying genetic cause is much more complex than we thought
  • Determining disease causes and developing drugs to address those diseases requires genetically engineered mice
    We support around 500 investigators and in the area of 500 different genetic strains of mice
    New research requires that we develop in the area of 200 new genetic strains of mice per year
    *** In most cases you can’t purchase a new genetic strain of an animal – a new Mouse Model
    Creating a new mouse genetic strain requires genetic testing
    LOTS of genetic testing – about 700,000 genetic tests per year for us
    The entire process of developing new genetic animal strains is very complex
    It requires breeding a number of generations of mice to obtain the desired genetic mutation
    Today I’ll be covering only the step where we determines if a particular gene is present or not
    Genetic tests uses a plate of “amplified” DNA a wells for each sample and genetic test
    We Run that dna sample through one of a variety of lab instruments we use for genetic testing
    We then load those test results, usually a CSV file, into our database
    Using these results the investigator can then decide which animals to breed
    There are different types of tests, different lab instruments – new ones coming out all the time
  • This has driven demand within one of the departments that I support, The Genetic Analysis Lab.
    The demands for mouse genetic testing has increased both because:
    There is the normal growth in research and therefore the number of samples to be tested
    But in addition, the growing complexity of sample testing is driving this even faster.
    We now test an average of two different genes instead of just one
  • In order to keep up with rising demand we needed to update the Genotyping Lab Instruments
    Originally we had just a 3730 Genetic Test.
    We loaded a file containing results for a plate
    Each file had results for one or more wells containing a genetic test
  • We added a new Genetic Test.
    From this test we would producesome of the same results information as for the original genetic test
    But we needed to capture additional and different details for the new genetic test.
    So in our relational database we created a child tables of PCR Wells.
    But we still generated the original PCR Well row since that was the integration point with the rest of the system.
    It took six months to integrate this new genetic test.
  • We then added a second new genetic test, this one from the same instrument but generating additional data.
    This required another child table for the new data, and took an additional three months to implement.
    You can see where this is going…
    Every new instrument began to add new complexity
    Perhaps more important – it took too long
    And – the requirement to add new types of genetic tests was expected to increase
    – driven by the need to increase throughput in the lab in order to keep up with rising demand.
  • To help address this we had undertaken a redesign of the system
    As part of the this new design we included a new DB design integrating our Relational Database with MongoDB.
    We were fortunate that a project for another department had required MongoDB. As a result our oracle dba's were comfortable with supporting mongod, making it easy for us to request a new mongodb database.
    The key point was to isolate data which we expected to vary for different genetic tests, into a new MongoDB document.
    For each different type of genetic test we planned to create an instrument specific load process to:
    Read the CSV File
    Parse that file into the MongoDB Document
    Edit, Validate, Preprocess
    Save the preprocessed data in the MongoDB
  • The next step in the process, a “Generalized Loader”, would then use certain commonly defined fields within the MongoDB document to load the Relational Database.
    Now, if we need to add a new genetic test– no time to modify the database schema
  • From a user perspective, this is how it appears.
    Most of the data displayed is coming from our relational database. But details within the results which come from MongoDB are combined with the relational DB data by a Java program and then displayed on the User Interface.
    Currently, the variable data is only needed when the genetic test results are initially being processed, though it will be available if needed.
    In the future we may perform further analysis on this data and we may also capture more data since that has become so easy - mainly because with MongoDB Flexible Schema we can do this without any programming effort.
  • This is an actual example – before the new system was even done!
    The users was “nice” enough to give us an “opportunity” to test out the flexibility of our MongoDB schema
    While in the middle implementing the data loading for the first time,
    The user decided we should drop that genetic test and instead load a different, newer genetic test that was just coming online to replace the previous one.
  • There was Zero impact on our data model – all changes were in the MongoDB Flexible Schema
    No time required to change the schema
    Approximate three week impact on project vs previous history of three to six months
    Mongo’s Flexible Schema was a big help in achieving this.
    It allowed us to use a new instrument without any changes to the data model.
  • Luckily this was NOT what going live looked like.
    It wasn’t a circus.
    It might have been a cirucus if we hadn’t used MongoDB though.
    The entire mouse breeding program, which this genetic testing is just a part of, is so important that we maintain a “Disaster Recovery Data Center” which keeps a running copy of the system - ready to take over if our main data center fails.
    Keeping a second copy of a database is a no brainer in MongoDB – keeping one or two copies of the MongoDB collections is the default configuration for most production MongoDB systems.
    But if you’ve every tried to do this with Oracle, the product we use, you may find it a much more difficult task. For example, when we went from Oracle 10g to Oracle 11g, somehow the defaults changed and our “disaster recovery” copy ended up being corrupted. Even scarier, we didn’t know the until a number of months later when we ran our yearly “disaster recovery” test and it failed.
    When we went live with MongoDB though, we reminded our DBAs that we needed a copy of the production database at the backup site. Although they did already have a replica running, they hadn’t set up one in our disaster recovery Data Center. Luckily, because of MongoDB, they were able to set this up in less than hour – something I wouldn’t have tried to do with Oracle.
  • Next let’s talk about synchronizing our relational DB with the MongoDB.
    How do we maintain consistency between the Relational DB and MongoDB?
    Whenever you join two databases together you run into issues regarding keeping the two “synchronized”. Often this requires a complex two phase commit or similar mechanism.
    In our case we always insert the complete MongoDB document first. The MongoDB then contains a standard set of fields which are needed to define the genetic test results and are then used to load the relational database.
    But suppose there is a failure before the Relational Database insert is completed?
  • Net result:
    MongoDB Document left in Collection with no corresponding Relational Database table row
    We considered a “quasi two phase commit”
    Set document “status” to “in progress”
    Insert and commit Relational Database
    Set document “status” to “committed”
    But then we still had to deal with scripts that clean up after any failure such as finding any MongoDB documents with a status of “in progress” and either setting the status to “committed” if there is a corresponding Relational Database row, or deleting the document if there wasn’t.
    But the question was: why bother?
    Who cares if there is an “extra” MonogDB document?
    If we just look at those which have an ID in the Relational Database – we’ll never see extra MongoDB documents
    Our simple solution makes Relational Database the DB of record and lets it handle the transaction management, something it does quite well.
    If an ID isn’t in the Relational Database, it doesn’t exists, as far as we’re concerned
    If we ever begin to go against MongoDB Directly, we can write a simple “clean up” script to delete any orphan documents
    But for now, we just ignore them
    doesn’t cause a problem
    Won’t happen often
    Not as if we have to worry about the MongoDB size
    The main objective is keeping It Simple
  • If at some future date we did go directly against the MongoDB and needed to clean up the “orphan” MongDB Documents there are various ways we could handle this.
    Here’s just one example of how it might be done. There are many other simple ways to do this though.
    In this case we simply need to mark those documents where we do have a corresponding Relational Database Row,
    And with a single delete command we can delete what doesn’t have a row in the relational database.
    The point is that there are a number of simple ways to correct this problem, in the rare case that it even happens.
  • Now that we’re live we’re realizing that **IF** we could easily do so, it might be nice to load some additional data that is available from the instrument. In the past we avoided this because we’d have to add new columns to the Relational Database schema.
    But many lab instruments often allow users to specify additional data elements they want in the CSV file we use to load the results.
  • The current CSV load program always looks for a known set of fields to load into the MongoDB document,
    These fields must be at the beginning of each row
    But in fact our load program will also load any other fields added onto the end of each row.
  • As long as the beginning of each “row” of the CSV File is what we expect, we can parse and save any additional comma separated values into the MongoDB document without programming changes.
    Here again the MongoDB flexible schema allows us to do things which would otherwise be difficult to support in a relational database.
  • You often can’ tell how useful the data might be until you collect it and examine it.
    With MongoDB’s flexible schema it becomes very easy to collect this additional data at low or no cost, providing the luxury of collecting much more than you might otherwise.
    So why not collect as much as you can?
    It’s inexpensive
    It’s easy
  • As a result we may one day start to analyze some of that additional data
    –access additional lab instrument specific detailed data which would otherwise be difficult to obtain.
    You never know what you’ll find. How the information can be used to improve the process
    Improve accuracy?
    Spot problems before they occur?
    Who knows what else…
    Until you capture the data and take a look at it – you never know what you’ll find.
    And with MongoDB you lower the barrier so much, that it becomes easy to collect all the data you’d ever want.
  • This is made even easier by new Aggregation Framework capabilities which have removed some of the previous resource limitations of the framework.
    If you want to find out more about the MongoDB Aggregation Framework
    Including major revisions included in the April MongoDB Release, 2.6
    Removes the 16MB limit for aggregation pipeline results
    Provides the option to removes limits for intermediate result set sizes
    Allowing you to save intermediate results on disk
    Chapter 6 of the soon to be released 2nd Edition of MongoDB in Action will cover this.
    Please use code mdbdgcf  for 44% off MongoDB in Action, 2e  (all formats) for all attendees.
    Please also give away a free MEAP and send us the winners name and email.
    The book is scheduled to be released this summer but Manning has an early access program which will allow you to read the chapter when it is completed – which should soon.
  • There are other future possibilities for MongoDB in our department, Bioinformatics, as well.
    While conducting an internal review of this project (BTSC), the possibilities enabled by MongoDB flexible schema started others thinking about additional ways we could leverage it.
    One idea was to use MongoDB to help in dealing with different formats of data arriving from a variety of sources. (actually Jan’s idea)
    If nothing else, MongoDB could provide a common and flexible access method for programs which need to process these data.
    It could also provide a common place to first store and then curate the data, if we need to do any preprocessing or validation
    We could then use the results to load a Relational Database, or even process it directly from MongoDB with either the aggregation framework or other languages which have MongoDB adapters, such as R.
    MongoDB’s flexible schema as well as easy access makes it a natural tool for this use.
  • So – as you can see
    MongoDB is not just about big data
    The flexible schema can speed development and provide system flexibility
    In our case, just for the genetic testing system,
    We’ve reduced the time to introduce some new lab equipment from months to weeks
    And we can actually capture some new instrument data without any programming changes
    And again - MongoDB is not just for new systems where you don’t need to integrate with existing Relational Database:
    We found a very simple way to integrate the two
    Not completely, integrated
    But Integrated “Enough”
    “Eventually” as consistent as needed
    And you never know when a single day will make a big difference in someone’s life.
  • As we’ve seen, MongoDB does help us integrate new genetic tests faster, which in turn can help reduce drug development time.
    In closing I wanted to share a personal story, one that helps motivate me to do things faster.
    “You better do it fast” was the punch line from the last joke my father ever made.
    He died of cancer shortly after I went to work for Genentech.
    A few weeks before he died I had told him that I was joining this great company, Genentech, and that we were researching cures for cancer.
    He smiled, laughed and said “You better do it fast”
    With the help of MongoDB we’ve reduced the time needed to introduce new genetic tests.
    And you never know when even a single day will make a major difference in someone’s life.
  • The Best of Both Worlds: Speeding Up Drug Research with MongoDB & Oracle (Genentech)

    1. 1. Speeding Up Drug Research with MongoDB Introducing MongoDB into an RDBMS Environment
    2. 2. Doug Garrett • Genentech Research and Early Development (gRED) Bioinformatics and Computational Biology (B&CB) Software Engineer
    3. 3. gRED: Disease and Drug Research
    4. 4. Bioinformatics Customers: Scientists
    5. 5. Most of All: Patients
    6. 6. Bioinformatics: Not Your Typical IT
    7. 7. MongoDB • Not just about big data MongoDB has a flexible schema • Not just about new systems MongoDB easily integrates with RDBMS • Not just about software It’s about saving lives
    8. 8. Time to Introduce New Genetic Test Weeks
    9. 9. Drug Development Process 9 New Drug
    10. 10. Drug Development Process 10 New Drug
    11. 11. Drug Development Process 11 New Drug
    12. 12. New Mouse Model - Genetic Testing File (csv)
    13. 13. J. Colin Cox Sept. 2013 Presentation Growth In Genetic Testing (thousands) Samples Genotypes
    14. 14. 6 months 3 months
    15. 15. 6 months 3 months
    16. 16. 6 months 3 months
    17. 17. Varies by Genetic Test
    18. 18. Case Study: New Genetic Test Instrument 20 New Instrument! Impact? Bio-Rad CFX384 ABI 7900HT
    19. 19. Case Study: New Genetic Test Instrument 21 New Instrument! Impact? DB Schema? No Impact Project? 3 weeks Bio-Rad CFX384 ABI 7900HT
    20. 20. Going Live… 22
    21. 21. Failure Mode
    22. 22. Failure Mode
    23. 23. Synch MongoDB with RDBMS db.NewRdbmsMongoId .find().forEach(function(doc){ db. TestResults .update({'_id':doc._id}, {'$inc':{useCount:1}}) }) db. db. TestResults.remove({'useCount':0});
    24. 24. But Wait! There’s More… • Flexible data collection
    25. 25. Load CSV to MongoDB "_id" : ObjectId(“…."), “plate_wells” : [ { "Well" : "A01", "Sample" : "308…", … } ]
    26. 26. Add Fields to CSV "_id" : ObjectId(“…."), “plate_wells” : [ { "Well" : "A01", "Sample" : "308…", … "New1" : "New Value" } ]
    27. 27. Future – What If… Avoiding the typical “Catch 22”: 1.Is it worth collecting the data? 2.What is the value of the data? 3.Need the data to find the value
    28. 28. Future Analytics MongoDB Aggregation Framework R Matlab
    29. 29. MongoDB Aggregation Framework 40% Discount Thru July 4 Use Code: mdbdgcf
    30. 30. Under Discussion CSV JSON XML Other Lab Instruments Government Agencies Third Party Sources MongoDB Load to RDBMS Process Directly
    31. 31. MongoDB • Not just about big data MongoDB has a flexible schema • Not just about new systems MongoDB easily integrates with RDBMS • Not just about software It’s about saving lives
    32. 32. “You better do it fast” For my Father Who I hope would have enjoyed this talk