Geo Searches for Health Care Pricing Data with MongoDB


Published on

Originally presented at MongoDB Days SF on May 10, 2013. The presentation describes how Castlight Health uses MongoDB to support very low latency searches for very large volumes of health care pricing data.

Published in: Technology, Health & Medicine
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • File system was ext4 with mount flags defaults,noatime,discard
  • Geo Searches for Health Care Pricing Data with MongoDB

    1. 1. CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIALGeo Searches for Health Care Pricing Datawith MongoDBMongoDB Days San Francisco 2013Robert StewartSenior Architect, Castlight
    2. 2. CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIALCastlight HealthThe Business and Technical ProblemsInitial SolutionMongoDB, Geo Haystack Index and SSDsReplica Set Flipping2
    3. 3.  Hosted web and mobile applications providingunbiased information on health care cost and quality Customers are employers and health plans Founded in 2008, raised $181 million in VC funding #1 on Wall Street Journal’s list of “Top 50 Venture-Backed Companies” for 2011 Hiring!Castlight Health3
    4. 4. Home Page4
    5. 5. Search Results5
    6. 6. Business Problem6 Support searches for Prices for a procedure performed by any in-network provider in ageographical area Prices for all procedures performed by a single provider Sub-second response, even if returning data onthousands of prices
    7. 7.  Need a very fast geo index Rate count doubled in last 3 months to 600 million Major rate updates monthly Difficult to index data to ensure sequential reads Sometimes lots of random readsTechnical Problems7
    8. 8. Pricing Retrieval Architecture8UserCastlightWeb BrowserMobile WebBrowserNative MobileApplicationCastlight WebAppCastlight MobileWeb AppProxy ServiceSearch ServicePricing ServicePrices
    9. 9. Initial Solution9 Store pricing data in MySQL When Pricing Service starts, create two in-memoryindexes and cache most of the rates 55 GB JVM Heap with lots of GC tuning 20-minute service startup time to build indexes 3 hours for background caching of most rates Trouble Brewing: Total rates growing quickly Rolling restart becoming unacceptably slow If rates not in Java or MySQL cache, retrieval was very slow
    11. 11. Geo Indexes11 Tried standard geo 2D indexes in MongoDB Too slow for my use case Geo Haystack index Conceptually similar From “A haystack index is a special index that is optimized to returnresults over small areas. Haystack indexes improve performanceon queries that use flat geometry.”
    12. 12. Mercator Projection with 10 degree grid12
    13. 13. Geo Haystack13 We chose degrees long-lat for x-y coordinate system 25 miles is our default search radius Roughly 0.5 degrees in middle of the USdb.priceables_1.ensureIndex({ loc: "geoHaystack", pm: 1 },{ bucketSize: 0.5 })db.runCommand({ geoSearch: "priceables_1",near: [-122.4, 37.79],maxDistance: 0.5,search: { pm: 6757 },limit: 50000 }) maxDistance calculated using great circle algorithm
    14. 14. Geo Haystack Pros14 Very fast when retrieving many documents in arelatively small search radius Great when you also need to apply a secondary filter Compound 2dsphere index in Mongo 2.4 has even better support
    15. 15. Geo Haystack Cons15 Supports only one extra filter in index SERVER-2979 A bug if unindexed query on only the second part ofthe key SERVER-8645> db.priceables_1.find({pm: 6757})error: { "$err" : "assertion src/mongo/db/geo/haystack.cpp:178" } Second part of index can’t have an array value Location part of key can’t be null
    16. 16. SSDs16 For uncached data on HDD, Geo Haystack was twice asfast as custom Java geo index and MySQL Still close to 1 minute for big queries with full data set Death by random read Tested with a $200 Samsung SSD Typical query dropped to 20 millis Big query only about 150 millis
    17. 17. Random 4k block reads, 5 GB file, 16 threadsMongoperf on SSDs17Env SSD Read Ops/s Read MB/sProd Samsung 200GB SLC 74k 288QA VM Samsung 200GB SLC 30k 117Dev Samsung 830 256GB SATA MLC 47k 183Env SSD Write Ops/s Write MB/sProd Samsung 200GB SLC 1074 289QA VM Samsung 200GB SLC 405 196Dev Samsung 830 256GB SATA MLC 438 210Sequential write of the 5 GB file
    18. 18.  Requirements Major price updates monthly Minor updates more frequently Huge bulk loads with no impact on active replica set I/O bound, not CPU boundLow Impact Pricing Updates18
    19. 19.  Two replica sets Lowered cost with two SSDs on each pricing server scp compressed files from QA to passive replica set Protip: to compress and uncompresstar cvf - pricing | pigz > ~/pricing.tgzpigz -dc pricing.tgz | tar xvf - Page in index and data db.runCommand({ touch: "priceables_1", index: true, data: true }) Pricing Service operation to atomically flipReplica Set Flipping Solution19
    20. 20. Replica Set Architecture20Physical ServersReplicaSetsprodpricing1prodpricing2Server pricing1mongod 28001primarymongod 28002secondaryServer pricing2mongod 28001secondarymongod 28002primaryServer db1mongod 28001arbiterServer db2mongod 28002arbiter
    21. 21.  Obviously, increased cost, but only for SSDs Recently added caching of remote pricing lookups TTL collections Cache is lost during a flip But, usually flip late at night Cache eviction time is only a few hoursReplica Set Flipping Drawbacks21
    22. 22.  Geo search speed with cold cache acceptable Geo search speed with warm cache awesome Pricing Service startup down to a few seconds No production impact for major rate updates Lowered risk for minor rate updatesOverall Results22
    23. 23. Summary23 Geo Haystack Index great for … Retrieving lots of documents in a constrained search area Geo searches with a secondary filter SSDs great for … Random reads Reducing need for lots of complex indexes Replica set flipping great for … Instant swap of large amounts of data Primarily, if not solely, read only Trading cost for operational flexibility