• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Geo Searches for Health Care Pricing Data with MongoDB
 

Geo Searches for Health Care Pricing Data with MongoDB

on

  • 2,202 views

I presented this updated version of my talk at NoSQL Now! 2013 in San Jose, CA, on August 22, 2013. The presentation describes how Castlight Health uses MongoDB to support very low latency searches ...

I presented this updated version of my talk at NoSQL Now! 2013 in San Jose, CA, on August 22, 2013. The presentation describes how Castlight Health uses MongoDB to support very low latency searches for very large volumes of health care pricing data. Key factors are geospatial indexes, SSDs and replica sets.

Statistics

Views

Total Views
2,202
Views on SlideShare
1,060
Embed Views
1,142

Actions

Likes
0
Downloads
16
Comments
0

7 Embeds 1,142

http://www.wombatnation.com 1129
http://www.wombatnation.com. 6
http://cloud.feedly.com 2
https://twitter.com 2
http://www.365dailyjournal.com 1
http://ranksit.com 1
http://prlog.ru 1
More...

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • File system was ext4 with mount flags defaults,noatime,discard

Geo Searches for Health Care Pricing Data with MongoDB Geo Searches for Health Care Pricing Data with MongoDB Presentation Transcript

  • CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL Geo Searches for Health Care Pricing Data with MongoDB NoSQL Now 2013 Robert Stewart Senior Architect, Castlight Health rstewart@castlighthealth.com @wombatnation 1
  • CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL Castlight Health The Business and Technical Problems Initial Solution MongoDB, Geospatial Indexes and SSDs Replica Set Flipping 2
  •  Hosted web and mobile applications providing unbiased information on health care cost and quality  Customers are employers and health plans  Founded in San Francisco in 2008  $181 million in VC funding  #1 on Wall Street Journal’s list of “Top 50 Venture- Backed Companies” for 2011  Hiring! Castlight Health 3
  • Home Page 4
  • Search Results 5
  • Business Problem 6  Support searches for  Prices for a procedure performed by any in-network provider in a geographical area  Prices for all procedures performed by a single provider  Sub-second response, even if returning data on thousands of prices
  •  Need a very fast geospatial index  Rate count at 1 billion and rising  Major rate updates monthly  Difficult to index data to ensure sequential reads  Sometimes lots of random reads Technical Problems 7 Apr-11 Jun-11 Aug-11 Oct-11 Dec-11 Feb-12 Apr-12 Jun-12 Aug-12 Oct-12 Dec-12 Feb-13 Apr-13 Jun-13 Aug-13
  • Pricing Retrieval Architecture 8 User Castlight Web Browser Mobile Web Browser Native Mobile Application Castlight Web App Castlight Mobile Web App Proxy Service Search Service Pricing Service Prices
  • Initial Solution 9  Store pricing data in MySQL  When Pricing Service starts, create two in-memory indexes and cache most of the rates  55 GB JVM Heap with lots of GC tuning  20-minute service startup time to build indexes  3 hours for background caching of most rates  Trouble Brewing:  Total rates growing quickly  Rolling restart becoming unacceptably slow  If rates not in Java or MySQL cache, retrieval was very slow
  • CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL Enter the Mongo 10
  • Geospatial Indexes We Evaluated 11  Standard 2D index in MongoDB 2.2  too slow for my use case  Geo Haystack index  From docs.mongodb.org: “A haystack index is a special index that is optimized to return results over small areas. Haystack indexes improve performance on queries that use flat geometry.”  2DSphere index in MongoDB 2.4
  • Mercator Projection with 10 degree grid 12
  • Geo Haystack 13  We chose degrees long-lat for x-y coordinate system  25 miles is our default search radius  Roughly 0.5 degrees in middle of the US db.priceables_1.ensureIndex( { loc: "geoHaystack", pm: 1 }, { bucketSize: 0.5 }) db.runCommand( { geoSearch: "priceables_1", near: [-122.4, 37.79], maxDistance: 0.5, search: { pm: 6757 }, limit: 50000 })
  • Geo Haystack Cons 14  Only one secondary filter  Second part of index can’t have an array value  Error on unindexed query on only the second part of the key
  •  Supports earth-like spherical geometries  Points can be GeoJSON or x,y pairs  GeoJSON LineString and Polygon  Queries for inclusion, intersection and proximity 2DSphere Index 15
  • db.priceables_1.ensureIndex( { loc: "2dsphere", pm: 1, pn : 1 }) db.priceables_1.find( { "loc" : { "$geoWithin" : { "$centerSphere" : [ [ -94.2128 , 36.3840], 0.006314]}}, "pm" : 6441, "pn" : { "$in" : [ 5236 , 5237 ] }}) 2DSphere Index Creation and Sample Query 16
  •  Geospatially Accurate  Even Faster than Haystack 2DSphere Results 17
  • SSDs 18  For uncached data on HDD, MongoDB geo index was twice as fast as custom Java geo index with MySQL  Still close to 1 minute for big queries with full data set  Death by random read  Tested with a $200 Samsung SSD  Typical query dropped to 20 millis  Big query only about 150 millis
  • Random 4k block reads, 5 GB file, 16 threads Mongoperf on SSDs 19 Env SSD Read Ops/s Read MB/s Prod Samsung 200GB SLC 74k 288 QA VM Samsung 200GB SLC 30k 117 Dev Samsung 830 256GB SATA MLC 47k 183 Env SSD Write Ops/s Write MB/s Prod Samsung 200GB SLC 1074 289 QA VM Samsung 200GB SLC 405 196 Dev Samsung 830 256GB SATA MLC 438 210 Sequential write of the 5 GB file
  •  Requirements  Major price updates monthly  Minor updates more frequently  Huge bulk loads with no impact on active replica set  I/O bound, not CPU bound  Solution  Two MongoDB replica sets  Multiple SSDs per server Low Impact Pricing Updates 20
  • Replica Set Architecture 21 Physical Servers Replica Sets prodpricing1 prodpricing2 Server pricing1 mongod 28001 primary mongod 28002 secondary Server pricing2 mongod 28001 secondary mongod 28002 primary Server db1 mongod 28001 arbiter Server db2 mongod 28002 arbiter
  •  Transfer compressed data files to passive replica set  Protip: to compress and uncompress tar cvf - pricing | pigz > ~/pricing.tgz pigz -dc pricing.tgz | tar xvf -  Page in index and data  db.runCommand({ touch: "priceables_1", index: true, data: true })  Pricing Service operation to atomically flip Replica Set Flipping Solution 22
  •  Obviously, increased cost, but only for extra SSDs  Recently added caching of remote pricing lookups  TTL collections  Cache is lost during a flip  But, usually flip late at night  Cache eviction time is only a few hours Replica Set Flipping Drawbacks 23
  •  Geo search speed with cold cache acceptable  Geo search speed with warm cache awesome  Pricing Service startup down to a few seconds  No production impact for major rate updates  Lowered risk for minor rate updates Overall Results 24
  • Summary 25  Geo Haystack Index great for …  Retrieving lots of documents in a constrained search area  Very simple geospatial searches with a single secondary filter  2DSphere Index great for …  Complex geospatial searches or complex indexing  SSDs great for …  Random reads  Reducing need for lots of complex indexes  Replica set flipping great for …  Instant swap of large amounts of data  Primarily, if not solely, read only  Trading cost for operational flexibility
  • CONFIDENTIALCONFIDENTIALCONFIDENTIALCONFIDENTIAL Q & A 26