MongoDB Use Cases: Healthcare, CMS, Analytics


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Thomas O’RourkeOuluSeattle – Seattle’s a great place. Amazon, Microsoft, Facebook, Google. Big Data is here. Gave me confidence to try MongoDB to hear some of the worlds architects tell you “It’s all a big hash table” or You can’t do global relations anyways – de-normalize.Cassandra, Hadoop, Riak, Redis, CouchDB All are good.MongoDB is EASIEST to work with and get started. And BROADEST use cases because of document architecture and indexing.Fun to hear horror stories – I’m afraid I don’t have any  Or maybe a few. Stand on shoulders.Visa cards. Just reconcile at the end of the day.
  • One year to know “Has this ID ever been seen before for the entire year”.The data structure needs to be flexible.
  • Time stamps might be 1 second. Or MS where there are 2.Do it!MongoDB was easy to get started Deduplication. An index built on a partial md5 string hash and a timestamp. (2 numbers into a compound index).L1. “De-duplication” – Log lines must be unique. Indexes that Hold Only Recent Values in RAMPlayback of log files in case of problemsL2. Parsing into key/value pairs JSON.parse()L3. Processing L4. Reports can work from slaves.
  • No duplicates: Integer index are fastPreallocate scheme100% Mongo. Collections make more collections which refine collections… etc. (See example)Use the dynamic nature of creating collections ! It’s not a relational DB BE DYNAMIC FOR GOODNESS SAKE!Like “Event_<NAME>” Create a collection with the event name. Might need to do some cleanup. So what.Playback is SUPER IMPORTANT. Verify everything is therre. AUTO INTEGRITY CHECKS.
  • We actually write JSON to our log files for events we want to capture.These can be parsed with one line of codeDynamic creation of collectionsThen it can directly into MongoDB
  • Say you have a collection (red) that you want to “process”Preallocate a processed (may be many).In processed collection store the the source_id and create an index with no duplicates.This way you can have many target collections, but you will never process twice. - ONLY UPDATE the procssed flag after 100% sure we have inserted the processed. BUT you might want to update severalThe processed flag does not have to be safe updated.GUARANTEEDAlso because of playback high, low water didn’t work.
  • Almost everything was map/reduceMySQL was considered for reports – but 100% mongo was easier!
  • You can read from a replica (use tags), but they might be 50ms behind. Only primary is writing and guaranteed to be consistent.(comparable to MySQL).. People who run benchmarks need to consider this.Connection pool! Yes.Multithread or Non-blocking I/O! (Eventmachine/tornado). Yes!
  • Journaling is on by default.The oplog and journal write are done in an Atomic transaction (how is that possible).After n-operations or single operationIf you are using MULTITHREADED driver Your Write and Read might not be consistent.getLastError() -> per thread! So driver…N-writes = majorityFor us, we don’t have n-writes because we have the integrity checks.Journal means written to disk. And you can combine with write concern.
  • A write is not committed until it hits a majority of nodes. (even with journaling)..(JOURNAL is default)All writes that were never replicated will trigger a rollback. The changes are stored in a “rollback” sub-directoryUSE write concern to wait until it is replicated to majority.  After every write, or after a series of writes.
  • Nothing greater than to wake up and see it failed over without intervention.
  • A . Is an event in 200 secondsAn 0 is no event in 200 seconds.Entire month of data.5 million events.30 seconds to map-reduce this.
  • MongoDB Use Cases: Healthcare, CMS, Analytics

    1. 1. MongoDB Use Cases Healthcare, CMS, Analytics Thomas O‟Rourke Upstream Innovations Ltd. Oulu / Seattle
    2. 2.
    3. 3. Dashwire Dashconfig• Users configure their mobile phones on PC. o Email accounts, wallpapers, ringtones, bookmarks, contacts, etc. o Generates a lot of data!• Wanted: Google Analytics + Splunk + BI. o Sensitive data: • Can‟t send out => No Google Analytics. o Many sources • (Server log files, SQS, Web analytics, etc.) o internal error report & • UI issues (powerful paradigm) o Real time vs. Reports/Enterprise• ~500,000 events a day o Store for year
    4. 4. Solution• Eco-system in Mongo o Evolved• Layered architecture o L1. Store - “De-duplication. • Streaming live (syslog) • Playback of log files o L2. Parsing into key/value pairs. o L3. Processing. o L4. Reports.• Trade-offs for real-time o Reconciler o Trade offs for real time and offline
    5. 5. Tools• MongoDB • Ruby• Sinatra• Ruby driver o (Connection pooling, multithreaded, replica set support)• Event machine + em-mongo• ZeroMQ• Sinatra/Rack/Thin• Mixpanel• Server density• Excel• Highcharts• softlayer
    6. 6. Eco system Syslog PlaybackIntegrity Store strings withChecks timestamps No DuplicatesOnce day Process to key/value pairs Sanitize/ intermediate Real time External charts interface App specific reports Excel, etc. Daily/weekly
    7. 7. Parsing logs"2012-08-17 13:08:11 app02 Passngr[20167]: I script(www-data) --{”analytics":{"scenario":"three","initial scenario":"three","phone":”CoolPhone","name":"Facebook","time":"2012-08-17 18:08:11.399 UTC","event":"BookmarkAdded","browser_tracking_id":"857b307a4d1xxxxx08ebca70f6","browser_time":"2012-08-1718:08:14.794 UTC","browser_event":1,"session_id":"68528379d5xxxxxxxcda27fd625fe"}}" JSON.parse( ) Collection = Event_Bookmark_Added { scenario: “three”, phone : “Cool phone”, event : “Bookmark Added”, session_id : . .. }
    8. 8. De-duplication• Multikey index o Integers perform well • MD5 of entire log line as string (only use half of result) • Unix time stamp (seconds) • Fraction of second (if one is present) • Better to use millisecond but not required@collections[collection].create_index( [ [:ts, Mongo::ASCENDING], [:ts_frac, Mongo::ASCENDING], [:dhash, Mongo::ASCENDING ] ], { :unique => true, :drop_dups => true} )
    9. 9. Process pattern Pre allocate “processed : 0” At insert time (creation) @collections[collection].insert( doc ) Index (no dup) process
    10. 10. Reports• Needed both Real time and Enterprise (Excel Reports) o We use MongoDB for both and all intermediate tables• Reports o Map/Reduce for Reports and Graphs o Considered MySQL but rejected as unnecessary o Write Excel (*.xlsx) directly using Ruby and accessing MongoBD. •• Real-time o Incremental Map/Reduce gives performance to do real time graphs. •
    11. 11. Server Density
    12. 12. PART 2 Technical Discussion• Performance• Durability• Replica sets• Maintenance• Transactions• Drivers and Languages• Demos
    13. 13. Performance• ~3000 inserts a second for unsafe mode.• < 1000 for safe mode.• Indexes = memory.• Use slaves when possible for reads (note: consistency)• Your driver makes a HUGE difference.• Pre-allocate for updates!• Safe mode is much slower o Not everything is required to be 100% safe o Not everything is unsafe. o Think! ARCHITECT your durability where you need it!
    14. 14. Durability majority SAFE / SLOWER Replica set Cluster Single Unsafe Safe n - writesFAST/ Journal (with journal)UNSAFE Safe modes
    15. 15. Replica set uses• Redundancy o Data is at multiple nodes o n-seconds behind mode, is an „ass‟ saver (it‟s very easy to accidentally drop a collection!)• Failover o Sleep at night• Maintenance o Backup slaves o Build indexes on slaves and promote them• Load balancing o Reads on slaves @collection.insert(doc, :safe => { :w => “majority” } ) Journal + replicate (journal only applies to primary) but guarantees the rollback will be available if failed before replication.
    16. 16. Maintenance• Backup/Maintenance o Backup by stopping slave, copy files, start slave • /data/* • Can be copied and backed up and compressed • Compression is high! (Can be 70%!) because fields names are not compressed o Mongo export and import BSON can be run while database is running o Server density • Nodes health • Slave lag - time behind • Index size • Etc.
    17. 17. Transactions• findAndUpdate(). o Atomic update and return it in same document• Upserts and indexes .• Planning for failure not assuming transactions.
    18. 18. Driver and language• Driver and Language o Use a dynamic language! Ruby, Python, etc. o Driver support for replica set, and connection pool preferred. o A Simple ORM/Mapper, etc. works great. • Mongoid • MongoMapper • Or even just plain driver (Mongo Ruby driver) o Learn Javascript! • Shell Javascript commands and Ruby driver methods are very similar o findOne vs find_one • Map/Reduce –is always Javascript • Everything is a Map/Reduce – get used to it. • (It‟s not difficult for these purposes!)
    19. 19. Demos• o JQuery tree view o Sinatra o Mongo• Cool o Integrating R with MongoDB o Highcharts• Contact information: o o