MongoDB at the energy frontier

3,440 views

Published on

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,440
On SlideShare
0
From Embeds
0
Number of Embeds
672
Actions
Shares
0
Downloads
19
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

MongoDB at the energy frontier

  1. 1. MongoDB at the energy frontier Valentin Kuznetsov, Cornell University MongoNYC, May, 2012Monday, May 21, 12 1
  2. 2. Outline ✤ CMS :: LHC :: CERN ✤ Data Aggregation System and MongoDB ✤ Experience ✤ SummaryMonday, May 21, 12 2
  3. 3. CMS :: LHC :: CERN Large Hadron Collider located at CERN, Geneva, Switzerland CMS is one of the 4 experiments to probe our knowledge of particle interactions and search for a new physicsMonday, May 21, 12 3
  4. 4. CMS :: LHC :: CERN Compact Muon Solenoid (CMS)Monday, May 21, 12 4
  5. 5. CMS :: LHC :: CERN Typical proton-proton collision in CMS detectorMonday, May 21, 12 5
  6. 6. CMS :: LHC :: CERN ✤ 40 countries, 172 institutions, more then 3000 scientists ✤ CMS experiment produces a few PB of real data each year and we collect ~TB of meta-data ✤ CMS relies on GRID infrastructure for data processing and uses 100+ computing centers word-wide ✤ CMS software consists of 4M lines of C++(framework), 2M lines of python (data management), plus Java, perl, etc. ✤ ORACLE, MySQL, SQLite, NoSQLMonday, May 21, 12 6
  7. 7. Dilemma GenDB LumiDB Data Quality Phedex How I can find my data? DBS PSetDB SiteDB Overview RunDBMonday, May 21, 12 7
  8. 8. Motivations ✤ Users want to query different data services without knowing Data Aggregation System about their existence ✤ Users want to combine RunSummary run DataQuality LumiDB information from different data run, trigger, detector, ... trigger, ecal, hcal, ... lumi, luminosity, hltpath run, run lumi services lumi Phedex DBS block, MC id GenDB ✤ Some users may have domain block, file, block.replica, file.replica, se, node, ... site run, file, block, site, config, tier, dataset, lumi, parameters, .... generator, xsection, process, decay, ... knowledge, but they need to site query X services, using Y SiteDB Overview pset Parameter Set DB site, admin, site.status, .. country, node, region, .. CMSSW parameters interface and dealing with Z data formats to get our data Service E param1, param2, DC Service .. Service param1, param2, .. B Service param1, param2, .. A Service param1, param2, .. param1, param2, ..Monday, May 21, 12 8
  9. 9. Implementation idea ✤ When we talk we may use different languages (English, French, etc.) or different conventions (pounds vs kg) ✤ In order to establish communication we use translation, dictionary, thesaurusMonday, May 21, 12 9
  10. 10. Implementation ideaMonday, May 21, 12 10
  11. 11. Pros ✤ Separate data management from discovery service ✤ Data are safe and secure ✤ Pluggable architecture (new translations) ✤ Users never bother with interface, naming and schema conflicts, data- formats, security policies ✤ Information is aggregated in a real-time over distributed services ✤ Data consistency checks for free ✤ DB and API changes are transparent for end-usersMonday, May 21, 12 11
  12. 12. Cons ✤ DAS does not own the data ✤ lots of writes/reads/translations ✤ Data-services are real bottleneck ✤ nothing is guaranteed, e.g. service can go down, no control of its performance, requested data can be really large, etc. ✤ cache often and preemptive MongoDB to rescue !!!Monday, May 21, 12 12
  13. 13. Data Aggregation System Invoke the same API(params) Update cache periodically DAS robot Fetch popular queries/APIs DAS DAS DAS DAS mapping Map data-service cache merge Analytics output to DAS records record query, API call to Analytics runsum mapping aggregator lumidb data-services parser DAS core DAS web plugins phedex CPU core RESTful interface server DAS core UI sitedb dbs DAS Cache serverMonday, May 21, 12 13
  14. 14. Mapping DB ✤ Holds translation between user keywords and data-service APIs, resolve naming conflicts, etc. ✤ city=Ithaca query translates into Google API call {das2api: [{api_param: q, das_key: city.name, pattern: }], daskeys: [{key: city, map: city.name, pattern: }], expire: 3600, format: JSON, params: {output: json, q: required}, system: google_maps, url: http://maps.google.com/maps/geo, urn: google_geo_maps}Monday, May 21, 12 14
  15. 15. Analytics DB ✤ Keep tracks of user queries, data-service API calls {api: {params: {q: Ithaca, output: json}, name: google_geo_maps}, qhash: 7272bdeac45174823d3a4ea240c124ec, system: google_maps, counter: 5} ✤ Used by DAS analytics daemons to pre-fetch “hot” queries ✤ ValueHotSpot look-up data by popular values ✤ KeyHotSpot look-up data by popular key ✤ QueryMaintainer to keep given query always in cacheMonday, May 21, 12 15
  16. 16. Caching DB ✤ Data coming out from data-service providers are translated into JSON and stored into cache collection ✤ naming translation are performed at this level ✤ Data records from cache collection are processed on common key, e.g. city.name, and merged into merge collection cache collection merge collection {city: {name: Ithaca, lat:42, lng:-76}} {city: {name: Ithaca, lat:42, lng:-76, {city: {name: Ithaca, zip:14850}} zip:14850}}Monday, May 21, 12 16
  17. 17. DAS workflow query DAS DAS core logging ✤ Query parser parser ✤ Query DAS merge collection yes no query DAS merge ✤ Query DAS cache collection yes query DAS cache no ✤ invoke call to data service DAS DAS query DAS merge cache data-services Mapping ✤ write to analytics Aggregator DAS Analytics ✤ Aggregate results results ✤ Represent results on web UI or via Web UI command line interfaceMonday, May 21, 12 17
  18. 18. ExampleMonday, May 21, 12 18
  19. 19. DAS QL & MongoDB QL ✤ DAS Query Language built on top of MongoDB QL; it represents MongoDB QL in human readable form ✤ UI level: block dataset=/a/b/c | grep block.size | count(block.size) ✤ DB level: col.find(spec={‘dataset.name’:‘/a/b/c’}, fields=[block.size]).count() ✤ We enrich QL with additional filters (grep, sort, unique) and implement set of coroutines for aggregator functionsMonday, May 21, 12 19
  20. 20. DAS & MongoDB ✤ DAS works with 15 distributed data-services ✤ their size vary, on average O(100GB) ✤ DAS uses 40 MongoDB collections ✤ caching, mapping, analytics, logging (normal, capped, gridfs cols) ✤ DAS inserts/deletes O(1M) records on a daily basis ✤ We operate on a single 64-bit Linux node with 8 CPUs, 24 GB of RAM and 1TB of disk space, sharding were tested, but it is not enabledMonday, May 21, 12 20
  21. 21. MongoDB benefits ✤ Fast I/O and schema-less database are ideal for cache implementation ✤ you’re not limited by key:value approach ✤ Flexible query language allows to build domain specific QL ✤ stay on par with SQL ✤ No administrative costs with DB ✤ easy to install and maintainMonday, May 21, 12 21
  22. 22. MongoDB issues (ver 2.0.X) ✤ We were unable to directly store DAS queries into analytics collection, due to the dot constrain, e.g. {‘a.b’:1} ✤ queries <=> storage format {‘key’:‘a.b’, ‘value’:1} ✤ Scons is not suitable in fully controlled build environment ✤ it removes $PATH/$LD_LIBRARY_PATH for compiler commands; it forces to use -L/lib64. As a result we used wrappers. ✤ Uncompressed field names and limitation with pagination/ aggregation ✤ should be addressed in new MongoDB aggregation frameworkMonday, May 21, 12 22
  23. 23. Tradeoffs ✤ Query collisions: DAS does not own the data and there is no transactions, we rely on query status and update it accordingly ✤ Index choice: initially one per select key, later one per query hash ✤ Storage size: we compromise storage vs data flexibility vs naming conventions ✤ Speed: we compromise simple data access vs conglomerate of restrictions (naming, security policies, interfaces, etc.), but we tuning- up our data-service APIs based on query patternsMonday, May 21, 12 23
  24. 24. Results ✤ The service in production over one year ✤ Users authenticated via GRID certificates and DAS uses proxy server to pass credentials to back-end services ✤ Single query request yields few thousand records and resolved within few seconds ✤ Pluggable architecture allows to query your service(s) ✤ unit tests are done against public data-services, e.g. Google, IP look-up, etc.Monday, May 21, 12 24
  25. 25. NoSQL @ CERN ✤ MongoDB is used by other experiments at CERN ✤ logging, monitoring, data analytics ✤ MongoDB is not the only NoSQL solution used at CERN ✤ One size does not fit all ✤ CouchDB, Cassandra, HBase, etc. ✤ There is on-going discussion between experiments and CERN IT about adoption of NoSQLMonday, May 21, 12 25
  26. 26. Summary ✤ CMS experiment built Data Aggregation System as an intelligent cache to query distributed data-services ✤ MongoDB is used as DAS back-end ✤ During first year of operation we did not experience any significant problems ✤ I’d like to thank MongoDB team and its community for their constant support ✤ Questions? Contact: vkuznet@gmail.com ✤ https://github.com/vkuznet/DAS/Monday, May 21, 12 26
  27. 27. Back-up slidesMonday, May 21, 12 27
  28. 28. From query to results Data service generator Aggreator API Data service Merge Query Aggreator lookup generator results Data service Aggreator generatorMonday, May 21, 12 28
  29. 29. From query to results Data service generator Aggreator API Data service Merge Query Aggreator lookup generator results Data service Aggreator generatorMonday, May 21, 12 28
  30. 30. From query to results Data service generator Aggreator API Data service Merge Query Aggreator lookup generator results Data service Aggreator generator block dataset=/a/b/c MongoDB spec Mapping DB holds relationshipsMonday, May 21, 12 28
  31. 31. From query to results Data service generator Aggreator API Data service Merge Query Aggreator lookup generator results Data service Aggreator generator block dataset=/a/b/c MongoDB spec Mapping DB Caching DB holds holds relationships service recordsMonday, May 21, 12 28
  32. 32. From query to results Data service generator Aggreator API Data service Merge Query Aggreator lookup generator results Data service Aggreator generator block dataset=/a/b/c MongoDB spec Mapping DB Caching DB Merge DB holds holds holds relationships service records merged recordsMonday, May 21, 12 28
  33. 33. From query to results Data service generator Aggreator API Data service Merge Query Aggreator lookup generator results Data service Aggreator generator block dataset=/a/b/c MongoDB spec Mapping DB Caching DB Merge DB holds holds holds relationships service records merged recordsMonday, May 21, 12 28

×