Data Aggregation System


Published on

The talk present a new Data Aggregation System for CMS experiment at CERN. We use MongoDB database as caching layer to query multiple data-provides (backed up by RDMS) and aggregate data across them.

Talk has been presented at ICCS 2010 conference.

Published in: Technology

Data Aggregation System

  1. 1. GenDB LumiDB Data Phedex PSetDB Quality DBS SiteDB RunDB Overview How can I find my data? CMS Data Aggregation System Valentin Kuznetsov, Cornell University ICCS Workshop, Amsterdam, May 31 - Jun. 2d, 2010 1
  2. 2. Talk outline ✤ Introduction ✤ Motivations ✤ What is DAS? ✤ Design, architecture, implementations ✤ Current status & benchmarks ✤ Future plans 2
  3. 3. Introduction ✤ CMS is a general purpose physics detector built for the LHC ✤ beam collision 25 nsec, online trigger 300 Hz, event size 1-2MB ✤ More then 3000 physicists, 183 institution, 38 countries ✤ CMS uses distributed computing and data model ✤ 1 Tier-0, 7 Tier-1, O(50) Tier-2, O(50) Tier-3 centers ✤ 2-6 PB/year of real data + 1x Simulated data, ~500GB/year of meta-data ✤ Code: C++/Python; Databases: ORACLE, MySQL, CouchDB, MongoDB ...
  4. 4. Motivations ... Data Aggregation System ✤ A user want to query different meta-data services without knowing of their existence run A user want to combine RunSummary DataQuality LumiDB ✤ run, trigger, detector, ... trigger, ecal, hcal, ... lumi, luminosity, hltpath run, run lumi information from different lumi meta-data services Phedex DBS block, file, block.replica, block, run, file, block, site, MC id GenDB generator, xsection, file.replica, se, node, ... site config, tier, dataset, lumi, parameters, .... process, decay, ... ✤ A user has domain knowledge, site pset but need to query X services, SiteDB site, admin, site.status, .. Overview country, node, region, .. Parameter Set DB CMSSW parameters using Y interface and dealing with Z data formats to get our Service E param1, param2, DC Service .. Service param1, param2, .. B Service param1, param2, .. A Service param1, param2, .. data param1, param2, .. 4
  5. 5. What is DAS? ✤ DAS stands for Data Aggregation System ✤ It is layer on top of existing data-services ✤ It aggregates data across distributed data-services while preserving their integrity, security policy and data-formats ✤ it provides caching for data-services (side effect) ✤ It represents data in defined format: JSON documents ✤ It allows query data via free text-based queries ✤ Agnostic to data content 5
  6. 6. Challenges ... ✤ Combining N data-services is a great idea, but ✤ there is no ad-hoc IT solution ✤ DAS doesn’t hold the data, can’t have pre-defined schema ✤ must support existing APIs, data formats, interfaces, security policies ✤ must relate and aggregate meta-data ✤ must be efficient, flexible, scalable and easy to use ✤ Work on DAS prototype to understand those challenges 6
  7. 7. DAS prototype ✤ Code written in python, ideal for prototyping ✤ Use existing meta-data from CMS data-services as test-bed ✤ 8 data-services, 75/250GB in tables/indexes ✤ Use document-oriented “schema-less’’database: MongoDB ✤ raw cache, merge result cache, mapping and analytics DBs ✤ Support free keyword-based queries, e.g. site=T1_CERN, run=100 ✤ Aggregate information using key-value matching 7
  8. 8. DAS architecture Invoke the same API(params) Update cache periodically DAS robot Fetch popular queries/APIs DAS DAS DAS DAS mapping Map data-service cache merge Analytics output to DAS records record query, API call to Analytics runsum mapping aggregator lumidb data-services parser DAS core DAS web plugins phedex CPU core RESTful interface server DAS core UI sitedb dbs DAS Cache server
  9. 9. DAS workflow query DAS DAS core logging parser ✤ Query parser yes no query DAS merge ✤ Query DAS merge collection yes no query DAS cache ✤ Query DAS cache collection DAS DAS query DAS merge cache data-services Mapping ✤ invoke call to data service Aggregator DAS ✤ write to analytics Analytics results ✤ Aggregate results (generator) Web UI
  10. 10. DAS and data-services ✤ DAS is data-service agnostic ✤ a data-service is identified by its URI and input parameters ✤ Use plug-and-play mechanism: ✤ add new data-service using ASCII map file (URI, parameters, ...) ✤ use generic HTTP access and standard data-parsers (XML, JSON) ✤ Use dedicated plugin: ✤ specific access requirements, custom parsers, etc.
  11. 11. DAS map files Data Aggregation System system : google_maps format : JSON --- urn : google_geo_maps url : "" expire : 30 DAS mapping params : { "q" : "required", "output": "json" } daskeys : [ {"key":"city","map":"","pattern":""}, ] Data Service: URL/api?params
  12. 12. DAS benchmark ✤ Fetch all blocks from our bookkeeping (DBS) and data transfer (PhEDEx) CMS data services ✤ parse, remap notations, store to cache, merge matched records (aggregation) ✤ Linux 64-bit, 1CPU for DAS, 1CPU for MongoDB, record size ~1KB ✤ Elapsed time = retrieval time + parsing time + remapping time + cache insertion/indexing time + output creation time Time, no Time w/ Format Records cache cache DBS yield XML 387K 68s 0.98s 393K DAS records, PhEDEx yield XML 190K 107s 0.98s create ~6K docs/s read ~7.6K docs/s Merge step JSON 577K 63s 0.9s DAS total JSON 393K 238s 2.05s 12
  13. 13. Future plans ✤ DAS goes into production this year in CMS: ✤ confirm scalability, transparency and durability w/ various data- services ✤ work on analytics to organize pre-fetch strategies ✤ Apply to other domain disciplines ✤ Release as open source
  14. 14. Summary ✤ Data Aggregation System is data agnostic and allow to query/ aggregate meta-data information in customizable way ✤ The current architecture easily integrates with existing data-services preserving their access, security policy and development cycle ✤ DAS is designed to work with existing CMS data-services, but can easily go beyond that boundary ✤ Plug-and-play mechanism makes it easily to add new data-services and configure DAS to specific domain