This document summarizes a talk about the CMS Data Aggregation System (DAS). DAS aggregates metadata from multiple CMS databases to allow users to query across different services. It uses a plug-and-play architecture to integrate new databases in a customizable way while preserving each database's access policies. Benchmark tests showed DAS can aggregate over 500,000 records from two databases into JSON documents within a few seconds by caching results. Future plans include further testing DAS in production and potentially releasing it as open source software.
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Find Your CMS Data
1. GenDB LumiDB
Data
Phedex PSetDB
Quality
DBS SiteDB RunDB Overview
How can I find
my data?
CMS Data Aggregation System
Valentin Kuznetsov, Cornell University
ICCS Workshop, Amsterdam, May 31 - Jun. 2d, 2010
1
2. Talk outline
✤ Introduction
✤ Motivations
✤ What is DAS?
✤ Design, architecture, implementations
✤ Current status & benchmarks
✤ Future plans
2
3. Introduction
✤ CMS is a general purpose physics detector built for the LHC
✤ beam collision 25 nsec, online trigger 300 Hz, event size 1-2MB
✤ More then 3000 physicists, 183 institution, 38 countries
✤ CMS uses distributed computing and data model
✤ 1 Tier-0, 7 Tier-1, O(50) Tier-2, O(50) Tier-3 centers
✤ 2-6 PB/year of real data + 1x Simulated data, ~500GB/year of meta-data
✤ Code: C++/Python; Databases: ORACLE, MySQL, CouchDB, MongoDB ...
4. Motivations ...
Data Aggregation System
✤ A user want to query different
meta-data services without
knowing of their existence
run
A user want to combine
RunSummary DataQuality LumiDB
✤ run, trigger, detector, ... trigger, ecal, hcal, ... lumi, luminosity, hltpath
run, run lumi
information from different lumi
meta-data services Phedex DBS
block, file, block.replica, block, run, file, block, site, MC id
GenDB
generator, xsection,
file.replica, se, node, ... site config, tier, dataset,
lumi, parameters, ....
process, decay, ...
✤ A user has domain knowledge,
site pset
but need to query X services, SiteDB
site, admin, site.status, ..
Overview
country, node, region, ..
Parameter Set DB
CMSSW parameters
using Y interface and dealing
with Z data formats to get our
Service E
param1, param2, DC
Service ..
Service
param1, param2, .. B
Service
param1, param2, .. A
Service
param1, param2, ..
data param1, param2, ..
4
5. What is DAS?
✤ DAS stands for Data Aggregation System
✤ It is layer on top of existing data-services
✤ It aggregates data across distributed data-services while preserving
their integrity, security policy and data-formats
✤ it provides caching for data-services (side effect)
✤ It represents data in defined format: JSON documents
✤ It allows query data via free text-based queries
✤ Agnostic to data content 5
6. Challenges ...
✤ Combining N data-services is a great idea, but
✤ there is no ad-hoc IT solution
✤ DAS doesn’t hold the data, can’t have pre-defined schema
✤ must support existing APIs, data formats, interfaces, security
policies
✤ must relate and aggregate meta-data
✤ must be efficient, flexible, scalable and easy to use
✤ Work on DAS prototype to understand those challenges 6
7. DAS prototype
✤ Code written in python, ideal for prototyping
✤ Use existing meta-data from CMS data-services as test-bed
✤ 8 data-services, 75/250GB in tables/indexes
✤ Use document-oriented “schema-less’’database: MongoDB
✤ raw cache, merge result cache, mapping and analytics DBs
✤ Support free keyword-based queries, e.g. site=T1_CERN, run=100
✤ Aggregate information using key-value matching
7
8. DAS architecture
Invoke the same API(params)
Update cache periodically
DAS robot Fetch popular
queries/APIs
DAS DAS DAS DAS
mapping Map data-service cache merge Analytics
output to DAS
records
record query, API
call to Analytics
runsum mapping aggregator
lumidb
data-services
parser
DAS core
DAS web
plugins
phedex CPU core RESTful interface
server
DAS core UI
sitedb
dbs DAS Cache server
9. DAS workflow query
DAS DAS
core logging
parser
✤ Query parser
yes no
query
DAS merge
✤ Query DAS merge collection yes no
query
DAS cache
✤ Query DAS cache collection
DAS DAS query DAS
merge cache data-services Mapping
✤ invoke call to data service
Aggregator DAS
✤ write to analytics Analytics
results
✤ Aggregate results (generator)
Web UI
10. DAS and data-services
✤ DAS is data-service agnostic
✤ a data-service is identified by its URI and input parameters
✤ Use plug-and-play mechanism:
✤ add new data-service using ASCII map file (URI, parameters, ...)
✤ use generic HTTP access and standard data-parsers (XML, JSON)
✤ Use dedicated plugin:
✤ specific access requirements, custom parsers, etc.
11. DAS map files
Data Aggregation System
system : google_maps
format : JSON
---
urn : google_geo_maps
url : "http://maps.google.com/maps/geo"
expire : 30 DAS mapping
params : { "q" : "required", "output": "json" }
daskeys : [
{"key":"city","map":"city.name","pattern":""},
]
Data Service: URL/api?params
12. DAS benchmark
✤ Fetch all blocks from our bookkeeping (DBS) and data transfer (PhEDEx) CMS data services
✤ parse, remap notations, store to cache, merge matched records (aggregation)
✤ Linux 64-bit, 1CPU for DAS, 1CPU for MongoDB, record size ~1KB
✤ Elapsed time = retrieval time + parsing time + remapping time + cache insertion/indexing
time + output creation time
Time, no Time w/
Format Records
cache cache
DBS yield XML 387K 68s 0.98s 393K DAS records,
PhEDEx yield XML 190K 107s 0.98s create ~6K docs/s
read ~7.6K docs/s
Merge step JSON 577K 63s 0.9s
DAS total JSON 393K 238s 2.05s 12
13. Future plans
✤ DAS goes into production this year in CMS:
✤ confirm scalability, transparency and durability w/ various data-
services
✤ work on analytics to organize pre-fetch strategies
✤ Apply to other domain disciplines
✤ Release as open source
14. Summary
✤ Data Aggregation System is data agnostic and allow to query/
aggregate meta-data information in customizable way
✤ The current architecture easily integrates with existing data-services
preserving their access, security policy and development cycle
✤ DAS is designed to work with existing CMS data-services, but can
easily go beyond that boundary
✤ Plug-and-play mechanism makes it easily to add new data-services
and configure DAS to specific domain