Data Aggregation System

GenDB LumiDB

Data
Phedex PSetDB
Quality

DBS SiteDB RunDB Overview

How can I ﬁnd
my data?

CMS Data Aggregation System
Valentin Kuznetsov, Cornell University

ICCS Workshop, Amsterdam, May 31 - Jun. 2d, 2010
1

Talk outline

✤ Introduction

✤ Motivations

✤ What is DAS?

✤ Design, architecture, implementations

✤ Current status & benchmarks

✤ Future plans

2

Introduction

✤ CMS is a general purpose physics detector built for the LHC

✤ beam collision 25 nsec, online trigger 300 Hz, event size 1-2MB

✤ More then 3000 physicists, 183 institution, 38 countries

✤ CMS uses distributed computing and data model

✤ 1 Tier-0, 7 Tier-1, O(50) Tier-2, O(50) Tier-3 centers

✤ 2-6 PB/year of real data + 1x Simulated data, ~500GB/year of meta-data

✤ Code: C++/Python; Databases: ORACLE, MySQL, CouchDB, MongoDB ...

Motivations ...
Data Aggregation System
✤ A user want to query different
meta-data services without
knowing of their existence
run
A user want to combine
RunSummary DataQuality LumiDB
✤ run, trigger, detector, ... trigger, ecal, hcal, ... lumi, luminosity, hltpath

run, run lumi
information from different lumi

meta-data services Phedex DBS
block, file, block.replica, block, run, file, block, site, MC id
GenDB
generator, xsection,
file.replica, se, node, ... site config, tier, dataset,
lumi, parameters, ....
process, decay, ...

✤ A user has domain knowledge,
site pset
but need to query X services, SiteDB
site, admin, site.status, ..
Overview
country, node, region, ..
Parameter Set DB
CMSSW parameters

using Y interface and dealing
with Z data formats to get our
Service E
param1, param2, DC
Service ..
Service
param1, param2, .. B
Service
param1, param2, .. A
Service
param1, param2, ..

data param1, param2, ..

4

What is DAS?
✤ DAS stands for Data Aggregation System

✤ It is layer on top of existing data-services

✤ It aggregates data across distributed data-services while preserving
their integrity, security policy and data-formats

✤ it provides caching for data-services (side effect)

✤ It represents data in deﬁned format: JSON documents

✤ It allows query data via free text-based queries

✤ Agnostic to data content 5

Challenges ...
✤ Combining N data-services is a great idea, but

✤ there is no ad-hoc IT solution

✤ DAS doesn’t hold the data, can’t have pre-defined schema

✤ must support existing APIs, data formats, interfaces, security
policies

✤ must relate and aggregate meta-data

✤ must be efficient, flexible, scalable and easy to use

✤ Work on DAS prototype to understand those challenges 6

DAS prototype

✤ Code written in python, ideal for prototyping

✤ Use existing meta-data from CMS data-services as test-bed

✤ 8 data-services, 75/250GB in tables/indexes

✤ Use document-oriented “schema-less’’database: MongoDB

✤ raw cache, merge result cache, mapping and analytics DBs

✤ Support free keyword-based queries, e.g. site=T1_CERN, run=100

✤ Aggregate information using key-value matching
7

DAS architecture
Invoke the same API(params)
Update cache periodically
DAS robot Fetch popular
queries/APIs

DAS DAS DAS DAS
mapping Map data-service cache merge Analytics
output to DAS
records

record query, API
call to Analytics
runsum mapping aggregator

lumidb
data-services

parser

DAS core
DAS web
plugins

phedex CPU core RESTful interface
server
DAS core UI
sitedb

dbs DAS Cache server

DAS workflow query

DAS DAS
core logging

parser
✤ Query parser
yes no
query
DAS merge
✤ Query DAS merge collection yes no
query
DAS cache

✤ Query DAS cache collection
DAS DAS query DAS
merge cache data-services Mapping

✤ invoke call to data service
Aggregator DAS
✤ write to analytics Analytics

results

✤ Aggregate results (generator)
Web UI

DAS and data-services

✤ DAS is data-service agnostic

✤ a data-service is identified by its URI and input parameters

✤ Use plug-and-play mechanism:

✤ add new data-service using ASCII map file (URI, parameters, ...)

✤ use generic HTTP access and standard data-parsers (XML, JSON)

✤ Use dedicated plugin:

✤ specific access requirements, custom parsers, etc.

DAS map files

Data Aggregation System
system : google_maps
format : JSON
---
urn : google_geo_maps
url : "http://maps.google.com/maps/geo"
expire : 30 DAS mapping
params : { "q" : "required", "output": "json" }
daskeys : [
{"key":"city","map":"city.name","pattern":""},
]

Data Service: URL/api?params

DAS benchmark
✤ Fetch all blocks from our bookkeeping (DBS) and data transfer (PhEDEx) CMS data services

✤ parse, remap notations, store to cache, merge matched records (aggregation)

✤ Linux 64-bit, 1CPU for DAS, 1CPU for MongoDB, record size ~1KB

✤ Elapsed time = retrieval time + parsing time + remapping time + cache insertion/indexing
time + output creation time

Time, no Time w/
Format Records
cache cache

DBS yield XML 387K 68s 0.98s 393K DAS records,
PhEDEx yield XML 190K 107s 0.98s create ~6K docs/s
read ~7.6K docs/s
Merge step JSON 577K 63s 0.9s

DAS total JSON 393K 238s 2.05s 12

Future plans

✤ DAS goes into production this year in CMS:

✤ conﬁrm scalability, transparency and durability w/ various data-
services

✤ work on analytics to organize pre-fetch strategies

✤ Apply to other domain disciplines

✤ Release as open source

Summary

✤ Data Aggregation System is data agnostic and allow to query/
aggregate meta-data information in customizable way

✤ The current architecture easily integrates with existing data-services
preserving their access, security policy and development cycle

✤ DAS is designed to work with existing CMS data-services, but can
easily go beyond that boundary

✤ Plug-and-play mechanism makes it easily to add new data-services
and conﬁgure DAS to speciﬁc domain

Data Aggregation System

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (6)

Similar to Data Aggregation System

Similar to Data Aggregation System (20)

Recently uploaded

Recently uploaded (20)

Data Aggregation System