MongoDB at the energy frontier

MongoDB at the energy frontier
Valentin Kuznetsov, Cornell University

MongoNYC, May, 2012

Monday, May 21, 12 1

Outline

✤ CMS :: LHC :: CERN

✤ Data Aggregation System and MongoDB

✤ Experience

✤ Summary


CMS :: LHC :: CERN

Large Hadron Collider located at CERN, Geneva, Switzerland
CMS is one of the 4 experiments to probe our knowledge of particle
interactions and search for a new physics


CMS :: LHC :: CERN

Compact Muon Solenoid (CMS)


CMS :: LHC :: CERN

Typical proton-proton collision in CMS detector

CMS :: LHC :: CERN

✤ 40 countries, 172 institutions, more then 3000 scientists

✤ CMS experiment produces a few PB of real data each year and we
collect ~TB of meta-data

✤ CMS relies on GRID infrastructure for data processing and uses 100+
computing centers word-wide

✤ CMS software consists of 4M lines of C++(framework), 2M lines of
python (data management), plus Java, perl, etc.

✤ ORACLE, MySQL, SQLite, NoSQL


Dilemma

GenDB

LumiDB

Data
Quality

Phedex How I can ﬁnd
my data?
DBS

PSetDB

SiteDB

Overview

RunDB


Motivations
✤ Users want to query different
data services without knowing Data Aggregation System
about their existence

✤ Users want to combine
RunSummary run DataQuality LumiDB
information from different data run, trigger, detector, ... trigger, ecal, hcal, ... lumi, luminosity, hltpath

run, run lumi
services lumi

Phedex DBS
block, MC id
GenDB
✤ Some users may have domain block, file, block.replica,
file.replica, se, node, ... site
run, file, block, site,
config, tier, dataset,
lumi, parameters, ....
generator, xsection,
process, decay, ...

knowledge, but they need to
site
query X services, using Y SiteDB Overview
pset
Parameter Set DB
site, admin, site.status, .. country, node, region, .. CMSSW parameters
interface and dealing with Z
data formats to get our data Service E
param1, param2, DC
Service ..
Service
param1, param2, .. B
Service
param1, param2, .. A
Service
param1, param2, ..
param1, param2, ..


Implementation idea

✤ When we talk we may use different
languages (English, French, etc.) or
different conventions (pounds vs kg)

✤ In order to establish communication
we use translation, dictionary,
thesaurus


Implementation idea


Pros
✤ Separate data management from discovery service

✤ Data are safe and secure

✤ Pluggable architecture (new translations)

✤ Users never bother with interface, naming and schema conﬂicts, data-
formats, security policies

✤ Information is aggregated in a real-time over distributed services

✤ Data consistency checks for free

✤ DB and API changes are transparent for end-users

Cons
✤ DAS does not own the data

✤ lots of writes/reads/translations

✤ Data-services are real bottleneck

✤ nothing is guaranteed, e.g. service can go down, no control of its
performance, requested data can be really large, etc.

✤ cache often and preemptive

MongoDB to rescue !!!


Data Aggregation System
Invoke the same API(params)
Update cache periodically
DAS robot Fetch popular
queries/APIs

DAS DAS DAS DAS
mapping Map data-service cache merge Analytics
output to DAS
records

record query, API
call to Analytics
runsum mapping aggregator

lumidb
data-services

parser

DAS core
DAS web
plugins

phedex CPU core RESTful interface
server
DAS core UI
sitedb

dbs DAS Cache server


Mapping DB
✤ Holds translation between user keywords and data-service APIs,
resolve naming conﬂicts, etc.

✤ city=Ithaca query translates into Google API call

{'das2api': [{'api_param': 'q', 'das_key': 'city.name', 'pattern': ''}],
'daskeys': [{'key': 'city', 'map': 'city.name', 'pattern': ''}],
'expire': 3600,
'format': 'JSON',
'params': {'output': 'json', 'q': 'required'},
'system': 'google_maps',
'url': 'http://maps.google.com/maps/geo',
'urn': 'google_geo_maps'}

Analytics DB

✤ Keep tracks of user queries, data-service API calls

{'api': {'params': {'q': 'Ithaca', 'output': 'json'}, 'name': 'google_geo_maps'}, 'qhash':
'7272bdeac45174823d3a4ea240c124ec', 'system': 'google_maps', 'counter': 5}

✤ Used by DAS analytics daemons to pre-fetch “hot” queries

✤ ValueHotSpot look-up data by popular values

✤ KeyHotSpot look-up data by popular key

✤ QueryMaintainer to keep given query always in cache


Caching DB
✤ Data coming out from data-service providers are translated into JSON
and stored into cache collection

✤ naming translation are performed at this level

✤ Data records from cache collection are processed on common key, e.g.
city.name, and merged into merge collection
cache collection merge collection
{'city': {'name': 'Ithaca',
'lat':42, 'lng':-76}} {'city': {'name': 'Ithaca',
'lat':42, 'lng':-76,
{'city': {'name': 'Ithaca',
'zip':14850}}
'zip':14850}}


DAS workflow query

DAS DAS
core logging
✤ Query parser
parser

✤ Query DAS merge collection yes no
query
DAS merge

✤ Query DAS cache collection yes
query
DAS cache
no

✤ invoke call to data service DAS DAS query DAS
merge cache data-services Mapping

✤ write to analytics
Aggregator DAS
Analytics

✤ Aggregate results results

✤ Represent results on web UI or via Web UI

command line interface

Example


DAS QL & MongoDB QL

✤ DAS Query Language built on top of MongoDB QL; it represents
MongoDB QL in human readable form

✤ UI level:

block dataset=/a/b/c | grep block.size | count(block.size)

✤ DB level:

col.find(spec={‘dataset.name’:‘/a/b/c’}, fields=[block.size]).count()

✤ We enrich QL with additional ﬁlters (grep, sort, unique) and
implement set of coroutines for aggregator functions


DAS & MongoDB

✤ DAS works with 15 distributed data-services

✤ their size vary, on average O(100GB)

✤ DAS uses 40 MongoDB collections

✤ caching, mapping, analytics, logging (normal, capped, gridfs cols)

✤ DAS inserts/deletes O(1M) records on a daily basis

✤ We operate on a single 64-bit Linux node with 8 CPUs, 24 GB of RAM
and 1TB of disk space, sharding were tested, but it is not enabled


MongoDB benefits

✤ Fast I/O and schema-less database are ideal for cache implementation

✤ you’re not limited by key:value approach

✤ Flexible query language allows to build domain speciﬁc QL

✤ stay on par with SQL

✤ No administrative costs with DB

✤ easy to install and maintain


MongoDB issues (ver 2.0.X)
✤ We were unable to directly store DAS queries into analytics collection,
due to the dot constrain, e.g. {‘a.b’:1}

✤ queries <=> storage format {‘key’:‘a.b’, ‘value’:1}

✤ Scons is not suitable in fully controlled build environment

✤ it removes $PATH/$LD_LIBRARY_PATH for compiler commands;
it forces to use -L/lib64. As a result we used wrappers.

✤ Uncompressed ﬁeld names and limitation with pagination/
aggregation

✤ should be addressed in new MongoDB aggregation framework

Tradeoffs

✤ Query collisions: DAS does not own the data and there is no
transactions, we rely on query status and update it accordingly

✤ Index choice: initially one per select key, later one per query hash

✤ Storage size: we compromise storage vs data ﬂexibility vs naming
conventions

✤ Speed: we compromise simple data access vs conglomerate of
restrictions (naming, security policies, interfaces, etc.), but we tuning-
up our data-service APIs based on query patterns


Results

✤ The service in production over one year

✤ Users authenticated via GRID certiﬁcates and DAS uses proxy server
to pass credentials to back-end services

✤ Single query request yields few thousand records and resolved within
few seconds

✤ Pluggable architecture allows to query your service(s)

✤ unit tests are done against public data-services, e.g. Google, IP
look-up, etc.


NoSQL @ CERN

✤ MongoDB is used by other experiments at CERN

✤ logging, monitoring, data analytics

✤ MongoDB is not the only NoSQL solution used at CERN

✤ One size does not ﬁt all

✤ CouchDB, Cassandra, HBase, etc.

✤ There is on-going discussion between experiments and CERN IT
about adoption of NoSQL


Summary
✤ CMS experiment built Data Aggregation System as an intelligent
cache to query distributed data-services

✤ MongoDB is used as DAS back-end

✤ During ﬁrst year of operation we did not experience any signiﬁcant
problems

✤ I’d like to thank MongoDB team and its community for their constant
support

✤ Questions? Contact: vkuznet@gmail.com

✤ https://github.com/vkuznet/DAS/

Back-up slides


From query to results

Data service
generator Aggreator

API Data service Merge
Query Aggreator
lookup generator results

Data service Aggreator
generator



Data service
generator Aggreator

Query Aggreator

generator
block dataset=/a/b/c

MongoDB spec

Mapping DB
holds
relationships



Data service
generator Aggreator

Query Aggreator

generator

MongoDB spec

Mapping DB Caching DB
holds holds
relationships service records



Data service
generator Aggreator

Query Aggreator

generator

MongoDB spec

Mapping DB Caching DB Merge DB
holds holds holds
relationships service records merged records


MongoDB at the energy frontier

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (17)

Similar to MongoDB at the energy frontier

Similar to MongoDB at the energy frontier (20)

Recently uploaded

Recently uploaded (20)

MongoDB at the energy frontier