SlideShare a Scribd company logo
1 of 14
Download to read offline
GenDB             LumiDB




                                          Data
                                                            Phedex            PSetDB
                                         Quality




                                   DBS             SiteDB            RunDB             Overview



                                                                                                  How can I find
                                                                                                    my data?




CMS Data Aggregation System
Valentin Kuznetsov, Cornell University

ICCS Workshop, Amsterdam, May 31 - Jun. 2d, 2010
                                                                                                                  1
Talk outline

✤   Introduction

✤   Motivations

✤   What is DAS?

✤   Design, architecture, implementations

✤   Current status & benchmarks

✤   Future plans

                                            2
Introduction

✤   CMS is a general purpose physics detector built for the LHC

    ✤   beam collision 25 nsec, online trigger 300 Hz, event size 1-2MB

✤   More then 3000 physicists, 183 institution, 38 countries

✤   CMS uses distributed computing and data model

    ✤   1 Tier-0, 7 Tier-1, O(50) Tier-2, O(50) Tier-3 centers

    ✤   2-6 PB/year of real data + 1x Simulated data, ~500GB/year of meta-data

✤   Code: C++/Python; Databases: ORACLE, MySQL, CouchDB, MongoDB ...
Motivations ...
                                                      Data Aggregation System
✤   A user want to query different
    meta-data services without
    knowing of their existence
                                                                       run
    A user want to combine
                                           RunSummary                               DataQuality                           LumiDB
✤                                    run, trigger, detector, ...             trigger, ecal, hcal, ...            lumi, luminosity, hltpath


                                                      run,                         run                                lumi
    information from different                        lumi

    meta-data services                         Phedex                                    DBS
                                     block, file, block.replica,     block,   run, file, block, site,      MC id
                                                                                                                         GenDB
                                                                                                                 generator, xsection,
                                     file.replica, se, node, ...     site     config, tier, dataset,
                                                                             lumi, parameters, ....
                                                                                                                 process, decay, ...


✤   A user has domain knowledge,
                                                      site                                                                pset
    but need to query X services,             SiteDB
                                     site, admin, site.status, ..
                                                                                     Overview
                                                                             country, node, region, ..
                                                                                                                   Parameter Set DB
                                                                                                                 CMSSW parameters

    using Y interface and dealing
    with Z data formats to get our
                                                                                    Service E
                                                                             param1, param2, DC
                                                                                     Service ..
                                                                                       Service
                                                                              param1, param2, .. B
                                                                                        Service
                                                                                param1, param2, .. A
                                                                                          Service
                                                                                 param1, param2, ..

    data                                                                           param1, param2, ..




                                                                                                                                             4
What is DAS?
✤   DAS stands for Data Aggregation System

✤   It is layer on top of existing data-services

✤   It aggregates data across distributed data-services while preserving
    their integrity, security policy and data-formats

    ✤   it provides caching for data-services (side effect)

✤   It represents data in defined format: JSON documents

✤   It allows query data via free text-based queries

✤   Agnostic to data content                                               5
Challenges ...
✤   Combining N data-services is a great idea, but

    ✤   there is no ad-hoc IT solution

    ✤   DAS doesn’t hold the data, can’t have pre-defined schema

    ✤   must support existing APIs, data formats, interfaces, security
        policies

    ✤   must relate and aggregate meta-data

    ✤   must be efficient, flexible, scalable and easy to use

✤   Work on DAS prototype to understand those challenges                 6
DAS prototype

✤   Code written in python, ideal for prototyping

✤   Use existing meta-data from CMS data-services as test-bed

    ✤   8 data-services, 75/250GB in tables/indexes

✤   Use document-oriented “schema-less’’database: MongoDB

    ✤   raw cache, merge result cache, mapping and analytics DBs

✤   Support free keyword-based queries, e.g. site=T1_CERN, run=100

✤   Aggregate information using key-value matching
                                                                     7
DAS architecture
    Invoke the same API(params)
    Update cache periodically
                                                    DAS robot                    Fetch popular
                                                                                 queries/APIs




    DAS                                          DAS                         DAS                                DAS
   mapping         Map data-service             cache                        merge                            Analytics
                   output to DAS
                   records

                                                                                                        record query, API
                                                                                                        call to Analytics
    runsum                                mapping               aggregator



    lumidb
                data-services




                                                                        parser

                                                    DAS core
                                                                                                              DAS web
                                plugins




    phedex                                          CPU core                        RESTful interface
                                                                                                               server
                                                    DAS core             UI
     sitedb

      dbs                                  DAS Cache server
DAS workflow                                     query




                                                  DAS              DAS
                                                  core           logging



                                                  parser
✤   Query parser
                                          yes                      no
                                                  query
                                                DAS merge
✤   Query DAS merge collection                                   yes                   no
                                                                             query
                                                                           DAS cache


    ✤   Query DAS cache collection
                                                 DAS                         DAS           query         DAS
                                                 merge                      cache       data-services   Mapping

        ✤   invoke call to data service
                                                            Aggregator                        DAS
        ✤   write to analytics                                                              Analytics



                                                results


✤   Aggregate results (generator)
                                                Web UI
DAS and data-services

✤   DAS is data-service agnostic

    ✤   a data-service is identified by its URI and input parameters

✤   Use plug-and-play mechanism:

    ✤   add new data-service using ASCII map file (URI, parameters, ...)

    ✤   use generic HTTP access and standard data-parsers (XML, JSON)

✤   Use dedicated plugin:

    ✤   specific access requirements, custom parsers, etc.
DAS map files

                                                       Data Aggregation System
system : google_maps
format : JSON
---
urn : google_geo_maps
url : "http://maps.google.com/maps/geo"
expire : 30                                                 DAS mapping
params : { "q" : "required", "output": "json" }
daskeys : [
    {"key":"city","map":"city.name","pattern":""},
]


                                                     Data Service: URL/api?params
DAS benchmark
✤    Fetch all blocks from our bookkeeping (DBS) and data transfer (PhEDEx) CMS data services

     ✤   parse, remap notations, store to cache, merge matched records (aggregation)

     ✤   Linux 64-bit, 1CPU for DAS, 1CPU for MongoDB, record size ~1KB

✤    Elapsed time = retrieval time + parsing time + remapping time + cache insertion/indexing
     time + output creation time


                                         Time, no     Time w/
                  Format     Records
                                          cache         cache

    DBS yield      XML         387K         68s         0.98s              393K DAS records,
PhEDEx yield       XML         190K        107s         0.98s              create ~6K docs/s
                                                                           read ~7.6K docs/s
    Merge step     JSON        577K         63s         0.9s

    DAS total      JSON        393K        238s         2.05s                                   12
Future plans

✤   DAS goes into production this year in CMS:

    ✤   confirm scalability, transparency and durability w/ various data-
        services

    ✤   work on analytics to organize pre-fetch strategies

✤   Apply to other domain disciplines

✤   Release as open source
Summary

✤   Data Aggregation System is data agnostic and allow to query/
    aggregate meta-data information in customizable way

✤   The current architecture easily integrates with existing data-services
    preserving their access, security policy and development cycle

✤   DAS is designed to work with existing CMS data-services, but can
    easily go beyond that boundary

✤   Plug-and-play mechanism makes it easily to add new data-services
    and configure DAS to specific domain

More Related Content

What's hot

SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
 
Nagios Conference 2012 - John Murphy - Rational Configuration Design
Nagios Conference 2012 - John Murphy - Rational Configuration DesignNagios Conference 2012 - John Murphy - Rational Configuration Design
Nagios Conference 2012 - John Murphy - Rational Configuration DesignNagios
 
SD Forum 1999 XML Lessons Learned
SD Forum 1999 XML Lessons LearnedSD Forum 1999 XML Lessons Learned
SD Forum 1999 XML Lessons LearnedTed Leung
 
Linked In Lessons Learned And Growth And Scalability
Linked In Lessons Learned And Growth And ScalabilityLinked In Lessons Learned And Growth And Scalability
Linked In Lessons Learned And Growth And ScalabilityConSanFrancisco123
 
Xml messages
Xml messagesXml messages
Xml messagesDeb Wolfe
 
Slash n: Technical Session 3 - Storage @ Scale: Quest for the mythical silver...
Slash n: Technical Session 3 - Storage @ Scale: Quest for the mythical silver...Slash n: Technical Session 3 - Storage @ Scale: Quest for the mythical silver...
Slash n: Technical Session 3 - Storage @ Scale: Quest for the mythical silver...slashn
 
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraMichaël Figuière
 
Acunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFPAcunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFPAcunu
 
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraNoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraMichaël Figuière
 
Advanced OpenSplice Programming - Part I
Advanced OpenSplice Programming - Part IAdvanced OpenSplice Programming - Part I
Advanced OpenSplice Programming - Part IAngelo Corsaro
 
Apache Con Us2007 Jcr In Action
Apache Con Us2007 Jcr In ActionApache Con Us2007 Jcr In Action
Apache Con Us2007 Jcr In Actionday
 

What's hot (15)

SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012
 
No Sql
No SqlNo Sql
No Sql
 
Session18 Madduri
Session18  MadduriSession18  Madduri
Session18 Madduri
 
Nagios Conference 2012 - John Murphy - Rational Configuration Design
Nagios Conference 2012 - John Murphy - Rational Configuration DesignNagios Conference 2012 - John Murphy - Rational Configuration Design
Nagios Conference 2012 - John Murphy - Rational Configuration Design
 
SD Forum 1999 XML Lessons Learned
SD Forum 1999 XML Lessons LearnedSD Forum 1999 XML Lessons Learned
SD Forum 1999 XML Lessons Learned
 
Linked In Lessons Learned And Growth And Scalability
Linked In Lessons Learned And Growth And ScalabilityLinked In Lessons Learned And Growth And Scalability
Linked In Lessons Learned And Growth And Scalability
 
Demo cloud ert_withoutvideos
Demo cloud ert_withoutvideosDemo cloud ert_withoutvideos
Demo cloud ert_withoutvideos
 
Xml messages
Xml messagesXml messages
Xml messages
 
Slash n: Technical Session 3 - Storage @ Scale: Quest for the mythical silver...
Slash n: Technical Session 3 - Storage @ Scale: Quest for the mythical silver...Slash n: Technical Session 3 - Storage @ Scale: Quest for the mythical silver...
Slash n: Technical Session 3 - Storage @ Scale: Quest for the mythical silver...
 
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with CassandraApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
ApacheCon Europe 2012 - Real Time Big Data in practice with Cassandra
 
Acunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFPAcunu & OCaml: Experience Report, CUFP
Acunu & OCaml: Experience Report, CUFP
 
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with CassandraNoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
NoSQL Matters 2012 - Real Time Big Data in practice with Cassandra
 
Advanced OpenSplice Programming - Part I
Advanced OpenSplice Programming - Part IAdvanced OpenSplice Programming - Part I
Advanced OpenSplice Programming - Part I
 
Stefan Gradman DM2E Kickoff 20120301
Stefan Gradman DM2E Kickoff 20120301Stefan Gradman DM2E Kickoff 20120301
Stefan Gradman DM2E Kickoff 20120301
 
Apache Con Us2007 Jcr In Action
Apache Con Us2007 Jcr In ActionApache Con Us2007 Jcr In Action
Apache Con Us2007 Jcr In Action
 

Viewers also liked

Building Scalable Aggregation Systems
Building Scalable Aggregation SystemsBuilding Scalable Aggregation Systems
Building Scalable Aggregation SystemsJared Winick
 
BIG DATA, a new way to achieve success in Enterprise Architecture.
BIG DATA, a new way to achieve success in Enterprise Architecture.BIG DATA, a new way to achieve success in Enterprise Architecture.
BIG DATA, a new way to achieve success in Enterprise Architecture.Georges Colin
 
The 6 Critical Components of Population Health
The 6 Critical Components of Population HealthThe 6 Critical Components of Population Health
The 6 Critical Components of Population HealthHealth Catalyst
 
Landmark Review of Population Health Management
Landmark Review of Population Health ManagementLandmark Review of Population Health Management
Landmark Review of Population Health ManagementHealth Catalyst
 
Network Attached Storage (NAS)
Network Attached Storage (NAS)Network Attached Storage (NAS)
Network Attached Storage (NAS)sandeepgodfather
 

Viewers also liked (6)

Building Scalable Aggregation Systems
Building Scalable Aggregation SystemsBuilding Scalable Aggregation Systems
Building Scalable Aggregation Systems
 
BIG DATA, a new way to achieve success in Enterprise Architecture.
BIG DATA, a new way to achieve success in Enterprise Architecture.BIG DATA, a new way to achieve success in Enterprise Architecture.
BIG DATA, a new way to achieve success in Enterprise Architecture.
 
The 6 Critical Components of Population Health
The 6 Critical Components of Population HealthThe 6 Critical Components of Population Health
The 6 Critical Components of Population Health
 
Aggregates
 Aggregates Aggregates
Aggregates
 
Landmark Review of Population Health Management
Landmark Review of Population Health ManagementLandmark Review of Population Health Management
Landmark Review of Population Health Management
 
Network Attached Storage (NAS)
Network Attached Storage (NAS)Network Attached Storage (NAS)
Network Attached Storage (NAS)
 

Similar to Find Your CMS Data

MongoDB at the energy frontier
MongoDB at the energy frontierMongoDB at the energy frontier
MongoDB at the energy frontierValentin Kuznetsov
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Toolsboorad
 
SSD Performance Benchmarking
SSD Performance BenchmarkingSSD Performance Benchmarking
SSD Performance BenchmarkingShirish Jamthe
 
Services Oriented Infrastructure in a Web2.0 World
Services Oriented Infrastructure in a Web2.0 WorldServices Oriented Infrastructure in a Web2.0 World
Services Oriented Infrastructure in a Web2.0 WorldLexumo
 
Using postgre sql for 3d cms
Using postgre sql for 3d cmsUsing postgre sql for 3d cms
Using postgre sql for 3d cmsTim Child
 
Apache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source IntegrationApache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source Integrationprajods
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache AccumuloJared Winick
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesBernd Ocklin
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBen Stopford
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesAmazon Web Services
 
Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012Logica_hummingbird
 
The SPOSAD Architectural Style for Multi-tenant Software Applications
The SPOSAD Architectural Style for Multi-tenant Software ApplicationsThe SPOSAD Architectural Style for Multi-tenant Software Applications
The SPOSAD Architectural Style for Multi-tenant Software ApplicationsHeiko Koziolek
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Ed Kohlwey
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBasedarach
 

Similar to Find Your CMS Data (20)

MongoDB at the energy frontier
MongoDB at the energy frontierMongoDB at the energy frontier
MongoDB at the energy frontier
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
Large Scale Data Analysis Tools
Large Scale Data Analysis ToolsLarge Scale Data Analysis Tools
Large Scale Data Analysis Tools
 
SSD Performance Benchmarking
SSD Performance BenchmarkingSSD Performance Benchmarking
SSD Performance Benchmarking
 
Services Oriented Infrastructure in a Web2.0 World
Services Oriented Infrastructure in a Web2.0 WorldServices Oriented Infrastructure in a Web2.0 World
Services Oriented Infrastructure in a Web2.0 World
 
Using postgre sql for 3d cms
Using postgre sql for 3d cmsUsing postgre sql for 3d cms
Using postgre sql for 3d cms
 
Apache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source IntegrationApache Camel: The Swiss Army Knife of Open Source Integration
Apache Camel: The Swiss Army Knife of Open Source Integration
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
MySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion QueriesMySQL Cluster Scaling to a Billion Queries
MySQL Cluster Scaling to a Billion Queries
 
Balancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java DatabaseBalancing Replication and Partitioning in a Distributed Java Database
Balancing Replication and Partitioning in a Distributed Java Database
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web Services
 
Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012Hummingbird - Open Source for Small Satellites - GSAW 2012
Hummingbird - Open Source for Small Satellites - GSAW 2012
 
The SPOSAD Architectural Style for Multi-tenant Software Applications
The SPOSAD Architectural Style for Multi-tenant Software ApplicationsThe SPOSAD Architectural Style for Multi-tenant Software Applications
The SPOSAD Architectural Style for Multi-tenant Software Applications
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?
 
Dancing with the Elephant
Dancing with the ElephantDancing with the Elephant
Dancing with the Elephant
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Complex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBaseComplex Er[jl]ang Processing with StreamBase
Complex Er[jl]ang Processing with StreamBase
 

Recently uploaded

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Recently uploaded (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Find Your CMS Data

  • 1. GenDB LumiDB Data Phedex PSetDB Quality DBS SiteDB RunDB Overview How can I find my data? CMS Data Aggregation System Valentin Kuznetsov, Cornell University ICCS Workshop, Amsterdam, May 31 - Jun. 2d, 2010 1
  • 2. Talk outline ✤ Introduction ✤ Motivations ✤ What is DAS? ✤ Design, architecture, implementations ✤ Current status & benchmarks ✤ Future plans 2
  • 3. Introduction ✤ CMS is a general purpose physics detector built for the LHC ✤ beam collision 25 nsec, online trigger 300 Hz, event size 1-2MB ✤ More then 3000 physicists, 183 institution, 38 countries ✤ CMS uses distributed computing and data model ✤ 1 Tier-0, 7 Tier-1, O(50) Tier-2, O(50) Tier-3 centers ✤ 2-6 PB/year of real data + 1x Simulated data, ~500GB/year of meta-data ✤ Code: C++/Python; Databases: ORACLE, MySQL, CouchDB, MongoDB ...
  • 4. Motivations ... Data Aggregation System ✤ A user want to query different meta-data services without knowing of their existence run A user want to combine RunSummary DataQuality LumiDB ✤ run, trigger, detector, ... trigger, ecal, hcal, ... lumi, luminosity, hltpath run, run lumi information from different lumi meta-data services Phedex DBS block, file, block.replica, block, run, file, block, site, MC id GenDB generator, xsection, file.replica, se, node, ... site config, tier, dataset, lumi, parameters, .... process, decay, ... ✤ A user has domain knowledge, site pset but need to query X services, SiteDB site, admin, site.status, .. Overview country, node, region, .. Parameter Set DB CMSSW parameters using Y interface and dealing with Z data formats to get our Service E param1, param2, DC Service .. Service param1, param2, .. B Service param1, param2, .. A Service param1, param2, .. data param1, param2, .. 4
  • 5. What is DAS? ✤ DAS stands for Data Aggregation System ✤ It is layer on top of existing data-services ✤ It aggregates data across distributed data-services while preserving their integrity, security policy and data-formats ✤ it provides caching for data-services (side effect) ✤ It represents data in defined format: JSON documents ✤ It allows query data via free text-based queries ✤ Agnostic to data content 5
  • 6. Challenges ... ✤ Combining N data-services is a great idea, but ✤ there is no ad-hoc IT solution ✤ DAS doesn’t hold the data, can’t have pre-defined schema ✤ must support existing APIs, data formats, interfaces, security policies ✤ must relate and aggregate meta-data ✤ must be efficient, flexible, scalable and easy to use ✤ Work on DAS prototype to understand those challenges 6
  • 7. DAS prototype ✤ Code written in python, ideal for prototyping ✤ Use existing meta-data from CMS data-services as test-bed ✤ 8 data-services, 75/250GB in tables/indexes ✤ Use document-oriented “schema-less’’database: MongoDB ✤ raw cache, merge result cache, mapping and analytics DBs ✤ Support free keyword-based queries, e.g. site=T1_CERN, run=100 ✤ Aggregate information using key-value matching 7
  • 8. DAS architecture Invoke the same API(params) Update cache periodically DAS robot Fetch popular queries/APIs DAS DAS DAS DAS mapping Map data-service cache merge Analytics output to DAS records record query, API call to Analytics runsum mapping aggregator lumidb data-services parser DAS core DAS web plugins phedex CPU core RESTful interface server DAS core UI sitedb dbs DAS Cache server
  • 9. DAS workflow query DAS DAS core logging parser ✤ Query parser yes no query DAS merge ✤ Query DAS merge collection yes no query DAS cache ✤ Query DAS cache collection DAS DAS query DAS merge cache data-services Mapping ✤ invoke call to data service Aggregator DAS ✤ write to analytics Analytics results ✤ Aggregate results (generator) Web UI
  • 10. DAS and data-services ✤ DAS is data-service agnostic ✤ a data-service is identified by its URI and input parameters ✤ Use plug-and-play mechanism: ✤ add new data-service using ASCII map file (URI, parameters, ...) ✤ use generic HTTP access and standard data-parsers (XML, JSON) ✤ Use dedicated plugin: ✤ specific access requirements, custom parsers, etc.
  • 11. DAS map files Data Aggregation System system : google_maps format : JSON --- urn : google_geo_maps url : "http://maps.google.com/maps/geo" expire : 30 DAS mapping params : { "q" : "required", "output": "json" } daskeys : [ {"key":"city","map":"city.name","pattern":""}, ] Data Service: URL/api?params
  • 12. DAS benchmark ✤ Fetch all blocks from our bookkeeping (DBS) and data transfer (PhEDEx) CMS data services ✤ parse, remap notations, store to cache, merge matched records (aggregation) ✤ Linux 64-bit, 1CPU for DAS, 1CPU for MongoDB, record size ~1KB ✤ Elapsed time = retrieval time + parsing time + remapping time + cache insertion/indexing time + output creation time Time, no Time w/ Format Records cache cache DBS yield XML 387K 68s 0.98s 393K DAS records, PhEDEx yield XML 190K 107s 0.98s create ~6K docs/s read ~7.6K docs/s Merge step JSON 577K 63s 0.9s DAS total JSON 393K 238s 2.05s 12
  • 13. Future plans ✤ DAS goes into production this year in CMS: ✤ confirm scalability, transparency and durability w/ various data- services ✤ work on analytics to organize pre-fetch strategies ✤ Apply to other domain disciplines ✤ Release as open source
  • 14. Summary ✤ Data Aggregation System is data agnostic and allow to query/ aggregate meta-data information in customizable way ✤ The current architecture easily integrates with existing data-services preserving their access, security policy and development cycle ✤ DAS is designed to work with existing CMS data-services, but can easily go beyond that boundary ✤ Plug-and-play mechanism makes it easily to add new data-services and configure DAS to specific domain