The document discusses MongoDB's use in the CMS experiment at CERN. MongoDB is used as the backend for CMS's Data Aggregation System (DAS), which acts as an intelligent cache to query distributed data services. DAS translates user queries, retrieves data from multiple services, aggregates the results, and returns consolidated responses. This architecture allows users to access different data without knowledge of the underlying services. MongoDB provides a flexible schema and fast I/O that make it suitable for caching distributed data and executing complex queries in DAS.
Behind the Scenes at LiveJournal: Scaling StorytimeSergeyChernyshev
Brad talks about clustering setups using MySQL and DRDB and their Open Source software most of which he wrote initially and continues to develop.
A lot of these techniques and/or software is used by many other companies as well - among them Flickr/Yahoo! and Facebook.
A great power point presentation for DBMS Concepts from start to end and with best examples chapter by chapter. Please go though each chapters sequentially for your knowledge.
A very easy going study material for better understanding and concepts of Database Management System.
Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and...JAX London
2011-11-02 | 02:25 PM - 03:15 PM
In 2009 RBS set out to build a single store of trade and risk data that all applications in the bank could access simultaniously. This talk discusses a number of novel techniques that were developed as part of this work. Based on Oracle Coherence the ODC departs from the trend set by most caching solutions by holding its data in a normalised form making it both memory efficient and easy to change. However it does this in a novel way that supports most arbitrary queries without the usual problems associated with distributed joins. We'll be discussing these patterns as well as others that allow linear scalability, fault tolerance and millisecond latencies.
Behind the Scenes at LiveJournal: Scaling StorytimeSergeyChernyshev
Brad talks about clustering setups using MySQL and DRDB and their Open Source software most of which he wrote initially and continues to develop.
A lot of these techniques and/or software is used by many other companies as well - among them Flickr/Yahoo! and Facebook.
A great power point presentation for DBMS Concepts from start to end and with best examples chapter by chapter. Please go though each chapters sequentially for your knowledge.
A very easy going study material for better understanding and concepts of Database Management System.
Java Tech & Tools | Beyond the Data Grid: Coherence, Normalisation, Joins and...JAX London
2011-11-02 | 02:25 PM - 03:15 PM
In 2009 RBS set out to build a single store of trade and risk data that all applications in the bank could access simultaniously. This talk discusses a number of novel techniques that were developed as part of this work. Based on Oracle Coherence the ODC departs from the trend set by most caching solutions by holding its data in a normalised form making it both memory efficient and easy to change. However it does this in a novel way that supports most arbitrary queries without the usual problems associated with distributed joins. We'll be discussing these patterns as well as others that allow linear scalability, fault tolerance and millisecond latencies.
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
The database world is undergoing a major upheaval. NoSQL databases such as MongoDB and Cassandra are emerging as a compelling choice for many applications. They can simplify the persistence of complex data models and offering significantly better scalability and performance. But these databases have a very different and unfamiliar data model and APIs as well as a limited transaction model. Moreover, the relational world is fighting back with so-called NewSQL databases such as VoltDB, which by using a radically different architecture offers high scalability and performance as well as the familiar relational model and ACID transactions. Sounds great but unlike the traditional relational database you can’t use JDBC and must partition your data.
In this presentation you will learn about popular NoSQL databases – MongoDB, and Cassandra – as well at VoltDB. We will compare and contrast each database’s data model and Java API using NoSQL and NewSQL versions of a use case from the book POJOs in Action. We will learn about the benefits and drawbacks of using NoSQL and NewSQL databases.
Nagios Conference 2012 - John Murphy - Rational Configuration DesignNagios
John Murphy's presentation on well designed Nagios configurations.
The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna
OpenSplice DDS enables seamless, timely, scalable and dependable data sharing between distributed applications and network-connected devices. Its technical and operational benefits have propelled adoption across multiple industries, such as Defence and Aerospace, SCADA, Gaming, Cloud Computing, Automotive, etc.
If you want to learn about OpenSplice DDS or discover some of its advanced features, this webcast is for you!
In this two-parts webcast we will cover all the aspects tied to architecting and developing OpenSplice DDS systems. We will look into Quality of Services, data selectors concurrency and scalability concerns.
We will present the brand-new, and recently finalized, C++ and Java APIs for DDS, including examples of how this can be used with C++11 features. We will show how, increasingly popular, functional languages such as Scala can be used to efficiently and elegantly exploit the massive HW parallelism provided by modern multi-core processors.
Finally we will present some OpenSplice specific extensions for dealing very high-volumes of data – meaning several millions of messages per seconds.
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
Oracle vs NoSQL – The good, the bad and the uglyJohn Kanagaraj
A good understanding of NoSQL database technologies that can be used to support a Big Data implementation is essential for today’s Oracle professional. This was discussed in detail in a 2 hour deep-dive technical session at COLLABORATE 2014 - The Oracle User Group Conference. In this slide deck, you will learn what Big Data brings to the table as well as the concepts behind the underlying NoSQL data stores, in comparison to its ancestor you know well - the Oracle RDBMS. We will determine where and how to employ these NoSQL data stores effectively as well as point out some of the issues that you will have to think through (and prepare for) before your organization rushes headlong into a “Big Data” implementation. We will look specifically at MongoDB, CouchBase and Cassandra in this context. At the end of the session, we will provide pointers and links to help the audience take the next step in learning about these technologies for themselves
SQL? NoSQL? NewSQL?!? What's a Java developer to do? - PhillyETE 2012Chris Richardson
The database world is undergoing a major upheaval. NoSQL databases such as MongoDB and Cassandra are emerging as a compelling choice for many applications. They can simplify the persistence of complex data models and offering significantly better scalability and performance. But these databases have a very different and unfamiliar data model and APIs as well as a limited transaction model. Moreover, the relational world is fighting back with so-called NewSQL databases such as VoltDB, which by using a radically different architecture offers high scalability and performance as well as the familiar relational model and ACID transactions. Sounds great but unlike the traditional relational database you can’t use JDBC and must partition your data.
In this presentation you will learn about popular NoSQL databases – MongoDB, and Cassandra – as well at VoltDB. We will compare and contrast each database’s data model and Java API using NoSQL and NewSQL versions of a use case from the book POJOs in Action. We will learn about the benefits and drawbacks of using NoSQL and NewSQL databases.
Nagios Conference 2012 - John Murphy - Rational Configuration DesignNagios
John Murphy's presentation on well designed Nagios configurations.
The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna
OpenSplice DDS enables seamless, timely, scalable and dependable data sharing between distributed applications and network-connected devices. Its technical and operational benefits have propelled adoption across multiple industries, such as Defence and Aerospace, SCADA, Gaming, Cloud Computing, Automotive, etc.
If you want to learn about OpenSplice DDS or discover some of its advanced features, this webcast is for you!
In this two-parts webcast we will cover all the aspects tied to architecting and developing OpenSplice DDS systems. We will look into Quality of Services, data selectors concurrency and scalability concerns.
We will present the brand-new, and recently finalized, C++ and Java APIs for DDS, including examples of how this can be used with C++11 features. We will show how, increasingly popular, functional languages such as Scala can be used to efficiently and elegantly exploit the massive HW parallelism provided by modern multi-core processors.
Finally we will present some OpenSplice specific extensions for dealing very high-volumes of data – meaning several millions of messages per seconds.
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
Oracle vs NoSQL – The good, the bad and the uglyJohn Kanagaraj
A good understanding of NoSQL database technologies that can be used to support a Big Data implementation is essential for today’s Oracle professional. This was discussed in detail in a 2 hour deep-dive technical session at COLLABORATE 2014 - The Oracle User Group Conference. In this slide deck, you will learn what Big Data brings to the table as well as the concepts behind the underlying NoSQL data stores, in comparison to its ancestor you know well - the Oracle RDBMS. We will determine where and how to employ these NoSQL data stores effectively as well as point out some of the issues that you will have to think through (and prepare for) before your organization rushes headlong into a “Big Data” implementation. We will look specifically at MongoDB, CouchBase and Cassandra in this context. At the end of the session, we will provide pointers and links to help the audience take the next step in learning about these technologies for themselves
Determining the root cause of performance issues is a critical task for Operations. In this webinar, we'll show you the tools and techniques for diagnosing and tuning the performance of your MongoDB deployment. Whether you're running into problems or just want to optimize your performance, these skills will be useful.
As your data grows, the need to establish proper indexes becomes critical to performance. MongoDB supports a wide range of indexing options to enable fast querying of your data, but what are the right strategies for your application?
In this talk we’ll cover how indexing works, the various indexing options, and use cases where each can be useful. We'll dive into common pitfalls using real-world examples to ensure that you're ready for scale.
Docker is all the rage these days. While one doesn't hear much about Solr on Docker, we're here to tell you not only that it can be done, but also share how it's done.
We'll quickly go over the basic Docker ideas - containers are lighter than VMs, they solve "but it worked on my laptop" issues - so we can dive into the specifics of running Solr on Docker.
We'll do a live demo showing you how to run Solr master - slave as well as SolrCloud using containers, how to manage CPU assignments, constraint memory and use Docker data volumes when running Solr in containers. We will also show you how to create your own containers with custom configurations.
Finally, we'll address one of the core Solr questions - which deployment type should I use? We will demonstrate performance differences between the following deployment types:
- Single Solr instance running on a bare metal machine
- Multiple Solr instances running on a single bare metal machine
- Solr running in containers
- Solr running on virtual machine
- Solr running on virtual machine using unikernel
For each deployment type we'll address how it impacts performance, operational flexibility and all other key pros and cons you ought to keep in mind.
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB
The United States will be deploying 16,000 traffic speed monitoring sensors - 1 on every mile of US interstate in urban centers. These sensors update the speed, weather, and pavement conditions once per minute. MongoDB will collect and aggregate live sensor data feeds from roadways around the country, support real-time queries from cars on traffic conditions on their route as well as be the platform for real-time dashboards displaying traffic conditions and more complex analytical queries used to identify traffic trends. In this session, we’ll implement a few different data aggregation techniques to query and dashboard the metrics gathered from the US interstate.
The talk present a new Data Aggregation System for CMS experiment at CERN. We use MongoDB database as caching layer to query multiple data-provides (backed up by RDMS) and aggregate data across them.
Talk has been presented at ICCS 2010 conference.
Balancing Replication and Partitioning in a Distributed Java DatabaseBen Stopford
This talk, presented at JavaOne 2011, describes the ODC, a distributed, in-memory database built in Java that holds objects in a normalized form in a way that alleviates the traditional degradation in performance associated with joins in shared-nothing architectures. The presentation describes the two patterns that lie at the core of this model. The first is an adaptation of the Star Schema model used to hold data either replicated or partitioned data, depending on whether the data is a fact or a dimension. In the second pattern, the data store tracks arcs on the object graph to ensure that only the minimum amount of data is replicated. Through these mechanisms, almost any join can be performed across the various entities stored in the grid, without the need for key shipping or iterative wire calls.
Apache Camel: The Swiss Army Knife of Open Source Integrationprajods
The Camel project from Apache(camel.apache.org), is a very popular, light weight, open source integration framework.
This presentation shows some interesting features of Camel and the unique advantages that Camel brings to your integration projects. Some business
use cases are shown to explain how Camel makes open source integration a cakewalk.
Table of contents:
1. An overview of Apache Camel
2. Integration architecture explained
3. Using Camel in different integration architectures
3.a. In the Securities domain
3.b. In the Travel domain
4. High Availability and Load Balancing with Camel
Hummingbird - Open Source for Small Satellites - GSAW 2012Logica_hummingbird
This presentation about the Hummingbird project won best presentation at the GSAW 2012 event in Los Angeles - http://csse.usc.edu/gsaw/index.html
To read more from this author about how the IT industry can shape the boundaries of the space industry, please visit - http://www.logica.com/we-work-in/space/related%20media/thought-pieces/2011/new-paradigms-for-space/
Hadoop & Greenplum: Why Do Such a Thing?Ed Kohlwey
Greenplum is using Hadoop in several interesting ways as part of a larger big data architecture with EMC Greenplum Database (a scale-out MPP SQL database) and EMC Isilon (a scale-out network-attached storage appliance). After a quick introduction of Greenplum Database and Isilon, I list some ways Greenplum is tightly integrating with Hadoop and why we would want to do such a thing. Integration points discussed include: Greenplum Database external tables to seamlessly access data in HDFS, querying HBase tables natively from Greenplum Database, Greenplum Database having its underlying storage on HDFS, and Isilon OneFS as a seamless replacement for HDFS.
Had the pleasure to deliver the key note presentation at Informa's 3G, HSPA & LTE Optimization conference in Prague. Great event with many very important presentations.
Services Oriented Infrastructure in a Web2.0 WorldLexumo
Tom Maguire discusses applying SOA Web 2.0 technologies, and open standards to the problems faced by IT in an ever changing world.
This session was recorded at EMC World 2007 in Orlando Florida
The Potential Impact of Software Defined Networking SDN on SecurityBrent Salisbury
The Potential Impact of Software Defined Networking SDN on Security. The video of the presentation is at http://networkstatic.net/the-potential-impact-of-software-defined-networking-sdn-on-security/ by Brent Salisbury. It is the first cut so a lot is not in the deck yet on the example use case front.
The Construction of the Internet Geological Data System Using WWW+Java+DB Tec...Channy Yun
YUN, SEOKCHAN, 1997, The Construction of the Internet Geological Data System Using WWW+Java+DB Technique, Tertiary Deposits of Korea, AAPG Annual Convention Abstracts, Association of American Petroleum Geologists 1997.4.23-26, Dallas, TX, USA, p.420
This deck was presented at the Spark meetup at Bangalore. The key idea behind the presentation was to focus on limitations of Hadoop MapReduce and introduce both Hadoop YARN and Spark in this context. An overview of the other aspects of the Berkeley Data Analytics Stack was also provided.
Zotonic presentation Erlang Camp Boston, august 2011Arjan
I gave this presentation on friday August 12, 2011 in the John Hancock Conference center in Boston as the closing session of the 2-day Erlang Camp event.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
1. MongoDB at the energy frontier
Valentin Kuznetsov, Cornell University
MongoNYC, May, 2012
Monday, May 21, 12 1
2. Outline
✤ CMS :: LHC :: CERN
✤ Data Aggregation System and MongoDB
✤ Experience
✤ Summary
Monday, May 21, 12 2
3. CMS :: LHC :: CERN
Large Hadron Collider located at CERN, Geneva, Switzerland
CMS is one of the 4 experiments to probe our knowledge of particle
interactions and search for a new physics
Monday, May 21, 12 3
5. CMS :: LHC :: CERN
Typical proton-proton collision in CMS detector
Monday, May 21, 12 5
6. CMS :: LHC :: CERN
✤ 40 countries, 172 institutions, more then 3000 scientists
✤ CMS experiment produces a few PB of real data each year and we
collect ~TB of meta-data
✤ CMS relies on GRID infrastructure for data processing and uses 100+
computing centers word-wide
✤ CMS software consists of 4M lines of C++(framework), 2M lines of
python (data management), plus Java, perl, etc.
✤ ORACLE, MySQL, SQLite, NoSQL
Monday, May 21, 12 6
7. Dilemma
GenDB
LumiDB
Data
Quality
Phedex How I can find
my data?
DBS
PSetDB
SiteDB
Overview
RunDB
Monday, May 21, 12 7
8. Motivations
✤ Users want to query different
data services without knowing Data Aggregation System
about their existence
✤ Users want to combine
RunSummary run DataQuality LumiDB
information from different data run, trigger, detector, ... trigger, ecal, hcal, ... lumi, luminosity, hltpath
run, run lumi
services lumi
Phedex DBS
block, MC id
GenDB
✤ Some users may have domain block, file, block.replica,
file.replica, se, node, ... site
run, file, block, site,
config, tier, dataset,
lumi, parameters, ....
generator, xsection,
process, decay, ...
knowledge, but they need to
site
query X services, using Y SiteDB Overview
pset
Parameter Set DB
site, admin, site.status, .. country, node, region, .. CMSSW parameters
interface and dealing with Z
data formats to get our data Service E
param1, param2, DC
Service ..
Service
param1, param2, .. B
Service
param1, param2, .. A
Service
param1, param2, ..
param1, param2, ..
Monday, May 21, 12 8
9. Implementation idea
✤ When we talk we may use different
languages (English, French, etc.) or
different conventions (pounds vs kg)
✤ In order to establish communication
we use translation, dictionary,
thesaurus
Monday, May 21, 12 9
11. Pros
✤ Separate data management from discovery service
✤ Data are safe and secure
✤ Pluggable architecture (new translations)
✤ Users never bother with interface, naming and schema conflicts, data-
formats, security policies
✤ Information is aggregated in a real-time over distributed services
✤ Data consistency checks for free
✤ DB and API changes are transparent for end-users
Monday, May 21, 12 11
12. Cons
✤ DAS does not own the data
✤ lots of writes/reads/translations
✤ Data-services are real bottleneck
✤ nothing is guaranteed, e.g. service can go down, no control of its
performance, requested data can be really large, etc.
✤ cache often and preemptive
MongoDB to rescue !!!
Monday, May 21, 12 12
13. Data Aggregation System
Invoke the same API(params)
Update cache periodically
DAS robot Fetch popular
queries/APIs
DAS DAS DAS DAS
mapping Map data-service cache merge Analytics
output to DAS
records
record query, API
call to Analytics
runsum mapping aggregator
lumidb
data-services
parser
DAS core
DAS web
plugins
phedex CPU core RESTful interface
server
DAS core UI
sitedb
dbs DAS Cache server
Monday, May 21, 12 13
14. Mapping DB
✤ Holds translation between user keywords and data-service APIs,
resolve naming conflicts, etc.
✤ city=Ithaca query translates into Google API call
{'das2api': [{'api_param': 'q', 'das_key': 'city.name', 'pattern': ''}],
'daskeys': [{'key': 'city', 'map': 'city.name', 'pattern': ''}],
'expire': 3600,
'format': 'JSON',
'params': {'output': 'json', 'q': 'required'},
'system': 'google_maps',
'url': 'http://maps.google.com/maps/geo',
'urn': 'google_geo_maps'}
Monday, May 21, 12 14
15. Analytics DB
✤ Keep tracks of user queries, data-service API calls
{'api': {'params': {'q': 'Ithaca', 'output': 'json'}, 'name': 'google_geo_maps'}, 'qhash':
'7272bdeac45174823d3a4ea240c124ec', 'system': 'google_maps', 'counter': 5}
✤ Used by DAS analytics daemons to pre-fetch “hot” queries
✤ ValueHotSpot look-up data by popular values
✤ KeyHotSpot look-up data by popular key
✤ QueryMaintainer to keep given query always in cache
Monday, May 21, 12 15
16. Caching DB
✤ Data coming out from data-service providers are translated into JSON
and stored into cache collection
✤ naming translation are performed at this level
✤ Data records from cache collection are processed on common key, e.g.
city.name, and merged into merge collection
cache collection merge collection
{'city': {'name': 'Ithaca',
'lat':42, 'lng':-76}} {'city': {'name': 'Ithaca',
'lat':42, 'lng':-76,
{'city': {'name': 'Ithaca',
'zip':14850}}
'zip':14850}}
Monday, May 21, 12 16
17. DAS workflow query
DAS DAS
core logging
✤ Query parser
parser
✤ Query DAS merge collection yes no
query
DAS merge
✤ Query DAS cache collection yes
query
DAS cache
no
✤ invoke call to data service DAS DAS query DAS
merge cache data-services Mapping
✤ write to analytics
Aggregator DAS
Analytics
✤ Aggregate results results
✤ Represent results on web UI or via Web UI
command line interface
Monday, May 21, 12 17
19. DAS QL & MongoDB QL
✤ DAS Query Language built on top of MongoDB QL; it represents
MongoDB QL in human readable form
✤ UI level:
block dataset=/a/b/c | grep block.size | count(block.size)
✤ DB level:
col.find(spec={‘dataset.name’:‘/a/b/c’}, fields=[block.size]).count()
✤ We enrich QL with additional filters (grep, sort, unique) and
implement set of coroutines for aggregator functions
Monday, May 21, 12 19
20. DAS & MongoDB
✤ DAS works with 15 distributed data-services
✤ their size vary, on average O(100GB)
✤ DAS uses 40 MongoDB collections
✤ caching, mapping, analytics, logging (normal, capped, gridfs cols)
✤ DAS inserts/deletes O(1M) records on a daily basis
✤ We operate on a single 64-bit Linux node with 8 CPUs, 24 GB of RAM
and 1TB of disk space, sharding were tested, but it is not enabled
Monday, May 21, 12 20
21. MongoDB benefits
✤ Fast I/O and schema-less database are ideal for cache implementation
✤ you’re not limited by key:value approach
✤ Flexible query language allows to build domain specific QL
✤ stay on par with SQL
✤ No administrative costs with DB
✤ easy to install and maintain
Monday, May 21, 12 21
22. MongoDB issues (ver 2.0.X)
✤ We were unable to directly store DAS queries into analytics collection,
due to the dot constrain, e.g. {‘a.b’:1}
✤ queries <=> storage format {‘key’:‘a.b’, ‘value’:1}
✤ Scons is not suitable in fully controlled build environment
✤ it removes $PATH/$LD_LIBRARY_PATH for compiler commands;
it forces to use -L/lib64. As a result we used wrappers.
✤ Uncompressed field names and limitation with pagination/
aggregation
✤ should be addressed in new MongoDB aggregation framework
Monday, May 21, 12 22
23. Tradeoffs
✤ Query collisions: DAS does not own the data and there is no
transactions, we rely on query status and update it accordingly
✤ Index choice: initially one per select key, later one per query hash
✤ Storage size: we compromise storage vs data flexibility vs naming
conventions
✤ Speed: we compromise simple data access vs conglomerate of
restrictions (naming, security policies, interfaces, etc.), but we tuning-
up our data-service APIs based on query patterns
Monday, May 21, 12 23
24. Results
✤ The service in production over one year
✤ Users authenticated via GRID certificates and DAS uses proxy server
to pass credentials to back-end services
✤ Single query request yields few thousand records and resolved within
few seconds
✤ Pluggable architecture allows to query your service(s)
✤ unit tests are done against public data-services, e.g. Google, IP
look-up, etc.
Monday, May 21, 12 24
25. NoSQL @ CERN
✤ MongoDB is used by other experiments at CERN
✤ logging, monitoring, data analytics
✤ MongoDB is not the only NoSQL solution used at CERN
✤ One size does not fit all
✤ CouchDB, Cassandra, HBase, etc.
✤ There is on-going discussion between experiments and CERN IT
about adoption of NoSQL
Monday, May 21, 12 25
26. Summary
✤ CMS experiment built Data Aggregation System as an intelligent
cache to query distributed data-services
✤ MongoDB is used as DAS back-end
✤ During first year of operation we did not experience any significant
problems
✤ I’d like to thank MongoDB team and its community for their constant
support
✤ Questions? Contact: vkuznet@gmail.com
✤ https://github.com/vkuznet/DAS/
Monday, May 21, 12 26
28. From query to results
Data service
generator Aggreator
API Data service Merge
Query Aggreator
lookup generator results
Data service Aggreator
generator
Monday, May 21, 12 28
29. From query to results
Data service
generator Aggreator
API Data service Merge
Query Aggreator
lookup generator results
Data service Aggreator
generator
Monday, May 21, 12 28
30. From query to results
Data service
generator Aggreator
API Data service Merge
Query Aggreator
lookup generator results
Data service Aggreator
generator
block dataset=/a/b/c
MongoDB spec
Mapping DB
holds
relationships
Monday, May 21, 12 28
31. From query to results
Data service
generator Aggreator
API Data service Merge
Query Aggreator
lookup generator results
Data service Aggreator
generator
block dataset=/a/b/c
MongoDB spec
Mapping DB Caching DB
holds holds
relationships service records
Monday, May 21, 12 28
32. From query to results
Data service
generator Aggreator
API Data service Merge
Query Aggreator
lookup generator results
Data service Aggreator
generator
block dataset=/a/b/c
MongoDB spec
Mapping DB Caching DB Merge DB
holds holds holds
relationships service records merged records
Monday, May 21, 12 28
33. From query to results
Data service
generator Aggreator
API Data service Merge
Query Aggreator
lookup generator results
Data service Aggreator
generator
block dataset=/a/b/c
MongoDB spec
Mapping DB Caching DB Merge DB
holds holds holds
relationships service records merged records
Monday, May 21, 12 28