Overlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in DetailJose Antonio Coarasa Perez
Overlay opportunistic clouds in CMS/ATLAS at CERN: The CMSooooooCloud in detail
The CMS and ATLAS online clusters consist of more than 3000 computers each. They have been exclusively used for the data acquisition that led to the Higgs particle discovery, handling 100Gbytes/s data flows and archiving 20Tbytes of data per day.
An openstack cloud layer has been deployed on the newest part of the clusters (totalling 1300 hypervisors and more than 13000 cores in CMS alone) as a minimal overlay so as to leave the primary role of the computers untouched while allowing an opportunistic usage of the cluster.
This presentation will show how to share resources with a minimal impact on the existing infrastructure. We will present the architectural choices made to deploy an unusual, as opposed to dedicated, "overlaid cloud infrastructure". These architectural choices ensured a minimal impact on the running cluster configuration while giving a maximal segregation of the overlaid virtual computer infrastructure. The use of openvswitch to avoid changes on the network infrastructure and encapsulate the virtual machines traffic will be illustrated, as well as the networking configuration adopted due to the nature of our private network. The design and performance of the openstack cloud controlling layer will be presented. We will also show the integration carried out to allow the cluster to be used in an opportunistic way while giving full control to the CMS online run control.
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemLarry Smarr
12.10.11
Keynote Lecture
12th Annual Global LambdaGrid Workshop
Title: How to Terminate the GLIF by Building a Campus Big Data Freeway System
Chicago, IL
Overlay Opportunistic Clouds in CMS/ATLAS at CERN: The CMSooooooCloud in DetailJose Antonio Coarasa Perez
Overlay opportunistic clouds in CMS/ATLAS at CERN: The CMSooooooCloud in detail
The CMS and ATLAS online clusters consist of more than 3000 computers each. They have been exclusively used for the data acquisition that led to the Higgs particle discovery, handling 100Gbytes/s data flows and archiving 20Tbytes of data per day.
An openstack cloud layer has been deployed on the newest part of the clusters (totalling 1300 hypervisors and more than 13000 cores in CMS alone) as a minimal overlay so as to leave the primary role of the computers untouched while allowing an opportunistic usage of the cluster.
This presentation will show how to share resources with a minimal impact on the existing infrastructure. We will present the architectural choices made to deploy an unusual, as opposed to dedicated, "overlaid cloud infrastructure". These architectural choices ensured a minimal impact on the running cluster configuration while giving a maximal segregation of the overlaid virtual computer infrastructure. The use of openvswitch to avoid changes on the network infrastructure and encapsulate the virtual machines traffic will be illustrated, as well as the networking configuration adopted due to the nature of our private network. The design and performance of the openstack cloud controlling layer will be presented. We will also show the integration carried out to allow the cluster to be used in an opportunistic way while giving full control to the CMS online run control.
How to Terminate the GLIF by Building a Campus Big Data Freeway SystemLarry Smarr
12.10.11
Keynote Lecture
12th Annual Global LambdaGrid Workshop
Title: How to Terminate the GLIF by Building a Campus Big Data Freeway System
Chicago, IL
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...Larry Smarr
10.04.07
Presentation by Larry Smarr to the NSF Campus Bridging Workshop
University Place Conference Center
Title: A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging End-User Laboratories to Data-Intensive Sources
Indianapolis, IN
The BonFIRE architecture was presented at the TridentCom Conference. These are the slides for the paper, which describes the key components and principles of the architecture and also some specific features offered to experimenter that are not available elsewhere.
Virtual Network Functions as Real-Time Containers in Private Cloudstcucinotta
This paper presents preliminary results from our on-going research for ensuring stable performance of co-located distributed cloud services in a resource-efficient way. It is based on using a real-time CPU scheduling policy to achieve a fine-grain control of the temporal interferences among real-time services running in co-located containers. We present results obtained applying the method to a synthetic application running within LXC containers on Linux, where a modified kernel has been used that includes our real-time scheduling policy.
More information about the paper is available at:
http://retis.sssup.it/~tommaso/papers/cloud18.php
With the HPC Cloud facility, SURFsara offers self-service, dynamically scalable and fully configurable HPC systems to the Dutch academic community. Users have, for example, a free choice of operating system and software.
The HPC Cloud offers full control over a HPC cluster, with fast CPUs and high memory nodes and it is possible to attach terabytes of local storage to a compute node. Because of this flexibility, users can fully tailor the system for a particular application. Long-running and small compute jobs are equally welcome. Additionally, the system facilitates collaboration: users can share control over their virtual private HPC cluster with other users and share processing time, data and results. A portal with wiki, fora, repositories, issue system, etc. is offered for collaboration projects as well.
An Evaluation of Adaptive Partitioning of Real-Time Workloads on Linuxtcucinotta
This paper provides an open implementation and an experimental evaluation of an adaptive partitioning approach for scheduling real-time tasks on symmetric multicore systems. The proposed technique is based on combining partitioned EDF scheduling with an adaptive migration policy that moves tasks across processors only when strictly needed to respect their temporal constraints. The implementation of the technique within the Linux kernel, via modifications to the SCHED DEADLINE code base, is presented. An extensive experimentation has been conducted by applying the technique on a real multi-core platform with several randomly generated synthetic task sets. The obtained experimental results highlight that the approach exhibits a promising performance to schedule real-time workloads on a real system, with a greatly reduced number of migrations compared to the original global EDF available in SCHED DEADLINE.
More information about the paper is available at:
http://retis.sssup.it/~tommaso/papers/isorc21.php
There are a set of new real-time scheduling algorithms being developed for the Linux kernel, which provide temporal isolation among tasks.
These include an implementation of the POSIX sporadic server (SS) and a deadline-based scheduler. These are based on the specification of
the scheduling guarantees needed by the kernel in terms of a budget and a period.
This presentation aims to tackle the issues related to how to design a proper kernel-space / user-space interface for accessing such new
functionality. For the SS, a POSIX compliant implementation would break binary compatibility. However, the currently implemented API seems to be lacking some important features, like a sufficient level of extensibility. This would be required for example for adding further parameters in the future, e.g., deadlines different from periods, or soft (i.e., work-conserving) reservations, or how to mix power management in the looop (if ever).
Energy Efficient GPS Acquisition with Sparse-GPSPrasant Misra
Following rising demands in positioning with GPS, low-cost receivers are becoming widely available; but their energy demands are still too high. For energy efficient GPS sensing in delay-tolerant applications, the possibility of offloading a few milliseconds of raw signal samples and leveraging the greater processing power of the cloud for obtaining a position fix is being actively investigated.
In an attempt to reduce the energy cost of this data offloading operation, we propose Sparse-GPS: a new computing framework for GPS acquisition via sparse approximation.
Within the framework, GPS signals can be efficiently compressed by random ensembles. The sparse acquisition information, pertaining to the visible satellites that are embedded within these limited measurements, can subsequently be recovered by our proposed representation dictionary.
By extensive empirical evaluations, we demonstrate the acquisition quality and energy gains of Sparse-GPS. We show that it is twice as energy efficient than offloading uncompressed data, and has 5-10 times lower energy costs than standalone GPS; with a median positioning accuracy of 40 m.
In this deck from the Perth HPC Conference, Rob Farber from TechEnablement presents: AI is Impacting HPC Everywhere.
"The convergence of AI and HPC has created a fertile venue that is ripe for imaginative researchers — versed in AI technology — to make a big impact in a variety of scientific fields. From new hardware to new computational approaches, the true impact of deep- and machine learning on HPC is, in a word, “everywhere”. Just as technology changes in the personal computer market brought about a revolution in the design and implementation of the systems and algorithms used in high performance computing (HPC), so are recent technology changes in machine learning bringing about an AI revolution in the HPC community. Expect new HPC analytic techniques including the use of GANs (Generative Adversarial Networks) in physics-based modeling and simulation, as well as reduced precision math libraries such as NLAFET and HiCMA to revolutionize many fields of research. Other benefits of the convergence of AI and HPC include the physical instantiation of data flow architectures in FPGAs and ASICs, plus the development of powerful data analytic services."
Learn more: http://www.techenablement.com/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging ...Larry Smarr
10.04.07
Presentation by Larry Smarr to the NSF Campus Bridging Workshop
University Place Conference Center
Title: A High-Performance Campus-Scale Cyberinfrastructure for Effectively Bridging End-User Laboratories to Data-Intensive Sources
Indianapolis, IN
The BonFIRE architecture was presented at the TridentCom Conference. These are the slides for the paper, which describes the key components and principles of the architecture and also some specific features offered to experimenter that are not available elsewhere.
Virtual Network Functions as Real-Time Containers in Private Cloudstcucinotta
This paper presents preliminary results from our on-going research for ensuring stable performance of co-located distributed cloud services in a resource-efficient way. It is based on using a real-time CPU scheduling policy to achieve a fine-grain control of the temporal interferences among real-time services running in co-located containers. We present results obtained applying the method to a synthetic application running within LXC containers on Linux, where a modified kernel has been used that includes our real-time scheduling policy.
More information about the paper is available at:
http://retis.sssup.it/~tommaso/papers/cloud18.php
With the HPC Cloud facility, SURFsara offers self-service, dynamically scalable and fully configurable HPC systems to the Dutch academic community. Users have, for example, a free choice of operating system and software.
The HPC Cloud offers full control over a HPC cluster, with fast CPUs and high memory nodes and it is possible to attach terabytes of local storage to a compute node. Because of this flexibility, users can fully tailor the system for a particular application. Long-running and small compute jobs are equally welcome. Additionally, the system facilitates collaboration: users can share control over their virtual private HPC cluster with other users and share processing time, data and results. A portal with wiki, fora, repositories, issue system, etc. is offered for collaboration projects as well.
An Evaluation of Adaptive Partitioning of Real-Time Workloads on Linuxtcucinotta
This paper provides an open implementation and an experimental evaluation of an adaptive partitioning approach for scheduling real-time tasks on symmetric multicore systems. The proposed technique is based on combining partitioned EDF scheduling with an adaptive migration policy that moves tasks across processors only when strictly needed to respect their temporal constraints. The implementation of the technique within the Linux kernel, via modifications to the SCHED DEADLINE code base, is presented. An extensive experimentation has been conducted by applying the technique on a real multi-core platform with several randomly generated synthetic task sets. The obtained experimental results highlight that the approach exhibits a promising performance to schedule real-time workloads on a real system, with a greatly reduced number of migrations compared to the original global EDF available in SCHED DEADLINE.
More information about the paper is available at:
http://retis.sssup.it/~tommaso/papers/isorc21.php
There are a set of new real-time scheduling algorithms being developed for the Linux kernel, which provide temporal isolation among tasks.
These include an implementation of the POSIX sporadic server (SS) and a deadline-based scheduler. These are based on the specification of
the scheduling guarantees needed by the kernel in terms of a budget and a period.
This presentation aims to tackle the issues related to how to design a proper kernel-space / user-space interface for accessing such new
functionality. For the SS, a POSIX compliant implementation would break binary compatibility. However, the currently implemented API seems to be lacking some important features, like a sufficient level of extensibility. This would be required for example for adding further parameters in the future, e.g., deadlines different from periods, or soft (i.e., work-conserving) reservations, or how to mix power management in the looop (if ever).
Energy Efficient GPS Acquisition with Sparse-GPSPrasant Misra
Following rising demands in positioning with GPS, low-cost receivers are becoming widely available; but their energy demands are still too high. For energy efficient GPS sensing in delay-tolerant applications, the possibility of offloading a few milliseconds of raw signal samples and leveraging the greater processing power of the cloud for obtaining a position fix is being actively investigated.
In an attempt to reduce the energy cost of this data offloading operation, we propose Sparse-GPS: a new computing framework for GPS acquisition via sparse approximation.
Within the framework, GPS signals can be efficiently compressed by random ensembles. The sparse acquisition information, pertaining to the visible satellites that are embedded within these limited measurements, can subsequently be recovered by our proposed representation dictionary.
By extensive empirical evaluations, we demonstrate the acquisition quality and energy gains of Sparse-GPS. We show that it is twice as energy efficient than offloading uncompressed data, and has 5-10 times lower energy costs than standalone GPS; with a median positioning accuracy of 40 m.
In this deck from the Perth HPC Conference, Rob Farber from TechEnablement presents: AI is Impacting HPC Everywhere.
"The convergence of AI and HPC has created a fertile venue that is ripe for imaginative researchers — versed in AI technology — to make a big impact in a variety of scientific fields. From new hardware to new computational approaches, the true impact of deep- and machine learning on HPC is, in a word, “everywhere”. Just as technology changes in the personal computer market brought about a revolution in the design and implementation of the systems and algorithms used in high performance computing (HPC), so are recent technology changes in machine learning bringing about an AI revolution in the HPC community. Expect new HPC analytic techniques including the use of GANs (Generative Adversarial Networks) in physics-based modeling and simulation, as well as reduced precision math libraries such as NLAFET and HiCMA to revolutionize many fields of research. Other benefits of the convergence of AI and HPC include the physical instantiation of data flow architectures in FPGAs and ASICs, plus the development of powerful data analytic services."
Learn more: http://www.techenablement.com/
and
http://hpcadvisorycouncil.com/events/2019/australia-conference/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
OSMC 2012 | Monitoring at CERN by Christophe HaenNETWAYS
Das CERN, die Europäische Organisation für Kernforschung, ist das weltweit größte Forschungszentrum für Teilchenphysik. Es werden dort Experimente in der Hochenergiephysik mit Hilfe des Teilchenbeschleunigers durchgeführt sowie anderer bereitgestellter Infrastrukturen. Die Untersuchungen rund um den Large Hadron Collider (LHC), erfordern umfangreiche IT-Infrastrukturen, um die Daten, die durch die Kollisionen generiert werden, zu verarbeiten. Sogar die Überwachung des LHC selbst hängt von einer komplexen Infrastruktur ab. Die CERN-IT stellt den Mitarbeitern und den Usern viele unterschiedliche Services bereit und ist vor allem aber der Hauptakteur des LHC GRID. Das GRID ist das weltweit verteilte Rechen- und Speicher-Netzwerk, das die nötige Kapazität zur Verfügung stellt um die Menge an Daten, die anhand des Teilchenbeschleunigers gesammelt wird, analysieren zu können. Es besteht aus 200.000 Cores verteilt auf 34 Ländern. All diese großen Rechenzentren erfordern ein sorgfältiges Monitoring, aber jedes für sich hat Besonderheiten, was dazu führt, dass unterschiedliche Monitoring Strategien und Tools angewandt werden müssen. Die unzähligen Herangehensweisen an diese Herausforderung werden in diesem Vortrag aufgezeigt sowie ein Ausblick auf geplante künftige Entwicklungen.
CERN, the European Organization for Nuclear Research, is one of the world’s largest centres for scientific research. Its business is fundamental physics, finding out what the universe is made of and how it works. At CERN, accelerators such as the 27km Large Hadron Collider, are used to study the basic constituents of matter. This talk reviews the challenges to record and analyse the 25 Petabytes/year produced by the experiments and the investigations into how OpenStack could help to deliver a more agile computing infrastructure.
C2MON - A highly scalable monitoring platform for Big Data scenarios @CERN by...J On The Beach
Developing reliable data acquisition, processing and control modules for mission critical systems - as they run at CERN - has always been challenging. As both data volumes and rates increase, non-functional requirements such as performance, availability, and maintainability are getting more important than ever. C2MON is a modular Open Source Java framework for realising highly available, large industrial monitoring and control solutions. It has been initially developed for CERN’s demanding infrastructure monitoring needs and is based on more than 10 years of experience with the Technical Infrastructure Monitoring (TIM) systems at CERN. Combining maintainability and high-availability within a portable architecture is the focus of this work. Making use of standard Java libraries for in-memory data management, clustering and data persistence, the platform becomes interesting for many Big Data scenarios.
Achieving Performance Isolation with Lightweight Co-KernelsJiannan Ouyang, PhD
This slides were presented at the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '15)
Performance isolation is emerging as a requirement for High Performance Computing (HPC) applications, particularly as HPC architectures turn to in situ data processing and application composition techniques to increase system throughput. These approaches require the co-location of disparate workloads on the same compute node, each with different resource and runtime requirements. In this paper we claim that these workloads cannot be effectively managed by a single Operating System/Runtime (OS/R). Therefore, we present Pisces, a system software architecture that enables the co-existence of multiple independent and fully isolated OS/Rs, or enclaves, that can be customized to address the disparate requirements of next generation HPC workloads. Each enclave consists of a specialized lightweight OS co-kernel and runtime, which is capable of independently managing partitions of dynamically assigned hardware resources. Contrary to other co-kernel approaches, in this work we consider performance isolation to be a primary requirement and present a novel co-kernel architecture to achieve this goal. We further present a set of design requirements necessary to ensure performance isolation, including: (1) elimination of cross OS dependencies, (2) internalized management of I/O, (3) limiting cross enclave communication to explicit shared memory channels, and (4) using virtualization techniques to provide missing OS features. The implementation of the Pisces co-kernel architecture is based on the Kitten Lightweight Kernel and Palacios Virtual Machine Monitor, two system software architectures designed specifically for HPC systems. Finally we will show that lightweight isolated co-kernels can provide better performance for HPC applications, and that isolated virtual machines are even capable of outperforming native environments in the presence of competing workloads.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
Lxcloud
1. PES CERN's Cloud Computing infrastructure
CERN's Cloud Computing Infrastructure
Tony Cass, Sebastien Goasguen, Belmiro Moreira, Ewan Roche,
Ulrich Schwickerath, Romain Wartel
Cloudview conference, Porto, 2010
See also related presentations:
HEPIX spring and autumn meeting 2009, 2010
Virtualization vision, Grid Deployment Board (GDB) 9/9/2009
Batch virtualization at CERN, EGEE09 conference, Barcelona
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
2. PES Outline and disclaimer
An introduction to CERN
Why virtualization and cloud computing ?
Virtualization of batch resources at CERN
Building blocks and current status
Image management systems: ISF and ONE
Status of the project and first numbers
Disclaimer: We are still in the testing and evaluation phase. No final decision
has been taken yet on what we are going to use in the future.
All given numbers and figures are preliminary
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 2
3. PES Introduction to CERN
European Organization for Nuclear Research
The world’s largest particle physics laboratory
Located on Swiss/French border
Funded/staffed by 20 member states in 1954
With many contributors in the USA
Birth place of World Wide Web
Made popular by the movie “Angels and Demons”
Flag ship accelerator LHC
http://www.cern.ch
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 3
4. PES Introduction to CERN: LHC and the experiments
TOTEM
LHCb LHCf Alice
Circumference of LHC: 26 659 m
Magnets : 9300
Temperature: -271.3°C (1.9 K)
Cooling: ~60T liquid He
Max. Beam energy: 7TeV
Current beam energy: 3.5TeV
ATLAS CMS
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 4
5. PES Introduction to CERN
Data Signal/Noise ratio 10-9
Data volume:
High rate * large number of
channels * 4 experiments
15 PetaBytes of new data
each year
Compute power
Event complexity * Nb. events *
thousands users
100 k CPUs (cores)
Worldwide analysis & funding
Computing funding locally in
major regions & countries
Efficient analysis everywhere
GRID technology
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 5
6. PES LCG computing GRID
Requried computing capacity:
~100 000 processors
Number of sites:
T0: 1 (CERN), 20%
T1: 11 round the world
T2: ~160
http://lcg.web.cern.ch/lcg/public/
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 6
7. PES The CERN Computer Center
Disk and tape:
1500 disk servers
5PB disk space
16PB tape storage
Computing facilities:
>20.000 CPU cores (batch only)
Up to ~10000 concurrent jobs
Job throughput ~200 000/day
CERN IT Department
http://it-dep.web.cern.ch/it-dep/communications/it_facts__figures.htm
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 7
8. PES Why virtualization and cloud computing ?
Service consolidation:
Improve resource usage by squeezing mostly unused machines
onto single big hypervisors
Decouple hardware life cycle from applications running on the box
Ease management by supporting life migration
Virtualization of batch resources:
Decouple jobs and physical resources
Ease management of the batch farm resources
Enable the computer center for new computing models
This presentation is about virtualization of batch resources only
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 8
9. PES Batch virtualization
CERN batch farm lxbatch:
~3000 physical hosts
~20000 CPU cores
>70 queues
Type 1:
Run my jobs in your VM
Type 2:
Run my jobs in my VM
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 9
10. PES Towards cloud computing
Type 3:
Give me my infrastructure
i.e a VM or a batch of VMs
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 10
11. PES Philosophy
(SLC = Scientific Linux CERN)
I
( Near future:
Batch
SLC4 WN SLC5 WN
Physical Physical
SLC4 WN SLC5 WN
hypervisor cluster
(far) future ?
Batch T0 development other/cloud applications
Internal cloud
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 11
12. PES Visions beyond the current plan
Reusing/sharing images between different sites (phase 2)
HEPIX working group founded in autumn 2009 to define rules and
boundary conditions (https://www.hepix.org/)
Experiment specific images (phase 3)
Use of images which are customized for specific experiments
Use of resources in a cloud like operation mode (phase 4)
Images directly join experiment controlled scheduling systems
(phase 5)
Controlled by experiment
Spread across sites
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 12
13. PES Virtual batch: basic ideas
Virtual batch worker nodes:
Clones of real worker nodes, same setup
Platform Infrastructure Sharing Facility
Mix with physical resources (ISF)
Dynamically join the batch farm as normal worker nodes
For high level VM management
Limited lifetime: stop accepting jobs after 24h
Destroy when empty
Only one user job per VM at a time
Note:
The limited lifetime allows for a fully automated system which dynamically adapts
to the current needs, and automatically deploys intrusive updates.
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 13
14. PES Virtual batch: basic ideas, technical
Images:
staged on hypervisors Infrastructure Sharing Facility
Platform
(ISF)
master images, instances use LVM snapshots
For high level VM management
Start with few different flavors only
Image creation:
Derived from a centrally managed “golden node”
Regularly updated and distributed to get updates “in”
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 14
15. PES Virtual batch: basic ideas
Images distribution:
Platform Infrastructure Sharing Facility
(ISF)
Only shared file system available is AFS
For high level VM management
Prefer peer to peer methods (more on that later)
SCP wave
Rtorrent
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 15
16. PES Virtual batch: basic ideas
VM placement and Platform Infrastructure Sharing Facility
management system
(ISF)
Use existing solutions
For high level VM management
Testing both a free and a commercial solution
OpenNebula (ONE)
Platform's Infrastructure Sharing Facility (ISF)
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 16
17. PES Batch virtualization: architecture
Centrally managed
Grey: physical resource
Golden nodes
Job submission Colored: different VMs
CE/interactive
VM kiosk
Batch system
management Hypervisors / HW resources
VM worker nodes
With limited lifetime VM management system
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 17
18. PES Status of building blocks (test system)
Submission Hypervisor VM kiosk VM
and batch cluster and image management
managemen distribution system
Initial
deployment OK OK OK OK
Central ISF OK,
management OK OK Mostly ONE
implemented missing
Monitoring
and alarming OK Switched off missing missing
for tests
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 18
19. PES Image distribution: SCP wave versus rtorrent
Slow nodes,
(under investigation)
Preliminary !
(BT = bit torrent)
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 19
20. PES VM placement and management
OpenNebula (ONE):
Basic model:
Single ONE master
Communication with hypervisors via ssh only
(Currently) no special tools on the hypervisors
Some scalability issues at the beginning (50 VM at the beginning)
Addressing issues as they turn up
Close collaboration with developers, ideas for improvements
Managed to start more than 7,500 VMs
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 20
21. PES Scalability tests: some first numbers
“One shot” test with OpenNebula:
Inject virtual machine requests
And let them die
Record the number of alive machines seen by LSF every 30s
Units: 1h5min
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 21
22. PES VM placement and management
Platforms Infrastructure Sharing Facility (ISF)
One active ISF master node, plus fail-over candidates
Hypervisors run an agent which talks to XEN
Needed to be packaged for CERN
Resource management layer similar to LSF
Scalability expected to be good but needs verification
Tested with 2 racks (96 machines) so far, ramping up
Filled with ~2.000 VMs so far (which is the maximum)
See: http://www.platform.com
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 22
23. PES Screen shots: ISF
VM status:
Status: 1 of 2 racks available
CERN IT Department
Note: one out of 2 racks enabled for demonstration purpose
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 23
24. PES Screen shots: ISF
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 24
25. PES Summary
Virtualization efforts at CERN are proceeding.
Still some work to be done.
Main challenges
Scalability considerations
provisioning systems
of the batch system
No decision on provisioning system to be used yet
Reliability and speed of image distribution
General readiness for production (hardening)
Seamless integration into the existing infrastructure
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 25
26. PES Outlook
What's next ?
Continue testing of ONE and ISF in parallel
Solve remaining (known) issues
Release first VMs for testing by our users soon
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 26
27. PES Questions ?
CERN IT Department
?
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 27
28. PES Philosophy
Platform Infrastructure Sharing Facility
VM provisioning system(s) (ISF)
For high level VM management
OpenNebula pVMO
Hypervisor cluster
(physical resources)
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 28
29. PES Details: Some additional explanations ...
“Golden node”:
A centrally managed (i.e. Quattor controlled) standard worker node
which
Is a virtual machine
Does not accept jobs
Receives regular updates
Purpose: creation of VM images
“Virtual machine worker node”:
A virtual machine derived from a golden node
Not updated during their life time
Dynamically adds itself to the batch farm
Accepts jobs for only 24h
Runs only one user job at a time
Destroys itself when empty
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 29
30. PES VM kiosk and Image distribution
Boundary conditions at CERN
Only available shared file system is AFS
Network infrastructure with a single 1GE connection
No dedicated fast network for transfers that could be used
(eg 10GE, IB or similar)
Tested options:
Scp wave:
Developed at Clemson university
Based on simple scp
rtorrent:
Infrastructure developed at CERN
Each node starts serving blocks it already hosts
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 30
31. PES Details: VM kiosk and Image distribution
Local image distribution model at CERN
Virtual machines are instantiated as LVM snapshots of a base
image.This is process is very fast
Replacing a production image:
Approved images are moved to a central image repository (the “kiosk”)
Hypervisors check regularly for new images on the kiosk node
The new image is transferred to a temporary area on the hypervisors
When the transfer is finished, a new LV is created
The new image is unpacked into the new LV
The current production image is renamed (via lvrename)
The new image is renamed to become the production image
Note: may need a sophisticated locking strategy in this process
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 31
32. PES Image distribution with torrrent: transfer speed
7GB file compressed
452 target nodes
y
in ar
lim
Pre
90% finished after 25min
Slow nodes, under investigation
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 32
33. PES Image distribution: total distribution speed
→ Unpacking still needs some tuning !
7GB file compressed
452 target nodes
y
in ar All done after 1.5h
lim
Pre
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
CERNs Cloud computing infrastructure - a status report - 33