SlideShare a Scribd company logo
1 of 22
HTCondor atRAL: AnUpdate
Andrew Lahiff
STFCRutherfordAppleton Laboratory
8th June, European HTCondor Workshop 2017
Overview
• Introduction
• Docker universe
• Monitoring
• Multi-core jobs
• Private cloud
• Containers
– on-premises
– public clouds
• Plans
2
Introduction
• RAL Tier-1 supports
– ALICE, ATLAS, CMS, LHCb
– many others, both HEP & non-HEP
• Batch system
– 24000 cores (soon 26000)
– HTCondor 8.6.3 (worker nodes, CMs), 8.4.8 (CEs)
Past 6 months
3
Computing elements
• Currently all job submission is via the Grid
• 4 ARC CEs
– working fine
• still using 5.0.5
• generally don’tupgradevery aggressively
– ATLAS submit jobs using pre-loaded pilots
• jobs specifytheiractualresourcerequirements
– no plans to migrate to HTCondor CEs yet
4
Docker universe
• We migrated to 100% Docker universe this year
– migration to SL7 took place over several months
– Docker 17.03 CE using OverlayFS storage backend
– worker node filesystem:
• / (RAID 0, XFS)
• /pool (RAID 0, XFS with ftype=1) for Docker, job
sandboxes & CVMFS cache
5
Docker universe
• HTCondor config relating to Docker
DOCKER=/usr/local/bin/docker.py
DOCKER_DROP_ALL_CAPABILITIES=regexp("pilot",x509UserProxyFirstFQAN) =?= False
DOCKER_MOUNT_VOLUMES=GRID_SECURITY, MJF, GRIDENV, GLEXEC, LCMAPS, LCAS, PASSWD, GROUP, CVMFS,
CGROUPS, ATLAS_RECOVERY, ETC_ATLAS, ETC_CMS, ETC_ARC
DOCKER_VOLUME_DIR_ATLAS_RECOVERY=/pool/atlas/recovery:/pool/atlas/recovery
DOCKER_VOLUME_DIR_ATLAS_RECOVERY_MOUNT_IF=regexp("atl",Owner)
DOCKER_VOLUME_DIR_CGROUPS=/sys/fs/cgroup:/sys/fs/cgroup:ro
DOCKER_VOLUME_DIR_CGROUPS_MOUNT_IF=regexp("atl",Owner)
DOCKER_VOLUME_DIR_CVMFS=/cvmfs:/cvmfs:shared
DOCKER_VOLUME_DIR_ETC_ARC=/etc/arc:/etc/arc:ro
DOCKER_VOLUME_DIR_ETC_ATLAS=/etc/atlas:/etc/atlas:ro
DOCKER_VOLUME_DIR_ETC_ATLAS_MOUNT_IF=regexp("atl",Owner)
DOCKER_VOLUME_DIR_ETC_CMS=/etc/cms:/etc/cms:ro
DOCKER_VOLUME_DIR_ETC_CMS_MOUNT_IF=regexp("cms",Owner)
DOCKER_VOLUME_DIR_GLEXEC=/etc/glexec.conf:/etc/glexec.conf:ro
DOCKER_VOLUME_DIR_GRIDENV=/etc/profile.d/grid-env.sh:/etc/profile.d/grid-env.sh:ro
DOCKER_VOLUME_DIR_GRID_SECURITY=/etc/grid-security:/etc/grid-security:ro
DOCKER_VOLUME_DIR_GROUP=/etc/group:/etc/group:ro
DOCKER_VOLUME_DIR_LCAS=/etc/lcas:/etc/lcas:ro
DOCKER_VOLUME_DIR_LCMAPS=/etc/lcmaps:/etc/lcmaps:ro
DOCKER_VOLUME_DIR_MJF=/etc/machinefeatures:/etc/machinefeatures:ro
DOCKER_VOLUME_DIR_PASSWD=/etc/passwd:/etc/passwd:ro
6
Docker universe
• Before the Docker universe, used
– cgroups with soft memory limits
– PERIODIC_SYSTEM_REMOVE to kill jobs which exceed 3x requested memory
• With the Docker universe
– by default jobs are given a hard memory limit
• high rateofjobs being killed(mainly ATLAS)
– now use a script which modifies arguments to “docker run” generated by
HTCondor (the “DOCKER” knob)
• set asoftmemorylimit atthe memoryrequested byjob
• hardlimit at2xrequestedmemory
7
Docker universe
• Goodbye MOUNT_UNDER_SCRATCH
– we used this for a long time
• enables /tmp to be unique for each job
– no need for this with the Docker universe
• /tmp is already unique to each container
• with OverlayFS storage backend /tmp is more than big
enough
• Converting Vanilla to Docker universe
– currently using job routers on each CE
– will switch to schedd job transforms once we upgrade
CEs to 8.6.x
8
Gateway
Gateway
Docker universe & storage
• Started rolling outxrootd Ceph gateways & proxies onto workernodes
– an important driver for migrating to SL7 worker nodes
– xrootd daemons running in containers
– jobs think they’re talking to the remote gateway, but they’re actually talking to
the local proxy
Ceph
Gateway
Workernode
xrootd
gateway
xrootd proxy
Swift, S3& GridFTP,xrootd
9
Monitoring
• Still have condor_gangliad running
– but generally never look at Ganglia
– gmond no longer being installed on some services
• Alternative in use since 2015
– Telegraf for collecting metrics
– InfluxDB time series database
– Grafana for visualisation
– Kapacitor for stream processing of metrics
• ClassAds for completed jobs
– Filebeat, Logstash, Elasticsearch
– MySQL (accounting)
10
Monitoring: overview
InfluxDB
InfluxDB Grafana
Kapacitor
worker
nodes
worker
nodes
worker
nodes
schedd
schedds
schedd
central
managers
Logstash Elasticsearch Kibana
MySQL
Telegraf
Telegraf
Telegraf
Filebeat
HTCondor metrics: Telegraf executes Python scri
which use the HTCondor Python bindings
11
Monitoring: dashboards
Overviewdashboard
Clusterhealthdashboard
12
Monitoring: containers
• Using Telegraf to collect container metrics from WNs
– currently just xrootd gateway & proxy containers
• No per-job time-series metrics yet, will revisit when
– InfluxDB 1.3 is released
• supports high cardinality indexing
– show tag values with a WHERE time clause
13
Multi-core jobs
• Multi-core jobs account for ~50% numberof used cores
– all 8-cores
– only ATLAS & CMS
• Haven’t made any changesfor overa year
– it just works
Past6months
14
Multi-core jobs
• Wedon’t use the defrag daemon orthe “Drained” state
– why have any idle cores?
– we wantto be able to run preemptable jobs on cores which are draining
• START expression on workernodes contains:
ifThenElse($(PREEMPTABLE_ONLY), isPreemptable =?= True, True)
• Draining sets PREEMPTABLE_ONLY to True
– only jobs with isPreemptable set to True can run
– currently only ATLAS Event Service jobs
• Preemptable jobs (if any) arekilled when enough cores have beendrained on
a node
15
Cloud
• For several years have been making opportunistic
use of our private OpenNebula cloud
– using condor rooster for provisioning virtual worker
nodes
– only idle cloud resources used
• backs off if cloud usage increases
• Future (short-term)
– migrating to OpenStack
– migrating to CernVMs
16
Worker nodes & containers
• Investigating ways of having more flexible computing
infrastructure
– need to be able to support more communities, not just
LHC experiments
– single set of resources for
• HTCondor
• OpenStack hypervisors
• Other StuffTM
• Ideally
– install nothing related to any experiment/community on
any machine
– run everything in containers, managed by schedulers
(not people)
17
container substrate
bare metal
HTCondor Hypervisors ...
Worker nodes & containers
• Example using Apache Mesos as the cluster
manager
– last year did some tests with real jobs from all 4 LHC
experiments
• running >5000 cores
– Mesos cluster containing 248 nodes
– custom Mesos framework
• creates startds in containers which join our production
HTCondor pool
– auto-scaling squids in containers for CVMFS/Frontier
18
Kubernetes as an abstraction layer
• Kubernetes is an open-source container cluster manager which can be run
anywhere
– on-premises
– “click a button” on public clouds (natively or via 3rd parties)
• Using it as an abstraction to enable portability between
– on-premises resources
– multiple public clouds
• Benefits
– Don’t need to worry about handling different cloud APIs
– Run a single command to create an elastic, self-healing Grid site
• ona single cluster
• onmultiple clustersaroundthe world(via Kubernetesfederation)
19
Kubernetes as an abstraction layer
• Did initial testing with CMS CRAB3 analysis jobs
– RAL, Google (GKE), Azure (ACS), AWS (StackPointCloud)
• Nowrunning ATLAS production jobs on Azure
– using “vacuum” model for independently creating startds which join a
HTCondor pool at CERN
schedd
central manager
AzureContainerService
CERN
squid
squid startd
startd
startd
ThankstoMicrosoftforanAzureResearchAward 20
Upcoming plans include
• Thereturn of local users
– haven’t had local users for many years
– would like to be able to support users of local facilities
• job submissionusing Kerberosauth?
• federatedidentity?CanINDIGOIAM help?
• SLA-aware automated rolling upgrades
– use a key-value store to coordinate draining & reboots
• idea fromCoreOS
• machinesneed toget accesstoa locktodrain
– have done PoC testing but not yet tried in production
– not HTCondor-specific, could be useful other services
• Getting ridof pool accounts!!!
21
Questions?
22

More Related Content

What's hot

Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Belmiro Moreira
 
Ceph de facto storage backend for OpenStack
Ceph de facto storage backend for OpenStack Ceph de facto storage backend for OpenStack
Ceph de facto storage backend for OpenStack eNovance
 
High Availability from the DevOps side - OpenStack Summit Portland
High Availability from the DevOps side - OpenStack Summit PortlandHigh Availability from the DevOps side - OpenStack Summit Portland
High Availability from the DevOps side - OpenStack Summit PortlandeNovance
 
20170926 cern cloud v4
20170926 cern cloud v420170926 cern cloud v4
20170926 cern cloud v4Tim Bell
 
Ceph Performance and Optimization - Ceph Day Frankfurt
Ceph Performance and Optimization - Ceph Day Frankfurt Ceph Performance and Optimization - Ceph Day Frankfurt
Ceph Performance and Optimization - Ceph Day Frankfurt Ceph Community
 
OpenStack in Action 4! Sebastien Han - Ceph: de facto storage backend for Ope...
OpenStack in Action 4! Sebastien Han - Ceph: de facto storage backend for Ope...OpenStack in Action 4! Sebastien Han - Ceph: de facto storage backend for Ope...
OpenStack in Action 4! Sebastien Han - Ceph: de facto storage backend for Ope...eNovance
 
CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016Belmiro Moreira
 
CERN OpenStack Cloud Control Plane - From VMs to K8s
CERN OpenStack Cloud Control Plane - From VMs to K8sCERN OpenStack Cloud Control Plane - From VMs to K8s
CERN OpenStack Cloud Control Plane - From VMs to K8sBelmiro Moreira
 
OpenStack Summit Vancouver: Lessons learned on upgrades
OpenStack Summit Vancouver:  Lessons learned on upgradesOpenStack Summit Vancouver:  Lessons learned on upgrades
OpenStack Summit Vancouver: Lessons learned on upgradesFrédéric Lepied
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNBelmiro Moreira
 
Future Science on Future OpenStack
Future Science on Future OpenStackFuture Science on Future OpenStack
Future Science on Future OpenStackBelmiro Moreira
 
CERN User Story
CERN User StoryCERN User Story
CERN User StoryTim Bell
 
High Availability in OpenStack Cloud
High Availability in OpenStack CloudHigh Availability in OpenStack Cloud
High Availability in OpenStack CloudQiming Teng
 
Learning to Scale OpenStack
Learning to Scale OpenStackLearning to Scale OpenStack
Learning to Scale OpenStackRainya Mosher
 
John Spray - Ceph in Kubernetes
John Spray - Ceph in KubernetesJohn Spray - Ceph in Kubernetes
John Spray - Ceph in KubernetesShapeBlue
 
Autoscaling OpenStack Natively with Heat, Ceilometer and LBaaS
Autoscaling OpenStack Natively with Heat, Ceilometer and LBaaSAutoscaling OpenStack Natively with Heat, Ceilometer and LBaaS
Autoscaling OpenStack Natively with Heat, Ceilometer and LBaaSShixiong Shang
 
Heat - keep the clouds up
Heat - keep the clouds upHeat - keep the clouds up
Heat - keep the clouds upKiran Murari
 
Evolution of Openstack Networking at CERN
Evolution of Openstack Networking at CERNEvolution of Openstack Networking at CERN
Evolution of Openstack Networking at CERNBelmiro Moreira
 
The OpenStack Cloud at CERN
The OpenStack Cloud at CERNThe OpenStack Cloud at CERN
The OpenStack Cloud at CERNArne Wiebalck
 

What's hot (20)

Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015Unveiling CERN Cloud Architecture - October, 2015
Unveiling CERN Cloud Architecture - October, 2015
 
Ceph de facto storage backend for OpenStack
Ceph de facto storage backend for OpenStack Ceph de facto storage backend for OpenStack
Ceph de facto storage backend for OpenStack
 
High Availability from the DevOps side - OpenStack Summit Portland
High Availability from the DevOps side - OpenStack Summit PortlandHigh Availability from the DevOps side - OpenStack Summit Portland
High Availability from the DevOps side - OpenStack Summit Portland
 
20170926 cern cloud v4
20170926 cern cloud v420170926 cern cloud v4
20170926 cern cloud v4
 
Ceph Performance and Optimization - Ceph Day Frankfurt
Ceph Performance and Optimization - Ceph Day Frankfurt Ceph Performance and Optimization - Ceph Day Frankfurt
Ceph Performance and Optimization - Ceph Day Frankfurt
 
OpenStack in Action 4! Sebastien Han - Ceph: de facto storage backend for Ope...
OpenStack in Action 4! Sebastien Han - Ceph: de facto storage backend for Ope...OpenStack in Action 4! Sebastien Han - Ceph: de facto storage backend for Ope...
OpenStack in Action 4! Sebastien Han - Ceph: de facto storage backend for Ope...
 
CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016CPU Optimizations in the CERN Cloud - February 2016
CPU Optimizations in the CERN Cloud - February 2016
 
CERN OpenStack Cloud Control Plane - From VMs to K8s
CERN OpenStack Cloud Control Plane - From VMs to K8sCERN OpenStack Cloud Control Plane - From VMs to K8s
CERN OpenStack Cloud Control Plane - From VMs to K8s
 
OpenStack Summit Vancouver: Lessons learned on upgrades
OpenStack Summit Vancouver:  Lessons learned on upgradesOpenStack Summit Vancouver:  Lessons learned on upgrades
OpenStack Summit Vancouver: Lessons learned on upgrades
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
 
Future Science on Future OpenStack
Future Science on Future OpenStackFuture Science on Future OpenStack
Future Science on Future OpenStack
 
CERN User Story
CERN User StoryCERN User Story
CERN User Story
 
High Availability in OpenStack Cloud
High Availability in OpenStack CloudHigh Availability in OpenStack Cloud
High Availability in OpenStack Cloud
 
Learning to Scale OpenStack
Learning to Scale OpenStackLearning to Scale OpenStack
Learning to Scale OpenStack
 
John Spray - Ceph in Kubernetes
John Spray - Ceph in KubernetesJohn Spray - Ceph in Kubernetes
John Spray - Ceph in Kubernetes
 
Autoscaling OpenStack Natively with Heat, Ceilometer and LBaaS
Autoscaling OpenStack Natively with Heat, Ceilometer and LBaaSAutoscaling OpenStack Natively with Heat, Ceilometer and LBaaS
Autoscaling OpenStack Natively with Heat, Ceilometer and LBaaS
 
OpenStack Heat
OpenStack HeatOpenStack Heat
OpenStack Heat
 
Heat - keep the clouds up
Heat - keep the clouds upHeat - keep the clouds up
Heat - keep the clouds up
 
Evolution of Openstack Networking at CERN
Evolution of Openstack Networking at CERNEvolution of Openstack Networking at CERN
Evolution of Openstack Networking at CERN
 
The OpenStack Cloud at CERN
The OpenStack Cloud at CERNThe OpenStack Cloud at CERN
The OpenStack Cloud at CERN
 

Similar to Euro ht condor_alahiff

Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Dave Holland
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...Radhika Puthiyetath
 
On Docker and its use for LHC at CERN
On Docker and its use for LHC at CERNOn Docker and its use for LHC at CERN
On Docker and its use for LHC at CERNSebastien Goasguen
 
Clocker - The Docker Cloud Maker
Clocker - The Docker Cloud MakerClocker - The Docker Cloud Maker
Clocker - The Docker Cloud MakerAndrew Kennedy
 
The Rise of the Container: The Dev/Ops Technology That Accelerates Ops/Dev
The Rise of the Container:  The Dev/Ops Technology That Accelerates Ops/DevThe Rise of the Container:  The Dev/Ops Technology That Accelerates Ops/Dev
The Rise of the Container: The Dev/Ops Technology That Accelerates Ops/DevRobert Starmer
 
Clocker: Managing Container Networking and Placement
Clocker: Managing Container Networking and PlacementClocker: Managing Container Networking and Placement
Clocker: Managing Container Networking and PlacementDocker, Inc.
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications OpenEBS
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebula Project
 
Kubernetes Internals
Kubernetes InternalsKubernetes Internals
Kubernetes InternalsShimi Bandiel
 
OpenStack London Meetup, 18 Nov 2015
OpenStack London Meetup, 18 Nov 2015OpenStack London Meetup, 18 Nov 2015
OpenStack London Meetup, 18 Nov 2015Jesse Pretorius
 
Microservices with Terraform, Docker and the Cloud. JavaOne 2017 2017-10-02
Microservices with Terraform, Docker and the Cloud. JavaOne 2017 2017-10-02Microservices with Terraform, Docker and the Cloud. JavaOne 2017 2017-10-02
Microservices with Terraform, Docker and the Cloud. JavaOne 2017 2017-10-02Derek Ashmore
 
Kubernetes workshop
Kubernetes workshopKubernetes workshop
Kubernetes workshopKumar Gaurav
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to KubernetesVishal Biyani
 
Let's Try Every CRI Runtime Available for Kubernetes
Let's Try Every CRI Runtime Available for KubernetesLet's Try Every CRI Runtime Available for Kubernetes
Let's Try Every CRI Runtime Available for KubernetesPhil Estes
 
Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...
Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...
Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...Anant Corporation
 
Containerization - The DevOps Revolution
Containerization - The DevOps RevolutionContainerization - The DevOps Revolution
Containerization - The DevOps RevolutionYulian Slobodyan
 
Pairs OpenStack Summit Summary
Pairs OpenStack Summit SummaryPairs OpenStack Summit Summary
Pairs OpenStack Summit SummaryGuangya Liu
 

Similar to Euro ht condor_alahiff (20)

Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
On CloudStack, Docker, Kubernetes, and Big Data…Oh my ! By Sebastien Goasguen...
 
On Docker and its use for LHC at CERN
On Docker and its use for LHC at CERNOn Docker and its use for LHC at CERN
On Docker and its use for LHC at CERN
 
141204 upload
141204 upload141204 upload
141204 upload
 
Clocker - The Docker Cloud Maker
Clocker - The Docker Cloud MakerClocker - The Docker Cloud Maker
Clocker - The Docker Cloud Maker
 
The Rise of the Container: The Dev/Ops Technology That Accelerates Ops/Dev
The Rise of the Container:  The Dev/Ops Technology That Accelerates Ops/DevThe Rise of the Container:  The Dev/Ops Technology That Accelerates Ops/Dev
The Rise of the Container: The Dev/Ops Technology That Accelerates Ops/Dev
 
Clocker: Managing Container Networking and Placement
Clocker: Managing Container Networking and PlacementClocker: Managing Container Networking and Placement
Clocker: Managing Container Networking and Placement
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
 
Kubernetes Internals
Kubernetes InternalsKubernetes Internals
Kubernetes Internals
 
OpenStack London Meetup, 18 Nov 2015
OpenStack London Meetup, 18 Nov 2015OpenStack London Meetup, 18 Nov 2015
OpenStack London Meetup, 18 Nov 2015
 
Microservices with Terraform, Docker and the Cloud. JavaOne 2017 2017-10-02
Microservices with Terraform, Docker and the Cloud. JavaOne 2017 2017-10-02Microservices with Terraform, Docker and the Cloud. JavaOne 2017 2017-10-02
Microservices with Terraform, Docker and the Cloud. JavaOne 2017 2017-10-02
 
Kubernetes workshop
Kubernetes workshopKubernetes workshop
Kubernetes workshop
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to Kubernetes
 
HPC in the Cloud
HPC in the CloudHPC in the Cloud
HPC in the Cloud
 
Containers and HPC
Containers and HPCContainers and HPC
Containers and HPC
 
Let's Try Every CRI Runtime Available for Kubernetes
Let's Try Every CRI Runtime Available for KubernetesLet's Try Every CRI Runtime Available for Kubernetes
Let's Try Every CRI Runtime Available for Kubernetes
 
Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...
Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...
Apache Cassandra Lunch #41: Cassandra on Kubernetes - Docker/Kubernetes/Helm ...
 
Containerization - The DevOps Revolution
Containerization - The DevOps RevolutionContainerization - The DevOps Revolution
Containerization - The DevOps Revolution
 
Pairs OpenStack Summit Summary
Pairs OpenStack Summit SummaryPairs OpenStack Summit Summary
Pairs OpenStack Summit Summary
 

Recently uploaded

FULL ENJOY - 9953040155 Call Girls in Dwarka Mor | Delhi
FULL ENJOY - 9953040155 Call Girls in Dwarka Mor | DelhiFULL ENJOY - 9953040155 Call Girls in Dwarka Mor | Delhi
FULL ENJOY - 9953040155 Call Girls in Dwarka Mor | DelhiMalviyaNagarCallGirl
 
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...akbard9823
 
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad EscortsIslamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escortswdefrd
 
RAK Call Girls Service # 971559085003 # Call Girl Service In RAK
RAK Call Girls Service # 971559085003 # Call Girl Service In RAKRAK Call Girls Service # 971559085003 # Call Girl Service In RAK
RAK Call Girls Service # 971559085003 # Call Girl Service In RAKedwardsara83
 
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | DelhiMalviyaNagarCallGirl
 
Olivia Cox. intertextual references.pptx
Olivia Cox. intertextual references.pptxOlivia Cox. intertextual references.pptx
Olivia Cox. intertextual references.pptxLauraFagan6
 
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad EscortsCall girls in Ahmedabad High profile
 
FULL ENJOY - 9953040155 Call Girls in Uttam Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Uttam Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Uttam Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Uttam Nagar | DelhiMalviyaNagarCallGirl
 
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | DelhiFULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | DelhiMalviyaNagarCallGirl
 
Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...
Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...
Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...anilsa9823
 
Akola Call Girls #9907093804 Contact Number Escorts Service Akola
Akola Call Girls #9907093804 Contact Number Escorts Service AkolaAkola Call Girls #9907093804 Contact Number Escorts Service Akola
Akola Call Girls #9907093804 Contact Number Escorts Service Akolasrsj9000
 
Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...
Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...
Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...akbard9823
 
FULL ENJOY - 9953040155 Call Girls in Noida | Delhi
FULL ENJOY - 9953040155 Call Girls in Noida | DelhiFULL ENJOY - 9953040155 Call Girls in Noida | Delhi
FULL ENJOY - 9953040155 Call Girls in Noida | DelhiMalviyaNagarCallGirl
 
Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...
Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...
Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...dajasot375
 
Pragati Maidan Call Girls : ☎ 8527673949, Low rate Call Girls
Pragati Maidan Call Girls : ☎ 8527673949, Low rate Call GirlsPragati Maidan Call Girls : ☎ 8527673949, Low rate Call Girls
Pragati Maidan Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 
FULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
FULL ENJOY - 9953040155 Call Girls in Shahdara | DelhiFULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
FULL ENJOY - 9953040155 Call Girls in Shahdara | DelhiMalviyaNagarCallGirl
 
FULL ENJOY - 9953040155 Call Girls in New Ashok Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in New Ashok Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in New Ashok Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in New Ashok Nagar | DelhiMalviyaNagarCallGirl
 
The First Date by Daniel Johnson (Inspired By True Events)
The First Date by Daniel Johnson (Inspired By True Events)The First Date by Daniel Johnson (Inspired By True Events)
The First Date by Daniel Johnson (Inspired By True Events)thephillipta
 
Bridge Fight Board by Daniel Johnson dtjohnsonart.com
Bridge Fight Board by Daniel Johnson dtjohnsonart.comBridge Fight Board by Daniel Johnson dtjohnsonart.com
Bridge Fight Board by Daniel Johnson dtjohnsonart.comthephillipta
 
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call GirlsJagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call Girlsashishs7044
 

Recently uploaded (20)

FULL ENJOY - 9953040155 Call Girls in Dwarka Mor | Delhi
FULL ENJOY - 9953040155 Call Girls in Dwarka Mor | DelhiFULL ENJOY - 9953040155 Call Girls in Dwarka Mor | Delhi
FULL ENJOY - 9953040155 Call Girls in Dwarka Mor | Delhi
 
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
Aminabad @ Book Call Girls in Lucknow - 450+ Call Girl Cash Payment 🍵 8923113...
 
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad EscortsIslamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
Islamabad Call Girls # 03091665556 # Call Girls in Islamabad | Islamabad Escorts
 
RAK Call Girls Service # 971559085003 # Call Girl Service In RAK
RAK Call Girls Service # 971559085003 # Call Girl Service In RAKRAK Call Girls Service # 971559085003 # Call Girl Service In RAK
RAK Call Girls Service # 971559085003 # Call Girl Service In RAK
 
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gtb Nagar | Delhi
 
Olivia Cox. intertextual references.pptx
Olivia Cox. intertextual references.pptxOlivia Cox. intertextual references.pptx
Olivia Cox. intertextual references.pptx
 
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
(NEHA) Call Girls Ahmedabad Booking Open 8617697112 Ahmedabad Escorts
 
FULL ENJOY - 9953040155 Call Girls in Uttam Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Uttam Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in Uttam Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in Uttam Nagar | Delhi
 
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | DelhiFULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
FULL ENJOY - 9953040155 Call Girls in Gandhi Vihar | Delhi
 
Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...
Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...
Lucknow 💋 Virgin Call Girls Lucknow | Book 8923113531 Extreme Naughty Call Gi...
 
Akola Call Girls #9907093804 Contact Number Escorts Service Akola
Akola Call Girls #9907093804 Contact Number Escorts Service AkolaAkola Call Girls #9907093804 Contact Number Escorts Service Akola
Akola Call Girls #9907093804 Contact Number Escorts Service Akola
 
Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...
Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...
Hazratganj / Call Girl in Lucknow - Phone 🫗 8923113531 ☛ Escorts Service at 6...
 
FULL ENJOY - 9953040155 Call Girls in Noida | Delhi
FULL ENJOY - 9953040155 Call Girls in Noida | DelhiFULL ENJOY - 9953040155 Call Girls in Noida | Delhi
FULL ENJOY - 9953040155 Call Girls in Noida | Delhi
 
Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...
Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...
Call Girl in Bur Dubai O5286O4116 Indian Call Girls in Bur Dubai By VIP Bur D...
 
Pragati Maidan Call Girls : ☎ 8527673949, Low rate Call Girls
Pragati Maidan Call Girls : ☎ 8527673949, Low rate Call GirlsPragati Maidan Call Girls : ☎ 8527673949, Low rate Call Girls
Pragati Maidan Call Girls : ☎ 8527673949, Low rate Call Girls
 
FULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
FULL ENJOY - 9953040155 Call Girls in Shahdara | DelhiFULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
FULL ENJOY - 9953040155 Call Girls in Shahdara | Delhi
 
FULL ENJOY - 9953040155 Call Girls in New Ashok Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in New Ashok Nagar | DelhiFULL ENJOY - 9953040155 Call Girls in New Ashok Nagar | Delhi
FULL ENJOY - 9953040155 Call Girls in New Ashok Nagar | Delhi
 
The First Date by Daniel Johnson (Inspired By True Events)
The First Date by Daniel Johnson (Inspired By True Events)The First Date by Daniel Johnson (Inspired By True Events)
The First Date by Daniel Johnson (Inspired By True Events)
 
Bridge Fight Board by Daniel Johnson dtjohnsonart.com
Bridge Fight Board by Daniel Johnson dtjohnsonart.comBridge Fight Board by Daniel Johnson dtjohnsonart.com
Bridge Fight Board by Daniel Johnson dtjohnsonart.com
 
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call GirlsJagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
Jagat Puri Call Girls : ☎ 8527673949, Low rate Call Girls
 

Euro ht condor_alahiff

  • 1. HTCondor atRAL: AnUpdate Andrew Lahiff STFCRutherfordAppleton Laboratory 8th June, European HTCondor Workshop 2017
  • 2. Overview • Introduction • Docker universe • Monitoring • Multi-core jobs • Private cloud • Containers – on-premises – public clouds • Plans 2
  • 3. Introduction • RAL Tier-1 supports – ALICE, ATLAS, CMS, LHCb – many others, both HEP & non-HEP • Batch system – 24000 cores (soon 26000) – HTCondor 8.6.3 (worker nodes, CMs), 8.4.8 (CEs) Past 6 months 3
  • 4. Computing elements • Currently all job submission is via the Grid • 4 ARC CEs – working fine • still using 5.0.5 • generally don’tupgradevery aggressively – ATLAS submit jobs using pre-loaded pilots • jobs specifytheiractualresourcerequirements – no plans to migrate to HTCondor CEs yet 4
  • 5. Docker universe • We migrated to 100% Docker universe this year – migration to SL7 took place over several months – Docker 17.03 CE using OverlayFS storage backend – worker node filesystem: • / (RAID 0, XFS) • /pool (RAID 0, XFS with ftype=1) for Docker, job sandboxes & CVMFS cache 5
  • 6. Docker universe • HTCondor config relating to Docker DOCKER=/usr/local/bin/docker.py DOCKER_DROP_ALL_CAPABILITIES=regexp("pilot",x509UserProxyFirstFQAN) =?= False DOCKER_MOUNT_VOLUMES=GRID_SECURITY, MJF, GRIDENV, GLEXEC, LCMAPS, LCAS, PASSWD, GROUP, CVMFS, CGROUPS, ATLAS_RECOVERY, ETC_ATLAS, ETC_CMS, ETC_ARC DOCKER_VOLUME_DIR_ATLAS_RECOVERY=/pool/atlas/recovery:/pool/atlas/recovery DOCKER_VOLUME_DIR_ATLAS_RECOVERY_MOUNT_IF=regexp("atl",Owner) DOCKER_VOLUME_DIR_CGROUPS=/sys/fs/cgroup:/sys/fs/cgroup:ro DOCKER_VOLUME_DIR_CGROUPS_MOUNT_IF=regexp("atl",Owner) DOCKER_VOLUME_DIR_CVMFS=/cvmfs:/cvmfs:shared DOCKER_VOLUME_DIR_ETC_ARC=/etc/arc:/etc/arc:ro DOCKER_VOLUME_DIR_ETC_ATLAS=/etc/atlas:/etc/atlas:ro DOCKER_VOLUME_DIR_ETC_ATLAS_MOUNT_IF=regexp("atl",Owner) DOCKER_VOLUME_DIR_ETC_CMS=/etc/cms:/etc/cms:ro DOCKER_VOLUME_DIR_ETC_CMS_MOUNT_IF=regexp("cms",Owner) DOCKER_VOLUME_DIR_GLEXEC=/etc/glexec.conf:/etc/glexec.conf:ro DOCKER_VOLUME_DIR_GRIDENV=/etc/profile.d/grid-env.sh:/etc/profile.d/grid-env.sh:ro DOCKER_VOLUME_DIR_GRID_SECURITY=/etc/grid-security:/etc/grid-security:ro DOCKER_VOLUME_DIR_GROUP=/etc/group:/etc/group:ro DOCKER_VOLUME_DIR_LCAS=/etc/lcas:/etc/lcas:ro DOCKER_VOLUME_DIR_LCMAPS=/etc/lcmaps:/etc/lcmaps:ro DOCKER_VOLUME_DIR_MJF=/etc/machinefeatures:/etc/machinefeatures:ro DOCKER_VOLUME_DIR_PASSWD=/etc/passwd:/etc/passwd:ro 6
  • 7. Docker universe • Before the Docker universe, used – cgroups with soft memory limits – PERIODIC_SYSTEM_REMOVE to kill jobs which exceed 3x requested memory • With the Docker universe – by default jobs are given a hard memory limit • high rateofjobs being killed(mainly ATLAS) – now use a script which modifies arguments to “docker run” generated by HTCondor (the “DOCKER” knob) • set asoftmemorylimit atthe memoryrequested byjob • hardlimit at2xrequestedmemory 7
  • 8. Docker universe • Goodbye MOUNT_UNDER_SCRATCH – we used this for a long time • enables /tmp to be unique for each job – no need for this with the Docker universe • /tmp is already unique to each container • with OverlayFS storage backend /tmp is more than big enough • Converting Vanilla to Docker universe – currently using job routers on each CE – will switch to schedd job transforms once we upgrade CEs to 8.6.x 8
  • 9. Gateway Gateway Docker universe & storage • Started rolling outxrootd Ceph gateways & proxies onto workernodes – an important driver for migrating to SL7 worker nodes – xrootd daemons running in containers – jobs think they’re talking to the remote gateway, but they’re actually talking to the local proxy Ceph Gateway Workernode xrootd gateway xrootd proxy Swift, S3& GridFTP,xrootd 9
  • 10. Monitoring • Still have condor_gangliad running – but generally never look at Ganglia – gmond no longer being installed on some services • Alternative in use since 2015 – Telegraf for collecting metrics – InfluxDB time series database – Grafana for visualisation – Kapacitor for stream processing of metrics • ClassAds for completed jobs – Filebeat, Logstash, Elasticsearch – MySQL (accounting) 10
  • 11. Monitoring: overview InfluxDB InfluxDB Grafana Kapacitor worker nodes worker nodes worker nodes schedd schedds schedd central managers Logstash Elasticsearch Kibana MySQL Telegraf Telegraf Telegraf Filebeat HTCondor metrics: Telegraf executes Python scri which use the HTCondor Python bindings 11
  • 13. Monitoring: containers • Using Telegraf to collect container metrics from WNs – currently just xrootd gateway & proxy containers • No per-job time-series metrics yet, will revisit when – InfluxDB 1.3 is released • supports high cardinality indexing – show tag values with a WHERE time clause 13
  • 14. Multi-core jobs • Multi-core jobs account for ~50% numberof used cores – all 8-cores – only ATLAS & CMS • Haven’t made any changesfor overa year – it just works Past6months 14
  • 15. Multi-core jobs • Wedon’t use the defrag daemon orthe “Drained” state – why have any idle cores? – we wantto be able to run preemptable jobs on cores which are draining • START expression on workernodes contains: ifThenElse($(PREEMPTABLE_ONLY), isPreemptable =?= True, True) • Draining sets PREEMPTABLE_ONLY to True – only jobs with isPreemptable set to True can run – currently only ATLAS Event Service jobs • Preemptable jobs (if any) arekilled when enough cores have beendrained on a node 15
  • 16. Cloud • For several years have been making opportunistic use of our private OpenNebula cloud – using condor rooster for provisioning virtual worker nodes – only idle cloud resources used • backs off if cloud usage increases • Future (short-term) – migrating to OpenStack – migrating to CernVMs 16
  • 17. Worker nodes & containers • Investigating ways of having more flexible computing infrastructure – need to be able to support more communities, not just LHC experiments – single set of resources for • HTCondor • OpenStack hypervisors • Other StuffTM • Ideally – install nothing related to any experiment/community on any machine – run everything in containers, managed by schedulers (not people) 17 container substrate bare metal HTCondor Hypervisors ...
  • 18. Worker nodes & containers • Example using Apache Mesos as the cluster manager – last year did some tests with real jobs from all 4 LHC experiments • running >5000 cores – Mesos cluster containing 248 nodes – custom Mesos framework • creates startds in containers which join our production HTCondor pool – auto-scaling squids in containers for CVMFS/Frontier 18
  • 19. Kubernetes as an abstraction layer • Kubernetes is an open-source container cluster manager which can be run anywhere – on-premises – “click a button” on public clouds (natively or via 3rd parties) • Using it as an abstraction to enable portability between – on-premises resources – multiple public clouds • Benefits – Don’t need to worry about handling different cloud APIs – Run a single command to create an elastic, self-healing Grid site • ona single cluster • onmultiple clustersaroundthe world(via Kubernetesfederation) 19
  • 20. Kubernetes as an abstraction layer • Did initial testing with CMS CRAB3 analysis jobs – RAL, Google (GKE), Azure (ACS), AWS (StackPointCloud) • Nowrunning ATLAS production jobs on Azure – using “vacuum” model for independently creating startds which join a HTCondor pool at CERN schedd central manager AzureContainerService CERN squid squid startd startd startd ThankstoMicrosoftforanAzureResearchAward 20
  • 21. Upcoming plans include • Thereturn of local users – haven’t had local users for many years – would like to be able to support users of local facilities • job submissionusing Kerberosauth? • federatedidentity?CanINDIGOIAM help? • SLA-aware automated rolling upgrades – use a key-value store to coordinate draining & reboots • idea fromCoreOS • machinesneed toget accesstoa locktodrain – have done PoC testing but not yet tried in production – not HTCondor-specific, could be useful other services • Getting ridof pool accounts!!! 21