Speech for the presentation:
In this paper the authors propose a new lock algorithm called Remote Core Locking. It’s presented in 2012 in Boston. Several approaches have been developped for optimizing critical section because many application nowadays obtain best performance with a lower number of cores than those are available on today’s multicore architecture. Performance are influenced by the lock algorithm in particular they should improve access contation and cache misses.How is possible to see in this pictures: Authors make a lot of experiments with 18 application whose the amount of time spent in critical section grows when the number of cores used increase. In the above picture instead we could see the memcached application where a get operation works in best performance with 10 cores instead a set with 2. How we can see at right a Memcache representation where RCL lock result as the best.
Authors design a solution which try to reduce the problem of bus saturation and improve cache locality for having better performance on time execution. The following blocking techinque is implemented entirely in software on a x86 architecture. The purpose it to improve performance of the execution ofthe critical section into application that have multi core architectures.
RCL replace lock acquisition with an optimized remote procedure call, to a dedicated server core. The importance of this factor is the advantage of storing shared-information in the sever core ’s cache.
We need to identify the lock that we want to transform from POSIX to RCL and core for run the server
PROFILER and Reeingeeniring Tool Coccinelle .
There’re three situation that induce to a deadlock because the server is unable to execute critical section of otherlocks
The thread could be blocked at the OS level, could spin or could be preempted at the OS level.
The runtime ensure responsiveness and liveness respectively avoiding the block of thread at OS level or inversion priority and managing at run time a pool of threads for each server : if the servicing thread is blocked/waited,replace it with another in the pool . For being sure of the existence of a free thread when is required we use an highest priority thread called ”management thread” which is activated at each timeout and check if there’s at least one progress since its last activation, otherwise it tries to modify the priority.There’s also a backup thread at lowest priority used in the cAse of block of all therads in the OS and woke up the management thread. RCL runtime implements a POSIX FIFO scheduling policy for avoid the thread preemption then to reduce the delay for minimizing the length of the FIFO queue. Using FIFO policy could induce priority inversion between threads for avoid it we use free lock algorithm.With one shared cache line .
In the RCL higher is the contation less is the delay.
Instruction Level Parallelism and Superscalar ProcessorsSyed Zaid Irshad
Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently
Equally applicable to RISC & CISC
In practice usually RISC
Fast switching of threads between cores - Advanced Operating SystemsRuhaim Izmeth
Fast switching of threads between cores is a published research paper on Operating systems, This is our attempt to decode the research and present to the class
DevOops - Lessons Learned from an OpenStack Network ArchitectJames Denton
Join as we discuss various OpenStack Neutron network configuration options and issues experienced with VLAN, VXLAN, L2population, multicast, Neutron routers, Open vSwitch and more.
Pushing Packets - How do the ML2 Mechanism Drivers Stack UpJames Denton
Architecting a private cloud to meet the use cases of its users can be a daunting task. How do you determine which of the many L2/L3 Neutron plugins and drivers to implement? Does network performance outweigh reliability? Are overlay networks just as performant as VLAN networks? The answers to these questions will drive the appropriate technology choice.
In this presentation, we will look at many of the common drivers built around the ML2 framework, including LinuxBridge, OVS, OVS+DPDK, SR-IOV, and more, and will provide performance data to help drive decisions around selecting a technology that's right for the situation. We will discuss our experience with some of these technologies, and the pros and cons of one technology over another in a production environment.
Instruction Level Parallelism and Superscalar ProcessorsSyed Zaid Irshad
Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently
Equally applicable to RISC & CISC
In practice usually RISC
Fast switching of threads between cores - Advanced Operating SystemsRuhaim Izmeth
Fast switching of threads between cores is a published research paper on Operating systems, This is our attempt to decode the research and present to the class
DevOops - Lessons Learned from an OpenStack Network ArchitectJames Denton
Join as we discuss various OpenStack Neutron network configuration options and issues experienced with VLAN, VXLAN, L2population, multicast, Neutron routers, Open vSwitch and more.
Pushing Packets - How do the ML2 Mechanism Drivers Stack UpJames Denton
Architecting a private cloud to meet the use cases of its users can be a daunting task. How do you determine which of the many L2/L3 Neutron plugins and drivers to implement? Does network performance outweigh reliability? Are overlay networks just as performant as VLAN networks? The answers to these questions will drive the appropriate technology choice.
In this presentation, we will look at many of the common drivers built around the ML2 framework, including LinuxBridge, OVS, OVS+DPDK, SR-IOV, and more, and will provide performance data to help drive decisions around selecting a technology that's right for the situation. We will discuss our experience with some of these technologies, and the pros and cons of one technology over another in a production environment.
While every new release of OpenStack offers improvements in functionality and the user experience, one thing’s for certain: troubleshooting is hard if you don’t know where to start.
Join us as we cover some common and not-so-common issues with Nova and Neutron that lead to some of our favorite error messages, including “No valid host was found”. Participants will learn basic troubleshooting procedures, including tips, tricks, and processes of elimination, to get their cloud back on track.
Very long instruction word or VLIW refers to a processor architecture designed to take advantage of instruction level parallelism
This type of processor architecture is intended to allow higher performance without the inherent complexity of some other approaches.
Faster microprocessor design presentation in American International University-Bangladesh (AIUB). Presentation was taken under the subject "SELECTED TOPICS IN ELECTRICAL AND ELECTRONIC ENGINEERING (PROCESSOR AND DSP HARDWARE DESIGN WITH SYSTEM VERILOG, VHDL AND FPGAS) [MEEE]", as a final semester student of M.Sc at AIUB.
2014 OpenStack Summit - Neutron OVS to LinuxBridge MigrationJames Denton
Presentation titled 'Migrating production workloads from OVS to LinuxBridge'. Presented at the Fall 2014 OpenStack summit in Paris, this slide deck introduced the possibility of migrating live workloads from Open vSwitch to LinuxBridge with minimal downtime.
Veryx Launches Virtual Service Assurance Using Intel® Xeon® Scalable ProcessorsSelvaraj Balasubramanian
New service assurance test tools are essential to measuring NFV quality of service. vProbes from Veryx® Technologies running on Intel® Xeon® Scalable processors feature scalability, performance, and features to provide visibility for virtualized networks.
Remote Procedure Call (RPC) Server creation semantics & call semanticssvm
Remote procedure contains the Server different semantics this presentation contains the complete idea of the server Creation and call semantics with its working
Training Slides: Basics 107: Simple Tungsten Replicator Installation to Extra...Continuent
In this basic training session, we illustrate how to install and configure the Tungsten Replicator to extract events from a MySQL instance. This course is aimed at anyone looking to set up replication from MySQL. A basic understanding of Tungsten Replicator and MySQL replication is assumed.
AGENDA
- Review of the Tungsten Replicator
- Inspect the required prerequisites
- Explain the available installation methods
- Demonstrate a simple installation
Tungsten Connector / Proxy is truly the secret sauce for the Tungsten Clustering solution. Watch this webinar to learn how the Tungsten Connector enables zero-downtime MySQL maintenance via the manual switch operation, and gain an understanding of the various configuration options for doing local reads in remote composite clusters.
AGENDA
- Review the cluster architecture
- Understand the role of the Connector
- Describe Connector deployment best practices (app, dedicated with lb, db with lb)
- Explore zero-downtime MySQL maintenance using the manual role switch procedure
- Learn about Connector routing patterns inside a composite cluster
- Illustrate a manual site switch
- Explain read affinity and the vast performance improvement of local reads
- Examine Connector multi-cluster support
While every new release of OpenStack offers improvements in functionality and the user experience, one thing’s for certain: troubleshooting is hard if you don’t know where to start.
Join us as we cover some common and not-so-common issues with Nova and Neutron that lead to some of our favorite error messages, including “No valid host was found”. Participants will learn basic troubleshooting procedures, including tips, tricks, and processes of elimination, to get their cloud back on track.
Very long instruction word or VLIW refers to a processor architecture designed to take advantage of instruction level parallelism
This type of processor architecture is intended to allow higher performance without the inherent complexity of some other approaches.
Faster microprocessor design presentation in American International University-Bangladesh (AIUB). Presentation was taken under the subject "SELECTED TOPICS IN ELECTRICAL AND ELECTRONIC ENGINEERING (PROCESSOR AND DSP HARDWARE DESIGN WITH SYSTEM VERILOG, VHDL AND FPGAS) [MEEE]", as a final semester student of M.Sc at AIUB.
2014 OpenStack Summit - Neutron OVS to LinuxBridge MigrationJames Denton
Presentation titled 'Migrating production workloads from OVS to LinuxBridge'. Presented at the Fall 2014 OpenStack summit in Paris, this slide deck introduced the possibility of migrating live workloads from Open vSwitch to LinuxBridge with minimal downtime.
Veryx Launches Virtual Service Assurance Using Intel® Xeon® Scalable ProcessorsSelvaraj Balasubramanian
New service assurance test tools are essential to measuring NFV quality of service. vProbes from Veryx® Technologies running on Intel® Xeon® Scalable processors feature scalability, performance, and features to provide visibility for virtualized networks.
Remote Procedure Call (RPC) Server creation semantics & call semanticssvm
Remote procedure contains the Server different semantics this presentation contains the complete idea of the server Creation and call semantics with its working
Training Slides: Basics 107: Simple Tungsten Replicator Installation to Extra...Continuent
In this basic training session, we illustrate how to install and configure the Tungsten Replicator to extract events from a MySQL instance. This course is aimed at anyone looking to set up replication from MySQL. A basic understanding of Tungsten Replicator and MySQL replication is assumed.
AGENDA
- Review of the Tungsten Replicator
- Inspect the required prerequisites
- Explain the available installation methods
- Demonstrate a simple installation
Tungsten Connector / Proxy is truly the secret sauce for the Tungsten Clustering solution. Watch this webinar to learn how the Tungsten Connector enables zero-downtime MySQL maintenance via the manual switch operation, and gain an understanding of the various configuration options for doing local reads in remote composite clusters.
AGENDA
- Review the cluster architecture
- Understand the role of the Connector
- Describe Connector deployment best practices (app, dedicated with lb, db with lb)
- Explore zero-downtime MySQL maintenance using the manual role switch procedure
- Learn about Connector routing patterns inside a composite cluster
- Illustrate a manual site switch
- Explain read affinity and the vast performance improvement of local reads
- Examine Connector multi-cluster support
Reduced instruction set computing, or RISC (pronounced 'risk', /ɹɪsk/), is a CPU design strategy based on the insight that a simplified instruction set provides higher performance when combined with a microprocessor architecture capable of executing those instructions using fewer microprocessor cycles per instruction.
Infinit's Next Generation Key-value Store - Julien Quintard and Quentin Hocqu...Docker, Inc.
Key-value store projects have been widely adopted as a way to store metadata, but also as a low-level construct on top of which can be built more advanced storage solutions from file systems, object storage APIs and more. Unfortunately, most key-value store constructs suffer the same limitations when it comes to scalability, performance, and resilience. Infinit's key-value store takes a different approach, relying on a decentralized architecture rather than a master/slave model while offering strong consistency.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
1. Remote Core Locking:
Migrating Critical-Section Execution
to Improve the Performance of Multithreaded Applications
Written by Jean-Pierre Lozi Florian David Gaël Thomas Julia Lawall Gilles Muller
Presented by Andrea Lombardo
2. Problem:
Some applications that work well on a small number of cores do not scale to the
number of cores available in today's multicore architectures.
Performance in lock algorithms is influenced:
1. access contention
• Solution: Reduce the number of threads that, simultaneously, require
the access to the critical section
2. cache misses
• Solution: improve locality
REMOTE CORE LOCKING 2
3. EXAMPLE OF THE PROBLEM
MEMCACHE IS AN EXAMPLE OF APPLICATION WHICH HAS
THIS PROBLEM WHERE WORKS IN BEST PERFORMANCE :
•FOR A GET OPERATION WITH 10 CORES
•FOR A SET OPERATION WITH 2 CORES
ONE OF THE BOTTLENECK OF THIS APPLICATION ARE
CRITICAL SECTIONS, WHICH THE INFORMATION SHOULD BE
ACCESSED IN ATOMIC WAY AND THEY’RE PROTECTED BY
LOCKS.
HIGH CONTENTION MEANS MORE PROCESSING TIME AND SO
IT ‘S MORE EXPENSIVE.
IT’S A PROBLEM WHEN THE NUMBER OF CORES START
INCREASING ..
REMOTE CORE LOCKING 3
Test system: Opteron 6172 with 48-core running at 3.0.0 Linux kernel with glibc 2.13
4. 1)Time spent in critical section
2)number of cache miss
3)Others measurements.
RCL performance are better
than other lock algorithms in
the case of increasing number
of clients
Memcached application
performance:
-no Flat combining because it
periodically blocks on
condition variables, which Flat
Combining does not support.
REMOTE CORE LOCKING 4
Performance application
5. Other studies for optimizing the execution of critical sections on multicore
architectures
Software solution where the server is an ordinary client thread and the role of server is distributed
between client threads, approach produces overhead for the management of the server
Hardware-based solution whose introduces new instructions to perform the transfer of control,
and uses a special fast core to execute critical sections
Insert a fast transfer of control from other client cores to the server, to reduce access contention
Execution a succession of critical sections on a single server core to improve cache locality
In the last 20 years, several approaches have been
developped for optimizing critical sections execution
on multi-core architectures:
REMOTE CORE LOCKING 5
6. Real motivations of low performance of lock-based approaches:
1) Cache misses when execute critical section
2) Bus saturation caused by spinlock because induces frequent broadcast on bus.
RCL is introduced to address both issues simultaneously
Solution: Design better locks
REMOTE CORE LOCKING 6
7. RCL key features:
Goal:
Improve performance
execution of critical
section into legacy
applications that run on
top of multicore
architectures.
1
Developed entirely in
software on x86
architecture
2
Works better than other
kind of locks works better.
-POSIX
-CAS SPINLOCK
-MCS
-FLAT COMBINING
3
REMOTE CORE LOCKING 7
8. Replace the management
of critical section with an
optimized remote
procedure which call to a
dedicated server core.
Shared information in the
server core’s cache
No need to transfer data
between a core to another
core
How it works
REMOTE CORE LOCKING 8
9. Overview
Transfer the execution and management of the critical
section to a server core, choosen according through
profiler, the client with the most frequent lock usage
Client as an handler locks implemented as a remote
procedure calls of critical sections
REMOTE CORE LOCKING 9
10. Core algorithm
• The remote call is transformed into a clients and server communicate done through an array of
request structures of CL dimension which is unique for each server .
• C is the max number of clients
• L is the size of the hardware cache line and represents a request done by a client to the server
• Each request is mapped into a single cache line
Each request contain in order:
1. Address of the lock associated with the critical section
2. Address of the structure refered to the context
3. Address of the function that include critical section
1. Client has requested the access
2. NULL not request.
REMOTE CORE LOCKING 10
11. Server side
• A thread analyze all the request and wait
those that have an address refers to a
critical section.
• Iterate for each entry:
• If function value is an address and lock is free,
server thread acquires the lock and executes
the critical
• server reset the element
• resume the iteration.
• After writing the entry cache line with all the
informations
• it waits that the address of the function is point to NULL.
• In case the number of client is less than the number of
cores available: it’s used SSE3 monitor/mwait routine for
sleeping the client sleeps until the server answer.
Client side
REMOTE CORE LOCKING 11
12. Profiler: it’s developed by authors to detect the information locks:
Lock frequency usage
Time spent in critical section
These information are used for identify the core in which running the
server and locks need replacing from POSIX to RCL
A tool Coccinelle used for transforming critical section to remote
procedure call.
Critical section looks like separate functions:
PROBLEMS:
Shared variable
Additional elements:
REMOTE CORE LOCKING 12
13. Implementation of the RCL Runtime(supported by Posix thread)
The runtime ensure responsiveness and liveness respectively avoiding the block of thread at OS
level or inversion priority and managing at run time a pool of threads for each server :
-if the servicing thread is blocked/waited, replace it with another in the pool.
The management thread used for management the pool of threads:
- Highest priority
- Check the progress threads every time is woken up
1) modify the priority
2) nothing change
The backup thread used when all threads are blocked at OS so it woke up the management thread.
1) The runtime implement a POSIX FIFO scheduling policy to execute a thread until blocked by
processor:
1.1) could induce priority inversion between threads
2)Reduce the delay minimizing the length FIFO queue
REMOTE CORE LOCKING 13
There’re situations to avoid which generate a deadlock because the server is unable to execute critical section
of other locks. Core algorithm is applied to a thread and it requires that the thread is never blocked at the OS
level and never spin into a waitloop. Now we focus on runtime RCL liveness and responsiveness: different
situation.
The thread could be blocked at the OS level
The thread could spin if the critical section try to acquire a spinlock
The thread could be preempted at the OS level
14. • Critical sections every time is executed in all cores, execept one that manages the lifecyle of the thread
• Vary the degree of contention on the lock by varying the delay between the execution of the critical
section
• Locality of the critical section varying the number of shared cache lines each one accesses.
• Cache access line are not pipelined: construct the address of the next memory access from the previously
read value.
REMOTE CORE LOCKING 14
Comparison when varying degree contention
average of 30 runs
15. False serialization
• For adapting Berkley DB application to the usage of RCL you need to allocate the
2 most used lock and then other 9. All 11 locks should be implemented as RCLs on
the same server. Their critical sections (refer to 11 locks) are artificially serialized
• Now we focus the impact of the serialization with two metrics:
• Use rate:
• The use rate measures the server workload.
• False serialization rate:
• The false serialization rate a ratio of the number of
iterations over the request array
• It’s important how change the rate between one or 2 different server:
• High rate with 1 and elimination of false serialization and increasing throughput of an amount 50 %
REMOTE CORE LOCKING 15
16. Analysis of performance
• Execution time incurred when each critical section accesses
5 cache lines.
• The average number of L2 cache
misses(top)
• The average execution time (bottom)
• When a critical section over 5000
iterations when critical sections access
one shared cache line
REMOTE CORE LOCKING 16
17. Conclusion
• Rcl is techinque focus on reducing lock acquisition time and improving execution
speed of critical sections through increased data locality and the migration of
execution to the server core.
• RCL powerful is when an application relies on highly conteded locks
REMOTE CORE LOCKING 17
18. Future work
DESIGN NEW APPLICATION WITH THESE
STRATEGIES
CONSIDER THE DESIGN AND THE
IMPLEMENTATION OF AN ADAPTIVE
RCL RUNTIME.
SYSTEM ABLE TO DYNAMICALLY
SWITCH BETWEEN LOCKING
STRATEGIES
CAPABILITY TO MIGRATE LOCKS
BETWEEN MULTIPLE SERVERS FOR
BALANCING DYNAMICALLY THE LOAD
AND AVOID FALSE SERIALIZATION.
REMOTE CORE LOCKING 18