Self-managed and automatically reconfigurable stream processingVasia Kalavri
With its superior state management and savepoint mechanism, Apache Flink is unique among modern stream processors in supporting minimal-effort job reconfiguration. Savepoints are being extensively used to enable dynamic scaling, bug fixing, upgrades, and numerous other reconfiguration use-cases, all while preserving exactly-once semantics. However, when it comes to dynamic scaling, the burden of reconfiguration decisions -when and how much to scale- is currently placed on the user.
In this talk, I share our recent work at ETH Zurich on providing support for self-managed and automatically reconfigurable stream processing. I present SnailTrail (NSDI’18), an online critical path analysis module that detects bottlenecks and provides insights on streaming application performance, and DS2 (OSDI’18), an automatic scaling controller which identifies optimal backpressure-free configurations and operates reactively online. Both SnailTrail and DS2 are integrated with Apache Flink and publicly available. I conclude with evaluation results, ongoing work, and and future challenges in this area.
The shortest path is not always a straight lineVasia Kalavri
The document proposes a 3-phase algorithm to compute the metric backbone of a weighted graph to improve the performance of graph algorithms and queries. Phase 1 finds 1st-order semi-metric edges by only examining triangles. Phase 2 identifies metric edges in 2-hop paths. Phase 3 runs BFS to label remaining edges. The algorithm removes up to 90% of semi-metric edges and scales to billion-edge graphs. Real-world graphs exhibit significant semi-metricity, and the backbone provides up to 6x speedups for graph queries and analytics.
Nondeterminism is unavoidable, but data races are pure evilracesworkshop
Presentation by Hans-J. Boehm.
Paper and more information: http://soft.vub.ac.be/races/paper/position-paper-nondeterminism-is-unavoidable-but-data-races-are-pure-evil/
(Relative) Safety Properties for Relaxed Approximate Programsracesworkshop
Presentation by Michael Carbin.
Paper and more information: http://soft.vub.ac.be/races/paper/relative-safety-properties-for-relaxed-approximate-programs/
Does Better Throughput Require Worse Latency?racesworkshop
The document discusses the tradeoff between throughput and latency in parallel systems. It provides examples of how different algorithms for shared counters can impact latency and throughput. Specifically, it shows that increasing throughput, such as through more parallelism or replication, often leads to worse latency due to the increased communication between cores. The document concludes that there is generally a tradeoff between throughput and latency based on the number of readers, writers, and contention level in a parallel system.
Self-managed and automatically reconfigurable stream processingVasia Kalavri
With its superior state management and savepoint mechanism, Apache Flink is unique among modern stream processors in supporting minimal-effort job reconfiguration. Savepoints are being extensively used to enable dynamic scaling, bug fixing, upgrades, and numerous other reconfiguration use-cases, all while preserving exactly-once semantics. However, when it comes to dynamic scaling, the burden of reconfiguration decisions -when and how much to scale- is currently placed on the user.
In this talk, I share our recent work at ETH Zurich on providing support for self-managed and automatically reconfigurable stream processing. I present SnailTrail (NSDI’18), an online critical path analysis module that detects bottlenecks and provides insights on streaming application performance, and DS2 (OSDI’18), an automatic scaling controller which identifies optimal backpressure-free configurations and operates reactively online. Both SnailTrail and DS2 are integrated with Apache Flink and publicly available. I conclude with evaluation results, ongoing work, and and future challenges in this area.
The shortest path is not always a straight lineVasia Kalavri
The document proposes a 3-phase algorithm to compute the metric backbone of a weighted graph to improve the performance of graph algorithms and queries. Phase 1 finds 1st-order semi-metric edges by only examining triangles. Phase 2 identifies metric edges in 2-hop paths. Phase 3 runs BFS to label remaining edges. The algorithm removes up to 90% of semi-metric edges and scales to billion-edge graphs. Real-world graphs exhibit significant semi-metricity, and the backbone provides up to 6x speedups for graph queries and analytics.
Nondeterminism is unavoidable, but data races are pure evilracesworkshop
Presentation by Hans-J. Boehm.
Paper and more information: http://soft.vub.ac.be/races/paper/position-paper-nondeterminism-is-unavoidable-but-data-races-are-pure-evil/
(Relative) Safety Properties for Relaxed Approximate Programsracesworkshop
Presentation by Michael Carbin.
Paper and more information: http://soft.vub.ac.be/races/paper/relative-safety-properties-for-relaxed-approximate-programs/
Does Better Throughput Require Worse Latency?racesworkshop
The document discusses the tradeoff between throughput and latency in parallel systems. It provides examples of how different algorithms for shared counters can impact latency and throughput. Specifically, it shows that increasing throughput, such as through more parallelism or replication, often leads to worse latency due to the increased communication between cores. The document concludes that there is generally a tradeoff between throughput and latency based on the number of readers, writers, and contention level in a parallel system.
This document summarizes a research paper that proposes a new framework called FinnMun for emulating spreadsheets. The paper introduces FinnMun and describes its implementation. It then discusses the experimental setup and results from evaluating FinnMun on various hardware configurations. The evaluation analyzes trends in metrics like throughput, response time, and hit ratio. The paper finds that FinnMun can successfully emulate spreadsheets and improve system performance. It concludes that FinnMun helps advance research on producer-consumer problems and complex systems.
This document discusses the performance of MochaWet, a system for managing constant-time algorithms. The system is made up of four independent components: probabilistic communication, context-free grammar, Byzantine fault tolerance evaluation, and low-energy configurations. Experimental results show that tripling the effective flash memory speed of topologically stochastic archetypes is crucial to MochaWet's results. The document concludes that MochaWet has set a precedent for synthesizing Byzantine fault tolerance.
If you’re a technical person and you find yourself leading people, it might be worth leaning on what you know: What if we were to understand your team as a distributed system? Even if your team isn’t distributed, it can act like a distributed system.
This document summarizes a research paper that proposes a new heuristic called PAUSE for investigating the producer-consumer problem in distributed systems. The paper motivates the need to study this problem, describes PAUSE's approach of using compact configurations and decentralized components, outlines its implementation in Lisp and Java, and presents experimental results showing PAUSE outperforms previous methods. Related work investigating similar challenges is also discussed.
An introduction to R is a document usefulssuser3c3f88
R is a language and environment for statistical computing and graphics. It provides functions for data manipulation, calculation, and graphical displays. Key features of R include its ability to produce publication-quality plots, perform statistical tests, fit models to data, and develop statistical software. R has an extensive library of additional user-contributed packages that extend its capabilities. The document provides information on downloading and using R, reading data into R, customizing plots, and interactive plotting functions.
Wireless data broadcast is an efficient way of disseminating data to users in the mobile computing environments. From the server’s point of view, how to place the data items on channels is a crucial issue, with the objective of minimizing the average access time and tuning time. Similarly, how to schedule the data retrieval process for a given request at the client side such that all the requested items can be downloaded in a short time is also an important problem. In this paper, we investigate the multi-item data retrieval scheduling in the push-based multichannel broadcast environments. The most important issues in mobile computing are energy efficiency and query response efficiency. However, in data broadcast the objectives of reducing access latency and energy cost can be contradictive to each other. Consequently, we define a new problem named Minimum Cost Data Retrieval Problem (MCDR) and Large Number Data Retrieval (LNDR) Problem. We also develop a heuristic algorithm to download a large number of items efficiently. When there is no replicated item in a broadcast cycle, we show that an optimal retrieval schedule can be obtained in polynomial time
Configuration Optimization for Big Data SoftwarePooyan Jamshidi
The document discusses configuration optimization for big data software using an approach developed in the DICE project funded by the European Union's Horizon 2020 program. It describes optimizing configurations for Apache Storm and Cassandra to significantly reduce configuration time. Experiments showed large performance variations between configurations and that default settings often performed poorly compared to optimized settings. Tuning on one version did not guarantee good performance on other versions, but transferring more observations from other versions improved performance, though with diminishing returns due to increased optimization costs.
The document proposes BergSump, a new framework for analyzing I/O automata. BergSump aims to confirm that superblocks and flip-flop gates are generally incompatible. It discusses related work on XML, wireless networks, and cryptography. The implementation section outlines version 5.9 of BergSump and plans to release the code under an open source license. The evaluation analyzes BergSump's performance and shows its median complexity is better than prior solutions. The conclusion argues that BergSump can successfully observe many sensor networks at once.
This is the course that was presented by James Liddle and Adam Vile for Waters in September 2008.
The book of this course can be found at: http://www.lulu.com/content/4334860
This summary provides the key points from the document in 3 sentences:
The document proposes a new method called Anvil for analyzing IPv7 configurations using pseudorandom methodologies. It describes Anvil's implementation as a collection of 13 lines of Python shell scripts that must run within the same JVM as the virtual machine monitor. The document outlines experiments run using Anvil to evaluate its performance and compares the results to related work on modeling networked systems.
Brian Klumpe Unification of Producer Consumer Key PairsBrian_Klumpe
This document discusses a framework called Vulva that aims to achieve several goals: (1) confirm that SCSI disks can be made omniscient, stable, and trainable; (2) evaluate the use of public-private key pairs to unify the producer-consumer problem and cryptography; (3) demonstrate that Vulva runs in O(n!) time. The paper describes experiments conducted using Vulva that analyzed seek time, complexity, bandwidth, and other metrics on various systems. However, the results were inconsistent due to bugs and electromagnetic disturbances. The paper also reviews related work on thin clients, online algorithms, and extensible symmetries.
This a fake scientific article generated by a computer program. It is the parody of science and a perfect example of the problem of our age: the achievement without actual knowledge and effort.
Constructing Operating Systems and E-CommerceIJARIIT
Information retrieval systems and the partition table, while essential in theory, have not until recently been considered important [15]. In fact, few theorists would disagree with the deployment of massive multiplayer online role-playing games, which embodies the robust principles of complexity theory. In this work we investigate how Smalltalk can be applied to the synthesis of lambda calculus.
This document proposes a new framework called EnodalPincers for understanding DHCP. EnodalPincers uses a novel heuristic to cache multi-processors and explores the exploration of thin clients. The methodology assumes each component enables introspective algorithms independently. Experimental results show EnodalPincers has an expected response time and energy usage that varies with work factor and signal-to-noise ratio. In conclusion, EnodalPincers runs in Θ(log n) time like other stable algorithms for congestion control.
The large-scale cyberinformatics method to replication is defined not only by the analysis of local-area networks, but also by the structured need for the Internet. Here, we confirm the refinement of superpages, which embodies the unfortunate principles of operating systems. SHODE, our new methodology for secure methodologies, is the solution to all of these obstacles.
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDSGyan Prakash
Cloud-based outsourced storage relieves the client’s load for storage management and maintenance by providing a comparably low-cost, scalable, location-independent platform. Though, the information that clients no longer have physical control of data specifies that they are facing a potentially formidable risk for missing or corrupted data. To avoid the security risks, inspection services are serious to ensure the integrity and availability of outsourced data and to achieve digital forensics and reliability on cloud computing. Provable data possession (PDP), which is a cryptographic method for validating the reliability of data without retrieving it at an untrusted server, can be used to realize audit services. In this project, profiting from the interactive zero-knowledge proof system, the construction of an interactive PDP protocol to prevent the fraudulence of prover (soundness property) and the leakage of verified data (zero knowledge property).To prove that our construction holds these properties based on the computation Diffie–Hellman assumption and the rewindable black-box knowledge extractor. An efficient mechanism on probabilistic queries and periodic verification is proposed to reduce the audit costs per verification and implement abnormal detection timely. Also, we present an efficient method for choosing an optimal parameter value to reduce computational overheads of cloud audit services.
Discover the Unseen: Tailored Recommendation of Unwatched ContentScyllaDB
The session shares how JioCinema approaches ""watch discounting."" This capability ensures that if a user watched a certain amount of a show/movie, the platform no longer recommends that particular content to the user. Flawless operation of this feature promotes the discover of new content, improving the overall user experience.
JioCinema is an Indian over-the-top media streaming service owned by Viacom18.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
This document summarizes a research paper that proposes a new framework called FinnMun for emulating spreadsheets. The paper introduces FinnMun and describes its implementation. It then discusses the experimental setup and results from evaluating FinnMun on various hardware configurations. The evaluation analyzes trends in metrics like throughput, response time, and hit ratio. The paper finds that FinnMun can successfully emulate spreadsheets and improve system performance. It concludes that FinnMun helps advance research on producer-consumer problems and complex systems.
This document discusses the performance of MochaWet, a system for managing constant-time algorithms. The system is made up of four independent components: probabilistic communication, context-free grammar, Byzantine fault tolerance evaluation, and low-energy configurations. Experimental results show that tripling the effective flash memory speed of topologically stochastic archetypes is crucial to MochaWet's results. The document concludes that MochaWet has set a precedent for synthesizing Byzantine fault tolerance.
If you’re a technical person and you find yourself leading people, it might be worth leaning on what you know: What if we were to understand your team as a distributed system? Even if your team isn’t distributed, it can act like a distributed system.
This document summarizes a research paper that proposes a new heuristic called PAUSE for investigating the producer-consumer problem in distributed systems. The paper motivates the need to study this problem, describes PAUSE's approach of using compact configurations and decentralized components, outlines its implementation in Lisp and Java, and presents experimental results showing PAUSE outperforms previous methods. Related work investigating similar challenges is also discussed.
An introduction to R is a document usefulssuser3c3f88
R is a language and environment for statistical computing and graphics. It provides functions for data manipulation, calculation, and graphical displays. Key features of R include its ability to produce publication-quality plots, perform statistical tests, fit models to data, and develop statistical software. R has an extensive library of additional user-contributed packages that extend its capabilities. The document provides information on downloading and using R, reading data into R, customizing plots, and interactive plotting functions.
Wireless data broadcast is an efficient way of disseminating data to users in the mobile computing environments. From the server’s point of view, how to place the data items on channels is a crucial issue, with the objective of minimizing the average access time and tuning time. Similarly, how to schedule the data retrieval process for a given request at the client side such that all the requested items can be downloaded in a short time is also an important problem. In this paper, we investigate the multi-item data retrieval scheduling in the push-based multichannel broadcast environments. The most important issues in mobile computing are energy efficiency and query response efficiency. However, in data broadcast the objectives of reducing access latency and energy cost can be contradictive to each other. Consequently, we define a new problem named Minimum Cost Data Retrieval Problem (MCDR) and Large Number Data Retrieval (LNDR) Problem. We also develop a heuristic algorithm to download a large number of items efficiently. When there is no replicated item in a broadcast cycle, we show that an optimal retrieval schedule can be obtained in polynomial time
Configuration Optimization for Big Data SoftwarePooyan Jamshidi
The document discusses configuration optimization for big data software using an approach developed in the DICE project funded by the European Union's Horizon 2020 program. It describes optimizing configurations for Apache Storm and Cassandra to significantly reduce configuration time. Experiments showed large performance variations between configurations and that default settings often performed poorly compared to optimized settings. Tuning on one version did not guarantee good performance on other versions, but transferring more observations from other versions improved performance, though with diminishing returns due to increased optimization costs.
The document proposes BergSump, a new framework for analyzing I/O automata. BergSump aims to confirm that superblocks and flip-flop gates are generally incompatible. It discusses related work on XML, wireless networks, and cryptography. The implementation section outlines version 5.9 of BergSump and plans to release the code under an open source license. The evaluation analyzes BergSump's performance and shows its median complexity is better than prior solutions. The conclusion argues that BergSump can successfully observe many sensor networks at once.
This is the course that was presented by James Liddle and Adam Vile for Waters in September 2008.
The book of this course can be found at: http://www.lulu.com/content/4334860
This summary provides the key points from the document in 3 sentences:
The document proposes a new method called Anvil for analyzing IPv7 configurations using pseudorandom methodologies. It describes Anvil's implementation as a collection of 13 lines of Python shell scripts that must run within the same JVM as the virtual machine monitor. The document outlines experiments run using Anvil to evaluate its performance and compares the results to related work on modeling networked systems.
Brian Klumpe Unification of Producer Consumer Key PairsBrian_Klumpe
This document discusses a framework called Vulva that aims to achieve several goals: (1) confirm that SCSI disks can be made omniscient, stable, and trainable; (2) evaluate the use of public-private key pairs to unify the producer-consumer problem and cryptography; (3) demonstrate that Vulva runs in O(n!) time. The paper describes experiments conducted using Vulva that analyzed seek time, complexity, bandwidth, and other metrics on various systems. However, the results were inconsistent due to bugs and electromagnetic disturbances. The paper also reviews related work on thin clients, online algorithms, and extensible symmetries.
This a fake scientific article generated by a computer program. It is the parody of science and a perfect example of the problem of our age: the achievement without actual knowledge and effort.
Constructing Operating Systems and E-CommerceIJARIIT
Information retrieval systems and the partition table, while essential in theory, have not until recently been considered important [15]. In fact, few theorists would disagree with the deployment of massive multiplayer online role-playing games, which embodies the robust principles of complexity theory. In this work we investigate how Smalltalk can be applied to the synthesis of lambda calculus.
This document proposes a new framework called EnodalPincers for understanding DHCP. EnodalPincers uses a novel heuristic to cache multi-processors and explores the exploration of thin clients. The methodology assumes each component enables introspective algorithms independently. Experimental results show EnodalPincers has an expected response time and energy usage that varies with work factor and signal-to-noise ratio. In conclusion, EnodalPincers runs in Θ(log n) time like other stable algorithms for congestion control.
The large-scale cyberinformatics method to replication is defined not only by the analysis of local-area networks, but also by the structured need for the Internet. Here, we confirm the refinement of superpages, which embodies the unfortunate principles of operating systems. SHODE, our new methodology for secure methodologies, is the solution to all of these obstacles.
SECURE & EFFICIENT AUDIT SERVICE OUTSOURCING FOR DATA INTEGRITY IN CLOUDSGyan Prakash
Cloud-based outsourced storage relieves the client’s load for storage management and maintenance by providing a comparably low-cost, scalable, location-independent platform. Though, the information that clients no longer have physical control of data specifies that they are facing a potentially formidable risk for missing or corrupted data. To avoid the security risks, inspection services are serious to ensure the integrity and availability of outsourced data and to achieve digital forensics and reliability on cloud computing. Provable data possession (PDP), which is a cryptographic method for validating the reliability of data without retrieving it at an untrusted server, can be used to realize audit services. In this project, profiting from the interactive zero-knowledge proof system, the construction of an interactive PDP protocol to prevent the fraudulence of prover (soundness property) and the leakage of verified data (zero knowledge property).To prove that our construction holds these properties based on the computation Diffie–Hellman assumption and the rewindable black-box knowledge extractor. An efficient mechanism on probabilistic queries and periodic verification is proposed to reduce the audit costs per verification and implement abnormal detection timely. Also, we present an efficient method for choosing an optimal parameter value to reduce computational overheads of cloud audit services.
Discover the Unseen: Tailored Recommendation of Unwatched ContentScyllaDB
The session shares how JioCinema approaches ""watch discounting."" This capability ensures that if a user watched a certain amount of a show/movie, the platform no longer recommends that particular content to the user. Flawless operation of this feature promotes the discover of new content, improving the overall user experience.
JioCinema is an Indian over-the-top media streaming service owned by Viacom18.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
"Scaling RAG Applications to serve millions of users", Kevin GoedeckeFwdays
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillLizaNolte
HERE IS YOUR WEBINAR CONTENT! 'Mastering Customer Journey Management with Dr. Graham Hill'. We hope you find the webinar recording both insightful and enjoyable.
In this webinar, we explored essential aspects of Customer Journey Management and personalization. Here’s a summary of the key insights and topics discussed:
Key Takeaways:
Understanding the Customer Journey: Dr. Hill emphasized the importance of mapping and understanding the complete customer journey to identify touchpoints and opportunities for improvement.
Personalization Strategies: We discussed how to leverage data and insights to create personalized experiences that resonate with customers.
Technology Integration: Insights were shared on how inQuba’s advanced technology can streamline customer interactions and drive operational efficiency.
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
Introducing BoxLang : A new JVM language for productivity and modularity!Ortus Solutions, Corp
Just like life, our code must adapt to the ever changing world we live in. From one day coding for the web, to the next for our tablets or APIs or for running serverless applications. Multi-runtime development is the future of coding, the future is to be dynamic. Let us introduce you to BoxLang.
Dynamic. Modular. Productive.
BoxLang redefines development with its dynamic nature, empowering developers to craft expressive and functional code effortlessly. Its modular architecture prioritizes flexibility, allowing for seamless integration into existing ecosystems.
Interoperability at its Core
With 100% interoperability with Java, BoxLang seamlessly bridges the gap between traditional and modern development paradigms, unlocking new possibilities for innovation and collaboration.
Multi-Runtime
From the tiny 2m operating system binary to running on our pure Java web server, CommandBox, Jakarta EE, AWS Lambda, Microsoft Functions, Web Assembly, Android and more. BoxLang has been designed to enhance and adapt according to it's runnable runtime.
The Fusion of Modernity and Tradition
Experience the fusion of modern features inspired by CFML, Node, Ruby, Kotlin, Java, and Clojure, combined with the familiarity of Java bytecode compilation, making BoxLang a language of choice for forward-thinking developers.
Empowering Transition with Transpiler Support
Transitioning from CFML to BoxLang is seamless with our JIT transpiler, facilitating smooth migration and preserving existing code investments.
Unlocking Creativity with IDE Tools
Unleash your creativity with powerful IDE tools tailored for BoxLang, providing an intuitive development experience and streamlining your workflow. Join us as we embark on a journey to redefine JVM development. Welcome to the era of BoxLang.
ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this shift, provides a look at the implementation and roadmap, and shares how this shift benefits ScyllaDB users.
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
Must Know Postgres Extension for DBA and Developer during MigrationMydbops
Mydbops Opensource Database Meetup 16
Topic: Must-Know PostgreSQL Extensions for Developers and DBAs During Migration
Speaker: Deepak Mahto, Founder of DataCloudGaze Consulting
Date & Time: 8th June | 10 AM - 1 PM IST
Venue: Bangalore International Centre, Bangalore
Abstract: Discover how PostgreSQL extensions can be your secret weapon! This talk explores how key extensions enhance database capabilities and streamline the migration process for users moving from other relational databases like Oracle.
Key Takeaways:
* Learn about crucial extensions like oracle_fdw, pgtt, and pg_audit that ease migration complexities.
* Gain valuable strategies for implementing these extensions in PostgreSQL to achieve license freedom.
* Discover how these key extensions can empower both developers and DBAs during the migration process.
* Don't miss this chance to gain practical knowledge from an industry expert and stay updated on the latest open-source database trends.
Mydbops Managed Services specializes in taking the pain out of database management while optimizing performance. Since 2015, we have been providing top-notch support and assistance for the top three open-source databases: MySQL, MongoDB, and PostgreSQL.
Our team offers a wide range of services, including assistance, support, consulting, 24/7 operations, and expertise in all relevant technologies. We help organizations improve their database's performance, scalability, efficiency, and availability.
Contact us: info@mydbops.com
Visit: https://www.mydbops.com/
Follow us on LinkedIn: https://in.linkedin.com/company/mydbops
For more details and updates, please follow up the below links.
Meetup Page : https://www.meetup.com/mydbops-databa...
Twitter: https://twitter.com/mydbopsofficial
Blogs: https://www.mydbops.com/blog/
Facebook(Meta): https://www.facebook.com/mydbops/
Session 1 - Intro to Robotic Process Automation.pdfUiPathCommunity
👉 Check out our full 'Africa Series - Automation Student Developers (EN)' page to register for the full program:
https://bit.ly/Automation_Student_Kickstart
In this session, we shall introduce you to the world of automation, the UiPath Platform, and guide you on how to install and setup UiPath Studio on your Windows PC.
📕 Detailed agenda:
What is RPA? Benefits of RPA?
RPA Applications
The UiPath End-to-End Automation Platform
UiPath Studio CE Installation and Setup
💻 Extra training through UiPath Academy:
Introduction to Automation
UiPath Business Automation Platform
Explore automation development with UiPath Studio
👉 Register here for our upcoming Session 2 on June 20: Introduction to UiPath Studio Fundamentals: https://community.uipath.com/events/details/uipath-lagos-presents-session-2-introduction-to-uipath-studio-fundamentals/
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Ukraine
Під час доповіді відповімо на питання, навіщо потрібно підвищувати продуктивність аплікації і які є найефективніші способи для цього. А також поговоримо про те, що таке кеш, які його види бувають та, основне — як знайти performance bottleneck?
Відео та деталі заходу: https://bit.ly/45tILxj
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfleebarnesutopia
So… you want to become a Test Automation Engineer (or hire and develop one)? While there’s quite a bit of information available about important technical and tool skills to master, there’s not enough discussion around the path to becoming an effective Test Automation Engineer that knows how to add VALUE. In my experience this had led to a proliferation of engineers who are proficient with tools and building frameworks but have skill and knowledge gaps, especially in software testing, that reduce the value they deliver with test automation.
In this talk, Lee will share his lessons learned from over 30 years of working with, and mentoring, hundreds of Test Automation Engineers. Whether you’re looking to get started in test automation or just want to improve your trade, this talk will give you a solid foundation and roadmap for ensuring your test automation efforts continuously add value. This talk is equally valuable for both aspiring Test Automation Engineers and those managing them! All attendees will take away a set of key foundational knowledge and a high-level learning path for leveling up test automation skills and ensuring they add value to their organizations.
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...AlexanderRichford
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation Functions to Prevent Interaction with Malicious QR Codes.
Aim of the Study: The goal of this research was to develop a robust hybrid approach for identifying malicious and insecure URLs derived from QR codes, ensuring safe interactions.
This is achieved through:
Machine Learning Model: Predicts the likelihood of a URL being malicious.
Security Validation Functions: Ensures the derived URL has a valid certificate and proper URL format.
This innovative blend of technology aims to enhance cybersecurity measures and protect users from potential threats hidden within QR codes 🖥 🔒
This study was my first introduction to using ML which has shown me the immense potential of ML in creating more secure digital environments!
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
2. Thank You
✦ Stefan Marr, Mattias De Wael
✦ Presenters
✦ Authors
✦ Program Committee
✦ Co-chair & Organizer: Theo D’Hondt
✦ Organizers: Andrew Black, Doug Kimelman, Martin
Rinard
✦ Voters
Saturday 4 May 13
3. Announcements
✦ Program at:
✦ http://soft.vub.ac.be/races/program/
✦ Strict timekeepers
✦ Dinner?
✦ Recording
Saturday 4 May 13
4. 9:00 Lightning and Welcome
9:10 Unsynchronized Techniques for Approximate Parallel Computing
9:35 Programming with Relaxed Synchronization
9:50 (Relative) Safety Properties for Relaxed Approximate Programs
10:05 Break
10:35 Nondeterminism is unavoidable, but data races are pure evil
11:00 Discussion
11:45 Lunch
1:15 How FIFO is Your Concurrent FIFO Queue?
1:35 The case for relativistic programming
1:55 Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models
2:15 Does Better Throughput Require Worse Latency?
2:30 Parallel Sorting on a Spatial Computer
2:50 Break
3:25 Dancing with Uncertainty
3:45 Beyond Expert-Only Parallel Programming
4:00 Discussion
4:30 Wrap up
Saturday 4 May 13
6. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
7. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
8. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
9. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
10. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
11. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Saturday 4 May 13
12. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Data Race!
Saturday 4 May 13
13. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Data Race!
Saturday 4 May 13
14. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Data Race!
Saturday 4 May 13
15. 2
4length
next
values
a
Expandable
Array
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
append(o)
c = a;
i = c.next;
if (c.length <= i)
n = expand c;
a = n; c = n;
c.values[i] = o;
c.next = i + 1;
Data Race!
Saturday 4 May 13
16. Hardware
Towards Approximate Computing:
Programming with Relaxed Synchronization
Precise Less Precise
Accurate
Less Accurate, less up-
to-date, possibly
corrupted
Reliable
Variable
Computation
Data
Computing model
today
Human Brain
Relaxed
Synchronization
Renganarayanan et al, IBM Research, RACES’12, Oct. 21, 2012
Saturday 4 May 13
18. Nondeterminism
is
Unavoidable,
but
Data
Races
are
Pure
Evil
Hans-‐J.
Boehm,
HP
Labs
• Much
low-‐level
code
is
inherently
nondeterminisBc,
but
• Data
races
–Are
forbidden
by
C/C++/OpenMP/Posix
language
standards.
–May
break
code
now
or
when
you
recompile.
Data
Races
–Don’t
improve
scalability
significantly,
even
if
the
code
sBll
works.
–Are
easily
avoidable
in
C11
&
C++11.
Saturday 4 May 13
19. How FIFO isYour Concurrent FIFO Queue?
Andreas Haas, Christoph M. Kirsch, Michael Lippautz, Hannes Payer
University of Salzburg
semantically correct
and therefore “slow”
FIFO queues
semantically relaxed
and thereby “fast”
FIFO queues
Semantically relaxed FIFO queues can appear more
FIFO than semantically correct FIFO queues.
vs.
Saturday 4 May 13
20. A Case for Relativistic Programming
• Alter ordering requirements
(Causal, not Total)
• Don’t Alter correctness requirements
• High performance, Highly scalable
• Easy to program
Philip W. Howard and Jonathan Walpole
Saturday 4 May 13
22. Does Better Throughput Require
Worse Latency?
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word was not changed by some other thread while the
updater was working. If it was changed, the updater must
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams,
Mark Wegman
IBM T. J. Watson Research Center
Introduction
As we continue to make the transition from uniprocessor to
multicore programming, pushed along by the changing
trajectory of hardware technology and system architecture,
we are seeing an explosion of techniques for crossing the
chasm between sequential and parallel data structures and
algorithms. In considering a spectrum of techniques for
moderating application access to shared data on multicore
and manycore systems, we have observed that as
application synchronization latency gets closer to hardware
inter-core latency, throughput decreases. The spectrum of
techniques we looked at includes: locks and mutexes, lock-
free approaches based on atomic instructions, RCU, and
(non-deterministic) race-and-repair. Below we present
definitions of our notion of synchronization latency and
throughput, and describe our observation in greater detail.
We conclude by wondering whether there is a fundamental
law relating latency to throughput:
Algorithms that improve application-level throughput
worsen inter-core application-level latency.
We believe that such a law would be of great utility as a
unification that would provide a common perspective from
which to view and compare synchronization approaches.
Throughput and Latency
For this proposal, we define throughput and latency as
follows:
• Throughput is the amount of application-level work
performed in unit time, normalized to the amount of
work that would be accomplished with perfect linear
scaling. In other words, a throughput of 1.0 would be
achieved by a system that performed N times as much
work per unit time with N cores as it did with one core.
This formulation reflects how well an application
exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on
one core to observe a change effected by a thread on
another core, normalized to the best latency possible for
the given platform. This formulation isolates the latency
inherent in the algorithms and data structures from the
latency arising out of the platform (operating system,
processor, storage hierarchy, communication network,
etc.). As an example of algorithm-and-data-structure-
imposed latency, if one chooses to replicate a data
structure, it will take additional time to update the
replicas. The best possible latency for a given platform
can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that
is apparent to an application.
Table 1 presents some fictional numbers in order to
illustrate the concept: It describes two versions of the same
application, A and B, running on a hypothetical system.
The numbers are consistent with a linear version of the
proposed law, because Version B sacrifices a factor of three
in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Table 1: Hypothetical figures
if tradeoff were linear
Version
Core count
Best-possible inter-core
latency
Mean observed latency in
application
Normalized latency
(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput
(normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading
Throughput for Latency
As techniques have evolved for improving performance,
each seems to have offered more throughput at the expense
of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the
simplest method for protecting shared data [1]. In this
style, each thread obtains a shared lock (or mutex) on a
data structure before accessing or modifying it. Latency
is minimized because a waiter will observe any changes
as soon as the updating thread releases the lock.
However, the overhead required to obtain a lock, and the
processing time lost while waiting for a lock can severely
limit throughput.
• Lock-Free: In the lock-free style, each shared data
structure is organized so that any potential races are
confined to a single word. An updating thread need not
lock the structure in advance. Instead, it prepares an
updated value, then uses an atomic instruction (such as
Compare-And-Swap) to attempt to store the value into
the word [1]. The atomic instruction ensures that the
word
Taking turns, broadcasting changes: Low latency
Dividing into sections, round-robin: High throughput
throughput -> parallel -> distributed/replicated -> latency
David Ungar, Doug Kimelman, Sam Adams and Mark Wegman: IBM
Saturday 4 May 13
23. spatial computing
offers insights into:
• the costs and constraints
of communication in large
parallel computer arrays
• how to design algorithms
that respect these costs
and constraints
parallel sorting on a spatial computer
Max Orhai, Andrew P. Black
Saturday 4 May 13