This document proposes hardware-aware thread scheduling on asymmetric multicore processors like the AMD Bulldozer. It presents a workload characterization technique using hardware performance counters to identify threads that are floating-point intensive. An optimized scheduler then performs scheduling decisions based on hardware resource usage and workload characterization to improve occupancy of processing units on Bulldozer chips. Evaluation on SPEC CPU2006 and SciMark2.0 benchmarks shows the approach improves performance over default OS scheduling by better distributing integer and floating-point workloads across the cores.
The Real World of Virtual Datacenters + Supporting MaterialsX. Breogan COSTA
Slides used in JINR/CERN "GRID and Advanced Information Systems" school of computing.
Right belonging to author (Breogan Costa), CERN and JINR.
You can freely use mentioning authorship. Logo of CERN cannot be used without explicit CERN permission.
The Real World of Virtual Datacenters + Supporting MaterialsX. Breogan COSTA
Slides used in JINR/CERN "GRID and Advanced Information Systems" school of computing.
Right belonging to author (Breogan Costa), CERN and JINR.
You can freely use mentioning authorship. Logo of CERN cannot be used without explicit CERN permission.
This presentation by Stanislav Donets (Lead Software Engineer, Consultant, GlobalLogic, Kharkiv) was delivered at GlobalLogic Kharkiv C++ Workshop #1 on September 14, 2019.
In this talk were covered:
- Graphics Processing Units: Architecture and Programming (theory).
- Scratch Example: Barnes Hut n-Body Algorithm (practice).
Conference materials: https://www.globallogic.com/ua/events/kharkiv-cpp-workshop/
MemGuard: Memory Bandwidth Reservation System for Efficient Performance Isola...Heechul Yun
Memory bandwidth in modern multi-core platforms is highly variable for many reasons and is a big challenge in designing real-time systems as applications are increasingly becoming more memory intensive. In this work, we proposed, designed, and implemented an efficient memory bandwidth reservation system, that we call MemGuard. MemGuard distinguishes memory bandwidth as two parts: guaranteed and best effort. It provides bandwidth reservation for the guaranteed bandwidth for temporal isolation, with efficient reclaiming to maximally utilize the reserved bandwidth. It further improves performance by exploiting the best effort bandwidth after satisfying each core’s reserved bandwidth. MemGuard is evaluated with SPEC2006 benchmarks on a real hardware platform, and the results demonstrate that it is able to provide memory performance isolation with minimal impact on overall throughput.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/luxoft/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Alexey Rybakov, Senior Director at LUXOFT, presents the "Making Computer Vision Software Run Fast on Your Embedded Platform" tutorial at the May 2016 Embedded Vision Summit.
Many computer vision algorithms perform well on desktop class systems, but struggle on resource constrained embedded platforms. This how-to talk provides a comprehensive overview of various optimization methods that make vision software run fast on low power, small footprint hardware that is widely used in automotive, surveillance, and mobile devices. The presentation explores practical aspects of deep algorithm and software optimization such as thinning of input data, using dynamic regions of interest, mastering data pipelines and memory access, overcoming compiler inefficiencies, and more.
HSA is a new heterogeneous programming model, created for lowering the learning curve of heterogeneous. This slide shares you the advanced features and HSA.
Disclaimer: Unless otherwise noted, the content of this course material is licensed under a Creative Commons Attribution 3.0 License.
You assume all responsibility for use and potential liability associated with any use of the material.
Introduce F9 microkernel, new open source implementation built from scratch, which deploys modern kernel techniques, derived from L4 microkernel designs, to deep embedded devices.
:: https://github.com/f9micro
Characteristics of F9 microkernel
– Efficiency: performance + power consumption
– Security: memory protection + isolated execution
– Flexible development environment
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Hardware-aware thread scheduling: the case of asymmetric multicore processors
1. Hardware-aware thread scheduling: the
case of asymmetric multicore processors
Achille Peternier*, Danilo Ansaloni, Daniele Bonetta,
Cesare Pautasso and Walter Binder
* achille.peternier@usi.ch
http://sosoa.inf.unisi.ch
3. Context
• Modern CPUs increase the computational
power through additional cores
• HW architectures are becoming increasingly
more complex
– Shared caches
– Non Uniform Memory Access (NUMA)
– Single Instruction Multiple Data (SIMD) registers
– Simultaneous MultiThreading (SMT) units
3
4. Context
• Operating System (OS) kernel and scheduler
try to automatically optimize applications’
performance according to the available
resources
– Based on the underlying HW
– Using a limited set of performance indicators (CPU
time, memory usage, etc.)
4
5. “Today it is impossible to estimate performance:
you have to measure it. Programming has become
an empirical science.”
Performance Anxiety: Performance analysis in the new millennium
Joshua Bloch, Google Inc.
5
6. Contributions
1) Automated workload analysis technique relying on a
specific set of performance metrics that are currently not
used by common OS schedulers
2) Hardware-aware optimized scheduler performing
decisions based on hardware resource usage and the
output of the workload analysis
- to improve processing units occupancy on
SMT/asymmetric processors
6
7. The big picture
Monitoring daemon
FPU
INT
Workload characterization
OS threads and processes
7
8. The big picture
FPU
Hardware-aware scheduler INT
Workload characterization
8
10. AMD Bulldozer
• AMD Bulldozer architecture
– Each CPU is implemented as a series of modules
(a.k.a. “cores”) with two cores (a.k.a. “processing
or SMT units”)
– Arithmetic-Logic Units (ALUs) are really available
per SMT unit
– A module is more similar to:
• A dual core when doing integer ops
• A single core with SMT=2 when
doing floating point ops
10
15. Workload characterization
• Is used to sort processes and threads that are
floating point intensive
– Among the X most running threads
• (where X = the number of cores available)
• Based on realtime monitoring system using
Hardware Performance Counters (HPCs)
15
16. …about HPCs…
• Registers embedded into processors to keep track
of hardware-related events such as cache misses,
number of CPU cycles, branch mispredictions,
etc.
• Very low overhead (about 1%)
• Extremely accurate
• Limited resources, only few of them can be used
at the same time
– This limits their wide adoption (yet) on large scale
• HW-specific
16
17. Workload characterization
• HPCs used:
– PERF_COUNT_HW_CPU_CYCLES: measures the
total number of CPU cycles consumed by a thread
during its execution time
– CYCLES_FPU_EMPTY: keeps track of the number
of CPU cycles the floating point units are not being
used by a thread during its execution time
– L2_CACHE_MISSES: counts the number of L2
cache misses generated by a thread during its
execution time
17
20. BulldOver design
• Server
– Daemon
– Scans the underlying architecture
– Time-based HPC monitoring (once per sec)
• We target scientific workloads, short-lived threads are
not well suitable
– Applies scheduling policies
– libHpcOverseer, hwloc, libpfm
20
21. BulldOver design
• Client
– Command-line tool
• prompt> bulldover java myprogram
– Traces the creation/termination of
threads/processes
– Share information through shared memory with
the server
– libmonitor, boost
21
24. Testing environment
• Dell PowerEdge M915
– 4x AMD 6282SE 2.6 GHz CPUs (16 cores/8
modules each)
• Limited to 1 CPU with 8 cores/4 modules
– Test limited to a single NUMA node
• Avoiding latencies and other NUMA-related well known
effects
– Turbo mode and freq. scaling disabled
24
25. Benchmark suites
• SPEC CPU 2006
– Perfect match for evaluating Integer vs. Floating point
behaviors
• SciMark 2.0
– Java based
– Noisy environment (additional threads for garbage
collection, JIT, etc.)
– Mainly FPU-oriented, with different levels of stress
– Modified multi-threaded version running several
random benchmarks over a thread-pool
25
29. Results for SPEC CPU 2006
Running 4x Int and 4x FPU
benchmarks on a single NUMA
node (4 modules/8 cores)
Inefficient baseline
Improved scheduling
Default OS scheduling
29
30. Discussion
• BulldOver avoids the worst case scenario
– The default OS scheduler is not aware of the
workload characterization
• Benefits coming both from improved cache
usage AND better FPU/Integer units
occupancy
30
31. Results for Scimark 2.0
Running 8x randomly changing
over-time benchmarks on a
single NUMA node (4 modules/8
cores)
Default OS scheduling
Improved scheduling
31
32. Discussion
• All the threads are FPU-intensive
– But at different levels
• Still a reasonable speedup “for free”
• Dynamic adaptation, since the FPU usage
intensity varies over time
– BulldOver reacts accordingly
32
33. Conclusions
- We show how thread scheduling not aware of the shared
HW resources available on the AMD Bulldozer processor
can incur a significant performance penalty
- We presented a monitoring system that is able to
characterize the most active threads according to their
FPU/Integer usage
- Thanks to the realtime analysis, improved scheduling can
be applied and performance improved
- Our system is very low intrusive:
- Low overhead (below 2%)
- No kernel patching required
- No code instrumentation
- Works on any application
33
34. Conclusions
• Currently tuned for a specific HW architecture
• Good for scientific workloads
– Sampling rate is required (1 sec in our case, could
be less but can’t be 0…)
• Based on a very simple scheduling policy
– More sophisticated policies could be used
34
36. “Pow7Over”
• Work in progress on IBM Power7 processors
– 1 CPU, 8 cores, up to 4 SMT units per core
– Completely different…
• …operating system: RHEL 6.3
• …architecture: PowerPC
• …HPCs: IBM-specific ones (more than 500 available…)
• …compiler: autotools 6.0
• Similar approach
• Slightly less significant speedup
– But this is a full SMT
– Similar overall behavior both for the PUs and L2 caches
36