A modern CPU looks under the cover to understand what performance. What about hardware for Java application performance? Let's understand Performance Monitoring Unit works, Hardware Counters & using them to speeding up Java apps.
Nowadays, CPU microarchitecture is concealed from developers by compilers, VMs, etc. Do Java developers need to know microarchitecture details of modern processors? Or, does it like to learn quantum mechanics for cooking? Are Java developers safe from leaking low-level microarchitecture details into high level application performance behaviour? We will try to answer these questions by analyzing several Java example.
Nowadays, CPU microarchitecture is concealed from developers by compilers, VMs, etc.
Do Java developers need to know microarchitecture details of modern processors?
Or, does it like to learn quantum mechanics for cooking?
Are Java developers safe from leaking low-level microarchitecture details into high level application performance behaviour?
We will try to answer these questions by analyzing several Java examples.
JDK8 is bringing lambdas into the Java world. But without significant Java SE 8 libraries extensions lambdas would be only a toy. The centerpiece of the new libraries is the Streams API (formerly ‘Collections bulk operations’). This session presents the Streams API and examples how to get the best performance and scalability using the library.
Speedup Your Java Apps with Hardware CountersC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2n593ji.
Sergey Kuksenko discusses how Performance Monitoring Unit works, what Hardware Counters are, which tools have friendship with Java and how to use HWC for speeding up our Java applications. Filmed at qconsf.com.
Sergey Kuksenko works as Java Performance Engineer at Oracle. His primary goal is making Oracle JVM faster digging into JVM runtime, JIT compilers, class libraries, etc. His favorite area is an interaction of Java with modern hardware what he is doing since 2005 when he worked at Intel in Apache Harmony Performance team.
An overview of the inner-workings of OpenJDK - with emphasis on...
- what triggers the just-in-time compiler (JIT)
- types of speculative optimizations performed by the JIT
- aspects of the Java language & ecosystem that make ahead-of-time (AOT) compilation challenging
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Akihiro Hayashi
Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. The 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013), September 25-27, 2013 Qualcomm Research Silicon Valley, Santa Clara, CA (co-located with CnC-2013).
Nowadays, CPU microarchitecture is concealed from developers by compilers, VMs, etc. Do Java developers need to know microarchitecture details of modern processors? Or, does it like to learn quantum mechanics for cooking? Are Java developers safe from leaking low-level microarchitecture details into high level application performance behaviour? We will try to answer these questions by analyzing several Java example.
Nowadays, CPU microarchitecture is concealed from developers by compilers, VMs, etc.
Do Java developers need to know microarchitecture details of modern processors?
Or, does it like to learn quantum mechanics for cooking?
Are Java developers safe from leaking low-level microarchitecture details into high level application performance behaviour?
We will try to answer these questions by analyzing several Java examples.
JDK8 is bringing lambdas into the Java world. But without significant Java SE 8 libraries extensions lambdas would be only a toy. The centerpiece of the new libraries is the Streams API (formerly ‘Collections bulk operations’). This session presents the Streams API and examples how to get the best performance and scalability using the library.
Speedup Your Java Apps with Hardware CountersC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2n593ji.
Sergey Kuksenko discusses how Performance Monitoring Unit works, what Hardware Counters are, which tools have friendship with Java and how to use HWC for speeding up our Java applications. Filmed at qconsf.com.
Sergey Kuksenko works as Java Performance Engineer at Oracle. His primary goal is making Oracle JVM faster digging into JVM runtime, JIT compilers, class libraries, etc. His favorite area is an interaction of Java with modern hardware what he is doing since 2005 when he worked at Intel in Apache Harmony Performance team.
An overview of the inner-workings of OpenJDK - with emphasis on...
- what triggers the just-in-time compiler (JIT)
- types of speculative optimizations performed by the JIT
- aspects of the Java language & ecosystem that make ahead-of-time (AOT) compilation challenging
Speculative Execution of Parallel Programs with Precise Exception Semantics ...Akihiro Hayashi
Akihiro Hayashi, Max Grossman, Jisheng Zhao, Jun Shirako, Vivek Sarkar. The 26th International Workshop on Languages and Compilers for Parallel Computing (LCPC2013), September 25-27, 2013 Qualcomm Research Silicon Valley, Santa Clara, CA (co-located with CnC-2013).
SIMD machines — machines capable of evaluating the same instruction on several elements of data in parallel — are nowadays commonplace and diverse, be it in supercomputers, desktop computers or even mobile ones. Numerous tools and libraries can make use of that technology to speed up their computations, yet it could be argued that there is no library that provides a satisfying minimalistic, high-level and platform-agnostic interface for the C++ developer.
Verification of Concurrent and Distributed SystemsMykola Novik
Building correct concurrent and distributed systems is hard and very challenging task also high complexity of such software increases the probability of human error in design and architecture. On practice standard verification techniques in industry are necessary but not sufficient. In my talk we will discuss formal specification and verification language that helps engineers design, specify, reason about and verify complex, real-life algorithms and software systems.
A quick tutorial on what debuggers are and how to use them. We present a debugging example using GDB. At the end of this tutorial, you will be able to work your way through a crash and analyze the cause of the error responsible for the crash.
Kernel Recipes 2013 - Automating source code evolutions using CoccinelleAnne Nicolas
New APIs are continually added to the Linux kernel, to improve functionality, reliability, or performance. Nevertheless, it is a challenge to update old code to use these new APIs. Coccinelle is a program matching and transformation tool for C code that has been extensively been applied to the Linux kernel. This talk will introduce Coccinelle and illustrate how it can be used for this kind of code renovation.
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
This talk was presented at the DOE Centers of Excellence Performance Portability Workshop in August 2017. In this talk I explore the current status of 4 OpenMP 4.5 compilers for NVIDIA GPUs and CPUs from the perspective of performance portability between compilers and between the GPU and CPU.
Production Time Profiling and Diagnostics on the JVMMarcus Hirt
These are the slides for my Code One 2018 talk on profiling and diagnostics on the JVM. The talk goes through various serviceability technologies built into the JVM, but with a focus on the production time use cases.
How do we go from your Java code to the CPU assembly that actually runs it? Using high level constructs has made us forget what happens behind the scenes, which is however key to write efficient code.
Starting from a few lines of Java, we explore the different layers that constribute to running your code: JRE, byte code, structure of the OpenJDK virtual machine, HotSpot, intrinsic methds, benchmarking.
An introductory presentation to these low-level concerns, based on the practical use case of optimizing 6 lines of code, so that hopefully you to want to explore further!
Presentation given at the Toulouse (FR) Java User Group.
Video (in french) at https://www.youtube.com/watch?v=rB0ElXf05nU
Slideshow with animations at https://docs.google.com/presentation/d/1eIcROfLpdTU2_Z_IKiMG-AwqZGZgbN1Bs2E0nGShpbk/pub?start=true&loop=false&delayms=60000
Inside the JVM - Follow the white rabbit! / Breizh JUGSylvain Wallez
Presentation given at the Rennes (FR) Java User Group in Feb 2019.
How do we go from your Java code to the CPU assembly that actually runs it? Using high level constructs has made us forget what happens behind the scenes, which is however key to write efficient code.
Starting from a few lines of Java, we explore the different layers that constribute to running your code: JRE, byte code, structure of the OpenJDK virtual machine, HotSpot, intrinsic methds, benchmarking.
An introductory presentation to these low-level concerns, based on the practical use case of optimizing 6 lines of code, so that hopefully you to want to explore further!
SIMD machines — machines capable of evaluating the same instruction on several elements of data in parallel — are nowadays commonplace and diverse, be it in supercomputers, desktop computers or even mobile ones. Numerous tools and libraries can make use of that technology to speed up their computations, yet it could be argued that there is no library that provides a satisfying minimalistic, high-level and platform-agnostic interface for the C++ developer.
Verification of Concurrent and Distributed SystemsMykola Novik
Building correct concurrent and distributed systems is hard and very challenging task also high complexity of such software increases the probability of human error in design and architecture. On practice standard verification techniques in industry are necessary but not sufficient. In my talk we will discuss formal specification and verification language that helps engineers design, specify, reason about and verify complex, real-life algorithms and software systems.
A quick tutorial on what debuggers are and how to use them. We present a debugging example using GDB. At the end of this tutorial, you will be able to work your way through a crash and analyze the cause of the error responsible for the crash.
Kernel Recipes 2013 - Automating source code evolutions using CoccinelleAnne Nicolas
New APIs are continually added to the Linux kernel, to improve functionality, reliability, or performance. Nevertheless, it is a challenge to update old code to use these new APIs. Coccinelle is a program matching and transformation tool for C code that has been extensively been applied to the Linux kernel. This talk will introduce Coccinelle and illustrate how it can be used for this kind of code renovation.
Early Results of OpenMP 4.5 Portability on NVIDIA GPUs & CPUsJeff Larkin
This talk was presented at the DOE Centers of Excellence Performance Portability Workshop in August 2017. In this talk I explore the current status of 4 OpenMP 4.5 compilers for NVIDIA GPUs and CPUs from the perspective of performance portability between compilers and between the GPU and CPU.
Production Time Profiling and Diagnostics on the JVMMarcus Hirt
These are the slides for my Code One 2018 talk on profiling and diagnostics on the JVM. The talk goes through various serviceability technologies built into the JVM, but with a focus on the production time use cases.
How do we go from your Java code to the CPU assembly that actually runs it? Using high level constructs has made us forget what happens behind the scenes, which is however key to write efficient code.
Starting from a few lines of Java, we explore the different layers that constribute to running your code: JRE, byte code, structure of the OpenJDK virtual machine, HotSpot, intrinsic methds, benchmarking.
An introductory presentation to these low-level concerns, based on the practical use case of optimizing 6 lines of code, so that hopefully you to want to explore further!
Presentation given at the Toulouse (FR) Java User Group.
Video (in french) at https://www.youtube.com/watch?v=rB0ElXf05nU
Slideshow with animations at https://docs.google.com/presentation/d/1eIcROfLpdTU2_Z_IKiMG-AwqZGZgbN1Bs2E0nGShpbk/pub?start=true&loop=false&delayms=60000
Inside the JVM - Follow the white rabbit! / Breizh JUGSylvain Wallez
Presentation given at the Rennes (FR) Java User Group in Feb 2019.
How do we go from your Java code to the CPU assembly that actually runs it? Using high level constructs has made us forget what happens behind the scenes, which is however key to write efficient code.
Starting from a few lines of Java, we explore the different layers that constribute to running your code: JRE, byte code, structure of the OpenJDK virtual machine, HotSpot, intrinsic methds, benchmarking.
An introductory presentation to these low-level concerns, based on the practical use case of optimizing 6 lines of code, so that hopefully you to want to explore further!
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
Overview and extended description: AI is expected to be the engine of technological advancements in the healthcare industry, especially in the areas of radiology and image processing. The purpose of this session is to demonstrate how we can build a AI-based Radiologist system using Apache Spark and Analytics Zoo to detect pneumonia and other diseases from chest x-ray images. The dataset, released by the NIH, contains around 110,00 X-ray images of around 30,000 unique patients, annotated with up to 14 different thoracic pathology labels. Stanford University developed a state-of-the-art model using CNN and exceeds average radiologist performance on the F1 metric. This talk focuses on how we can build a multi-label image classification model in a distributed Apache Spark infrastructure, and demonstrate how to build complex image transformations and deep learning pipelines using BigDL and Analytics Zoo with scalability and ease of use. Some practical image pre-processing procedures and evaluation metrics are introduced. We will also discuss runtime configuration, near-linear scalability for training and model serving, and other general performance topics.
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
Sparklens: Understanding the Scalability Limits of Spark Applications with R...Databricks
One of the common requests we receive from customers (at Qubole) is debugging slow spark application. Usually this process is done with trial and error, which takes time and requires running clusters beyond normal usage (read wasted resources). Moreover, it doesn’t tell us where to looks for further improvements. We at Qubole are looking into making this process more self-serve. Towards this goal we have built a tool (OSS https://github.com/qubole/sparklens) based on spark event listener framework.
From a single run of the application, Sparklens provides insights about scalability limits of given spark application. In this talk we will cover what Sparklens does and theory behind Sparklens. We will talk about how structure of spark application puts important constraints on its scalability. How can we find these structural constraints and how to use these constraints as a guide in solving performance and scalability problems of spark applications.
This talk will help audience in answering the following questions about their spark applications: 1) Will their application run faster with more executors? 2) How will cluster utilization change as number of executors change? 3) What is the absolute minimum time this application will take even if we give it infinite executors? 4) What is the expected wall clock time for the application when we fix the most important structural limits of these application? Sparklens makes the ROI of additional executor extremely obvious for a given application and needs just a single run of the application to determine how application with behave with different executor counts. Specifically, it will help managers take the correct side of the tradeoff between spending developer time optimising applications vs spending money on compute bills.
Talk for YOW! by Brendan Gregg. "Systems performance studies the performance of computing systems, including all physical components and the full software stack to help you find performance wins for your application and kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes the topic for everyone, touring six important areas: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events) and tracing (ftrace, bcc/BPF, and bpftrace/BPF), advice about what is and isn't important to learn, and case studies to see how it is applied. This talk is aimed at everyone: developers, operations, sysadmins, etc, and in any environment running Linux, bare metal or the cloud.
"
Stay up-to-date with the OpenACC Monthly Highlights. July's edition covers the OpenACC Summit 2021, GCC, upcoming GPU Hackathons and Bootcamps, Sunita Chandrasekaran named as PI for SOLLVE Project, recent research and more!
A 2015 presentation to introduce users to Java profiling. The Yourkit Profiler is used for concrete examples. The following topics are covered:
1) When to profile
2) Profiler sampling
3) Profiler instrumentation
4) Where to Start
5) Macro vs micro benchmarking
With interest in AI based applications growing, and companies like IBM, Google, Microsoft, NVidia investing heavily in computing and software applications, interest in Deep Learning has been growing in the past few years. With advances in software and hardware technologies, Neural Networks are making a resurgence.
In this workshop, we will discuss the basics of Neural Networks and discuss how Deep Learning Neural networks are different from conventional Neural Network architectures. We will review a bit of mathematics that goes into building neural networks and understand the role of GPUs in Deep Learning. We will also get an introduction to Autoencoders, Convolutional Neural Networks, Recurrent Neural Networks and understand the state-of-the-art in hardware and software architectures. Functional Demos will be presented in Keras, a popular Python package with a backend in Theano and Tensorflow.
After explaining what problem Reactive Programming solves I will give an introduction to one implementation: RxJava. I show how to compose Observable without concurrency first and then with Scheduler. I finish the talk by showing examples of flow control and draw backs.
Inspired from https://www.infoq.com/presentations/rxjava-reactor and https://www.infoq.com/presentations/rx-service-architecture
Code: https://github.com/toff63/Sandbox/tree/master/java/rsjug-rx/rsjug-rx/src/main/java/rs/jug/rx
Small introduction to FPGA acceleration and the impact of the new High Level Synthesis toolchains to their programmability
Video here: https://www.linkedin.com/posts/marcobarbone_can-my-application-benefit-from-fpga-acceleration-activity-6848674747375460352-0fua
Pragmatic Optimization in Modern Programming - Ordering Optimization ApproachesMarina Kolpakova
The slides give an idea about how to look pragmatically at software optimization and order optimization approaches according to this pragmatic point of view
Java In-Process Caching - Performance, Progress and PitfallsJens Wilke
How to speed up in-process caching and why you should not use LRU. Comparison of EHCache, Google Guava, Caffeine and cache2k. With benchmarks of throughput and eviction efficiency.
Similar to Java Performance: Speedup your application with hardware counters (20)
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
2. @kuksenk0#JavaPerfAndHWC
The following is intended to outline our general product
direction. It is intended for information purposes only, and may
not be incorporated into any contract. It is not a commitment
to deliver any material, code, or functionality, and should not be
relied upon in making purchasing decisions. The development,
release, and timing of any features or functionality described for
Oracle’s products remains at the sole discretion of Oracle.
2/64
4. @kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiply(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j];
}
R[i][j] = s;
}
}
return result;
}
4/64
5. @kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiply(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j];
}
R[i][j] = s;
}
}
return result;
}
𝑂( 𝑁3)⋆
⋆
big-O
4/64
10. @kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiply(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j] ;
}
R[i][j] = s;
}
}
return result;
}
L1-load-misses
9/64
11. @kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiply(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
int s = 0;
for (int k = 0; k < size; k++) {
s += A[i][k] * B[k][j] ;
}
R[i][j] = s;
}
}
return result;
}
9/64
12. @kuksenk0#JavaPerfAndHWC
Example 1: Very Hot Code
public Matrix multiplyIKJ(Matrix other) {
int size = data.length;
Matrix result = new Matrix(size);
int [][] R = result.data;
int [][] A = this.data;
int [][] B = other.data;
for (int i = 0; i < size; i++) {
for (int k = 0; k < size; k++) {
int aik = A[i][k];
for (int j = 0; j < size; j++) {
R[i][j] += aik * B[k][j];
}
}
}
return result;
}
10/64
22. @kuksenk0#JavaPerfAndHWC
Three magic questions
What? ⇒ Where? ⇒ How?
∙ What prevents my application to work faster?
(monitoring)
∙ Where does it hide?
(profiling)
∙ How to stop it messing with performance?
(tuning/optimizing)
15/64
36. @kuksenk0#JavaPerfAndHWC
CPU Utilization
∙ What does ∼100% CPU Utilization mean?
– OS has enough tasks to schedule
Can profiling help?
Profiler
⋆
shows «WHERE» application time is spent,
but there’s no answer to the question «WHY».
⋆
traditional profiler
20/64
37. @kuksenk0#JavaPerfAndHWC
Who/What is to blame?
Complex CPU microarchitecture:
∙ Inefficient algorithm ⇒ 100% CPU
∙ Pipeline stall due to memory load/stores ⇒ 100% CPU
∙ Pipeline flush due to mispredicted branch ⇒ 100% CPU
∙ Expensive instructions ⇒ 100% CPU
∙ Insufficient ILP
⋆
⇒ 100% CPU
∙ etc. ⇒ 100% CPU
⋆
Instruction Level Parallelism
21/64
40. @kuksenk0#JavaPerfAndHWC
PMU: Performance Monitoring Unit
Performance Monitoring Unit - profiles hardware activity,
built into CPU.
PMU Internals (in less than 21 seconds) :
Hardware counters (HWC) count hardware performance events
(performance monitoring events)
23/64
42. @kuksenk0#JavaPerfAndHWC
Events: Issues
∙ Hundreds of events
∙ Microarchitectural experience is required
∙ Platform dependent
– vary from CPU vendor to vendor
– may vary when CPU manufacturer introduces new
microarchitecture
∙ How to work with HWC?
25/64
43. @kuksenk0#JavaPerfAndHWC
HWC
HWC modes:
∙ Counting mode
– if (event_happened) ++counter;
– general monitoring (answering «WHAT?» )
∙ Sampling mode
– if (event_happened)
if (++counter < threshold) INTERRUPT;
– profiling (answering «WHERE?»)
26/64
44. @kuksenk0#JavaPerfAndHWC
HWC: Issues
So many events, so few counters
e.g. “Nehalem”:
– over 900 events
– 7 HWC (3 fixed + 4 programmable)
∙ «multiple-running» (different events)
– Repeatability
∙ «multiplexing» (only a few tools are able to do that)
– Steady state
27/64
68. @kuksenk0#JavaPerfAndHWC
Example 2: dTLB misses
TLB = Translation Lookaside Buffer
∙ a memory cache that stores recent translations of virtual
memory addresses to physical addresses
∙ each memory access → access to TLB
∙ TLB miss may take hundreds cycles
46/64
69. @kuksenk0#JavaPerfAndHWC
Example 2: dTLB misses
TLB = Translation Lookaside Buffer
∙ a memory cache that stores recent translations of virtual
memory addresses to physical addresses
∙ each memory access → access to TLB
∙ TLB miss may take hundreds cycles
How to check?
∙ dtlb_load_misses_miss_causes_a_walk
∙ dtlb_load_misses_walk_duration
46/64
75. @kuksenk0#JavaPerfAndHWC
Example 3: False Sharing
@State(Scope.Group)
public static class StateBaseline {
int field0;
int field1;
}
@Benchmark
@Group("baseline")
public int justDoIt(StateBaseline s) {
return s.field0 ++;
}
@Benchmark
@Group("baseline")
public int doItFromOtherThread(StateBaseline s) {
return s.field1 ++;
}
52/64
78. @kuksenk0#JavaPerfAndHWC
Example 3: «on a core and a prayer»
in case of the single core(HT) we have to look into
MACHINE_CLEARS.MEMORY_ORDERING
Same Core Diff Cores Diff Sockets
sharing padded sharing padded sharing padded
CPI 1.3 0.7 1.4 0.4 1.7 0.4
cycles 33130 17536 36012 9163 46484 9608
instructions 26418 25865 26550 25747 26717 25768
CLEARS 238 ≈0 ≈0 ≈0 ≈0 ≈0
55/64
84. @kuksenk0#JavaPerfAndHWC
Summary: Performance is easy
To achieve high performance:
∙ You have to know your Application!
∙ You have to know your Frameworks!
∙ You have to know your Virtual Machine!
∙ You have to know your Operating System!
∙ You have to know your Hardware!
61/64
85. @kuksenk0#JavaPerfAndHWC
Summary: «Here be dragons»
To achieve high performance:
∙ You have to know your Application!
∙ You have to know your Frameworks!
∙ You have to know your Virtual Machine!
∙ You have to know your Operating System!
∙ You have to know your Hardware!
61/64
86. @kuksenk0#JavaPerfAndHWC
Enlarge your knowledge with these simple tricks!
Reading list:
∙ “Computer Architecture: A Quantitative Approach”
John L. Hennessy, David A. Patterson
∙ CPU vendors documentation
∙ http://www.agner.org/optimize/
∙ http://www.google.com/search?q=Hardware+
performance+counter
∙ etc. . .
62/64