The document discusses challenges with existing memory managers and proposes solutions. Current memory managers are inadequate for high-performance applications on modern multicore architectures as they limit scalability and performance. The talk introduces the Heap Layers framework for building customizable memory managers. It also describes Hoard, a provably scalable memory manager that bounds local memory consumption by explicitly tracking utilization and moving free memory to a global heap. Finally, an extended memory manager called Reap is proposed for server applications.
Hoard: A Scalable Memory Allocator for Multithreaded ApplicationsEmery Berger
Fast and effective memory management is crucial for many applications, including web servers, database managers, and scientific codes. However, current memory managers do not provide adequate support for these applications on modern architectures, severely limiting their performance, scalability, and robustness.
In this talk, I describe how to design memory managers that support high-performance applications. I first address the software engineering challenges of building efficient memory managers. I then show how current general-purpose memory managers do not scale on multiprocessors, cause false sharing of heap objects, and systematically leak memory. I describe a fast, provably scalable general-purpose memory manager called Hoard (available at www.hoard.org) that solves these problems, improving performance by up to a factor of 60.
The document discusses best practices for using Oracle Database In-Memory. It provides an overview of In-Memory and describes how to configure and populate the In-Memory Column Store. It also discusses how the optimizer utilizes In-Memory statistics and hints to optimize queries for In-Memory. Several examples of queries that benefit from In-Memory, such as aggregation queries and queries with predicates, are also provided.
Operating System
Topic Memory Management
for Btech/Bsc (C.S)/BCA...
Memory management is the functionality of an operating system which handles or manages primary memory. Memory management keeps track of each and every memory location either it is allocated to some process or it is free. It checks how much memory is to be allocated to processes. It decides which process will get memory at what time. It tracks whenever some memory gets freed or unallocated and correspondingly it updates the status.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
a glance on memory management in operating system.
this note is useful for those who are keen to know about how the OS works and a brief explanation regarding several terms such
-paging
segmentation
fragmentation
virtual memory
page table
to A Level A2 Computing students, this light note may be helpful for your revision
The document discusses different memory management strategies:
- Swapping allows processes to be swapped temporarily out of memory to disk, then back into memory for continued execution. This improves memory utilization but incurs long swap times.
- Contiguous memory allocation allocates processes into contiguous regions of physical memory using techniques like memory mapping and dynamic storage allocation with first-fit or best-fit. This can cause external and internal fragmentation over time.
- Paging permits the physical memory used by a process to be noncontiguous by dividing memory into pages and mapping virtual addresses to physical frames, allowing more efficient use of memory but requiring page tables for translation.
Operating Systems and Memory Managementguest1415ae65
The document discusses operating systems and how they manage hardware, software, memory and processes. It defines key concepts like physical memory, virtual memory, paging, swapping and buffers. It also categorizes different types of operating systems like real-time OS, single-user OS, multi-user OS and discusses how they schedule processes and allocate system resources.
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.
Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.
Presented by David Smith at Data Day Texas in Austin, January 27 2018.
Hoard: A Scalable Memory Allocator for Multithreaded ApplicationsEmery Berger
Fast and effective memory management is crucial for many applications, including web servers, database managers, and scientific codes. However, current memory managers do not provide adequate support for these applications on modern architectures, severely limiting their performance, scalability, and robustness.
In this talk, I describe how to design memory managers that support high-performance applications. I first address the software engineering challenges of building efficient memory managers. I then show how current general-purpose memory managers do not scale on multiprocessors, cause false sharing of heap objects, and systematically leak memory. I describe a fast, provably scalable general-purpose memory manager called Hoard (available at www.hoard.org) that solves these problems, improving performance by up to a factor of 60.
The document discusses best practices for using Oracle Database In-Memory. It provides an overview of In-Memory and describes how to configure and populate the In-Memory Column Store. It also discusses how the optimizer utilizes In-Memory statistics and hints to optimize queries for In-Memory. Several examples of queries that benefit from In-Memory, such as aggregation queries and queries with predicates, are also provided.
Operating System
Topic Memory Management
for Btech/Bsc (C.S)/BCA...
Memory management is the functionality of an operating system which handles or manages primary memory. Memory management keeps track of each and every memory location either it is allocated to some process or it is free. It checks how much memory is to be allocated to processes. It decides which process will get memory at what time. It tracks whenever some memory gets freed or unallocated and correspondingly it updates the status.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
a glance on memory management in operating system.
this note is useful for those who are keen to know about how the OS works and a brief explanation regarding several terms such
-paging
segmentation
fragmentation
virtual memory
page table
to A Level A2 Computing students, this light note may be helpful for your revision
The document discusses different memory management strategies:
- Swapping allows processes to be swapped temporarily out of memory to disk, then back into memory for continued execution. This improves memory utilization but incurs long swap times.
- Contiguous memory allocation allocates processes into contiguous regions of physical memory using techniques like memory mapping and dynamic storage allocation with first-fit or best-fit. This can cause external and internal fragmentation over time.
- Paging permits the physical memory used by a process to be noncontiguous by dividing memory into pages and mapping virtual addresses to physical frames, allowing more efficient use of memory but requiring page tables for translation.
Operating Systems and Memory Managementguest1415ae65
The document discusses operating systems and how they manage hardware, software, memory and processes. It defines key concepts like physical memory, virtual memory, paging, swapping and buffers. It also categorizes different types of operating systems like real-time OS, single-user OS, multi-user OS and discusses how they schedule processes and allocate system resources.
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.
Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.
Presented by David Smith at Data Day Texas in Austin, January 27 2018.
This document summarizes a lecture on processes and threads. It discusses the key differences between processes and threads, including that processes have separate address spaces while threads share an address space. It covers process and thread APIs, examples of using processes and threads, interprocess communication techniques like pipes and sockets, and considerations for when to use processes versus threads, such as that threads have lower overhead but processes are more robust.
This document contains lecture slides about operating system architecture from Emery Berger at the University of Massachusetts Amherst. The slides cover topics like the memory hierarchy including registers, caches, locality, and quantifying locality through hit curves. They also discuss important CPU internals like pipelining, branch prediction, and superscalar architectures.
WorkflowSim is a toolkit for simulating scientific workflows in distributed environments. It models workflow overhead, failures, and the hierarchical nature of workflows with tasks and jobs. WorkflowSim extends CloudSim to be workflow-aware and supports modeling diverse overhead distributions, failure models, and fault tolerant techniques like reclustering and job retry. It helps researchers evaluate workflow optimization techniques more accurately. Validation experiments show WorkflowSim can accurately simulate overhead and failures and their impact on workflow scheduling heuristics and fault tolerant clustering approaches.
下記論文を扱った研究室内輪読用の資料です
This is slides for group reading in Lab.
Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau, "Operating Systems: Three Easy Pieces"
http://pages.cs.wisc.edu/~remzi/OSTEP/
This document discusses the Maatkit toolkit and how it can be used to simplify various MySQL administration tasks. Some key capabilities and tools covered include mk-archiver for efficiently archiving and purging data, mk-table-checksum for checking replication consistency, and mk-query-digest (formerly mk-log-parser) for analyzing query logs and performance. The speaker advocates that Maatkit tools can help avoid complex custom coding by providing robust solutions for common problems like archiving, replication monitoring, and query analysis.
Operating Systems - Distributed Parallel ComputingEmery Berger
The document discusses distributed parallel programming and message passing. It begins with an introduction to distributed memory machines and message passing as a programming model. It then covers the Message Passing Interface (MPI) library for message passing and provides an example MPI program that prints "Hello world" from multiple processes. The document also discusses sending and receiving messages directly between processes.
This document discusses containers and virtual machines. It explains that containers provide a lightweight virtualization method that isolates applications but shares the host operating system kernel. Containers use resource isolation features like cgroups and namespaces to limit CPU, memory, storage, and networking usage. In contrast, virtual machines run their own full operating system and provide stronger isolation but are more resource intensive.
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Spark Summit
Mesos is an open source cluster manager that improves resource utilization. It allows Spark Streaming jobs to leverage Mesos fault tolerance features like driver supervision using Marathon. Backpressure is also supported in Spark Streaming to prevent scheduling delays from fast data arrival. Reactive Streams provide more direct backpressure control and are expected in future Spark versions.
Vous avez récemment commencé à travailler sur Spark et vos jobs prennent une éternité pour se terminer ? Cette présentation est faite pour vous.
Himanshu Arora et Nitya Nand YADAV ont rassemblé de nombreuses bonnes pratiques, optimisations et ajustements qu'ils ont appliqué au fil des années en production pour rendre leurs jobs plus rapides et moins consommateurs de ressources.
Dans cette présentation, ils nous apprennent les techniques avancées d'optimisation de Spark, les formats de sérialisation des données, les formats de stockage, les optimisations hardware, contrôle sur la parallélisme, paramétrages de resource manager, meilleur data localité et l'optimisation du GC etc.
Ils nous font découvrir également l'utilisation appropriée de RDD, DataFrame et Dataset afin de bénéficier pleinement des optimisations internes apportées par Spark.
Scaling Deep Learning Algorithms on Extreme Scale Architecturesinside-BigData.com
This document summarizes a presentation on scaling deep learning algorithms on extreme scale architectures. It discusses challenges in using deep learning, a vision for machine/deep learning R&D including novel algorithms, and the MaTEx toolkit which supports distributed deep learning on GPU and CPU clusters. Sample results show strong and weak scaling of asynchronous gradient descent on Summit. Fault tolerance needs and the impact of deep learning on other domains are also covered.
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...Gianmario Spacagna
Abstract:
Legacy enterprise architectures still rely on relational data warehouse and require moving and syncing with the so-called "Data Lake" where raw data is stored and periodically ingested into a distributed file system such as HDFS.
Moreover, there are a number of use cases where you might want to avoid storing data on the development cluster disks, such as for regulations or reducing latency, in which case Alluxio (previously known as Tachyon) can make this data available in-memory and shared among multiple applications.
We propose an Agile workflow by combining Spark, Scala, DataFrame (and the recent DataSet API), JDBC, Parquet, Kryo and Alluxio to create a scalable, in-memory, reactive stack to explore data directly from source and develop high quality machine learning pipelines that can then be deployed straight into production.
In this talk we will:
* Present how to load raw data from an RDBMS and use Spark to make it available as a DataSet
* Explain the iterative exploratory process and advantages of adopting functional programming
* Make a crucial analysis on the issues faced with the existing methodology
* Show how to deploy Alluxio and how it greatly improved the existing workflow by providing the desired in-memory solution and by decreasing the loading time from hours to seconds
* Discuss some future improvements to the overall architecture
Bio:
Gianmario is a Senior Data Scientist at Pirelli Tyre, processing telemetry data for smart manufacturing and connected vehicles applications.
His main expertise is on building production-oriented machine learning systems.
Co-author of the Professional Manifesto for Data Science (datasciencemanifesto.com), founder of the Data Science Milan Meetup group and currently writing "Python Deep Learning" book (will be published soon).
He loves evangelising his passion for best practices and effective methodologies amongst the community.
Prior to Pirelli, he worked in Financial Services (Barclays), Cyber Security (Cisco) and Predictive Marketing (AgilOne).
Quick, what do memcache, MogileFS, and Gearman have in common? They are scalable, distributed technologies, and they can also interface with PHP, your ubiquitous web development language. Digg uses all 3 (and a few more) in its quest for social news domination, and this presentation shares what we’ve learned about them and how they are best utilized with PHP.
The document provides an overview and introduction to crash dump analysis on Nexenta systems. It discusses core dumps, crash dumps, the panic process, and basic crash dump analysis using mdb. Key topics include process and thread terminology, interrupts and traps, hangs vs crashes vs panics, forensic data sources like console logs and crash dumps, and C language basics relevant to crash analysis like data types and functions. Examples of panic strings, stack traces, and thread lists from crash dumps are also provided, as well as guidance on determining if an issue is hardware, firmware, or software-related.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
This document provides guidance on tuning MySQL for optimal performance. It discusses adjusting various configuration settings related to I/O, memory allocation, query caching, and InnoDB settings. Tuning aspects like I/O, queries, maintenance and configurations are recommended to maximize speed within the constraints of other services. Transaction logs and temporary file storage especially impact performance as heavy consumers of I/O.
Here are some ways to optimize the code:
1. Use strtr() instead of preg_replace() since it avoids the overhead of regular expressions.
2. Define the replacement array outside the loop to avoid redefining it on each iteration.
3. Use direct string concatenation instead of sprintf() for better performance.
4. Avoid function calls inside the loop like sizeof(). Define the length before the loop for better performance.
5. Consider using string replacement/manipulation functions like str_replace() instead of redefining/reconcatenating strings on each loop iteration.
So in summary, the optimized code would be:
$rep = ['-' => '*', '.' => '*
H2O Design and Infrastructure with Matt DowleSri Ambati
This document provides an overview of H2O, an open source machine learning platform that allows for distributed, in-memory analytics of large datasets. It discusses how H2O works, including how it uses a map-reduce style to parallelize machine learning algorithms across multiple nodes. The document demonstrates starting an 8-node H2O cluster on Amazon EC2 and importing a 23GB dataset in under a minute, significantly faster than with other tools. It also summarizes how H2O's distributed fork-join framework executes tasks across nodes and shares data through its distributed data structures.
Doppio: Breaking the Browser Language BarrierEmery Berger
Web browsers have become a de facto universal operating system, and JavaScript its instruction set. Unfortunately, running other languages in the browser is not generally possible. Translation to JavaScript is not enough because browsers are a hostile environment for other languages. Previous approaches are either non-portable or require extensive modifications for programs to work in a browser.
This talk presents Doppio, a JavaScript-based runtime system that makes it possible to run unaltered applications written in general- purpose languages directly inside the browser. Doppio provides a wide range of runtime services, including a file system that enables local and external (cloud-based) storage, an unmanaged heap, sockets, blocking I/O, and multiple threads. We demonstrate Doppio's usefulness with two case studies: we extend Emscripten with Doppio, letting it run an unmodified C++ application in the browser with full functionality, and present DoppioJVM, an interpreter that runs unmodified JVM programs directly in the browser. While substantially slower than a native JVM, DoppioJVM makes it feasible to directly reuse existing, non compute-intensive code.
Dthreads is an efficient deterministic multithreading system for unmodified C/C++ applications that replaces the pthreads library. Dthreads enforces determinism in the face of data races and deadlocks. It is easy to use: just link your program with -ldthread instead of -lpthread.
Dthreads can be downloaded from its source code repo on GitHub (https://github.com/plasma-umass/dthreads). A technical paper describing Dthreads appeared at SOSP 2012 (https://github.com/plasma-umass/dthreads/blob/master/doc/dthreads-sosp11.pdf?raw=true).
Multithreaded programming is notoriously difficult to get right. A key problem is non-determinism, which complicates debugging, testing, and reproducing errors. One way to simplify multithreaded programming is to enforce deterministic execution, but current deterministic systems for C/C++ are incomplete or impractical. These systems require program modification, do not ensure determinism in the presence of data races, do not work with general-purpose multithreaded programs, or run up to 8.4× slower than pthreads.
This talk presents Dthreads, an efficient deterministic multithreading system for unmodified C/C++ applications that replaces the pthreads library. Dthreads enforces determinism in the face of data races and deadlocks. Dthreads works by exploding multithreaded applications into multiple processes, with private, copy-on-write mappings to shared memory. It uses standard virtual memory protection to track writes, and deterministically orders updates by each thread. By separating updates from different threads, Dthreads has the additional benefit of eliminating false sharing. Experimental results show that Dthreads substantially outperforms a state-of-the-art deterministic runtime system, and for a majority of the benchmarks we evaluated, matches and occasionally exceeds the performance of pthreads.
More Related Content
Similar to Memory Management for High-Performance Applications
This document summarizes a lecture on processes and threads. It discusses the key differences between processes and threads, including that processes have separate address spaces while threads share an address space. It covers process and thread APIs, examples of using processes and threads, interprocess communication techniques like pipes and sockets, and considerations for when to use processes versus threads, such as that threads have lower overhead but processes are more robust.
This document contains lecture slides about operating system architecture from Emery Berger at the University of Massachusetts Amherst. The slides cover topics like the memory hierarchy including registers, caches, locality, and quantifying locality through hit curves. They also discuss important CPU internals like pipelining, branch prediction, and superscalar architectures.
WorkflowSim is a toolkit for simulating scientific workflows in distributed environments. It models workflow overhead, failures, and the hierarchical nature of workflows with tasks and jobs. WorkflowSim extends CloudSim to be workflow-aware and supports modeling diverse overhead distributions, failure models, and fault tolerant techniques like reclustering and job retry. It helps researchers evaluate workflow optimization techniques more accurately. Validation experiments show WorkflowSim can accurately simulate overhead and failures and their impact on workflow scheduling heuristics and fault tolerant clustering approaches.
下記論文を扱った研究室内輪読用の資料です
This is slides for group reading in Lab.
Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau, "Operating Systems: Three Easy Pieces"
http://pages.cs.wisc.edu/~remzi/OSTEP/
This document discusses the Maatkit toolkit and how it can be used to simplify various MySQL administration tasks. Some key capabilities and tools covered include mk-archiver for efficiently archiving and purging data, mk-table-checksum for checking replication consistency, and mk-query-digest (formerly mk-log-parser) for analyzing query logs and performance. The speaker advocates that Maatkit tools can help avoid complex custom coding by providing robust solutions for common problems like archiving, replication monitoring, and query analysis.
Operating Systems - Distributed Parallel ComputingEmery Berger
The document discusses distributed parallel programming and message passing. It begins with an introduction to distributed memory machines and message passing as a programming model. It then covers the Message Passing Interface (MPI) library for message passing and provides an example MPI program that prints "Hello world" from multiple processes. The document also discusses sending and receiving messages directly between processes.
This document discusses containers and virtual machines. It explains that containers provide a lightweight virtualization method that isolates applications but shares the host operating system kernel. Containers use resource isolation features like cgroups and namespaces to limit CPU, memory, storage, and networking usage. In contrast, virtual machines run their own full operating system and provide stronger isolation but are more resource intensive.
Leverage Mesos for running Spark Streaming production jobs by Iulian Dragos a...Spark Summit
Mesos is an open source cluster manager that improves resource utilization. It allows Spark Streaming jobs to leverage Mesos fault tolerance features like driver supervision using Marathon. Backpressure is also supported in Spark Streaming to prevent scheduling delays from fast data arrival. Reactive Streams provide more direct backpressure control and are expected in future Spark versions.
Vous avez récemment commencé à travailler sur Spark et vos jobs prennent une éternité pour se terminer ? Cette présentation est faite pour vous.
Himanshu Arora et Nitya Nand YADAV ont rassemblé de nombreuses bonnes pratiques, optimisations et ajustements qu'ils ont appliqué au fil des années en production pour rendre leurs jobs plus rapides et moins consommateurs de ressources.
Dans cette présentation, ils nous apprennent les techniques avancées d'optimisation de Spark, les formats de sérialisation des données, les formats de stockage, les optimisations hardware, contrôle sur la parallélisme, paramétrages de resource manager, meilleur data localité et l'optimisation du GC etc.
Ils nous font découvrir également l'utilisation appropriée de RDD, DataFrame et Dataset afin de bénéficier pleinement des optimisations internes apportées par Spark.
Scaling Deep Learning Algorithms on Extreme Scale Architecturesinside-BigData.com
This document summarizes a presentation on scaling deep learning algorithms on extreme scale architectures. It discusses challenges in using deep learning, a vision for machine/deep learning R&D including novel algorithms, and the MaTEx toolkit which supports distributed deep learning on GPU and CPU clusters. Sample results show strong and weak scaling of asynchronous gradient descent on Summit. Fault tolerance needs and the impact of deep learning on other domains are also covered.
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...Gianmario Spacagna
Abstract:
Legacy enterprise architectures still rely on relational data warehouse and require moving and syncing with the so-called "Data Lake" where raw data is stored and periodically ingested into a distributed file system such as HDFS.
Moreover, there are a number of use cases where you might want to avoid storing data on the development cluster disks, such as for regulations or reducing latency, in which case Alluxio (previously known as Tachyon) can make this data available in-memory and shared among multiple applications.
We propose an Agile workflow by combining Spark, Scala, DataFrame (and the recent DataSet API), JDBC, Parquet, Kryo and Alluxio to create a scalable, in-memory, reactive stack to explore data directly from source and develop high quality machine learning pipelines that can then be deployed straight into production.
In this talk we will:
* Present how to load raw data from an RDBMS and use Spark to make it available as a DataSet
* Explain the iterative exploratory process and advantages of adopting functional programming
* Make a crucial analysis on the issues faced with the existing methodology
* Show how to deploy Alluxio and how it greatly improved the existing workflow by providing the desired in-memory solution and by decreasing the loading time from hours to seconds
* Discuss some future improvements to the overall architecture
Bio:
Gianmario is a Senior Data Scientist at Pirelli Tyre, processing telemetry data for smart manufacturing and connected vehicles applications.
His main expertise is on building production-oriented machine learning systems.
Co-author of the Professional Manifesto for Data Science (datasciencemanifesto.com), founder of the Data Science Milan Meetup group and currently writing "Python Deep Learning" book (will be published soon).
He loves evangelising his passion for best practices and effective methodologies amongst the community.
Prior to Pirelli, he worked in Financial Services (Barclays), Cyber Security (Cisco) and Predictive Marketing (AgilOne).
Quick, what do memcache, MogileFS, and Gearman have in common? They are scalable, distributed technologies, and they can also interface with PHP, your ubiquitous web development language. Digg uses all 3 (and a few more) in its quest for social news domination, and this presentation shares what we’ve learned about them and how they are best utilized with PHP.
The document provides an overview and introduction to crash dump analysis on Nexenta systems. It discusses core dumps, crash dumps, the panic process, and basic crash dump analysis using mdb. Key topics include process and thread terminology, interrupts and traps, hangs vs crashes vs panics, forensic data sources like console logs and crash dumps, and C language basics relevant to crash analysis like data types and functions. Examples of panic strings, stack traces, and thread lists from crash dumps are also provided, as well as guidance on determining if an issue is hardware, firmware, or software-related.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
This document provides guidance on tuning MySQL for optimal performance. It discusses adjusting various configuration settings related to I/O, memory allocation, query caching, and InnoDB settings. Tuning aspects like I/O, queries, maintenance and configurations are recommended to maximize speed within the constraints of other services. Transaction logs and temporary file storage especially impact performance as heavy consumers of I/O.
Here are some ways to optimize the code:
1. Use strtr() instead of preg_replace() since it avoids the overhead of regular expressions.
2. Define the replacement array outside the loop to avoid redefining it on each iteration.
3. Use direct string concatenation instead of sprintf() for better performance.
4. Avoid function calls inside the loop like sizeof(). Define the length before the loop for better performance.
5. Consider using string replacement/manipulation functions like str_replace() instead of redefining/reconcatenating strings on each loop iteration.
So in summary, the optimized code would be:
$rep = ['-' => '*', '.' => '*
H2O Design and Infrastructure with Matt DowleSri Ambati
This document provides an overview of H2O, an open source machine learning platform that allows for distributed, in-memory analytics of large datasets. It discusses how H2O works, including how it uses a map-reduce style to parallelize machine learning algorithms across multiple nodes. The document demonstrates starting an 8-node H2O cluster on Amazon EC2 and importing a 23GB dataset in under a minute, significantly faster than with other tools. It also summarizes how H2O's distributed fork-join framework executes tasks across nodes and shares data through its distributed data structures.
Similar to Memory Management for High-Performance Applications (20)
Doppio: Breaking the Browser Language BarrierEmery Berger
Web browsers have become a de facto universal operating system, and JavaScript its instruction set. Unfortunately, running other languages in the browser is not generally possible. Translation to JavaScript is not enough because browsers are a hostile environment for other languages. Previous approaches are either non-portable or require extensive modifications for programs to work in a browser.
This talk presents Doppio, a JavaScript-based runtime system that makes it possible to run unaltered applications written in general- purpose languages directly inside the browser. Doppio provides a wide range of runtime services, including a file system that enables local and external (cloud-based) storage, an unmanaged heap, sockets, blocking I/O, and multiple threads. We demonstrate Doppio's usefulness with two case studies: we extend Emscripten with Doppio, letting it run an unmodified C++ application in the browser with full functionality, and present DoppioJVM, an interpreter that runs unmodified JVM programs directly in the browser. While substantially slower than a native JVM, DoppioJVM makes it feasible to directly reuse existing, non compute-intensive code.
Dthreads is an efficient deterministic multithreading system for unmodified C/C++ applications that replaces the pthreads library. Dthreads enforces determinism in the face of data races and deadlocks. It is easy to use: just link your program with -ldthread instead of -lpthread.
Dthreads can be downloaded from its source code repo on GitHub (https://github.com/plasma-umass/dthreads). A technical paper describing Dthreads appeared at SOSP 2012 (https://github.com/plasma-umass/dthreads/blob/master/doc/dthreads-sosp11.pdf?raw=true).
Multithreaded programming is notoriously difficult to get right. A key problem is non-determinism, which complicates debugging, testing, and reproducing errors. One way to simplify multithreaded programming is to enforce deterministic execution, but current deterministic systems for C/C++ are incomplete or impractical. These systems require program modification, do not ensure determinism in the presence of data races, do not work with general-purpose multithreaded programs, or run up to 8.4× slower than pthreads.
This talk presents Dthreads, an efficient deterministic multithreading system for unmodified C/C++ applications that replaces the pthreads library. Dthreads enforces determinism in the face of data races and deadlocks. Dthreads works by exploding multithreaded applications into multiple processes, with private, copy-on-write mappings to shared memory. It uses standard virtual memory protection to track writes, and deterministically orders updates by each thread. By separating updates from different threads, Dthreads has the additional benefit of eliminating false sharing. Experimental results show that Dthreads substantially outperforms a state-of-the-art deterministic runtime system, and for a majority of the benchmarks we evaluated, matches and occasionally exceeds the performance of pthreads.
The document describes a system called AutoMan that integrates human computation via Amazon Mechanical Turk (MTurk) into programming. AutoMan allows programmers to write functions that are implemented by having MTurk workers complete small tasks. It addresses challenges like ensuring quality work from workers and preventing gaming of the system. AutoMan manages pricing, timing of tasks, and number of workers to balance cost, speed and accuracy of results for user-defined functions.
The document describes how memory layout affects program performance evaluation and how STABILIZER addresses this issue. STABILIZER eliminates the effect of memory layout on performance so that the true effect of a code change can be measured in isolation, enabling statistically sound performance evaluation. It discusses how traditional methods of repeated runs and error bars are not sufficient and presents case studies showing how STABILIZER enables evaluating the performance of LLVM optimizations accurately.
Heap-based attacks depend on a combination of memory management errors and an exploitable memory allocator. Many allocators include ad hoc countermeasures against particular exploits, but their effectiveness against future exploits has been uncertain.
This paper presents the first formal treatment of the impact of allocator design on security. It analyzes a range of widely-deployed memory allocators, including those used by Windows, Linux, FreeBSD, and OpenBSD, and shows that they remain vulnerable to attack. It then presents DieHarder, a new allocator whose design was guided by this analysis. DieHarder provides the highest degree of security from heap-based attacks of any practical allocator of which we are aware, while imposing modest performance overhead. In particular, the Firefox web browser runs as fast with DieHarder as with the Linux allocator.
Operating Systems - Advanced File SystemsEmery Berger
This document discusses distributed file systems and some of their key design considerations. It covers issues like naming and transparency in distributed file systems, different approaches to naming like absolute vs relative names, caching strategies like write-back vs write-through, and consistency models. It also provides examples of different distributed file systems like NFS, AFS, and Sprite and discusses some of their approaches.
This document discusses operating systems and file systems. It covers topics such as hierarchical directory structures, files, directories, metadata, access control, and different types of files in Unix systems like regular files, devices, pipes and the proc filesystem. Examples are given of how to view file metadata using ls and change access permissions using chmod. The document is a set of lecture slides on these fundamental file system concepts.
The document discusses computer networks and operating systems. It covers topics like local and wide area networks, protocols, packet formats, sockets, client-server communication, and network topologies. Specifically, it describes how data is broken into packets and transmitted over networks using protocols like TCP and UDP. It also discusses common network topologies like bus, point-to-point, and tree structures.
The document discusses queuing systems and networks. It describes how queuing networks can model concurrent systems with nodes that include queues and servers. Key concepts discussed include arrival and service rates, stable systems where arrival equals departure rates, and Little's Law relating queue length, arrival rate, and wait time. The document also introduces Flux, a programming language for building high-performance concurrent servers by combining sequential components with defined flows and atomicity.
The document discusses concurrency patterns for operating systems and presents examples using a web server. It introduces the thread pool and producer-consumer patterns for improving concurrency and hiding latency from I/O. The producer-consumer pattern is explored in more detail for a web server application with examples of using multiple queues and threads to parallelize work.
Operating Systems - Advanced SynchronizationEmery Berger
This document discusses synchronization in operating systems and computer science. It covers topics like locks, mutexes, semaphores, and condition variables. Locks are used for mutual exclusion to ensure only one thread accesses a critical section at a time. Semaphores generalize locks and can be used to coordinate threads and signal events. Readers-writer locks allow multiple reader threads but only one writer thread for optimization. Common synchronization issues like deadlocks, priority inversion, and failures to unlock are also addressed.
This document discusses synchronization in operating systems and concurrent programming. It introduces the concept of critical sections and mutual exclusion locks that allow only one thread to access shared data at a time. It presents solutions to the "too much milk" problem using notes and waiting to ensure only one thread buys milk at a time. The document then discusses implementing locks using disabling interrupts as well as higher-level synchronization primitives like semaphores and monitors that make concurrent programming easier.
This document summarizes a lecture on virtual memory and paging given by Emery Berger at the University of Massachusetts Amherst. It discusses how virtual memory uses an indirection layer between physical and virtual addresses via a memory management unit (MMU) and page table. This indirection allows for isolation between processes, simplifies memory management, and enables optimizations like sharing and swapping of memory pages to disk. It also describes how translation lookaside buffers (TLBs) cache page translations for faster address translation and outlines the life cycle of a memory page from allocation to eviction and replacement.
This document summarizes a lecture on virtual memory. It discusses how virtual memory uses page tables to map virtual addresses to physical addresses, allowing processes to access more memory than is physically available. It describes demand paging, where pages are loaded on demand from disk, and the translation lookaside buffer (TLB) that caches translations to improve performance. Common page replacement algorithms like least recently used (LRU) and first-in first-out (FIFO) are also summarized.
Quantifying the Performance of Garbage Collection vs. Explicit Memory ManagementEmery Berger
This talk answers an age-old question: is garbage collection faster/slower/the same speed as malloc/free? We introduce oracular memory management, an approach that lets us measure unaltered Java programs as if they used malloc and free. The result: a good GC can match the performance of a good allocator, but it takes 5X more space. If physical memory is tight, however, conventional garbage collectors suffer an order-of-magnitude performance penalty.
Introduces bookmarking collection, a GC algorithm that works with the virtual memory manager to eliminate paging. Just before memory is paged out, the collector "bookmarks" the targets of pointers from the pages. Using these bookmarks, BC can perform full garbage collections without loading the pages back from disk. By performing in-memory garbage collections, BC can speed up Java programs by orders of magnitude (up to 41X).
DieHard: Probabilistic Memory Safety for Unsafe LanguagesEmery Berger
DieHard uses randomization and replication to transparently make C and C++ programs tolerate a wide range of errors, including buffer overflows and dangling pointers. Instead of crashing or running amok, DieHard lets programs continue to run correctly in the face of memory errors with high probability. Using DieHard also makes programs highly resistant to heap-based hacker attacks. Downloadable at www.diehard-software.org.
Exterminator: Automatically Correcting Memory Errors with High ProbabilityEmery Berger
Exterminator automatically corrects heap-based memory errors without programmer intervention. It exploits randomization and replication (or multiple users) to pinpoint errors with high precision. From this information, Exterminator derives runtime patches that fix these errors in current and subsequent executions.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
Memory Management for High-Performance Applications
1. Memory Management
for High-Performance Applications
Emery Berger
University of Massachusetts Amherst
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science
AMHERST
2. High-Performance Applications
Web servers,
search engines,
scientific codes cpu
cpu
cpu cpu RAM
cpu
cpu cpu RAM
cpu
C or C++
cpu RAM
cpu RAID drive
cpu Raid drive
cpu Raid drive
Run on one or
cluster of server
boxes software
compiler
Needs support at every level
runtime system
operating system
hardware
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 2
AMHERST
3. New Applications,
Old Memory Managers
Applications and hardware have changed
Multiprocessors now commonplace
Object-oriented, multithreaded
Increased pressure on memory manager
(malloc, free)
But memory managers have not kept up
Inadequate support for modern applications
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 3
AMHERST
4. Current Memory Managers
Limit Scalability
As we add
Runtime Performance
processors, 14
13
program slows 12
Ideal
11
10
down Actual
9
Speedup
8
Caused by heap 7
6
5
contention 4
3
2
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of Processors
Larson server benchmark on 14-processor Sun
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 4
AMHERST
5. The Problem
Current memory managers
inadequate for high-performance
applications on modern architectures
Limit scalability & application
performance
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 5
AMHERST
6. This Talk
Building memory managers
Heap Layers framework
Problems with current memory managers
Contention, false sharing, space
Solution: provably scalable memory manager
Hoard
Extended memory manager for servers
Reap
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 6
AMHERST
7. Implementing Memory Managers
Memory managers must be
Space efficient
Very fast
Heavily-optimized C code
Hand-unrolled loops
Macros
Monolithic functions
Hard to write, reuse, or extend
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 7
AMHERST
8. Real Code: DLmalloc 2.7.2
#d e f i n e c h u n k s i z e ( p ) ( ( p ) - >s i z e & ~( S I ZE_BI TS ) )
#d e f i n e n e x t _ c h u n k ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) )
#d e f i n e p r e v _ c h u n k ( p ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) - ( ( p ) - >p r e v _s i z e ) ) )
#d e f i n e c h u n k _ a t _ o f f s e t ( p , s ) ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) )
#d e f i n e i n u s e ( p )
( ( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) +( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e ) & PREV_I NUS E)
#d e f i n e s e t _ i n u s e ( p )
( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e | = PREV_I NUS E
#d e f i n e c l e a r _ i n u s e ( p )
( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( ( p ) - >s i z e & ~PREV_I NUS E) ) ) - >s i z e &= ~( PREV_I NUS E)
#d e f i n e i n u s e _ b i t _ a t _ o f f s e t ( p , s )
( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) ) - >s i z e & PREV_I NUS E)
#d e f i n e s e t _ i n u s e _ b i t _ a t _ o f f s e t ( p , s )
( ( ( mc h u n k p t r ) ( ( ( c h a r * ) ( p ) ) + ( s ) ) ) - >s i z e | = PREV_I NUS E)
#d e f i n e MAL L OC_ ZERO( c h a r p , n b y t e s )
do {
I NTERNAL _ S I ZE_ T* mz p = ( I NTERNAL_S I ZE_T* ) ( c h a r p ) ;
CHUNK_ S I ZE_ T mc t mp = ( n b y t e s ) /s i z e o f ( I NTERNAL_S I ZE_T) ;
l o n g mc n ;
i f ( mc t mp < 8 ) mc n = 0 ; e l s e { mc n = ( mc t mp - 1 ) /8 ; mc t mp %= 8 ; }
s wi t c h ( mc t mp ) {
c a s e 0 : f o r ( ; ; ) { * mz p ++ = 0 ;
c a s e 7: * mz p ++ = 0 ;
c a s e 6: * mz p ++ = 0 ;
c a s e 5: * mz p ++ = 0 ;
c a s e 4: * mz p ++ = 0 ;
c a s e 3: * mz p ++ = 0 ;
c a s e 2: * mz p ++ = 0 ;
c a s e 1: * mz p ++ = 0 ; i f ( mc n <= 0 ) b r e a k ; mc n - - ; }
}
} wh i l e ( 0 )
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 8
AMHERST
9. Programming Language Support
Classes Mixins
Overhead No overhead
Rigid hierarchy Flexible hierarchy
Sounds great...
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 9
AMHERST
10. A Heap Layer
C++ mixin with malloc & free methods
RedHeapLayer template <class SuperHeap>
class GreenHeapLayer :
public SuperHeap {…};
GreenHeapLayer
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 10
AMHERST
11. Example: Thread-Safe Heap Layer
LockedHeap
protect the superheap
with a lock
LockedMallocHeap
m a llocH ea p
L ockedH ea p
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 11
AMHERST
12. Empirical Results
Runtime (normalized to Lea allocator)
Heap Layers vs.
Kingsley KingsleyHeap Lea LeaHeap
Normalized Runtime
1.5
originals: 1.25
1
0.75
KingsleyHeap
0.5
0.25
vs. BSD allocator 0
cfrac espresso lindsay LRUsim perl roboop Average
Benchmark
LeaHeap
vs. DLmalloc 2.7 Space (normalized to Lea allocator)
Kingsley KingsleyHeap Lea LeaHeap
Competitive
Normalized Space
2.5
2
runtime and 1.5
1
memory efficiency 0.5
0
cfrac espresso lindsay LRUsim perl roboop Average
Benchmark
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 12
AMHERST
13. Overview
Building memory managers
Heap Layers framework
Problems with memory managers
Contention, space, false sharing
Solution: provably scalable allocator
Hoard
Extended memory manager for servers
Reap
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 13
AMHERST
14. Problems with General-Purpose
Memory Managers
Previous work for multiprocessors
Concurrent single heap [Bigler et al. 85, Johnson 91, Iyengar 92]
Impractical
Multiple heaps [Larson 98, Gloger 99]
Reduce contention but cause other problems:
P-fold or even unbounded increase in space
we show
Allocator-induced false sharing
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 14
AMHERST
15. Multiple Heap Allocator:
Pure Private Heaps
Key:
One heap per processor: = in use, processor 0
= free, on heap 1
gets memory
malloc
from its local heap
processor 0 processor 1
puts memory
free
x1= malloc(1)
on its local heap x2= malloc(1)
free(x1) free(x2)
x4= malloc(1)
x3= malloc(1)
STL, Cilk, ad hoc free(x3) free(x4)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 15
AMHERST
17. Multiple Heap Allocator:
Private Heaps with Ownership
processor 0 processor 1
returns memory
free
x1= malloc(1)
to original heap free(x1)
x2= malloc(1)
Bounded memory
free(x2)
consumption
No crash!
“Ptmalloc” (Linux),
LKmalloc
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 17
AMHERST
18. Problem:
P-fold Memory Blowup
Occurs in practice
processor 0 processor 1 processor 2
Round-robin producer- x1= malloc(1)
free(x1)
consumer x2= malloc(1)
free(x2)
processor i mod P allocates
x3=malloc(1)
processor (i+1) mod P frees
free(x3)
Footprint = 1 (2GB),
but space = 3 (6GB)
Exceeds 32-bit address space:
Crash!
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 18
AMHERST
19. Problem:
Allocator-Induced False Sharing
False sharing
CPU 0 CPU 1
Non-shared objects
on same cache line cache cache
Bane of parallel applications
bus
Extensively studied
cache line
processor 0 processor 1
All these allocators
x1= malloc(1) x2= malloc(1)
cause false sharing! thrash… thrash…
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 19
AMHERST
20. So What Do We Do Now?
Where do we put free memory?
on central heap: Heap contention
on our own heap: Unbounded memory
(pure private heaps) consumption
on the original heap: P-fold blowup
(private heaps with ownership)
How do we avoid false sharing?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 20
AMHERST
21. Overview
Building memory managers
Heap Layers framework
Problems with memory managers
Contention, space, false sharing
Solution: provably scalable allocator
Hoard
Extended memory manager for servers
Reap
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 21
AMHERST
22. Hoard: Key Insights
Bound local memory consumption
Explicitly track utilization
Move free memory to a global heap
Provably bounds memory consumption
Manage memory in large chunks
Avoids false sharing
Reduces heap contention
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 22
AMHERST
23. Overview of Hoard
global heap
Manage memory in heap blocks
Page-sized
Avoids false sharing
Allocate from local heap block
Avoids heap contention
processor 0 processor P-1
Low utilization
…
Move heap block to global heap
Avoids space blowup
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 23
AMHERST
24. Summary of Analytical Results
Space consumption: near optimal worst-case
Hoard: O(n log M/m + P) {P « n}
Optimal: O(n log M/m)
n = memory required
[Robson 70]
M = biggest object size
Private heaps with ownership: m = smallest object size
P = processors
O(P n log M/m)
Provably low synchronization
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 24
AMHERST
25. Empirical Results
Measure runtime on 14-processor Sun
Allocators
Solaris (system allocator)
Ptmalloc (GNU libc)
mtmalloc (Sun’s “MT-hot” allocator)
Micro-benchmarks
Threadtest: no sharing
Larson: sharing (server-style)
Cache-scratch: mostly reads & writes
(tests for false sharing)
Real application experience similar
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 25
AMHERST
26. Runtime Performance:
threadtest
Many
threads,
no sharing
Hoard
achieves
linear
speedup
speedup(x,P) = runtime(Solaris allocator, one processor)
/ runtime(x on P processors)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 26
AMHERST
27. Runtime Performance:
Larson
Many
threads,
sharing
(server-style)
Hoard
achieves
linear
speedup
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 27
AMHERST
28. Runtime Performance:
false sharing
Many
threads,
mostly reads
& writes of
heap data
Hoard
achieves
linear
speedup
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 28
AMHERST
29. Hoard in the “Real World”
Open source code
www.hoard.org
13,000 downloads
Solaris, Linux, Windows, IRIX, …
Widely used in industry
AOL, British Telecom, Novell, Philips
Reports: 2x-10x, “impressive” improvement in performance
Search server, telecom billing systems, scene rendering,
real-time messaging middleware, text-to-speech engine,
telephony, JVM
Scalable general-purpose memory manager
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 29
AMHERST
30. Overview
Building memory managers
Heap Layers framework
Problems with memory managers
Contention, space, false sharing
Solution: provably scalable allocator
Hoard
Extended memory manager for servers
Reap
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 30
AMHERST
31. Custom Memory Allocation
Replace new/delete, Very common practice
bypassing general-purpose Apache, gcc, lcc, STL,
allocator database servers…
Language-level
Reduce runtime – often
support in C++
Expand functionality – sometimes
Reduce space – rarely
“Use custom
allocators”
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 31
AMHERST
32. The Reality
Lea allocator
Runtime - Custom Allocator Benchmarks
often as fast Custom Win32 DLmalloc
or faster 1.75
non-regions regions averages
Normalized Runtime
1.5
Custom
1.25
1
allocation 0.75
ineffective, 0.5
0.25
except for 0
regions.
ll
s
le
ze
r
ns
he
c
sim
r
c
ra
vp
se
on
lc
gc
l
ud
ee
io
ac
ve
5.
ar
gi
d-
6.
eg
m
br
17
ap
O
re
.p
xe
17
[OOPSLA 2002]
R
c-
7
-
bo
on
19
N
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 32
AMHERST
33. Overview of Regions
Separate areas, deletion only en masse
regioncreate(r) r
regionmalloc(r, sz)
regiondelete(r)
- Risky
Fast
+
- Accidental deletion
Pointer-bumping allocation
+
- Too much space
Deletion of chunks
+
Convenient
+
One call frees all memory
+
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 33
AMHERST
34. Why Regions?
Apparently faster, more space-efficient
Servers need memory management support:
Avoid resource leaks
Tear down memory associated with terminated
connections or transactions
Current approach (e.g., Apache): regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 34
AMHERST
35. Drawbacks of Regions
Can’t reclaim memory within regions
Problem for long-running computations,
producer-consumer patterns,
off-the-shelf “malloc/free” programs
unbounded memory consumption
Current situation for Apache:
vulnerable to denial-of-service
limits runtime of connections
limits module programming
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 35
AMHERST
36. Reap Hybrid Allocator
Reap = region + heap
Adds individual object deletion & heap
reapcreate(r)
r
reapmalloc(r, sz)
reapfree(r,p)
reapdelete(r)
Can reduce memory consumption
Fast
Adapts to use (region or heap style)
Cheap deletion
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 36
AMHERST
37. Using Reap as Regions
Runtime - Region-Based Benchmarks
Original Win32 DLmalloc WinHeap Vmalloc Reap
4.08
2.5
Normalized Runtime
2
1.5
1
0.5
0
lcc mudlle
Reap performance nearly matches regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 37
AMHERST
38. Reap: Best of Both Worlds
Combining new/delete with regions
usually impossible:
Incompatible API’s
Hard to rewrite code
Use Reap: Incorporate new/delete code into Apache
“mod_bc” (arbitrary-precision calculator)
Changed 20 lines (out of 8000)
Benchmark: compute 1000th prime
With Reap: 240K
Without Reap: 7.4MB
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 38
AMHERST
39. Summary
Building memory managers
Heap Layers framework [PLDI 2001]
Problems with current memory managers
Contention, false sharing, space
Solution: provably scalable memory manager
Hoard [ASPLOS-IX]
Extended memory manager for servers
Reap [OOPSLA 2002]
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 39
AMHERST
40. Current Projects
CRAMM: Cooperative Robust Automatic Memory
Management
Garbage collection without paging
Automatic heap sizing
SAVMM: Scheduler-Aware Virtual Memory Management
Markov:
Programming language for building high-performance servers
COLA: Customizable Object Layout Algorithms
Improving locality in Java
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 40
AMHERST
41. www.cs.umass.edu/~plasma
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 41
AMHERST
43. Looking Forward
“New” programming languages
Increasing use of Java = garbage collection
New architectures
NUMA: SMT/CMP (“hyperthreading”)
Technology trends
Memory hierarchy
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 43
AMHERST
44. The Ever-Steeper
Memory Hierarchy
Higher = smaller, faster, closer to CPU
A real desktop machine (mine)
registers 8 integer, 8 floating-point; 1-cycle latency
L1 cache 8K data & instructions; 2-cycle latency
L2 cache 512K; 7-cycle latency
RAM 1GB; 100 cycle latency
Disk 40 GB; 38,000,000 cycle latency (!)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 44
AMHERST
45. Swapping & Throughput
Heap > available memory - throughput plummets
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 45
AMHERST
46. Why Manage Memory At All?
Just buy more!
Simplifies memory management
Still have to collect garbage eventually…
Workload fits in RAM = no more swapping!
Sounds great…
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 46
AMHERST
47. Memory Prices Over Time
RAM Prices Over Time
(1977 dollars)
$10,000.00
$1,000.00
2K
$100.00
8K
Dollars per GB
32K
$10.00 128K
conventional DRAM
512K
2M
$1.00
8M
$0.10
$0.01
1977
1980
1981
1982
1985
1986
1987
1989
1990
1991
1992
1993
1994
1995
1997
1998
1999
2000
2002
2003
2004
2005
1978
1979
1983
1984
1988
1996
2001 Year
“Soon it will be free…”
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 47
AMHERST
48. Memory Prices: Inflection Point!
RAM Prices Ov er Time
(1977 dollars)
$10,000.00
$1,000.00
2K
8K
$100.00
32K
Dollars per GB
128K
$10.00 512K
S DRA M ,
conventional DRAM R DR A M ,
2M
DDR ,
Chipkill 8M
$1.00
512M
1G
$0.10
$0.01
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
Year
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 48
AMHERST
49. Memory Is Actually Expensive
Desktops:
Most ship with 256MB
1GB = 50% more $$
Laptops = 70%, if possible
Limited capacity
Servers:
Buy 4GB, get 1 CPU
free!
Sun Enterprise 10000:
8GB extra = $150,000!
8GB Sun RAM =
Fast RAM – new
technologies 1 Ferrari Modena
Cosmic rays…
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 49
AMHERST
50. Key Problem: Paging
Garbage collectors: VM oblivious
GC disrupts LRU queue
Touches non-resident pages
Virtual memory managers: GC oblivious
Likely to evict pages needed by GC
Paging
Orders of magnitude more time than RAM
BIG hit in performance and LONG pauses
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 50
AMHERST
51. Cooperative Robust Automatic
Memory Management (CRAMM)
Garbage collector Virtual memory manager
I’m a
cooperative
application!
Coarse-grained
change in
(heap-level)
memory pressure
Tracks per-process,
new heap size
Adjusts heap size overall
memory utilization
Fine-grained
page eviction
(page-level)
notification
Evacuates pages Page replacement
victim page(s)
Selects victim pages
Joint work: Eliot Moss (UMass), Scott Kaplan (Amherst College)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 51
AMHERST
52. Fine-Grained Cooperative GC
Garbage collector Virtual memory manager
Fine-grained page eviction
notification
Evacuates pages Page replacement
victim page(s)
Selects victim pages
Goal: GC triggers no additional paging
Key ideas:
Adapt collection strategy on-the-fly
Page-oriented memory management
Exploit detailed page information from VM
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 52
AMHERST
53. Summary
Building memory managers
Heap Layers framework
Problems with memory managers
Contention, space, false sharing
Solution: provably scalable allocator
Hoard
Future directions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 53
AMHERST
54. If You Have to Spend $$...
more Ferraris: good
more memory: bad
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 54
AMHERST
56. This Page Intentionally Left Blank
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 56
AMHERST
57. Virtual Memory Manager Support
New VM required: detailed page-level information
“Segmented queue” for low-overhead
unprotected protected
Local LRU order per-process, not gLRU (Linux)
Complementary to SAVM work:
“Scheduler-Aware Virtual Memory manager”
Under development – modified Linux kernel
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 57
AMHERST
58. Current Work: Robust
Performance
Currently: no VM-GC communicaton
BAD interactions under memory pressure
Our approach (with Eliot Moss, Scott Kaplan):
Cooperative Robust Automatic Memory
Management
LRU queue
memory pressure
Virtual Garbage
memory collector
empty pages
manager / allocator
reduced impact
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 58
AMHERST
59. Current Work: Predictable VMM
Recent work on scheduling for QoS
E.g., proportional-share
Under memory pressure, VMM is scheduler
Paged-out processes may never recover
Intermittent processes may wait long time
Scheduler-faithful virtual memory
(with Scott Kaplan, Prashant Shenoy)
Based on page value rather than order
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 59
AMHERST
60. Conclusion
Memory management for high-performance applications
Heap Layers framework [PLDI 2001]
Reusable components, no runtime cost
Hoard scalable memory manager [ASPLOS-IX]
High-performance, provably scalable & space-efficient
Reap hybrid memory manager [OOPSLA 2002]
Provides speed & robustness for server applications
Current work: robust memory management for
multiprogramming
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 60
AMHERST
61. The Obligatory URL Slide
http://www.cs.umass.edu/~emery
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 61
AMHERST
62. If You Can Read This,
I Went Too Far
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 62
AMHERST
63. Hoard: Under the Hood
S ystem Heap
get or return memory to global heap
HeapBlockManager
LockedHeap
HeapBlockManager
HeapBlockManager
S uperblockHeap
malloc from local heap,
LockedHeap Empty
LockedHeap
LockedHeap
free to heap block
Heap Blocks
P erP rocessorHeap FreeT oHeapBlock
Large
objects
MallocOrF reeHeap
(> 4K)
S electS izeHeap
select heap based on size
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 63
AMHERST
64. Custom Memory Allocation
Replace new/delete, Very common practice
bypassing general-purpose Apache, gcc, lcc, STL,
allocator database servers…
Language-level
Reduce runtime – often
support in C++
Expand functionality – sometimes
Reduce space – rarely
“Use custom
allocators”
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 64
AMHERST
65. Drawbacks of Custom Allocators
Avoiding memory manager means:
More code to maintain & debug
Can’t use memory debuggers
Not modular or robust:
Mix memory from custom
and general-purpose allocators → crash!
Increased burden on programmers
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 65
AMHERST
66. Overview
Introduction
Perceived benefits and drawbacks
Three main kinds of custom allocators
Comparison with general-purpose allocators
Advantages and drawbacks of regions
Reaps – generalization of regions & heaps
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 66
AMHERST
67. (1) Per-Class Allocators
Recycle freed objects from a free list
a = new Class1; Class1
Fast
free list +
b = new Class1;
c = new Class1; Linked list operations
+
a
delete a;
Simple
+
delete b;
Identical semantics
b +
delete c;
C++ language support
+
a = new Class1; c
Possibly space-inefficient
-
b = new Class1;
c = new Class1;
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 67
AMHERST
68. (II) Custom Patterns
Tailor-made to fit allocation patterns
Example: 197.parser (natural language parser)
db
a c
char[MEMORY_LIMIT]
end_of_array end_of_array
end_of_array
end_of_array
end_of_array
a = xalloc(8); Fast
+
b = xalloc(16);
Pointer-bumping allocation
+
c = xalloc(8);
- Brittle
xfree(b);
- Fixed memory size
xfree(c);
d = xalloc(8); - Requires stack-like lifetimes
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 68
AMHERST
69. (III) Regions
Separate areas, deletion only en masse
regioncreate(r) r
regionmalloc(r, sz)
regiondelete(r)
- Risky
Fast
+
- Accidental deletion
Pointer-bumping allocation
+
- Too much space
Deletion of chunks
+
Convenient
+
One call frees all memory
+
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 69
AMHERST
70. Overview
Introduction
Perceived benefits and drawbacks
Three main kinds of custom allocators
Comparison with general-purpose allocators
Advantages and drawbacks of regions
Reaps – generalization of regions & heaps
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 70
AMHERST
71. Custom Allocators Are Faster…
Runtime - Custom Allocator Benchmarks
Custom Win32
1.75
non-regions regions averages
Normalized Runtime
1.5
1.25
1
0.75
0.5
0.25
0
s
r
er
he
ll
ll e
ze
m
c
ns
c
vp
on
ra
gc
lc
rs
si
ud
ac
ee
io
5.
ve
gi
6.
d-
pa
eg
m
17
ap
br
-re
O
17
xe
7.
R
c-
on
bo
19
N
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 71
AMHERST
72. Not So Fast…
Runtime - Custom Allocator Benchmarks
Custom Win32 DLmalloc
1.75
non-regions regions averages
Normalized Runtime
1.5
1.25
1
0.75
0.5
0.25
0
l
s
l le
s
ze
r
he
c
er
sim
al
c
vp
n
on
lc
gc
r
ud
rs
io
ee
ac
ve
5.
d-
6.
i
g
pa
eg
m
br
17
ap
O
re
17
xe
7.
R
c-
-
bo
on
19
N
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 72
AMHERST
73. The Lea Allocator (DLmalloc 2.7.0)
Optimized for common allocation patterns
Per-size quicklists ≈ per-class allocation
Deferred coalescing
(combining adjacent free objects)
Highly-optimized fastpath
Space-efficient
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 73
AMHERST
74. Space Consumption Results
Space - Custom Allocator Benchmarks
Original DLmalloc
1.75
non-regions regions averages
Normalized Space
1.5
1.25
1
0.75
0.5
0.25
0
ll
lle
s
c
r
e
s
e
er
c
im
ra
vp
lc
n
on
z
ch
c
ud
rs
io
ee
.g
-s
ve
5.
a
i
g
pa
eg
6
ed
m
br
17
ap
O
re
17
7.
R
c-
x
-
bo
on
19
N
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 74
AMHERST
75. Overview
Introduction
Perceived benefits and drawbacks
Three main kinds of custom allocators
Comparison with general-purpose allocators
Advantages and drawbacks of regions
Reaps – generalization of regions & heaps
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 75
AMHERST
76. Why Regions?
Apparently faster, more space-efficient
Servers need memory management support:
Avoid resource leaks
Tear down memory associated with terminated
connections or transactions
Current approach (e.g., Apache): regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 76
AMHERST
77. Drawbacks of Regions
Can’t reclaim memory within regions
Problem for long-running computations,
producer-consumer patterns,
off-the-shelf “malloc/free” programs
unbounded memory consumption
Current situation for Apache:
vulnerable to denial-of-service
limits runtime of connections
limits module programming
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 77
AMHERST
78. Reap Hybrid Allocator
Reap = region + heap
Adds individual object deletion & heap
reapcreate(r)
r
reapmalloc(r, sz)
reapfree(r,p)
reapdelete(r)
Can reduce memory consumption
Fast
+
Adapts to use (region or heap style)
+
Cheap deletion
+
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 78
AMHERST
79. Using Reap as Regions
Runtime - Region-Based Benchmarks
Original Win32 DLmalloc WinHeap Vmalloc Reap
4.08
2.5
Normalized Runtime
2
1.5
1
0.5
0
lcc mudlle
Reap performance nearly matches regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 79
AMHERST
80. Reap: Best of Both Worlds
Combining new/delete with regions
usually impossible:
Incompatible API’s
Hard to rewrite code
Use Reap: Incorporate new/delete code into Apache
“mod_bc” (arbitrary-precision calculator)
Changed 20 lines (out of 8000)
Benchmark: compute 1000th prime
With Reap: 240K
Without Reap: 7.4MB
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 80
AMHERST
81. Conclusion
Empirical study of custom allocators
Lea allocator often as fast or faster
Custom allocation ineffective,
except for regions
Reaps:
Nearly matches region performance
without other drawbacks
Take-home message:
Stop using custom memory allocators!
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 81
AMHERST
83. Experimental Methodology
Comparing to general-purpose allocators
Same semantics: no problem
E.g., disable per-class allocators
Different semantics: use emulator
Uses general-purpose allocator
but adds bookkeeping
regionfree: Free all associated objects
Other functionality (nesting, obstacks)
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 83
AMHERST
84. Use Custom Allocators?
Strongly recommended by practitioners
Little hard data on performance/space
improvements
Only one previous study [Zorn 1992]
Focused on just one type of allocator
Custom allocators: waste of time
Small gains, bad allocators
Different allocators better? Trade-offs?
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 84
AMHERST
85. Kinds of Custom Allocators
Three basic types of custom allocators
Per-class
Fast
Custom patterns
Fast, but very special-purpose
Regions
Fast, possibly more space-efficient
Convenient
Variants: nested, obstacks
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 85
AMHERST
86. Optimization Opportunity
Time Spent in Memory Operations
Memory Operations Other
100
80
% of runtime
60
40
20
0
lcc
ll e
sim
cc
e
ze
e
pr
r
se
ag
h
ud
v
g
ee
ac
5.
d-
6.
ar
er
m
ap
br
17
xe
17
p
Av
7.
c-
bo
19
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 86
AMHERST
88. Custom Memory Allocation
Programmers often replace malloc/free
Attempt to increase performance
Provide extra functionality (e.g., for servers)
Reduce space (rarely)
Empirical study of custom allocators
Lea allocator often as fast or faster
Custom allocation ineffective,
except for regions. [OOPSLA 2002]
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 88
AMHERST
89. Overview of Regions
Separate areas, deletion only en masse
regioncreate(r) r
regionmalloc(r, sz)
regiondelete(r)
- Risky
Fast
+
- Accidental deletion
Pointer-bumping allocation
+
- Too much space
Deletion of chunks
+
Convenient
+
One call frees all memory
+
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 89
AMHERST
90. Why Regions?
Apparently faster, more space-efficient
Servers need memory management support:
Avoid resource leaks
Tear down memory associated with terminated
connections or transactions
Current approach (e.g., Apache): regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 90
AMHERST
91. Drawbacks of Regions
Can’t reclaim memory within regions
Problem for long-running computations,
producer-consumer patterns,
off-the-shelf “malloc/free” programs
unbounded memory consumption
Current situation for Apache:
vulnerable to denial-of-service
limits runtime of connections
limits module programming
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 91
AMHERST
92. Reap Hybrid Allocator
Reap = region + heap
Adds individual object deletion & heap
reapcreate(r)
r
reapmalloc(r, sz)
reapfree(r,p)
reapdelete(r)
Can reduce memory consumption
Fast
Adapts to use (region or heap style)
Cheap deletion
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 92
AMHERST
93. Using Reap as Regions
Runtime - Region-Based Benchmarks
Original Win32 DLmalloc WinHeap Vmalloc Reap
4.08
2.5
Normalized Runtime
2
1.5
1
0.5
0
lcc mudlle
Reap performance nearly matches regions
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 93
AMHERST
94. Reap: Best of Both Worlds
Combining new/delete with regions
usually impossible:
Incompatible API’s
Hard to rewrite code
Use Reap: Incorporate new/delete code into Apache
“mod_bc” (arbitrary-precision calculator)
Changed 20 lines (out of 8000)
Benchmark: compute 1000th prime
With Reap: 240K
Without Reap: 7.4MB
UNIVERSITY OF MASSACHUSETTS AMHERST • Department of Computer Science 94
AMHERST