Доклад рассказывает об устройстве и опыте применения инструментов динамического тестирования C/C++ программ — AddressSanitizer, ThreadSanitizer и MemorySanitizer. Инструменты находят такие ошибки, как использование памяти после освобождения, обращения за границы массивов и объектов, гонки в многопоточных программах и использования неинициализированной памяти.
Есть много причин заниматься конверсией управляемых языков в нативные: это прежде всего производительность, но также защита от реверс-инжиниринга, поддержка аппаратных технологий или каких-то специфичных платформ. В этом докладе мы посмотрим на пример построения конвертера из C# в C++ и те нюансы, которые встречаются при решении этой задачи
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...Stefan Marr
Runtime metaprogramming enables many useful applications and is often a convenient solution to solve problems in a generic way, which makes it widely used in frameworks, middleware, and domain-specific languages. However, powerful metaobject protocols are rarely supported and even common concepts such as reflective method invocation or dynamic proxies are not optimized. Solutions proposed in literature either restrict the metaprogramming capabilities or require application or library developers to apply performance improving techniques.
For overhead-free runtime metaprogramming, we demonstrate that dispatch chains, a generalized form of polymorphic inline caches common to self-optimizing interpreters, are a simple optimization at the language-implementation level. Our evaluation with self-optimizing interpreters shows that unrestricted metaobject protocols can be realized for the first time without runtime overhead, and that this optimization is applicable for just-in-time compilation of interpreters based on meta-tracing as well as partial evaluation. In this context, we also demonstrate that optimizing common reflective operations can lead to significant performance improvements for existing applications.
Building High-Performance Language Implementations With Low EffortStefan Marr
This talk shows how languages can be implemented as self-optimizing interpreters, and how Truffle or RPython go about to just-in-time compile these interpreters to efficient native code.
Programming languages are never perfect, so people start building domain-specific languages to be able to solve their problems more easily. However, custom languages are often slow, or take enormous amounts of effort to be made fast by building custom compilers or virtual machines.
With the notion of self-optimizing interpreters, researchers proposed a way to implement languages easily and generate a JIT compiler from a simple interpreter. We explore the idea and experiment with it on top of RPython (of PyPy fame) with its meta-tracing JIT compiler, as well as Truffle, the JVM framework of Oracle Labs for self-optimizing interpreters.
In this talk, we show how a simple interpreter can reach the same order of magnitude of performance as the highly optimizing JVM for Java. We discuss the implementation on top of RPython as well as on top of Java with Truffle so that you can start right away, independent of whether you prefer the Python or JVM ecosystem.
While our own experiments focus on SOM, a little Smalltalk variant to keep things simple, other people have used this approach to improve peek performance of JRuby, or build languages such as JavaScript, R, and Python 3.
Shai Halevi discusses new ways to protect cloud data and security. Presented at "New Techniques for Protecting Cloud Data and Security" organized by the New York Technology Council.
Доклад рассказывает об устройстве и опыте применения инструментов динамического тестирования C/C++ программ — AddressSanitizer, ThreadSanitizer и MemorySanitizer. Инструменты находят такие ошибки, как использование памяти после освобождения, обращения за границы массивов и объектов, гонки в многопоточных программах и использования неинициализированной памяти.
Есть много причин заниматься конверсией управляемых языков в нативные: это прежде всего производительность, но также защита от реверс-инжиниринга, поддержка аппаратных технологий или каких-то специфичных платформ. В этом докладе мы посмотрим на пример построения конвертера из C# в C++ и те нюансы, которые встречаются при решении этой задачи
Zero-Overhead Metaprogramming: Reflection and Metaobject Protocols Fast and w...Stefan Marr
Runtime metaprogramming enables many useful applications and is often a convenient solution to solve problems in a generic way, which makes it widely used in frameworks, middleware, and domain-specific languages. However, powerful metaobject protocols are rarely supported and even common concepts such as reflective method invocation or dynamic proxies are not optimized. Solutions proposed in literature either restrict the metaprogramming capabilities or require application or library developers to apply performance improving techniques.
For overhead-free runtime metaprogramming, we demonstrate that dispatch chains, a generalized form of polymorphic inline caches common to self-optimizing interpreters, are a simple optimization at the language-implementation level. Our evaluation with self-optimizing interpreters shows that unrestricted metaobject protocols can be realized for the first time without runtime overhead, and that this optimization is applicable for just-in-time compilation of interpreters based on meta-tracing as well as partial evaluation. In this context, we also demonstrate that optimizing common reflective operations can lead to significant performance improvements for existing applications.
Building High-Performance Language Implementations With Low EffortStefan Marr
This talk shows how languages can be implemented as self-optimizing interpreters, and how Truffle or RPython go about to just-in-time compile these interpreters to efficient native code.
Programming languages are never perfect, so people start building domain-specific languages to be able to solve their problems more easily. However, custom languages are often slow, or take enormous amounts of effort to be made fast by building custom compilers or virtual machines.
With the notion of self-optimizing interpreters, researchers proposed a way to implement languages easily and generate a JIT compiler from a simple interpreter. We explore the idea and experiment with it on top of RPython (of PyPy fame) with its meta-tracing JIT compiler, as well as Truffle, the JVM framework of Oracle Labs for self-optimizing interpreters.
In this talk, we show how a simple interpreter can reach the same order of magnitude of performance as the highly optimizing JVM for Java. We discuss the implementation on top of RPython as well as on top of Java with Truffle so that you can start right away, independent of whether you prefer the Python or JVM ecosystem.
While our own experiments focus on SOM, a little Smalltalk variant to keep things simple, other people have used this approach to improve peek performance of JRuby, or build languages such as JavaScript, R, and Python 3.
Shai Halevi discusses new ways to protect cloud data and security. Presented at "New Techniques for Protecting Cloud Data and Security" organized by the New York Technology Council.
Cloud computing is an ever-growing field in today‘s era.With the accumulation of data and the
advancement of technology,a large amount of data is generated everyday.Storage, availability and security of
the data form major concerns in the field of cloud computing.This paper focuses on homomorphic encryption,
which is largely used for security of data in the cloud.Homomorphic encryption is defined as the technique of
encryption in which specific operations can be carried out on the encrypted data.The data is stored on a remote
server.The task here is operating on the encrypted data.There are two types of homomorphic encryption, Fully
homomorphic encryption and patially homomorphic encryption.Fully homomorphic encryption allow arbitrary
computation on the ciphertext in a ring, while the partially homomorphic encryption is the one in which
addition or multiplication operations can be carried out on the normal ciphertext.Homomorphic encryption
plays a vital role in cloud computing as the encrypted data of companies is stored in a public cloud, thus taking
advantage of the cloud provider‘s services.Various algorithms and methods of homomorphic encryption that
have been proposed are discussed in this paper
A fast-paced introduction to Deep Learning that starts with a simple yet complete neural network (no frameworks), followed by an overview of activation functions, cost functions, backpropagation, and then a quick dive into CNNs. Next we'll create a neural network using Keras, followed by an introduction to TensorFlow and TensorBoard. For best results, familiarity with basic vectors and matrices, inner (aka "dot") products of vectors, and rudimentary Python is definitely helpful.
This slide is going to introduce the concept of TensorFlow based on the source code study, including tensor, operation, computation graph and execution.
Rust: Reach Further (from QCon Sao Paolo 2018)nikomatsakis
Rust is a new programming language that is growing rapidly. Rust's goal is to support a high-level coding style while offering performance comparable to C and C++ as well as minimal runtime requirements -- it does not require a runtime or garbage collector, and you can even choose to forego the standard library. At the same time, Rust offers strong support for parallel programming, including guaranteed freedom from data-races (something that GC’d languages like Java or Go do not provide).
Rust’s slim runtime requirements make it an ideal choice for integrating into other languages and projects. Anywhere that you could integrate a C or C++ library, you can choose to use Rust instead. Mozilla, for example, has rewritten a portion of the Firefox web browser in Rust -- while keeping the rest in C++. There are also projects for writing native extensions to Python, Ruby, and Node in Rust, as well as a recent effort to have the Rust compiler generate WebAssembly.
This talk will cover some of the highlights of Rust's design, and show how Rust's type system not only supports different parallel styles but also encourages users to write code that is amenable to parallelization. I'll also talk a bit about some of the experiences of using Rust in production, as well as how to integrate Rust into existing projects written in different languages.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
Conflux provides a parallel programming framework to use CPUs and GPUs in collaboration as components of an integrated computing system. Conflux proposes already known kernel-based architecture that is compatible with CUDA,
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
Poster presented at PEARC20.
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another (“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many independent subproblems and exhibits near linear scaling. In this poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We reduced the run time of computing UniFrac on the published Earth Microbiome Project dataset from 13 hours on an Intel Xeon E5-2680 v4 CPU to 12 minutes on an NVIDIA Tesla V100 GPU, and to about one hour on a laptop with NVIDIA GTX 1050 (with minor loss in precision). Computing UniFrac on a larger dataset containing 113k samples reduced the run time from over one month on the CPU to less than 2 hours on the V100 and 9 hours on an NVIDIA RTX 2080TI GPU (with minor loss in precision). This was achieved by using OpenACC for generating the GPU offload code and by improving the memory access patterns. A BSD-licensed implementation is available, which produces a Cshared library linkable by any programming language.
Cloud computing is an ever-growing field in today‘s era.With the accumulation of data and the
advancement of technology,a large amount of data is generated everyday.Storage, availability and security of
the data form major concerns in the field of cloud computing.This paper focuses on homomorphic encryption,
which is largely used for security of data in the cloud.Homomorphic encryption is defined as the technique of
encryption in which specific operations can be carried out on the encrypted data.The data is stored on a remote
server.The task here is operating on the encrypted data.There are two types of homomorphic encryption, Fully
homomorphic encryption and patially homomorphic encryption.Fully homomorphic encryption allow arbitrary
computation on the ciphertext in a ring, while the partially homomorphic encryption is the one in which
addition or multiplication operations can be carried out on the normal ciphertext.Homomorphic encryption
plays a vital role in cloud computing as the encrypted data of companies is stored in a public cloud, thus taking
advantage of the cloud provider‘s services.Various algorithms and methods of homomorphic encryption that
have been proposed are discussed in this paper
A fast-paced introduction to Deep Learning that starts with a simple yet complete neural network (no frameworks), followed by an overview of activation functions, cost functions, backpropagation, and then a quick dive into CNNs. Next we'll create a neural network using Keras, followed by an introduction to TensorFlow and TensorBoard. For best results, familiarity with basic vectors and matrices, inner (aka "dot") products of vectors, and rudimentary Python is definitely helpful.
This slide is going to introduce the concept of TensorFlow based on the source code study, including tensor, operation, computation graph and execution.
Rust: Reach Further (from QCon Sao Paolo 2018)nikomatsakis
Rust is a new programming language that is growing rapidly. Rust's goal is to support a high-level coding style while offering performance comparable to C and C++ as well as minimal runtime requirements -- it does not require a runtime or garbage collector, and you can even choose to forego the standard library. At the same time, Rust offers strong support for parallel programming, including guaranteed freedom from data-races (something that GC’d languages like Java or Go do not provide).
Rust’s slim runtime requirements make it an ideal choice for integrating into other languages and projects. Anywhere that you could integrate a C or C++ library, you can choose to use Rust instead. Mozilla, for example, has rewritten a portion of the Firefox web browser in Rust -- while keeping the rest in C++. There are also projects for writing native extensions to Python, Ruby, and Node in Rust, as well as a recent effort to have the Rust compiler generate WebAssembly.
This talk will cover some of the highlights of Rust's design, and show how Rust's type system not only supports different parallel styles but also encourages users to write code that is amenable to parallelization. I'll also talk a bit about some of the experiences of using Rust in production, as well as how to integrate Rust into existing projects written in different languages.
Highlighted notes of:
Introduction to CUDA C: NVIDIA
Author: Blaise Barney
From: GPU Clusters, Lawrence Livermore National Laboratory
https://computing.llnl.gov/tutorials/linux_clusters/gpu/NVIDIA.Introduction_to_CUDA_C.1.pdf
Blaise Barney is a research scientist at Lawrence Livermore National Laboratory.
Conflux provides a parallel programming framework to use CPUs and GPUs in collaboration as components of an integrated computing system. Conflux proposes already known kernel-based architecture that is compatible with CUDA,
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
Poster presented at PEARC20.
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another (“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many independent subproblems and exhibits near linear scaling. In this poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We reduced the run time of computing UniFrac on the published Earth Microbiome Project dataset from 13 hours on an Intel Xeon E5-2680 v4 CPU to 12 minutes on an NVIDIA Tesla V100 GPU, and to about one hour on a laptop with NVIDIA GTX 1050 (with minor loss in precision). Computing UniFrac on a larger dataset containing 113k samples reduced the run time from over one month on the CPU to less than 2 hours on the V100 and 9 hours on an NVIDIA RTX 2080TI GPU (with minor loss in precision). This was achieved by using OpenACC for generating the GPU offload code and by improving the memory access patterns. A BSD-licensed implementation is available, which produces a Cshared library linkable by any programming language.
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)PROIDEA
Kiedy ostatnio stworzyłeś nową strukturę pisząc aplikację w .NET? Wiesz do czego wykorzystywać struktury i jak mogą one zwiększyć wydajność Twojego programu? W prezentacji pokażę czym charakteryzują się struktury, jak dużo różni je od klas oraz opowiem o kilku ciekawych eksperymentach.
Towards Safe Automated Refactoring of Imperative Deep Learning Programs to Gr...Raffi Khatchadourian
Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code—supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, imperative DL frameworks encouraging eager execution have emerged at the expense of run-time performance. Though hybrid approaches aim for the “best of both worlds,” using them effectively requires subtle considerations to make code amenable to safe, accurate, and efficient graph execution. We present our ongoing work on automated refactoring that assists developers in specifying whether and how their otherwise eagerly-executed imperative DL code could be reliably and efficiently executed as graphs while preserving semantics. The approach, based on a novel imperative tensor analysis, will automatically determine when it is safe and potentially advantageous to migrate imperative DL code to graph execution and modify decorator parameters or eagerly executing code already running as graphs. The approach is being implemented as a PyDev Eclipse IDE plug-in and uses the WALA Ariadne analysis framework. We discuss our ongoing work towards optimizing imperative DL code to its full potential.
Towards neuralprocessingofgeneralpurposeapproximateprogramsParidha Saxena
Did validation of one of the machine learning algorithms of neural networks,and compared the results for its implementation on hardware (FPGA) using xilinx, with that of a sequential code execution(using FANN).
Options and trade offs for parallelism and concurrency in Modern C++Satalia
While threads have become a first class citizen in C++ since C++11, it is not always the case that they are the best abstraction to express parallelism where the objective is to speed up computations. OpenMP is a parallelism API for C/C++ and Fortran that has been around for a long time. Intel's Threading Building Blocks (TBB) is only a little bit more than 10 years old, but is very mature, and specifically for C++.
Mats will introduce OpenMP and TBB and their use in modern C++ and provide some best practices for them as well as try to predict what the C++ standard has in store for us when it comes to parallelism in the future.
Presentation of NvFX: an effect layer that allows encapsulation of GLSL and/or D3D shading language.
The basic concept follows the footprints of NVIDIA CgFX
https://github.com/tlorach/nvFX
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
Apache Spark provides an elegant API for developing machine learning pipelines that can be deployed seamlessly in production. However, one of the most intriguing and performant family of algorithms – deep learning – remains difficult for many groups to deploy in production, both because of the need for tremendous compute resources and also because of the inherent difficulty in tuning and configuring.
In this session, you’ll discover how to deploy the Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. Learn about the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. You’ll also see a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.Microsoft Cognitive Toolkit (CNTK) inside of Spark clusters on the Azure cloud platform. We’ll discuss the key considerations for administering GPU-enabled Spark clusters, configuring such workloads for maximum performance, and techniques for distributed hyperparameter optimization. We’ll illustrate a real-world example of training distributed deep learning learning algorithms for speech recognition and natural language processing.
Similar to The Effect of Hierarchical Memory on the Design of Parallel Algorithms and the Expression of Parallelism (20)
The increased availability of biomedical data, particularly in the public domain, offers the opportunity to better understand human health and to develop effective therapeutics for a wide range of unmet medical needs. However, data scientists remain stymied by the fact that data remain hard to find and to productively reuse because data and their metadata i) are wholly inaccessible, ii) are in non-standard or incompatible representations, iii) do not conform to community standards, and iv) have unclear or highly restricted terms and conditions that preclude legitimate reuse. These limitations require a rethink on data can be made machine and AI-ready - the key motivation behind the FAIR Guiding Principles. Concurrently, while recent efforts have explored the use of deep learning to fuse disparate data into predictive models for a wide range of biomedical applications, these models often fail even when the correct answer is already known, and fail to explain individual predictions in terms that data scientists can appreciate. These limitations suggest that new methods to produce practical artificial intelligence are still needed.
In this talk, I will discuss our work in (1) building an integrative knowledge infrastructure to prepare FAIR and "AI-ready" data and services along with (2) neurosymbolic AI methods to improve the quality of predictions and to generate plausible explanations. Attention is given to standards, platforms, and methods to wrangle knowledge into simple, but effective semantic and latent representations, and to make these available into standards-compliant and discoverable interfaces that can be used in model building, validation, and explanation. Our work, and those of others in the field, creates a baseline for building trustworthy and easy to deploy AI models in biomedicine.
Bio
Dr. Michel Dumontier is the Distinguished Professor of Data Science at Maastricht University, founder and executive director of the Institute of Data Science, and co-founder of the FAIR (Findable, Accessible, Interoperable and Reusable) data principles. His research explores socio-technological approaches for responsible discovery science, which includes collaborative multi-modal knowledge graphs, privacy-preserving distributed data mining, and AI methods for drug discovery and personalized medicine. His work is supported through the Dutch National Research Agenda, the Netherlands Organisation for Scientific Research, Horizon Europe, the European Open Science Cloud, the US National Institutes of Health, and a Marie-Curie Innovative Training Network. He is the editor-in-chief for the journal Data Science and is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies including ontologies and linked data.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
Mammalian Pineal Body Structure and Also Functions
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and the Expression of Parallelism
1. The Effect of Hierarchical Memory on
the Design of Parallel Applications
and the Expression of Parallelism
David W. Walker
Cardiff School of Computer Science & Informatics
15. Overlap in a parallel algorithm
Interior points
Boundary points
Send boundary data to neighbours
Update interior points
Receive boundary data from neighbours
Update boundary points
18. CUDA: used on NVidia GPUs. Fine grain
parallelism, large numbers of threads running
on thousands of cores
19. OpenACC: “a single programming model that
will allow you to write a single program that
runs with high performance in parallel across
a wide range of target systems”
OpenACC: “a single programming model that
will allow you to write a single program that
runs with high performance in parallel across
a wide range of target systems”
Michael Wolfe in OpenACC for Multicore GPUs
https://www.pgroup.com/lit/brochures/openacc_sc15.pdf
21. PGAS languages: each thread has its own private
memory and also has access to globally shared
memory
22. PGAS: Local and shared variables
Thread 0 Thread 1 Thread 2 Thread 3
Global shared
address space
An array{
Private memory
{
A thread is said to have
an “affinity” for certain
elements in an array,
which it can access
faster than others.
23. To optimize performance PGAS languages still
require the programmer to reason about data
locality and synchronization
24. Example: 2D Laplace Problem
24
Strip (N-1)
Strip 0
Strip 1
Strip 2
NPTSX
NPTSY
NY
NY
NY
NY
…
…
Solution is held at 0 on
the boundary, and 1 at
the 4 centre squares.
25. 2D Laplace Problem: MPI solution
25
Process (N-1)
Process 0
Process 1
Process 2
NPTSX
NPTSY
NY
NY
NY
NY
…
…
At start of each Jacobi
iteration, each process
exchanges its first and
last rows with the
processes above and
below.
27. 2D Laplace Problem: UPC solution
27
Thread (THREADS-1)
Thread 0
Thread 1
Thread 2
NPTSX
NPTSY
NY
NY
NY
NY
…
…
28. #include <stdio.h>
#include <upc.h>
#define NY 20
#define NPTSX 200
#define NPTSY (NY*THREADS)
#define NSTEPS 5000
shared[*] float phi[NPTSY][NPTSX], oldphi[NPTSY][NPTSX];
shared[*] int mask[NPTSY][NPTSX];
// Routines setup_grid(), output_array(), and RGBval()
int main ()
{
int i, j, k;
setup_grid();
upc_barrier;
for(k=1;k<=NSTEPS;k++){
upc_forall(j=0;j<NPTSY;j++;j/NY)
for(i=0;i<NPTSX;i++) oldphi[j][i] = phi[j][i];
upc_barrier;
upc_forall(j=0;j<NPTSY;j++;j/NY)
for(i=0;i<NPTSX;i++) {
if (mask[j][i]) phi[j][i] = 0.25*(oldphi[j][i-1] +
oldphi[j][i+1] + oldphi[j-1][i] + oldphi[j+1][i]);
}
upc_barrier;
}
output_array();
}
28
UPC Program #1
Can use shared arrays
Note the barriers
Updating values lying on upper and lower
boundaries of a thread requires access to data values
with different affinities. These accesses are slow.
29. Data Sharing Between Threads
Thread 0NY
NPTSX
Thread 1NY
Thread 2NY
NY
…
…
…
Thread THREADS-1
To update a value on a
thread’s upper or lower
boundary requires data
from the thread above
or below
30. 30
Each thread copies its first and last rows into
shared memory at start of time step, and then
reads rows from neighbouring threads from
shared memory.
31. Coordinating Private and Shared Memory
31
Thread 0NY
NPTSX
Thread 1NY
Thread THREADS-1NY
…
…
…
Row 1
Row NY
Shared memory is used as a way of coordinating the sharing of
data between threads. This avoids the explicit barriers, and
coalesces data movement between local and remote memory.
Shared
memory
32. #include <stdio.h>
#include <upc.h>
#define NY 20
#define NPTSX 200
#define NPTSY (NY*THREADS)
#define NSTEPS 5000
shared[NPTSX] float ud[2][THREADS*NPTSX];
shared[*] float finalphi[NPTSY][NPTSX];
float phi[NY+2][NPTSX], oldphi[NY+2][NPTSX];
int mask[NY+2][NPTSX];
// Routines setup_grid(), output_array(), and RGBval()
int main ()
{
int i, j, k;
setup_grid();
upc_barrier;
for(k=1;k<=NSTEPS;k++){
…
}
output_array();
}
32
Main Program: Array Declarations
Shared array to hold rows 1
and NY of each thread
Needed for output
Arrays in private
memory
See next slide for
update code
33. for(i=0;i<NPTSX;i++){
ud[0][MYTHREAD*NPTSX+i] = phi[1][i];
ud[1][MYTHREAD*NPTSX+i] = phi[NY][i];
}
upc_barrier;
if (MYTHREAD>0) {
for(i=0;i<NPTSX;i++)
phi[0][i] = ud[1][(MYTHREAD-1)*NPTSX+i];
}
if (MYTHREAD<THREADS-1) {
for(i=0;i<NPTSX;i++)
phi[NY+1][i] = ud[0][(MYTHREAD+1)*NPTSX+i];
}
for(j=0;j<NY+2;j++)
for(i=0;i<NPTSX;i++) oldphi[j][i] = phi[j][i];
for(j=1;j<NY+1;j++)
for(i=0;i<NPTSX;i++) {
if (mask[j][i]) phi[j][i] = 0.25*(oldphi[j][i-1] +
oldphi[j][i+1] + oldphi[j-1][i] + oldphi[j+1][i]);
}
33
Main Program: Update
Copy rows 1 and NY of
phi to shared memory
Copy into row 0
Copy phi to
oldphi
Do the update
Copy into row
NY+1
Only one barrier
34. That was a straightforward example: regular
communication and good load balance.
35.
36. Start at the root and visit
every node of the tree
using a depth-first traversal
algorithm.
Note: it’s an implicit tree –
every node contains all the
information needed to
specify its children.
Start at the root and visit
every node of the tree
using a depth-first traversal
algorithm.
37. Each thread has a node stack. When it’s
empty a thread will steal work from the
stack of another thread.
Node A
Node B
Node C
Node D
Node E
Stack
Node B
Node C
Node D
Node E
Node X
Node Y Children of A
38. UPC implementation allows a thread to do push and pull
operations on the top of its stack, and to steal nodes from the
bottom of other threads’ stacks.
Node A
Node B
Node C
Node Y
Node Z
Stack top
Node X
Thread has affinity for this part of its stack
and can access it using a local pointer.
Stack bottom
Other threads can steal nodes from this
part of its stack by accessing it using a
global pointer.
Complications: Need to use locks to synchronise access to
bottom part of stack, and when moving nodes between the
top and bottom parts of the stack.
40. Node x;
int n = make_tree(&x, max_depth);
omp_set_dynamic(0);
#pragma omp parallel shared(x) num_threads(nthreads)
{
#pragma omp single
visit_node(&x);
}
Create a pool of threads
Aim to visit each node and
process it in some way.
One thread visits root node
Do not adjust number of
threads at runtime
41. void visit_node(Node *x){
Node *y = x->children;
while(y != NULL){
#pragma omp task firstprivate(y)
visit_node(y);
y = y->next;
}
#pragma omp taskwait
process_node(x);
return;
}
Loop over children of
node x
Creates a new task for each
child to call visit_node(y)
Wait here until all the
child tasks have finished
The runtime system schedules the tasks on
the threads.
42. OpenMP tasks work well for parallelizing recursive
problems with dynamic load imbalance
44. “To achieve good performance the
programmer and the programming system
must reason about locality and
independence”
45. In Sequoia, recursive tasks act as self-
contained units of computation, and
hierarchical memory is represented by a
tree.
46. Programmer must provide Sequoia with a
task mapping specification that maps
different levels of the memory hierarchy to
different granularities of task.
47. In addition to changing the order of
arithmetical operations, we can also
change the layout of data in memory
51. Square 2n x 2n Arrays: RM and Morton index
Block size, b = 2n-r (maximum r =n-1)
52. Morton Order
0 1
2 3
4 5
6 7
8 9
10 11
12 13
14 15
n= 5, r=2
Consider (i,j)=(18,13)
i2 = 10010, j2 = 01101
Interlace top 2 bits of i
and j:
1001 → 9
Morton index is:
1001010101 → 597
53. The unshuffle operation takes a shuffled sequence of items
and unshuffles them:
where each ai is a contiguous vector of ℓa items,
and each bi is a contiguous vector of ℓb items.
20 July 2017 53
a1b1a2b2…anbn ®a1a2…anb1b2…bn
54. Apply Morton Ordering to Matrix A
mortonOrder (A,n,b){
if( b < n ){
p1 = (n*n)/4
p2 = 2*p1
p3 = 3*p1
unshuffle(A,n/2,n/2)
unshuffle(A+p2,n/2,n/2)
mortonOrder(A,n/2,b)
mortonOrder(A+p1,n/2,b)
mortonOrder(A+p2,n/2,b)
mortonOrder(A+p3,n/2,b)
}
}
p1
p2 p3
n is matrix size, b is
block size. Both are
powers of 2.
55. Possible use of Morton or SFC ordering would be
in a library – optionally convert between matrix
layouts on entry to, and exit from, the library.
56. Recursive Matrix Multiply
mm_Recursive (A,B,C,n,b){ // C = C + AB
if(n==b){
matmul(A,B,C,n)
}
else{
mm_Recursive(A00,B00,C00,n/2,b)
mm_Recursive(A01,B10,C00,n/2,b)
mm_Recursive(A00,B01,C01n/2,b)
mm_Recursive(A01,B11,C01,n/2,b)
mm_Recursive(A10,B00,C10,n/2,b)
mm_Recursive(A11,B10,C10,n/2,b)
mm_Recursive(A10,B01,C11,n/2,b)
mm_Recursive(A11,B11,C11,n/2,b)
}
return
}
End of recursion. Choose b
so matrices fit in cache.
Note: all the
computational work
happens in the leaves
of the recursion tree.
A00 A01
A10 A11
57. Platform 1: MacBook
Pro, Intel i7, 4 cores,
256KB L2 cache/core,
6MB L3 cache
Platform 2: two Xeon
E5-2620, 6 cores each,
256KB L2 cache/core,
15MB L3 cache
gcc compiler used
with –O3 flag set
Platform 2: two Xeon
E5-2620, 6 cores each,
256KB L2 cache/core,
15MB L3 cache
gcc compiler used
with –O3 flag set
59. Tail Recursive Cholesky
choleskyTailRecursive (A,n,b){ // C = C + AB
if(n==b){
cholesky(A,b)
}
else{
cholesky(A00,b)
triangularSolve(A10,A00,n-b,b)
symmetricRankUpdate(A11,A10,n-b,b)
choleskyTailRecursive(A11,n-b,b)
}
return
}
End of recursion. Choose b
so matrices fit in cache.
Note: computational
work happens at all
levels of the recursion
tree.
A00
A10 A11
60. Binary Recursive Cholesky
choleskyBinaryRecursive (A,n,b){ // C = C + AB
if(n==b){
cholesky(A,b)
}
else{
choleskyBinaryRecursive(A00,n/2,b)
triangularSolve(A10,A00,n/2,n/2)
symmetricRankUpdate(A11,A10,n/2,n/2)
choleskyBinaryRecursive(A11,n/2,b)
}
return
}
End of recursion. Choose b
so matrices fit in cache.
Note: the 4 operations at the inner
nodes of the recursion tree have to
be done in order, so cannot do
recursive calls in parallel.
A00
A10 A11
61. Blocked RM order: standard algorithm
based on rectangular blocks
Tiled RM order: all operations are
expressed in terms of operations involving
square tiles, but matrices are stored in RM
order
Tiled Morton order: as above, but matrices
are stored in Morton order.
All times are relative
to time for single call
to DPOTRF
62. Morton order algorithms require Morton
index computations. There are a number
of ways to do these (bitwise operations,
lookup tables) but the method used does
not impact performance much.
63. These plots show results for the binary
recursive algorithm on Platform 1. Similar
results were obtained on Platform 2.
64. The Fourier transform of an nxn array, X, can be
expressed as:
Y = FnXFn
where element (p,q) of matrix Fn is wn
pq
wn = exp(-2pi / n)
𝐹4 =
1 1
1 𝑤
1 1
𝑤2
𝑤3
1 𝑤2
1 𝑤3
𝑤4
𝑤6
𝑤6
𝑤9
65. 2D Fast Fourier Transform
Y = FnXFn = FnXFn
T = At…A1(Pn
TXPn)A1
T…At
T
where t = log2(n) and Pn
T is a permutation matrix such that Pn
TX
exchanges row k of X with row k’, where k’ is the t bits of k in
reverse order.
1 0
0 0
0 0
0 0
0 0
0 0
1 0
0 0
0 0
1 0
0 0
0 0
0 0
0 0
0 0
1 0
0 1
0 0
0 0
0 0
0 0
0 0
0 1
0 0
0 0
0 1
0 0
0 0
0 0
0 0
0 0
0 1
𝑇
0
1
2
3
4
5
6
7
=
0
4
2
6
1
5
3
7
66. where L = 2q, r = n/L, L*=L/2
”Butterfly” matrix
r diagonal blocks of BL
𝐴 𝑞 =
𝐵 𝐿 ⋯ 0
⋮ ⋱ ⋮
0 ⋯ 𝐵 𝐿
Kronecker matrix product 𝐴⨂𝐵 =
𝑎00 𝐵 ⋯ 𝑎0,𝑛−1 𝐵
⋮ ⋱ ⋮
𝑎 𝑚−1,0 𝐵 ⋯ 𝑎 𝑚−1,𝑛−1 𝐵
67. I recommend this book if
you want to understand
the mathematics behind
the FFT algorithm.
68. A Common 2D FFT Algorithm
Y = At…A1(Pn
TXPn)A1
T…At
T
1. Evaluate 𝑋 = 𝐴 𝑡 ⋯ 𝐴1 𝑃𝑛
𝑇 𝑋
2. Transpose 𝑋 𝑇
3. Evaluate 𝑌 𝑇
= 𝐴 𝑡 ⋯ 𝐴1 𝑃𝑛
𝑇
𝑋 𝑇
4. Transpose 𝑌 𝑇to get 𝑌
69. Πn is a permutation matrix that performs a perfect
shuffle index operation, and Πb,n performs a partial
bit reversal on indices.
Basis of recursive 2D FFT
𝐹𝑛Π 𝑏,𝑛 = 𝐵 𝑏,𝑛 𝐼 𝑛/𝑏⨂𝐹𝑏
Π 𝑏,𝑛 = Π 𝑛 𝐼2⨂Π 𝑛/2 𝐼4⨂Π 𝑛/4 ⋯ 𝐼 𝑛/(2𝑏)⨂Π2𝑏
𝐵 𝑏,𝑛 = 𝐵𝑛 𝐼2⨂𝐵 𝑛/2 𝐼4⨂𝐵 𝑛/4 ⋯ 𝐼 𝑛/(2𝑏)⨂𝐵2𝑏
70. Hb,n permutes the columns and rows of X
based on a partial bit-reversal of indices.
What is in the red box?
This is the result of partitioning the
matrix into bxb blocks and performing a
2D FFT on each
Denote this by Kb,n
𝐻 𝑏,𝑛= Π 𝑏,𝑛
𝑇
XΠ 𝑏,𝑛
𝐹𝑛 𝑋𝐹𝑛 = 𝐹𝑛 𝑋𝐹𝑛
𝑇
= 𝐵 𝑏,𝑛 𝐼 𝑛/𝑏⨂𝐹𝑏 𝐻 𝑏,𝑛 𝐼 𝑛/𝑏⨂𝐹𝑏 𝐵 𝑏,𝑛
𝑇
72. Evaluate Kb,n: the FFTs
of the bxb blocks
b
b
n
b
b
n
Evaluate 𝐾2𝑏,𝑛 = 𝐼2 ⊗ 𝐵2𝑏 𝐾𝑏,𝑛 𝐼2 ⊗ 𝐵2𝑏
𝑇
:
the FFTs of the 2bx2b blocks
2b
2b
nn
Evaluate 𝐵4𝑏 𝐾2𝑏,𝑛 𝐵4𝑏
𝑇
: the FFT of the whole
nxn array
73. Transpose-based 2D FFT
transposeFFT2D (X,n,b){
partialBitReversal(X,n,b)
for (each bxb block, B, of X)
fft2D(B,b)
recursiveTransposeFFT(X,n,b)
transpose(X,n,b)
recursiveTransposeFFT(X,n,b)
transpose(X,n,b)
return
}
Do FFT of each block using
any algorithm.
Transpose X
Pre-multiply blocks as we
move up the recursion
tree
74. Recursive Transpose-Based 2D FFT
recursiveTransposeFFT (X,n,b){
if(n>b){
recursiveTransposeFFT(X00,n/2,b)
recursiveTransposeFFT(X01,n/2,b)
recursiveTransposeFFT(X10,n/2,b)
recursiveTransposeFFT(X11,n/2,b)
butterflyPre(X,n,b)
}
return
}
End recursion when n=b.
Choose b so matrices fit in
cache.
Pre-multiply nxn block by
butterfly matrix,
overwriting X.
Note: includes
work at each level
of the recursion
tree.
Note: recursive
calls are readily
parallelizable.
76. Recursive Vector Radix 2D FFT
recursiveVRFFT (X,n,b){
if(n==b){
fft2D(X,n)
}
else{
recursiveVRFFT(X00,n/2,b)
recursiveVRFFT(X01,n/2,b)
recursiveVRFFT(X10,n/2,b)
recursiveVRFFT(X11,n/2,b)
butterflyPre(X,n,b)
butterflyPost(X,n,b)
}
return
}
End recursion when n=b.
Choose b so matrices fit in
cache.
Pre-multiply nxn block by
butterfly matrix,
overwriting X, and then
post-multiply.
Note: recursive
calls are readily
parallelizable.
77. All times are relative to
time for transpose-
based FFT on RM matrix
of same size
78. Morton ordering doesn’t improve FFT timings by as
much as for matrix multiplication. Computation to
data movement ratio is n for matrix multiply, and
log(n) for FFT
79. Morton ordering and related recursive parallel
algorithms may work well when hierarchical
memory is handled programmatically.