Presentation on Auto Tuning delivered as part of our "Software for Multicore Processors" course at UT Austin. It covers the basics of AutoTuning and details of two library generators called PhiPAC and ATLAS.
Есть много причин заниматься конверсией управляемых языков в нативные: это прежде всего производительность, но также защита от реверс-инжиниринга, поддержка аппаратных технологий или каких-то специфичных платформ. В этом докладе мы посмотрим на пример построения конвертера из C# в C++ и те нюансы, которые встречаются при решении этой задачи
In a nutshell, an Arduino is an open hardware development board that can be used by tinkerers, hobbyists, and makers to design and build devices that interact with the real world. While Arduino refers to a specific type of board design, it can also be used to refer to a company which manufactures a specific implementation of these boards, and is typically also used to describe the community around compatible boards made by other people or companies which function in a similar way.
Доклад рассказывает об устройстве и опыте применения инструментов динамического тестирования C/C++ программ — AddressSanitizer, ThreadSanitizer и MemorySanitizer. Инструменты находят такие ошибки, как использование памяти после освобождения, обращения за границы массивов и объектов, гонки в многопоточных программах и использования неинициализированной памяти.
Есть много причин заниматься конверсией управляемых языков в нативные: это прежде всего производительность, но также защита от реверс-инжиниринга, поддержка аппаратных технологий или каких-то специфичных платформ. В этом докладе мы посмотрим на пример построения конвертера из C# в C++ и те нюансы, которые встречаются при решении этой задачи
In a nutshell, an Arduino is an open hardware development board that can be used by tinkerers, hobbyists, and makers to design and build devices that interact with the real world. While Arduino refers to a specific type of board design, it can also be used to refer to a company which manufactures a specific implementation of these boards, and is typically also used to describe the community around compatible boards made by other people or companies which function in a similar way.
Доклад рассказывает об устройстве и опыте применения инструментов динамического тестирования C/C++ программ — AddressSanitizer, ThreadSanitizer и MemorySanitizer. Инструменты находят такие ошибки, как использование памяти после освобождения, обращения за границы массивов и объектов, гонки в многопоточных программах и использования неинициализированной памяти.
Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsNIT Sikkim
ECC was developed in 1985 independently by Neal Koblitz and Victor Miller. Both men saw the application of the elliptic curve discrete log problem (ECDLP) as a replacement for the conventional discrete log problem (DLP) which is used in DSA, and the integer factorization problem found in RSA. For both problems, sub-exponential solutions have been generated; the
same which cannot be said for ECDLP . In addition to offering increased security for a smaller key size, operations of adding and doubling can be optimized successfully on a mobile
platform . ECC offers a viable replacement to the most common public-key cryptography algorithms on mobile devices.
Dan Towner of ACCU Bristol & Bath, presenting at the Bristol IT MegaMeet 2013
This talk aims to demystify the clever parts of compilers that nobody ever told you about, explaining their inner secrets in simple terms. Come along to find out what induction variables do, what software pipelining is, how vectorisation works, how code scheduling is done, and how the debugger makes sense of it all.
See the video of the presentation here: http://www.youtube.com/watch?v=aeyf6wfxbL4
TMPA-2017: Static Checking of Array Objects in JavaScriptIosif Itkin
TMPA-2017: Tools and Methods of Program Analysis
3-4 March, 2017, Hotel Holiday Inn Moscow Vinogradovo, Moscow
Static Checking of Array Objects in JavaScript
Astrid Younang, Lunjin Lu, Nabil Almashfi, Oakland Univerity
For video follow the link: https://youtu.be/eZC2x-Qf93I
Would like to know more?
Visit our website:
www.tmpaconf.org
www.exactprosystems.com/events/tmpa
Follow us:
https://www.linkedin.com/company/exactpro-systems-llc?trk=biz-companies-cym
https://twitter.com/exactpro
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...Frank Nielsen
These are the slides accompanying the textbook:
A Concise and Practical Introduction to Programming Algorithms in Java
by Frank Nielsen
Published by Springer-Verlag (2009), Undergraduate textbook in computer science (UTiCS series)
ISBN: 978-1-84882-338-9
http://www.lix.polytechnique.fr/~nielsen/JavaProgramming/
http://link.springer.com/book/10.1007%2F978-1-84882-339-6
With the introduction of FPGAs in the cloud, there is an increasing need for solutions able to accelerate traditional CPU code with minimum burden on the user, while retaining competitive performance. In this presentation, we illustrate OXiGen, a tool for the acceleration of dataflow-oriented C applications on FPGA-based systems. The tool offers a complete design flow to optimize C functions into dataflow accelerated kernels and an automated frequency-aware design-space exploration that selects an optimal set of optimizations for the given function. It allows to automatically simulate the resulting function by generating a testbench for the function. We compare the generated hardware designs against both the respective software implementations and state-of-the-art dataflow designs, reaching comparable performance with a hardware design generated in a few seconds.
Discrete Logarithmic Problem- Basis of Elliptic Curve CryptosystemsNIT Sikkim
ECC was developed in 1985 independently by Neal Koblitz and Victor Miller. Both men saw the application of the elliptic curve discrete log problem (ECDLP) as a replacement for the conventional discrete log problem (DLP) which is used in DSA, and the integer factorization problem found in RSA. For both problems, sub-exponential solutions have been generated; the
same which cannot be said for ECDLP . In addition to offering increased security for a smaller key size, operations of adding and doubling can be optimized successfully on a mobile
platform . ECC offers a viable replacement to the most common public-key cryptography algorithms on mobile devices.
Dan Towner of ACCU Bristol & Bath, presenting at the Bristol IT MegaMeet 2013
This talk aims to demystify the clever parts of compilers that nobody ever told you about, explaining their inner secrets in simple terms. Come along to find out what induction variables do, what software pipelining is, how vectorisation works, how code scheduling is done, and how the debugger makes sense of it all.
See the video of the presentation here: http://www.youtube.com/watch?v=aeyf6wfxbL4
TMPA-2017: Static Checking of Array Objects in JavaScriptIosif Itkin
TMPA-2017: Tools and Methods of Program Analysis
3-4 March, 2017, Hotel Holiday Inn Moscow Vinogradovo, Moscow
Static Checking of Array Objects in JavaScript
Astrid Younang, Lunjin Lu, Nabil Almashfi, Oakland Univerity
For video follow the link: https://youtu.be/eZC2x-Qf93I
Would like to know more?
Visit our website:
www.tmpaconf.org
www.exactprosystems.com/events/tmpa
Follow us:
https://www.linkedin.com/company/exactpro-systems-llc?trk=biz-companies-cym
https://twitter.com/exactpro
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
(chapter 8) A Concise and Practical Introduction to Programming Algorithms in...Frank Nielsen
These are the slides accompanying the textbook:
A Concise and Practical Introduction to Programming Algorithms in Java
by Frank Nielsen
Published by Springer-Verlag (2009), Undergraduate textbook in computer science (UTiCS series)
ISBN: 978-1-84882-338-9
http://www.lix.polytechnique.fr/~nielsen/JavaProgramming/
http://link.springer.com/book/10.1007%2F978-1-84882-339-6
With the introduction of FPGAs in the cloud, there is an increasing need for solutions able to accelerate traditional CPU code with minimum burden on the user, while retaining competitive performance. In this presentation, we illustrate OXiGen, a tool for the acceleration of dataflow-oriented C applications on FPGA-based systems. The tool offers a complete design flow to optimize C functions into dataflow accelerated kernels and an automated frequency-aware design-space exploration that selects an optimal set of optimizations for the given function. It allows to automatically simulate the resulting function by generating a testbench for the function. We compare the generated hardware designs against both the respective software implementations and state-of-the-art dataflow designs, reaching comparable performance with a hardware design generated in a few seconds.
As an administrative support professionals, improve t your communication and management skills, becoming better equipped to take on more responsibility in the workplace.
Automotive-Casting Part in Reverse Engineering_ PSH Mechanical Design From scan data to Fully parametric model in Catia, Pro/E , SW, Creo, Inventor, NX , AutoCad
Five Steps to Optimize Casting and Eliminate DefectsDesign World
Traditionally, engineers had a hard time predicting defects with their cast designs. With simulation, engineers can now predict defects from porosity and cold shots to air pockets.
Watch the webinar: http://www.designworldonline.com/optimize-casting-and-eliminate-defects/#_
Rinine is a Product Development Consultancy based in Kottayam, Kerala, India. We provide solutions for the entire product life cycle from design research, concept generation, 3D visualization, Engineering detailing, Finite Element Analysis(FEA), Computational Fluid Dynamics(CFD), Digital Manufacturing, Value Engineering, Rapid Prototyping, Product/Process Optimization, Animation, Rendering etc.
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...AMD Developer Central
In this webinar presentation, ArrayFire COO Oded Green demonstrates best practices to help you quickly get started with OpenCL™ programming. Learn how to get the best performance from AMD hardware in various programming languages using ArrayFire. Oded discusses the latest advancements in the OpenCL™ ecosystem, including cutting edge OpenCL™ libraries such as clBLAS, clFFT, clMAGMA and ArrayFire. Examples are shown in real code for common application domains.
Watch the webinar here: http://bit.ly/1obT0M2
For more developer resources, visit:
http://arrayfire.com/
http://developer.amd.com/
Follow us on Twitter: https://twitter.com/AMDDevCentral
See info in the slides for more contact information and resource links!
BUD17-302: LLVM Internals #2
Speaker: Renato Golin, Peter Smith, Diana Picus, Omair Javaid, Adhemerval Zanella
Track: Toolchain
★ Session Summary ★
Continuing from LAS16 and, if we have time, introducing global isel that we’re working on.
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/bud17/bud17-302/
Presentation:
Video:
---------------------------------------------------
★ Event Details ★
Linaro Connect Budapest 2017 (BUD17)
6-10 March 2017
Corinthia Hotel, Budapest,
Erzsébet krt. 43-49,
1073 Hungary
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://twitter.com/linaroorg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961
"
Mirko Damiani - An Embedded soft real time distributed system in Golinuxlab_conf
An embedded system usually involves low level languages like C and highly customized hardware. In this talk we will see a use case of a soft real time system which was developed taking a very different approach, written in Go. We will see what are the advantages of this choice, along with its limits.
LAS16-501: Introduction to LLVM - Projects, Components, Integration, InternalsLinaro
LAS16-501: Introduction to LLVM - Projects, Components, Integration, Internals
Speakers: Renato Golin
Date: September 30, 2016
★ Session Description ★
Deep dive into LLVM internals, middle/back-ends, libraries, sanitizers, linker, debugger and overall compilation process. The focus is to show how LLVM works under the hood, which is useful for GCC compiler engineers getting into LLVM development, as well as for other engineers to learn more about parts of the toolchain they’re not familiar with. This presentation also touches on frequent LLVM-specific errors, so GCC users may find useful, if they’re moving to LLVM.
★ Resources ★
Etherpad: pad.linaro.org/p/las16-501
Presentations & Videos: http://connect.linaro.org/resource/las16/las16-501/
★ Event Details ★
Linaro Connect Las Vegas 2016 – #LAS16
September 26-30, 2016
http://www.linaro.org
http://connect.linaro.org
The GlobalISel framework was introduced with the intention of replacing SelectionDAG, aiming to provide advantages in terms of performance, granularity, and modularity. This tutorial will provide everything you need to know about using this framework for a new target, focusing on RISC-V as an example and working through some specific examples of challenging cases.
(c) European LLVM Developers' Meeting 2023
Glasgow, United Kingdom
May 10 - 11, 2023
https://llvm.swoogo.com/2023eurollvm/
https://www.youtube.com/playlist?list=PL_R5A0lGi1AD-bqRaY61l5Q-EozbfyLZr
this is simd programming power point file that CS 240A, Winter 2016
single instruction multiple data strea simd or sim-dee
simd computer exploits multiple data streams against a single loop
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
3. What is Auto Tuning?
● Several Definitions
○ First result on Wikipedia - "Auto-Tune is an audio
processor created by Antares Audio Technologies
"
● A Definition
○ Autotuning is an automatic process for selecting one
out of several possible solutions to a computational
problem.
● Techniques used by:
○ Library generators, Compilers and Runtime systems
4. Possible Versions of a Solution
● The solutions may differ in the
○ algorithm (quicksort vs selection sort)
○ implementation (loop unroll).
● The versions may result from
○ transformations (unroll, tile, interchange)
● The versions could be generated by
○ programmer manually (coding or directives)
○ compiler automatically
5. Motivation
■ Increasing diversity of computation supports
■ New influences on the execution of parallel
applications
○ Hierarchical structure
○ Heterogeneity of the processors
■ Design efficient software that takes full
advantage of such systems
■ Solving a target problem by using a single
algorithm is not always efficient everywhere
6. First Ideas
● Poly-Algorithms
○ (1969) Johh Rice (Purdue) "A polyalgorithm for the automatic
solution of nonlinear equations"
● Profiling and feedback assisted compilation
○ (1982) S. Graham et.al : gprof
○ (1991) P. Chang et.a l: "Using profile information to assist classic
code optimizations"
● Code generation
○ (1989) J. Johnson et.al : “A methodology for designing, modifying,
and implementing Fourier Transform algorithms on various
architectures.”
○ (1992) M. Covell et.al : “Computer-aided algorithm design and
arrangement”
7. Context: High Performance Libraries
● Linear Algebra
○ BLAS, LAPACK, ScaLAPACK
● Signal/Image Processing
○ Vector Signal Image Processing Library (VSIPL)
● Distributed/Parallel Systems
○ Message Passing Interface (MPI)
● Can we implement libraries:
○ Automatically and Portably
○ Incorporating platform-specific features
○ matching performance of hand-tuned
implementations leveraging compiler technology
○ using domain-specific knowledge
8. AutoTuning
● 2 phase scheme for producing automatically
tuned code
● Given: Program; inputs; machine
● Step1: Identify and generate a space of
candidate implementations
● Step2: Select the fastest one using empirical
modeling and/or automated experiments
9. Why not let the compiler worry?
● General Purpose
○ whereas Library generators can focus on specific
problems
● Engineering
○ Hard to modify a production compiler and its effects
are global
● Analysis
○ Limited access to relevant run-time information
○ Over specified dependencies
○ Correctness Criteria
10. Compiler Vs AutoTuner
Compiler AutoTuner
Input General Purpose Specification including
Source Code problem size, machine
parameters and
problem specific
transformations
Output Low level Machine Mostly High Level
Code Source (eg: C code)
Time to Short (unless Usually Long (depends
feedback/profiling on search space)
Generate enabled)
Select Mostly Static Analysis Automated Empirical
(rarely feedback Models and
Implementation tuning) experiments
11. Some AutoTuning Projects
● Linear Algebra
○ Portable High-Performance ANSI C
■ PHiPAC
○ Automatically Tuned Linear Algebra Software
■ ATLAS
● Signal and Image Processing
○ Fast Fourier Transformations of the West
■ FFTW
○ SPIRAL
14. PHiPAC (1997)
● Developing Portable High-Performance
matrix vector libraries in ANSI C
● Parametrized C-code Generator
○ produces code according to certain
guidelines
● Auto Tune the code
● Exhaustive search over all parameters
● Claim: achieve over 90% of peak-perf and
17. Efficient Code Generation
● Studied several ANSI C Compilers and
determined that it is best to
● Rely on Compilers for:
○ Register allocation
○ Instruction selection and Scheduling
● Manually perform:
○ register/cache blocking
○ loop unrolling
○ software pipe-lining, etc
18. Local Variables to explicitly remove false
dependencies
● Before After
a[i] = b[i] + c; float f1, f2;
a[i+1] = b[i+1] * d; f1 = b[i]; f2 = b[i+1];
a[i] = f1 + c;
a[i+1] = f2 * d;
Compiler mayn't assume &a[i] != &b[i+1]
and so is forced to first store a[i] before
loading b[i+1] (Pointer Aliasing)
20. Exploit Multiple Registers
● Explicitly keep values in local variables
○ Reduces memory bandwidth
○ compiler would reload fil values for every
iteration (potential aliasing with res)
Before After
while(...) { float f0 = fil[0];
*res++ = fil[0] * sig[0]; float f1 = fil[1];
+ fil[1] * sig[1]; while(...) {
signal ++; *res++ = f0 * sig[0]
} + f1 * sig[1];
signal ++
}
21. Minimize pointer updates by striding with
constant offsets
Before After
● f0 = *r8; r8 += 4; f0 = r8[0];
f1 = *r8; r8 += 4; f1 = r8[4];
f2 = *r8; r8 += 4; f2 = r8[8];
r8 += 12;
Compilers can fold constant index into
(register + offset) addressing mode.
22. Minimize branches, avoid magnitude
compares
● Branches are costly
○ Unroll loops
○ Use do{} while(); loops to avoid loop
head branches
● Using == and != is cheaper
Before After
for(i = 0, a = start_ptr; end_ptr = &a[ARRAY_SIZE];
i < ARRAY_SIZE; do {
i ++, a++) { ...
.... a++;
} } while (a != end_ptr);
24. Other Guidelines
● Balance Instruction Mix
○ Interleave 1 FPM, 1 FPA and 1-2 FP loads or
stores
● Increase Locality
○ Arrange code to have unit-stride memory
accesses and try to reuse data in cache
● Convert Integer multiplies to adds
○ * and / are slower than +
25. Matrix Multiply Generators
● Produce C code with PHiPAC guidelines
● C = αop(A)op(B) + βC
○ MxK, KxN and MxN matrices
○ op(X) is either X or transpose(X)
● mm_cgen and mm_lgen
○ Core (register blocking)
○ Level (higher level cache blocking)
● mm_cgen -l0 M0 K0 N0 [-l1 M1 K1 N1] ...
26. Blocked MMM
for (i=0; i<M; i+=M0)
for (j=0; j<N; j+=N0)
for (l=0; l<K; l+=K0)
for (r=i; r<i+M0; r++)
for (s=i; s<i+N0; s++)
for (t=i; t<i+K0; t++)
c[r][s] += a[r][t] * b[t][s];
31. Optimal Block Sizes
● Naive brute force search
● For Register Parameters
○ NR/4 <= M0N0 <= NR ; NR is max regs
○ 1 <= K0 <= K0max ; K0max = 20 (tunable)
● Benchmark all squares M = K = N = D
○ D runs over 2x, 3x, 10x and all primes
○ 3D2 fits in L1 cache
32. Contd.
● For L1 blocking Parameters
● The square case ( D x D)
● Search the neighborhood centered at 3D2 =
L1
● Set the values of M1, K1, N1 to ϕ D/M0
○ Where, ϕ ∈ { 0.25, 0.5, 1.0, 1.5, 2.0 }
○ D = sqrt(L1/3)
○ 125 Combinations
33. Naive Brute Force ?
● Search take too long
● Generates very lengthy code
● Very slow under full optimization
● Need a better search strategy
34. Smarter Search
● Majority of the computation is performed in
register blocked code
● Benchmark only in multiples of register block
size
● Search space of M0, N0, K0 is not reduced
○ Prioritize neighborhood of the best ones found
○ {M0-1, M0, M0+1} etc.
● Terminate after reaching acceptable
efficiency
36. Single Precision MMM (100 MHz SGI
Indigo R4k)
Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
37. Double Precision MMM (HP 712/80i)
Source: PHiPAC: a Portable, High-Performance, ANSI C Coding Methodology
38. There is no Golden Hammer
Strengths: Weaknesses:
● Automatic Search ● Focus on
for optimal Params uniprocessor
● Produces portable Machines
ANSI C Code. ● No support for
vector based CPUs
● No control over
instruction
scheduling
41. ATLAS
● Automatically Tuned Linear Algebra
Software
● Generates optimized BLAS library
● C and Fortran77
● Provides implementation for BLAS levels 1,2
and 3.
● We will focus on Matrix-Matrix-Multiply
(MMM)
42. Naive MMM
● C = A * B using 3 for-loops
● Dimensions of A, B and C are NxK, KxM and
NxM respectively.
43. Optimization for L1 cache
● Matrix divided into NB x NB blocks
● Each block is called mini-MMM
● Optimization parameter NB is chosen such
that each mini-MMM fits in cache
45. Optimization for register file
● Mini-MMMs are further represented as micro-
MMMs
● Multiplies MU x 1 sub-matrix of A by 1 x NU sub-
matrix of B and accumulates the result into MU x
NU sub-matrix of C
● Here MU and NU are the optimization parameters
● Necessary condition : MU + NU + MU*NU <= NR
● where NR = no. of floating point registers
48. Pipeline scheduling
The 2 innermost loops (i'' and j'') are unrolled,
to create interleaved multiply and add
statements
Exploits instruction-level parallelism
● If there is fused multiply-add, then these 2
operations can be executed together
● The optimization parameter FMA indicates
the code generator whether this facility
49. Pipeline scheduling
● MU + NU loads and stores
● MU * NU additions and multiplications
● Latency of operations might stall the pipeline
● Solution : Interleave the operations such that
dependent operations are separated by a
particular distance (What would that be?)
● This is governed by another optimization
parameter - LS
50. Pipeline scheduling
● Inject MU + NU loads of A and B
● Loads divided into:
○ Initial fetch (IF)
○ Blocks of other load operations (NF)
51. Loop Unrolling
● KU is the optimization parameter that
controls loop unrolling
● Constrained by the capacity of instruction
cache
● Should not be so small (wastage of cache)
or so big (overflow of instruction cache)
52. Other Optimizations
● Copying tiles of A is done in the beginning of
outermost loop. These tiles are fully reused
in each iteration of j loop
● Copying jth vertical panel of B -- done before
beginning of i loop.
● Copying tile (i,j) of C just before the "k" loop
starts
53. Other optimizations
● Choosing loop order:
○ if N < M then JIK loop order (so that A
completely fits into L2 cache)
○ else if M < N then IJK loop order
54. Other optimizations
● Copying A, B, C for smaller matrices might
be an overhead
● Non-copying versions are generated with
optimization parameter NCNB
● This version used if:
○ M * N * K is less than a threshold
○ at least 1 dimension of 1 of the matrices is
smaller than 3 * NCNB
55. Estimating parameters
● Orthogonal search is used for optimizing
parameters.
● It is a heuristic, and finds approximate
solutions
● No guarantee of optimized solution
● It needs these details:
○ Optimized in what order?
○ Possible solution range for parameters
○ reference value used for parameter k during
optimization of 1 to k-1
57. Estimating Machine Parameters
Machine parameters are measured:
● C1 - Size of L1 data cache
● NR - Number of floating point registers
● FMA - Availability of fused multiply-add
● LS - Amount of separation between
dependent multiply and add instructions
59. Finding NB
● Generates values in range :
16 <= NB <= min(80, √C1)
where C1 = size of L1 data cache
60. Finding MU and NU
● All combinations that satisfy:
○ MU * NU + MU + NU + LS <= NR
● NB was obtained earlier
61. Finding LS and IF, NF
LS
● Tries values in interval [1, 6]
● Boundary value fixed based on experiments
● Divides MU * NU * KU (instruction scheduling)
● IF: Searches of IF in the interval [2, MU + NU]
● NF in the interval [1, MU + NU - IF]
62. Finding NCNB
● Searches in the range [NB : -4 : 4]
● Terminates search when performance drops
by 20% of the best found solution
64. Finding KU
● Constrained by instruction cache
● Values between 4 and NB/2 are tried
● Special values 1 and NB are also considered
65. Empirical Optimization
● Estimation of optimal values is the key
○ Compilers use Analytical models
○ Library Generators (eg: ATLAS) use search
● Empirical Search:
○ Get a version of program for each combination of
parameters
○ Execute it on the target machine and measure
performance
○ Select the one that performs best
○ Increased installation time!!
● How is the search space bounded?
○ The hardware parameters
66. Yotov et.al
● Realised that most optimizations used in
ATLAS code generator are already known to
the compilers.
○ cache Tiling, register tiling, etc.
● Replaced the search module with a
parameter estimator based on standard
analytical models
● Code generator is not modified
○ Any performance change is solely based on
differently chosen parameters
68. Analysis
● Results indicated that a simple and intuitive
model is able to estimate near-optimal
values for the parameters
● Focus on the ATLAS generated code
● Notations:
○ ATLAS CGw/S - Code Generator with Search
○ ATLAS Model - Modified Atlas (No search)
○ Atlas Unleashed - Hand written code may be used
along with predefined architecture defaults for the
parameter values to produce the library.
69. Model-Based Optimization
● Requires more machine parameters than
original ATLAS
○ No Search!!
● Empirical optimizers:
○ Approximate values of machine params are okay
○ Only used to bound the search space
● Model-based Optimizers:
○ Need accurate values
○ Developed a tool called X-RAY to accurately
measure them
70. Hardware Parameters
● C1,B1: the capacity and the line size of the
L1 data cache
● CI : The capacity of the L1 instruction cache
● Lx: hardware latency of the floating-point
multiply instruction
● |ALUFP |: number of floating-point functional
units
● NR: the number of floating-point registers
● FMA: the availability of a fused multiply-add
instruction
71. Estimating NB
● Consider L1 cache - Fully Associative,
Optimal replacement, Unit line size
● Working set of mini-MMM loop has 3 blocks
of NB x NB
3 NB2 <= C1
● In the inner most loop (C), element once
computed is not used again. Similarly only 1
column of B is needed in cache.
NB2 + NB + 1 <= C1
72. Refined Estimate of NB
● Correcting for non-unit line size
|N2B/B1| + |NB/B1| + 1 <= C1/B1
73. Further Refinement
● Estimated NB may not be multiple of MU and
NU
● This might cause fractional register tiles and
extra clean up
● Avoid this by choosing proper NB
● ATLAS needs NB to be an even integer
● So, we have: NB =
74. Estimating MU and NU
● View register file as a software cache
○ that is fully associative
○ unit line size
○ capacity = # registers, NR
● ATLAS performs outer products of (MU x 1)
and (1 x NU) vectors for register tiling
75. Contd.
● ATLAS allocates MU elements for A, NU
elements for B, and MU*NU elements for C
● Also need LS registers to store temp values
of multiplications to make use of pipelining
● So we have:
(MU x NU) + NU + MU + LS <= NR
LS calculation will be shown later, NR is known.
Only unknowns are MU and NU.
76. Estimation Scheme
● Let MU = NU = u. Solve prev inequality for u
● Let MU = max (u, 1). Solve for NU
● Let NU = max (NU, 1)
● <MU,NU> = <max (MU,NU) ,min (MU,NU)>
77. Estimating KU
● Not limited by the size of the register file
● Limited by the size of I-Cache
● Unroll the innermost loop within the size
constraints of instruction cache
● Avoid micro-MMM code cleanup
○ Trim KU so that it divides NB
○ Usually, KU = NB in most machines
78. Estimating LS
● Skew factor that ATLAS code generator
uses to schedule dependent multiplication
and addition operations for CPU Pipeline
● LS independent multiplications and LS-1
independent additions between muli and
corresponding addi should at least hide the
latency of multiplication.
79. Estimating Ls
● LX = latency of multiplication
● 2 * LS - 1 independent instructions hides this
latency
● So, 2 * LS - 1 >= LX
● There may be multiple floating point units
(2 x LS) - 1/ |ALUFP| >= LX
● Solution for LS:
80. Summary
1. Estimate FMA
2. Estimate LS :
3. Estimate MU and Nu
MU*NU + NU + MU + LS <= NR
Set MU = NU = u. Solve for u
MU = max(1, u). Solve for NU
NU = max(NU, 1). If MU < NU swap MU and NU
4. Estimate NB
|N2B/B1| + |NB/B1| + 1 <= C1/B1
○ Trim NB to be multiple of 2, MU and NU
5. Estimate KU
○ Constrained by I-cache.
○ Make KU divide NB
6. Estimate NF, IF
○ IF = 2 , N F = 2
82. Conclusions
● In all machines (other than Itanium), the
codes performed almost as well as global
search based codes
● Models to find parameters are much faster
● Might be difficult to implement analytical
methods in compilers
○ This model is focused on only 1 application