This document discusses Fortran programming best practices and features. It recommends avoiding labeled DO loops, EQUIVALENCE, and COMMON blocks due to errors and complexity. Instead, it suggests using CYCLE and EXIT statements, WHERE constructs, and DO CONCURRENT for vectorization. The document also covers sparse matrix storage formats and using Intel MKL and BLAS routines for linear algebra operations.
Unleash performance through parallelism - Intel® Math Kernel LibraryIntel IT Center
This document discusses Intel's Math Kernel Library (MKL) and its support for Intel Xeon Phi coprocessors. MKL is a math library that provides optimized routines for linear algebra, Fourier transforms, vector math and more. It supports Intel Xeon Phi coprocessors using three usage models: automatic offload for transparent parallelism, compiler assisted offload for explicit control, and native execution to use coprocessors independently. The document provides examples and recommendations for choosing the best usage model based on application needs.
TensorRT is an NVIDIA tool that optimizes and accelerates deep learning models for production deployment. It performs optimizations like layer fusion, reduced precision from FP32 to FP16 and INT8, kernel auto-tuning, and multi-stream execution. These optimizations reduce latency and increase throughput. TensorRT automatically optimizes models by taking in a graph, performing optimizations, and outputting an optimized runtime engine.
This document contains instructions for three labs using Intel MKL on an Intel Xeon Phi coprocessor. Lab 1 demonstrates matrix multiplication using MKL's SGEMM and automatic offload. Lab 2 uses MKL's FFT functions with offload pragmas. Lab 3 runs a LINPACK benchmark natively on the coprocessor. The labs exercise different usage modes of MKL and optimization techniques like affinity settings.
This document introduces the compilation flow and IR design of Glow, an open-source framework for optimizing and compiling machine learning models to multiple backends and devices. It discusses the three levels of IR in Glow: High Level IR (HIR), Low Level IR (LIR), and backends. Pros include supporting training and inference compilation, quantization, and many HIR and LIR optimizations. Cons include lacking Python support and real ASIC backends. The document suggests areas for further work on Glow, such as adding more advanced optimizations, offloading subgraphs, improving JIT performance, and debugging optimized models.
The document discusses the steps in the logic synthesis process from RTL to optimized gate-level netlist. It includes:
1) RTL description is converted to an internal representation
2) Logic is optimized to remove redundancy
3) Technology mapping implements the representation using cells from a technology library
The document also discusses floor planning, which determines routing areas by placing blocks/macros, and placement which places standard cells in rows to minimize area and interconnect cost.
The document provides an introduction to OpenMP, which is an application programming interface for explicit, portable, shared-memory parallel programming in C/C++ and Fortran. OpenMP consists of compiler directives, runtime calls, and environment variables that are supported by major compilers. It is designed for multi-processor and multi-core shared memory machines, where parallelism is accomplished through threads. Programmers have full control over parallelization through compiler directives that control how the program works, including forking threads, work sharing, synchronization, and data environment.
This document discusses compiler optimization techniques. It describes Intel's guided optimization path which involves using general optimization options, identifying performance hotspots, and using inter-procedural optimization (IPO), profile-guided optimization (PGO), and parallel performance options. IPO allows optimizations across procedures, PGO improves performance based on profiling, and high-level optimizations exploit properties of high-level languages. The document provides examples of optimizations performed by each technique.
Unleash performance through parallelism - Intel® Math Kernel LibraryIntel IT Center
This document discusses Intel's Math Kernel Library (MKL) and its support for Intel Xeon Phi coprocessors. MKL is a math library that provides optimized routines for linear algebra, Fourier transforms, vector math and more. It supports Intel Xeon Phi coprocessors using three usage models: automatic offload for transparent parallelism, compiler assisted offload for explicit control, and native execution to use coprocessors independently. The document provides examples and recommendations for choosing the best usage model based on application needs.
TensorRT is an NVIDIA tool that optimizes and accelerates deep learning models for production deployment. It performs optimizations like layer fusion, reduced precision from FP32 to FP16 and INT8, kernel auto-tuning, and multi-stream execution. These optimizations reduce latency and increase throughput. TensorRT automatically optimizes models by taking in a graph, performing optimizations, and outputting an optimized runtime engine.
This document contains instructions for three labs using Intel MKL on an Intel Xeon Phi coprocessor. Lab 1 demonstrates matrix multiplication using MKL's SGEMM and automatic offload. Lab 2 uses MKL's FFT functions with offload pragmas. Lab 3 runs a LINPACK benchmark natively on the coprocessor. The labs exercise different usage modes of MKL and optimization techniques like affinity settings.
This document introduces the compilation flow and IR design of Glow, an open-source framework for optimizing and compiling machine learning models to multiple backends and devices. It discusses the three levels of IR in Glow: High Level IR (HIR), Low Level IR (LIR), and backends. Pros include supporting training and inference compilation, quantization, and many HIR and LIR optimizations. Cons include lacking Python support and real ASIC backends. The document suggests areas for further work on Glow, such as adding more advanced optimizations, offloading subgraphs, improving JIT performance, and debugging optimized models.
The document discusses the steps in the logic synthesis process from RTL to optimized gate-level netlist. It includes:
1) RTL description is converted to an internal representation
2) Logic is optimized to remove redundancy
3) Technology mapping implements the representation using cells from a technology library
The document also discusses floor planning, which determines routing areas by placing blocks/macros, and placement which places standard cells in rows to minimize area and interconnect cost.
The document provides an introduction to OpenMP, which is an application programming interface for explicit, portable, shared-memory parallel programming in C/C++ and Fortran. OpenMP consists of compiler directives, runtime calls, and environment variables that are supported by major compilers. It is designed for multi-processor and multi-core shared memory machines, where parallelism is accomplished through threads. Programmers have full control over parallelization through compiler directives that control how the program works, including forking threads, work sharing, synchronization, and data environment.
This document discusses compiler optimization techniques. It describes Intel's guided optimization path which involves using general optimization options, identifying performance hotspots, and using inter-procedural optimization (IPO), profile-guided optimization (PGO), and parallel performance options. IPO allows optimizations across procedures, PGO improves performance based on profiling, and high-level optimizations exploit properties of high-level languages. The document provides examples of optimizations performed by each technique.
The document discusses various compiler optimizations including:
1. Procedure integration replaces procedure calls with the procedure body to eliminate function call overhead.
2. Common subexpression elimination replaces repeated computations of the same expression with a single variable to store the result.
3. Constant propagation replaces variables assigned a constant value with the constant throughout the code.
4. The document provides examples of these and other optimizations like copy propagation, code motion, induction variable elimination, and loop unrolling which aims to improve performance by reducing instructions and improving pipeline utilization.
Synthesizing HDL using LeonardoSpectrumHossam Hassan
This document discusses synthesizing HDL designs using the LeonardoSpectrum synthesis tool. It begins with an overview of synthesis and the basic synthesis process. It then describes the LeonardoSpectrum tool flow, including the stages of synthesis from technology independent to dependent. The document concludes with a tutorial on getting started with LeonardoSpectrum, walking through invoking the tool, loading a technology library, specifying input/output files, and setting global constraints.
This document provides an overview of SystemC Transaction Level Modeling (TLM) and the TLM standard. It describes what TLM is, why it is useful, how it is being adopted, and key concepts like abstraction levels, interfaces, and the goals of the TLM standard API. It also provides examples of how to model a system using TLM and leverage TLM to enable system debug and analysis.
High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
The document provides a history of digital logic and programmable logic devices such as PLDs, CPLDs, and ASICs. It describes the advantages of FPGAs over other technologies including lower costs, faster time to market, and easier design changes. The architecture of FPGAs is explained including logic blocks, interconnects, embedded memory and DSP blocks. Modern SoC FPGAs integrate an ARM processor for improved performance. Applications include automotive, wireless, military, and medical imaging systems.
This document discusses instruction level parallelism (ILP) and how it can be used to improve performance by overlapping the execution of instructions through pipelining. ILP refers to the potential overlap among instructions within a basic block. Factors like dynamic branch prediction and compiler dependence analysis can impact the ideal pipeline CPI and number of data hazard stalls. Loop level parallelism refers to the parallelism available across iterations of a loop. Data dependencies between instructions, if not properly handled, can limit parallelism and require instructions to execute in order. The three types of data dependencies are data, name, and control dependencies.
This document discusses programmable logic devices (PLDs) like field programmable gate arrays (FPGAs). It provides details on FPGA types and vendors like Xilinx. FPGAs offer efficient resource utilization and flexibility. Xilinx is a major FPGA vendor and the document describes Xilinx FPGA families and features. It also includes an example VHDL code for a half adder circuit implemented on an FPGA.
This document discusses shared-memory parallel programming using OpenMP. It begins with an overview of OpenMP and the shared-memory programming model. It then covers key OpenMP constructs for parallelizing loops, including the parallel for pragma and clauses for declaring private variables. It also discusses managing shared data with critical sections and reductions. The document provides several techniques for improving performance, such as loop inversions, if clauses, and dynamic scheduling.
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
(8) cpp stack automatic_memory_and_static_memoryNico Ludwig
Check out these exercises: http://de.slideshare.net/nicolayludwig/8-cpp-stack-automaticmemoryandstaticmemory-38510742
- Introducing CPU Registers
- Function Stack Frames and the Decrementing Stack
- Function Call Stacks, the Stack Pointer and the Base Pointer
- C/C++ Calling Conventions
- Stack Overflow, Underflow and Channelling incl. Examples
- How variable Argument Lists work with the Stack
- Static versus automatic Storage Classes
- The static Storage Class and the Data Segment
Erlang on Xen (LING) is a new Erlang platform that runs without an operating system for improved performance. The LINCX project ported the LINC-Switch to LING, demonstrating high compatibility. LINCX runs 100x faster than the original code by optimizing the fast processing path. LING can achieve throughput of up to 0.5 million packets per second and was used to build a high-performance network switch called LINCX.
The document discusses Emergent Game Technologies' Floodgate cross-platform stream processing library. It describes Floodgate as a foundation for easing multi-core development across platforms like PC, Xbox, PS3 and Wii. It outlines how Floodgate uses a stream processing model to partition work into tasks that can run concurrently, improving performance by taking advantage of multiple cores. Examples are given showing how tasks like skinning and morphing benefit from being offloaded to Floodgate.
Implementation of Soft-core processor on FPGA (Final Presentation)Deepak Kumar
Implementation of Soft-core processor(PicoBlaze) on FPGA using Xilinx.
Establishing communication between two PicoBlaze processors.
Creating an application using the multi-core processor.
Peephole optimization techniques in compiler designAnul Chaudhary
This document discusses various compiler optimization techniques, focusing on peephole optimization. It defines optimization as transforming code to run faster or use less memory without changing functionality. Optimization can be machine-independent, transforming code regardless of hardware, or machine-dependent, tailored to a specific architecture. Peephole optimization examines small blocks of code and replaces them with faster or smaller equivalents using techniques like constant folding, strength reduction, null sequence elimination, and algebraic laws. Common replacement rules aim to improve performance, reduce memory usage, and decrease code size.
OpenMP is a portable programming model that allows for parallel programming on shared memory architectures. It utilizes multithreading and shared memory to parallelize serial programs. OpenMP uses compiler directives, runtime libraries, and environment variables to parallelize loops and sections of code. It uses a fork-join model where the master thread forks additional threads to run portions of the program concurrently using shared memory. OpenMP provides a way to incrementally parallelize programs and is supported across many platforms.
Implementing subprograms requires saving execution context, allocating activation records, and maintaining dynamic or static chains. Activation records contain parameters, local variables, return addresses, and dynamic/static links. Nested subprograms are supported through static chains that connect activation records. Dynamic scoping searches the dynamic chain for non-local variables, while shallow access uses a central variable table. Blocks are implemented as parameterless subprograms to allocate separate activation records for block variables.
This document outlines how to simulate Verilog modules within MATLAB. It describes using Xilinx ISIM to generate a simulation executable from a module and testbench. A MATLAB function called runverilogmodule allows running the simulation executable and interfacing module inputs/outputs with MATLAB. An example short circuit module is provided to demonstrate the workflow, including a MATLAB wrapper function, Verilog testbench, and module code.
OpenMP is a framework for parallel programming that utilizes shared memory multiprocessing. It allows users to split their programs into threads that can run simultaneously across multiple processors or processor cores. OpenMP uses compiler directives, runtime libraries, and environment variables to implement parallel regions, shared memory, and thread synchronization. It is commonly used with C/C++ and Fortran to parallelize loops and speed up computationally intensive programs. A real experiment showed a nested for loop running 3.4x faster when parallelized with OpenMP compared to running sequentially.
The document discusses various compiler optimizations including:
1. Procedure integration replaces procedure calls with the procedure body to eliminate function call overhead.
2. Common subexpression elimination replaces repeated computations of the same expression with a single variable to store the result.
3. Constant propagation replaces variables assigned a constant value with the constant throughout the code.
4. The document provides examples of these and other optimizations like copy propagation, code motion, induction variable elimination, and loop unrolling which aims to improve performance by reducing instructions and improving pipeline utilization.
Synthesizing HDL using LeonardoSpectrumHossam Hassan
This document discusses synthesizing HDL designs using the LeonardoSpectrum synthesis tool. It begins with an overview of synthesis and the basic synthesis process. It then describes the LeonardoSpectrum tool flow, including the stages of synthesis from technology independent to dependent. The document concludes with a tutorial on getting started with LeonardoSpectrum, walking through invoking the tool, loading a technology library, specifying input/output files, and setting global constraints.
This document provides an overview of SystemC Transaction Level Modeling (TLM) and the TLM standard. It describes what TLM is, why it is useful, how it is being adopted, and key concepts like abstraction levels, interfaces, and the goals of the TLM standard API. It also provides examples of how to model a system using TLM and leverage TLM to enable system debug and analysis.
High Performance Analytics Toolkit (HPAT) is a Julia-based framework for big data analytics on clusters that is both easy to use and extremely fast; it is orders of magnitude faster than alternatives like Apache Spark.
HPAT automatically parallelizes analytics tasks written in Julia and generates efficient MPI/C++ code.
The document provides a history of digital logic and programmable logic devices such as PLDs, CPLDs, and ASICs. It describes the advantages of FPGAs over other technologies including lower costs, faster time to market, and easier design changes. The architecture of FPGAs is explained including logic blocks, interconnects, embedded memory and DSP blocks. Modern SoC FPGAs integrate an ARM processor for improved performance. Applications include automotive, wireless, military, and medical imaging systems.
This document discusses instruction level parallelism (ILP) and how it can be used to improve performance by overlapping the execution of instructions through pipelining. ILP refers to the potential overlap among instructions within a basic block. Factors like dynamic branch prediction and compiler dependence analysis can impact the ideal pipeline CPI and number of data hazard stalls. Loop level parallelism refers to the parallelism available across iterations of a loop. Data dependencies between instructions, if not properly handled, can limit parallelism and require instructions to execute in order. The three types of data dependencies are data, name, and control dependencies.
This document discusses programmable logic devices (PLDs) like field programmable gate arrays (FPGAs). It provides details on FPGA types and vendors like Xilinx. FPGAs offer efficient resource utilization and flexibility. Xilinx is a major FPGA vendor and the document describes Xilinx FPGA families and features. It also includes an example VHDL code for a half adder circuit implemented on an FPGA.
This document discusses shared-memory parallel programming using OpenMP. It begins with an overview of OpenMP and the shared-memory programming model. It then covers key OpenMP constructs for parallelizing loops, including the parallel for pragma and clauses for declaring private variables. It also discusses managing shared data with critical sections and reductions. The document provides several techniques for improving performance, such as loop inversions, if clauses, and dynamic scheduling.
These are slides from the Dec 17 SF Bay Area Julia Users meeting [1]. Ehsan Totoni presented the ParallelAccelerator Julia package, a compiler that performs aggressive analysis and optimization on top of the Julia compiler. Ehsan is a Research Scientist at Intel Labs working on the High Performance Scripting project.
[1] http://www.meetup.com/Bay-Area-Julia-Users/events/226531171/
(8) cpp stack automatic_memory_and_static_memoryNico Ludwig
Check out these exercises: http://de.slideshare.net/nicolayludwig/8-cpp-stack-automaticmemoryandstaticmemory-38510742
- Introducing CPU Registers
- Function Stack Frames and the Decrementing Stack
- Function Call Stacks, the Stack Pointer and the Base Pointer
- C/C++ Calling Conventions
- Stack Overflow, Underflow and Channelling incl. Examples
- How variable Argument Lists work with the Stack
- Static versus automatic Storage Classes
- The static Storage Class and the Data Segment
Erlang on Xen (LING) is a new Erlang platform that runs without an operating system for improved performance. The LINCX project ported the LINC-Switch to LING, demonstrating high compatibility. LINCX runs 100x faster than the original code by optimizing the fast processing path. LING can achieve throughput of up to 0.5 million packets per second and was used to build a high-performance network switch called LINCX.
The document discusses Emergent Game Technologies' Floodgate cross-platform stream processing library. It describes Floodgate as a foundation for easing multi-core development across platforms like PC, Xbox, PS3 and Wii. It outlines how Floodgate uses a stream processing model to partition work into tasks that can run concurrently, improving performance by taking advantage of multiple cores. Examples are given showing how tasks like skinning and morphing benefit from being offloaded to Floodgate.
Implementation of Soft-core processor on FPGA (Final Presentation)Deepak Kumar
Implementation of Soft-core processor(PicoBlaze) on FPGA using Xilinx.
Establishing communication between two PicoBlaze processors.
Creating an application using the multi-core processor.
Peephole optimization techniques in compiler designAnul Chaudhary
This document discusses various compiler optimization techniques, focusing on peephole optimization. It defines optimization as transforming code to run faster or use less memory without changing functionality. Optimization can be machine-independent, transforming code regardless of hardware, or machine-dependent, tailored to a specific architecture. Peephole optimization examines small blocks of code and replaces them with faster or smaller equivalents using techniques like constant folding, strength reduction, null sequence elimination, and algebraic laws. Common replacement rules aim to improve performance, reduce memory usage, and decrease code size.
OpenMP is a portable programming model that allows for parallel programming on shared memory architectures. It utilizes multithreading and shared memory to parallelize serial programs. OpenMP uses compiler directives, runtime libraries, and environment variables to parallelize loops and sections of code. It uses a fork-join model where the master thread forks additional threads to run portions of the program concurrently using shared memory. OpenMP provides a way to incrementally parallelize programs and is supported across many platforms.
Implementing subprograms requires saving execution context, allocating activation records, and maintaining dynamic or static chains. Activation records contain parameters, local variables, return addresses, and dynamic/static links. Nested subprograms are supported through static chains that connect activation records. Dynamic scoping searches the dynamic chain for non-local variables, while shallow access uses a central variable table. Blocks are implemented as parameterless subprograms to allocate separate activation records for block variables.
This document outlines how to simulate Verilog modules within MATLAB. It describes using Xilinx ISIM to generate a simulation executable from a module and testbench. A MATLAB function called runverilogmodule allows running the simulation executable and interfacing module inputs/outputs with MATLAB. An example short circuit module is provided to demonstrate the workflow, including a MATLAB wrapper function, Verilog testbench, and module code.
OpenMP is a framework for parallel programming that utilizes shared memory multiprocessing. It allows users to split their programs into threads that can run simultaneously across multiple processors or processor cores. OpenMP uses compiler directives, runtime libraries, and environment variables to implement parallel regions, shared memory, and thread synchronization. It is commonly used with C/C++ and Fortran to parallelize loops and speed up computationally intensive programs. A real experiment showed a nested for loop running 3.4x faster when parallelized with OpenMP compared to running sequentially.
This document summarizes random number generation using OpenCL. It discusses the Marsaglia polar method for generating random numbers and Gaussian pairs. It presents pseudocode for the Gaussian pair generation algorithm. Profiling results show that 54% of time is spent generating Gaussian pairs while 46% is for random numbers. The document also discusses optimization techniques like using local memory, coalesced global memory access, and choosing an optimal work group size. Performance results show near linear speedup from 1 to 8 GPUs.
1) The velocity gradient tensor decomposes the rate of deformation tensor (Sij) and the rate of rotation tensor (Ωij). Sij is symmetric and represents straining motions while Ωij is antisymmetric and represents rotational motions.
2) Simple examples are analyzed to demonstrate irrotational and rotational flows, and how the tensors decompose them into straining and rotational parts.
3) Solid body rotation is shown to be purely rotational without strain, while other examples like plane shear flow contain both straining and rotational motions.
The document describes a level set method for simulating droplets using a Navier-Stokes equation. It defines the level set function φ such that φ<0 in the gas region and φ>0 in the liquid region. The interface is defined as the zero level set of φ. Density and viscosity are defined as constant in each region using the Heaviside function of φ. Dimensionless forms of the equations are presented. A projection method is used to handle high Reynolds number flows. The level set function is modified to have a smooth density and a thickness at the interface defined using a function of φ. An equation is added to correct volume during simulation.
The document summarizes the level set method for simulating droplets. It describes using the level set method to represent the interface as the zero level set of a function φ. The interface moves with the fluid particles according to the Navier-Stokes equation. Surface tension is modeled as a body force localized on the interface. Density and viscosity are constant in each region. Dimensionless equations are derived and the projection method is used to handle high Reynolds number flows. The thickness of the interface is defined based on a smoothed Heaviside function.
The MPACK : Multiple precision version of BLAS and LAPACKMaho Nakata
We are interested in the accuracy of linear algebra operations; accuracy of the solution of linear equation, eigenvalue and eigenvectors of some matrices, etc. This is a reason for we have been developing the MPACK. The MPACK consists of MBLAS and MLAPACK, multiple precision version of BLAS and LAPACK, respectively. Features of MPACK are: (i) based on LAPACK 3.x, (ii) to provide a reference implementation and or API (iii) written in C++, rewrite from FORTRAN77 (iv) supports GMP, MPFR, DD/QD and binary128 as multiple precision arithmetic library and (v) portable. Current version of MPACK is 0.7.0 and it supports 76 MBLAS routines and 100 MLAPACK routines. Matrix-matrix multiplication routine has been accelerated using NVIDIA C2050 GPU. All source codes are available at: http://mplapack.sourceforge.net/
The document discusses storage area networks (SANs) and fiber channel technology. It provides background on SANs and how they function as a separate high-speed network connecting storage resources like RAID systems directly to servers. It then covers SAN topologies using fiber channel, including point-to-point, arbitrated loop, and fabric switch configurations. Finally, it discusses planning, managing and the management perspective of SANs in the data center.
The document discusses Android development and UI design. It introduces some common widgets in Android like TextView, buttons, and different layouts like linear, relative and table layouts. It also discusses activities, services, intents and the Android component and manifest files.
World War 2 involved extensive spying and intelligence gathering efforts. The US established the Coordinator of Information in 1941 to collect and analyze national security data, but the FBI was reluctant to share information. After the Pearl Harbor attacks, it was clear that intelligence coordination needed improvement. The Office of Strategic Services was formed in 1942 to conduct clandestine operations like training foreign troops and sending operatives behind enemy lines. Notable OSS agent Moe Berg was a Major League Baseball player who used his language skills and undercover abilities on missions in Yugoslavia, Norway, and Italy. The military also had their own intelligence operations such as code breaking and interrogating prisoners of war. After the war, the OSS was dissolved and re
The document discusses trends in various industries and demographic groups. It provides statistics on year-over-year growth and key demographics for several companies including Abbott Laboratories, Pacific Life, CVS Caremark, and trends for age groups ranging from 13-70 years old. Unknown terms are used that make the overall meaning difficult to discern from the document.
This document discusses various topics related to criminal justice programs and profiling of serial killers. It includes outlines on profiling serial killers and the SARA problem solving model. It also covers classical, biological, social and psychological theories of crime. Various criminal justice organizations and conferences are mentioned. Case studies related to Ted Bundy are discussed in the context of criminal justice programs and profiling.
This document defines storage area networks (SANs) and discusses their architecture, technologies, management, security and benefits. A SAN consists of storage devices connected via a dedicated network that allows servers to access storage independently. Fibre Channel is the most widely used technology but iSCSI and FCIP allow block storage over IP networks. Effective SAN management requires coordination across storage, network and system levels. Security measures like authentication, authorization and encryption help protect data in this shared storage environment.
Sorayya Khan's novel "City of Spies" tells the story of 11-year-old Aliya Shah, who is struggling with her identity between her Pakistani father and Dutch mother in 1970s Pakistan. As political unrest grows under General Zia's regime, Aliya observes the arrests of Zulfiqar Ali Bhutto and the impact on those close to her like her servant's son Hanif. When anti-American protests erupt, Aliya's American friend Lizzy and her family are forced to leave the country. The novel explores Aliya's coming-of-age and search for identity amid the country's turmoil during this volatile period in Pakistan's history.
MATLAB is a high-level programming language and computing environment used for numerical computations, visualization, and programming. The document discusses MATLAB's capabilities including its toolboxes, plotting functions, control structures, M-files, and user-defined functions. MATLAB is useful for engineering and scientific calculations due to its matrix-based operations and built-in functions.
This document provides a summary of a course on introduction to MATLAB. The course includes 7 lectures covering topics like variables, operations, plotting, visualization, programming, solving equations and advanced methods. It will have problem sets to be submitted after each lecture and requirements to pass include attending all lectures and completing all problem sets. The course materials provide an overview of MATLAB including getting started, creating and manipulating variables, and basic plotting.
This document provides an overview of embedded C programming concepts including:
- The C preprocessor and directives like #define, #include, #if.
- Bitwise operations like bit masking, setting, clearing, and toggling bits.
- Type qualifiers like const and volatile and their usage.
- Compiler optimization levels and tradeoffs between execution time, code size, and memory usage.
- Enumerations and typedef for defining standard data types.
- Design concepts like layered architectures and finite state machines.
- The contents and purpose of object files like .text, .data, .bss sections.
- AUTOSAR architecture with layers like MCAL, ECUAL, and services layer.
Klee and Angr are tools for symbolic execution. Klee is a symbolic virtual machine that executes programs symbolically and generates test cases by solving constraints. It works on LLVM bitcode. Angr is a Python framework for analyzing binaries using static and dynamic symbolic analysis. It lifts binaries into an intermediate representation called VEX to analyze machine code across architectures. Both tools explore all paths in a program and solve path constraints to generate inputs that execute each path.
This document summarizes key features introduced in Java SE 5.0 (Tiger) including generics, autoboxing/unboxing, enhanced for loops, type-safe enums, varargs, static imports, and annotations. It also discusses performance enhancements in the virtual machine as well as new concurrency utilities like Executors and ScheduledExecutorService that make multi-threaded programming easier and more robust.
Introduction to Verilog HDL. This class notes present basic HDL structures, data types, operators, and expressions in Verilog. It also describes three typical modeling style for HDL design; behavioral, dataflow, and structural.
Welcome to the wonderful world of Java Streams ported for the CFML world!The beauty of streams is that the elements in a stream are processed and passed across the processing pipeline. Unlike traditional CFML functions like map(), reduce() and filter() which create completely new collections until all items in the pipeline are processed. With streams, the elements are streamed across the pipeline to increase efficiency and performance.
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...Ortus Solutions, Corp
This session will introduce the cbStreams module. It will discuss what Java streams are, each of the available methods and options, and how to implement cbStreams into their applications. With real-world examples of stream implementation, this session will also show how using streams can enhance the performance of your application and reduce latency. Target Audience: Anyone wishing to learn about Java streams.
Is SQLcl the Next Generation of SQL*Plus?Zohar Elkayam
Session from ILOUG I presented in May, 2016
Introducing the new tool from the developers of SQL Developer: SQLcl – a new command line tool from the SQL Developer team that might replace SQL*Plus and all of its functions which has been around for over 30 years!
In this session, we will explore the new functionality of the SQLcl, and use a live demonstration to show what SQLcl has to offer over the old SQL*Plus. We will use real life example to see what makes this tool such a time saver in day-to-day tasks for DBAs and developers who prefer using the command line interface.
This document provides an overview of an introductory C# programming course. The course covers C# fundamentals like setting up a development environment, data types, conditionals, loops, object-oriented programming concepts, and data structures. It includes topics like installing Visual Studio, writing a "Hello World" program, built-in data types like string, integer, boolean, and more. The document also outlines sample code solutions for exercises on command line arguments, integer operations, leap year finder, and powers of two.
Go is a general purpose programming language created by Google. It is statically typed, compiled, garbage collected, and memory safe. Go has good support for concurrency with goroutines and channels. It has a large standard library and integrates well with C. Some key differences compared to other languages are its performance, explicit concurrency model, and lack of classes. Common data types in Go include arrays, slices, maps, structs and interfaces.
This document provides an overview of MATLAB including its history, applications, development environment, built-in functions, and toolboxes. MATLAB stands for Matrix Laboratory and was originally developed in the 1970s at the University of New Mexico to provide an interactive environment for matrix computations. It has since grown to be a comprehensive programming language and environment used widely in technical computing across many domains including engineering, science, and finance. The key components of MATLAB are its development environment, mathematical function library, programming language, graphics capabilities, and application programming interface. It also includes a variety of toolboxes that provide domain-specific functionality in areas like signal processing, neural networks, and optimization.
Using existing language skillsets to create large-scale, cloud-based analyticsMicrosoft Tech Community
This document discusses how to use Python for analytics with Azure Data Lake. Currently, Python can be used via an extension library to run Python code in a reducer context. Going forward, Python will be able to run natively on vertices, allowing Python code to be used to build extractors, processors, outputters, reducers, appliers, and combiners. This will enable fully leveraging Python for analytics tasks like transforming data, creating new columns, and deleting columns.
Oracle 9i is changing the ETL (Extract, Transform, Load) paradigm by providing powerful new ETL capabilities within the database. Key features discussed include external tables for reading flat files directly without loading to temporary tables, the MERGE statement for updating or inserting rows with one statement, multi-table inserts for conditionally inserting rows into multiple tables, pipelined table functions for efficiently passing row sets between functions, and native compilation for improving PL/SQL performance. These new Oracle 9i capabilities allow for simpler, more efficient, and lower cost ETL processes compared to traditional third-party ETL tools.
The document provides an overview of the C++ programming language. It discusses that C++ was designed by Bjarne Stroustrup to provide Simula's facilities for program organization together with C's efficiency and flexibility for systems programming. It outlines key C++ features such as classes, operator overloading, references, templates, exceptions, and input/output streams. It also covers topics like class definitions, constructors, destructors, friend functions, and operator overloading. The document provides examples of basic C++ programs and explains concepts like compiling, linking, and executing C++ programs.
This document provides an overview and programming tips for using SQL procedural language (SQL PL) stored procedures on DB2 for z/OS. It discusses various features and enhancements for SQL PL including compound blocks, templates, dynamic SQL, XML support, array data types, global variables, and autonomous transactions. The document also provides examples and best practices for writing SQL procedures, including handling naming resolution, using templates for readability, and working with arrays and dynamic SQL.
This document discusses Java primitive data types and operators. It describes the 8 primitive types in Java - boolean, byte, char, double, float, int, long, short - including their ranges and behaviors. It also covers literals, variables, scopes and lifetimes. For operators, it explains arithmetic, relational, logical, assignment, increment/decrement, shift, and ternary operators. It includes examples to demonstrate the usage of various data types and operators in Java programs.
This document provides an overview of a 5-day Java programming workshop covering operators and conditionals. It discusses arithmetic, assignment, relational and logical operators as well as operator precedence. It also covers conditional statements using if/else and switch/case and provides examples of evaluating grades based on percentages. Additional learning resources on Java programming concepts and documentation are recommended.
Trace flags are used to temporarily change SQL Server's behavior for debugging or diagnosing issues. This document discusses several trace flags including:
TF 652, 661, 834, 836 which disable certain SQL Server processes or enable large page allocations.
TF 1211, 1224 which avoid lock escalation. TF 1117 forces data files to auto grow equally. TF 1204, 1205, 1222 provide more information on deadlocks.
TF 1118 addresses tempdb contention. TFs 3226, 3014, 3004 provide more backup/restore details. TF 4199 enables query processor fixes. TF 3502 prints checkpoint messages.
The document provides explanations of these trace flags
This document provides an overview of key concepts in MATLAB including:
- MATLAB can be used as a powerful calculator or programming language. It has many built-in functions and the ability to define variables and scripts.
- Scripts allow storing and running sequences of MATLAB commands. Variables can be created and manipulated using basic arithmetic, element-wise, and matrix operations.
- Common variable types include numeric arrays and cell arrays. Variables are initialized without declaring type or size. Built-in functions help work with variables.
- Key concepts covered include scripts, variables, vectors, matrices, basic operations, and plotting. Examples are provided to demonstrate MATLAB basics.
Similar to Fortran & Link with Library & Brief Explanation of MKL BLAS (20)
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
3. Fortran….
• Still Fortran 77, 90, or 95?
• Fortran 2003 & 2008 is already here and 2015 will be a future.
• Some parts will be deleted or obsolescent.
• We are using Fortran wrong way.
4. What you shouldn’t use
Labeled Do Loops
do 100
ii=istart,ilast,istep
isum = isum + ii
100 continue
1 2 3 4 5 6 7
A
B
C(1) C(2)
EQUIVALENCE
specify the sharing of storage units by two or more objects
in a scoping unit
character (len=3) :: C(2)
character (len=4) :: A,B
equivalence (A,C(1)), (B,C(2))
COMMON
Blocks of physical storage accessed by any of
the scoping units in a program
COMMON /BLOCKA/ A,B,C(10,30)
COMMON I, J, K
ENTRY
subroutine-like-things Inside subroutine
FIXED FORM SOURCE
Fortran 77 style (80 column restriction)
CHARACTER* form
replaced with CHARACTER(LEN=?)
NON-BLOCK DO CONSTRUCT
the DO range doesn't end in a CONTINUE or
END DO
5. What you shouldn’t use
Labeled Do Loops
Label doesn’t need, hard to remember
what meaning of number. Moreover, we
have END DO or CYCLE statement
EQUIVALENCE
Equivalence is also error-prone. It is hard to
memorize all of positions where this variables
points.
Since COMMON and EQUIVALENCE is not to
encouraged to use, BLOCK statement is also not
to do.
COMMON
Sharing lots of variables over program is
dangerous. It is error-prone
ENTRY
It complicates program because we have
module & subroutine
NON-BLOCK DO CONSTRUCT
Hard to maintain where DO loop ends
6. What you might want to use – CYCLE , EXIT
• Avoid GOTO Statement
• Use CYCLE or EXIT statement
• CYCLE : Skip to the end of a loop
• EXIT : exit loop
do i=1, 100
x = real(i)
y = sin(x)
if (i == 20) exit
z = cos(x)
enddo
do i=1, 100
x = real(i)
y = sin(x)
if (i == 20) cycle
z = cos(x)
enddo
19 iteration will be done successfully, but at
20th iteration, y = sin(x) executed
then exit loop.
100 iteration, but at i=20, z = cos(x)
doesn’t executed
7. What you might want to use – CYCLE , EXIT
• Avoid GOTO statement
• Use CYCLE or EXIT statement with nested loop
• Constructs (DO, IF, CASE, etc.) may have names
outer: do j=1, 100
inner: do i=1, 100
x = real(i)
y = sin(x)
if (i > 20) exit outer
z = cos(x)
enddo inner
enddo outer
Exit whole loop at i=21 Skip z=cos(x) when i>21
outer: do j=1, 100
inner: do i=1, 100
x = real(i)
y = sin(x)
if (i > 20) cycle outer
z = cos(x)
enddo inner
enddo outer
8. What you might want to use – WHERE
real, dimension(4) :: &
x = [ -1, 0, 1, 2 ], &
a = [ 5, 6, 7, 8 ]
...
where (x < 0)
a = -1.
end where
where (x /= 0)
a = 1. / a
elsewhere
a = 0.
end where
where (x < 0)
a = -1.
end where
a : {-1.0, 6.0, 7.0, 8.0}
where (x /= 0)
a = 1. / a
elsewhere
a = 0.
end where
a : {-1.0, 0.0, 1.0/7.0, 1.0/8.0}
9. What you might want to use – ANY
integer, parameter :: n = 100
real, dimension(n,n) :: a, b, c1, c2
c1 = my_matmul(a, b) ! home-grown function
c2 = matmul(a, b) ! built-in function
if (any(abs(c1 - c2) > 1.e-4)) then
print *, ’There are significant
differences’
endif
• ANY and WHERE remove redundant do loop
10. What you might want to use – DO CONCURRENT
• Vectorization
• Simple example of Auto-Parallelization
• Definition : Processes one operation on multiple pairs of operands at once
do concurrent (i=1:m)
call dosomething()
end do
DO i=1,1024
C(i) = A(i) * B(i)
END DO
DO i=1,1024,4
C(i:i+3) = A(i:i+3) * B(i:i+3)
END DO
• ALLOW/REQUEST Vectorization. If you need vectorization, enable –parallel option.
• No data dependencies, No EXIT or CYCLE Statement, No return statement.
• Use with OpenMP.
11. For More..
• Read Fortran 2008 Standard
• http://www.j3-fortran.org/doc/year/10/10-007.pdf
• More recent document for Fortran 2015 (or more, working now)
• http://j3-fortran.org/doc/year/15/15-007.pdf
• Easy to read documents
• The new features of Fortran 2008 : ftp://ftp.nag.co.uk/sc22wg5/N1801-N1850/N1828.pdf
• Modern Programming Languages: Fortran90/95/2003/2008 :
https://www.tacc.utexas.edu/documents/13601/162125/fortran_class.pdf
13. Build?
• Process From Source Code to Executable Files, so called Build.
• Compiler : tool for compile, Linker : tool for Link.
• ifort, gcc, gfortran, and so on are combined tool for compile & link.
Source Code1.f
Source Code2.f
Source Code3.f
Source Code1.o
Source Code2.o
Source Code3.o
Compile Link
Libraries(FFTW..)
Readable Unreadable
a.out
14. Makefile?
• make do all of compile & link jobs automatically. Makefile is a build script.
• make(actually gmake) is one of many tools. There are many tools like make, so called build
system.
• Visual studio has own build system. Hence it doesn’t use makefile.
$ gcc -o hellomake hellomake.c hellofunc.c -I.
hellomake: hellomake.c hellofunc.c
gcc -o hellomake hellomake.c hellofunc.c -I.
1. Command-line
2. Simple Makefile (1)
• “hellomake:” : rule name
• “hellomake.c hellofunc.c hellomake.h” : dependencies
• “gcc …” : actual command
• Simply “make” execute first rule defined in Makefile
Makefile Command-line
$ make or
$ make hellomake
15. Makefile?
CC=gcc
CFLAGS=-I.
hellomake: hellomake.o hellofunc.o
$(CC) -o hellomake hellomake.o hellofunc.o -I.
3. Simple Makefile (3)
Add constants
• “CC=gcc” : C Compiler
• “CFLAGS” : list of flags to pass to the compilation command
• For Fortran, “FC” instead of “CC”, “FFLAGS” instead of “CFLAGS”
• Indent(tab) with command line (“$(CC)”) is important!
$ make or
$ make hellomake
16. Makefile?
CC=gcc
CFLAGS=-I.
DEPS = hellomake.h
hellomake: hellomake.o hellofunc.o
$(CC) -o hellomake hellomake.o hellofunc.o -I.
%.o: %.c $(DEPS)
$(CC) -c $< $(CFLAGS)
4. Simple Makefile (4)
Automatically find .c files and make a rule for compilation(.o). $@ and $< are special macros in Makefile
• Rule %.o : rule for compilation, Rule hellomake : rule for link.
• $@ is the name of the file to be made. (e.g. hellomake for rule hellomake)
• $< The name of the first prerequisite. (hellomake.o is first prerequisite of rule hellomake)
• $^ The names of all the prerequisites, with spaces between them
• $* the prefix shared by target and dependent files (hellomake : $* of hellomake.c)
$ make or
$ make hellomake
17. Compiler & Linker Options
FFLAGS=-O3 -r8 -openmp -I /home/astromece/usr/fftw/include
LIBS=-L/home/astromeca/usr/lib -lfftw3 -lm
Compiler Options and Linker Options
• -O3 : Optimization Level (O1 : Code size optimization, O2 : General Optimization(Default), O3 : Aggressive
Optimization)
• -r8 : real type is a double precision (8byte(=64bit) for real)
• -I : Specify include directory. Include : .h files (declaration)
• -L : Specify library directory. Library files : .so or .a
• -lfftw3 : Link with fftw3 library
• -lm : link with math library (to use several math intrinsic functions)
18. Compiler & Linker Options
Recommend options
• -heap-arrays [numbers] : Puts automatic arrays and arrays above [numbers]KB created for temporary
computations on the heap instead of the stack. Same effect as allocate statement.
• -axcode [code] : Specify CPU architecture. DGIST, Boolt : AVX, CSE Server(OMP) : SSE4.1, CSE Server(SMP) :
SSE4.2
• -O2 : before enable –O3, compare results with -O2 and -O3 options. “Sometimes”, -O3 cause different results.
• -parallel : Enable auto parallelized code. turn on if you use DO CONCURRENT.
• -free : free-form source (f90 style), ifort automatically compile .f file as Fortran77. If you want to compile .f
suffix as Fortran 90 or higher, enable this option.
• $ man ifort gives us a lot of additional information.
Debug vs Release
• -g (to use debugger) or –check (check array bounds and son on) option help reducing errors, however, it adds
some additional code hence it slows code and turn off optimization automatically.
• If you are sure that you don’t have errors and want to get results, enable optimization but remove –g or –
check options.
20. Intel MKL(Math Kernel Library) and BLAS
Intel MKL
• A library of optimized math routines for science, engineering, and financial applications.
• Basic functions related to matrix or vector included.
• You don’t need any installation, just add library.
BLAS
• Basic Linear Algebra Subprograms
• a set of low-level routines for performing common linear algebra operations such as vector addition, scalar
multiplication, dot products, linear combinations, and matrix multiplication
• It has same interface but has various implementations, ATLAS, MKL, OpenBLAS, GotoBLAS and so on.
• I will use MKL BLAS because it is easy to compile and well documentated.
• It already parallelized. Hence, just turn on an option make all parallelism without using OpenMP. (MPI
parallelism is not implemented).
I will show how to make CG method using MKL BLAS line by line.
21. Sparse Matrix Format
• Before starting BLAS Library Functions, we need to consider how to construct 𝐴 matrix in 𝐴𝑥 = 𝑏.
1 7 0 0
0 2 8 0
5 0 3 9
0 6 0 4
1
1
1
row offsets
column indices
values
9 entries (non zero entries)
22. Sparse Matrix Format
• Before starting BLAS Library Functions, we need to consider how to construct 𝐴 matrix in 𝐴𝑥 = 𝑏.
1 7 0 0
0 2 8 0
5 0 3 9
0 6 0 4
1
1 2
1 7
column indices
values
9 entries (non zero entries)
row offsets
23. Sparse Matrix Format
• Before starting BLAS Library Functions, we need to consider how to construct 𝐴 matrix in 𝐴𝑥 = 𝑏.
1 7 0 0
0 2 8 0
5 0 3 9
0 6 0 4
1 3
1 2 2
1 7 2
column indices
values
9 entries (non zero entries)
row offsets
24. Sparse Matrix Format
• Before starting BLAS Library Functions, we need to consider how to construct 𝐴 matrix in 𝐴𝑥 = 𝑏.
1 7 0 0
0 2 8 0
5 0 3 9
0 6 0 4
1 3
1 2 2 3
1 7 2 8
column indices
values
9 entries (non zero entries)
row offsets
25. Sparse Matrix Format
• Before starting BLAS Library Functions, we need to consider how to construct 𝐴 matrix in 𝐴𝑥 = 𝑏.
1 7 0 0
0 2 8 0
5 0 3 9
0 6 0 4
1 3 5
1 2 2 3 1
1 7 2 8 5
column indices
values
9 entries (non zero entries)
row offsets
26. Sparse Matrix Format
• Before starting BLAS Library Functions, we need to consider how to construct 𝐴 matrix in 𝐴𝑥 = 𝑏.
1 7 0 0
0 2 8 0
5 0 3 9
0 6 0 4
1 3 5
1 2 2 3 1 3
1 7 2 8 5 3
column indices
values
9 entries (non zero entries)
row offsets
27. Sparse Matrix Format
• Before starting BLAS Library Functions, we need to consider how to construct 𝐴 matrix in 𝐴𝑥 = 𝑏.
1 7 0 0
0 2 8 0
5 0 3 9
0 6 0 4
1 3 5
1 2 2 3 1 3 4
1 7 2 8 5 3 9
column indices
values
9 entries (non zero entries)
row offsets
28. Sparse Matrix Format
• Before starting BLAS Library Functions, we need to consider how to construct 𝐴 matrix in 𝐴𝑥 = 𝑏.
1 7 0 0
0 2 8 0
5 0 3 9
0 6 0 4
1 3 5 8
1 2 2 3 1 3 4 2
1 7 2 8 5 3 9 6
column indices
values
9 entries (non zero entries)
row offsets
29. Sparse Matrix Format
• Before starting BLAS Library Functions, we need to consider how to construct 𝐴 matrix in 𝐴𝑥 = 𝑏.
1 7 0 0
0 2 8 0
5 0 3 9
0 6 0 4
1 3 5 8
1 2 2 3 1 3 4 2 4
1 7 2 8 5 3 9 6 4
column indices
values
9 entries (non zero entries)
row offsets
30. Sparse Matrix Format
• Before starting BLAS Library Functions, we need to consider how to construct 𝐴 matrix in 𝐴𝑥 = 𝑏.
1 7 0 0
0 2 8 0
5 0 3 9
0 6 0 4
1 3 5 8 10
1 2 2 3 1 3 4 2 4
1 7 2 8 5 3 9 6 4
column indices
values
9 entries (non zero entries)
row offsets
Indicates end
31. Sparse matrix
• If construct A matrix with zeros, 16 * 8bytes is required
• Sparse matrix, CSR matrix, requires 23 * 8bytes.
• Inefficient? No, if you have large A matrix, such as 𝑛𝑥 ⋅ 𝑛𝑦 × (𝑛𝑥 ⋅ 𝑛𝑦), CSR is SOOOO efficient.
1 7 0 0
0 2 8 0
5 0 3 9
0 6 0 4
1 3 5 8 10
1 2 2 3 1 3 4 2 4
1 7 2 8 5 3 9 6 4
32. What BLAS Library Functions Required?
• mkl_dcsrgemv : Computes matrix - vector product of a sparse general matrix stored in the CSR format (3-
array variation) with zero-based indexing with double precision. used in 𝐴𝑥 computation.
• call mkl_dcsrgemv(transa, m, a, ia, ja, x, y)
• transa : determine 𝐴𝑥 (transa=‘N’ or ‘n’) or 𝐴’𝑥 (transa=‘T’ or ‘t’ or ‘C’ or ‘c’).
• m : # of rows of A
• a : Values array of A in CSR format
• ia : Row offset array of A in CSR format
• ja : Column indices array of A in CSR format
• x : x vector
• y : output (𝐴𝑥)
• dcopy : Copy vector (routines), copy arrays from x to y. 𝑦 = 𝑥
• call dcopy(n, x, y)
• n : # of elements in vectors 𝑥 and 𝑦.
• x : Input, 𝑥 vector
• y : Output, 𝑦 vector
33. What BLAS Library Functions Required?
• ddot : Computes a vector-vector dot product. 𝑥 ⋅ 𝑦
• not subroutine, it’s a function.
• dot(x, y)
• x, y : 𝑥, 𝑦 vector
• daxpy : Computes a vector-scalar product and adds the result to a vector. SAXPY : Single-precision A·X Plus Y
• 𝑦 = 𝑎 ⋅ 𝑥 + 𝑦
• call daxpy(n, a, x, y)
• n : # of elements in vectors 𝑥 and 𝑦.
• A : Scalar A
• x : Input, 𝑥 vector
• y : Output, 𝑦 vector
• dnrm2 : Computes the Euclidean norm of a vector. 𝑦 = 𝑎 ⋅ 𝑥 + 𝑦
• not subroutine, it’s a function
• nrm2(x)
• n : # of elements in vectors 𝑥.
• x : Input, 𝑥 vector