While threads have become a first class citizen in C++ since C++11, it is not always the case that they are the best abstraction to express parallelism where the objective is to speed up computations. OpenMP is a parallelism API for C/C++ and Fortran that has been around for a long time. Intel's Threading Building Blocks (TBB) is only a little bit more than 10 years old, but is very mature, and specifically for C++.
Mats will introduce OpenMP and TBB and their use in modern C++ and provide some best practices for them as well as try to predict what the C++ standard has in store for us when it comes to parallelism in the future.
Introducton to Convolutional Nerural Network with TensorFlowEtsuji Nakai
Explaining basic mechanism of the Convolutional Neural Network with sample TesnsorFlow codes.
Sample codes: https://github.com/enakai00/cnn_introduction
TensorFlow is an open source neural network library for Python and C++. It defines data flows as graphs with nodes representing operations and edges representing multidimensional data arrays called tensors. It supports supervised learning algorithms like gradient descent to minimize cost functions. TensorFlow automatically computes gradients so the user only needs to define the network structure, cost function, and optimization algorithm. An example shows training various neural network models on the MNIST handwritten digit dataset, achieving up to 99.2% accuracy. TensorFlow can implement other models like recurrent neural networks and is a simple yet powerful framework for neural networks.
The document describes how to use TensorBoard, TensorFlow's visualization tool. It outlines 5 steps: 1) annotate nodes in the TensorFlow graph to visualize, 2) merge summaries, 3) create a writer, 4) run the merged summary and write it, 5) launch TensorBoard pointing to the log directory. TensorBoard can visualize the TensorFlow graph, plot metrics over time, and show additional data like histograms and scalars.
Towards typesafe deep learning in scalaTongfei Chen
Deep learning is predominantly done in Python, a type-unsafe language. Try changing this unfortunate fact by doing it in Scala instead! With expressive types, compile-time typechecking, and awesome DSLs, we present Nexus, a typesafe, expressive and succinct deep learning framework.
The document discusses arrays and sparse matrices as data structures. It defines array and sparse matrix abstract data types, including methods for creating, accessing, and manipulating the structures. Examples are given of representing polynomials using arrays or as a sparse matrix to illustrate different implementations of these data structures.
Scientific Computing with Python Webinar March 19: 3D Visualization with MayaviEnthought, Inc.
In this webinar, Didrik Pinte provides an introduction to MayaVi, the 3D interactive visualization library for the open source Enthought Tool Suite. These tools provide scientists and engineers a sophisticated Python development framework for analysis and visualization.
Introducton to Convolutional Nerural Network with TensorFlowEtsuji Nakai
Explaining basic mechanism of the Convolutional Neural Network with sample TesnsorFlow codes.
Sample codes: https://github.com/enakai00/cnn_introduction
TensorFlow is an open source neural network library for Python and C++. It defines data flows as graphs with nodes representing operations and edges representing multidimensional data arrays called tensors. It supports supervised learning algorithms like gradient descent to minimize cost functions. TensorFlow automatically computes gradients so the user only needs to define the network structure, cost function, and optimization algorithm. An example shows training various neural network models on the MNIST handwritten digit dataset, achieving up to 99.2% accuracy. TensorFlow can implement other models like recurrent neural networks and is a simple yet powerful framework for neural networks.
The document describes how to use TensorBoard, TensorFlow's visualization tool. It outlines 5 steps: 1) annotate nodes in the TensorFlow graph to visualize, 2) merge summaries, 3) create a writer, 4) run the merged summary and write it, 5) launch TensorBoard pointing to the log directory. TensorBoard can visualize the TensorFlow graph, plot metrics over time, and show additional data like histograms and scalars.
Towards typesafe deep learning in scalaTongfei Chen
Deep learning is predominantly done in Python, a type-unsafe language. Try changing this unfortunate fact by doing it in Scala instead! With expressive types, compile-time typechecking, and awesome DSLs, we present Nexus, a typesafe, expressive and succinct deep learning framework.
The document discusses arrays and sparse matrices as data structures. It defines array and sparse matrix abstract data types, including methods for creating, accessing, and manipulating the structures. Examples are given of representing polynomials using arrays or as a sparse matrix to illustrate different implementations of these data structures.
Scientific Computing with Python Webinar March 19: 3D Visualization with MayaviEnthought, Inc.
In this webinar, Didrik Pinte provides an introduction to MayaVi, the 3D interactive visualization library for the open source Enthought Tool Suite. These tools provide scientists and engineers a sophisticated Python development framework for analysis and visualization.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/dec-2019-alliance-vitf-facebook
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Joseph Spisak, Product Manager at Facebook, delivers the presentation "PyTorch Deep Learning Framework: Status and Directions" at the Embedded Vision Alliance's December 2019 Vision Industry and Technology Forum. Spisak gives an update on the Torch deep learning framework and where it’s heading.
SciPy and NumPy are Python packages that provide scientific computing capabilities. NumPy provides multidimensional array objects and fast linear algebra functions. SciPy builds on NumPy and adds modules for optimization, integration, signal and image processing, and more. Together, NumPy and SciPy give Python powerful data analysis and visualization capabilities. The community contributes to both projects to expand their functionality. Memory mapped arrays in NumPy allow working with large datasets that exceed system memory.
An introduction to Deep Learning concepts, with a simple yet complete neural network, CNNs, followed by rudimentary concepts of Keras and TensorFlow, and some simple code fragments.
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014PyData
Pythran is a an ahead of time compiler that turns modules written in a large subset of Python into C++ meta-programs that can be compiled into efficient native modules. It targets mainly compute intensive part of the code, hence it comes as no surprise that it focuses on scientific applications that makes extensive use of Numpy. Under the hood, Pythran inter-procedurally analyses the program and performs high level optimizations and parallel code generation. Parallelism can be found implicitly in Python intrinsics or Numpy operations, or explicitly specified by the programmer using OpenMP directives directly in the Python source code. Either way, the input code remains fully compatible with the Python interpreter. While the idea is similar to Parakeet or Numba, the approach differs significantly: the code generation is not performed at runtime but offline. Pythran generates C++11 heavily templated code that makes use of the NT2 meta-programming library and relies on any standard-compliant compiler to generate the binary code. We propose to walk through some examples and benchmarks, exposing the current state of what Pythran provides as well as the limit of the approach.
This fast-paced session starts with an introduction to neural networks and linear regression models, along with a quick view of TensorFlow, followed by some Scala APIs for TensorFlow. You'll also see a simple dockerized image of Scala and TensorFlow code and how to execute the code in that image from the command line. No prior knowledge of NNs, Keras, or TensorFlow is required (but you must be comfortable with Scala).
An introductory-to-mid level to presentation to complex network analysis: network metrics, analysis of online social networks, approximated algorithms, memorization issues, storage.
NumPy is a Python library used for working with multidimensional arrays and matrices for scientific computing. It allows fast operations on arrays through optimized C code and is the foundation of the Python scientific computing stack. NumPy arrays can be created in many ways and support operations like indexing, slicing, broadcasting, and universal functions. NumPy provides many useful features for linear algebra, Fourier transforms, random number generation and more.
Introduction to TensorFlow, by Machine Learning at BerkeleyTed Xiao
A workshop introducing the TensorFlow Machine Learning framework. Presented by Brenton Chu, Vice President of Machine Learning at Berkeley.
This presentation cover show to construct, train, evaluate, and visualize neural networks in TensorFlow 1.0
http://ml.berkeley.edu
An introduction to Deep Learning (DL) concepts, starting with a simple yet complete neural network (no frameworks), followed by aspects of deep neural networks, such as back propagation, activation functions, CNNs, and the AUT theorem. Next, a quick introduction to TensorFlow and Tensorboard, and then some code samples with Scala and TensorFlow.
A fast-paced introduction to Deep Learning that starts with a simple yet complete neural network (no frameworks), followed by an overview of activation functions, cost functions, backpropagation, and then a quick dive into CNNs. Next we'll create a neural network using Keras, followed by an introduction to TensorFlow and TensorBoard. For best results, familiarity with basic vectors and matrices, inner (aka "dot") products of vectors, and rudimentary Python is definitely helpful.
This document discusses algorithms complexity and data structures efficiency. It covers topics like time and memory complexity, asymptotic notation, fundamental data structures like arrays, lists, trees and hash tables, and choosing proper data structures. Computational complexity is important for algorithm design and efficient programming. The document provides examples of analyzing complexity for different algorithms.
Tensorflow in practice by Engineer - donghwi chaDonghwi Cha
- Tensorflow is an introduction to the machine learning framework Tensorflow covering key concepts like computation graphs, operations, sessions, training, replication, and clustering.
- Key aspects discussed include how Tensorflow executes operations as a static computation graph, uses sessions to run graphs and tensors to hold values, and supports data parallelism through replication across devices/workers.
- The document provides examples of building neural network models in Tensorflow and discusses techniques for training models like backpropagation and distributing training using data parallelism.
An introduction to Google's AI Engine, look deeper into Artificial Networks and Machine Learning. Appreciate how our simplest neural network be codified and be used to data analytics.
This document provides an overview and introduction to deep learning concepts including linear regression, activation functions, gradient descent, backpropagation, hyperparameters, convolutional neural networks (CNNs), recurrent neural networks (RNNs), and TensorFlow. It discusses clustering examples to illustrate neural networks, explores different activation functions and cost functions, and provides code examples of TensorFlow operations, constants, placeholders, and saving graphs.
Abstract: This PDSG workshop introduces basic concepts on TensorFlow. The course covers fundamentals. Concepts covered are Vectors/Matrices/Vectors, Design&Run, Constants, Operations, Placeholders, Bindings, Operators, Loss Function and Training.
Level: Fundamental
Requirements: Some basic programming knowledge is preferred. No prior statistics background is required.
Vector is a container that improves on C++ arrays by allowing elements to be added and removed dynamically without specifying a size. Vectors can hold different data types like integers, strings, and custom objects. Elements can be accessed and modified using operator[], and iterated over using begin() and end() iterators. Common operations include push_back() to add elements, pop_back() to remove the last element, and insert() and erase() to modify the elements within the vector. Vectors can be compared using ==, !=, and < to check for equality and sort order.
The document discusses dictionaries and different data structures used to implement them efficiently, including hashing. Dictionaries store key-value pairs and support fast retrieval of elements using keys. Hashing is described as an efficient solution to implement dictionaries, where keys are mapped to table slots using a hash function. Hashing provides expected O(1) time for search, insert and delete operations when the load factor is a constant. Collision resolution using chaining is explained, where elements hash to the same slot are stored in a linked list.
This document provides an overview and introduction to TensorFlow. It describes that TensorFlow is an open source software library for numerical computation using data flow graphs. The graphs are composed of nodes, which are operations on data, and edges, which are multidimensional data arrays (tensors) passing between operations. It also provides pros and cons of TensorFlow and describes higher level APIs, requirements and installation, program structure, tensors, variables, operations, and other key concepts.
C++11 introduced several new features to C++ that support modern programming patterns like lambdas, smart pointers, and move semantics. Lambdas allow for anonymous inline functions, which improve code readability and eliminate the need for functor classes. Smart pointers like unique_ptr and shared_ptr make memory management safer and more straightforward using features like move semantics and RAII. C++11 also aims to bring C++ more in line with modern hardware by supporting multi-threading and higher level abstractions while maintaining performance.
- Linear algebra is important for image recognition and other fields like physics, economics, and politics. It allows analyzing relationships between multiple variables without calculus.
- Python is a good platform for linear algebra due to libraries like NumPy that allow fast processing of multi-dimensional data like matrices. It also has simple syntax without semicolons.
- Key concepts discussed include vectors, matrices, linear transformations, abstraction, and how linear algebra solves problems in fields like quantum mechanics. Comprehensions provide a concise way to generate sets, lists, and arrays in Python.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/dec-2019-alliance-vitf-facebook
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Joseph Spisak, Product Manager at Facebook, delivers the presentation "PyTorch Deep Learning Framework: Status and Directions" at the Embedded Vision Alliance's December 2019 Vision Industry and Technology Forum. Spisak gives an update on the Torch deep learning framework and where it’s heading.
SciPy and NumPy are Python packages that provide scientific computing capabilities. NumPy provides multidimensional array objects and fast linear algebra functions. SciPy builds on NumPy and adds modules for optimization, integration, signal and image processing, and more. Together, NumPy and SciPy give Python powerful data analysis and visualization capabilities. The community contributes to both projects to expand their functionality. Memory mapped arrays in NumPy allow working with large datasets that exceed system memory.
An introduction to Deep Learning concepts, with a simple yet complete neural network, CNNs, followed by rudimentary concepts of Keras and TensorFlow, and some simple code fragments.
Pythran: Static compiler for high performance by Mehdi Amini PyData SV 2014PyData
Pythran is a an ahead of time compiler that turns modules written in a large subset of Python into C++ meta-programs that can be compiled into efficient native modules. It targets mainly compute intensive part of the code, hence it comes as no surprise that it focuses on scientific applications that makes extensive use of Numpy. Under the hood, Pythran inter-procedurally analyses the program and performs high level optimizations and parallel code generation. Parallelism can be found implicitly in Python intrinsics or Numpy operations, or explicitly specified by the programmer using OpenMP directives directly in the Python source code. Either way, the input code remains fully compatible with the Python interpreter. While the idea is similar to Parakeet or Numba, the approach differs significantly: the code generation is not performed at runtime but offline. Pythran generates C++11 heavily templated code that makes use of the NT2 meta-programming library and relies on any standard-compliant compiler to generate the binary code. We propose to walk through some examples and benchmarks, exposing the current state of what Pythran provides as well as the limit of the approach.
This fast-paced session starts with an introduction to neural networks and linear regression models, along with a quick view of TensorFlow, followed by some Scala APIs for TensorFlow. You'll also see a simple dockerized image of Scala and TensorFlow code and how to execute the code in that image from the command line. No prior knowledge of NNs, Keras, or TensorFlow is required (but you must be comfortable with Scala).
An introductory-to-mid level to presentation to complex network analysis: network metrics, analysis of online social networks, approximated algorithms, memorization issues, storage.
NumPy is a Python library used for working with multidimensional arrays and matrices for scientific computing. It allows fast operations on arrays through optimized C code and is the foundation of the Python scientific computing stack. NumPy arrays can be created in many ways and support operations like indexing, slicing, broadcasting, and universal functions. NumPy provides many useful features for linear algebra, Fourier transforms, random number generation and more.
Introduction to TensorFlow, by Machine Learning at BerkeleyTed Xiao
A workshop introducing the TensorFlow Machine Learning framework. Presented by Brenton Chu, Vice President of Machine Learning at Berkeley.
This presentation cover show to construct, train, evaluate, and visualize neural networks in TensorFlow 1.0
http://ml.berkeley.edu
An introduction to Deep Learning (DL) concepts, starting with a simple yet complete neural network (no frameworks), followed by aspects of deep neural networks, such as back propagation, activation functions, CNNs, and the AUT theorem. Next, a quick introduction to TensorFlow and Tensorboard, and then some code samples with Scala and TensorFlow.
A fast-paced introduction to Deep Learning that starts with a simple yet complete neural network (no frameworks), followed by an overview of activation functions, cost functions, backpropagation, and then a quick dive into CNNs. Next we'll create a neural network using Keras, followed by an introduction to TensorFlow and TensorBoard. For best results, familiarity with basic vectors and matrices, inner (aka "dot") products of vectors, and rudimentary Python is definitely helpful.
This document discusses algorithms complexity and data structures efficiency. It covers topics like time and memory complexity, asymptotic notation, fundamental data structures like arrays, lists, trees and hash tables, and choosing proper data structures. Computational complexity is important for algorithm design and efficient programming. The document provides examples of analyzing complexity for different algorithms.
Tensorflow in practice by Engineer - donghwi chaDonghwi Cha
- Tensorflow is an introduction to the machine learning framework Tensorflow covering key concepts like computation graphs, operations, sessions, training, replication, and clustering.
- Key aspects discussed include how Tensorflow executes operations as a static computation graph, uses sessions to run graphs and tensors to hold values, and supports data parallelism through replication across devices/workers.
- The document provides examples of building neural network models in Tensorflow and discusses techniques for training models like backpropagation and distributing training using data parallelism.
An introduction to Google's AI Engine, look deeper into Artificial Networks and Machine Learning. Appreciate how our simplest neural network be codified and be used to data analytics.
This document provides an overview and introduction to deep learning concepts including linear regression, activation functions, gradient descent, backpropagation, hyperparameters, convolutional neural networks (CNNs), recurrent neural networks (RNNs), and TensorFlow. It discusses clustering examples to illustrate neural networks, explores different activation functions and cost functions, and provides code examples of TensorFlow operations, constants, placeholders, and saving graphs.
Abstract: This PDSG workshop introduces basic concepts on TensorFlow. The course covers fundamentals. Concepts covered are Vectors/Matrices/Vectors, Design&Run, Constants, Operations, Placeholders, Bindings, Operators, Loss Function and Training.
Level: Fundamental
Requirements: Some basic programming knowledge is preferred. No prior statistics background is required.
Vector is a container that improves on C++ arrays by allowing elements to be added and removed dynamically without specifying a size. Vectors can hold different data types like integers, strings, and custom objects. Elements can be accessed and modified using operator[], and iterated over using begin() and end() iterators. Common operations include push_back() to add elements, pop_back() to remove the last element, and insert() and erase() to modify the elements within the vector. Vectors can be compared using ==, !=, and < to check for equality and sort order.
The document discusses dictionaries and different data structures used to implement them efficiently, including hashing. Dictionaries store key-value pairs and support fast retrieval of elements using keys. Hashing is described as an efficient solution to implement dictionaries, where keys are mapped to table slots using a hash function. Hashing provides expected O(1) time for search, insert and delete operations when the load factor is a constant. Collision resolution using chaining is explained, where elements hash to the same slot are stored in a linked list.
This document provides an overview and introduction to TensorFlow. It describes that TensorFlow is an open source software library for numerical computation using data flow graphs. The graphs are composed of nodes, which are operations on data, and edges, which are multidimensional data arrays (tensors) passing between operations. It also provides pros and cons of TensorFlow and describes higher level APIs, requirements and installation, program structure, tensors, variables, operations, and other key concepts.
C++11 introduced several new features to C++ that support modern programming patterns like lambdas, smart pointers, and move semantics. Lambdas allow for anonymous inline functions, which improve code readability and eliminate the need for functor classes. Smart pointers like unique_ptr and shared_ptr make memory management safer and more straightforward using features like move semantics and RAII. C++11 also aims to bring C++ more in line with modern hardware by supporting multi-threading and higher level abstractions while maintaining performance.
- Linear algebra is important for image recognition and other fields like physics, economics, and politics. It allows analyzing relationships between multiple variables without calculus.
- Python is a good platform for linear algebra due to libraries like NumPy that allow fast processing of multi-dimensional data like matrices. It also has simple syntax without semicolons.
- Key concepts discussed include vectors, matrices, linear transformations, abstraction, and how linear algebra solves problems in fields like quantum mechanics. Comprehensions provide a concise way to generate sets, lists, and arrays in Python.
Simple, fast, and scalable torch7 tutorialJin-Hwa Kim
A tutorial based on basic information of Torch7. It covers installation, simple runable codes, tensor manipulations, sweep out key-packages and post-hoc audience q&a.
Contiguous buffer is what makes code run fast. With the regular and compact layout, data can go through the memory hierarchy in a quick way to the processor. Numpy makes use of it to achieve the runtime almost as fast as C.
However, not all algorithms can be easily maintained or designed by using array operations. A more general ways is to use the contiguous buffer as an interface. The dynamic typing system of Python makes it easy to use but hard to optimize. The typed, fixed-shape buffer allows efficient data sharing between the fast code written in C++ and the Python application. In the end, the system is as easy to use as but much faster than Python. The runtime is the same as C++.
In this talk, I will show how to design a simple array system in C++ and make it available in Python. It may be made as a general-purpose library, or embedded for ad hoc optimization.
Rust provides safe, fast code through its ownership and borrowing model which prevents common bugs like use-after-free and data races. It enables building efficient parallel programs while avoiding the need for locking. Traits allow defining common interfaces that can be implemented for different types, providing abstraction without runtime costs. The language also supports unsafe code for interfacing with other systems while still enforcing safety within Rust programs through the type system.
This document provides an overview of advanced data structures and algorithm analysis taught by Dr. Sukhamay Kundu at Louisiana State University. It discusses the role of data structures in making computations faster by supporting efficient data access and storage. The document distinguishes between algorithms, which determine the computational steps and data access order, and data structures, which enable efficient reading and writing of data. It also describes different methods for measuring algorithm performance, such as theoretical time complexity analysis and empirical measurements. Examples are provided for instrumenting code to count operations. Overall, the document introduces fundamental concepts about algorithms and data structures.
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...GeeksLab Odessa
SubScript - это расширение языка Scala, добавляющее поддержку конструкций и синтаксиса аглебры общающихся процессов (Algebra of Communicating Processes, ACP). SubScript является перспективным расширением, применимым как для разработки высоконагруженных параллельных систем, так и для простых персональных приложений.
Being a slow interpreter, Python may drive a system to deliver utmost speed if some guidelines are followed. The key is to treat programming languages as syntactic sugar to the machine code. It expedites the workflow of timing, iterative design, automatic testing, optimization, and realize an HPC system balancing the time to market and quality of code.
Speed is the king. 10x productive developers change business. So does 10x faster code. Python is 100x slower than C++ but it only matters when you really use Python to implement number-crunching algorithms. We should not do that, and instead go directly with C++ for speed. It calls for strict disciplines of software engineering and code quality, but it should be noted that here the quality is defined by the runtime and the time to market.
The presentation focuses on the Python side of the development workflow. It is made possible by confining C++ in architecture defined by the Python code, which realizes most of the software engineering. The room for writing fast C++ code is provided by pybind11 and careful design of typed data objects. The data objects hold memory buffers exposed to Python as numpy ndarrays for direct access for speed.
This document provides an overview of CBStreams, a ColdFusion module that implements Java Streams functionality for processing data in a functional programming style. It discusses key concepts like lazy evaluation, intermediate operations that transform streams, and terminal operations that produce final results. Examples are given for building streams from various data sources, applying filters, maps, reductions and more. Lambda expressions and closures play an important role in functional-style stream processing.
This document discusses parallelizing loops to compute pi through numerical integration. It begins by describing how pi can be computed by approximating the integral of 1/(1+x^2) from 0 to 1. It then shows C++ and Fortran code that performs this computation sequentially with increasing numbers of steps to improve the approximation of pi. The document discusses variable scopes in C++ and Fortran and how OpenMP parallelization works by dividing the loop workload across threads. It demonstrates parallelizing the loops in both languages using the reduction clause to correctly sum values across threads. Finally, it discusses considerations for parallelizing loops and challenges like data dependencies.
4Developers 2018: Ile (nie) wiesz o strukturach w .NET (Łukasz Pyrzyk)PROIDEA
Kiedy ostatnio stworzyłeś nową strukturę pisząc aplikację w .NET? Wiesz do czego wykorzystywać struktury i jak mogą one zwiększyć wydajność Twojego programu? W prezentacji pokażę czym charakteryzują się struktury, jak dużo różni je od klas oraz opowiem o kilku ciekawych eksperymentach.
Bartosz Milewski, “Re-discovering Monads in C++”Platonov Sergey
Once you know what a monad is, you start seeing them everywhere. The std::future library of C++11 was an example of an incomplete design, which stopped short of recognizing the monadic nature of futures. This is now being remedied in C++17, and there are new library additions, like std::expected and the range library, that are much more monad-conscious. I’ll explain what a monad is using copious C++ examples.
Here are the steps to plot the given functions using MATLAB:
1. Plot y = 0.4x + 1.8 for 0 ≤ x ≤ 35 and 0 ≤ y ≤ 3.5:
x = 0:35;
y = 0.4.*x + 1.8;
plot(x,y)
xlim([0 35])
ylim([0 3.5])
2. Plot imaginary vs real parts of 0.2 + 0.8i*n for 0 ≤ n ≤ 20:
n = 0:20;
z = 0.2 + 0.8i*n;
plot(real(z),imag(z))
xlabel('Real Part')
Adam Sitnik "State of the .NET Performance"Yulia Tsisyk
MSK DOT NET #5
2016-12-07
In this talk Adam will describe how latest changes in.NET are affecting performance.
Adam wants to go through:
C# 7: ref locals and ref returns, ValueTuples.
.NET Core: Spans, Buffers, ValueTasks
And how all of these things help build zero-copy streams aka Channels/Pipelines which are going to be a game changer in the next year.
This document summarizes Adam Sitnik's presentation on .NET performance. It discusses new features in C# 7 like ValueTuple, ref returns and locals, and Span. It also covers .NET Core improvements such as ArrayPool and ValueTask that reduce allocations. The presentation shows how these features improve performance through benchmarks and reduces GC pressure. It provides examples and guidance on best using new features like Span, pipelines, and unsafe code.
The document discusses Intel Threading Building Blocks (TBB), a C++ template library for parallel programming. TBB provides features like parallel_for to simplify parallelizing loops across CPU cores without managing threads directly. It uses generic programming principles and provides common parallel algorithms, concurrent data structures, and synchronization primitives to make parallel programming more accessible. TBB aims to improve both correctness through avoiding race conditions and performance through efficient hardware utilization.
Similar to Options and trade offs for parallelism and concurrency in Modern C++ (20)
How Can Hiring A Mobile App Development Company Help Your Business Grow?ToXSL Technologies
ToXSL Technologies is an award-winning Mobile App Development Company in Dubai that helps businesses reshape their digital possibilities with custom app services. As a top app development company in Dubai, we offer highly engaging iOS & Android app solutions. https://rb.gy/necdnt
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemPeter Muessig
Learn about the latest innovations in and around OpenUI5/SAPUI5: UI5 Tooling, UI5 linter, UI5 Web Components, Web Components Integration, UI5 2.x, UI5 GenAI.
Recording:
https://www.youtube.com/live/MSdGLG2zLy8?si=INxBHTqkwHhxV5Ta&t=0
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
Project Management: The Role of Project Dashboards.pdfKarya Keeper
Project management is a crucial aspect of any organization, ensuring that projects are completed efficiently and effectively. One of the key tools used in project management is the project dashboard, which provides a comprehensive view of project progress and performance. In this article, we will explore the role of project dashboards in project management, highlighting their key features and benefits.
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian CompaniesQuickdice ERP
Explore the seamless transition to e-invoicing with this comprehensive guide tailored for Saudi Arabian businesses. Navigate the process effortlessly with step-by-step instructions designed to streamline implementation and enhance efficiency.
Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.
Consistent toolbox talks are critical for maintaining workplace safety, as they provide regular opportunities to address specific hazards and reinforce safe practices.
These brief, focused sessions ensure that safety is a continual conversation rather than a one-time event, which helps keep safety protocols fresh in employees' minds. Studies have shown that shorter, more frequent training sessions are more effective for retention and behavior change compared to longer, infrequent sessions.
Engaging workers regularly, toolbox talks promote a culture of safety, empower employees to voice concerns, and ultimately reduce the likelihood of accidents and injuries on site.
The traditional method of conducting safety talks with paper documents and lengthy meetings is not only time-consuming but also less effective. Manual tracking of attendance and compliance is prone to errors and inconsistencies, leading to gaps in safety communication and potential non-compliance with OSHA regulations. Switching to a digital solution like Safelyio offers significant advantages.
Safelyio automates the delivery and documentation of safety talks, ensuring consistency and accessibility. The microlearning approach breaks down complex safety protocols into manageable, bite-sized pieces, making it easier for employees to absorb and retain information.
This method minimizes disruptions to work schedules, eliminates the hassle of paperwork, and ensures that all safety communications are tracked and recorded accurately. Ultimately, using a digital platform like Safelyio enhances engagement, compliance, and overall safety performance on site. https://safelyio.com/
WWDC 2024 Keynote Review: For CocoaCoders AustinPatrick Weigel
Overview of WWDC 2024 Keynote Address.
Covers: Apple Intelligence, iOS18, macOS Sequoia, iPadOS, watchOS, visionOS, and Apple TV+.
Understandable dialogue on Apple TV+
On-device app controlling AI.
Access to ChatGPT with a guest appearance by Chief Data Thief Sam Altman!
App Locking! iPhone Mirroring! And a Calculator!!
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsPeter Muessig
The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.
The Key to Digital Success_ A Comprehensive Guide to Continuous Testing Integ...kalichargn70th171
In today's business landscape, digital integration is ubiquitous, demanding swift innovation as a necessity rather than a luxury. In a fiercely competitive market with heightened customer expectations, the timely launch of flawless digital products is crucial for both acquisition and retention—any delay risks ceding market share to competitors.
Measures in SQL (SIGMOD 2024, Santiago, Chile)Julian Hyde
SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries.
SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL.
To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context.
A talk at SIGMOD, June 9–15, 2024, Santiago, Chile
Authors: Julian Hyde (Google) and John Fremlin (Google)
https://doi.org/10.1145/3626246.3653374
Top Benefits of Using Salesforce Healthcare CRM for Patient Management.pdfVALiNTRY360
Salesforce Healthcare CRM, implemented by VALiNTRY360, revolutionizes patient management by enhancing patient engagement, streamlining administrative processes, and improving care coordination. Its advanced analytics, robust security, and seamless integration with telehealth services ensure that healthcare providers can deliver personalized, efficient, and secure patient care. By automating routine tasks and providing actionable insights, Salesforce Healthcare CRM enables healthcare providers to focus on delivering high-quality care, leading to better patient outcomes and higher satisfaction. VALiNTRY360's expertise ensures a tailored solution that meets the unique needs of any healthcare practice, from small clinics to large hospital systems.
For more info visit us https://valintry360.com/solutions/health-life-sciences
4. Vector vs Thread parallelism
• Vector parallelism maps naturally to
Regular Data Parallelism
• Inner loops can (sometimes) be
vectorized
• Ways to vectorization:
• Auto-vectorization
• Vector Intrinsics
• __mm_add_ps(__m128 x,y)
• Compiler hints
• Cilk Plus array notation
• #pragma omp simd
Images courtesy Intel and Rebel Science News
5. Key features for Performance
• Data locality
• Chunks that fit in cache
• Reuse data locally
• Avoid cache conflicts
• Use few virtual pages
• Avoid false sharing
• Parallel slack
• Specify potential parallelism much
higher than the actual parallelism
• Load balance
• All threads have the same amount of
work to do
6. Example: SAXPY, scaling of a vector
• SAXPY scales a vector, , by a factor, added by vector
• is used for both input and output
• Single-precision floating point (DAXPY: double precision)
• Low arithmetic intensity
• Little arithmetic work compared to the amount of data consumed and
produced
• 2 FLOPS, 8 bytes read and 4 bytes written
7. The Map Pattern
• Applies a function to every
element of a collection of data
items
• Elemental function
• No side effects
• Embarrassingly parallel
• Often combined with collective
patterns
8. Crash course C++11 thread programming
Compile: g++ -std=c++11 –O –lpthread –o hello-threads hello-threads.cc
#include <thread> at top of file
std::thread t; declare a thread; acts as thread handle
t(foo, a1, a2); Instantiate a new thread starting at function foo(a1,a2)
with arguments a1 and a2
t.join() join with thread t. Wait for thread with handle tid to finish
std::mutex m; Declare mutual exclusion lock
m.lock() Enter critical section protected by m
m.unlock(); Leave the critical section
9. Explicit threading on
multicores (in C++11)
• Create one thread per core
• Divide the work manually
• Substantial amount of extra
code over the serial
• Inflexible scheduling of work
void saxpy_t(float a, const vector<float> &x,
vector<float> &y, int nthreads,
int thr_id) {
int n = x.size();
int start = thr_id*n/num_threads;
int end= min((thr_id+1)*n/num_threads, n);
for (int i = start; i < end; i++)
y[i] = a*x[i] + y[i];
}
void main(…)
…
vector<thread> tarr;
for (int i = 0; i < nthreads; i++){
tarr.push_back(thread(saxpy_t, a, ref(x),
ref(y), nthreads, i));
}
// Wait for threads to finish
for (auto & t : tarr){
t.join();
}
11. What happens when the iterations are
different?
void map_serial(
float a; // scale factor
const std::vector<float> &x; // input vec
std::vector<float> &y; // output and input vec )
{
for (int i = 0; i < n; i++)
y[i] = map(i);
}
12. Explicit threading with
load imbalance
• Atomic update of index variable
• Fine granularity of load balancing
• Overheads in multiple threads
wanting to update index
• Note that the declaration of
saxpy_index to be atomic
guarantees no data races
• saxpy_index.fetch_add(1) returns
the old value and atomically adds 1
to it.
std::atomic<int> saxpy_index {-1};
void saxpy_t(float a, vector<float> &x,
vector<float> &y) {
int i = std::saxpy_index.fetch_add(1);
while (i < x.size()) {
y[i] = map(i, a*x[i] + y[i]);
i = std::saxpy_index.fetch_add(1);
}
}
void main(…)
…
vector<thread> tarr;
index = 0;
for (int i = 0; i < num_threads; i++){
tarr.push_back(thread(saxpy_t, a,
ref(x), ref(y));
}
// Wait
// join
for (auto & t : tarr){
t.join();
}
13. Load imbalance with chunks
• The index variable might be a
bottleneck
• Use CHUNKS so that each
threads work on a range of
indeces
• Note that the declaration of
index to be atomic guarantees
no data races
std::atomic<int> saxpy_index {-CHUNK};
void saxpy_t(float a, std::vector<float> x,
std::vector<float> y) {
int c = std::saxpy_index.fetch_add(CHUNK);
int n = x.size();
while (c < n) {
for (int i = c ; c < min(n, c+CHUNK); i++)
y[i] = map(i, a*x[i] + y[i]);
c = std::saxpy_index.fetch_add(CHUNK);
}
}
14. Sequence of maps vs Map of Sequence
• Also called: Code fusion
• Do this whenever possible!
• Increases the Arithmetic
intensity
• Less data to load and store
• Explicit changes needed
• Make sure consecutive elemental
functions do not refer to memory
either through compiler
optimizations or by design
15. Cache fusion optimization
• Almost as important as code
fusion
• Break down maps to sequences
of smaller maps, executed by
each thread
• Keep aggregate data small
enough to fit in cache
16. Outline
• Parallelism and Concurrency in C++
• Higher abstraction models:
• OpenMP
• TBB
• Parallel STL
• Parallel map-reduce with threads and OpenMP
• Task-parallel models
• Threads vs tasks
• OpenMP tasks
• TBB
• Performance implications
• Conclusions
17. Higher abstraction models
• OpenMP
void saxpy_par_openmp(float a,
const vector<float> & x,
vector<float> & y) {
auto n = x.size();
#pragma omp parallel for
for (auto i = 0; i < n; i++) {
y[i] = a * x[i] + y[i];
}
}
• TBB
auto n = x.size();
tbb::parallel_for(size_t(0), n,
[&]( size_t i ) {
y[i] = a * x[i] + y[i];
});
• Parallel STL (C++17)
std::transform(std::par, x.begin(),
x.end(),
y.begin(),
y.begin(),
[=](float x, float y){
return a*x + y;
});
18. • Threads are assigned
independent set of
iterations
• Work-sharing
construct
24
OpenMP support for map: for loops
parallel
Work sharing
i=0
i=1
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9
i=10
i=11
i=12
i=13
i=14
i=15
Implicit barrier
#pragma omp parallel
#pragma omp for
for (i=0; i < 16; i++)
c[i] = b[i] + a[i];
19. Outline
• Parallelism and Concurrency in C++
• Higher abstraction models:
• OpenMP
• TBB
• Parallel STL
• Parallel map-reduce with threads and OpenMP
• Task-parallel models
• Threads vs tasks
• OpenMP tasks
• TBB
• Performance implications
• Conclusions
20. The Reduce Pattern
• Combining the elements of a
collection of data to a single
value
• A combiner function is used to
combine elementd pairwise
• The combiner function must be
associative for parallelism
• 𝑎 ⊗ 𝑏 ⊗ 𝑐 = (𝑎 ⊗ 𝑏) ⊗ 𝑐
21. Serial reduction
Example: Dot-product
float sdot(vector x, vector y){
float sum = 0.0;
for (int i; i < x.size(); i++)
sum += x[i]*y[i];
return sum;
}
Note that this is a fusion of a map (vector element product) and the reduce (sum).
22. Implementation of parallel reduction
• Simple approach:
• Let each thread make the reduce on its part of the data
• Let one (master) thread combine the results to a scalar value
Each thread performs local reduction in parallel
Master thread reduces to scalar value
Thread 0 1 2 3
23. Awkward way of returning results from a
thread: dot-product example
Plain C/C++:
void sprod(const vector<float> &a,
const vector<float> &b,
int start,
int end,
double &sum) {
float lsum = 0.0;
for (int i=start; i < end; i++)
sum += a[i] * b[i];
}
#include <thread>
using namespace std;
…
vector<float> sum_array(nthr, 0.0);
vector<thread> t_arr;
for (i = 0; i < nthr; i++) {
int start=i*size/nthr;
int end = (i+1)*size/nthr;
if (i==nthr-1)
end = size;
t_arr.push_pack(thread(
sprod, ref(a), ref(b),
start, end,
ref(sum_array[i])));
}
for (i = 0; i < nthr; i++){
t_arr[i]->join();
sum += sum_array[i];
}
24. The Async function using futures
Plain C/C++:
float sprod(vector<float> &a,
vector<float> &b,
int start,
int end) {
float lsum = 0.0;
for (int i=start; i < end; i++)
lsum += a[i] * b[i];
return lsum;
}
#include <thread>
#include <future>
using namespace std;
…
future<float> f_arr[i];
for (i = 0; i < nthr; i++) {
int start=i*size/nthr;
int end = (i+1)*size/nthr;
if (i==nthr-1)
end = size;
f_arr[i] = async(launch::async,
sprod, a, b,
start, end);
}
for (i = 0; i < nthr; i++){
sum += f_arr[i].get();
}
25. Definition of async
• The template function async runs the function f asynchronously
(potentially in a separate thread) and returns a std::future that will
eventually hold the result of that function call.
• The launch::async argument makes the function run on a separate
thread (which could be held in a thread pool, or created for this call)
26. 32
Example:
Numerical Integration
4.0
(1+x2) dx =
0
1
F(xi)x
i = 0
N
Mathematically, we know that:
We can approximate the
integral as a sum of
rectangles:
Where each rectangle has
width x and height F(xi) at
the middle of interval i.
4.0
2.0
1.0
X
0.0
27. 33
Serial PI Program
The Map
The Reduction
static long num_steps = 100000;
double step;
void main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
28. Map reduce in OpenMP
static long num_steps = 100000;
double step;
void main ()
{
int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for reduction(+:sum) private(x)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
29. Outline
• Parallelism and Concurrency in C++
• Higher abstraction models:
• OpenMP
• TBB
• Parallel STL
• Parallel map-reduce with threads and OpenMP
• Task-parallel models
• Threads vs tasks
• OpenMP tasks
• TBB
• Performance implications
• Conclusions
30. Main challenges in writing parallel software
• Difficult to write composable parallel software
• The parallel models of different languages do not work well together
• Poor resource management
• Difficult to write portable parallel software
31. Make Tasks a First Class Citizen
• Separation of concerns
• Concentrate on exposing parallelism
• Not how it is mapped onto hardware
Run-time
scheduler HW
32. • The (naïve) sequential Fibonacci calculation
int fib(int n){
if( n<2 ) return n;
else {
int a,b;
a = fib(n-1);
b = fib(n-2);
return b+a;
}
}
An example of task-parallelism
Parallelism in fib:
• The two calls to fib are independent and can
be computed in any order and in parallel
• It helps that fib is side-effect free but
disjoint side-effects are OK
The need for synchronization:
• The return statement must be executed
after both recursive calls have been
completed because of data-dependence on
a and b.
33. int fib(int n){
if( n<2 ) return n;
else {
int a,b;
#pragma omp task shared(a)
a = fib(n-1);
#pragma omp task shared(b)
b = fib(n-2);
#pragma omp taskwait
return b+a;
}
}
A task-parallel fib in OpenMP 3+
Starting code:
...
#pragma omp parallel
#pragma omp single
fib(n);
...
34. Work-stealing schedulers
• Cores work on tasks in their
own queue
• Generated tasks are put in local
queue
• If empty, select a random
queue to steal work from
• Steal from:
• Continuation, or
• Generated tasks
c c c c
36. Task model benefits
• Automatic load-balancing through work-stealing
• Serial semantics => debug in serial mode
• Composable parallelism
• Parallel libraries can be called from parallel code
• Can be mixed with data-parallelism
• SIMD/Vector instructions
• Data-parallel workers can also be tasks
• Adapts naturally to
• Different number of cores, even in run-time
• Different speeds of cores (e.g. ARM big.LITTLE)
37. Greatest thing since sliced bread?
• Overheads are still too big to not care about when creating tasks
• Tasks need to have high enough arithmetic intensity to amortize the cost of
creation and scheduling
• Different models do not use same run-time
• You can’t have a task in TBB calling a function creating tasks written in
OpenMP
• Still no accepted way to target different ISA:s
• Research is going on
• The operating system does not know about tasks
• Current OS:s only schedules threads
38. Heteregeneous processing
• Same ISA, performance heterogeneous processing is transparant
• Real heterogeneity is challenging
39. OpenMP 4.0: extending OpenMP with
dependence annotations
sort(A);
sort(B);
sort(C);
sort(D);
merge(A,B,E);
merge(C,D,F);
A B C D
A,
B
C,
D
40. OpenMP 4.0: extending OpenMP with
dependence annotations
#pragma omp task
sort(A);
#pragma omp task
sort(B);
#pragma omp task
sort(C);
#pragma omp task
sort(D);
#pragma omp taskwait
#pragma omp task
merge(A,B,E);
#pragma omp task
merge(C,D,F);
A B C D
A,
B
C,
D
41. OpenMP 4.0: extending OpenMP with
dependence annotations
#pragma omp task depend(inout:A)
sort(A);
#pragma omp task depend(inout:B)
sort(B);
#pragma omp task depend(inout:C)
sort(C);
#pragma omp task depend(inout:D)
sort(D);
// taskwait not needed
#pragma omp task depend(in:A,B, out:E)
merge(A,B,E);
#pragma omp task depend(in:C,D, out:F)
merge(C,D,F);
A B C D
A,
B
C,
D
42. Benefits of tasks in OpenMP 4.0
• More parallelism can be exposed
• Complex synchronization patterns can be avoided/automated
• Knowing a task's memory usage/footprint, offloading to accelerators
can now be made almost transparent to the user
- In terms of memory handling and execution!
44. Conclusions
• Parallelism should be exploited
at all levels
• User higher abstraction models
and compiler support
• Vector instructions
• STL Parallel Algorithms
• TBB / OpenMP
• Only then, use threads
• Measure performance
bottlenecks with profilers
• Beware of
• Granularity
• False sharing