The document discusses GPU computing for machine learning. It notes that machine learning algorithms are computationally expensive and their requirements increase with data size. GPUs provide significant performance gains over CPUs for parallel problems like machine learning. Many machine learning algorithms have been implemented on GPUs, achieving speedups of 1-2 orders of magnitude. However, most GPU implementations are closed-source. Open-source implementations provide advantages like reproducibility and fair algorithm comparisons.
Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs are designed from the ground up with the benefit of Google's deep experience and leadership in machine learning.
This document summarizes Issam Said's December 21, 2015 Ph.D. defense presentation on contributions of hybrid architectures to depth imaging. The presentation compared the performance of CPU, GPU, and APU architectures on seismic imaging algorithms like reverse time migration (RTM). It found that the APU, with its integrated CPU and GPU, is a promising HPC solution for depth imaging as it may be more efficient than CPUs and able to overcome GPU limitations through unified memory and lower power consumption. Evaluation of finite difference stencils, a building block of RTM, showed the APU achieved best performance using local memory with zero-copy data placement for small problems and explicit copy for large problems.
The document describes the TSUBAME2 supercomputer system at the Tokyo Institute of Technology. It has the following key aspects:
- It has over 17 PFlops of computing power across 1408 thin computing nodes, 24 medium nodes, and 10 fat nodes with Intel and NVIDIA GPU processors.
- It has a total storage capacity of 11PB including 7PB of HDD storage, 4PB of tape storage, and 200TB of SSD storage.
- It utilizes a high-performance Infiniband QDR network with 12 core switches and over 180 edge switches for fast interconnectivity between nodes and storage.
The European Commission recently announced the creation of the European Processor Initiative (EPI), a European consortium to co-design, develop and bring to the market a European low-power microprocessor. EPI will start in 2018 and will develop the first European High Performance Computing (HPC) Systems on Chip (SoC) and accelerators. Both elements will be implemented and validated in a prototype system that will become the basis for a full Exascale machine based on European technology
Greater Chicago Area - Independent Non-Profit Organization Management Professional
View clifford sugerman's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like clifford sugerman discover.
Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs are designed from the ground up with the benefit of Google's deep experience and leadership in machine learning.
This document summarizes Issam Said's December 21, 2015 Ph.D. defense presentation on contributions of hybrid architectures to depth imaging. The presentation compared the performance of CPU, GPU, and APU architectures on seismic imaging algorithms like reverse time migration (RTM). It found that the APU, with its integrated CPU and GPU, is a promising HPC solution for depth imaging as it may be more efficient than CPUs and able to overcome GPU limitations through unified memory and lower power consumption. Evaluation of finite difference stencils, a building block of RTM, showed the APU achieved best performance using local memory with zero-copy data placement for small problems and explicit copy for large problems.
The document describes the TSUBAME2 supercomputer system at the Tokyo Institute of Technology. It has the following key aspects:
- It has over 17 PFlops of computing power across 1408 thin computing nodes, 24 medium nodes, and 10 fat nodes with Intel and NVIDIA GPU processors.
- It has a total storage capacity of 11PB including 7PB of HDD storage, 4PB of tape storage, and 200TB of SSD storage.
- It utilizes a high-performance Infiniband QDR network with 12 core switches and over 180 edge switches for fast interconnectivity between nodes and storage.
The European Commission recently announced the creation of the European Processor Initiative (EPI), a European consortium to co-design, develop and bring to the market a European low-power microprocessor. EPI will start in 2018 and will develop the first European High Performance Computing (HPC) Systems on Chip (SoC) and accelerators. Both elements will be implemented and validated in a prototype system that will become the basis for a full Exascale machine based on European technology
Greater Chicago Area - Independent Non-Profit Organization Management Professional
View clifford sugerman's professional profile on LinkedIn. LinkedIn is the world's largest business network, helping professionals like clifford sugerman discover.
Standardized Construction of HPC Clusters for Academic UsageBradford Bazemore
A model for the standardization of design and implementation of HPC
clusters to be used in universities is presented. Standardization is achieved
by using an open-source operating system, network infrastructure, and
software packages. The cluster is configured for universities intending to
implement an HPC cluster for research or teaching use. No prior understanding
of clusters is assumed but a basic understanding of programming,
networking and computers in general is required.
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
The document describes a technology called PG-Strom that uses GPU acceleration to optimize I/O performance for PostgreSQL. PG-Strom allows data to be transferred directly from NVMe SSDs to the GPU over the PCIe bus, bypassing the CPU and RAM. This reduces data movement and allows PostgreSQL queries to be partially executed directly on the GPU. Benchmark results show the approach can achieve throughput close to the theoretical hardware limits for a single server configuration processing large datasets.
Neural Networks and Deep Learning for PhysicistsHéloïse Nonne
Introduction to neural networks and deep learning. Seminar given by Héloïse Nonne on February 19th, 2015 at CINaM (Centre Interdisciplinaire de Nanosciences de Marseille) at Aix-Marseille University
Collective Knowledge: python and scikit-learn based open research SDK for col...Grigori Fursin
We would like to share our experience with a python-based Collective Knowledge SDK for collaborative and reproducible experimentation. It helps organize and share experimental setups (code, data and meta) as unified and reusable components with JSON API via GITHUB. It also helps unify, automate and crowdsource analysis and exploration of multi-dimensional optimization spaces using scikit-learn.
SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in...inside-BigData.com
In this deck from the HPC User Forum at Argonne, Deepak Pathania presents: SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in Server Chassis.
"The NEC Vector Engine Processor was developed using 16 nm FinFET process technology for extreme high performance and low power consumption. The Vector Engine Processor has the world's first implementation of one processor with six HBM2 memory modules using Chip-on-Wafer-on-Substrate technology, leading to the world-record memory bandwidth of 1.2 TB/s."
Watch the video: https://wp.me/p3RLHQ-kOK
Learn more: https://www.nec.com/en/global/solutio...
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Comparison of deep learning frameworks from a viewpoint of double backpropaga...Kenta Oono
This document compares deep learning frameworks from the perspective of double backpropagation. It discusses the typical technology stacks and design choices of frameworks like Chainer, PyTorch, and TensorFlow. It also provides a primer on double backpropagation, explaining how it computes the differentiation of a loss function with respect to inputs. Code examples of double backpropagation are shown for Chainer, PyTorch and TensorFlow.
All AI Roads lead to Distribution - Dot AIJim Dowling
The document discusses how all roads of artificial intelligence lead to distributed systems and computing. It provides examples of how companies like Facebook and Google have improved the accuracy and training times of image recognition models by utilizing larger distributed training datasets and systems with thousands of GPUs and TPUs. The future of AI will rely on techniques like distributed deep learning, hyperparameter optimization, and elastic model serving that can scale computation across large computing clusters in the cloud or on-premise.
This document provides an overview of building high-performance inter-cloud infrastructure in Japan. It discusses Masaharu Munetomo's roles and background, Hokkaido University's academic cloud system, the High Performance Computing Infrastructure (HPCI) collaboration, and several use cases of the cloud system including big data processing, simulation environments, and drug design. It also outlines related projects involving remote cloud collaborations, distributed database infrastructure, and inter-cloud resource optimization using evolutionary algorithms.
This file contains complete information about computer Architecture.
1. What is a computer.
2. Types of computers
3. Block Diagram of Computer.
4 . Processor, Memory
5. Computer Generati
The next generation of the Montage image mosaic engineG. Bruce Berriman
Presentation given by Bruce Berriman at the Astronomical Data Analysis Software & Systems XXV (ADASS XXV) Conference, Sydney, Australia, October 29, 2015.
Authors: G. B. Berriman, J.C. Good, B. Rusholme, T. Robitaille.
This document discusses challenges and approaches for storing and processing extremely large graphs, or "Big Graphs". It describes how graph sizes have grown orders of magnitude larger than what traditional algorithms and computer architectures can handle. Distributed systems like Accumulo and MapReduce are necessary to store and analyze Big Graphs at scales like billions of nodes and trillions of edges. The document also proposes experiments on the Graph500 benchmark to test the viability of these approaches.
Intel optimized tensorflow, distributed deep learninggeetachauhan
This document discusses optimizations for running TensorFlow on Intel CPUs for deep learning. It outlines techniques for compiling TensorFlow from source with CPU optimizations, using proper data formats and batch sizes, and reading data with queues to leverage multi-core CPUs. It also covers distributed deep learning using TensorFlow Estimators, parameter servers, and model parallelism to distribute graphs across multiple machines. Resources for further information on Intel optimizations, installing libraries, and distributed TensorFlow are provided.
- Python has become a robust platform for scientific and engineering work, from data analysis to modeling and visualization. It has clear syntax, many open source libraries, and can be used across operating systems.
- This document discusses the history and advantages of using Python for earth sciences, including modeling hydrodynamics and earthquakes. It also provides examples of using Python with libraries like NumPy, SciPy, and matplotlib for tasks like data analysis, visualization, and GIS processing.
- Python is now widely used for tasks that previously required other languages or programs, offering an integrated environment while maintaining high performance via compiled extensions.
Hopsworks - ExtremeEarth Open WorkshopExtremeEarth
This document summarizes a presentation about the three-year ExtremeEarth project. It discusses the ExtremeEarth platform architecture, which brings together Earth observation data access from DIASes, end-user products from TEPs, and scalable AI capabilities from Hopsworks. The architecture provides infrastructure on Creodias and uses Hopsworks to develop end-to-end machine learning pipelines for processing petabytes of Earth observation data. Results have been exploited through additional research projects and a product offering on Hopsworks.ai. The project has also led to several publications and blog posts about applying AI to Earth observation data.
Big Data Analytics for connected home: a few usecases, some important messages and a little example. Presentation given at CEA Cadarache - Cité des Nouvelles Energies at the strategic comittee of ARCSIS (http://www.arcsis.org/missions.html)
This document provides an overview of visualizing big imaging data in radio astronomy. It discusses:
1) Facilities like Pawsey Supercomputing Centre and ICRAR that provide computational resources for processing and visualizing large astronomy data.
2) Common astronomy image formats like FITS and emerging "big data" formats like JPEG2000 that allow for multi-resolution and streaming visualization.
3) The SkuareView framework that implements remote visualization of JPEG2000 encoded astronomy data using JPIP by streaming different resolutions and regions of interest without downloading full datasets.
4) A demo of using SkuareView to interactively visualize multi-TB radio astronomy datasets stored in the cloud.
The document summarizes the SciPy 2010 conference which had 187 attendees. The major theme was parallel computing and GPUs, with tutorials on high performance computing, Python concurrency, and GPU programming in Python. There were also sessions on parallel libraries and GPU frameworks like Theano. A minor theme was statistics and data structures, with talks on Pandas, Statsmodels, and N-dimensional data arrays. Other sessions covered bioinformatics, astronomy, machine learning, and Python libraries like NumPy and PyZMQ. Attendees also participated in sprints to contribute to Python scientific computing packages.
This presentation was given at the Green500 BoF at SC21, in which PFN's VP of Computing Infrastructure Yusuke Doi discussed the power measurement for PFN's MN-3 supercomputer with MN-Core™ accelerators and how the company improved MN-3's power efficiency from 29.7GF/W to 39.38GF/W in 5 months.
More about MN-Core: https://projects.preferred.jp/mn-core/en/
More about MN-3: https://projects.preferred.jp/supercomputers/en/
This is the slide that Terry. T. Um gave a presentation at Kookmin University in 22 June, 2014. Feel free to share it and please let me know if there is some misconception or something.
(http://t-robotics.blogspot.com)
(http://terryum.io)
Some resources how to navigate in the hardware space in order to build your own workstation for training deep learning models.
Alternative download link: https://www.dropbox.com/s/o7cwla30xtf9r74/deepLearning_buildComputer.pdf?dl=0
Standardized Construction of HPC Clusters for Academic UsageBradford Bazemore
A model for the standardization of design and implementation of HPC
clusters to be used in universities is presented. Standardization is achieved
by using an open-source operating system, network infrastructure, and
software packages. The cluster is configured for universities intending to
implement an HPC cluster for research or teaching use. No prior understanding
of clusters is assumed but a basic understanding of programming,
networking and computers in general is required.
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
The document describes a technology called PG-Strom that uses GPU acceleration to optimize I/O performance for PostgreSQL. PG-Strom allows data to be transferred directly from NVMe SSDs to the GPU over the PCIe bus, bypassing the CPU and RAM. This reduces data movement and allows PostgreSQL queries to be partially executed directly on the GPU. Benchmark results show the approach can achieve throughput close to the theoretical hardware limits for a single server configuration processing large datasets.
Neural Networks and Deep Learning for PhysicistsHéloïse Nonne
Introduction to neural networks and deep learning. Seminar given by Héloïse Nonne on February 19th, 2015 at CINaM (Centre Interdisciplinaire de Nanosciences de Marseille) at Aix-Marseille University
Collective Knowledge: python and scikit-learn based open research SDK for col...Grigori Fursin
We would like to share our experience with a python-based Collective Knowledge SDK for collaborative and reproducible experimentation. It helps organize and share experimental setups (code, data and meta) as unified and reusable components with JSON API via GITHUB. It also helps unify, automate and crowdsource analysis and exploration of multi-dimensional optimization spaces using scikit-learn.
SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in...inside-BigData.com
In this deck from the HPC User Forum at Argonne, Deepak Pathania presents: SX Aurora TSUBASA (Vector Engine) a Brand-new Vector Supercomputing power in Server Chassis.
"The NEC Vector Engine Processor was developed using 16 nm FinFET process technology for extreme high performance and low power consumption. The Vector Engine Processor has the world's first implementation of one processor with six HBM2 memory modules using Chip-on-Wafer-on-Substrate technology, leading to the world-record memory bandwidth of 1.2 TB/s."
Watch the video: https://wp.me/p3RLHQ-kOK
Learn more: https://www.nec.com/en/global/solutio...
and
http://hpcuserforum.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Comparison of deep learning frameworks from a viewpoint of double backpropaga...Kenta Oono
This document compares deep learning frameworks from the perspective of double backpropagation. It discusses the typical technology stacks and design choices of frameworks like Chainer, PyTorch, and TensorFlow. It also provides a primer on double backpropagation, explaining how it computes the differentiation of a loss function with respect to inputs. Code examples of double backpropagation are shown for Chainer, PyTorch and TensorFlow.
All AI Roads lead to Distribution - Dot AIJim Dowling
The document discusses how all roads of artificial intelligence lead to distributed systems and computing. It provides examples of how companies like Facebook and Google have improved the accuracy and training times of image recognition models by utilizing larger distributed training datasets and systems with thousands of GPUs and TPUs. The future of AI will rely on techniques like distributed deep learning, hyperparameter optimization, and elastic model serving that can scale computation across large computing clusters in the cloud or on-premise.
This document provides an overview of building high-performance inter-cloud infrastructure in Japan. It discusses Masaharu Munetomo's roles and background, Hokkaido University's academic cloud system, the High Performance Computing Infrastructure (HPCI) collaboration, and several use cases of the cloud system including big data processing, simulation environments, and drug design. It also outlines related projects involving remote cloud collaborations, distributed database infrastructure, and inter-cloud resource optimization using evolutionary algorithms.
This file contains complete information about computer Architecture.
1. What is a computer.
2. Types of computers
3. Block Diagram of Computer.
4 . Processor, Memory
5. Computer Generati
The next generation of the Montage image mosaic engineG. Bruce Berriman
Presentation given by Bruce Berriman at the Astronomical Data Analysis Software & Systems XXV (ADASS XXV) Conference, Sydney, Australia, October 29, 2015.
Authors: G. B. Berriman, J.C. Good, B. Rusholme, T. Robitaille.
This document discusses challenges and approaches for storing and processing extremely large graphs, or "Big Graphs". It describes how graph sizes have grown orders of magnitude larger than what traditional algorithms and computer architectures can handle. Distributed systems like Accumulo and MapReduce are necessary to store and analyze Big Graphs at scales like billions of nodes and trillions of edges. The document also proposes experiments on the Graph500 benchmark to test the viability of these approaches.
Intel optimized tensorflow, distributed deep learninggeetachauhan
This document discusses optimizations for running TensorFlow on Intel CPUs for deep learning. It outlines techniques for compiling TensorFlow from source with CPU optimizations, using proper data formats and batch sizes, and reading data with queues to leverage multi-core CPUs. It also covers distributed deep learning using TensorFlow Estimators, parameter servers, and model parallelism to distribute graphs across multiple machines. Resources for further information on Intel optimizations, installing libraries, and distributed TensorFlow are provided.
- Python has become a robust platform for scientific and engineering work, from data analysis to modeling and visualization. It has clear syntax, many open source libraries, and can be used across operating systems.
- This document discusses the history and advantages of using Python for earth sciences, including modeling hydrodynamics and earthquakes. It also provides examples of using Python with libraries like NumPy, SciPy, and matplotlib for tasks like data analysis, visualization, and GIS processing.
- Python is now widely used for tasks that previously required other languages or programs, offering an integrated environment while maintaining high performance via compiled extensions.
Hopsworks - ExtremeEarth Open WorkshopExtremeEarth
This document summarizes a presentation about the three-year ExtremeEarth project. It discusses the ExtremeEarth platform architecture, which brings together Earth observation data access from DIASes, end-user products from TEPs, and scalable AI capabilities from Hopsworks. The architecture provides infrastructure on Creodias and uses Hopsworks to develop end-to-end machine learning pipelines for processing petabytes of Earth observation data. Results have been exploited through additional research projects and a product offering on Hopsworks.ai. The project has also led to several publications and blog posts about applying AI to Earth observation data.
Big Data Analytics for connected home: a few usecases, some important messages and a little example. Presentation given at CEA Cadarache - Cité des Nouvelles Energies at the strategic comittee of ARCSIS (http://www.arcsis.org/missions.html)
This document provides an overview of visualizing big imaging data in radio astronomy. It discusses:
1) Facilities like Pawsey Supercomputing Centre and ICRAR that provide computational resources for processing and visualizing large astronomy data.
2) Common astronomy image formats like FITS and emerging "big data" formats like JPEG2000 that allow for multi-resolution and streaming visualization.
3) The SkuareView framework that implements remote visualization of JPEG2000 encoded astronomy data using JPIP by streaming different resolutions and regions of interest without downloading full datasets.
4) A demo of using SkuareView to interactively visualize multi-TB radio astronomy datasets stored in the cloud.
The document summarizes the SciPy 2010 conference which had 187 attendees. The major theme was parallel computing and GPUs, with tutorials on high performance computing, Python concurrency, and GPU programming in Python. There were also sessions on parallel libraries and GPU frameworks like Theano. A minor theme was statistics and data structures, with talks on Pandas, Statsmodels, and N-dimensional data arrays. Other sessions covered bioinformatics, astronomy, machine learning, and Python libraries like NumPy and PyZMQ. Attendees also participated in sprints to contribute to Python scientific computing packages.
This presentation was given at the Green500 BoF at SC21, in which PFN's VP of Computing Infrastructure Yusuke Doi discussed the power measurement for PFN's MN-3 supercomputer with MN-Core™ accelerators and how the company improved MN-3's power efficiency from 29.7GF/W to 39.38GF/W in 5 months.
More about MN-Core: https://projects.preferred.jp/mn-core/en/
More about MN-3: https://projects.preferred.jp/supercomputers/en/
This is the slide that Terry. T. Um gave a presentation at Kookmin University in 22 June, 2014. Feel free to share it and please let me know if there is some misconception or something.
(http://t-robotics.blogspot.com)
(http://terryum.io)
Some resources how to navigate in the hardware space in order to build your own workstation for training deep learning models.
Alternative download link: https://www.dropbox.com/s/o7cwla30xtf9r74/deepLearning_buildComputer.pdf?dl=0
Deep learning - Conceptual understanding and applicationsBuhwan Jeong
This document provides an overview of deep learning, including conceptual understanding and applications. It defines deep learning as a deep and wide artificial neural network. It describes key concepts in artificial neural networks like signal transmission between neurons, graphical models, linear/logistic regression, weights/biases/activation, and backpropagation. It also discusses popular deep learning applications and techniques like speech recognition, natural language processing, computer vision, representation learning using restricted Boltzmann machines and autoencoders, and deep network architectures.
Deep Learning - The Past, Present and Future of Artificial IntelligenceLukas Masuch
The document provides an overview of deep learning, including its history, key concepts, applications, and recent advances. It discusses the evolution of deep learning techniques like convolutional neural networks, recurrent neural networks, generative adversarial networks, and their applications in computer vision, natural language processing, and games. Examples include deep learning for image recognition, generation, segmentation, captioning, and more.
This document provides an agenda for a presentation on deep learning, neural networks, convolutional neural networks, and interesting applications. The presentation will include introductions to deep learning and how it differs from traditional machine learning by learning feature representations from data. It will cover the history of neural networks and breakthroughs that enabled training of deeper models. Convolutional neural network architectures will be overviewed, including convolutional, pooling, and dense layers. Applications like recommendation systems, natural language processing, and computer vision will also be discussed. There will be a question and answer section.
Suggestions:
1) For best quality, download the PDF before viewing.
2) Open at least two windows: One for the Youtube video, one for the screencast (link below), and optionally one for the slides themselves.
3) The Youtube video is shown on the first page of the slide deck, for slides, just skip to page 2.
Screencast: http://youtu.be/VoL7JKJmr2I
Video recording: http://youtu.be/CJRvb8zxRdE (Thanks to Al Friedrich!)
In this talk, we take Deep Learning to task with real world data puzzles to solve.
Data:
- Higgs binary classification dataset (10M rows, 29 cols)
- MNIST 10-class dataset
- Weather categorical dataset
- eBay text classification dataset (8500 cols, 500k rows, 467 classes)
- ECG heartbeat anomaly detection
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
The document discusses accelerating science discovery with AI inference-as-a-service. It describes showcases using this approach for high energy physics and gravitational wave experiments. It outlines the vision of the A3D3 institute to unite domain scientists, computer scientists, and engineers to achieve real-time AI and transform science. Examples are provided of using AI inference-as-a-service to accelerate workflows for CMS, ProtoDUNE, LIGO, and other experiments.
The document discusses plans to establish an institutional high performance computing (HPC) facility at North-West University. It outlines the technical goals of building a Beowulf cluster to link existing departmental clusters and integrate with national and international computational grids. It also discusses management principles for the new HPC facility to ensure sustainability, efficiency, reliability, availability and high performance.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
The document discusses the future of computing platforms and how they will change to handle massive amounts of data and machine learning tasks. Some key points:
- Traditional views of performance gains from clock speed increases are over. New architectures enabled by multi-core CPUs will radically change computing.
- "Big data" tasks like search, machine learning, and real-time data analysis will be increasingly important drivers of new computing platforms.
- Simple machine learning models applied to massive amounts of data can produce useful results, even without deep domain expertise. This approach has been demonstrated to work well for tasks like language translation.
- Future platforms may blend CPUs and GPUs differently to best handle both serial and parallel tasks for big data and machine
FPGA-accelerated High-Performance Computing – Close to Breakthrough or Pipedr...Christian Plessl
Numerous results in reconfigurable computing research suggest that FPGAs are able to deliver greatly improved performance or energy efficiency for many computationally demanding applications. This potential is being exploited by hyperscale cloud providers, which have recently deployed large scale installations with FPGA. In contrast, FPGAs have not had any significant impact on general purpose HPC installations so far.
In this presentation, I will try to shed some light on the reasons for this development and the apparent gap between the promise and reality for FPGAs in HPC. I will discuss what the reconfigurable computing research community can and needs to provide to attract more interest from HPC users and suppliers. To highlight practical challenges, I will share some of our experiences at the Paderborn Center for Parallel Computing, where have recently commissioned two HPC testbed clusters with FPGAs and where we are currently planning to deploy FPGAs at a larger scale in our production HPC systems.
FPGA Hardware Accelerator for Machine Learning
Machine learning publications and models are growing exponentially, outpacing Moore's law. Hardware acceleration using FPGAs, GPUs, and ASICs can provide performance gains over CPU-only implementations for machine learning workloads. FPGAs allow for reprogramming after manufacturing and can accelerate parts of machine learning algorithms through customized hardware while sharing computations between the FPGA and CPU. Vitis AI is a software stack that optimizes machine learning models for deployment on Xilinx FPGAs, providing pre-optimized models, tools for optimization and quantization, and high-level APIs.
Implementing AI: Running AI at the Edge: Adapting AI to available resource in...KTN
The Implementing AI: Running AI at the Edge, hosted by KTN and eFutures, is the second event of the Implementing AI webinar series.
To make products more intelligent, more responsive and to reduce the data generated, it is advantageous to run AI on the product itself, as opposed to in the cloud.
The focus of this webinar was the opportunities and challenges of moving the AI processing to “the Edge”. The webinar had four presentations from experts covering overviews of the opportunity, implementation techniques and case studies.
Find out more: https://ktn-uk.co.uk/news/just-launched-implementing-ai-webinar-series
Highlighted notes on Hybrid Multicore Computing
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
In this comprehensive report, Prof. Dip Banerjee describes about the benefit of utilizing both multicore systems, CPUs with vector instructions, and manycore systems, GPUs with large no. of low speed ALUs. Such hybrid systems are beneficial to several algorithms as an accelerator cant optimize for all parts of an algorithms (some computations are very regular, while some very irregular).
The document discusses IBM's PowerAI software for large model support and distributed deep learning. It describes how PowerAI uses large model support (LMS) to enable processing of high-definition images, large models, and higher batch sizes that don't fit in GPU memory. It provides examples of using LMS with Caffe and TensorFlow. It also describes IBM's distributed deep learning library (DDL) for scaling deep learning training across multiple servers and GPUs, and how tools like ddlrun automatically handle tasks like topology detection and mpirun options.
The document discusses the emergence of computation for interdisciplinary large data analysis. It notes that exponential increases in computational power and data are driving changes in science and engineering. Computational modeling is becoming a third pillar of science alongside theory and experimentation. However, continued increases in clock speeds are no longer feasible due to power constraints, necessitating the use of multi-core processors and parallelism. This is driving changes in software design to expose parallelism.
1. Building exascale computers requires moving to sub-nanometer scales and steering individual electrons to solve problems more efficiently.
2. Moving data is a major challenge, as moving data off-chip uses 200x more energy than computing with it on-chip.
3. Future computers should optimize for data movement at all levels, from system design to microarchitecture, to minimize energy usage.
A Survey on in-a-box parallel computing and its implications on system softwa...ChangWoo Min
1) The document surveys research on parallel computing using multicore CPUs and GPUs, and its implications for system software.
2) It discusses parallel programming models like OpenMP, Intel TBB, CUDA, and OpenCL. It also covers research on optimizing memory allocation, reducing system call overhead, and revisiting OS architecture for manycore systems.
3) The document reviews work on supporting GPUs in virtualized environments through techniques like GPU virtualization. It also summarizes projects that utilize the GPU in middleware for tasks like network packet processing.
This document summarizes key points from a presentation on real-time AI systems. It discusses how Moore's law is no longer extending processor performance and how heterogeneous computing platforms with specialized hardware are needed for efficient AI. Emerging platforms include multi-core CPUs with GPUs, FPGAs and custom hardware. The document also outlines techniques for optimizing neural networks for hardware like quantization, pruning and efficient data layout to meet timing constraints in real-time systems.
Computational science and engineering (CSE) utilizes high performance computing, large-scale simulations, and scientific applications to enable data-driven discovery. The speaker discusses CSE initiatives at UC Berkeley and Lawrence Berkeley National Lab focused on areas like health, freshwater, food security, ecosystems, and urban metabolism using exascale computing and big data analytics.
Arm A64fx and Post-K: Game-Changing CPU & Supercomputer for HPC, Big Data, & AIinside-BigData.com
Satoshi Matsuoka from RIKEN gave this talk at the HPC User Forum in Santa Fe.
"With rapid rise and increase of Big Data and AI as a new breed of high-performance workloads on supercomputers, we need to accommodate them at scale, and thus the need for R&D for HW and SW Infrastructures where traditional simulation-based HPC and BD/AI would converge, in a BYTES-oriented fashion. Post-K is the flagship next generation national supercomputer being developed by Riken and Fujitsu in collaboration. Post-K will have hyperscale class resource in one exascale machine, with well more than 100,000 nodes of sever-class A64fx many-core Arm CPUs, realized through extensive co-design process involving the entire Japanese HPC community.
Rather than to focus on double precision flops that are of lesser utility, rather Post-K, especially its Arm64fx processor and the Tofu-D network is designed to sustain extreme bandwidth on realistic applications including those for oil and gas, such as seismic wave propagation, CFD, as well as structural codes, besting its rivals by several factors in measured performance. Post-K is slated to perform 100 times faster on some key applications c.f. its predecessor, the K-Computer, but also will likely to be the premier big data and AI/ML infrastructure. Currently, we are conducting research to scale deep learning to more than 100,000 nodes on Post-K, where we would obtain near top GPU-class performance on each node."
Watch the video: https://wp.me/p3RLHQ-k6G
Learn more: https://en.wikichip.org/wiki/supercomputers/post-k
and
http://hpcuserforum.com
4 TeraGrid Sites Have Focal Points:
SDSC – The Data Place
Large-scale and high-performance data analysis/handling
Every Cluster Node is Directly Attached to SAN
NCSA – The Compute Place
Large-scale, Large Flops computation
Argonne – The Viz place
Scalable Viz walls
Caltech – The Applications place
Data and flops for applications – Especially some of the GriPhyN Apps
Specific machine configurations reflect this
(Im2col)accelerating deep neural networks on low power heterogeneous architec...Bomm Kim
This document discusses accelerating deep neural networks on low power heterogeneous architectures. Specifically, it focuses on accelerating the inference time of the VGG-16 neural network on the ODROID-XU4 board, which contains an ARM CPU and Mali GPU. The authors develop parallel versions of VGG-16 using OpenMP for the CPU and OpenCL for the GPU. Several optimizations are explored in OpenCL, including work groups, vector data types, and the CLBlast library. The best OpenCL implementation achieves a 9.4x speedup over the original serial version.
1) The document discusses implementing and evaluating deep neural networks (DNNs) on mainstream heterogeneous systems like CPUs, GPUs, and APUs.
2) Preliminary results show that an APU achieves the highest performance per watt compared to CPUs and GPUs for DNN models like MLP and autoencoders.
3) Data transfers between the CPU and GPU are identified as a bottleneck, but APUs can help avoid this issue through efficient data sharing and zero-copy techniques between the CPU and GPU.
Improving Insurance Risk Prediction with Generative Adversarial Networks (GANs)Armando Vieira
Generative adversarial networks (GANs) show promise for addressing data imbalance issues in insurance modeling. GANs were originally developed for computer vision tasks but have also been applied to tabular data. Conditional GANs and CycleGANs can generate synthetic minority class examples to balance datasets. In a case study on insurance fraud detection, GANs outperformed traditional resampling techniques like SMOTE in improving precision, recall, and F1-score. However, GANs require dense feature representations and consistency over time to be effective for tabular data imbalance problems.
Predicting online user behaviour using deep learning algorithmsArmando Vieira
We propose a robust classifier to predict buying intentions based on user behaviour within a large e-commerce website. In this work we compare traditional machine learning techniques with the most advanced deep learning approaches. We show that both Deep Belief Networks and Stacked Denoising auto-Encoders achieved a substantial improvement by extracting features from high dimensional data during the pre-train phase. They prove also to be more convenient to deal with severe class imbalance.
Boosting conversion rates on ecommerce using deep learning algorithmsArmando Vieira
This document summarizes an approach to use deep learning algorithms to predict the probability that online shoppers will purchase a product based on their website interactions. The approach involves using stacked auto-encoders to reduce the high dimensionality of the product interaction data before applying classification algorithms. Testing on various datasets showed that random forest outperformed logistic regression and that incorporating time data and more training examples improved prediction performance. Further work proposed applying stacked auto-encoders and deep belief networks to fully leverage the large amount of product interaction data.
Seasonality effects on second hand cars salesArmando Vieira
This document analyzes seasonality effects on car sales using weekly aggregated car deal data from October 2012 to November 2014. It finds that:
1) A sudden drop in the last week's sales can be explained by statistical fluctuations based on the normal distribution of weekly deals over the period.
2) Months with the lowest deals (November and December) still show that last week's sales of 154 were a normal occurrence based on the mean and standard deviation for those months.
3) Google trends data for the keyword "used cars" shows a clear seasonality pattern of decreasing searches before the end of the year and increasing searches at the start and middle of the year.
Visualizations of high dimensional data using R and ShinyArmando Vieira
This document discusses building interactive visualizations with Shiny and R to explore social and health care data from the UK. It describes using inputs like demographics, economic deprivation, and health metrics to create outputs like a health score and stress score. Visualizations were created with Shiny and Google Motion Charts to compare districts. The document concludes discussing using machine learning techniques like embeddings and exploring causality.
This document provides an overview of deep learning algorithms, including deep neural networks, convolutional neural networks, deep belief networks, and restricted Boltzmann machines. It discusses key concepts such as learning in deep neural networks, the evolution timeline of deep learning approaches, deep architectures, and restricted Boltzmann machines. It also covers training restricted Boltzmann machines using contrastive divergence, constructing deep belief networks by stacking restricted Boltzmann machines, and practical considerations for pre-training and fine-tuning deep belief networks.
Extracting Knowledge from Pydata London 2015Armando Vieira
The document discusses using deep learning techniques like word embeddings to jointly embed text and knowledge graphs for information extraction purposes. Word embeddings represent words as vectors in a way that captures semantic meaning, allowing related words to have similar embeddings. Knowledge graphs explicitly represent entities and relations. The document proposes combining text corpora with knowledge graphs by training a model on both to generate embeddings that incorporate information from both sources. This allows extracting knowledge expressed in text and transforming it into a machine-readable format.
We propose an algorithm for training Multi Layer Preceptrons for classification problems, that we named Hidden Layer Learning Vector Quantization (H-LVQ). It consists of applying Learning Vector Quantization to the last hidden layer of a MLP and it gave very successful results on problems containing a large number of correlated inputs. It was applied with excellent results on classification of Rurtherford
backscattering spectra and on a benchmark problem of image recognition. It may also be used for efficient feature extraction.
machine learning in the age of big data: new approaches and business applicat...Armando Vieira
Presentation at University of Lisbon on Machine Learning and big data.
Deep learning algorithms and applications to credit risk analysis, churn detection and recommendation algorithms
Neural Networks and Genetic Algorithms Multiobjective accelerationArmando Vieira
by Neural
Network
- The document proposes a hybrid multi-objective evolutionary algorithm that uses an artificial neural network to reduce the number of objective function evaluations needed. It combines a multi-objective evolutionary algorithm (MOEA) with an artificial neural network (ANN) to approximate solutions. The ANN is trained on solutions evaluated by the MOEA and then used to estimate fitness for unevaluated solutions to further guide the search. This approach aims to improve optimization efficiency over existing MOEAs for problems with computationally expensive objective functions.
Optimization of digital marketing campaignsArmando Vieira
This document discusses using machine learning techniques to optimize digital marketing campaigns. Specifically, it analyzes data from campaigns using clustering, visualization and predictive models. Unsupervised learning methods like k-means clustering, PCA, MDS and SOM are used to identify patterns in large digital data. Supervised models like SVMs and random forests predict conversions. The goal is to extract actionable insights to improve ROI, engagement and sales through optimization of parameters like ad design, keywords, bids, channels and budget allocation.
Credit risk with neural networks bankruptcy prediction machine learningArmando Vieira
The document discusses credit risk management with AI tools. It summarizes that credit scoring is used to statistically quantify risk by converting applicant information into numbers and a score. The objective is to forecast future performance based on past client behavior. It then discusses using various machine learning models like HLVQ-C and neural networks to predict financial distress, classify companies, and improve bankruptcy prediction. The models were tested on real world credit and financial datasets.
This document outlines a proposal called "Democracy 2" which aims to define a new democratic model that is more citizen-centric and suited to today's society. It proposes moving beyond representative democracy by giving citizens a more direct role in important political decisions through information technology. The initiative will define the new model through contributions from citizens and experts across three streams focusing on political, social, and technology issues. It will also conduct a proof of concept trial of the new model at the local/regional level in multiple countries. The overall goal is to create a more open and representative democratic system.
Sairmais.com is a new tourism web portal that uses a recommendation system to provide personalized recommendations to users. It analyzes a user's social connections and preferences to filter vast amounts of tourism information and provide the most relevant options. The portal aims to be a one-stop platform for comprehensive geo-referenced tourism data. It incorporates review sharing and social networking features commonly seen on sites like Amazon, Facebook and TripAdvisor. Sairmais.com's recommendation system analyzes the relationships between users, items, and ratings to provide customized recommendations tailored to each individual user's interests. The system seeks to simplify the travel planning process and provide a more personal touch than other major tourism websites.
Sairmais.com is a new tourism web portal that uses a recommendation system to provide personalized recommendations to users. It analyzes a user's social connections and ratings of tourism items like hotels and restaurants to filter vast amounts of online tourism information and provide the most relevant options. The portal aims to be a one-stop platform for comprehensive geo-referenced tourism data. It incorporates social networking features allowing users to share experiences and opinions to improve recommendations for others. The system utilizes collaborative tagging and ratings within a user's social network to build profiles and predict their preferences, helping users more easily plan trips by finding the best options tailored specifically for them.
Manifold learning for bankruptcy predictionArmando Vieira
This document presents a method for bankruptcy prediction and analysis using manifold learning. Specifically, it applies the Isomap algorithm with class label information incorporated into the dissimilarity matrix (S-Isomap) on a real dataset of French companies. S-Isomap is shown to have comparable testing accuracy to other classifiers like SVM and better than KNN and RVM, while providing excellent lower-dimensional visualization with only 3 dimensions. The S-Isomap approach achieves separability of patterns from healthy to bankrupt firms in the embedded space. This preprocessing technique using manifold learning is a promising approach for bankruptcy prediction and analysis on high-dimensional financial data.
This document presents a study comparing several machine learning models for personal credit scoring: logistic regression, multilayer perceptron, support vector machine, AdaBoostM1, and Hidden Layer Learning Vector Quantization (HLVQ-C). The models were tested on datasets from a Portuguese bank. HLVQ-C achieved the highest accuracy and was the most useful model according to a proposed measure that considers earnings from denying bad credits and losses from denying good credits. While other models had higher error rates for good credits, HLVQ-C balanced accuracy and usefulness the best, making it suitable for commercial credit scoring applications.
O autor descreve como a curiosidade natural das crianças é inibida pelo sistema educativo, transformando o ensino da ciência em algo abstrato e livresco em vez de prático e exploratório. Isto leva ao desinteresse dos alunos pela ciência e ao pequeno papel de Portugal na investigação científica. Defende que a educação deve estimular a curiosidade das crianças em vez de a reprimir.
This document describes a multi-objective evolutionary algorithm that uses artificial neural networks to approximate fitness functions in order to reduce the number of exact function evaluations. The algorithm runs the evolutionary algorithm for an initial number of generations to collect a training dataset. It then trains a neural network on this dataset. The evolutionary algorithm continues running for additional generations, using the neural network to approximate some or all of the fitness function evaluations. The neural network approximation error is monitored, and the evolutionary algorithm switches back to using exact function evaluations when the error becomes too high. This process repeats until an acceptable Pareto front is found. The method was tested on benchmark multi-objective test functions and showed a 20-40% reduction in the number of exact function evaluations needed
Artificial neural networks for ion beam analysisArmando Vieira
The document discusses using artificial neural networks (ANNs) for ion beam analysis. Specifically, it discusses:
1) Using ANNs to analyze Rutherford backscattering spectroscopy (RBS) data in an automated way, by recognizing patterns in the data related to sample properties without explicit knowledge of causes.
2) Training ANNs on datasets of RBS spectra with known sample parameters to allow the ANNs to relate spectral features to things like layer thickness, composition, and depth.
3) The potential for ANNs to enable real-time automated analysis and optimization of ion beam experiments.
Honeypots Unveiled: Proactive Defense Tactics for Cyber Security, Phoenix Sum...APNIC
Adli Wahid, Senior Internet Security Specialist at APNIC, delivered a presentation titled 'Honeypots Unveiled: Proactive Defense Tactics for Cyber Security' at the Phoenix Summit held in Dhaka, Bangladesh from 23 to 24 May 2024.
Securing BGP: Operational Strategies and Best Practices for Network Defenders...APNIC
Md. Zobair Khan,
Network Analyst and Technical Trainer at APNIC, presented 'Securing BGP: Operational Strategies and Best Practices for Network Defenders' at the Phoenix Summit held in Dhaka, Bangladesh from 23 to 24 May 2024.
HijackLoader Evolution: Interactive Process HollowingDonato Onofri
CrowdStrike researchers have identified a HijackLoader (aka IDAT Loader) sample that employs sophisticated evasion techniques to enhance the complexity of the threat. HijackLoader, an increasingly popular tool among adversaries for deploying additional payloads and tooling, continues to evolve as its developers experiment and enhance its capabilities.
In their analysis of a recent HijackLoader sample, CrowdStrike researchers discovered new techniques designed to increase the defense evasion capabilities of the loader. The malware developer used a standard process hollowing technique coupled with an additional trigger that was activated by the parent process writing to a pipe. This new approach, called "Interactive Process Hollowing", has the potential to make defense evasion stealthier.
1. deep learning
Algorithms and Applications
Bernardete Ribeiro, bribeiro@dei.uc.pt
University of Coimbra, Portugal
INIT/AERFAI Summer School on Machine Learning, Benicassim 22-26 June 2015
4. outline
∙ Motivation
∙ Graphics Processing Units (GPUs) Computing
∙ Machine Learning (ML) GPU algorithms
∙ Advantages of Open-Source in the ML field
∙ Open-source GPU ML library (GPUMLib)
∙ Overview of GPUMLib algorithms
∙ Conclusions
3
5. motivation
∙ The volume of data is increasing at an exponential rate
Diversity
Data Sources
Low-Cost
Sensors
High-
Bandwidth
Networks
Robotic
systems
High-
capacity
storage
devices
Remote
sensing
Commodity
computing
4
6. data
∙ Nowadays, there are projects that can generate several
petabytes of data per day [Hey et al., 2009]:
∙ Australian Square Kilometre Array of radio telescopes
∙ CERN’s Large Hadron Collider
∙ Pan-STARRS array of celestial telescopes
5
7. big data
∙ Nowadays, there are projects that can generate several
petabytes of data per day [Hey et al., 2009]:
∙ Australian Square Kilometre Array of radio telescopes
∙ CERN’s Large Hadron Collider
∙ Pan-STARRS array of celestial telescopes
6
8. data science
∙ Data is an asset, from which useful and valuable
information can be extracted.
7
9. data science
∙ Data is an asset, from which useful and valuable
information can be extracted.
∙ Science is gradually moving toward being computational
and data centric.
7
10. data science
∙ Data is an asset, from which useful and valuable
information can be extracted.
∙ Science is gradually moving toward being computational
and data centric.
∙ To obtain information represents only a fraction of the time
and effort needed to analyze it.
7
11. challenges
Data sources Real Data
Computer
Simulation Models
Artificial Data
Extract useful
and relevant
information
Large
volumes
of data
vastly exceeds our
capacity to analyze
it
Persistent
repositories of
(accumu-
lated) Data
challenge
8
12. potential solution
Data sources Real Data
Computer
Simulation Models
Artificial Data
Extract useful
and relevant
information
Machine Learning
Algorithms
Large
volume
of data
Persistent
repositories of
(accumu-
lated) Data
9
15. computational resources
∙ Machine Learning (ML) algorithms are computationally
expensive
∙ Their computational requirements are usually proportional
to the amount of data being processed
11
16. computational resources
∙ Machine Learning (ML) algorithms are computationally
expensive
∙ Their computational requirements are usually proportional
to the amount of data being processed
∙ ML algorithms often demand prohibitive computational
resources
11
17. advanced computing
∙ Problems are becoming increasingly challenging and
demanding (in some cases intractable by traditional CPU
architectures).
12
18. advanced computing
∙ Problems are becoming increasingly challenging and
demanding (in some cases intractable by traditional CPU
architectures).
∙ Toolkits supporting ML software development fail to meet
the expectations in terms of computational performance.
12
19. advanced computing
∙ Problems are becoming increasingly challenging and
demanding (in some cases intractable by traditional CPU
architectures).
∙ Toolkits supporting ML software development fail to meet
the expectations in terms of computational performance.
∙ The scientific breakthroughs of the future will undoubtedly
be powered by advanced computing capabilities that will
allow researchers to manipulate and explore massive
datasets [Hey et al., 2009].
12
20. advanced computing
∙ Problems are becoming increasingly challenging and
demanding (in some cases intractable by traditional CPU
architectures).
∙ Toolkits supporting ML software development fail to meet
the expectations in terms of computational performance.
∙ The scientific breakthroughs of the future will undoubtedly
be powered by advanced computing capabilities that will
allow researchers to manipulate and explore massive
datasets [Hey et al., 2009].
∙ Pressure to shift development toward high-throughput
parallel architectures (crucial for real-world applications).
12
22. graphical processing units (gpus)
∙ Highly-parallel and
programmable devices
that can be used for
general-purpose
computing applications
[Owens et al., 2008].
2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
0
200
400
600
800
1000
1200
GFLOPS
dual-core quad-core
AMD (GPU)
NVIDIA (GPU)
Intel (CPU)
14
23. gpus strengths
∙ Provide remarkable
performance gains
(compared to CPUs).
∙ Relatively inexpensive
(serve the large gaming
industry).
∙ Availability.
∙ Scalability.
2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
0
200
400
600
800
1000
1200
GFLOPS
dual-core quad-core
AMD (GPU)
NVIDIA (GPU)
Intel (CPU)
15
24. gpu vs cpu performance
Disparity between the GPU and CPU peak floating-point
performance
∙ The GPU performance is
doubled every 12 months
while the CPU
performance doubles
every 18 months
[Zhongwen et al., 2005].
2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
0
200
400
600
800
1000
1200
GFLOPS
dual-core quad-core
AMD (GPU)
NVIDIA (GPU)
Intel (CPU)
16
25. nvidia gpu architecture
Streaming
Multiprocessor
SIMT control
Shared memory
Streaming
Multiprocessor
SIMT control
Shared memory
Streaming
Multiprocessor
SIMT control
Shared memory
· · ·
Thread
scheduling
Hostin-
terface
Memoryinterface
Off-chip memory
DRAM
DRAM
DRAM
· · ·
17
28. speedups
∙ Graphics Processing Units (GPUs) are responsible for
dramatic speedups in a wide range of areas for many
problems.
20
29. speedups
∙ GPUs are responsible for dramatic speedups in a wide
range of areas for many problems.
∙ It is not uncommon to obtain speedups of one or two
orders of magnitude.
∙ Tasks that would take years on the CPU can now be completed
in days.
∙ Weeks of processing can be transformed into hours
[Lopes and Ribeiro, 2009]
∙ Computations that would otherwise take hours can now be
completed in a few seconds.
20
34. machine learning tools
∙ Caffe: Framework for convolutional neural network
algorithms
∙ cuda-convnet: High performance C++/CUDA
implementation of convolutional neural networks
∙ Theano: Python library to define, optimize, and evaluate
mathematical expressions
∙ Torch7: Scientific computing framework for machine
learning algorithms
∙ cuBLAS: GPU-accelerated version of the complete standard
BLAS library
∙ MATLAB: Easy-to-use HPC language integrating
computation, visualization, and programming
∙ GPUMLib: GPU Machine Learning Library
25
35. companies using gpus for machine learning
http://www.nvidia.com/object/machine-learning.html
26
36. ml algorithms in gpu platform
∙ Large computational requirements.
27
37. ml algorithms in gpu platform
∙ Large computational requirements.
∙ Algorithms should present a high-degree of parallelism.
27
38. ml algorithms in gpu platform
∙ Large computational requirements.
∙ Algorithms should present a high-degree of parallelism.
∙ Favor data throughput in detriment of the latency of
individual operations.
27
39. gpu ml implementations
ClosedSourceOpenSource
2004 2005 2006 2007 2008 2009 2010 2011 2012
Multilayer Perceptrons (forward-phase)
Oh and Jung
Self-Organizing Maps
Campbell et al.
Luo et al.
Genetic Algorithms
Wong et al.
Yu et al.
Back-Propagation (two layers)
Steinkrau et al.
Convolutional Neural Networks
Chellapilla et al.
Spiking Neural Networks
Bernhard and Keriven
Belief Propagation
Brunton et al.
Yang et al.
Fuzzy ART neural networks
Martínez-Zarzuela et al.
K-Means Clustering
Shalom et al.
Recurrent networks
Trebatický and Pospíchal
Decision Trees and Forests
Sharp
Neural Network based text detection
Jang et al.
Linear Radial Basis Functions
Brandstetter and Artusi
Deep Belief Networks Sparse Coding
Raina et al.
Back-Propagation (three layers)
Guzhva et al.
Support Vector Machines
Catanzaro et al.
Genetic Algorithms
Langdon and Banzhaf
K-Nearest Neighbor
Garcia et al.
Spiking Neural Networks
Nageswaran et al.
Multiple Back-Propagation
Back-Propagation
Lopes and Ribeiro
Non-negative Matrix
Factorization
Lopes and Ribeiro
28
41. gpu implementations
∙ The number of GPU implementations of ML algorithms has
increased substantially over the last few years.
∙ However, most of the implementations are not openly
shared.
30
42. open source advantages
∙ Better reproducibility of experimental results;
1
[Sonnenburg et al., 2007]
31
43. open source advantages
∙ Better reproducibility of experimental results;
∙ Fair comparison of algorithms;
1
[Sonnenburg et al., 2007]
31
44. open source advantages
∙ Better reproducibility of experimental results;
∙ Fair comparison of algorithms;
∙ Quicker detection of errors;
1
[Sonnenburg et al., 2007]
31
45. open source advantages
∙ Better reproducibility of experimental results;
∙ Fair comparison of algorithms;
∙ Quicker detection of errors;
∙ Quicker adoption of algorithms;
1
[Sonnenburg et al., 2007]
31
46. open source advantages
∙ Better reproducibility of experimental results;
∙ Fair comparison of algorithms;
∙ Quicker detection of errors;
∙ Quicker adoption of algorithms;
∙ Innovative applications and easier combination of
advances;
1
[Sonnenburg et al., 2007]
31
47. open source advantages
∙ Better reproducibility of experimental results;
∙ Fair comparison of algorithms;
∙ Quicker detection of errors;
∙ Quicker adoption of algorithms;
∙ Innovative applications and easier combination of
advances;
∙ Faster adoption of ML methods in other disciplines and in
industry.
1
[Sonnenburg et al., 2007]
31
48. open source advantages
∙ Better reproducibility of experimental results;
∙ Fair comparison of algorithms;
∙ Quicker detection of errors;
∙ Quicker adoption of algorithms;
∙ Innovative applications and easier combination of
advances;
∙ Faster adoption of ML methods in other disciplines and in
industry.
∙ Cooperation among researchers
1
1
[Sonnenburg et al., 2007]
31
52. cuda
∙ represented a major step toward the simplification of the
GPU programming model:
35
53. cuda
∙ represented a major step toward the simplification of the
GPU programming model:
∙ Support for accessible programming interfaces and
industry-standard languages, such as C and C++.
35
54. cuda
∙ represented a major step toward the simplification of the
GPU programming model:
∙ Support for accessible programming interfaces and
industry-standard languages, such as C and C++.
∙ released by NVIDIA in the end of 2006 and since then
numerous GPU implementations, spanning a wide range of
applications, have been developed using this technology.
35
55. cuda
∙ represented a major step toward the simplification of the
GPU programming model:
∙ Support for accessible programming interfaces and
industry-standard languages, such as C and C++.
∙ released by NVIDIA in the end of 2006 and since then
numerous GPU implementations, spanning a wide range of
applications, have been developed using this technology.
∙ While there alternative options, such as the OpenCL, the
Microsoft Directcompute or the AMD Stream, so far CUDA is the
only technology that has achieved wide adoption and usage
[Stamatopoulos et al., 2012].
35
62. neural selective input model (nsim)
x
p
1
x
p
2
x
p
3
r
p
3
x
p
j
y
p
1
y
p
2
wjk
θk
×
multiplier
r
p
j
˜x
p
j
selective input neuron
Physical model
Model 1 when x
p
3 is missing: r
p
3 = 0
x
p
1
x
p
2
Conceptual models
y
p
1
y
p
2
Model 2 when the value of x
p
3 is known: r
p
3 = 1
x
p
1
x
p
2
x
p
3
y
p
1
y
p
2
42
63. resource allocating network with long term memory
x1
x2
x3
x4
z1
z2
Hidden
layer
Input
layer
Output
layer
Generate & StoreRetrieve & Learn
Long-Term Memory
43
80. non-negative matrix factorization (nmf)
H
W≈V
rN samples
Dfeatures
rfeatures
N samples
sample with D
original features
sample with r
new features
basis
vector
60
81. yale and orl image datasets
∙ Yale
∙ Vtrain is composed of 4096 rows (64 × 64 pixels) and 150
columns (face images)
∙ Vtest is composed of 4096 rows and 15 columns.
∙ AT&T (ORL)
∙ Vtrain is composed of 10304 (112 × 92) rows and 360 columns
(face images)
∙ Vtest is composed of 10304 rows and 40 columns.
61
82. time to perform 10,000 nmf iterations on the yale database.
10
100
1000
10000
20 40 60 80 100 120
time(seconds)
r
Vtrain (CPU)
Vtest (CPU)
Vtrain (GPU)
Vtest (GPU)
10s
1m40s
16m40s
3h46m40s
55.6× 82.5× 110.9× 182.3× 251.7×
6.6× 12.9× 21.5× 44.1× 74.1×
62
97. conclusions
∙ Parallel implementations of ML algorithms are crucial for
the development of real-world ML applications
∙ The GPU is particularly well positioned to fulfil this need,
given its availability, high-performance and relative
low-cost.
76
98. conclusions
∙ Parallel implementations of ML algorithms are crucial for
the development of real-world ML applications
∙ The GPU is particularly well positioned to fulfil this need,
given its availability, high-performance and relative
low-cost.
∙ Experimental results with GPUMLib algorithms show the
potential and usefulness of this library
76
99. conclusions
∙ Parallel implementations of ML algorithms are crucial for
the development of real-world ML applications
∙ The GPU is particularly well positioned to fulfil this need,
given its availability, high-performance and relative
low-cost.
∙ Experimental results with GPUMLib algorithms show the
potential and usefulness of this library
∙ Problems involving larger datasets benefit the most from
this architecture.
76
100. conclusions
∙ Parallel implementations of ML algorithms are crucial for
the development of real-world ML applications
∙ The GPU is particularly well positioned to fulfil this need,
given its availability, high-performance and relative
low-cost.
∙ Experimental results with GPUMLib algorithms show the
potential and usefulness of this library
∙ Problems involving larger datasets benefit the most from
this architecture.
∙ To promote cooperation among researchers and benefit the
field, open-source GPU ML algorithms are fundamental
76
101. Hey, T., Tansley, S., and Tolle, K., editors (2009).
The Fourth Paradigm: Data-Intensive Scientific Discovery.
Microsoft Research.
Lopes, N. and Ribeiro, B. (2009).
Fast pattern classification of ventricular arrhythmias
using graphics processing units.
In Proceedings of the 14th Iberoamerican Conference on
Pattern Recognition (CIARP 2009), LNCS 5856, pages
603–610. Springer.
Owens, J. D., Houston, M., Luebke, D., Green, S., Stone,
J. E., and Phillips, J. C. (2008).
GPU computing.
Proceedings of the IEEE, 96(5):879–899.
Sonnenburg, S., Braun, M. L., Ong, C. S., Bengio, S., Bottou,
L., Holmes, G., LeCun, Y., Müller, K.-R., Pereira, F.,
76
102. Rasmussen, C. E., Rätsch, G., Schölkopf, B., Smola, A.,
Vincent, P., Weston, J., and Williamson, R. C. (2007).
The need for open source software in machine learning.
Journal of Machine Learning Research, 8:2443–2466.
Stamatopoulos, C., Chuang, T. Y., Fraser, C. S., and Lu, Y. Y.
(2012).
Fully automated image orientation in the absence of
targets.
In International Archives of the Photogrammetry, Remote
Sensing and Spatial Information Sciences (XXII ISPRS
Congress), volume Volume XXXIX-B5, pages 303–308.
Zhongwen, L., hongzhi, L., Zhengping, Y., and Xincai, W.
(2005).
Self-organizing maps computing on graphic process unit.
76
103. In Proceedings of the 13th European Symposium on
Artificial Neural Networks, pages 557–562.
76