Multi-cores have become ubiquitous both in the general-purpose computing and the
embedded domain. The current technology trends show that the number of on-chip cores is
rapidly increasing, while their complexity is decreasing due to power and thermal constraints.
Increasing number of simple cores enable parallel applications benefit from abundant thread-level parallelism (TLP), while sequential fragments suffer from poor exploitation of instruction-level parallelism (ILP). Recent research has proposed adaptive multi-core architectures that are
capable of coalescing simple physical cores to create more complex virtual cores so as to
accelerate sequential code. Such adaptive architectures can seamlessly exploit both ILP and TLP.
The goal of this paper is to quantitatively characterize the performance potential of adaptive
multi-core architectures. Previous research have primarily focused on only sequential
Workload on adaptive multi-cores. We address a more realistic scenario where parallel and
sequential applications co-exist on an adaptive multi-core platform. Scheduling tasks on adaptive
architectures reveal challenging resource allocation problems for the existing schedulers. We
construct offline and online schedulers that intelligently reconfigure and allocate the cores to the
applications so as to minimize the overall makespan under the constraints of a realistic adaptive
multi-core architecture. Experimental results reveal that adaptive multi-core architectures can
substantially decrease the makespan compared to both static symmetric and asymmetric multi-core architectures.
Low Energy Task Scheduling based on Work StealingLEGATO project
Abstract: Optimizing energy efficiency of parallel execution on computing systems, ranging from server farms, mobile devices to embedded systems, becomes increasingly one of the first-order concerns. A common way to express a parallel application is as a directed acyclic graph (DAG) in which each node represents a task. The problem of such task scheduling on multiprocessor systems is to find the proper execution processors. Especially nowadays asymmetric multiprocessor systems feature different type of cores with different performance and power consumption, e.g. Arm big.LITTLE and Intel Lakefield. However, naive task assignment without considering core types and task features could result in inefficient resources utilization and detrimentally impacts the overall energy consumption. Dynamic task scheduling is a widely used scheduling strategy, which does not require prior knowledge, e.g. architecture heterogeneity, task DAG structure, before execution but makes the decisions during runtime. Work stealing has been proven to be an effective method among dynamic task scheduling with better scalability in larger systems. DVFS is a common technique to achieve better energy efficiency, however, exploiting it costs reconfiguration overhead ranging from tens of microseconds to one millisecond. With fine-grained tasks as small as milliseconds, as required to expose large parallelism, it is not realistic to use DVFS on a per-task level. Also, it shows that the energy consumed in cores’ under-utilized period is significant.
Based on these problem statements, we come up with a low energy task scheduling work stealing runtime based on XiTAO where the system environment configurations are either fixed or managed by the O/S power governors or system administrators. The runtime contains dynamic performance tracing module, idleness tracing module, power profiling module and a task mapping algorithm. The dynamic performance model is able to give the accurate predictions for future tasks given a set of resources. It is independent of platforms and frequencies and achieves scalability and portability. Power profiling helps runtime systems to understand CPU power consumption trends with respect to number/type of cores and frequencies. Idleness tracing presents the real-time status of cores and contributes to the energy conservation of under-utilized period. It also provides the real-time parallel slackness of active cores, which allows the task mapping algorithm to attribute corresponding power consumption on each concurrent running task. The task mapping algorithm integrates the information from above three modules and outputs the predicted best resources placements for ready tasks.
Poster presented by jing Chen at the LEGaTO Final Event: 'Low-Energy Heterogeneous Computing Workshop'
Moldable pipelines for CNNs on heterogeneous edge devicesLEGATO project
Abstract: Modern edge devices are equipped with more powerful computing resources than ever before which opens up the opportunity to execute deep neural networks on the devices instead of cloud-based implementation. Existing DNN frameworks such as TensorFlow, Caffe, Torch, etc.. do not exploit the heterogeneous features of these devices except supporting GPUs. Modern edge devices contain different categories of heterogeneity i.e. clusters of different types of cores fabricated on the same chip. An example of such type of board is Jetson TX2 from Nvidia. To support DNN applications on heterogeneous edge devices, we have developed a framework which generates an efficient and balanced parallel pipelined implementation of CNN inference from a simplified template language interface. We leverage the input and output information provided by template language to generate a balanced pipeline. Since the cores are heterogeneous, we run a brief online training phase to find the best core distribution for a balanced and high throughput pipeline.
Our experiments show that a pipeline mapping configuration obtained from online training provides upto 22% faster pipeline as compared to the baseline. We compare against kernel level parallel implementation of widely used image classification CNN, VGG-16 on Nvidia Jetson TX2 board.
Poster presented by Pirah Noor Soomro at the LEGaTO Final Event: 'Low-Energy Heterogeneous Computing Workshop'
Multi-cores have become ubiquitous both in the general-purpose computing and the
embedded domain. The current technology trends show that the number of on-chip cores is
rapidly increasing, while their complexity is decreasing due to power and thermal constraints.
Increasing number of simple cores enable parallel applications benefit from abundant thread-level parallelism (TLP), while sequential fragments suffer from poor exploitation of instruction-level parallelism (ILP). Recent research has proposed adaptive multi-core architectures that are
capable of coalescing simple physical cores to create more complex virtual cores so as to
accelerate sequential code. Such adaptive architectures can seamlessly exploit both ILP and TLP.
The goal of this paper is to quantitatively characterize the performance potential of adaptive
multi-core architectures. Previous research have primarily focused on only sequential
Workload on adaptive multi-cores. We address a more realistic scenario where parallel and
sequential applications co-exist on an adaptive multi-core platform. Scheduling tasks on adaptive
architectures reveal challenging resource allocation problems for the existing schedulers. We
construct offline and online schedulers that intelligently reconfigure and allocate the cores to the
applications so as to minimize the overall makespan under the constraints of a realistic adaptive
multi-core architecture. Experimental results reveal that adaptive multi-core architectures can
substantially decrease the makespan compared to both static symmetric and asymmetric multi-core architectures.
Low Energy Task Scheduling based on Work StealingLEGATO project
Abstract: Optimizing energy efficiency of parallel execution on computing systems, ranging from server farms, mobile devices to embedded systems, becomes increasingly one of the first-order concerns. A common way to express a parallel application is as a directed acyclic graph (DAG) in which each node represents a task. The problem of such task scheduling on multiprocessor systems is to find the proper execution processors. Especially nowadays asymmetric multiprocessor systems feature different type of cores with different performance and power consumption, e.g. Arm big.LITTLE and Intel Lakefield. However, naive task assignment without considering core types and task features could result in inefficient resources utilization and detrimentally impacts the overall energy consumption. Dynamic task scheduling is a widely used scheduling strategy, which does not require prior knowledge, e.g. architecture heterogeneity, task DAG structure, before execution but makes the decisions during runtime. Work stealing has been proven to be an effective method among dynamic task scheduling with better scalability in larger systems. DVFS is a common technique to achieve better energy efficiency, however, exploiting it costs reconfiguration overhead ranging from tens of microseconds to one millisecond. With fine-grained tasks as small as milliseconds, as required to expose large parallelism, it is not realistic to use DVFS on a per-task level. Also, it shows that the energy consumed in cores’ under-utilized period is significant.
Based on these problem statements, we come up with a low energy task scheduling work stealing runtime based on XiTAO where the system environment configurations are either fixed or managed by the O/S power governors or system administrators. The runtime contains dynamic performance tracing module, idleness tracing module, power profiling module and a task mapping algorithm. The dynamic performance model is able to give the accurate predictions for future tasks given a set of resources. It is independent of platforms and frequencies and achieves scalability and portability. Power profiling helps runtime systems to understand CPU power consumption trends with respect to number/type of cores and frequencies. Idleness tracing presents the real-time status of cores and contributes to the energy conservation of under-utilized period. It also provides the real-time parallel slackness of active cores, which allows the task mapping algorithm to attribute corresponding power consumption on each concurrent running task. The task mapping algorithm integrates the information from above three modules and outputs the predicted best resources placements for ready tasks.
Poster presented by jing Chen at the LEGaTO Final Event: 'Low-Energy Heterogeneous Computing Workshop'
Moldable pipelines for CNNs on heterogeneous edge devicesLEGATO project
Abstract: Modern edge devices are equipped with more powerful computing resources than ever before which opens up the opportunity to execute deep neural networks on the devices instead of cloud-based implementation. Existing DNN frameworks such as TensorFlow, Caffe, Torch, etc.. do not exploit the heterogeneous features of these devices except supporting GPUs. Modern edge devices contain different categories of heterogeneity i.e. clusters of different types of cores fabricated on the same chip. An example of such type of board is Jetson TX2 from Nvidia. To support DNN applications on heterogeneous edge devices, we have developed a framework which generates an efficient and balanced parallel pipelined implementation of CNN inference from a simplified template language interface. We leverage the input and output information provided by template language to generate a balanced pipeline. Since the cores are heterogeneous, we run a brief online training phase to find the best core distribution for a balanced and high throughput pipeline.
Our experiments show that a pipeline mapping configuration obtained from online training provides upto 22% faster pipeline as compared to the baseline. We compare against kernel level parallel implementation of widely used image classification CNN, VGG-16 on Nvidia Jetson TX2 board.
Poster presented by Pirah Noor Soomro at the LEGaTO Final Event: 'Low-Energy Heterogeneous Computing Workshop'
Talk @ APT Group, University of Manchester, 06 August 2014
Abstract:
Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them.
As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.
Extracting a Rails Engine to a separated applicationJônatas Paganini
As a Rails Application grows, there is a need to decouple heavy systems from the monolithic applications. Several teams in different companies are doing the same: extracting (micro) services from their monolithic applications to give the engineering teams more flexibility to speed up the workflow.
From the separation of the business logic to the server's setup, every change should respect the zero-downtime approach.
This talk shares the automated steps and exercises we created to have a smooth transition to the new system.
I'll share the context of the tool that is automatically extracting an entire
rails engine from a project and moving it to a separate service.
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
A customer case implemented for Forsmark power group, a Swedish company that owns and operates the nuclear power plant Forsmark north of Stockholm.
This is a customer where safety always comes first and with use of FME they can facilitate access to data without compromising security. Presentation will include everything from manual paper work to handling of sophisticated models in BIM. We solved the assignment using open standards as IFC and used FME to move and assure the quality of data from both flat DWG models and smart 3d models to a database. The end result is a customized system with FME Server running in the background.
See more presentations from the FME User Conference 2014 at: www.safe.com/fmeuc
Parallel Distributed Deep Learning on HPCC SystemsHPCC Systems
As part of the 2018 HPCC Systems Community Day event:
The training process for modern deep neural networks requires big data and large amounts of computational power. Combining HPCC Systems and Google’s TensorFlow, Robert created a parallel stochastic gradient descent algorithm to provide a basis for future deep neural network research, thereby helping to enhance the distributed neural network training capabilities of HPCC Systems.
Robert Kennedy is a first year Ph.D. student in CS at Florida Atlantic University with research interests in Deep Learning and parallel and distributed computing. His current research is in improving distributed deep learning by implementing and optimizing distributed algorithms.
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf
Applying Deep Learning at Facebook Scale: Facebook leverages Deep Learning for various applications including event prediction, machine translation, natural language understanding and computer vision at a very large scale. There are more than a billion users logging on to Facebook every daily generating thousands of posts per second and uploading more than a billion images and videos every day. This talk will explain how Facebook scaled Deep Learning inference for realtime applications with latency budgets in the milliseconds.
Image Classification Done Simply using Keras and TensorFlow Rajiv Shah
This presentation walks through the process of building an image classifier using Keras with a TensorFlow backend. It will give a basic understanding of image classification and show the techniques used in industry to build image classifiers. The presentation will start with building a simple convolutional network, augmenting the data, using a pretrained network, and finally using transfer learning by modifying the last few layers of a pretrained network. The classification will be based on the classic example of classifying cats and dogs. The code for the presentation can be found at https://github.com/rajshah4/image_keras, and the presentation will discuss how to extend the code to your own pictures to make a custom image classifier.
Object classification using CNN & VGG16 Model (Keras and Tensorflow) Lalit Jain
Using CNN with Keras and Tensorflow, we have a deployed a solution which can train any image on the fly. Code uses Google Api to fetch new images, VGG16 model to train the model and is deployed using Python Django framework
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote) Wim Vanderbauwhede
Keynote I gave at the ParCo conference (http://www.parco2015.org) workshop paraFPGA in Edinburgh, Sept 2015, on the need to raise the abstraction level for programming of heterogeneous systems.
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
JT Kellington, IBM and Allan Cantle, Nallatech present at the 2015 HPCC Systems Engineering Summit Community Day about porting HPCC Systems to the POWER8-based ppc64el architecture.
Talk @ APT Group, University of Manchester, 06 August 2014
Abstract:
Nowadays HPC systems, such as those in the Top500, are equipped with a range of different processors, from multi-core CPUs to GPUs. Programming them can be a tough job, specially if we want to squeeze every last FLOPs of performance out of them.
As a Phd Student, I am now doing a brief research visit in the APT group, working in topics related to the programmability and efficient use of GPUs and many-core coprocessors. In particular, I am implementing a large database operation using OpenCL in these state-of-the-art systems. In this talk I will summarize my work in Manchester and discuss the future work in this topic.
Extracting a Rails Engine to a separated applicationJônatas Paganini
As a Rails Application grows, there is a need to decouple heavy systems from the monolithic applications. Several teams in different companies are doing the same: extracting (micro) services from their monolithic applications to give the engineering teams more flexibility to speed up the workflow.
From the separation of the business logic to the server's setup, every change should respect the zero-downtime approach.
This talk shares the automated steps and exercises we created to have a smooth transition to the new system.
I'll share the context of the tool that is automatically extracting an entire
rails engine from a project and moving it to a separate service.
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
A customer case implemented for Forsmark power group, a Swedish company that owns and operates the nuclear power plant Forsmark north of Stockholm.
This is a customer where safety always comes first and with use of FME they can facilitate access to data without compromising security. Presentation will include everything from manual paper work to handling of sophisticated models in BIM. We solved the assignment using open standards as IFC and used FME to move and assure the quality of data from both flat DWG models and smart 3d models to a database. The end result is a customized system with FME Server running in the background.
See more presentations from the FME User Conference 2014 at: www.safe.com/fmeuc
Parallel Distributed Deep Learning on HPCC SystemsHPCC Systems
As part of the 2018 HPCC Systems Community Day event:
The training process for modern deep neural networks requires big data and large amounts of computational power. Combining HPCC Systems and Google’s TensorFlow, Robert created a parallel stochastic gradient descent algorithm to provide a basis for future deep neural network research, thereby helping to enhance the distributed neural network training capabilities of HPCC Systems.
Robert Kennedy is a first year Ph.D. student in CS at Florida Atlantic University with research interests in Deep Learning and parallel and distributed computing. His current research is in improving distributed deep learning by implementing and optimizing distributed algorithms.
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf
Applying Deep Learning at Facebook Scale: Facebook leverages Deep Learning for various applications including event prediction, machine translation, natural language understanding and computer vision at a very large scale. There are more than a billion users logging on to Facebook every daily generating thousands of posts per second and uploading more than a billion images and videos every day. This talk will explain how Facebook scaled Deep Learning inference for realtime applications with latency budgets in the milliseconds.
Image Classification Done Simply using Keras and TensorFlow Rajiv Shah
This presentation walks through the process of building an image classifier using Keras with a TensorFlow backend. It will give a basic understanding of image classification and show the techniques used in industry to build image classifiers. The presentation will start with building a simple convolutional network, augmenting the data, using a pretrained network, and finally using transfer learning by modifying the last few layers of a pretrained network. The classification will be based on the classic example of classifying cats and dogs. The code for the presentation can be found at https://github.com/rajshah4/image_keras, and the presentation will discuss how to extend the code to your own pictures to make a custom image classifier.
Object classification using CNN & VGG16 Model (Keras and Tensorflow) Lalit Jain
Using CNN with Keras and Tensorflow, we have a deployed a solution which can train any image on the fly. Code uses Google Api to fetch new images, VGG16 model to train the model and is deployed using Python Django framework
FPGAs as Components in Heterogeneous HPC Systems (paraFPGA 2015 keynote) Wim Vanderbauwhede
Keynote I gave at the ParCo conference (http://www.parco2015.org) workshop paraFPGA in Edinburgh, Sept 2015, on the need to raise the abstraction level for programming of heterogeneous systems.
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
JT Kellington, IBM and Allan Cantle, Nallatech present at the 2015 HPCC Systems Engineering Summit Community Day about porting HPCC Systems to the POWER8-based ppc64el architecture.
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
El Barcelona Supercomputing Center (BSC) fue establecido en 2005 y alberga el MareNostrum, uno de los superordenadores más potentes de España. Somos el centro pionero de la supercomputación en España. Nuestra especialidad es la computación de altas prestaciones - también conocida como HPC o High Performance Computing- y nuestra misión es doble: ofrecer infraestructuras y servicio de supercomputación a los científicos españoles y europeos, y generar conocimiento y tecnología para transferirlos a la sociedad. Somos Centro de Excelencia Severo Ochoa, miembros de primer nivel de la infraestructura de investigación europea PRACE (Partnership for Advanced Computing in Europe), y gestionamos la Red Española de Supercomputación (RES). Como centro de investigación, contamos con más de 456 expertos de 45 países, organizados en cuatro grandes áreas de investigación: Ciencias de la computación, Ciencias de la vida, Ciencias de la tierra y aplicaciones computacionales en ciencia e ingeniería.
ParaForming - Patterns and Refactoring for Parallel Programmingkhstandrews
Despite Moore's "law", uniprocessor clock speeds have now stalled. Rather than single processors running at ever higher clock speeds, it is
common to find dual-, quad- or even hexa-core processors, even in consumer laptops and desktops.
Future hardware will not be slightly parallel, however, as in today's multicore systems, but will be
massively parallel, with manycore and perhaps even megacore systems
becoming mainstream.
This means that programmers need to start thinking parallel. To achieve this they must move away
from traditional programming models where parallelism is a
bolted-on afterthought. Rather, programmers must use languages where parallelism is deeply embedded into the programming model
from the outset.
By providing a high level model of computation, without explicit ordering of computations,
declarative languages in general, and functional languages in particular, offer many advantages for parallel
programming.
One of the most fundamental advantages of the functional paradigm is purity.
In a purely functional language, as exemplified by Haskell, there are simply no side effects: it is therefore impossible for parallel computations to conflict with each
other in ways that are not well understood.
ParaForming aims to radically improve the process
of parallelising purely functional programs through a comprehensive set of high-level parallel refactoring patterns for Parallel Haskell,
supported by advanced refactoring tools.
By matching parallel design patterns with appropriate algorithmic skeletons
using advanced software refactoring techniques and novel cost information, we will bridge the gap between fully automatic
and fully explicit approaches to parallelisation, helping programmers "think parallel" in a systematic,
guided way. This talk introduces the ParaForming approach, gives some examples and shows how
effective parallel programs can be developed using advanced refactoring technology.
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
In this deck from the Swiss HPC Conference, Mark Wilkinson presents: 40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility.
"DiRAC is the integrated supercomputing facility for theoretical modeling and HPC-based research in particle physics, and astrophysics, cosmology, and nuclear physics, all areas in which the UK is world-leading. DiRAC provides a variety of compute resources, matching machine architecture to the algorithm design and requirements of the research problems to be solved. As a single federated Facility, DiRAC allows more effective and efficient use of computing resources, supporting the delivery of the science programs across the STFC research communities. It provides a common training and consultation framework and, crucially, provides critical mass and a coordinating structure for both small- and large-scale cross-discipline science projects, the technical support needed to run and develop a distributed HPC service, and a pool of expertise to support knowledge transfer and industrial partnership projects. The on-going development and sharing of best-practice for the delivery of productive, national HPC services with DiRAC enables STFC researchers to produce world-leading science across the entire STFC science theory program."
Watch the video: https://wp.me/p3RLHQ-k94
Learn more: https://dirac.ac.uk/
and
http://hpcadvisorycouncil.com/events/2019/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Slides from Strata+Hadoop Singapore 2016 presenting how Deep Learning can be scaled both vertically and horizontally, when to use CPUs and when to use GPUs.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/auvizsystems/embedded-vision-training/videos/pages/may-2016-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Nagesh Gupta, Founder and CEO of Auviz Systems, presents the "Semantic Segmentation for Scene Understanding: Algorithms and Implementations" tutorial at the May 2016 Embedded Vision Summit.
Recent research in deep learning provides powerful tools that begin to address the daunting problem of automated scene understanding. Modifying deep learning methods, such as CNNs, to classify pixels in a scene with the help of the neighboring pixels has provided very good results in semantic segmentation. This technique provides a good starting point towards understanding a scene. A second challenge is how such algorithms can be deployed on embedded hardware at the performance required for real-world applications. A variety of approaches are being pursued for this, including GPUs, FPGAs, and dedicated hardware.
This talk provides insights into deep learning solutions for semantic segmentation, focusing on current state of the art algorithms and implementation choices. Gupta discusses the effect of porting these algorithms to fixed-point representation and the pros and cons of implementing them on FPGAs.
Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study us...Xavier Llorà
Data-intensive computing has positioned itself as a valuable programming paradigm to efficiently approach problems requiring processing very large volumes of data. This paper presents a pilot study about how to apply the data-intensive computing paradigm to evolutionary computation algorithms. Two representative cases (selectorecombinative genetic algorithms and estimation of distribution algorithms) are presented, analyzed, and discussed. This study shows that equivalent data-intensive computing evolutionary computation algorithms can be easily developed, providing robust and scalable algorithms for the multicore-computing era. Experimental results show how such algorithms scale with the number of available cores without further modification.
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
Pak Lui from the HPC Advisory Council presented this deck at the 2017 Stanford HPC Conference.
"To achieve good scalability performance on the HPC scientific applications typically involves good understanding of the workload though performing profile analysis, and comparing behaviors of using different hardware which pinpoint bottlenecks in different areas of the HPC cluster. In this session, a selection of HPC applications will be shown to demonstrate various methods of profiling and analysis to determine the bottleneck, and the effectiveness of the tuning to improve on the application performance from tests conducted at the HPC Advisory Council High Performance Center."
Watch the video presentation: http://wp.me/p3RLHQ-gpY
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Hokkaido University Academic Cloud: Largest Academic Cloud System in Japan Masaharu Munetomo
Hokkaido university academic cloud which started services in 2011, is the largest academic cloud system in Japan. Its peak performance is 43TFlops and HPC/Hadoop cluster instances can be deployed automatically.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
1. Realizing Robust and Scalable Evolutionary
Algorithms toward Exascale Era
Masaharu Munetomo
Information Initiative Center,
Hokkaido University, Sapporo, JAPAN.
2. Toward the next generations
supercomputing
• Next year (2012): 10 PFlops systems will be installed
• “K” Computer (Fujitsu) : 10PFlops
• Sequoia (IBM) : 20PFlops
• Near Future (~2018): Toward exascale (1018
)
• International Exa-Scale Software Project
3. Potential Architecture in Exa-Flops Era
• # of nodes: O(106
)
• # of concurrency
per node: O(103
)
• Total concurrency: O(109
)
• Memory BW: 400GB/s
• Inter-node BW: 50GB/s
(Jack Dongarra
Presentation@SC10)
4. Uniform or Heterogeneous
• Uniform Architecture: using cores with the same architecture
• PROS: Natural approach, easy to program
• CONS: Needs more power compared with heterogeneous architecture
• Heterogeneous Architecture: CPU + Accelerator with different architecture
• PROS: Energy efficiency, concurrency
• CONS: Difficult to program and optimize compared with uniform one
5. Challenges toward Exa-Scale
• Improvement of bandwidth and delay of the interconnect are not enough
compared to that by increasing the number of cores
• It is difficult to develop massively parallel library for the many-core
architecture
• It is also difficult to develop scalable algorithms for many-core
• In the field of high performance computing, their efforts concentrate on
parallel numerical library such as large-scale matrix calculations. Extreme
scale parallel optimization is not considered.
6. Scalability of Master-Worker Model
• When delay is proportional to # of processors
• When delay is proportional to log of # of processors (ideal situation)
7. Extremely Parallel Master-Slave Model
• P* should be around
100,000 ~ 1,000,000
• X
should be more
than 1010
!!
• OK
• It is preferable if we can
hide communication
overheads.
8. Toward exa-scale massive parallelization
• Communication latency should be at most O(log n)
• Network hardware architecture: Hypercube, etc.
• Network software stack: Optimizing collective communications, etc.
• Algorithms that reduces communications
• Employing heterogeneous architecture: GPGPU, etc.
9. Massive parallelization of Evolutionary
computation
• Master-Worker Model: Same as general parallelized algorithms
• Island models: difficult to implement massive parallelism as it is
• Localize inter-node communications such as cellular GA
• Massive parallelization of advanced methods like EDAs
Dependency (communications) among variables should be considered→
ex) DBOA, pLINC, pD5 etc.
• Massive parallelization on cloud systems: MapReduce GA, ECGA, etc.
10. Research toward exa-scale ECs
• We should assume more than million-way parallelization
• Making communication overheads as less as possible
• Designing algorithms with less communications necessary
• Dependency on the target application problems
• When fitness evaluations are costly, simply parallelize them is enough
• Parallelization of “intelligent” algorithms are necessary to consider
• Analyzing interactions among genes to solve complex problems
11. Promising approach toward massive
parallelization
• Massively parallel cellular GAs with local communications on many-core
• Massive parallelization of linkage identification: pLINC, pLIEM, pD5
• Parallelization of EDAs: pBOA, gBOA (A GPGPU implementation)
• GPGPU implementations: gBOA, MAXSAT, MINLP
• MapReduce implementations: MapReduce GA, ECGA, linkage identification
12. A cellular GA implemented on CellBE [Asim
2009]
• A cellular GA (cGA) to solve Capacitated Vehicle Routing
Problem (CVRP) implemented on a many-core architecture,
Cell BroadBand Engine (Sony, IBM, Toshiba).
• Cellular GA was employed due
to limited communication BW
between SPE and PPE.
13. Parallelization of linkage identification
techniques
• Linkage identification techniques are based on pair-wise perturbations to
detect nonlinearity/non-monotonicity.
• It is relatively easy to parallelize by
assigning calculation of each pair
to each processors
• pLINC, pLIEM, pD5
have been proposed.
• We have succeeded in solving a half-million bit problems by employing pD5
14. A half-million-bit optimization with pD5
[Tsuji
2004]
• A 500,000-bit problem
(5bit trap x 100,000 )
• # of local optima
= 2100,000
-1 10≒ 30,000
• HITACHI SR-11000
model K1 (5.4TFlops)
• 32 days by 1 node (16 CPU)
• 1 day by whole system
(40 nodes)
15. pBOA on GPGPU [Asim 2008]
• Implementation of parallel Bayesian Optimization
Algorithm on NVIDIA CUDA.
16. Massive Parallelization over the Cloud:
MapReduce
• MapReduce is a simple but powerful approach to realize massive
parallelization over Cloud computing systems
• There are several papers have been published on MapReduce
implementations on GA, ECGA.
• MapReduce ECGA: Calculation of Marginal Product Model (MPM) and
generating offsprings can be parallelized via
“Map” process.
• Fault tolerance can be handled byMapReduce
for massive parallelization
17. Concluding Remarks
• Designing robust and scalable algorithms is not easy, since analysis of
linkage or interactions among genes should be necessary which usually
needs at least O(l2
) computational overheads and communications among
processors.
• In order to adapt to exa-scale massively parallel processing cores of more
than O(106
), we need to reduce communication overheads of each message
exchange less than O(log n). Otherwise, the parallel algorithm cannot scale
well to such massively parallel architectures.
• We also need to adapt to modern architecture and systems such as
heterogeneous architecture with GPGPUs and cloud computing environment.
Editor's Notes
I’m Masaharu Munetomo from information initiative center of Hokkaido university, one of the Japanese national supercomputing centers. This talk is to review current status of massively parallel evolutionary algorithms toward exa-scale era. Robustness means that an algorithm can be applicable to wide-spectrum of applications by analyzing interactions among genes. Realizing robustness and scalability at the same time is considered difficult generally.