The IBM POWER10 processor represents the 10th generation of the POWER family of enterprise computing engines. Its performance is a result of both powerful processing cores and high-bandwidth intra- and inter-chip interconnect. POWER10 systems can be configured with up to 16 processor chips and 1920 simultaneous threads of execution. Cross-system memory sharing, through the new Memory Inception technology, and 2 Petabytes of addressing space support an expansive memory system. The POWER10 processing core has been significantly enhanced over its POWER9 predecessor, including a doubling of vector units and the addition of an all-new matrix math engine. Throughput gains from POWER9 to POWER10 average 30% at the core level and three-fold at the socket level. Those gains can reach ten- or twenty-fold at the socket level for matrix-intensive computations.
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Linaro
Session ID: SFO17-203
Session Name: Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Speaker: Fu Wei
Track: LEG
★ Session Summary ★
This presentation gives an updated RAS architecture on ARM64 base on RAS extension (in ARMv8.2), SDEI (Software Delegated Exception Interface), APEI, UEFI PI-SMM. Will talk about all the components of the new RAS architecture on ARM64, gives audience the current status and the next step of development.
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/sfo17/sfo17-203/
Presentation:
Video: https://www.youtube.com/watch?v=NReFBzbeWi0
---------------------------------------------------
★ Event Details ★
Linaro Connect San Francisco 2017 (SFO17)
25-29 September 2017
Hyatt Regency San Francisco Airport
---------------------------------------------------
Keyword:
'http://www.linaro.org'
'http://connect.linaro.org'
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://twitter.com/linaroorg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...The Linux Foundation
Shared Virtual Memory (SVM) is a VT-d feature that allows sharing application address space with the I/O device. The feature works with the PCI sig Process Address Space ID (PASID). With SVM, programmer gets a consistent view of memory across host application and device, avoids pining or copying overheads. We have been working on supporting SVM in Xen to enable SVM usage in guest if a SVM capable device is assigned. e.g. assign IGD to a guest, applications like OpenCL would benefit if SVM is supported in guest. SVM virtualization requires exposing a virtual VT-d to guest. In this discussion, Yi would update the latest SVM virtualization implementation and foresee the future work about supporting SVM and IOVA a single virtual VT-d.
During the CXL Forum at OCP Global Summit 23, Rick Kutcipal and Sreeni Bagalkote of Broadcom presented their PCIe/CXL Roadmap and announced their Atlas 4 CXL switch.
Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203Linaro
Session ID: SFO17-203
Session Name: Reliability, Availability, and Serviceability (RAS) on ARM64 status - SFO17-203
Speaker: Fu Wei
Track: LEG
★ Session Summary ★
This presentation gives an updated RAS architecture on ARM64 base on RAS extension (in ARMv8.2), SDEI (Software Delegated Exception Interface), APEI, UEFI PI-SMM. Will talk about all the components of the new RAS architecture on ARM64, gives audience the current status and the next step of development.
---------------------------------------------------
★ Resources ★
Event Page: http://connect.linaro.org/resource/sfo17/sfo17-203/
Presentation:
Video: https://www.youtube.com/watch?v=NReFBzbeWi0
---------------------------------------------------
★ Event Details ★
Linaro Connect San Francisco 2017 (SFO17)
25-29 September 2017
Hyatt Regency San Francisco Airport
---------------------------------------------------
Keyword:
'http://www.linaro.org'
'http://connect.linaro.org'
---------------------------------------------------
Follow us on Social Media
https://www.facebook.com/LinaroOrg
https://twitter.com/linaroorg
https://www.youtube.com/user/linaroorg?sub_confirmation=1
https://www.linkedin.com/company/1026961
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...The Linux Foundation
Shared Virtual Memory (SVM) is a VT-d feature that allows sharing application address space with the I/O device. The feature works with the PCI sig Process Address Space ID (PASID). With SVM, programmer gets a consistent view of memory across host application and device, avoids pining or copying overheads. We have been working on supporting SVM in Xen to enable SVM usage in guest if a SVM capable device is assigned. e.g. assign IGD to a guest, applications like OpenCL would benefit if SVM is supported in guest. SVM virtualization requires exposing a virtual VT-d to guest. In this discussion, Yi would update the latest SVM virtualization implementation and foresee the future work about supporting SVM and IOVA a single virtual VT-d.
During the CXL Forum at OCP Global Summit 23, Rick Kutcipal and Sreeni Bagalkote of Broadcom presented their PCIe/CXL Roadmap and announced their Atlas 4 CXL switch.
In the CXL Forum Theater at SC23 hosted by MemVerge, the Open Compute Project provided an overview of CXL, as well as CXL-related hardware and software projects at OCP
In this presentation, Yasunori Goto and Qi fuli will talk about basis of NVDIMM, the issues of RAS of Non Volatile DIMM(NVDIMM), and what feature is made and is developing for it.
NVDIMM is expected as new age device recently. Though a cpu can read/write the NVDIMM directly like RAM, the data of NVDIMM remains after power down or reboot. So, on memory database will be one of the good example of usecase of NVDIMM.
Since many people have made great effort for Linux, NVDIMM drivers, filesystems,management command, and many libraries has been well developed for a few years,
However, Yasunori Goto found some issues about RAS(Reliabivity, Availability, and Serviceability) feature of NVDIMM, because characteristic of the NVDIMM is likely mixture of Storage and RAM. For example, NVDIMM does not have hotplug feature because it is inserted at DIMM slot like RAM, but its data must be back-upped/restored like storage.
Board Support Package contains Low level device driver and HAL layer. It is the essential to boot the system. By this virtue, BSP finds its application in almost every automotive ECU software.We offer our Board Support Package (BSP) on a one-time license fee model. This model makes our customers the owner of the entire BSP source code along with the solution IP rights.
Innovative Solutions for Cloud Gaming, Media, Transcoding, & AI InferencingRebekah Rodriguez
Supermicro and Intel® product and solution experts will discuss, in an informal session, the benefits of the solutions in the areas of Cloud Gaming, Media Delivery, Transcoding, and AI Inferencing using the recently announced Intel Flex Series GPUs. The webinar will explain the advantages of the Supermicro solutions, the ideal servers and the benefits of using the Intel® Data Center GPU Flex Series (codenamed Arctic Sound-M).
During the CXL Forum at OCP Global Summit, memory system architect Jungmin Choi of SK hynix talks about the need for memory bandwidth and capacity, and the SK hynix Niagara solution.
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
This deck presents highlights from the Introduction to OpenCL™ Programming Webinar presented by Acceleware & AMD on Sept. 17, 2014. Watch a replay of this popular webinar on the AMD Dev Central YouTube channel here: https://www.youtube.com/user/AMDDevCentral or here for the direct link: http://bit.ly/1r3DgfF
Las16 200 - firmware summit - ras what is it- why do we need itLinaro
Title: RAS What is it? Why do we need it?
A 101 style introduction to RAS, its purpose and how we use it on ARM64. Covering current status of implementation in ASWG specs and Linux kernel. Plans for future features that are essential for ARM64. Followed by a discussion period.
Speaker: Yazen Ghannam, Fu Wei
Shared Memory Centric Computing with CXL & OMIAllan Cantle
Discusses how CXL can be better utilized as a separate Fabric Cache domain to a processors own Local Cache Domain. This is done by leveraging a Shared Memory Centric architectures that utilize both the Open Memory Interface OMI, and Compute eXpress Link, CXL, for the memory ports.
In the CXL Forum Theater at SC23 hosted by MemVerge, the Open Compute Project provided an overview of CXL, as well as CXL-related hardware and software projects at OCP
In this presentation, Yasunori Goto and Qi fuli will talk about basis of NVDIMM, the issues of RAS of Non Volatile DIMM(NVDIMM), and what feature is made and is developing for it.
NVDIMM is expected as new age device recently. Though a cpu can read/write the NVDIMM directly like RAM, the data of NVDIMM remains after power down or reboot. So, on memory database will be one of the good example of usecase of NVDIMM.
Since many people have made great effort for Linux, NVDIMM drivers, filesystems,management command, and many libraries has been well developed for a few years,
However, Yasunori Goto found some issues about RAS(Reliabivity, Availability, and Serviceability) feature of NVDIMM, because characteristic of the NVDIMM is likely mixture of Storage and RAM. For example, NVDIMM does not have hotplug feature because it is inserted at DIMM slot like RAM, but its data must be back-upped/restored like storage.
Board Support Package contains Low level device driver and HAL layer. It is the essential to boot the system. By this virtue, BSP finds its application in almost every automotive ECU software.We offer our Board Support Package (BSP) on a one-time license fee model. This model makes our customers the owner of the entire BSP source code along with the solution IP rights.
Innovative Solutions for Cloud Gaming, Media, Transcoding, & AI InferencingRebekah Rodriguez
Supermicro and Intel® product and solution experts will discuss, in an informal session, the benefits of the solutions in the areas of Cloud Gaming, Media Delivery, Transcoding, and AI Inferencing using the recently announced Intel Flex Series GPUs. The webinar will explain the advantages of the Supermicro solutions, the ideal servers and the benefits of using the Intel® Data Center GPU Flex Series (codenamed Arctic Sound-M).
During the CXL Forum at OCP Global Summit, memory system architect Jungmin Choi of SK hynix talks about the need for memory bandwidth and capacity, and the SK hynix Niagara solution.
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware WebinarAMD Developer Central
This deck presents highlights from the Introduction to OpenCL™ Programming Webinar presented by Acceleware & AMD on Sept. 17, 2014. Watch a replay of this popular webinar on the AMD Dev Central YouTube channel here: https://www.youtube.com/user/AMDDevCentral or here for the direct link: http://bit.ly/1r3DgfF
Las16 200 - firmware summit - ras what is it- why do we need itLinaro
Title: RAS What is it? Why do we need it?
A 101 style introduction to RAS, its purpose and how we use it on ARM64. Covering current status of implementation in ASWG specs and Linux kernel. Plans for future features that are essential for ARM64. Followed by a discussion period.
Speaker: Yazen Ghannam, Fu Wei
Shared Memory Centric Computing with CXL & OMIAllan Cantle
Discusses how CXL can be better utilized as a separate Fabric Cache domain to a processors own Local Cache Domain. This is done by leveraging a Shared Memory Centric architectures that utilize both the Open Memory Interface OMI, and Compute eXpress Link, CXL, for the memory ports.
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...Rakuten Group, Inc.
Rakuten Technology Conference 2013
"TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data"
Satoshi Matsuoka
Professor
Global Scientific Information and Computing (GSIC) Center
Tokyo Institute of Technology
Fellow, Association for Computing Machinery (ACM)
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi
Third Workshop on Accelerator Programming Using Directives (WACCPD2016, co-located with SC16)
While GPUs are increasingly popular for high-performance
computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard
GPU programming models such as CUDA and OpenCL:
programmers are required to orchestrate low-level operations
in order to exploit the full capability of GPUs. In terms of
software productivity and portability, a more attractive approach
would be to facilitate GPU programming by providing high-level
abstractions for expressing parallel algorithms.
OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years.
From OpenMP 4.0 onwards, GPU platforms are supported
by extending OpenMP’s high-level parallel abstractions with
accelerator programming. This extension allows programmers to
write GPU programs in standard C/C++ or Fortran languages,
without exposing too many details of GPU architectures.
However, such high-level parallel programming strategies generally impose additional program optimizations on compilers,
which could result in lower performance than fully hand-tuned
code with low-level programming models.To study potential
performance improvements by compiling and optimizing high-level GPU programs, in this paper, we 1) evaluate a set of
OpenMP 4.x benchmarks on an IBM POWER8 and NVIDIA
Tesla GPU platform and 2) conduct a comparable performance
analysis among hand-written CUDA and automatically-generated
GPU programs by the IBM XL and clang/LLVM compilers.
Jetson AGX Xavier and the New Era of Autonomous MachinesDustin Franklin
Deep-dive on NVIDIA Jetson AGX Xavier, designed to help you deploy advanced AI onboard robots, drones, and other autonomous machines. View the webinar here: https://bit.ly/2BWVWv1
Today Fujitsu published specifications for the A64FX CPU to be featured in the post-K computer, a future machine designed to be 100 times faster than the legendary K computer that dominated the TOP500 for years.
A64FX is the world's first CPU to adopt the Scalable Vector Extension (SVE), an extension of Armv8-A instruction set architecture for supercomputers. Building on over 60 years' worth of Fujitsu-developed microarchitecture, this chip offers peak performance of over 2.7 TFLOPS, demonstrating superior HPC and AI performance. A64FX offers a number of features, including broad utility supporting a wide range of applications, massive parallelization through the Tofu interconnect, low power consumption, and mainframe-class reliability.
A64FX is the world's first CPU to adopt the SVE of Arm Limited's Armv8-A instruction set architecture, extended for supercomputers. Fujitsu collaborated with Arm, contributing to the development of the SVE as a lead partner, and adopted the results in the A64FX.
Learn more: https://wp.me/p3RLHQ-iYt
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
IEI is een van de grootste leveranciers van producten voor industriële computersystemen. IEI levert honderden verschillende boards, systemen en onderdelen voor uiteenlopende applicaties in de industriële automatisering, defensie, medisch, infotainment en mobiel gebruik. Vooruitstrevende oplossingen bezorgen u als klant een kortere ontwerpcyclus zodat u de voorsprong op de concurrent kunt behouden en zelfs vergroten.
Advanced High-Performance Computing Features of the OpenPOWER ISAGanesan Narayanasamy
Power ISA processors have a long history of offering superior features for HPC applications. Well known examples include POWER3, used in the ASCI White supercomputer, various PowerPC processors used in the Blue Gene family of massively parallel computers, and POWER9, present in the leading supercomputers of today, Summit and Sierra. OpenPOWER ISA has enabled open access to many of these features. IBM's most recent contribution to OpenPOWER ISA, in the form of Power ISA Version 3.1, includes the Matrix-Multiply Assist (MMA) instructions. The MMA instructions are designed to deliver additional performance both for classical high-performance computing, in the space of scientific and technical computing, and for the increasingly important space of business analytics. In addition, the Open Memory Interface (OMI), also developed by IBM, opens new levels of memory bandwidth and capacity for the most demanding applications. Our goal is to raise awareness of and interest in these new features, which we believe can lead to further research in processor architecture and programming environments. Some of the most promising application areas include graph algorithms, classical machine learning and deep learning.
NVIDIA GPUs Power HPC & AI Workloads in Cloud with Univainside-BigData.com
In this deck from the Univa Breakfast Briefing at ISC 2018, Duncan Poole from NVIDIA describes how the company is accelerating HPC in the Cloud.
Learn more: https://www.nvidia.com/en-us/data-center/dgx-systems/
and
http://univa.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Today’s groundbreaking scientific discoveries are taking place in HPC data centers. Using containers, researchers and scientists gain the flexibility to run HPC application containers on NVIDIA Volta-powered systems including Quadro-powered workstations, NVIDIA DGX Systems, and HPC clusters.
The Libre-SOC Project aims to create an entirely Libre-Licensed, transparently-developed fully auditable Hybrid 3D CPU-GPU-VPU, using the Supercomputer-class OpenPOWER ISA as the foundation.
Our first test ASIC is a 180nm "Fixed-Point" Power ISA v3.0B processor, 5.1mm x 5.9mm, as a proof-of-concept for the team, whose primary expertise is in Software Engineering. Software Engineering training brings a radically different approach to Hardware development: extensive unit tests, source code revision control, automated development tools are normal. Libre Project Management brings even more: bug trackers, mailing lists, auditable IRC logs and a wiki are standard fare for Libre Projects that are simply not normal Industry-Standard practice.
This talk therefore goes through the workflow, from the original HDL through to the GDS-II layout, showing how we were able to keep track of the development that led to the IMEC 180nm tape-out in July 2021. In particular, by following a parallel development process involving "Real" and "Symbolic" Cell Libraries, developed by Chips4Makers, will be shown how our developers did not need to sign a Foundry NDA, but were still able to work side-by-side with a University that did. With this parallel development process, the University upheld their NDA obligations, and Libre-SOC were simultaneously able to honour its Transparency Objectives.
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
IT Industry is going through two major transformations. One is adaption of AI and tight integration of the same in the commercial applications and enterprise workflow. Two the transformation in software architecture through the concepts like microservices and the cloud native architecture. These transformation alongside the aggressive adaption of IoT/mobile and 5G in all our day today activities is making the world operate in more real time manner which opens-up a new challenge to improve the hardware architecture to adapt to these requirements. These above two major transformation pushes the boundary of the entire systems stack making the designer rethink hardware. This talk presents you a picture of how the enterprise Industry leading POWER architecture is transforming to fulfill the performance demands of these newer generation workloads with primary focus on the AI acceleration on the chip.
July 16th 2021 , Friday for our newest workshop with DoMS, IIT Roorkee, Concept to Solutions using OpenPOWER Stack. It's time to discover advances in #DeepLearning tools and techniques from the world's leading innovators across industries, research, and public speakers.
Register here:
https://lnkd.in/ggxMq2N
This presentation covers two uses cases using OpenPOWER Systems
1. Diabetic Retinopathy using AI on NVIDIA Jetson Nano: The objective is to classify the diabetic level solely on retina image in a remote area with minimum doctor's inference. The model uses VGG16 network architecture and gets trained from scratch on POWER9. The model was deployed on the Jetson Nano board.
1. Classifying Covid positivity using lung X-ray images: The idea is to build ML models to detect positive cases using X-ray images. The model was trained on POWER9, and the application was developed using Python.
IBM Bayesian Optimization Accelerator (BOA) is a do-it-yourself toolkit to apply state-of-the-art Bayesian inferencing techniques and obtain optimal solutions for complex, real-world design simulations without requiring deep machine learning skills. This talk will describe IBM BOA, its differentiation and ease of use, and how researchers can take advantage of it for optimizing any arbitrary HPC simulation.
This presentation covers various partners and collaborators who are currently working with OpenPOWER foundation ,Use cases of OpenPOWER systems in multiple Industries , OpenPOWER Workgroups and OpenCAPI features .
Everything is changing from Health Care to the Automotive markets without forgetting Financial markets or any type of engineering everything has stopped being created as an individual or best-case scenario a team effort to something that is being developed and perfectioned by using AI and hundreds of computers.And even AI is something that we no longer can run in a single computer, no matter how powerful it is. What drives everything today is HPC or High-Performance Computing heavily linked to AI In this session we will discuss about AI, HPC computing, IBM Power architecture and how it can help develop better Healthcare, better Automobiles, better financials and better everything that we run on them
Macromolecular crystallography is an experimental technique allowing to explore 3D atomic structure of proteins, used by academics for research in biology and by pharmaceutical companies in rational drug design. While up to now development of the technique was limited by scientific instruments performance, recently computing performance becomes a key limitation. In my presentation I will present a computing challenge to handle 18 GB/s data stream coming from the new X-ray detector. I will show PSI experiences in applying conventional hardware for the task and why this attempt failed. I will then present how IC 922 server with OpenCAPI enabled FPGA boards allowed to build a sustainable and scalable solution for high speed data acquisition. Finally, I will give a perspective, how the advancement in hardware development will enable better science by users of the Swiss Light Source.
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
As the adoption of AI technologies increases and matures, the focus will shift from exploration to time to market, productivity and integration with existing workflows. Governing Enterprise data, scaling AI model development, selecting a complete, collaborative hybrid platform and tools for rapid solution deployments are key focus areas for growing data scientist teams tasked to respond to business challenges. This talk will cover the challenges and innovations for AI at scale for the Industires such as Healthcare and Automotive , the AI ladder and AI life cycle and infrastructure architecture considerations.
This talk gives an introduction about Healthcare Use cases - The AI ladder and Lifestyle AI at Scale Themes The iterative nature of the workflow and some of the important components to be aware in developing AI health care solutions were being discussed. The different types of algorithms and when machine learning might be more appropriate in deep learning or the other way will also be discussed. Use cases in terms of examples are also shared as part of this presentation .
Healthcare has became one of the most important aspects of everyones life. Its importance has surged due to the latests outbreaks and due to this latest pandemic it has become mandatory to collaborate to improve everyones Healthcare as soon as possible.
IBM has reacted quickly sharing not only its knowledge but also its Artificial Intelligence Supercomputers all around the world.
Those Supercomputers are helping to prevail this outbreak and also future ones.
They have completely different features compared to proposals from other players of this Supercomputers market.
We will try to make a quick look at the differences of those AI focused Supercomputers and how they can help in the R&D of Healthcare solutions for everyone, from those ones with access to a big IBM AI Supercomputer to those ones with access to only one small IBM AI focused server.
Healthcare has became one of the most important aspects of everyones life. Its importance has surged due to the latests outbreaks and due to this latest pandemic it has become mandatory to collaborate to improve everyones Healthcare as soon as possible.
IBM has reacted quickly sharing not only its knowledge but also its Artificial Intelligence Supercomputers all around the world.
Those Supercomputers are helping to prevail this outbreak and also future ones.
They have completely different features compared to proposals from other players of this Supercomputers market.
We will try to make a quick look at the differences of those AI focused Supercomputers and how they can help in the R&D of Healthcare solutions for everyone, from those ones with access to a big IBM AI Supercomputer to those ones with access to only one small IBM AI focused server.
Moving object recognition (MOR) corresponds to the localization and classification of moving objects in videos. Discriminating moving objects from static objects and background in videos is an essential task for many computer vision applications. MOR has widespread applications in intelligent visual surveillance, intrusion detection, anomaly detection and monitoring, industrial sites monitoring, detection-based tracking, autonomous vehicles, etc. In this session, Murari provided a poster about the deep learning algorithms to identify both locations and corresponding categories of moving objects with a convolutional network. The challenges in developing such algorithms have been discussed.
Clarisse Hedglin from IBM presented this as part of 3 days International Summit .. She shared the scenarios AI can solve for today using the IBM AI infrastructure.
Dr Murari Mandal from NUS presented as part of 3 days OpenPOWER Industry summit about Robustness in Deep learning where he talked about AI Breakthroughs , Performance improments in AI models , Adversarial attacks , Attacks on semantic segmentation , Attacs on object detector , Defending Against adversarial attacks and many other areas.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
1. POWER10 Innovations for HPC
Presented at the OpenPOWER Workshop at CINECA
Wednesday 06/30/2021
José Moreira
IBM Research
1
2. My most direct collaborators, who made POWER10 a reality …
• Bill Starke (chip architect)
• Brian Thompto (core architect)
• Dung Q. Nguyen
• Ramon Bertran
• Hans Jacobson
• Richard J. Eickemeyer
• Rahul M.Rao
• Michael Goulet
• Marcy Byers
• Christopher J. Gonzalez
• Karthik Swaminathan
• Nagu R. Dhanwada
• Silvia M. Müller
• Andreas Wagner
• Satish Kumar Sadasivam
• Robert K. Montoye
• Christian G. Zoellin
• Michael S. Floyd
• Jeffrey Stuecheli
• Nandhini Chandramoorthy
• John-David Wellman
• Alper Buyuktosunoglu
• Matthias Pflanz
• Balaram Sinharoy
• Pradip Bose
2
• Kit Barton
• Steven Battle
• Peter Bergner
• Puneeth Bhat
• Pedro Caldeira
• David Edelsohn
• Gordon Fossum
• Brad Frey
• Nemanja Ivanovic
• Chip Kerchner
• Vincent Lim
• Shakti Kapoor
• Tulio Machado Filho
• Brett Olsson
• Baptiste Saleil
• Bill Schmidt
• Rajalakshmi Srinivasaraghavan
• Andreas Wagner
• Nelson Wu
• Serif Yesil
...
Most of what I will talk today is not really my work – I will cover
what I specifically worked on towards the end of the talk
3. POWER10 chip: key features
▪ Enterprise focused design:
▪ 602 mm2 7nm 18 layer metal stack
▪ SCM: 15 SMT-8 cores / 30 SMT-4 cores → 120 HW threads
▪ DCM: 30 SMT-8 cores / 60 SMT-4 cores → 240 HW Threads
▪ Robust data plane: Bandwidth, Scale, Composability
▪ Open Memory Interface up to 1 TB/s
▪ PowerAXON Interface up to 1 TB/s
▪ Up to 16 SCM sockets ”glueless” SMP
▪ Integrated AI acceleration
▪ Security co-optimized across the stack
▪ Re-architected for efficiency: up to 3x relative to POWER9 (DCM)
3
SMT8
Core
2MB L2
SMT8
Core
2MB L2
SMT8
Core
2MB L2
SMT8
Core
2MB L2
SMT8
Core
2MB L2
SMT8
Core
2MB L2
SMT8
Core
2MB L2
SMT8
Core
2MB L2
SMT8
Core
2MB L2
SMT8
Core
2MB L2
SMT8
Core
2MB L2
SMT8
Core
2MB L2
SMT8
Core
2MB L2
SMT8
Core
2MB L2
SMT8
Core
2MB L2
SMT8
Core
2MB L2
Local 8MB
L3 region
64 MB L3 Hemisphere
64 MB L3 Hemisphere
SMP,
Memory,
Accel,
Cluster,
PCI
Interconnect
SMP,
Memory,
Accel,
Cluster,
PCI
Interconnect
PCIe Gen 5
Signaling (x16)
PowerAXON
x
3
2
+
4
Memory
Signaling
(8x8
OMI)
Memory
Signaling
(8x8
OMI)
PCIe Gen 5
Signaling (x16)
x
3
2
+
4
PowerAXON
PowerAXON
x
3
2
+
4
x
3
2
+
4
PowerAXON
Die Photo courtesy of Samsung Foundry
3
4. SCM OpenCAPI Attach
SCM
Main tier DRAM
OMI Memory
PowerAXON
1 Terabyte / sec 1 Terabyte / sec
POWER10
chip
PCIe
G5
ASIC or FPGA
Accelerated
App
(PowerAXON and OMI Memory configurations show processor capability only, and do not imply system product offerings) 4
POWER10 interconnect architecture
5. OMI Memory
PowerAXON
Built on best-of-breed
low Power, low Latency,
high bandwidth
signaling technology
Multi-protocol
“Swiss-army-knife”
flexible / modular interfaces
POWER10
chip
1 Terabyte / sec 1 Terabyte / sec
PowerAXON corner
4x8 @ 32 GT/s
OMI edge
8x8 @ 32 GT/s
6x bandwidth / mm2
compared to DDR4
signaling
5
System Composability: PowerAXON & Open Memory Interfaces
6. Main tier DRAM
OMI Memory
PowerAXON
Built on best-of-breed
low power, low latency,
high bandwidth
signaling technology
1 Terabyte / sec 1 Terabyte / sec
Build up to 16 SCM socket
Robustly Scalable
High Bisection Bandwidth
“Glueless” SMP
Initial Offering:
Up to 4 TB / socket
OMI DRAM memory
410 GB/s peak bandwidth
(MicroChip DDR4 buffer)
< 10ns latency adder
Multi-protocol
“Swiss-army-knife”
flexible / modular interfaces
POWER10
chip
DIMM swap upgradeable:
DDR5 OMI DRAM memory
with higher bandwidth
and higher capacity
(PowerAXON and OMI Memory configurations show processor capability only, and do not imply system product offerings) 6
System Enterprise Scale and Bandwidth: SMP & Main Memory
7. DRAM DIMM comparison
• Technology agnostic
• Low cost
• Ultra-scale system density
• Enterprise reliability
• Low-latency
• High bandwidth
Approximate Scale
JEDEC DDR DIMM
IBM Centaur DIMM
OMI DDIMM
7
8. Single-chip module focus:
- 602mm2 7nm (18B devices)
- Core/thread strength
- Up to 15 SMT8 Cores (4+ GHz)
- Capacity & bandwidth / compute
- Memory: x128 @ 32 GT/s
- SMP/Cluster/Accel: x128 @ 32 GT/s
- I/O: x32 PCIe G5
- System scale (broad range)
- 1 to 16 sockets
Dual-chip module focus:
- 1204mm2 7nm (36B devices)
- Throughput / socket
- Up to 30 SMT8 Cores (3.5+ GHz)
- Compute & I/O density
- Memory: x128 @ 32 GT/s
- SMP/Cluster/Accel: x192 @ 32 GT/s
- I/O: x64 PCIe G5
- 1 to 4 sockets
Horizontal Full Connect
Vertical
Full
Connect
Up to
16 SCM
Sockets
Up to
4 DCM
Sockets
(Multi-socket configurations show processor capability only, and do not imply system product offerings) 8
Socket composability: SCM & DCM
9. Workload A
Workload B
Workload C
A B
C
A A A
C C C C C C C
C C C
C C C
C C C
C C C
C C C
C
B B B
B B B B
B B B B
B B B B
B B B B
B B B B
Use case: Share load/store memory amongst
directly connected neighbors within Pod
Unlike other schemes, memory can be used:
- As low latency local memory
- As NUMA latency remote memory
Example: Pod = 8 systems each with 8TB
Workload A Rqmt: 4 TB low latency
Workload B Rqmt: 24 TB relaxed latency
Workload C Rqmt: 8 TB low latency plus
16TB relaxed latency
All Rqmts met by configuration shown
POWER10 2 Petabyte memory size enables
much larger configurations
(Memory cluster configurations show processor capability only, and do not imply system product offerings) 9
Memory Inception: Distributed memory disaggregation and sharing
10. POWER10 socket performance gains
▪ General workload speedup up to 3x
▪ SIMD/AI accelerated speedup up to 10x+
10
(Performance assessments based upon pre-silicon engineering analysis of POWER10 dual-socket server offering vs POWER9 dual-socket server offering)
POWER9
Baseline
POWER10
Integer
POWER10
Enterprise
POWER10
Floating
Point
POWER10
Memory
Streaming
DDR4
OMI
Memory
DDR5
OMI
Memory
2
1
3
4
POWER9
Baseline
POWER10
Linpack
POWER10
Inference
Resnet-50
FP32
POWER10
Inference
Resnet-50
BFloat16
POWER10
Inference
Resnet-50
INT8
5
10
15
20
10
11. POWER10 core: key enhancements over POWER9
▪ Efficiency: 2.6x performance/watt
▪ 30% avg core perf improvement
▪ 50% avg core power reduction
▪ AI infusion with in-core capability
▪ Matrix accelerator (MMA)
▪ 2x general SIMD
▪ 2x Load/Store bandwidth
▪ 4x L2 Cache
11
4x L2 Cache
MMA
2x General
SIMD
2x Load, 2x Store
4x MMU
New ISA
Prefix
Fusion
Enhanced
Control &
Branch
SMT8 Core
11
12. 12
32B LD
32B LD
32B ST
(+gathered)
64B
dedicated
32B
8 instr
Decode/Fuse 8 iop
TLB miss
miss
I miss
L3 prefetch
I-EA
Instruction Buffer
128 entries
2
Load
EA
MMA
Accelerator
2x512b
L2 Cache
1 MB
(hashed
index)
L1 Instr. Cache
48k 6-way
<EA Tagged>
L1 Data Cache
32k 8-way
<EA Tagged>
Store Queue
80 entries (SMT)
40 entries (ST)
Load Queue
128 entries (SMT)
64 entries (ST)
Load Miss Queue
12 entries
Predecode
+Fusion/Prefix
Instruction Table
512 entries
Execution
Slice
128b
Execution
Slice
128b
Execution
Slice
128b
Execution
Slice
128b
2
Store
EA
Fetch / +Branch
Predictors
Prefetch
16 streams
ERAT
64 entry
TLB
4k entry L3 Prefetch
48 entries
D miss
▪ SMT8 and SMT4 (shown) cores
▪ 8-way superscalar
▪ SMT8-core has 2x resources of SMT4-core
▪ Double SIMD + AI/ML acceleration
▪ 2x SIMD, 4x MMA, 4x AES/SHA
▪ Larger working-sets
▪ 1.5x L1 I-cache, 4x L2, 4x TLB
▪ Deeper instruction windows/queues
▪ 512 instructions in flight
▪ Data bandwidth & latency (cycles)
▪ L2 13.5 (minus 2), L3 27.5 (minus 8)
▪ L1-D cache 4 +0 for Store forward (minus 2)
▪ 2x bandwidth
▪ Improved branch prediction
▪ Target registers with GPR in main regfile
▪ New predictors: target and direction, 2x BHT
▪ Instruction fusion
▪ Fixed, SIMD, other: merge and back-to-back
▪ Load, store: consecutive storage
>= 2x
Improved = 4x
Capacity vs.
POWER9:
POWER10 core microarchitecture
12
13. • BLAS-2 rank-1 update (GER): 𝑨 ← 𝒙𝒚𝐓
+ 𝑨
• One-dimensional input vectors from main register file
• Two-dimensional accumulator resides local to unit
• Reduced-precision data types,
perform multiple updates: 𝑨 ← σ 𝒙𝒊𝒚𝒊
𝐓
+ 𝑨
Energy efficient computing in the MMA
MMA rank-k update can be exploited
by a variety of computation kernels
• BLAS-3 and BLAS-2
• Sparse matrix-operations
(SpMM, Cholesky factorization)
• Direct convolution
• Discrete-Fourier Transform
• Stencil computation
ISCA 2021, Industry Track 13
13
× +
𝑎00
× +
𝑎01
× +
𝑎02
× +
𝑎03
× +
𝑎10
× +
𝑎11
× +
𝑎12
× +
𝑎13
× +
𝑎20
× +
𝑎21
× +
𝑎22
× +
𝑎23
× +
𝑎30
× +
𝑎31
× +
𝑎32
× +
𝑎33
𝑦0 𝑦1 𝑦2 𝑦3
𝑥0
𝑥1
𝑥2
𝑥3
Vector-Scalar
Register File
FETCH
FETCH
14. New matrix math instruction set architecture
14
ACC[0]
VSR[0]
…
VSR[3]
ACC[1]
VSR[4]
…
VSR[7]
ACC[2]
VSR[8]
…
VSR[11]
ACC[3]
VSR[12]
…
VSR[15]
ACC[4]
VSR[16]
…
VSR[19]
ACC[5]
VSR[20]
…
VSR[23]
ACC[6]
VSR[24]
…
VSR[27]
ACC[7]
VSR[28]
…
VSR[31]
• Accumulators (ACC) are 4 × 4 arrays of 32-bit elements
(see below for 64-bit extension)
𝐴 =
𝑎00 𝑎01
𝑎10 𝑎11
𝑎02 𝑎03
𝑎12 𝑎13
𝑎20 𝑎21
𝑎30 𝑎31
𝑎22 𝑎23
𝑎32 𝑎33
• Vector-scalar registers (VSR) are 128-bit wide and can hold
• 4 single-precision (32-bit) floating-point values
• 8 half-precision (16-bit) floating-point values
• 16 signed/unsigned (8-bit) integer values
• …
• Instructions take 1 accumulator (input/output) and 2
vector-scalar registers (input)
𝐨𝐩𝐞𝐫𝐚𝐭𝐢𝐨𝐧 𝐴, 𝑋, 𝑌
15. Rank-𝑘 update with 32-bit and 16-bit elements
• 32-bit: Each VSR holds 1 column (row) of 4×32-bit elements
• 16-bit: Each VSR holds 2 columns (rows) of 4×16-bit elements
15
4
4
4
+=
4
×
4
×
4
×
+ + … +
4 4
2 2 2
2
2
2
rank-2 update rank-2 update rank-2 update
4
4
4
+=
4
×
4
×
4
×
+ + … +
4 4
1 1 1
1
1
1
rank-1 update rank-1 update rank-1 update
16. Rank-𝑘 update with 8-bit and 4-bit elements
• 8-bit: Each VSR holds 4 columns (rows) of 4×8-bit elements
• 4-bit: Each VSR holds 8 columns (rows) of 4×4-bit elements
16
4
4
+= …
4
4
× +
8
8
rank-8 update
4
4
+= …
4
4
× +
4
4
rank-4 update
4
4
× +
4
4
rank-4 update
4
× +
8
8
rank-8 update
4
17. Outer-product (xv<type>ger<rank-𝑘>) instructions
• Integer data: 𝐴 ← 𝑋𝑌𝑇 +𝐴
• For 16-bit data, 𝑋 and 𝑌 are 4 × 2 arrays of elements
• For 8-bit data, 𝑋 and 𝑌 are 4 × 4 arrays of elements
• For 4-bit data, 𝑋 and 𝑌 are 4 × 8 arrays of elements
• Floating-point data: 𝐴 ← − 𝑋𝑌𝑇 [±𝐴]
• For 32-bit data, 𝑋 and 𝑌 are 4-element vectors
• For 16-bit data, 𝑋 and 𝑌 are 4 × 2 arrays of elements
• For 32-bit data, 𝑋 is a 4-element vector and 𝑌 is a 2-element vector
17
Instruction 𝑨 𝑿 𝒀𝑻 # of madds/instruction
xvf32ger 4 × 4 (fp32) 4 × 1 (fp32) 1 × 4 (fp32) 16
xv[b]f16ger2 4 × 4 (fp32) 4 × 2 (fp16) 2 × 4 (fp16) 32
xvi16ger[s]2 4 × 4 (int32) 4 × 2 (int16) 2 × 4 (int16) 32
xvi8ger[s]4 4 × 4 (int32) 4 × 4 (int8) 4 × 4 (uint8) 64
xvi4ger8 4 × 4 (int32) 4 × 8 (int4) 8 × 4 (int4) 128
xvf64ger 4 × 2 (fp64) 4 × 1 (fp64) 1 × 2 (fp64) 8
25. Conclusions
• POWER10 is the latest generation of Power Systems processors for enterprise computing
• Big improvements in compute, memory and interconnect deliver 3-times more socket performance at
2.6-times the power efficiency
• More memory bandwidth and more vector processing are key for high-performance computing
• The MMA facility defines a new set of matrix math instructions, directly implementing rank-𝑘 updates
• With the MMA, POWER10 has ben able to achieve new levels of performance in a broadening space of
applications – starting with dense linear algebra, moving to sparse, others on the horizon
• IBM Advance Toolchain (native and cross compiler):
git clone https://github.com/advancetoolchain/advance-toolchain.git
cd advance-toolchain
make AT_CONFIGSET=next BUILD_ARCH=ppc64le
• IBM POWER10 Functional Simulator (ISA-level simulation)
https://www14.software.ibm.com/webapp/set2/sas/f/pwrfs/pwr10/home.html
25