The document describes the TSUBAME2 supercomputer system at the Tokyo Institute of Technology. It has the following key aspects:
- It has over 17 PFlops of computing power across 1408 thin computing nodes, 24 medium nodes, and 10 fat nodes with Intel and NVIDIA GPU processors.
- It has a total storage capacity of 11PB including 7PB of HDD storage, 4PB of tape storage, and 200TB of SSD storage.
- It utilizes a high-performance Infiniband QDR network with 12 core switches and over 180 edge switches for fast interconnectivity between nodes and storage.
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合Hitoshi Sato
AI Bridging Cloud Infrastructure (ABCI) is a large-scale open AI infrastructure in Japan operated by the University of Tokyo. It provides:
1) Over 0.55 exaflops of computing power with 1088 nodes equipped with 4352 NVIDIA GPUs and 43520 CPU cores for AI and data science research.
2) Dense rack design optimized for thermal management with ambient warm water cooling to achieve high density computing.
3) Hierarchical storage including 1.6PB of local NVMe SSDs, 22PB of parallel file storage, and object storage for burst buffers and campaign storage.
4) Open access platform to accelerate joint academic-industry R&D for AI through distributed
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataHitoshi Sato
National Institute of Advanced Industrial Science and Technology (AIST) in Japan is focusing on bridging innovative technological seeds to commercialization. It currently lacks cutting-edge computing infrastructure dedicated to AI and big data that is openly available. The proposed AI Bridging Cloud Infrastructure (ABCI) project aims to provide a large-scale open AI infrastructure to accelerate joint academic-industry R&D for AI in Japan. ABCI will feature 1088 compute nodes with 4352 NVIDIA Tesla V100 GPUs providing 0.550 exaflops of AI performance, connected by an InfiniBand network and utilizing liquid cooling technologies. It will provide an open platform for AI research, applications, services and infrastructure design through industry and academic
Opportunities of ML-based data analytics in ABCIRyousei Takano
This document discusses opportunities for using machine learning-based data analytics on the ABCI supercomputer system. It summarizes:
1) An introduction to the ABCI system and how it is being used for AI research.
2) How sensor data from the ABCI system and job logs could be analyzed using machine learning to optimize data center operation and improve resource utilization and scheduling.
3) Two potential use cases - using workload prediction to enable more efficient cooling system control, and applying machine learning to better predict job execution times to improve scheduling.
ABCI: An Open Innovation Platform for Advancing AI Research and DeploymentRyousei Takano
AI Infrastructure for Everyone (Democratization AI) aims to build an AI infrastructure platform that is accessible to everyone from beginners to experts. The platform provides up to 512-node computing resources, ready-to-use software, datasets, and pre-trained models. It also offers services like an easy-to-use web-based IDE for beginners and an AI cloud with on-demand, reserved, and batch processing options. The goal is to accelerate AI research and promote social implementation of AI technologies.
The document describes the TSUBAME2 supercomputing system overview. It has a total storage capacity of 11PB distributed across HDD, tape and SSD storage. It utilizes Intel Xeon and Nvidia GPU processors with Infiniband networking. The system employs a hierarchical parallel file system and provides various access protocols like NFS, CIFS and iSCSI for home directories and applications.
PG-Strom is an extension of PostgreSQL that utilizes GPUs and NVMe SSDs to enable terabyte-scale data processing and in-database analytics. It features SSD-to-GPU Direct SQL, which loads data directly from NVMe SSDs to GPUs using RDMA, bypassing CPU and RAM. This improves query performance by reducing I/O traffic over the PCIe bus. PG-Strom also uses Apache Arrow columnar storage format to further boost performance by transferring only referenced columns and enabling vector processing on GPUs. Benchmark results show PG-Strom can process over a billion rows per second on a simple 1U server configuration with an NVIDIA GPU and multiple NVMe SSDs.
AI橋渡しクラウド(ABCI)における高性能計算とAI/ビッグデータ処理の融合Hitoshi Sato
AI Bridging Cloud Infrastructure (ABCI) is a large-scale open AI infrastructure in Japan operated by the University of Tokyo. It provides:
1) Over 0.55 exaflops of computing power with 1088 nodes equipped with 4352 NVIDIA GPUs and 43520 CPU cores for AI and data science research.
2) Dense rack design optimized for thermal management with ambient warm water cooling to achieve high density computing.
3) Hierarchical storage including 1.6PB of local NVMe SSDs, 22PB of parallel file storage, and object storage for burst buffers and campaign storage.
4) Open access platform to accelerate joint academic-industry R&D for AI through distributed
ABCI: AI Bridging Cloud Infrastructure for Scalable AI/Big DataHitoshi Sato
National Institute of Advanced Industrial Science and Technology (AIST) in Japan is focusing on bridging innovative technological seeds to commercialization. It currently lacks cutting-edge computing infrastructure dedicated to AI and big data that is openly available. The proposed AI Bridging Cloud Infrastructure (ABCI) project aims to provide a large-scale open AI infrastructure to accelerate joint academic-industry R&D for AI in Japan. ABCI will feature 1088 compute nodes with 4352 NVIDIA Tesla V100 GPUs providing 0.550 exaflops of AI performance, connected by an InfiniBand network and utilizing liquid cooling technologies. It will provide an open platform for AI research, applications, services and infrastructure design through industry and academic
Opportunities of ML-based data analytics in ABCIRyousei Takano
This document discusses opportunities for using machine learning-based data analytics on the ABCI supercomputer system. It summarizes:
1) An introduction to the ABCI system and how it is being used for AI research.
2) How sensor data from the ABCI system and job logs could be analyzed using machine learning to optimize data center operation and improve resource utilization and scheduling.
3) Two potential use cases - using workload prediction to enable more efficient cooling system control, and applying machine learning to better predict job execution times to improve scheduling.
ABCI: An Open Innovation Platform for Advancing AI Research and DeploymentRyousei Takano
AI Infrastructure for Everyone (Democratization AI) aims to build an AI infrastructure platform that is accessible to everyone from beginners to experts. The platform provides up to 512-node computing resources, ready-to-use software, datasets, and pre-trained models. It also offers services like an easy-to-use web-based IDE for beginners and an AI cloud with on-demand, reserved, and batch processing options. The goal is to accelerate AI research and promote social implementation of AI technologies.
The document describes the TSUBAME2 supercomputing system overview. It has a total storage capacity of 11PB distributed across HDD, tape and SSD storage. It utilizes Intel Xeon and Nvidia GPU processors with Infiniband networking. The system employs a hierarchical parallel file system and provides various access protocols like NFS, CIFS and iSCSI for home directories and applications.
PG-Strom is an extension of PostgreSQL that utilizes GPUs and NVMe SSDs to enable terabyte-scale data processing and in-database analytics. It features SSD-to-GPU Direct SQL, which loads data directly from NVMe SSDs to GPUs using RDMA, bypassing CPU and RAM. This improves query performance by reducing I/O traffic over the PCIe bus. PG-Strom also uses Apache Arrow columnar storage format to further boost performance by transferring only referenced columns and enabling vector processing on GPUs. Benchmark results show PG-Strom can process over a billion rows per second on a simple 1U server configuration with an NVIDIA GPU and multiple NVMe SSDs.
PL/CUDA allows writing user-defined functions in CUDA C that can run on a GPU. This provides benefits for analytics workloads that can utilize thousands of GPU cores and wide memory bandwidth. A sample logistic regression implementation in PL/CUDA showed a 350x speedup compared to a CPU-based implementation in MADLib. Logistic regression performs binary classification by estimating weights for explanatory variables and intercept through iterative updates. This is well-suited to parallelization on a GPU.
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
The document describes a technology called PG-Strom that uses GPU acceleration to optimize I/O performance for PostgreSQL. PG-Strom allows data to be transferred directly from NVMe SSDs to the GPU over the PCIe bus, bypassing the CPU and RAM. This reduces data movement and allows PostgreSQL queries to be partially executed directly on the GPU. Benchmark results show the approach can achieve throughput close to the theoretical hardware limits for a single server configuration processing large datasets.
1) The PG-Strom project aims to accelerate PostgreSQL queries using GPUs. It generates CUDA code from SQL queries and runs them on Nvidia GPUs for parallel processing.
2) Initial results show PG-Strom can be up to 10 times faster than PostgreSQL for queries involving large table joins and aggregations.
3) Future work includes better supporting columnar formats and integrating with PostgreSQL's native column storage to improve performance further.
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)PhtRaveller
This repost was presented at Fronties in Computational Astrophysics Conference (Lyon, France, 11-15 October, 2010). I give brief and light introduction to CUDA architecture and it's benefits for scientific HPC. Also a brief description about KIPT in-house package for N-body simulations is given. This talk with minor differences was also presented at
seminars in Institute for Single Crystals (Kharkov) and Kharkov Institute of Physics and Technology.
The document provides an overview of big data analysis and parallel programming tools for R. It discusses what constitutes big data, popular big data applications, and relevant hardware and software. It then covers parallel programming challenges and approaches in R, including using multicore processors with the multicore package, SMP and cluster programming with foreach and doMC/doSNOW, NoSQL databases like Redis with doRedis, and job scheduling. The goal is to help users effectively analyze big data in R by leveraging parallelism.
Presentació a càrrec d'Adrián Macía, cap de Càlcul Científic del CSUC, duta a terme a la "3a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 29 d'octubre de 2020 en format virtual.
This document discusses using GPUs and SSDs to accelerate PostgreSQL queries. It introduces PG-Strom, a project that generates CUDA code from SQL to execute queries massively in parallel on GPUs. The document proposes enhancing PG-Strom to directly transfer data from SSDs to GPUs without going through CPU/RAM, in order to filter and join tuples during loading for further acceleration. Challenges include improving the NVIDIA driver for NVMe devices and tracking shared buffer usage to avoid unnecessary transfers. The goal is to maximize query performance by leveraging the high bandwidth and parallelism of GPUs and SSDs.
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
GPU processing provides significant performance gains for PostgreSQL according to benchmarks. PG-Strom is an open source project that allows PostgreSQL to leverage GPUs for processing queries. It generates CUDA code from SQL queries to accelerate operations like scans, joins, and aggregations by massive parallel processing on GPU cores. Performance tests show orders of magnitude faster response times for queries involving multiple joins and aggregations when using PG-Strom compared to the regular PostgreSQL query executor. Further development aims to support more data types and functions for GPU processing.
The document discusses PG-Strom, an open source project that uses GPU acceleration for PostgreSQL. PG-Strom allows for automatic generation of GPU code from SQL queries, enabling transparent acceleration of operations like WHERE clauses, JOINs, and GROUP BY through thousands of GPU cores. It introduces PL/CUDA, which allows users to write custom CUDA kernels and integrate them with PostgreSQL for manual optimization of complex algorithms. A case study on k-nearest neighbor similarity search for drug discovery is presented to demonstrate PG-Strom's ability to accelerate computational workloads through GPU processing.
The column-oriented data structure of PG-Strom stores data in separate column storage (CS) tables based on the column type, with indexes to enable efficient lookups. This reduces data transfer compared to row-oriented storage and improves GPU parallelism by processing columns together.
The document describes benchmark results achieved by using NVMe SSDs and GPU acceleration to improve the performance of PostgreSQL beyond typical limitations. A benchmark was run using 13 queries on a 1055GB dataset with PostgreSQL v11beta3 + PG-Strom v2.1. This achieved a maximum query execution throughput of 13.5GB/s. PG-Strom is an extension module that uses thousands of GPU cores and wide-band memory to accelerate SQL workloads. It generates GPU code from SQL and executes queries directly on the GPU, bypassing data transfers between CPU and GPU to improve performance.
This document describes using in-place computing on PostgreSQL to perform statistical analysis directly on data stored in a PostgreSQL database. Key points include:
- An F-test is used to compare the variances of accelerometer data from different phone models (Nexus 4 and S3 Mini) and activities (walking and biking).
- Performing the F-test directly in PostgreSQL via SQL queries is faster than exporting the data to an R script, as it avoids the overhead of data transfer.
- PG-Strom, an extension for PostgreSQL, is used to generate CUDA code on-the-fly to parallelize the variance calculations on a GPU, further speeding up the F-test.
The document discusses graphics processing units (GPUs) and general-purpose GPU (GPGPU) computing. It explains that GPUs were originally designed for computer graphics but can now be used for general computations through GPGPU. The document outlines CUDA and MPI frameworks for programming GPGPU applications and discusses how GPGPU provides highly parallel processing that is much faster than traditional CPUs. Example applications mentioned include molecular dynamics, bioinformatics, and high performance computing.
¿Es posible construir el Airbus de la Supercomputación en Europa?AMETIC
Presentación a cargo de Mateo Valero, Director del Barcelona Supercomputing Center, en el marco de la 30ª edición de los Encuentros de Telecomunicaciones y Economía Digital.
The document discusses the limits of information and communication technologies (ICT) such as computing power, data storage, and network bandwidth. It proposes that future networks will need to scale in both size and functionality through approaches like federation of multiple networks. Cloud computing is presented as a potential approach to tackle these limits by providing on-demand access to shared computing resources over a network in a scalable and elastic manner. However, cloud computing is still associated with many marketing hype and open questions remain regarding its impact and how it can integrate with existing technologies.
PL/CUDA allows writing user-defined functions in CUDA C that can run on a GPU. This provides benefits for analytics workloads that can utilize thousands of GPU cores and wide memory bandwidth. A sample logistic regression implementation in PL/CUDA showed a 350x speedup compared to a CPU-based implementation in MADLib. Logistic regression performs binary classification by estimating weights for explanatory variables and intercept through iterative updates. This is well-suited to parallelization on a GPU.
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
The document describes a technology called PG-Strom that uses GPU acceleration to optimize I/O performance for PostgreSQL. PG-Strom allows data to be transferred directly from NVMe SSDs to the GPU over the PCIe bus, bypassing the CPU and RAM. This reduces data movement and allows PostgreSQL queries to be partially executed directly on the GPU. Benchmark results show the approach can achieve throughput close to the theoretical hardware limits for a single server configuration processing large datasets.
1) The PG-Strom project aims to accelerate PostgreSQL queries using GPUs. It generates CUDA code from SQL queries and runs them on Nvidia GPUs for parallel processing.
2) Initial results show PG-Strom can be up to 10 times faster than PostgreSQL for queries involving large table joins and aggregations.
3) Future work includes better supporting columnar formats and integrating with PostgreSQL's native column storage to improve performance further.
Report on GPGPU at FCA (Lyon, France, 11-15 October, 2010)PhtRaveller
This repost was presented at Fronties in Computational Astrophysics Conference (Lyon, France, 11-15 October, 2010). I give brief and light introduction to CUDA architecture and it's benefits for scientific HPC. Also a brief description about KIPT in-house package for N-body simulations is given. This talk with minor differences was also presented at
seminars in Institute for Single Crystals (Kharkov) and Kharkov Institute of Physics and Technology.
The document provides an overview of big data analysis and parallel programming tools for R. It discusses what constitutes big data, popular big data applications, and relevant hardware and software. It then covers parallel programming challenges and approaches in R, including using multicore processors with the multicore package, SMP and cluster programming with foreach and doMC/doSNOW, NoSQL databases like Redis with doRedis, and job scheduling. The goal is to help users effectively analyze big data in R by leveraging parallelism.
Presentació a càrrec d'Adrián Macía, cap de Càlcul Científic del CSUC, duta a terme a la "3a Jornada de formació sobre l'ús del servei de càlcul" celebrada el 29 d'octubre de 2020 en format virtual.
This document discusses using GPUs and SSDs to accelerate PostgreSQL queries. It introduces PG-Strom, a project that generates CUDA code from SQL to execute queries massively in parallel on GPUs. The document proposes enhancing PG-Strom to directly transfer data from SSDs to GPUs without going through CPU/RAM, in order to filter and join tuples during loading for further acceleration. Challenges include improving the NVIDIA driver for NVMe devices and tracking shared buffer usage to avoid unnecessary transfers. The goal is to maximize query performance by leveraging the high bandwidth and parallelism of GPUs and SSDs.
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~Kohei KaiGai
GPU processing provides significant performance gains for PostgreSQL according to benchmarks. PG-Strom is an open source project that allows PostgreSQL to leverage GPUs for processing queries. It generates CUDA code from SQL queries to accelerate operations like scans, joins, and aggregations by massive parallel processing on GPU cores. Performance tests show orders of magnitude faster response times for queries involving multiple joins and aggregations when using PG-Strom compared to the regular PostgreSQL query executor. Further development aims to support more data types and functions for GPU processing.
The document discusses PG-Strom, an open source project that uses GPU acceleration for PostgreSQL. PG-Strom allows for automatic generation of GPU code from SQL queries, enabling transparent acceleration of operations like WHERE clauses, JOINs, and GROUP BY through thousands of GPU cores. It introduces PL/CUDA, which allows users to write custom CUDA kernels and integrate them with PostgreSQL for manual optimization of complex algorithms. A case study on k-nearest neighbor similarity search for drug discovery is presented to demonstrate PG-Strom's ability to accelerate computational workloads through GPU processing.
The column-oriented data structure of PG-Strom stores data in separate column storage (CS) tables based on the column type, with indexes to enable efficient lookups. This reduces data transfer compared to row-oriented storage and improves GPU parallelism by processing columns together.
The document describes benchmark results achieved by using NVMe SSDs and GPU acceleration to improve the performance of PostgreSQL beyond typical limitations. A benchmark was run using 13 queries on a 1055GB dataset with PostgreSQL v11beta3 + PG-Strom v2.1. This achieved a maximum query execution throughput of 13.5GB/s. PG-Strom is an extension module that uses thousands of GPU cores and wide-band memory to accelerate SQL workloads. It generates GPU code from SQL and executes queries directly on the GPU, bypassing data transfers between CPU and GPU to improve performance.
This document describes using in-place computing on PostgreSQL to perform statistical analysis directly on data stored in a PostgreSQL database. Key points include:
- An F-test is used to compare the variances of accelerometer data from different phone models (Nexus 4 and S3 Mini) and activities (walking and biking).
- Performing the F-test directly in PostgreSQL via SQL queries is faster than exporting the data to an R script, as it avoids the overhead of data transfer.
- PG-Strom, an extension for PostgreSQL, is used to generate CUDA code on-the-fly to parallelize the variance calculations on a GPU, further speeding up the F-test.
The document discusses graphics processing units (GPUs) and general-purpose GPU (GPGPU) computing. It explains that GPUs were originally designed for computer graphics but can now be used for general computations through GPGPU. The document outlines CUDA and MPI frameworks for programming GPGPU applications and discusses how GPGPU provides highly parallel processing that is much faster than traditional CPUs. Example applications mentioned include molecular dynamics, bioinformatics, and high performance computing.
¿Es posible construir el Airbus de la Supercomputación en Europa?AMETIC
Presentación a cargo de Mateo Valero, Director del Barcelona Supercomputing Center, en el marco de la 30ª edición de los Encuentros de Telecomunicaciones y Economía Digital.
The document discusses the limits of information and communication technologies (ICT) such as computing power, data storage, and network bandwidth. It proposes that future networks will need to scale in both size and functionality through approaches like federation of multiple networks. Cloud computing is presented as a potential approach to tackle these limits by providing on-demand access to shared computing resources over a network in a scalable and elastic manner. However, cloud computing is still associated with many marketing hype and open questions remain regarding its impact and how it can integrate with existing technologies.
[RakutenTechConf2013] [A-3] TSUBAME2.5 to 3.0 and Convergence with Extreme Bi...Rakuten Group, Inc.
Rakuten Technology Conference 2013
"TSUBAME2.5 to 3.0 and Convergence with Extreme Big Data"
Satoshi Matsuoka
Professor
Global Scientific Information and Computing (GSIC) Center
Tokyo Institute of Technology
Fellow, Association for Computing Machinery (ACM)
Abstract: Iterative stencils represent the core computational kernel of many applications belonging to different domains, from scientific computing to finance. Given the complex dependencies and the low computation to memory access ratio, this kernels represent a challenging acceleration target on every architecture. This is especially true for FPGAs, whose direct hardware execution offers the possibility for high performance and power efficiency, but where the non-fixed architecture can lead to very large solutions spaces to be explored.
In this work, we build upon an FPGA-based acceleration methodology for iterative stencil algorithms previously presented, where we provide a dataflow architectural template that implements optimal on-chip buffering and is able to increase almost linearly in performance using a scaling technique denoted as iterations queuing. In particular, we propose a set of design improvements and we elaborate an accurate analytical performance model that can be used to support the exploration of the design space. Experimental results obtained implementing a set of benchmarks from different application domains on a Xilinx VC707 board show an average performance and power efficiency increase over the previous work of respectively around 22x and 8x, and a prediction error that is on average less than 1%.
EODATASERVICE.ORG - Digital Earth Platform to enable Muti-disciplinary Geospa...EUDAT
This document discusses the EarthServer-2 project, which aims to create a digital Earth platform for multi-disciplinary geospatial applications. The platform includes several data services that provide access to Earth observation data through standardized interfaces like OGC services. It allows users to access, visualize, subset, combine and process vast amounts of geospatial data simultaneously from multiple data sources. The platform is demonstrated through examples of using vegetation, precipitation and soil moisture data to study drought in Eastern Africa. It also discusses lessons learned from implementing the cube technology and providing access to data through various user interfaces tailored for different user groups.
El Barcelona Supercomputing Center (BSC) fue establecido en 2005 y alberga el MareNostrum, uno de los superordenadores más potentes de España. Somos el centro pionero de la supercomputación en España. Nuestra especialidad es la computación de altas prestaciones - también conocida como HPC o High Performance Computing- y nuestra misión es doble: ofrecer infraestructuras y servicio de supercomputación a los científicos españoles y europeos, y generar conocimiento y tecnología para transferirlos a la sociedad. Somos Centro de Excelencia Severo Ochoa, miembros de primer nivel de la infraestructura de investigación europea PRACE (Partnership for Advanced Computing in Europe), y gestionamos la Red Española de Supercomputación (RES). Como centro de investigación, contamos con más de 456 expertos de 45 países, organizados en cuatro grandes áreas de investigación: Ciencias de la computación, Ciencias de la vida, Ciencias de la tierra y aplicaciones computacionales en ciencia e ingeniería.
Achitecture Aware Algorithms and Software for Peta and Exascaleinside-BigData.com
Jack Dongarra from the University of Tennessee presented these slides at Ken Kennedy Institute of Information Technology on Feb 13, 2014.
Listen to the podcast review of this talk: http://insidehpc.com/2014/02/13/week-hpc-jack-dongarra-talks-algorithms-exascale/
Reproducible Computational Pipelines with Docker and Nextflowinside-BigData.com
This document summarizes a presentation about using Docker and Nextflow to create reproducible computational pipelines. It discusses two major challenges in computational biology being reproducibility and complexity. Containers like Docker help address these challenges by creating portable and standardized environments. Nextflow is introduced as a workflow framework that allows pipelines to run across platforms and isolates dependencies using containers, enabling fast prototyping. Examples are given of using Nextflow with Docker to run pipelines on different systems like HPC clusters in a scalable and reproducible way.
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Yuichiro Yasui
The document discusses Graph500 and Green Graph500 benchmarks for evaluating graph processing performance on the SGI UV2000 system. It provides an overview of the benchmarks and describes testing various graph workloads, including social networks and road networks, on different hardware from smartphones to supercomputers. The authors aim to optimize breadth-first search (BFS) graph algorithms on the NUMA-based SGI UV2000 without using MPI through NUMA-aware techniques.
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
The constantly increasing number of connected devices and sensors results in increasing volume and velocity of sensor-based streaming data. Traditional approaches for processing high velocity sensor data rely on stream processing engines. However, the increasing complexity of continuous queries executed on top of high velocity data has resulted in growing demand for federated systems composed of data stream processing engines and database engines. One of major challenges for such systems is to devise the optimal query execution plan to maximize the throughput of continuous queries.
In this paper we present a general framework for federated database and stream processing systems, and introduce the design and implementation of a cost-based optimizer for optimizing relational continuous queries in such systems. Our optimizer uses characteristics of continuous queries and source data streams to devise an optimal placement for each operator of a continuous query. This fine level of optimization, combined with the estimation of the feasibility of query plans, allows our optimizer to devise query plans which result in 8 times higher throughput as compared to the baseline approach which uses only stream processing engines. Moreover, our experimental results showed that even for simple queries, a hybrid execution plan can result in 4 times and 1.6 times higher throughput than a pure stream processing engine plan and a pure database engine plan, respectively.
The document provides specifications for the N3K-C3232C 32 x 100G, 1RU switch from Cisco. It offers key details on the switch's physical dimensions and components, performance capabilities including 6.4 Tbps switching capacity and 3.3 bpps forwarding rate, and management/programmability features of the Cisco NX-OS operating system. The switch supports both forward and reverse airflow, has 32 QSFP28 ports that can each support 100G or 4 x 25G Ethernet, and weighs 22.2 lbs.
The document discusses the evolution of computer architectures from early technological achievements like the transistor and integrated circuit. It describes increasing transistor densities following Moore's Law. Future technologies will focus on increasing core counts while decreasing cycle times and voltages. Performance will come from parallelism rather than clock speed increases due to heat limitations. The document outlines challenges in scaling to exascale systems by 2018.
This document discusses DevOps tools and practices on Kubernetes and OpenShift container platforms. It covers topics like:
1. Using Jenkins as a service on OpenShift for continuous integration and delivery.
2. Deploying web applications and microservices on Kubernetes, including technologies like circuit breakers.
3. Architectures for distributed and microservices systems, including service meshes.
4. DevOps tools available on OpenShift like Istio for traffic management between microservices.
Introduction to the Oakforest-PACS Supercomputer in Japaninside-BigData.com
In this deck from the DDN User Group at SC16, Prof. Taisuke Boku from the University of Tsukuba & JCAHPC presents: Oakforest-PACS: Overview of the New JCAHPC Computing Facility.
The University of Tokyo, the University of Tsukuba, and Fujitsu Limited recently announced that the Oakforest-PACS massively parallel cluster-type supercomputer, built by Fujitsu and operated by the Joint Center for Advanced High Performance Computing (JCAHPC), has achieved a LINPACK performance result of 13.55 petaflops, as ranked in the November Top500 list for supercomputer performance. Given this, Oakforest-PACS has surpassed the K computer to officially become the highest performance supercomputer in Japan. The system's peak performance is 25 petaflops, which is about 2.2 times that of the K computer.
"Thanks to DDN’s IME Burst Buffer, researchers using Oakforest-PACS at the Joint Center for Advanced High Performance Computing (JCAHPC) are able to improve modeling of fundamental physical systems and advance understanding of requirements for Exascale-level systems architectures. With DDN’s advanced technology, JCAHPC has achieved effective I/O performance exceeding 1TB/s in writing tens of thousands of processes to the same file."
Watch the video presentation: http://wp.me/p3RLHQ-g3D
Learn more: http://ddn.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Dr. Kashif Rasul from Zalando Research in Berlin held this presentation on "Multi-GPU for Deep Learning" on the COMPUTER SCIENCE, MACHINE LEARNING & STATISTICS MEETUP in the Zalando adtech lab Office in Hamburg on 6th September 2017
The document is a tutorial on dimensioning IP backbone networks. It discusses routing in IP networks, the need for traffic engineering and network planning to optimize resource usage and costs. It covers basics of network dimensioning such as transmission channels, optimization algorithms, integer linear programming formulations. Examples are provided on dimensioning telephone networks and IP networks to determine required link capacities and transmission equipment while satisfying traffic demands and constraints. The goal is to minimize total network construction and operation costs.
Raj Ojha from GigFire Microsystems presented at the Ethernet Alliance Technology Exploration Forum on the impact of exploding bandwidth demands in data centers. The presentation covered:
1) Projected growth in data center traffic and equipment requirements driven by compound annual growth rates of 70-100% through 2020.
2) The need for changes in switch/router architectures, line cards, and pluggable modules to support higher data rates like 100G, 400G and 1Tb.
3) The strong industry need to develop very low power semiconductors to reduce energy costs and increase bandwidth, aiming for 50% lower power consumption and 10x higher bandwidth.
Blue Waters and Resource Management - Now and in the Futureinside-BigData.com
In this presentation from Moabcon 2013, Bill Kramer from NCSA presents: Blue Waters and Resource Management - Now and in the Future.
Watch the video of this presentation: http://insidehpc.com/?p=36343
The National Polar-orbiting Operational Environmental Satellite System (NPOESS) is a tri-agency effort between NOAA, NASA, and the Department of Defense to develop the next generation of weather and environmental satellites. NPOESS aims to reduce costs by consolidating previous separate satellite programs and will provide critical data for weather forecasting, climate monitoring, and other applications. NPOESS will produce a variety of environmental data records from multiple sensors on each satellite to measure things like sea surface temperature, winds, ozone, and more.
This document discusses performance improvements to the Lustre parallel file system in versions 2.5 through large I/O patches, metadata improvements, and metadata scaling with distributed namespace (DNE). It summarizes evaluations showing improved throughput from 4MB RPC, reduced degradation with large numbers of threads using SSDs over NL-SAS, high random read performance from SSD pools, and significant metadata performance gains in Lustre 2.4 from DNE allowing nearly linear scaling. Key requirements for next-generation storage include extreme IOPS, tiered architectures using local flash with parallel file systems, and reducing infrastructure needs while maintaining throughput.
14 th Edition of International conference on computer visionShulagnaSarkar2
About the event
14th Edition of International conference on computer vision
Computer conferences organized by ScienceFather group. ScienceFather takes the privilege to invite speakers participants students delegates and exhibitors from across the globe to its International Conference on computer conferences to be held in the Various Beautiful cites of the world. computer conferences are a discussion of common Inventions-related issues and additionally trade information share proof thoughts and insight into advanced developments in the science inventions service system. New technology may create many materials and devices with a vast range of applications such as in Science medicine electronics biomaterials energy production and consumer products.
Nomination are Open!! Don't Miss it
Visit: computer.scifat.com
Award Nomination: https://x-i.me/ishnom
Conference Submission: https://x-i.me/anicon
For Enquiry: Computer@scifat.com
Building API data products on top of your real-time data infrastructureconfluent
This talk and live demonstration will examine how Confluent and Gravitee.io integrate to unlock value from streaming data through API products.
You will learn how data owners and API providers can document, secure data products on top of Confluent brokers, including schema validation, topic routing and message filtering.
You will also see how data and API consumers can discover and subscribe to products in a developer portal, as well as how they can integrate with Confluent topics through protocols like REST, Websockets, Server-sent Events and Webhooks.
Whether you want to monetize your real-time data, enable new integrations with partners, or provide self-service access to topics through various protocols, this webinar is for you!
How Can Hiring A Mobile App Development Company Help Your Business Grow?ToXSL Technologies
ToXSL Technologies is an award-winning Mobile App Development Company in Dubai that helps businesses reshape their digital possibilities with custom app services. As a top app development company in Dubai, we offer highly engaging iOS & Android app solutions. https://rb.gy/necdnt
How GenAI Can Improve Supplier Performance Management.pdfZycus
Data Collection and Analysis with GenAI enables organizations to gather, analyze, and visualize vast amounts of supplier data, identifying key performance indicators and trends. Predictive analytics forecast future supplier performance, mitigating risks and seizing opportunities. Supplier segmentation allows for tailored management strategies, optimizing resource allocation. Automated scorecards and reporting provide real-time insights, enhancing transparency and tracking progress. Collaboration is fostered through GenAI-powered platforms, driving continuous improvement. NLP analyzes unstructured feedback, uncovering deeper insights into supplier relationships. Simulation and scenario planning tools anticipate supply chain disruptions, supporting informed decision-making. Integration with existing systems enhances data accuracy and consistency. McKinsey estimates GenAI could deliver $2.6 trillion to $4.4 trillion in economic benefits annually across industries, revolutionizing procurement processes and delivering significant ROI.
A Comprehensive Guide on Implementing Real-World Mobile Testing Strategies fo...kalichargn70th171
In today's fiercely competitive mobile app market, the role of the QA team is pivotal for continuous improvement and sustained success. Effective testing strategies are essential to navigate the challenges confidently and precisely. Ensuring the perfection of mobile apps before they reach end-users requires thoughtful decisions in the testing plan.
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...Paul Brebner
Closing talk for the Performance Engineering track at Community Over Code EU (Bratislava, Slovakia, June 5 2024) https://eu.communityovercode.org/sessions/2024/why-apache-kafka-clusters-are-like-galaxies-and-other-cosmic-kafka-quandaries-explored/ Instaclustr (now part of NetApp) manages 100s of Apache Kafka clusters of many different sizes, for a variety of use cases and customers. For the last 7 years I’ve been focused outwardly on exploring Kafka application development challenges, but recently I decided to look inward and see what I could discover about the performance, scalability and resource characteristics of the Kafka clusters themselves. Using a suite of Performance Engineering techniques, I will reveal some surprising discoveries about cosmic Kafka mysteries in our data centres, related to: cluster sizes and distribution (using Zipf’s Law), horizontal vs. vertical scalability, and predicting Kafka performance using metrics, modelling and regression techniques. These insights are relevant to Kafka developers and operators.
WWDC 2024 Keynote Review: For CocoaCoders AustinPatrick Weigel
Overview of WWDC 2024 Keynote Address.
Covers: Apple Intelligence, iOS18, macOS Sequoia, iPadOS, watchOS, visionOS, and Apple TV+.
Understandable dialogue on Apple TV+
On-device app controlling AI.
Access to ChatGPT with a guest appearance by Chief Data Thief Sam Altman!
App Locking! iPhone Mirroring! And a Calculator!!
Enhanced Screen Flows UI/UX using SLDS with Tom KittPeter Caitens
Join us for an engaging session led by Flow Champion, Tom Kitt. This session will dive into a technique of enhancing the user interfaces and user experiences within Screen Flows using the Salesforce Lightning Design System (SLDS). This technique uses Native functionality, with No Apex Code, No Custom Components and No Managed Packages required.
Superpower Your Apache Kafka Applications Development with Complementary Open...Paul Brebner
Kafka Summit talk (Bangalore, India, May 2, 2024, https://events.bizzabo.com/573863/agenda/session/1300469 )
Many Apache Kafka use cases take advantage of Kafka’s ability to integrate multiple heterogeneous systems for stream processing and real-time machine learning scenarios. But Kafka also exists in a rich ecosystem of related but complementary stream processing technologies and tools, particularly from the open-source community. In this talk, we’ll take you on a tour of a selection of complementary tools that can make Kafka even more powerful. We’ll focus on tools for stream processing and querying, streaming machine learning, stream visibility and observation, stream meta-data, stream visualisation, stream development including testing and the use of Generative AI and LLMs, and stream performance and scalability. By the end you will have a good idea of the types of Kafka “superhero” tools that exist, which are my favourites (and what superpowers they have), and how they combine to save your Kafka applications development universe from swamploads of data stagnation monsters!
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...XfilesPro
Wondering how X-Sign gained popularity in a quick time span? This eSign functionality of XfilesPro DocuPrime has many advancements to offer for Salesforce users. Explore them now!
Nashik's top web development company, Upturn India Technologies, crafts innovative digital solutions for your success. Partner with us and achieve your goals
Manyata Tech Park Bangalore_ Infrastructure, Facilities and Morenarinav14
Located in the bustling city of Bangalore, Manyata Tech Park stands as one of India’s largest and most prominent tech parks, playing a pivotal role in shaping the city’s reputation as the Silicon Valley of India. Established to cater to the burgeoning IT and technology sectors
DECODING JAVA THREAD DUMPS: MASTER THE ART OF ANALYSISTier1 app
Are you ready to unlock the secrets hidden within Java thread dumps? Join us for a hands-on session where we'll delve into effective troubleshooting patterns to swiftly identify the root causes of production problems. Discover the right tools, techniques, and best practices while exploring *real-world case studies of major outages* in Fortune 500 enterprises. Engage in interactive lab exercises where you'll have the opportunity to troubleshoot thread dumps and uncover performance issues firsthand. Join us and become a master of Java thread dump analysis!
Orca: Nocode Graphical Editor for Container OrchestrationPedro J. Molina
Tool demo on CEDI/SISTEDES/JISBD2024 at A Coruña, Spain. 2024.06.18
"Orca: Nocode Graphical Editor for Container Orchestration"
by Pedro J. Molina PhD. from Metadev
The Comprehensive Guide to Validating Audio-Visual Performances.pdfkalichargn70th171
Ensuring the optimal performance of your audio-visual (AV) equipment is crucial for delivering exceptional experiences. AV performance validation is a critical process that verifies the quality and functionality of your AV setup. Whether you're a content creator, a business conducting webinars, or a homeowner creating a home theater, validating your AV performance is essential.
Unlock the Secrets to Effortless Video Creation with Invideo: Your Ultimate G...The Third Creative Media
"Navigating Invideo: A Comprehensive Guide" is an essential resource for anyone looking to master Invideo, an AI-powered video creation tool. This guide provides step-by-step instructions, helpful tips, and comparisons with other AI video creators. Whether you're a beginner or an experienced video editor, you'll find valuable insights to enhance your video projects and bring your creative ideas to life.
The Rising Future of CPaaS in the Middle East 2024Yara Milbes
Explore "The Rising Future of CPaaS in the Middle East in 2024" with this comprehensive PPT presentation. Discover how Communication Platforms as a Service (CPaaS) is transforming communication across various sectors in the Middle East.
The Rising Future of CPaaS in the Middle East 2024
Japan Lustre User Group 2014
1. Extreme Big Data (EBD)
Convergence of Extreme Computing
and Big Data Technologies
7RUIHRCJPLTPBJTMU:SHPUTHTK3USV=PTN3LTL:
DUQ@UTXP=LUMDLJOTURUN@
8PUXOPCHU
2. Big Data Examples
Rates and Volumes are extremely immense
Social NW
• Facebook
– 1 billion users
– Average 130 friends
– 30 billion pieces of content
shared per month
• Twitter
– 500 million active users
– 340 million tweets per day
• Internet
– 300 million new websites per year
– 48 hours of video to YouTube per minute
– 30,000 YouTube videos played per second
Genomics
Social Simulation
3. Sequencing)data)(bp)/$)
)becomes)x4000)per)5)years)
c.f.,)HPC)x33)in)5)years
).121
,.1(,120,
.2252
• Applications
– Target Area: Planet
(Open Street Map)
– 7 billion people
• Input Data
– Road Network for Planet:
300GB (XML)
– Trip data for 7 billion people
10KB (1trip) x 7 billion = 70TB
– Real-Time Streaming Data
(e.g., Social sensor, physical data)
• Simulated Output for 1 Iteration
– 700TB
Weather
A#1.'Quality'Control
A#2.'Data'Processing
30#sec'Ensemble'
Forecast'Simulations
2'PFLOP
Ensemble
Data'Assimilation
2'PFLOP
Himawari
500MB/2.5min
Ensemble'Forecasts
8. Big Data Examples
Rates and Volumes are extremely immense
Social NW
• Facebook
Future “Extreme Big Data”
• NOT mining Tbytes Silo Data
• Peta~Zetabytes of Data
• Ultra High-BW Data Stream
• Highly Unstructured, Irregular
• Complex correlations between data from multiple sources
• Extreme Capacity, Bandwidth, Compute All Required
– 1 billion users
– Average 130 friends
– 30 billion pieces of content
shared per month
• Twitter
– 500 million active users
– 340 million tweets per day
• Internet
– 300 million new websites per year
– 48 hours of video to YouTube per minute
– 30,000 YouTube videos played per second
Genomics
Social Simulation
9. Sequencing)data)(bp)/$)
)becomes)x4000)per)5)years)
c.f.,)HPC)x33)in)5)years
).121
,.1(,120,
.2252
• Applications
– Target Area: Planet
(Open Street Map)
– 7 billion people
• Input Data
– Road Network for Planet:
300GB (XML)
– Trip data for 7 billion people
10KB (1trip) x 7 billion = 70TB
– Real-Time Streaming Data
(e.g., Social sensor, physical data)
• Simulated Output for 1 Iteration
– 700TB
Weather
A#1.'Quality'Control
A#2.'Data'Processing
30#sec'Ensemble'
Forecast'Simulations
2'PFLOP
Ensemble
Data'Assimilation
2'PFLOP
Himawari
500MB/2.5min
Ensemble'Forecasts
15. A: 0.57, B: 0.19
C: 0.19, D: 0.05
November(15,(2010(
Graph500TakesAimataNewKindofHPC
RichardMurphy(SandiaNL=Micron)
“(IexpectthatthisrankingmayatImeslookvery
differentfromtheTOP500list.Cloudarchitectureswill
almostcertainlydominateamajorchunkofpartofthe
list.”(
The 8th Graph500 List (June2014): K Computer #1, TSUBAME2 #12
Koji Ueno, Tokyo Institute of Technology/RIKEN AICS
RIKEN Advanced Institute for Computational
Science (AICS)’s K computer
is ranked
No.1
Reality: Top500 Supercomputers Dominate
on the Graph500 Ranking of Supercomputers with
17977.1 GE/s on Scale 40
on the 8th Graph500 list published at the International
Supercomputing Conference, June 22, 2014.
Congratulations from the Graph500 Executive Committee
#1 K Computer
Global Scientific Information and Computing Center, Tokyo Institute
of Technology’s TSUBAME 2.5
is ranked
No.12
on the Graph500 Ranking of Supercomputers with
1280.43 GE/s on Scale 36
on the 8th Graph500 list published at the International
Supercomputing Conference, June 22, 2014.
Congratulations from the Graph500 Executive Committee
#12 TSUBAME2
No Cloud IDCs at all
23. A Major Northern Japanese
Cloud Datacenter (2013)
10GbE 10GbE
Juniper(MX480 Juniper(MX480
2(zone(switches((Virtual(Chassis)
10GbE
Juniper(EX8208 Juniper(EX8208
Juniper(
EX4200
Juniper(
EX4200
Zone((700(nodes)
Juniper(
EX4200
Juniper(
EX4200
Zone((700(nodes)
Juniper(
EX4200
10GbE
Juniper(
EX4200
Zone((700(nodes)
LACP
the(Internet
8 zones, Total 5600 nodes,
Injection 1GBps/Node
Bisection 160Gigabps
Supercomputer Tokyo Tech.
Tsubame 2.0
#4 Top500 (2010)
Advanced Silicon
Photonics 40G
single CMOS Die
1490nm DFB
100km Fiber
~1500 nodes compute storage
Full Bisection Multi-Rail
Optical Network
Injection 80GBps/Node
Bisection 220Terabps
x1000!
24. Towards Extreme-scale
Supercomputers and BigData Machines
• Computation
– Increase in Parallelism, Heterogeneity, Density
• Multi-core, Many-core processors
• Heterogeneous processors
• Hierarchial Memory/Storage Architecture
– NVM (Non-Volatile Memory),
SCM (Storage Class Memory)
• 61C8##3 #CDD B1 #BLB1 #
8 3#LJ
– Next-gen HDDs (SMR),
Tapes (LTFS)
(
Algorithm
Network
Locality
Power
FT Productivity
Storage Hierarchy
I/O
Problems
Scalability
Heterogeneity
25. Extreme Big Data (EBD)
Next Generation Big Data
Infrastructure Technologies Towards
Yottabyte/Year
Principal Investigator
Satoshi Matsuoka
Global Scientific Information and
Computing Center
Tokyo Institute of Technology
2014/11/05 JST CREST Big Data Symposium
26. EBE Research Scheme
Future Non-Silo Extreme Big Data Apps
Co-Design
Co-Design
Co-Design 日本地図13/06/06 22:36
EBD System Software
incl. EBD Object System
NVM/
Flash
NVM/
Flash
NVM/
Flash
DRAM
DRAM
DRAM
2Tbps HBM
4~6HBM Channels
1.5TB/s DRAM
NVM BW
30PB/s I/O BW Possible
1 Yottabyte / Year
TSV Interposer
NVM/
Flash
NVM/
Flash
NVM/
Flash
DRAM
DRAM
DRAM
EBD Bag
Cartesian
Plane
KV
S
KV
S
EBD KVS
1000km
KV
S
Convergent Architecture (Phases 1~4)
Large Capacity NVM, High-Bisection NW
Supercomputers
ComputeBatch-Oriented
Cloud+IDC
Very low BW Efficiencty
PCB
High Powered
Main CPU
Low
Power
CPU
Low
Power
CPU
Introduction
Problem Domain
In most living organisms their development molecule called DNA consists of called nucleotides.
The four bases found A), cytosine (C), Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Algorithm Large Scale
Metagenomics
Massive Sensors and
Data Assimilation in
Weather Prediction
Ultra Large Scale
Graphs and Social
Infrastructures
Exascale Big Data HPC
Graph
Store
file:///Users/shirahata/Pictures/日本地図.svg 1/1 ページ
28. Tasks(5c1~5c3 Task6
Problem Domain
In most their development molecule DNA consists called nucleotides.
The four A), cytosine Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Introduction
100,000TimesFoldEBD“Convergent”SystemOverview
Problem Domain
EBD(Performance(Modeling((
(EvaluaQon
Task(4
To decipher the information contained we need to determine the order This task is important for many areas of science and 日本地図medicine.
13/06/06 22:36
Cartesian(Plane
Modern sequencing techniques KVS
molecule into pieces (called reads) KVS
processed separately KVS
to increase sequencing throughput.
1000km
Reads must be aligned file:///Users/shirahata/Pictures/日本地図.svg to the 1/1 ページ
reference
sequence to determine their position molecule. This process is called alignment.
Task(3
EBD(Programming(System(
Graph(Store
EBD(ApplicaQon(Coc
Design(and(ValidaQon(
Ultra(High(BW((Low(Latency(NVM Ultra(High(BW((Low(Latency(NW(
Processorcincmemory 3D(stacking
Large(Scale(
Genomic(
CorrelaQon
Data(AssimilaQon(
in(Large(Scale(Sensors(
and(Exascale(
Atmospherics
Large(Scale(Graphs(
and(Social(
Infrastructure(Apps
TSUBAME(3.0
TSUBAME(2.0/2.5
EBD(“converged”(RealcTime(
Resource(Scheduling(
EBD(Distrbuted(Object(Store(on(
100,000(NVM(Extreme(Compute(
and(Data(Nodes(
Task(2
EBD(Bag
EBD(KVS(
Ultra(Parallel((Low(Powe(I/O(EBD(
“Convergent”(Supercomputer(
~10TB/s*~100TB/s*~10PB/s
Task(1
29. Problem Domain
In most their development molecule DNA consists called nucleotides.
The four A), cytosine Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Introduction
100,000TimesFoldEBD“Convergent”SystemOverview
Problem Domain
To decipher the information contained we need to determine the order This task is important for many areas of science and medicine.
Modern sequencing techniques molecule into pieces (called reads) processed separately to increase sequencing throughput.
Reads must be aligned to the reference
sequence to determine their position molecule. This process is called alignment.
SQLforEBD Xpregel(Graph)
Workflow/ScripIngLanguagesforEBD MapReduceforEBD
日本地図13/06/06 22:36
Cartesian(Plane
KVS
EBDBurstI/OBuffer EBDNetworkTopologyandRouIng
KVS
1000km
KVS
file:///Users/shirahata/Pictures/日本地図.svg 1/1 ページ
MessagePassing(MPI,X10)forEBD
EBD(Bag
EBDFileSystem EBDDataObject
Graph(Store
Cloud(Datacenter
Large(Scale(Genomic(
CorrelaQon
Data(AssimilaQon(
in(Large(Scale(Sensors(
and(Exascale(
Atmospherics
Large(Scale(Graphs(
and(Social(
Infrastructure(Apps
TSUBAME(3.0( TSUBAMEcGoldenBox
EBD(KVS(
Interconnect
(InfiniBand(100GbE)
EBDAbstractDataModels
(Distributed(Array,(Key(Value,(Sparse(Data(Model,(Tree,(etc.)
EBDAlgorithmKernels((
(Search/(Sort,(Matching,(Graph(Traversals,(,(etc.)(
NVM
(61C8##3 #CDD B1 #
BLB1 #8 3#LJ)(
HPCStorage
PGAS/GlobalArrayforEBD
Network
(SINET5)
Intercloud(/(Grid((HPCI)
WebObjectStorage
34. EBD-IO Device: A Prototype of Local Storage
reliable storage designs for resilient extreme scale computing.
Configuration [Shirahata, Sato et al. GTC2014]
3.2 Burst Buffer System
To solve the problems in a flat buffer system, we consider a
burst buffer system [21]. A burst buffer is a storage space to
bridge the gap in latency and bandwidth between node-local stor-age
High(Bandwidth(and(IOPS,(Huge(Capacity,(Low(Cost,(Power(Efficient((
16cardsofmSATASSDdevices
and the PFS, and is shared by a subset of compute nodes.
Although additional nodes are required, a burst buffer can offer
a system many advantages including higher reliability and effi-ciency
Capacity:256GBx16→4TBReadBW:0.5GB/sx16→8GB/s
over a flat buffer system. A burst buffer system is more
reliable for checkpointing because burst buffers are located on
a smaller number of dedicated I/O nodes, so the probability of
lost checkpoints is decreased. In addition, even if a large number
of compute nodes fail concurrently, an application can still ac-cess
the checkpoints from the burst buffer. A burst buffer system
provides more efficient utilization of storage resources for partial
restart of uncoordinated checkpointing because processes involv-ing
restart can exploit higher storage bandwidth. For example, if
compute node 1 and 3 are in the same cluster, and both restart
from a failure, the processes can utilize all SSD bandwidth unlike
a flat buffer system. This capability accelerates the partial restart
of uncoordinated checkpoint/restart.
Table 1 Node specification
CPU Intel Core i7-3770K CPU (3.50GHz x 4 cores)
Memory Cetus DDR3-1600 (16GB)
M/B GIGABYTE GA-Z77X-UD5H
SSD Crucial m4 msata 256GB CT256M4SSD3
(Peak read: 500MB/s, Peak write: 260MB/s)
SATA converter KOUTECH IO-ASS110 mSATA to 2.5’ SATA
Device Converter with Metal Fram
RAID Card Adaptec RAID 7805Q ASR-7805Q Single
A(single(mSATA(SSD( 8(integrated(mSATA(SSDs(
RAID(cards( Prototype/Test(machine(
36. Sorting for EBD
Plugging in GPUs for large-scale sorting
30
20
10
0
0 500 1000 1500 2000
# of proccesses (2 proccesses per node)
Keys/second(billions)
HykSort 1thread
HykSort 6threads
HykSort GPU + 6threads
K20x x4 faster than K20x
60
40
20
0
0 500 1000 1500 2000 0 500 1000 1500 2000
# of proccesses (2 proccesses per node)
Keys/second(billions)
HykSort 6threads
HykSort GPU + 6threads
PCIe_10
PCIe_100
PCIe_200
PCIe_50
Prediction of our implementation
[Shamoto, Sato et al.
BigData 2014]
• GPU implementation of
splitter-based sorting (HykSort)
• Weak scaling performance (Grand
Challenge on TSUBAME2.5)
– 1 ~ 1024 nodes (2 ~ 2048 GPUs)
– 2 processes per node and each node has
2GB 64bit integer
• Yahoo/Hadoop Terasort: 0.02[TB/s]
– Including I/O
x1.4
x3.61
x389
0.25
[TB/s]
• Performance prediction
! PCIe_#: #GB/s bandwidth
x2.2 speedup compared
to CPU-based
implmentataion when the
# of PCI bandwidth
increase to 50GB/s
of interconnect between
CPU and GPU
8.8% reduction of
overall runtime when
the accelerators work 4
times faster than K20x
37. Graph500Benchmark h{p://www.graph500.org
! New BigData Benchmark based on Large-scale Graph Search for Ranking
Supercomputers
! BFS (Breadth First Search) from a single vertex on a static, undirected Kronecker
graph with average vertex degree edgegactor (=16).
! Evaluation criteria: TEPS (Traversed Edges Per Second), and problem size that
can be solved on a system, minimum execution time.
SCALE(and(edgefactor(=16)
Input parameters Graph generation Graph construction BFS Validation
Results
BFS Validation
Input parameters Graph generation Graph construction BFS Validation
TEPS
MedianTEPS
1. GeneraIon
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
TEPS
ratio
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS - Traversed - TEPS
Input parameters Graph generation Graph construction TEPS
ratio
64 Iterations
- SCALE
- edgefactor
ratio
64 Iterations
2. ConstrucIon 3. BFS
x64
• Kronecker(graph( TEPS(raQox64
– 2SCALE(verQces(and(2SCALE+4(edges(
– syntheQc(scalecfree(network(
38. Scalable distributed memory BFS for Graph500
Koji Ueno (Tokyo Institute of Technology) et al.
! What’s the best algorithm for distributed memory
BFS?
We proposed and explored many
optimizations.
Optimizations SC11 ISC12 SC12 ISC14
2D decomposition ✓ ✓ ✓ ✓
vertex sorting ✓
direction optimization ✓
data compression ✓ ✓ ✓
sparse vector with pop counting ✓
adaptive data representation ✓
overlapped communication ✓ ✓ ✓ ✓
shared memory ✓
GPGPU ✓ ✓
✓ Utilization for each version
100
317
462
1,280
1400
1200
1000
800
600
400
200
0
SC'11 ISC'12 SC'12 ISC'14
Performance (GTEPS)
Graph500 score history of
TSUBAME2
Continuous effort to improve
performance
Optimized for various machines
Machine # of nodes Performance
K computer 65536 5524 GTEPS
TSUBAME2.5 1024 1280 GTEPS
TSUBAME-KFC 32 104 GTEPS
45. Expectations for Next-Gen Storage System
(Towards TSUBAME3.0)
• Achieving high IOPS
– Many apps with massive small I/O ops
• graph, etc.
• Utilizing NVM devices
– Discrete local SSDs on TSUBAME2
– How to aggregate them?
• Stability/Reliability as Archival Storage
• I/O resource reduction/consolidation
– Can we allow a large number of OSSs for
achieving ~TB/s throughput
– Many constraints
• Space, Power, Budget, etc.
46. Current Status
• New Approaches
• Tsubame 2.0 has pioneered the use of local flash storage as
a high-IOPS alternative to an external PFS
• Tired and hybrid storage environments, combining (node)
local flash with an external PFS
• Industry Status
• High-performance, high-capacity flash (and other new
semiconductor devices) are becoming available
at reasonable cost
• New approaches/interface to use high-performance devices
(e.g. NVMexpress)