#IntelMPI | Explore Tumblr posts and blogs

govindhtech · 4 months ago

Text

SynxFlow Project: A Smooth Migration From CUDA To SYCL

The SynxFlow Project

SynxFlow, an open-source GPU-based hydrodynamic flood modeling software, in CUDA, C++, and Python Data pre-processing and visualization are done in Python while simulations are executed on CUDA. SynxFlow can simulate floods quicker than real-time with hundreds of millions of computational cells and metre-level precision on many GPUs. An open-source software with a simple Python interface, it may be linked into data science workflows for disaster risk assessments. The model has been widely utilized in research and industry, such as to assist flood early warning systems and generate flood maps for (re)insurance firms.

SynxFlow can simulate flooding, landslide runout, and debris flow. Simulations are crucial to emergency service planning and management. A comprehensive prediction of natural disasters can reduce their social and economic costs. In addition to risk assessment and disaster preparedness, SynxFlow flood simulation can help with urban planning, environmental protection, climate change adaptation, insurance and financial planning, infrastructure design and engineering, public awareness, and education.

- Advertisement -

Issue Statement

Several variables make probabilistic flood forecasting computationally difficult:

Large dataset storage, retrieval, and management

Complex real-time data processing requires high-performance computation.

Model calibration and validation needed as real-world conditions change.

Effective integration and data transfer between hydrological, hydraulic, and meteorological models, and more.

For speedier results, a flood forecasting system must process data in parallel and offload compute-intensive operations to hardware accelerators. Thus, the SynxFlow team must use larger supercomputers to increase flood simulation scale and cut simulation time. DAWN, the UK’s newest supercomputer, employs Intel GPUs, which SynxFlow didn’t support.

These issues offered researchers a new goal to make the SynxFlow model performance-portable and scalable on supercomputers with multi-vendor GPUs. They must transition SynxFlow code from CUDA to a cross-vendor programming language in weeks, not years.

Solution Powered by oneAPI

After considering several possibilities, the SynxFlow project team chose the Intel oneAPI Base Toolkit implementation of the Unified Acceleration Foundation-backed oneAPI protocol. All are built on multiarchitecture, multi-vendor SYCL framework. It supports Intel, NVIDIA, and AMD GPUs and includes the Intel DPC++ Compatibility Tool for automated CUDA-to-SYCL code translation.

- Advertisement -

SynxFlow code migration went smoothly. This produced code that automatically translated most CUDA kernels and API calls into SYCL. After auto-translation, some mistakes were found during compilation, but the migration tool’s error-diagnostic indications and warnings made them easy to rectify. It took longer to switch from NVIDIA Collective Communications Library (NCCL)-based inter-GPU communication to GPU-direct enabled Intel MPI library calls because this could not be automated.

To summarize, there has been a promising attempt to transfer a complicated flood simulation code that was built on CUDA to SYCL, achieving both scalability and performance-portability. The conversion has been easy to handle and seamless thanks to the Intel oneAPI Base Toolkit.

Intel hosted a oneAPI Hackfest at the DiRAC HPC Research Facility

DiRAC

The High Performance Super Computer facility in the United Kingdom serving the theoretical communities of Particle Physics, Astrophysics, Cosmology, Solar System and Planetary Science, and Nuclear Physics.

DiRAC’s three HPC services Extreme Scaling, Memory-Intensive, and Data-Intensive are each designed to support the distinct kinds of computational workflows required to carry out their science program. DiRAC places a strong emphasis on innovation, and all of its services are co-designed with vendor partners, technical and software engineering teams, and research community.

Training Series on oneAPI at DiRAC Hackfest

On May 21–23, 2024, the DiRAC community hosted three half-day remote training sessions on the Intel oneAPI Base Toolkit. The training series was designed for developers and/or researchers with varying degrees of experience, ranging from novices to experts.

The cross-platform compatible SYCL programming framework served as the foundation for a variety of concepts that were taught to the attendees. The students were also introduced to a number of Base Kit component tools and libraries that facilitate SYCL. For instance, the Intel DPC++ Compatibility Tool facilitates automated code migration from CUDA to C++ with SYCL; the Intel oneAPI Math Kernel Library (oneMKL) optimizes math operations; the Intel oneAPI Deep Neural Networks (oneDNN) accelerates hackfest and the Intel oneAPI DPC++ Library (oneDPL) expedites SYCL kernels on a variety of hardware. Additionally, the training sessions covered code profiling and the use of Intel Advisor and Intel VTune Profiler, two tools included in the Base Kit for analyzing performance bottlenecks.

DiRAC Hackfest’s oneAPI Hackath on

In order to complete a range of tasks, including parallelizing Fortran code on Intel GPUs, accelerating math operations like the Fast Fourier Transform (FFT) using oneMKL’s SYCL API, and resolving performance bottlenecks with the aid of Intel Advisor and Intel VTune Profiler, the participants improvised their cutting-edge projects using oneAPI tools and libraries.

The participants reported that it was easy to adjust to using oneAPI components and that the code migration process went smoothly. The teams saw a noticeable increase in workload performance with libraries like Intel MPI. Approximately 70% of the teams who took part indicated that they would be open to using oneAPI technologies to further optimize the code for their research projects. Thirty percent of the teams benchmarked their outcomes using SYCL and oneAPI, and they achieved a 100% success rate in code conversion to SYCL.

Start Programming Multiarchitecture Using SYCL and oneAPI

Investigate the SYCL framework and oneAPI toolkits now for multiarchitecture development that is accelerated! Use oneAPI to enable cross-platform parallelism in your apps and move your workloads to SYCL for high-performance heterogeneous computing.

Intel invite you to review the real-world code migration application samples found in the CUDA to SYCL catalog. Investigate the AI, HPC, and rendering solutions available in Intel’s software portfolio driven by oneAPI.

Read more on govindhtech.com

#SynxFlowProject #CUDA #SYCL #scienceworkflows #riskassessment #IntelGPU #IntelDPC #IntelMPIlibrary #oneAPI #InteloneAPIMathKernelLibrary #IntelMPI #IntelVTuneProfiler #intel #gpu #technology #technews #news #govindhtech

0 notes

govindhtech · 10 months ago

Text

Try Intel MPI Benchmarks: MPI Performance Analysis Guide

Ping-Pong with Intel MPI Benchmark (IMB)

The latency experienced when sending a fixed-sized message between two ranks on separate virtual machines is measured by IMB Ping-Pong. When utilizing the HPC Rocky Linux 8 image instead of the GCP Rocky Linux 8 default image, they observed improvements of up to 15%.

Setting up benchmarks Two instances of h3-standard-88 Intel OneAPI MPI library 2021.11.0 is the MPI library. Intel MPI Benchmarks 2019 Update 6 is the MPI benchmarks application. Environment variables for MPI: Processor List I_MPI_PIN_LIST=0 SHM:ofi I_MPI_FABRICS= FI_PROVIDER = tcp Mpirun -n 2 -ppn 1 -bind-to core -hostfile is the command line. IMB-MPI1 Pingpong -iter 50000 -msglog 0:16 AllReduce – Intel MPI Benchmark (IMB) – single process per node The collective latency across multiple ranks across virtual machines is measured by the IMB AllReduce benchmark. It uses the MPI_SUM operation to reduce a vector with a fixed length.

Initially, they display 1 PPN (process-per-node) result (1 MPI rank) on 8 VMs in order to isolate networking performance.

When comparing the HPC Rocky Linux 8 image to the default GCP Rocky Linux 8 image, they observed improvements of up to 35%.

Setting up benchmarks Eight times the h3-standard-88 Every node has one process. Intel OneAPI MPI library 2021.11.0 is the MPI library. Intel MPI Benchmarks 2019 Update 6 is the MPI benchmarks application. Environment variables for MPI: SHM:ofi I_MPI_FABRICS= FI_PROVIDER = tcp 11 is I_MPI_ADJUST_ALLREDUCE. Mpirun -n 008 -ppn 01 -bind-to core -hostfile is the command line. IMB-MPI1 Allreduce -iter 50000 -npmin 008 -msglog 0:16 Benchmark Intel MPI (IMB) One process for each core (88 processes per node) in AllReduce

They display 88 PPN results with 1 thread/rank (704 ranks) and 88 MPI ranks/node.

When comparing the HPC Rocky Linux 8 image to the default GCP Rocky Linux 8 image for this test, they observed an improvement of up to 25%.

Setting up benchmarks Eight times the h3-standard-88 88 processes per node, or one process per core Intel OneAPI MPI library 2021.11.0 is the MPI library. Intel MPI Benchmarks 2019 Update 6 is the MPI benchmarks application. Environment variables for MPI: SHM:ofi I_MPI_FABRICS= FI_PROVIDER = tcp 11 is I_MPI_ADJUST_ALLREDUCE. pirun -n 704 -ppn 88 -bind-to core -hostfile is the command line. IMB-MPI1 Allreduce -iter 50000 -npmin 704 -msglog 0:16 Google is pleased to announce today the general availability of HPC Virtual Machine (VM) images for workloads involving high performance computing (HPC), primarily tightly coupled workloads like fluid dynamics, molecular modeling, and weather forecasting. These images are based on CentOS 7 and Rocky Linux 8.

They have made it simple to create an HPC-ready virtual machine instance using the HPC VM image by integrating their finest practices for HPC on Google Cloud, such as:

Virtual machines prepared for HPC right out of the box Virtual machines prepared for HPC right out of the box For closely coupled HPC workloads, there’s no need to manually adjust performance, handle virtual machine reboots, or stay current with Google Cloud updates especially with their frequent releases of HPC VM images. As soon as tunings call for them, the HPC VM image will take care of the reboot process automatically.

Optimization of networking for closely coupled workloads Networking best practices for workloads that are closely connected Applications that heavily rely on point-to-point and collective communications benefit from optimizations that lower latency for small messages.

Compute optimizations These include adjustments to lessen system jitter, which improves scalability by ensuring consistent single-node performance.

Enhanced application compatibility A high level of system interoperability is made possible by alignment with the node-level specifications of the Intel HPC platform specification.

Measuring performance with HPC benchmarks They have evaluated the HPC VM images’ performance using the Intel MPI Benchmarks (IMB) against the GCP-optimized Rocky Linux 8 image and the default CentOS 7 image.

The following images served as the benchmarks.

Rocky Linux 8 for HPC hpc-rocky-linux-8-v20240126 is the image name. The cloud-hpc-image-public image project GCP Rocky Linux 8 default Rocky-linux-8-optimized-gcp-v20240111 is the image name. Picture project: cloud-linux-rocky In order to reduce network latency, each cluster of machines was deployed using compact placement with max_distance=1, which meant that all virtual machines were situated on hardware that was physically on the same rack

The HPC VM image and the Cloud HPC Toolkit The Cloud HPC Toolkit, an open-source tool that makes it easier to deploy environments for a range of workloads, including machine learning, AI, and HPC, is where you can use the HPC VM image. Actually, the HPC VM image is used by default by the Toolkit blueprints and Slurm images based on CentOS 7 and Rocky Linux 8. The HPC VM image can be further customized with the help of the Cloud HPC Toolkit, allowing for the installation of additional software and configuration changes, thus increasing its utility.

It is possible to create and share blueprints for creating optimized and specialized images, increasing reproducibility while lowering setup time and effort, by using the Cloud HPC Toolkit to customize images based on the HPC VM Image.

Setting up an HPC-ready VM Overview In tightly coupled high performance computing (HPC) workloads, processes and VM instances communicate via MPI.. However, creating your own VM image that is optimized for MPI performance calls for additional maintenance time, familiarity with Google Cloud, and system knowledge. You can use the HPC VM image to quickly create virtual machine instances for your HPC workloads. A substitute is to use the H3 machine series to build virtual machines (VMs).

The HPC virtual machine image is designed for tightly coupled HPC workloads and is based on either CentOS 7.9 or Rocky Linux 8. It contains pre-configured network and kernel tuning parameters needed to build virtual machines (VM instances) with the best MPI performance on Google Cloud

These options can be used to create a virtual machine that is ready for HPC:

Cloud CLI for Google Google Cloud interface. The image can be accessed in the console via Cloud Marketplace. Slurm, the workload manager from SchedMD, defaults to using the HPC VM image. Omnibond CloudyCluster defaults to using the HPC VM image. Advantages The advantages of the HPC VM image are as follows:

Virtual machines that are ready for HPC right out of the box For closely coupled HPC workloads, there’s no need to manually adjust performance, handle virtual machine reboots, or keep up with Google Cloud updates.

Networking best practices for workloads that are closely connected Applications that heavily rely on point-to-point and collective communications benefit from optimizations that lower latency for small messages.

Computation-related improvements for HPC tasks The inclusion of jitter reduction optimizations increases the predictability of single-node high performance. reliable and consistent performance. Application-level performance is consistent and repeatable with VM image standardization. enhanced compatibility with applications. System interoperability is greatly enhanced by alignment with the node-level requirements of the Intel HPC platform specification.

Features Collective tunings for Intel MPI Compact placement policies were used to perform Intel MPI collective tunings on c2-standard-60 and c2d-standard-112 instances, which are included in the HPC VM image.

RPMs that are already installed The RPM packages listed below are pre-installed on the HPC virtual machine image

Package group “Development Tools” contains the following: Lmod, dkms, htop, hwloc, hwloc-devel, kernel-devel, ltrace, libXt, nfs-utils, numptl, numptl-devel, papi, pciutils, pdsh, perf, redhat-lsb-core, redhat-lsb-cxx, rsh, screen, trace, wget, zsh