#InteloneAPIMathKernelLibrary
Explore tagged Tumblr posts
govindhtech · 1 month ago
Text
Intel VTune Profiler: Improve AI/ML Application Performance
Tumblr media
Improve AI/ML Application Performance with Intel VTune Profiler.
Find out how to profile Data Parallel Python and OpenVINO workloads using the oneAPI-powered tool. The use of AI and ML is rising in healthcare and life sciences, marketing and finance, manufacturing, robotics, driverless automobiles, smart cities, and more. ML workloads in real-world fields employ deep learning frameworks like PyTorch, TensorFlow, Keras, and others.
Through the “write once, deploy anywhere” approach, other developer tools, such as the OpenVINO Toolkit, also help to expedite AI research on the newest hardware architectures in fields like computer vision and generative artificial intelligence (GenAI). The goal of the open source OpenVINO Toolkit, which was first released in 2018, has been to speed up AI inference with reduced latency and increased throughput while preserving accuracy, minimizing model footprint, and maximizing hardware utilization.
It is challenging to locate and examine performance bottlenecks in the underlying source code because to the intricate structure of deep learning models, which include numerous layers and non-linear functions. ML frameworks like PyTorch and TensorFlow provide native tools and profiling APIs for tracking and evaluating performance metrics at various phases of model construction.
These approaches, however, are only applicable to software functionality. This problem is addressed by the Intel VTune Profiler, which is driven by the oneAPI and offers comprehensive insights into hardware-level memory and compute bottlenecks. By doing this, performance problems are resolved and AI applications’ performance is optimized and scaled across hardware systems with different computational envelopes.
The scope of optimization for AI/ML workloads may be expanded by using Intel VTune Profiler to profile data in concurrent Python and OpenVINO programs, as you will discover in this article.
Use Intel VTune Profiler to Boost Python Application Performance
VTune Profiler may assist in profiling a Python program, as shown in a recently published recipe in the Intel VTune Profiler Cookbook.
The following are the recipe’s fundamental software requirements:
Data Parallel Extensions for Python
Vtune Profiler (version 2022 or later)
Intel Distribution for Python
Compiler for Intel OneAPI DPC++/C++
The NumPy implementation covered in the recipe divides the calculations into logical jobs using the Intel Instrumentation and Tracing Technology (ITT) APIs and performs distance computations using the Intel oneAPI Math Kernel Library (oneMKL) routines. You may then determine which areas of the code need attention for necessary changes to get additional performance by using the VTune Profiler tool to examine the execution time and memory consumption of each logical job.
Details on the most CPU-intensive code segments are included in the output analysis report when Hotspots analysis is performed on the NumPy implementation. Additionally, it offers recommendations for investigating the profiler tool’s other performance analysis features, such Threading analysis for enhanced parallelism and Microarchitecture Exploration analysis for effective use of the underlying hardware.
Use the Data Parallel Extension for NumPy and Numba to Address Performance Bottlenecks
According to the Hotspots analysis report, NumPy operations and underlying oneMKL functions account for a significant amount of the execution time in the simple NumPy implementation of the pairwise distance computation example. By making little code modifications, NumPy may be swapped out for the Data Parallel Extension for NumPy, which will eliminate these bottlenecks. To evaluate the speed gains over the simple NumPy code and find any areas that might need further optimization, run the Hotspots analysis once again.
Additionally, the VTune Profiler makes recommendations such as using the Data Parallel Extension for Numba with your platform’s GPU to bring offload accelerator parallelism to the application. The Numba JIT compiler for NumPy operations has an open-source extension called Numba. It offers Python kernel programming APIs that resemble SYCL. The GPU Compute/Media Hotspots analysis preview function of VTune Profiler may then be used to examine the Numba implementation’s execution on a GPU.
Use Intel VTune Profiler to Examine OpenVINO Applications’ Performance
Using the VTune Profiler to profile OpenVINO-based AI applications is covered in another new recipe in the VTune Profiler cookbook. It discusses how to use the profiler tool to analyze performance bottlenecks in the CPU, GPU, and Neural Processing Unit (NPU).
If your OpenVINO application makes use of the Intel oneAPI Data Analytics Library (oneDAL) and/or the Intel oneAPI Deep Neural Network (oneDNN) Intel Distribution for Python Intel VTune Profiler (v2024.1 or later), you can access the Benchmark Tool application as part of the OpenVINO Toolkit Intel oneAPI Base Toolkit.
The recipe offers detailed steps for configuring OpenVINO with the ITT APIs for performance analysis, building the OpenVINO source, and setting it up. It profiles the AI application and analyzes performance and latency using a reference benchmark application.
Depending on the computational architecture, you may use the VTune Profiler‘s numerous performance analysis features to find hotspots and look at how much hardware is being utilized by specific code sections.
For example,
To examine CPU bottlenecks that is, the sections of code that take up the most CPU execution time use the Hotspots Analysis tool.
Use the GPU Compute/Media Hotspots Analysis preview function to profile GPU hotspots. Examine inefficient kernel methods, examine the frequency of GPU instructions for various kinds of instructions, and more to get an understanding of GPU use.
The AI PCs’ Neural Processing Units (NPUs) are made especially to boost AI/ML applications’ performance. With the Intel Distribution of OpenVINO Toolkit, you may transfer compute-intensive AI/ML tasks to Intel NPUs. You may examine the NPU performance using a number of hardware measures, including workload size, execution time, sampling interval, and more, with the use of the VTune Profiler‘s NPU Exploration Analysis preview function.
Intel VTune Profiler Download
Use one of the following methods to install Intel VTune Profiler on your computer:
Get the standalone bundle here.
As part of the Intel oneAPI Base Toolkit, download Intel VTune Profiler.
Know the Process
In the standard software performance analysis process,Image Credit To Intel
Use one of the following methods to launch Intel VTune Profiler:
Using Microsoft Visual Studio’s GUI From the Command Line
Configure parameters and choose a profiling analysis for your application.
Create a profile of the target system (remote collection) or application on the host (local collection).
View the host system’s findings.
Identify bottlenecks and address them over a number of cycles until you reach a desirable level of performance.
FAQs
What does Intel VTune do?
Examine CPU usage, OpenMP efficiency, memory access, and vectorization to assess application performance. Measure metrics to find memory access concerns.
Read more on govindhtech.com
1 note · View note
govindhtech · 2 months ago
Text
SynxFlow Project: A Smooth Migration From CUDA To SYCL
Tumblr media
The SynxFlow Project
SynxFlow, an open-source GPU-based hydrodynamic flood modeling software, in CUDA, C++, and Python Data pre-processing and visualization are done in Python while simulations are executed on CUDA. SynxFlow can simulate floods quicker than real-time with hundreds of millions of computational cells and metre-level precision on many GPUs. An open-source software with a simple Python interface, it may be linked into data science workflows for disaster risk assessments. The model has been widely utilized in research and industry, such as to assist flood early warning systems and generate flood maps for (re)insurance firms.
SynxFlow can simulate flooding, landslide runout, and debris flow. Simulations are crucial to emergency service planning and management. A comprehensive prediction of natural disasters can reduce their social and economic costs. In addition to risk assessment and disaster preparedness, SynxFlow flood simulation can help with urban planning, environmental protection, climate change adaptation, insurance and financial planning, infrastructure design and engineering, public awareness, and education.
- Advertisement -
Issue Statement
Several variables make probabilistic flood forecasting computationally difficult:
Large dataset storage, retrieval, and management
Complex real-time data processing requires high-performance computation.
Model calibration and validation needed as real-world conditions change.
Effective integration and data transfer between hydrological, hydraulic, and meteorological models, and more.
For speedier results, a flood forecasting system must process data in parallel and offload compute-intensive operations to hardware accelerators. Thus, the SynxFlow team must use larger supercomputers to increase flood simulation scale and cut simulation time. DAWN, the UK’s newest supercomputer, employs Intel GPUs, which SynxFlow didn’t support.
These issues offered researchers a new goal to make the SynxFlow model performance-portable and scalable on supercomputers with multi-vendor GPUs. They must transition SynxFlow code from CUDA to a cross-vendor programming language in weeks, not years.
Solution Powered by oneAPI
After considering several possibilities, the SynxFlow project team chose the Intel oneAPI Base Toolkit implementation of the Unified Acceleration Foundation-backed oneAPI protocol. All are built on multiarchitecture, multi-vendor SYCL framework. It supports Intel, NVIDIA, and AMD GPUs and includes the Intel DPC++ Compatibility Tool for automated CUDA-to-SYCL code translation.
- Advertisement -
SynxFlow code migration went smoothly. This produced code that automatically translated most CUDA kernels and API calls into SYCL. After auto-translation, some mistakes were found during compilation, but the migration tool’s error-diagnostic indications and warnings made them easy to rectify. It took longer to switch from NVIDIA Collective Communications Library (NCCL)-based inter-GPU communication to GPU-direct enabled Intel MPI library calls because this could not be automated.
To summarize, there has been a promising attempt to transfer a complicated flood simulation code that was built on CUDA to SYCL, achieving both scalability and performance-portability. The conversion has been easy to handle and seamless thanks to the Intel oneAPI Base Toolkit.
Intel hosted a oneAPI Hackfest at the DiRAC HPC Research Facility
DiRAC
The High Performance Super Computer facility in the United Kingdom serving the theoretical communities of Particle Physics, Astrophysics, Cosmology, Solar System and Planetary Science, and Nuclear Physics.
DiRAC’s three HPC services Extreme Scaling, Memory-Intensive, and Data-Intensive are each designed to support the distinct kinds of computational workflows required to carry out their science program. DiRAC places a strong emphasis on innovation, and all of its services are co-designed with vendor partners, technical and software engineering teams, and research community.
Training Series on oneAPI at DiRAC Hackfest
On May 21–23, 2024, the DiRAC community hosted three half-day remote training sessions on the Intel oneAPI Base Toolkit. The training series was designed for developers and/or researchers with varying degrees of experience, ranging from novices to experts.
The cross-platform compatible SYCL programming framework served as the foundation for a variety of concepts that were taught to the attendees. The students were also introduced to a number of Base Kit component tools and libraries that facilitate SYCL. For instance, the Intel DPC++ Compatibility Tool facilitates automated code migration from CUDA to C++ with SYCL; the Intel oneAPI Math Kernel Library (oneMKL) optimizes math operations; the Intel oneAPI Deep Neural Networks (oneDNN) accelerates hackfest and the Intel oneAPI DPC++ Library (oneDPL) expedites SYCL kernels on a variety of hardware. Additionally, the training sessions covered code profiling and the use of Intel Advisor and Intel VTune Profiler, two tools included in the Base Kit for analyzing performance bottlenecks.
DiRAC Hackfest’s oneAPI Hackath on
In order to complete a range of tasks, including parallelizing Fortran code on Intel GPUs, accelerating math operations like the Fast Fourier Transform (FFT) using oneMKL’s SYCL API, and resolving performance bottlenecks with the aid of Intel Advisor and Intel VTune Profiler, the participants improvised their cutting-edge projects using oneAPI tools and libraries.
The participants reported that it was easy to adjust to using oneAPI components and that the code migration process went smoothly. The teams saw a noticeable increase in workload performance with libraries like Intel MPI. Approximately 70% of the teams who took part indicated that they would be open to using oneAPI technologies to further optimize the code for their research projects. Thirty percent of the teams benchmarked their outcomes using SYCL and oneAPI, and they achieved a 100% success rate in code conversion to SYCL.
Start Programming Multiarchitecture Using SYCL and oneAPI
Investigate the SYCL framework and oneAPI toolkits now for multiarchitecture development that is accelerated! Use oneAPI to enable cross-platform parallelism in your apps and move your workloads to SYCL for high-performance heterogeneous computing.
Intel invite you to review the real-world code migration application samples found in the CUDA to SYCL catalog. Investigate the AI, HPC, and rendering solutions available in Intel’s software portfolio driven by oneAPI.
Read more on govindhtech.com
0 notes