#IntelVTuneProfiler
Explore tagged Tumblr posts
govindhtech · 5 months ago
Text
Intel VTune Profiler For Data Parallel Python Applications
Tumblr media
Intel VTune Profiler tutorial
This brief tutorial will show you how to use Intel VTune Profiler to profile the performance of a Python application using the NumPy and Numba example applications.
Analysing Performance in Applications and Systems
For HPC, cloud, IoT, media, storage, and other applications, Intel VTune Profiler optimises system performance, application performance, and system configuration.
Optimise the performance of the entire application not just the accelerated part using the CPU, GPU, and FPGA.
Profile SYCL, C, C++, C#, Fortran, OpenCL code, Python, Google Go, Java,.NET, Assembly, or any combination of languages can be multilingual.
Application or System: Obtain detailed results mapped to source code or coarse-grained system data for a longer time period.
Power: Maximise efficiency without resorting to thermal or power-related throttling.
VTune platform profiler
It has following Features.
Optimisation of Algorithms
Find your code’s “hot spots,” or the sections that take the longest.
Use Flame Graph to see hot code routes and the amount of time spent in each function and with its callees.
Bottlenecks in Microarchitecture and Memory
Use microarchitecture exploration analysis to pinpoint the major hardware problems affecting your application’s performance.
Identify memory-access-related concerns, such as cache misses and difficulty with high bandwidth.
Inductors and XPUs
Improve data transfers and GPU offload schema for SYCL, OpenCL, Microsoft DirectX, or OpenMP offload code. Determine which GPU kernels take the longest to optimise further.
Examine GPU-bound programs for inefficient kernel algorithms or microarchitectural restrictions that may be causing performance problems.
Examine FPGA utilisation and the interactions between CPU and FPGA.
Technical summary: Determine the most time-consuming operations that are executing on the neural processing unit (NPU) and learn how much data is exchanged between the NPU and DDR memory.
In parallelism
Check the threading efficiency of the code. Determine which threading problems are affecting performance.
Examine compute-intensive or throughput HPC programs to determine how well they utilise memory, vectorisation, and the CPU.
Interface and Platform
Find the points in I/O-intensive applications where performance is stalled. Examine the hardware’s ability to handle I/O traffic produced by integrated accelerators or external PCIe devices.
Use System Overview to get a detailed overview of short-term workloads.
Multiple Nodes
Describe the performance characteristics of workloads involving OpenMP and large-scale message passing interfaces (MPI).
Determine any scalability problems and receive suggestions for a thorough investigation.
Intel VTune Profiler
To improve Python performance while using Intel systems, install and utilise the Intel Distribution for Python and Data Parallel Extensions for Python with your applications.
Configure your Python-using VTune Profiler setup.
To find performance issues and areas for improvement, profile three distinct Python application implementations. The pairwise distance calculation algorithm commonly used in machine learning and data analytics will be demonstrated in this article using the NumPy example.
The following packages are used by the three distinct implementations.
Numpy Optimised for Intel
NumPy’s Data Parallel Extension
Extensions for Numba on GPU with Data Parallelism
Python’s NumPy and Data Parallel Extension
By providing optimised heterogeneous computing, Intel Distribution for Python and Intel Data Parallel Extension for Python offer a fantastic and straightforward approach to develop high-performance machine learning (ML) and scientific applications.
Added to the Python Intel Distribution is:
Scalability on PCs, powerful servers, and laptops utilising every CPU core available.
Assistance with the most recent Intel CPU instruction sets.
Accelerating core numerical and machine learning packages with libraries such as the Intel oneAPI Math Kernel Library (oneMKL) and Intel oneAPI Data Analytics Library (oneDAL) allows for near-native performance.
Tools for optimising Python code into instructions with more productivity.
Important Python bindings to help your Python project integrate Intel native tools more easily.
Three core packages make up the Data Parallel Extensions for Python:
The NumPy Data Parallel Extensions (dpnp)
Data Parallel Extensions for Numba, aka numba_dpex
Tensor data structure support, device selection, data allocation on devices, and user-defined data parallel extensions for Python are all provided by the dpctl (Data Parallel Control library).
It is best to obtain insights with comprehensive source code level analysis into compute and memory bottlenecks in order to promptly identify and resolve unanticipated performance difficulties in Machine Learning (ML),  Artificial Intelligence ( AI), and other scientific workloads. This may be done with Python-based ML and AI programs as well as C/C++ code using Intel VTune Profiler. The methods for profiling these kinds of Python apps are the main topic of this paper.
Using highly optimised Intel Optimised Numpy and Data Parallel Extension for Python libraries, developers can replace the source lines causing performance loss with the help of Intel VTune Profiler, a sophisticated tool.
Setting up and Installing
1. Install Intel Distribution for Python
2. Create a Python Virtual Environment
   python -m venv pyenv
   pyenv\Scripts\activate
3. Install Python packages
   pip install numpy
   pip install dpnp
   pip install numba
   pip install numba-dpex
   pip install pyitt
Make Use of Reference Configuration
The hardware and software components used for the reference example code we use are:
Software Components:
dpnp 0.14.0+189.gfcddad2474
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
mkl-umath 0.1.1
numba 0.59.0
numba-dpex 0.21.4
numpy 1.26.4
pyitt 1.1.0
Operating System:
Linux, Ubuntu 22.04.3 LTS
CPU:
Intel Xeon Platinum 8480+
GPU:
Intel Data Center GPU Max 1550
The Example Application for NumPy
Intel will demonstrate how to use Intel VTune Profiler and its Intel Instrumentation and Tracing Technology (ITT) API to optimise a NumPy application step-by-step. The pairwise distance application, a well-liked approach in fields including biology, high performance computing (HPC), machine learning, and geographic data analytics, will be used in this article.
Summary
The three stages of optimisation that we will discuss in this post are summarised as follows:
Step 1: Examining the Intel Optimised Numpy Pairwise Distance Implementation: Here, we’ll attempt to comprehend the obstacles affecting the NumPy implementation’s performance.
Step 2: Profiling Data Parallel Extension for Pairwise Distance NumPy Implementation: We intend to examine the implementation and see whether there is a performance disparity.
Step 3: Profiling Data Parallel Extension for Pairwise Distance Implementation on Numba GPU: Analysing the numba-dpex implementation’s GPU performance
Boost Your Python NumPy Application
Intel has shown how to quickly discover compute and memory bottlenecks in a Python application using Intel VTune Profiler.
Intel VTune Profiler aids in identifying bottlenecks’ root causes and strategies for enhancing application performance.
It can assist in mapping the main bottleneck jobs to the source code/assembly level and displaying the related CPU/GPU time.
Even more comprehensive, developer-friendly profiling results can be obtained by using the Instrumentation and Tracing API (ITT APIs).
Read more on govindhtech.com
2 notes · View notes
govindhtech · 2 months ago
Text
Intel Data Direct I/O Performance With Intel VTune Profiler
Tumblr media
Improve Intel Data Direct I/O (DDIO) Workload Performance with Intel VTune Profiler.
Profile uncore hardware performance events in Intel Xeon processors with oneAPI
One hardware feature included in Intel Xeon CPUs is Intel Data Direct I/O (DDIO) technology. By making the CPU cache the primary point of entry and exit for I/O data going into and out of the Intel Ethernet controllers and adapters, it contributes to advances in I/O performance.
To monitor the effectiveness of DDIO and Intel Virtualization Technology (Intel VT) for Directed I/O (Intel VT-d), which permits the independent execution of several operating systems and applications, it is essential to monitor uncore events, or events that take place outside the CPU core. By analyzing uncore hardware events, you may improve the performance of Intel Data Direct I/O (DDIO) workloads using Intel VTune Profiler, a performance analysis and debugging tool driven by the oneAPI.
We’ll talk about using VTune Profiler to evaluate and enhance directed I/O performance in this blog. Let’s take a quick look at Intel Data Direct I/O technology before we go into the profiling approach.
Overview of the Intel Data Direct I/O (DDIO) Technology
Intel Integrated I/O technology Intel DDIO was launched in 2012 for the Intel Xeon processor E5 and E7 v2 generations. It aims to increase system-level I/O performance by employing a new processor-to-I/O data flow.
I/O operations were sluggish and processor cache was a scarce resource prior to the development of Data Direct I/O technology. It was necessary for the host processor’s main memory to store and retrieve any incoming or departing data from an Ethernet controller or adapter, respectively. It used to be necessary to move the data from main memory to the cache before working with it.
This led to a lot of read and write operations in the memory. This also caused some additional, speculative read operations from the I/O hub in some of the older designs. Excessive memory accesses often lead to higher system power consumption and deterioration of I/O performance.
Intel DDIO technology was created to rearrange the flow of I/O data by making the processor cache the primary source and destination of I/O data instead of the main memory, as the processor cache is no longer a restricted resource.
Depending on the kind of workload at the workstation or on the server, the DDIO approach offers benefits like:
Higher transaction rates, reduced battery usage, reduced latency, increased bandwidth, and more.
There is no industry enablement needed for the Data Direct I/O technology.
It doesn’t rely on any hardware, and it doesn’t need any modifications to your operating system, drivers, or software.
Boost DDIO Performance Using Intel VTune Profiler
A function carried out in a CPU’s uncore section, outside of the processor core itself, that yet affects processor performance as a whole is referred to as an uncore event. For instance, these occurrences may be connected to the Intel Ultra Path Interconnect (UPI) block, memory controller, or I/O stack action.
A new recipe in the VTune Profiler Cookbook explains how to count these kinds of uncore hardware events using the tool’s input and output analysis function. You may analyze Data Direct I/O and VT-d efficiency by using the data to better understand the traffic and behavior of the Peripheral Component Interconnect Express (PCIe).
The recipe explains how to do input and output analysis, evaluate the findings, and classify the resulting I/O metrics. In essence, VTune Profiler v2023.2 or later and an Intel Xeon scalable CPU of the first or later generation are needed. Although the approach is suitable to the most recent version of Intel Xeon Processors, the I/O metrics and events covered in the recipe are based on the third generation Intel Xeon Scalable Processor.
Perform I/O Analysis with VTune Profiler
Start by analyzing your application’s input and output using VTune Profiler. With the analysis function, you may examine CPU, bus, and I/O subsystem use using a variety of platform-level metrics. You may get data indicating the Intel Data Direct I/O(DDIO) use efficiency by turning on the PCIe traffic analysis option.
Analyze the I/O Metrics
VTune Profiler Web Server or VTune Profiler GUI may be used to examine the report that is produced as a consequence of the input and output analysis. Using the VTune Profiler Web Server Interface, the recipe illustrates the examination of many I/O performance indicators, including:
Platform diagram use of the physical core, DRAM, PCIe, and Intel UPI linkages.
PCIe Traffic Summary, which includes metrics for both outgoing (caused by the CPU) and incoming (caused by I/O devices) PCIe traffic.
These measurements aid in the computation of CPU/IO conflicts, latency for incoming read/write requests, PCIe bandwidth and efficient use, and other factors.
Metrics to assess the workload’s effectiveness in re-mapping incoming I/O device memory locations to various host addresses using Intel VT-d technology.
Usage of DRAM and UPI bandwidth.
Read more on Govindhtech.com
0 notes
govindhtech · 2 months ago
Text
Intel VTune Profiler: Improve AI/ML Application Performance
Tumblr media
Improve AI/ML Application Performance with Intel VTune Profiler.
Find out how to profile Data Parallel Python and OpenVINO workloads using the oneAPI-powered tool. The use of AI and ML is rising in healthcare and life sciences, marketing and finance, manufacturing, robotics, driverless automobiles, smart cities, and more. ML workloads in real-world fields employ deep learning frameworks like PyTorch, TensorFlow, Keras, and others.
Through the “write once, deploy anywhere” approach, other developer tools, such as the OpenVINO Toolkit, also help to expedite AI research on the newest hardware architectures in fields like computer vision and generative artificial intelligence (GenAI). The goal of the open source OpenVINO Toolkit, which was first released in 2018, has been to speed up AI inference with reduced latency and increased throughput while preserving accuracy, minimizing model footprint, and maximizing hardware utilization.
It is challenging to locate and examine performance bottlenecks in the underlying source code because to the intricate structure of deep learning models, which include numerous layers and non-linear functions. ML frameworks like PyTorch and TensorFlow provide native tools and profiling APIs for tracking and evaluating performance metrics at various phases of model construction.
These approaches, however, are only applicable to software functionality. This problem is addressed by the Intel VTune Profiler, which is driven by the oneAPI and offers comprehensive insights into hardware-level memory and compute bottlenecks. By doing this, performance problems are resolved and AI applications’ performance is optimized and scaled across hardware systems with different computational envelopes.
The scope of optimization for AI/ML workloads may be expanded by using Intel VTune Profiler to profile data in concurrent Python and OpenVINO programs, as you will discover in this article.
Use Intel VTune Profiler to Boost Python Application Performance
VTune Profiler may assist in profiling a Python program, as shown in a recently published recipe in the Intel VTune Profiler Cookbook.
The following are the recipe’s fundamental software requirements:
Data Parallel Extensions for Python
Vtune Profiler (version 2022 or later)
Intel Distribution for Python
Compiler for Intel OneAPI DPC++/C++
The NumPy implementation covered in the recipe divides the calculations into logical jobs using the Intel Instrumentation and Tracing Technology (ITT) APIs and performs distance computations using the Intel oneAPI Math Kernel Library (oneMKL) routines. You may then determine which areas of the code need attention for necessary changes to get additional performance by using the VTune Profiler tool to examine the execution time and memory consumption of each logical job.
Details on the most CPU-intensive code segments are included in the output analysis report when Hotspots analysis is performed on the NumPy implementation. Additionally, it offers recommendations for investigating the profiler tool’s other performance analysis features, such Threading analysis for enhanced parallelism and Microarchitecture Exploration analysis for effective use of the underlying hardware.
Use the Data Parallel Extension for NumPy and Numba to Address Performance Bottlenecks
According to the Hotspots analysis report, NumPy operations and underlying oneMKL functions account for a significant amount of the execution time in the simple NumPy implementation of the pairwise distance computation example. By making little code modifications, NumPy may be swapped out for the Data Parallel Extension for NumPy, which will eliminate these bottlenecks. To evaluate the speed gains over the simple NumPy code and find any areas that might need further optimization, run the Hotspots analysis once again.
Additionally, the VTune Profiler makes recommendations such as using the Data Parallel Extension for Numba with your platform’s GPU to bring offload accelerator parallelism to the application. The Numba JIT compiler for NumPy operations has an open-source extension called Numba. It offers Python kernel programming APIs that resemble SYCL. The GPU Compute/Media Hotspots analysis preview function of VTune Profiler may then be used to examine the Numba implementation’s execution on a GPU.
Use Intel VTune Profiler to Examine OpenVINO Applications’ Performance
Using the VTune Profiler to profile OpenVINO-based AI applications is covered in another new recipe in the VTune Profiler cookbook. It discusses how to use the profiler tool to analyze performance bottlenecks in the CPU, GPU, and Neural Processing Unit (NPU).
If your OpenVINO application makes use of the Intel oneAPI Data Analytics Library (oneDAL) and/or the Intel oneAPI Deep Neural Network (oneDNN) Intel Distribution for Python Intel VTune Profiler (v2024.1 or later), you can access the Benchmark Tool application as part of the OpenVINO Toolkit Intel oneAPI Base Toolkit.
The recipe offers detailed steps for configuring OpenVINO with the ITT APIs for performance analysis, building the OpenVINO source, and setting it up. It profiles the AI application and analyzes performance and latency using a reference benchmark application.
Depending on the computational architecture, you may use the VTune Profiler‘s numerous performance analysis features to find hotspots and look at how much hardware is being utilized by specific code sections.
For example,
To examine CPU bottlenecks that is, the sections of code that take up the most CPU execution time use the Hotspots Analysis tool.
Use the GPU Compute/Media Hotspots Analysis preview function to profile GPU hotspots. Examine inefficient kernel methods, examine the frequency of GPU instructions for various kinds of instructions, and more to get an understanding of GPU use.
The AI PCs’ Neural Processing Units (NPUs) are made especially to boost AI/ML applications’ performance. With the Intel Distribution of OpenVINO Toolkit, you may transfer compute-intensive AI/ML tasks to Intel NPUs. You may examine the NPU performance using a number of hardware measures, including workload size, execution time, sampling interval, and more, with the use of the VTune Profiler‘s NPU Exploration Analysis preview function.
Intel VTune Profiler Download
Use one of the following methods to install Intel VTune Profiler on your computer:
Get the standalone bundle here.
As part of the Intel oneAPI Base Toolkit, download Intel VTune Profiler.
Know the Process
In the standard software performance analysis process,Image Credit To Intel
Use one of the following methods to launch Intel VTune Profiler:
Using Microsoft Visual Studio’s GUI From the Command Line
Configure parameters and choose a profiling analysis for your application.
Create a profile of the target system (remote collection) or application on the host (local collection).
View the host system’s findings.
Identify bottlenecks and address them over a number of cycles until you reach a desirable level of performance.
FAQs
What does Intel VTune do?
Examine CPU usage, OpenMP efficiency, memory access, and vectorization to assess application performance. Measure metrics to find memory access concerns.
Read more on govindhtech.com
1 note · View note
govindhtech · 4 months ago
Text
SynxFlow Project: A Smooth Migration From CUDA To SYCL
Tumblr media
The SynxFlow Project
SynxFlow, an open-source GPU-based hydrodynamic flood modeling software, in CUDA, C++, and Python Data pre-processing and visualization are done in Python while simulations are executed on CUDA. SynxFlow can simulate floods quicker than real-time with hundreds of millions of computational cells and metre-level precision on many GPUs. An open-source software with a simple Python interface, it may be linked into data science workflows for disaster risk assessments. The model has been widely utilized in research and industry, such as to assist flood early warning systems and generate flood maps for (re)insurance firms.
SynxFlow can simulate flooding, landslide runout, and debris flow. Simulations are crucial to emergency service planning and management. A comprehensive prediction of natural disasters can reduce their social and economic costs. In addition to risk assessment and disaster preparedness, SynxFlow flood simulation can help with urban planning, environmental protection, climate change adaptation, insurance and financial planning, infrastructure design and engineering, public awareness, and education.
- Advertisement -
Issue Statement
Several variables make probabilistic flood forecasting computationally difficult:
Large dataset storage, retrieval, and management
Complex real-time data processing requires high-performance computation.
Model calibration and validation needed as real-world conditions change.
Effective integration and data transfer between hydrological, hydraulic, and meteorological models, and more.
For speedier results, a flood forecasting system must process data in parallel and offload compute-intensive operations to hardware accelerators. Thus, the SynxFlow team must use larger supercomputers to increase flood simulation scale and cut simulation time. DAWN, the UK’s newest supercomputer, employs Intel GPUs, which SynxFlow didn’t support.
These issues offered researchers a new goal to make the SynxFlow model performance-portable and scalable on supercomputers with multi-vendor GPUs. They must transition SynxFlow code from CUDA to a cross-vendor programming language in weeks, not years.
Solution Powered by oneAPI
After considering several possibilities, the SynxFlow project team chose the Intel oneAPI Base Toolkit implementation of the Unified Acceleration Foundation-backed oneAPI protocol. All are built on multiarchitecture, multi-vendor SYCL framework. It supports Intel, NVIDIA, and AMD GPUs and includes the Intel DPC++ Compatibility Tool for automated CUDA-to-SYCL code translation.
- Advertisement -
SynxFlow code migration went smoothly. This produced code that automatically translated most CUDA kernels and API calls into SYCL. After auto-translation, some mistakes were found during compilation, but the migration tool’s error-diagnostic indications and warnings made them easy to rectify. It took longer to switch from NVIDIA Collective Communications Library (NCCL)-based inter-GPU communication to GPU-direct enabled Intel MPI library calls because this could not be automated.
To summarize, there has been a promising attempt to transfer a complicated flood simulation code that was built on CUDA to SYCL, achieving both scalability and performance-portability. The conversion has been easy to handle and seamless thanks to the Intel oneAPI Base Toolkit.
Intel hosted a oneAPI Hackfest at the DiRAC HPC Research Facility
DiRAC
The High Performance Super Computer facility in the United Kingdom serving the theoretical communities of Particle Physics, Astrophysics, Cosmology, Solar System and Planetary Science, and Nuclear Physics.
DiRAC’s three HPC services Extreme Scaling, Memory-Intensive, and Data-Intensive are each designed to support the distinct kinds of computational workflows required to carry out their science program. DiRAC places a strong emphasis on innovation, and all of its services are co-designed with vendor partners, technical and software engineering teams, and research community.
Training Series on oneAPI at DiRAC Hackfest
On May 21–23, 2024, the DiRAC community hosted three half-day remote training sessions on the Intel oneAPI Base Toolkit. The training series was designed for developers and/or researchers with varying degrees of experience, ranging from novices to experts.
The cross-platform compatible SYCL programming framework served as the foundation for a variety of concepts that were taught to the attendees. The students were also introduced to a number of Base Kit component tools and libraries that facilitate SYCL. For instance, the Intel DPC++ Compatibility Tool facilitates automated code migration from CUDA to C++ with SYCL; the Intel oneAPI Math Kernel Library (oneMKL) optimizes math operations; the Intel oneAPI Deep Neural Networks (oneDNN) accelerates hackfest and the Intel oneAPI DPC++ Library (oneDPL) expedites SYCL kernels on a variety of hardware. Additionally, the training sessions covered code profiling and the use of Intel Advisor and Intel VTune Profiler, two tools included in the Base Kit for analyzing performance bottlenecks.
DiRAC Hackfest’s oneAPI Hackath on
In order to complete a range of tasks, including parallelizing Fortran code on Intel GPUs, accelerating math operations like the Fast Fourier Transform (FFT) using oneMKL’s SYCL API, and resolving performance bottlenecks with the aid of Intel Advisor and Intel VTune Profiler, the participants improvised their cutting-edge projects using oneAPI tools and libraries.
The participants reported that it was easy to adjust to using oneAPI components and that the code migration process went smoothly. The teams saw a noticeable increase in workload performance with libraries like Intel MPI. Approximately 70% of the teams who took part indicated that they would be open to using oneAPI technologies to further optimize the code for their research projects. Thirty percent of the teams benchmarked their outcomes using SYCL and oneAPI, and they achieved a 100% success rate in code conversion to SYCL.
Start Programming Multiarchitecture Using SYCL and oneAPI
Investigate the SYCL framework and oneAPI toolkits now for multiarchitecture development that is accelerated! Use oneAPI to enable cross-platform parallelism in your apps and move your workloads to SYCL for high-performance heterogeneous computing.
Intel invite you to review the real-world code migration application samples found in the CUDA to SYCL catalog. Investigate the AI, HPC, and rendering solutions available in Intel’s software portfolio driven by oneAPI.
Read more on govindhtech.com
0 notes