#VTuneProfiler | Explore Tumblr posts and blogs

govindhtech · 2 months ago

Text

Intel Data Direct I/O Performance With Intel VTune Profiler

Improve Intel Data Direct I/O (DDIO) Workload Performance with Intel VTune Profiler.

Profile uncore hardware performance events in Intel Xeon processors with oneAPI

One hardware feature included in Intel Xeon CPUs is Intel Data Direct I/O (DDIO) technology. By making the CPU cache the primary point of entry and exit for I/O data going into and out of the Intel Ethernet controllers and adapters, it contributes to advances in I/O performance.

To monitor the effectiveness of DDIO and Intel Virtualization Technology (Intel VT) for Directed I/O (Intel VT-d), which permits the independent execution of several operating systems and applications, it is essential to monitor uncore events, or events that take place outside the CPU core. By analyzing uncore hardware events, you may improve the performance of Intel Data Direct I/O (DDIO) workloads using Intel VTune Profiler, a performance analysis and debugging tool driven by the oneAPI.

We’ll talk about using VTune Profiler to evaluate and enhance directed I/O performance in this blog. Let’s take a quick look at Intel Data Direct I/O technology before we go into the profiling approach.

Overview of the Intel Data Direct I/O (DDIO) Technology

Intel Integrated I/O technology Intel DDIO was launched in 2012 for the Intel Xeon processor E5 and E7 v2 generations. It aims to increase system-level I/O performance by employing a new processor-to-I/O data flow.

I/O operations were sluggish and processor cache was a scarce resource prior to the development of Data Direct I/O technology. It was necessary for the host processor’s main memory to store and retrieve any incoming or departing data from an Ethernet controller or adapter, respectively. It used to be necessary to move the data from main memory to the cache before working with it.

This led to a lot of read and write operations in the memory. This also caused some additional, speculative read operations from the I/O hub in some of the older designs. Excessive memory accesses often lead to higher system power consumption and deterioration of I/O performance.

Intel DDIO technology was created to rearrange the flow of I/O data by making the processor cache the primary source and destination of I/O data instead of the main memory, as the processor cache is no longer a restricted resource.

Depending on the kind of workload at the workstation or on the server, the DDIO approach offers benefits like:

Higher transaction rates, reduced battery usage, reduced latency, increased bandwidth, and more.

There is no industry enablement needed for the Data Direct I/O technology.

It doesn’t rely on any hardware, and it doesn’t need any modifications to your operating system, drivers, or software.

Boost DDIO Performance Using Intel VTune Profiler

A function carried out in a CPU’s uncore section, outside of the processor core itself, that yet affects processor performance as a whole is referred to as an uncore event. For instance, these occurrences may be connected to the Intel Ultra Path Interconnect (UPI) block, memory controller, or I/O stack action.

A new recipe in the VTune Profiler Cookbook explains how to count these kinds of uncore hardware events using the tool’s input and output analysis function. You may analyze Data Direct I/O and VT-d efficiency by using the data to better understand the traffic and behavior of the Peripheral Component Interconnect Express (PCIe).

The recipe explains how to do input and output analysis, evaluate the findings, and classify the resulting I/O metrics. In essence, VTune Profiler v2023.2 or later and an Intel Xeon scalable CPU of the first or later generation are needed. Although the approach is suitable to the most recent version of Intel Xeon Processors, the I/O metrics and events covered in the recipe are based on the third generation Intel Xeon Scalable Processor.

Perform I/O Analysis with VTune Profiler

Start by analyzing your application’s input and output using VTune Profiler. With the analysis function, you may examine CPU, bus, and I/O subsystem use using a variety of platform-level metrics. You may get data indicating the Intel Data Direct I/O(DDIO) use efficiency by turning on the PCIe traffic analysis option.

Analyze the I/O Metrics

VTune Profiler Web Server or VTune Profiler GUI may be used to examine the report that is produced as a consequence of the input and output analysis. Using the VTune Profiler Web Server Interface, the recipe illustrates the examination of many I/O performance indicators, including:

Platform diagram use of the physical core, DRAM, PCIe, and Intel UPI linkages.

PCIe Traffic Summary, which includes metrics for both outgoing (caused by the CPU) and incoming (caused by I/O devices) PCIe traffic.

These measurements aid in the computation of CPU/IO conflicts, latency for incoming read/write requests, PCIe bandwidth and efficient use, and other factors.

Metrics to assess the workload’s effectiveness in re-mapping incoming I/O device memory locations to various host addresses using Intel VT-d technology.

Usage of DRAM and UPI bandwidth.

Read more on Govindhtech.com

#DataDirectI/O #IntelXeonCPUs #IntelVTuneProfiler #IntelXeonprocessor #IntelXeon #VTuneProfiler #News #Technews #Technology #Technologynews #Technologytrendes #govindhtech

0 notes

govindhtech · 3 months ago

Text

Intel VTune Profiler: Improve AI/ML Application Performance

Improve AI/ML Application Performance with Intel VTune Profiler.

Find out how to profile Data Parallel Python and OpenVINO workloads using the oneAPI-powered tool. The use of AI and ML is rising in healthcare and life sciences, marketing and finance, manufacturing, robotics, driverless automobiles, smart cities, and more. ML workloads in real-world fields employ deep learning frameworks like PyTorch, TensorFlow, Keras, and others.

Through the “write once, deploy anywhere” approach, other developer tools, such as the OpenVINO Toolkit, also help to expedite AI research on the newest hardware architectures in fields like computer vision and generative artificial intelligence (GenAI). The goal of the open source OpenVINO Toolkit, which was first released in 2018, has been to speed up AI inference with reduced latency and increased throughput while preserving accuracy, minimizing model footprint, and maximizing hardware utilization.

It is challenging to locate and examine performance bottlenecks in the underlying source code because to the intricate structure of deep learning models, which include numerous layers and non-linear functions. ML frameworks like PyTorch and TensorFlow provide native tools and profiling APIs for tracking and evaluating performance metrics at various phases of model construction.

These approaches, however, are only applicable to software functionality. This problem is addressed by the Intel VTune Profiler, which is driven by the oneAPI and offers comprehensive insights into hardware-level memory and compute bottlenecks. By doing this, performance problems are resolved and AI applications’ performance is optimized and scaled across hardware systems with different computational envelopes.

The scope of optimization for AI/ML workloads may be expanded by using Intel VTune Profiler to profile data in concurrent Python and OpenVINO programs, as you will discover in this article.

Use Intel VTune Profiler to Boost Python Application Performance

VTune Profiler may assist in profiling a Python program, as shown in a recently published recipe in the Intel VTune Profiler Cookbook.

The following are the recipe’s fundamental software requirements:

Data Parallel Extensions for Python

Vtune Profiler (version 2022 or later)

Intel Distribution for Python

Compiler for Intel OneAPI DPC++/C++

The NumPy implementation covered in the recipe divides the calculations into logical jobs using the Intel Instrumentation and Tracing Technology (ITT) APIs and performs distance computations using the Intel oneAPI Math Kernel Library (oneMKL) routines. You may then determine which areas of the code need attention for necessary changes to get additional performance by using the VTune Profiler tool to examine the execution time and memory consumption of each logical job.

Details on the most CPU-intensive code segments are included in the output analysis report when Hotspots analysis is performed on the NumPy implementation. Additionally, it offers recommendations for investigating the profiler tool’s other performance analysis features, such Threading analysis for enhanced parallelism and Microarchitecture Exploration analysis for effective use of the underlying hardware.

Use the Data Parallel Extension for NumPy and Numba to Address Performance Bottlenecks

According to the Hotspots analysis report, NumPy operations and underlying oneMKL functions account for a significant amount of the execution time in the simple NumPy implementation of the pairwise distance computation example. By making little code modifications, NumPy may be swapped out for the Data Parallel Extension for NumPy, which will eliminate these bottlenecks. To evaluate the speed gains over the simple NumPy code and find any areas that might need further optimization, run the Hotspots analysis once again.

Additionally, the VTune Profiler makes recommendations such as using the Data Parallel Extension for Numba with your platform’s GPU to bring offload accelerator parallelism to the application. The Numba JIT compiler for NumPy operations has an open-source extension called Numba. It offers Python kernel programming APIs that resemble SYCL. The GPU Compute/Media Hotspots analysis preview function of VTune Profiler may then be used to examine the Numba implementation’s execution on a GPU.

Use Intel VTune Profiler to Examine OpenVINO Applications’ Performance

Using the VTune Profiler to profile OpenVINO-based AI applications is covered in another new recipe in the VTune Profiler cookbook. It discusses how to use the profiler tool to analyze performance bottlenecks in the CPU, GPU, and Neural Processing Unit (NPU).

If your OpenVINO application makes use of the Intel oneAPI Data Analytics Library (oneDAL) and/or the Intel oneAPI Deep Neural Network (oneDNN) Intel Distribution for Python Intel VTune Profiler (v2024.1 or later), you can access the Benchmark Tool application as part of the OpenVINO Toolkit Intel oneAPI Base Toolkit.

The recipe offers detailed steps for configuring OpenVINO with the ITT APIs for performance analysis, building the OpenVINO source, and setting it up. It profiles the AI application and analyzes performance and latency using a reference benchmark application.

Depending on the computational architecture, you may use the VTune Profiler‘s numerous performance analysis features to find hotspots and look at how much hardware is being utilized by specific code sections.

For example,

To examine CPU bottlenecks that is, the sections of code that take up the most CPU execution time use the Hotspots Analysis tool.

Use the GPU Compute/Media Hotspots Analysis preview function to profile GPU hotspots. Examine inefficient kernel methods, examine the frequency of GPU instructions for various kinds of instructions, and more to get an understanding of GPU use.

The AI PCs’ Neural Processing Units (NPUs) are made especially to boost AI/ML applications’ performance. With the Intel Distribution of OpenVINO Toolkit, you may transfer compute-intensive AI/ML tasks to Intel NPUs. You may examine the NPU performance using a number of hardware measures, including workload size, execution time, sampling interval, and more, with the use of the VTune Profiler‘s NPU Exploration Analysis preview function.

Intel VTune Profiler Download

Use one of the following methods to install Intel VTune Profiler on your computer:

Get the standalone bundle here.

As part of the Intel oneAPI Base Toolkit, download Intel VTune Profiler.

Know the Process

In the standard software performance analysis process,Image Credit To Intel

Use one of the following methods to launch Intel VTune Profiler:

Using Microsoft Visual Studio’s GUI From the Command Line

Configure parameters and choose a profiling analysis for your application.

Create a profile of the target system (remote collection) or application on the host (local collection).

View the host system’s findings.

Identify bottlenecks and address them over a number of cycles until you reach a desirable level of performance.

FAQs

What does Intel VTune do?

Examine CPU usage, OpenMP efficiency, memory access, and vectorization to assess application performance. Measure metrics to find memory access concerns.