Tumgik
#onemkl
govindhtech · 2 months
Text
Intel VTune Profiler For Data Parallel Python Applications
Tumblr media
Intel VTune Profiler tutorial
This brief tutorial will show you how to use Intel VTune Profiler to profile the performance of a Python application using the NumPy and Numba example applications.
Analysing Performance in Applications and Systems
For HPC, cloud, IoT, media, storage, and other applications, Intel VTune Profiler optimises system performance, application performance, and system configuration.
Optimise the performance of the entire application not just the accelerated part using the CPU, GPU, and FPGA.
Profile SYCL, C, C++, C#, Fortran, OpenCL code, Python, Google Go, Java,.NET, Assembly, or any combination of languages can be multilingual.
Application or System: Obtain detailed results mapped to source code or coarse-grained system data for a longer time period.
Power: Maximise efficiency without resorting to thermal or power-related throttling.
VTune platform profiler
It has following Features.
Optimisation of Algorithms
Find your code’s “hot spots,” or the sections that take the longest.
Use Flame Graph to see hot code routes and the amount of time spent in each function and with its callees.
Bottlenecks in Microarchitecture and Memory
Use microarchitecture exploration analysis to pinpoint the major hardware problems affecting your application’s performance.
Identify memory-access-related concerns, such as cache misses and difficulty with high bandwidth.
Inductors and XPUs
Improve data transfers and GPU offload schema for SYCL, OpenCL, Microsoft DirectX, or OpenMP offload code. Determine which GPU kernels take the longest to optimise further.
Examine GPU-bound programs for inefficient kernel algorithms or microarchitectural restrictions that may be causing performance problems.
Examine FPGA utilisation and the interactions between CPU and FPGA.
Technical summary: Determine the most time-consuming operations that are executing on the neural processing unit (NPU) and learn how much data is exchanged between the NPU and DDR memory.
In parallelism
Check the threading efficiency of the code. Determine which threading problems are affecting performance.
Examine compute-intensive or throughput HPC programs to determine how well they utilise memory, vectorisation, and the CPU.
Interface and Platform
Find the points in I/O-intensive applications where performance is stalled. Examine the hardware’s ability to handle I/O traffic produced by integrated accelerators or external PCIe devices.
Use System Overview to get a detailed overview of short-term workloads.
Multiple Nodes
Describe the performance characteristics of workloads involving OpenMP and large-scale message passing interfaces (MPI).
Determine any scalability problems and receive suggestions for a thorough investigation.
Intel VTune Profiler
To improve Python performance while using Intel systems, install and utilise the Intel Distribution for Python and Data Parallel Extensions for Python with your applications.
Configure your Python-using VTune Profiler setup.
To find performance issues and areas for improvement, profile three distinct Python application implementations. The pairwise distance calculation algorithm commonly used in machine learning and data analytics will be demonstrated in this article using the NumPy example.
The following packages are used by the three distinct implementations.
Numpy Optimised for Intel
NumPy’s Data Parallel Extension
Extensions for Numba on GPU with Data Parallelism
Python’s NumPy and Data Parallel Extension
By providing optimised heterogeneous computing, Intel Distribution for Python and Intel Data Parallel Extension for Python offer a fantastic and straightforward approach to develop high-performance machine learning (ML) and scientific applications.
Added to the Python Intel Distribution is:
Scalability on PCs, powerful servers, and laptops utilising every CPU core available.
Assistance with the most recent Intel CPU instruction sets.
Accelerating core numerical and machine learning packages with libraries such as the Intel oneAPI Math Kernel Library (oneMKL) and Intel oneAPI Data Analytics Library (oneDAL) allows for near-native performance.
Tools for optimising Python code into instructions with more productivity.
Important Python bindings to help your Python project integrate Intel native tools more easily.
Three core packages make up the Data Parallel Extensions for Python:
The NumPy Data Parallel Extensions (dpnp)
Data Parallel Extensions for Numba, aka numba_dpex
Tensor data structure support, device selection, data allocation on devices, and user-defined data parallel extensions for Python are all provided by the dpctl (Data Parallel Control library).
It is best to obtain insights with comprehensive source code level analysis into compute and memory bottlenecks in order to promptly identify and resolve unanticipated performance difficulties in Machine Learning (ML),  Artificial Intelligence ( AI), and other scientific workloads. This may be done with Python-based ML and AI programs as well as C/C++ code using Intel VTune Profiler. The methods for profiling these kinds of Python apps are the main topic of this paper.
Using highly optimised Intel Optimised Numpy and Data Parallel Extension for Python libraries, developers can replace the source lines causing performance loss with the help of Intel VTune Profiler, a sophisticated tool.
Setting up and Installing
1. Install Intel Distribution for Python
2. Create a Python Virtual Environment
   python -m venv pyenv
   pyenv\Scripts\activate
3. Install Python packages
   pip install numpy
   pip install dpnp
   pip install numba
   pip install numba-dpex
   pip install pyitt
Make Use of Reference Configuration
The hardware and software components used for the reference example code we use are:
Software Components:
dpnp 0.14.0+189.gfcddad2474
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
mkl-umath 0.1.1
numba 0.59.0
numba-dpex 0.21.4
numpy 1.26.4
pyitt 1.1.0
Operating System:
Linux, Ubuntu 22.04.3 LTS
CPU:
Intel Xeon Platinum 8480+
GPU:
Intel Data Center GPU Max 1550
The Example Application for NumPy
Intel will demonstrate how to use Intel VTune Profiler and its Intel Instrumentation and Tracing Technology (ITT) API to optimise a NumPy application step-by-step. The pairwise distance application, a well-liked approach in fields including biology, high performance computing (HPC), machine learning, and geographic data analytics, will be used in this article.
Summary
The three stages of optimisation that we will discuss in this post are summarised as follows:
Step 1: Examining the Intel Optimised Numpy Pairwise Distance Implementation: Here, we’ll attempt to comprehend the obstacles affecting the NumPy implementation’s performance.
Step 2: Profiling Data Parallel Extension for Pairwise Distance NumPy Implementation: We intend to examine the implementation and see whether there is a performance disparity.
Step 3: Profiling Data Parallel Extension for Pairwise Distance Implementation on Numba GPU: Analysing the numba-dpex implementation’s GPU performance
Boost Your Python NumPy Application
Intel has shown how to quickly discover compute and memory bottlenecks in a Python application using Intel VTune Profiler.
Intel VTune Profiler aids in identifying bottlenecks’ root causes and strategies for enhancing application performance.
It can assist in mapping the main bottleneck jobs to the source code/assembly level and displaying the related CPU/GPU time.
Even more comprehensive, developer-friendly profiling results can be obtained by using the Instrumentation and Tracing API (ITT APIs).
Read more on govindhtech.com
2 notes · View notes
govindhtech · 18 days
Text
Guide To Python NumPy and SciPy In Multithreading In Python
Tumblr media
An Easy Guide to Multithreading in Python
Python is a strong language, particularly for developing AI and machine learning applications. However, CPython, the programming language’s original, reference implementation and byte-code interpreter, lacks multithreading functionality; multithreading and parallel processing need to be enabled from the kernel. Some of the desired multi-core processing is made possible by libraries Python NumPy and SciPy such as NumPy, SciPy, and PyTorch, which use C-based implementations. However, there is a problem known as the Global Interpreter Lock (GIL), which literally “locks” the CPython interpreter to only working on one thread at a time, regardless of whether the interpreter is in a single or multi-threaded environment.
Let’s take a different approach to Python.
The robust libraries and tools that support Intel Distribution of Python, a collection of high-performance packages that optimize underlying instruction sets for Intel architectures, are designed to do this.
For compute-intensive, core Python numerical and scientific packages like NumPy, SciPy, and Numba, the Intel distribution helps developers achieve performance levels that are comparable to those of a C++ program by accelerating math and threading operations using oneAPI libraries while maintaining low Python overheads. This enables fast scaling over a cluster and assists developers in providing highly efficient multithreading, vectorization, and memory management for their applications.
Let’s examine Intel’s strategy for enhancing Python parallelism and composability in more detail, as well as how it might speed up your AI/ML workflows.
Parallelism in Nests: Python NumPy and SciPy
Python libraries called Python NumPy and SciPy were created especially for scientific computing and numerical processing, respectively.
Exposing parallelism on all conceivable levels of a program for example, by parallelizing the outermost loops or by utilizing various functional or pipeline sorts of parallelism on the application level is one workaround to enable multithreading/parallelism in Python scripts. This parallelism can be accomplished with the use of libraries like Dask, Joblib, and the included multiprocessing module mproc (with its ThreadPool class).
Data-parallelism can be performed with Python modules like Python NumPy and SciPy, which can then be accelerated with an efficient math library like the Intel oneAPI Math Kernel Library (oneMKL). This is because massive data processing requires a lot of processing. Using various threading runtimes, oneMKL is multi-threaded. An environment variable called MKL_THREADING_LAYER can be used to adjust the threading layer.
As a result, a code structure known as nested parallelism is created, in which a parallel section calls a function that in turn calls another parallel region. Since serial sections that is, regions that cannot execute in parallel and synchronization latencies are typically inevitable in Python NumPy and SciPy based systems, this parallelism-within-parallelism is an effective technique to minimize or hide them.
Going One Step Further: Numba
Despite offering extensive mathematical and data-focused accelerations through C-extensions, Python NumPy and SciPy remain a fixed set of mathematical tools accelerated through C-extensions. If non-standard math is required, a developer should not expect it to operate at the same speed as C-extensions. Here’s where Numba can work really well.
OneTBB
Based on LLVM, Numba functions as a “Just-In-Time” (JIT) compiler. It aims to reduce the performance difference between Python and compiled, statically typed languages such as C and C++. Additionally, it supports a variety of threading runtimes, including workqueue, OpenMP, and Intel oneAPI Threading Building Blocks (oneTBB). To match these three runtimes, there are three integrated threading layers. The only threading layer installed by default is workqueue; however, other threading layers can be added with ease using conda commands (e.g., $ conda install tbb).
The environment variable NUMBA_THREADING_LAYER can be used to set the threading layer. It is vital to know that there are two ways to choose this threading layer: either choose a layer that is generally safe under different types of parallel processing, or specify the desired threading layer name (e.g., tbb) explicitly.
Composability of Threading
The efficiency or efficacy of co-existing multi-threaded components depends on an application’s or component’s threading composability. A component that is “perfectly composable” would operate without compromising the effectiveness of other components in the system or its own efficiency.
In order to achieve a completely composable threading system, care must be taken to prevent over-subscription, which means making sure that no parallel region of code or component can require a certain number of threads to run (this is known as “mandatory” parallelism).
An alternative would be to implement a type of “optional” parallelism in which a work scheduler determines at the user level which thread(s) the components should be mapped to while automating the coordination of tasks among components and parallel regions. Naturally, the efficiency of the scheduler’s threading model must be better than the high-performance libraries’ integrated scheme since it is sharing a single thread-pool to arrange the program’s components and libraries around. The efficiency is lost otherwise.
Intel’s Strategy for Parallelism and Composability
Threading composability is more readily attained when oneTBB is used as the work scheduler. OneTBB is an open-source, cross-platform C++ library that was created with threading composability and optional/nested parallelism in mind. It allows for multi-core parallel processing.
An experimental module that enables threading composability across several libraries unlocks the potential for multi-threaded speed benefits in Python and was included in the oneTBB version released at the time of writing. As was previously mentioned, the scheduler’s improved threads allocation is what causes the acceleration.
The ThreadPool for Python standard is replaced by the Pool class in oneTBB. Additionally, the thread pool is activated across modules without requiring any code modifications thanks to the use of monkey patching, which allows an object to be dynamically replaced or updated during runtime. Additionally, oneTBB replaces oneMKL by turning on its own threading layer, which allows it to automatically provide composable parallelism when using calls from the Python NumPy and SciPy libraries.
See the code samples from the following composability demo, which is conducted on a system with MKL-enabled NumPy, TBB, and symmetric multiprocessing (SMP) modules and their accompanying IPython kernels installed, to examine the extent to which nested parallelism can enhance performance. Python is a feature-rich command-shell interface that supports a variety of programming languages and interactive computing. To get a quantifiable performance comparison, the demonstration was executed using the Jupyter Notebook extension.
import NumPy as np from multiprocessing.pool import ThreadPool pool = ThreadPool(10)
The aforementioned cell must be executed again each time the kernel in the Jupyter menu is changed in order to build the ThreadPool and provide the runtime outcomes listed below.
The following code, which runs the identical line for each of the three trials, is used with the default Python kernel:
%timeit pool.map(np.linalg.qr, [np.random.random((256, 256)) for i in range(10)])
This approach can be used to get the eigenvalues of a matrix using the standard Python kernel. Runtime is significantly improved up to an order of magnitude when the Python-m SMP kernel is enabled. Applying the Python-m TBB kernel yields even more improvements.
OneTBB’s dynamic task scheduler, which most effectively manages code where the innermost parallel sections cannot fully utilize the system’s CPU and where there may be a variable amount of work to be done, yields the best performance for this composability example. Although the SMP technique is still quite effective, it usually performs best in situations when workloads are more evenly distributed and the loads of all workers in the outermost regions are generally identical.
In summary, utilizing multithreading can speed up AI/ML workflows
The effectiveness of Python programs with an AI and machine learning focus can be increased in a variety of ways. Using multithreading and multiprocessing effectively will remain one of the most important ways to push AI/ML software development workflows to their limits.
Read more on Govindhtech.com
0 notes
govindhtech · 2 months
Text
Utilizing llama.cpp, LLMs can be executed on Intel GPUs
Tumblr media
The open-source project known as llama.cpp is a lightweight LLM framework that is gaining greater and greater popularity. Given its performance and customisability, developers, scholars, and fans have formed a strong community around the project. Since its launch, GitHub has over 600 contributors, 52,000 stars, 1,500 releases, and 7,400 forks. More hardware, including Intel GPUs seen in server and consumer products, is now supported by llama.cpp as a result of recent code merges. Hardware support for GPUs from other vendors and CPUs (x86 and ARM) is now combined with Intel’s GPUs.
Georgi Gerganov designed the first implementation. The project is mostly instructional in nature and acts as the primary testing ground for new features being developed for the machine learning tensor library known as ggml library. Intel is making  AI more accessible to a wider range of customers by enabling inference on a greater number of devices with its latest releases. Because Llama.cpp is built in C and has a number of other appealing qualities, it is quick.
16-bit float compatibility
Support for integer quantisation (four-, five-, eight-, etc.)
Absence of reliance on outside parties
There are no runtime memory allocations.
Intel GPU SYCL Backend
GGM offers a number of backends to accommodate and adjust for various hardware. Since oneAPI supports GPUs from multiple vendors, Intel decided to construct the SYCL backend using their direct programming language, SYCL, and high-performance BLAS library, oneMKL. A programming model called SYCL is designed to increase hardware accelerator productivity. It is an embedded, single-source language with a domain focus that is built entirely on C++17.
All Intel GPUs can be used with the SYCL backend. Intel has confirmed with:
Flex Series and Data Centre GPU Max from Intel
Discrete GPU Intel Arc
Intel Arc GPU integrated with the Intel Core Ultra CPU
In Intel Core CPUs from Generations 11 through 13: iGPU
Millions of consumer devices can now conduct inference on Llama since llama.cpp now supports Intel GPUs. The SYCL backend performs noticeably better on Intel GPUs than the OpenCL (CLBlast) backend. Additionally, it supports an increasing number of devices, including CPUs and future processors with  AI accelerators. For information on using the SYCL backend, please refer to the llama.cpp tutorial.
Utilise the SYCL Backend to Run LLM on an Intel GPU
For SYCL, llama.cpp contains a comprehensive manual. Any Intel GPU that supports SYCL and oneAPI can run it. GPUs from the Flex Series and Intel Data Centre GPU Max can be used by server and cloud users. On their Intel Arc GPU or iGPU on Intel Core CPUs, client users can test it out. The 11th generation Core and later iGPUs have been tested by Intel. While it functions, the older iGPU performs poorly.
The memory is the only restriction. Shared memory on the host is used by the iGPU. Its own memory is used by the dGPU. For llama2-7b-Q4 models, Intel advise utilising an iGPU with 80+ EUs (11th Gen Core and above) and shared memory that is greater than 4.5 GB (total host memory is 16 GB and higher, and half memory could be assigned to iGPU).
Put in place the Intel GPU driver
There is support for Windows (WLS2) and Linux. Intel suggests Ubuntu 22.04 for Linux, and this version was utilised for testing and development.
Linux:sudo usermod -aG render username sudo usermod -aG video username sudo apt install clinfo sudo clinfo -l
Output (example):Platform #0: Intel(R) OpenCL Graphics -- Device #0: Intel(R) Arc(TM) A770 Graphics
orPlatform #0: Intel(R) OpenCL HD Graphics -- Device #0: Intel(R) Iris(R) Xe Graphics \[0x9a49\]
Set the oneAPI Runtime to ON
Install the Intel oneAPI Base Toolkit first in order to obtain oneMKL and the SYCL compiler. Turn on the oneAPI runtime next:
First, install the Intel oneAPI Base Toolkit to get the SYCL compiler and oneMKL. Next, enable the oneAPI runtime:
Linux: source /opt/intel/oneapi/setvars.sh
Windows: “C:\Program Files (x86)\Intel\oneAPI\setvars.bat\” intel64
Run sycl-ls to confirm that there are one or more Level Zero devices. Please confirm that at least one GPU is present, like [ext_oneapi_level_zero:gpu:0].
Build by one-click:
Linux: ./examples/sycl/build.sh
Windows: examples\sycl\win-build-sycl.bat
Note, the scripts above include the command to enable the oneAPI runtime.
Run an Example by One-Click
Download llama-2–7b.Q4_0.gguf and save to the models folder:
Linux: ./examples/sycl/run-llama2.sh
Windows: examples\sycl\win-run-llama2.bat
Note that the scripts above include the command to enable the oneAPI runtime. If the ID of your Level Zero GPU is not 0, please change the device ID in the script. To list the device ID:
Linux: ./build/bin/ls-sycl-device or ./build/bin/main
Windows: build\bin\ls-sycl-device.exe or build\bin\main.exe
Synopsis
All Intel GPUs are available to LLM developers and users via the SYCL backend included in llama.cpp. Kindly verify whether the Intel laptop, your gaming PC, or your cloud virtual machine have an iGPU, an Intel Arc GPU, or an Intel Data Centre GPU Max and Flex Series GPU. If so, llama.cpp’s wonderful LLM features on Intel GPUs are yours to enjoy. To add new features and optimise SYCL for Intel GPUs, Intel want developers to experiment and contribute to the backend. The oneAPI programming approach is a useful project to learn for cross-platform development.
Read more on Govindhtech.com
0 notes
govindhtech · 2 months
Text
OneAPI Math Kernel Library (oneMKL): Intel MKL’s Successor
Tumblr media
The upgraded and enlarged Intel oneAPI Math Kernel Library supports numerical processing not only on CPUs but also on GPUs, FPGAs, and other accelerators that are now standard components of heterogeneous computing environments.
In order to assist you decide if upgrading from traditional  Intel MKL is the better option for you, this blog will provide you with a brief summary of the maths library.
Why just oneMKL?
The vast array of mathematical functions in oneMKL can be used for a wide range of tasks, from straightforward ones like linear algebra and equation solving to more intricate ones like data fitting and summary statistics.
Several scientific computing functions, including vector math, fast Fourier transforms (FFT), random number generation (RNG), dense and sparse Basic Linear Algebra Subprograms (BLAS), Linear Algebra Package (LAPLACK), and vector math, can all be applied using it as a common medium while adhering to uniform API conventions. Together with GPU offload and SYCL support, all of these are offered in C and Fortran interfaces.
Additionally, when used with  Intel Distribution for Python, oneAPI Math Kernel Library speeds up Python computations (NumPy and SciPy).
Intel MKL Advanced with oneMKL
A refined variant of the standard  Intel MKL is called oneMKL. What sets it apart from its predecessor is its improved support for SYCL and GPU offload. Allow me to quickly go over these two distinctions.
GPU Offload Support for oneMKL
GPU offloading for SYCL and OpenMP computations is supported by oneMKL. With its main functionalities configured natively for Intel GPU offload, it may thus take use of parallel-execution kernels of GPU architectures.
oneMKL adheres to the General Purpose GPU (GPGPU) offload concept that is included in the Intel Graphics Compute Runtime for OpenCL Driver and oneAPI Level Zero. The fundamental execution mechanism is as follows: the host CPU is coupled to one or more compute devices, each of which has several GPU Compute Engines (CE).
SYCL API for oneMKL
OneMKL’s SYCL API component is a part of oneAPI, an open, standards-based, multi-architecture, unified framework that spans industries. (Khronos Group’s SYCL integrates the SYCL specification with language extensions created through an open community approach.) Therefore, its advantages can be reaped on a variety of computing devices, including FPGAs, CPUs, GPUs, and other accelerators. The SYCL API’s functionality has been divided into a number of domains, each with a corresponding code sample available at the oneAPI GitHub repository and its own namespace.
OneMKL Assistance for the Most Recent Hardware
On cutting-edge architectures and upcoming hardware generations, you can benefit from oneMKL functionality and optimizations. Some examples of how oneMKL enables you to fully utilize the capabilities of your hardware setup are as follows:
It supports the 4th generation  Intel Xeon Scalable Processors’ float16 data type via  Intel Advanced Vector Extensions 512 (Intel AVX-512) and optimised bfloat16 and int8 data types via Intel Advanced Matrix Extensions (Intel AMX).
It offers matrix multiply optimisations on the upcoming generation of CPUs and GPUs, including Single Precision General Matrix Multiplication (SGEMM), Double Precision General Matrix Multiplication (DGEMM), RNG functions, and much more.
For a number of features and optimisations on the Intel Data Centre GPU Max Series, it supports Intel Xe Matrix Extensions (Intel XMX).
For memory-bound dense and sparse linear algebra, vector math, FFT, spline computations, and various other scientific computations, it makes use of the hardware capabilities of  Intel Xeon processors and  Intel Data Centre GPUs.
Additional Terms and Context
The brief explanation of terminology provided below could also help you understand oneMKL and how it fits into the heterogeneous-compute ecosystem.
The C++ with SYCL interfaces for performance math library functions are defined in the oneAPI Specification for oneMKL. The oneMKL specification has the potential to change more quickly and often than its implementations.
The specification is implemented in an open-source manner by the oneAPI Math Kernel Library (oneMKL) Interfaces project. With this project, we hope to show that the SYCL interfaces described in the oneMKL specification may be implemented for any target hardware and math library.
The intention is to gradually expand the implementation, even though the one offered here might not be the complete implementation of the specification. We welcome community participation in this project, as well as assistance in expanding support to more math libraries and a variety of hardware targets.
With C++ and SYCL interfaces, as well as comparable capabilities with C and Fortran interfaces, oneMKL is the  Intel product implementation of the specification. For Intel CPU and  Intel GPU hardware, it is extremely optimized.
Next up, what?
Launch oneMKL now to begin speeding up your numerical calculations like never before! Leverage oneMKL’s powerful features to expedite math processing operations and improve application performance while reducing development time for both current and future Intel platforms.
Keep in mind that oneMKL is rapidly evolving even while you utilize the present features and optimizations! In an effort to keep up with the latest Intel technology, we continuously implement new optimizations and support for sophisticated math functions.
They also invite you to explore the  AI, HPC, and Rendering capabilities available in  Intel’s software portfolio that is driven by oneAPI.
Read more on govindhtech.com
0 notes
govindhtech · 2 months
Text
PyTorch 2.4 to Speed Up AI Tasks Support for Intel GPUs
Tumblr media
PyTorch 2.4 Launches to Speed Up AI Tasks with Initial Support for Intel GPUs. In order to further speed up AI tasks, PyTorch 2.4 now offers initial support for the  Intel Data Centre GPU Max Series, which integrates  Intel GPUs and the SYCL software stack into the standard PyTorch stack.
Advantages
With Intel GPU support, customers have more options for GPUs and can use a consistent front-end and back-end GPU programming model. Workloads can now be deployed and operated on Intel GPUs with little coding required. To support streaming devices, this version generalizes the PyTorch device and runtime (device, stream, event, generator, allocator, and guard). The generalization facilitates not only PyTorch’s deployment on widely available hardware but also the integration of many hardware back ends.
Integrated PyTorch provides continuous software support, standardized software distribution, and consistent product release schedules, all of which will improve the experience for users of Intel GPUs.
An Overview of Support for Intel GPUs
Eager mode and graph mode are supported in the PyTorch built-in front end thanks to Intel GPU support that has been up streamed into the program. The SYCL programming language is now utilized to implement popular Aten operators in the eager mode. OneAPI Math Kernel Library (oneMKL) and oneAPI Deep Neural Network Library (oneDNN) are used to highly optimize the most performance-critical graphs and operators. To perform the optimization for Intel GPUs and to integrate Triton, the graph mode (torch.compile) now has an enabled Intel GPU back end.
PyTorch 2.4 now includes the necessary parts of Intel GPU support: Aten operators, oneDNN, Triton, Intel GPU source build, and integration of Intel GPU tool chains. In the meantime, PyTorch Profiler which is built on an integration between Kineto and oneMKL is being actively worked on in front of the forthcoming PyTorch 2.5 release. The front-end and back-end enhancements for Intel GPUs that are currently being implemented into PyTorch are depicted in Figure 1.Image Credit To Intel
PyTorch 2.4 Features
Apart from offering essential functionalities for training and inference on the Intel Data Centre GPU Max Series, the PyTorch 2.4 release for Linux maintains the same user interface as other supported hardware for PyTorch.
Using an Intel GPU, PyTorch 2.4 features include:
Workflows for inference and training.
The core eager functions as well as torch.compile are supported, and both eager and compile modes can fully run a Dynamo Hugging Face benchmark.
Data types like automated mixed precision (AMP), BF16, FP32, and so on.
operates on the Intel Data Centre GPU Max Series and Linux.
PyTorch 2.5
The first Intel GPU from the Intel Data Centre GPU Max Series is now available in the PyTorch ecosystem for AI workload acceleration thanks to the Intel GPU on PyTorch 2.4 first support (prototype) release.
In order to achieve beta quality in the PyTorch 2.5 release, they are constantly improving the functionality and performance of the Intel GPU support. Intel Client GPUs will be added to the list of GPUs supported for AI PC use cases as the product develops further. They’re also investigating more features for PyTorch 2.5, like:
Eager Mode: Completely execute Dynamo Torchbench and TIMM eager mode, and implement additional Aten operators.
Torch.compile: Optimise performance while running Dynamo Torchbench and TIMM benchmark compile mode in full.
To support Intel GPU, enable torch.profile under the profiler and utilities section.
Distribution of PyPI wheels.
Support for Windows and the  Intel Client GPU Series.
They invite the community to assess these latest additions to PyTorch’s  Intel GPU support.
Intel Extensions For PyTorch
The most recent performance enhancements for Intel devices are added to PyTorch using the Intel Extension. The Intel XeMatrix Extensions (XMX) AI engines on Intel discrete GPUs and Intel Advanced Vector Extensions 512 (Intel AVX-512) Vector Neural Network Instructions (VNNI) and Intel Advanced Matrix Extensions (Intel AMX) on Intel CPUs are utilized in optimizations. Additionally, the PyTorch xpu device, in conjunction with Intel Extension for PyTorch, facilitates simple GPU acceleration for Intel discrete GPUs.
Workloads and models for Generative AI (GenAI) have become increasingly common in today’s technological environment. These GenAI applications are mostly driven by large language models, or LLMs. The Intel Extension for PyTorch has added special optimizations for a few Large Language Models (LLMs) as of version 2.1.0. See Large Language Models (LLMs) section for additional details on LLM optimizations.
For Python programs, the extension can be loaded as a module, and for C++ projects, it can be linked as a library. It can be dynamically enabled in Python programs by importing intel_extension_for_pytorch.
Buildings
Eager Mode: Custom Python modules (including fusion modules), optimum optimizers, and INT8 quantization APIs are added to the PyTorch frontend in the eager mode. Using extended graph fusion passes, eager-mode models can be transformed into graph mode to further increase performance.
Graph Mode: Performance is enhanced by fusions’ reduction of operator/kernel invocation overhead in the graph mode. In PyTorch, the graph mode typically produces better results from optimization techniques like operation fusion than the eager mode does.
They are enhanced by the Intel Extension for PyTorch, which offers more thorough graph optimizations. Supported graph modes are PyTorch Torchscript and TorchDynamo. They advise you to use torch.jit.trace() instead of torch.jit.script() when using Torchscript since it typically supports a larger variety of workloads. The ipex backend can deliver strong performance with TorchDynamo.
CPU Optimization: Based on the detected instruction set architecture (ISA), Intel Extension for PyTorch automatically assigns operators to underlying kernels on the CPU. The addon makes use of the Intel hardware’s vectorization and matrix acceleration units. For enhanced performance, the runtime extension provides weight sharing and more precise thread runtime management.
Intel GPU
GPU Optimisation: The PyTorch dispatching method is used to implement and register optimized operators and kernels on the GPU. The intrinsic vectorization and matrix calculating capabilities of Intel GPU hardware enhance certain operators and kernels. The DPC++ compiler, which supports both the most recent SYCL standard and several extensions to the SYCL standard, is used by the Intel Extension for PyTorch for GPU. These extensions are located in the sycl/doc/extensions directory.
Encouragement
GitHub issues are used by the team to keep track of bugs and enhancement requests. Check to see whether your issue has previously been reported on GitHub before making a proposal or bug report.
Read more on govindhtech.com
0 notes