#onemkl
Explore tagged Tumblr posts
govindhtech · 3 months ago
Text
OneAPI Construction Kit For Intel RISC V Processor Interface
Tumblr media
With the oneAPI Construction Kit, you may integrate the oneAPI Ecosystem into your Intel RISC V Processor.
Intel RISC-V
Recently, Codeplay, an Intel business, revealed that their oneAPI Construction Kit supports RISC-V. Rapidly expanding, Intel RISC V is an open standard instruction set architecture (ISA) available under royalty-free open-source licenses for processors of all kinds.
Through direct programming in C++ with SYCL, along with a set of libraries aimed at common functions like math, threading, and neural networks, and a hardware abstraction layer that allows programming in one language to target different devices, the oneAPI programming model enables a single codebase to be deployed across multiple computing architectures including CPUs, GPUs, FPGAs, and other accelerators.
In order to promote open source cooperation and the creation of a cohesive, cross-architecture programming paradigm free from proprietary software lock-in, the oneAPI standard is now overseen by the UXL Foundation.
A framework that may be used to expand the oneAPI ecosystem to bespoke AI and HPC architectures is Codeplay’s oneAPI Construction Kit. For both native on-host and cross-compilation, the most recent 4.0 version brings RISC-V native host for the first time.
Because of this capability, programs may be executed on a CPU and benefit from the acceleration that SYCL offers via data parallelism. With the oneAPI Construction Kit, Intel RISC V processor designers can now effortlessly connect SYCL and the oneAPI ecosystem with their hardware, marking a key step toward realizing the goal of a completely open hardware and software stack. It is completely free to use and open-source.
OneAPI Construction Kit
Your processor has access to an open environment with the oneAPI Construction Kit. It is a framework that opens up SYCL and other open standards to hardware platforms, and it can be used to expand the oneAPI ecosystem to include unique AI and HPC architectures.
Give Developers Access to a Dynamic, Open-Ecosystem
With the oneAPI Construction Kit, new and customized accelerators may benefit from the oneAPI ecosystem and an abundance of SYCL libraries. Contributors from many sectors of the industry support and maintain this open environment, so you may build with the knowledge that features and libraries will be preserved. Additionally, it frees up developers’ time to innovate more quickly by reducing the amount of time spent rewriting code and managing disparate codebases.
The oneAPI Construction Kit is useful for anybody who designs hardware. To get you started, the Kit includes a reference implementation for Intel RISC V vector processors, although it is not confined to RISC-V and may be modified for a variety of processors.
Codeplay Enhances the oneAPI Construction Kit with RISC-V Support
The rapidly expanding open standard instruction set architecture (ISA) known as RISC-V is compatible with all sorts of processors, including accelerators and CPUs. Axelera, Codasip, and others make Intel RISC V processors for a variety of applications. RISC-V-powered microprocessors are also being developed by the EU as part of the European Processor Initiative.
At Codeplay, has been long been pioneers in open ecosystems, and as a part of RISC-V International, its’ve worked on the project for a number of years, leading working groups that have helped to shape the standard. Nous realize that building a genuinely open environment starts with open, standards-based hardware. But in order to do that, must also need open hardware, open software, and open source from top to bottom.
This is where oneAPI and SYCL come in, offering an ecosystem of open-source, standards-based software libraries for applications of various kinds, such oneMKL or oneDNN, combined with a well-developed programming architecture. Both SYCL and oneAPI are heterogeneous, which means that you may create code once and use it on any GPU AMD, Intel, NVIDIA, or, as of late, RISC-V without being restricted by the manufacturer.
Intel initially implemented RISC-V native host for both native on-host and cross-compilation with the most recent 4.0 version of the oneAPI Construction Kit. Because of this capability, programs may be executed on a CPU and benefit from the acceleration that SYCL offers via data parallelism. With the oneAPI Construction Kit, Intel RISC V processor designers can now effortlessly connect SYCL and the oneAPI ecosystem with their hardware, marking a major step toward realizing the vision of a completely open hardware and software stack.
Read more on govindhtech.com
2 notes · View notes
govindhtech · 5 months ago
Text
Intel VTune Profiler For Data Parallel Python Applications
Tumblr media
Intel VTune Profiler tutorial
This brief tutorial will show you how to use Intel VTune Profiler to profile the performance of a Python application using the NumPy and Numba example applications.
Analysing Performance in Applications and Systems
For HPC, cloud, IoT, media, storage, and other applications, Intel VTune Profiler optimises system performance, application performance, and system configuration.
Optimise the performance of the entire application not just the accelerated part using the CPU, GPU, and FPGA.
Profile SYCL, C, C++, C#, Fortran, OpenCL code, Python, Google Go, Java,.NET, Assembly, or any combination of languages can be multilingual.
Application or System: Obtain detailed results mapped to source code or coarse-grained system data for a longer time period.
Power: Maximise efficiency without resorting to thermal or power-related throttling.
VTune platform profiler
It has following Features.
Optimisation of Algorithms
Find your code’s “hot spots,” or the sections that take the longest.
Use Flame Graph to see hot code routes and the amount of time spent in each function and with its callees.
Bottlenecks in Microarchitecture and Memory
Use microarchitecture exploration analysis to pinpoint the major hardware problems affecting your application’s performance.
Identify memory-access-related concerns, such as cache misses and difficulty with high bandwidth.
Inductors and XPUs
Improve data transfers and GPU offload schema for SYCL, OpenCL, Microsoft DirectX, or OpenMP offload code. Determine which GPU kernels take the longest to optimise further.
Examine GPU-bound programs for inefficient kernel algorithms or microarchitectural restrictions that may be causing performance problems.
Examine FPGA utilisation and the interactions between CPU and FPGA.
Technical summary: Determine the most time-consuming operations that are executing on the neural processing unit (NPU) and learn how much data is exchanged between the NPU and DDR memory.
In parallelism
Check the threading efficiency of the code. Determine which threading problems are affecting performance.
Examine compute-intensive or throughput HPC programs to determine how well they utilise memory, vectorisation, and the CPU.
Interface and Platform
Find the points in I/O-intensive applications where performance is stalled. Examine the hardware’s ability to handle I/O traffic produced by integrated accelerators or external PCIe devices.
Use System Overview to get a detailed overview of short-term workloads.
Multiple Nodes
Describe the performance characteristics of workloads involving OpenMP and large-scale message passing interfaces (MPI).
Determine any scalability problems and receive suggestions for a thorough investigation.
Intel VTune Profiler
To improve Python performance while using Intel systems, install and utilise the Intel Distribution for Python and Data Parallel Extensions for Python with your applications.
Configure your Python-using VTune Profiler setup.
To find performance issues and areas for improvement, profile three distinct Python application implementations. The pairwise distance calculation algorithm commonly used in machine learning and data analytics will be demonstrated in this article using the NumPy example.
The following packages are used by the three distinct implementations.
Numpy Optimised for Intel
NumPy’s Data Parallel Extension
Extensions for Numba on GPU with Data Parallelism
Python’s NumPy and Data Parallel Extension
By providing optimised heterogeneous computing, Intel Distribution for Python and Intel Data Parallel Extension for Python offer a fantastic and straightforward approach to develop high-performance machine learning (ML) and scientific applications.
Added to the Python Intel Distribution is:
Scalability on PCs, powerful servers, and laptops utilising every CPU core available.
Assistance with the most recent Intel CPU instruction sets.
Accelerating core numerical and machine learning packages with libraries such as the Intel oneAPI Math Kernel Library (oneMKL) and Intel oneAPI Data Analytics Library (oneDAL) allows for near-native performance.
Tools for optimising Python code into instructions with more productivity.
Important Python bindings to help your Python project integrate Intel native tools more easily.
Three core packages make up the Data Parallel Extensions for Python:
The NumPy Data Parallel Extensions (dpnp)
Data Parallel Extensions for Numba, aka numba_dpex
Tensor data structure support, device selection, data allocation on devices, and user-defined data parallel extensions for Python are all provided by the dpctl (Data Parallel Control library).
It is best to obtain insights with comprehensive source code level analysis into compute and memory bottlenecks in order to promptly identify and resolve unanticipated performance difficulties in Machine Learning (ML),  Artificial Intelligence ( AI), and other scientific workloads. This may be done with Python-based ML and AI programs as well as C/C++ code using Intel VTune Profiler. The methods for profiling these kinds of Python apps are the main topic of this paper.
Using highly optimised Intel Optimised Numpy and Data Parallel Extension for Python libraries, developers can replace the source lines causing performance loss with the help of Intel VTune Profiler, a sophisticated tool.
Setting up and Installing
1. Install Intel Distribution for Python
2. Create a Python Virtual Environment
   python -m venv pyenv
   pyenv\Scripts\activate
3. Install Python packages
   pip install numpy
   pip install dpnp
   pip install numba
   pip install numba-dpex
   pip install pyitt
Make Use of Reference Configuration
The hardware and software components used for the reference example code we use are:
Software Components:
dpnp 0.14.0+189.gfcddad2474
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
mkl-umath 0.1.1
numba 0.59.0
numba-dpex 0.21.4
numpy 1.26.4
pyitt 1.1.0
Operating System:
Linux, Ubuntu 22.04.3 LTS
CPU:
Intel Xeon Platinum 8480+
GPU:
Intel Data Center GPU Max 1550
The Example Application for NumPy
Intel will demonstrate how to use Intel VTune Profiler and its Intel Instrumentation and Tracing Technology (ITT) API to optimise a NumPy application step-by-step. The pairwise distance application, a well-liked approach in fields including biology, high performance computing (HPC), machine learning, and geographic data analytics, will be used in this article.
Summary
The three stages of optimisation that we will discuss in this post are summarised as follows:
Step 1: Examining the Intel Optimised Numpy Pairwise Distance Implementation: Here, we’ll attempt to comprehend the obstacles affecting the NumPy implementation’s performance.
Step 2: Profiling Data Parallel Extension for Pairwise Distance NumPy Implementation: We intend to examine the implementation and see whether there is a performance disparity.
Step 3: Profiling Data Parallel Extension for Pairwise Distance Implementation on Numba GPU: Analysing the numba-dpex implementation’s GPU performance
Boost Your Python NumPy Application
Intel has shown how to quickly discover compute and memory bottlenecks in a Python application using Intel VTune Profiler.
Intel VTune Profiler aids in identifying bottlenecks’ root causes and strategies for enhancing application performance.
It can assist in mapping the main bottleneck jobs to the source code/assembly level and displaying the related CPU/GPU time.
Even more comprehensive, developer-friendly profiling results can be obtained by using the Instrumentation and Tracing API (ITT APIs).
Read more on govindhtech.com
2 notes · View notes
govindhtech · 2 months ago
Text
Intel VTune Profiler: Improve AI/ML Application Performance
Tumblr media
Improve AI/ML Application Performance with Intel VTune Profiler.
Find out how to profile Data Parallel Python and OpenVINO workloads using the oneAPI-powered tool. The use of AI and ML is rising in healthcare and life sciences, marketing and finance, manufacturing, robotics, driverless automobiles, smart cities, and more. ML workloads in real-world fields employ deep learning frameworks like PyTorch, TensorFlow, Keras, and others.
Through the “write once, deploy anywhere” approach, other developer tools, such as the OpenVINO Toolkit, also help to expedite AI research on the newest hardware architectures in fields like computer vision and generative artificial intelligence (GenAI). The goal of the open source OpenVINO Toolkit, which was first released in 2018, has been to speed up AI inference with reduced latency and increased throughput while preserving accuracy, minimizing model footprint, and maximizing hardware utilization.
It is challenging to locate and examine performance bottlenecks in the underlying source code because to the intricate structure of deep learning models, which include numerous layers and non-linear functions. ML frameworks like PyTorch and TensorFlow provide native tools and profiling APIs for tracking and evaluating performance metrics at various phases of model construction.
These approaches, however, are only applicable to software functionality. This problem is addressed by the Intel VTune Profiler, which is driven by the oneAPI and offers comprehensive insights into hardware-level memory and compute bottlenecks. By doing this, performance problems are resolved and AI applications’ performance is optimized and scaled across hardware systems with different computational envelopes.
The scope of optimization for AI/ML workloads may be expanded by using Intel VTune Profiler to profile data in concurrent Python and OpenVINO programs, as you will discover in this article.
Use Intel VTune Profiler to Boost Python Application Performance
VTune Profiler may assist in profiling a Python program, as shown in a recently published recipe in the Intel VTune Profiler Cookbook.
The following are the recipe’s fundamental software requirements:
Data Parallel Extensions for Python
Vtune Profiler (version 2022 or later)
Intel Distribution for Python
Compiler for Intel OneAPI DPC++/C++
The NumPy implementation covered in the recipe divides the calculations into logical jobs using the Intel Instrumentation and Tracing Technology (ITT) APIs and performs distance computations using the Intel oneAPI Math Kernel Library (oneMKL) routines. You may then determine which areas of the code need attention for necessary changes to get additional performance by using the VTune Profiler tool to examine the execution time and memory consumption of each logical job.
Details on the most CPU-intensive code segments are included in the output analysis report when Hotspots analysis is performed on the NumPy implementation. Additionally, it offers recommendations for investigating the profiler tool’s other performance analysis features, such Threading analysis for enhanced parallelism and Microarchitecture Exploration analysis for effective use of the underlying hardware.
Use the Data Parallel Extension for NumPy and Numba to Address Performance Bottlenecks
According to the Hotspots analysis report, NumPy operations and underlying oneMKL functions account for a significant amount of the execution time in the simple NumPy implementation of the pairwise distance computation example. By making little code modifications, NumPy may be swapped out for the Data Parallel Extension for NumPy, which will eliminate these bottlenecks. To evaluate the speed gains over the simple NumPy code and find any areas that might need further optimization, run the Hotspots analysis once again.
Additionally, the VTune Profiler makes recommendations such as using the Data Parallel Extension for Numba with your platform’s GPU to bring offload accelerator parallelism to the application. The Numba JIT compiler for NumPy operations has an open-source extension called Numba. It offers Python kernel programming APIs that resemble SYCL. The GPU Compute/Media Hotspots analysis preview function of VTune Profiler may then be used to examine the Numba implementation’s execution on a GPU.
Use Intel VTune Profiler to Examine OpenVINO Applications’ Performance
Using the VTune Profiler to profile OpenVINO-based AI applications is covered in another new recipe in the VTune Profiler cookbook. It discusses how to use the profiler tool to analyze performance bottlenecks in the CPU, GPU, and Neural Processing Unit (NPU).
If your OpenVINO application makes use of the Intel oneAPI Data Analytics Library (oneDAL) and/or the Intel oneAPI Deep Neural Network (oneDNN) Intel Distribution for Python Intel VTune Profiler (v2024.1 or later), you can access the Benchmark Tool application as part of the OpenVINO Toolkit Intel oneAPI Base Toolkit.
The recipe offers detailed steps for configuring OpenVINO with the ITT APIs for performance analysis, building the OpenVINO source, and setting it up. It profiles the AI application and analyzes performance and latency using a reference benchmark application.
Depending on the computational architecture, you may use the VTune Profiler‘s numerous performance analysis features to find hotspots and look at how much hardware is being utilized by specific code sections.
For example,
To examine CPU bottlenecks that is, the sections of code that take up the most CPU execution time use the Hotspots Analysis tool.
Use the GPU Compute/Media Hotspots Analysis preview function to profile GPU hotspots. Examine inefficient kernel methods, examine the frequency of GPU instructions for various kinds of instructions, and more to get an understanding of GPU use.
The AI PCs’ Neural Processing Units (NPUs) are made especially to boost AI/ML applications’ performance. With the Intel Distribution of OpenVINO Toolkit, you may transfer compute-intensive AI/ML tasks to Intel NPUs. You may examine the NPU performance using a number of hardware measures, including workload size, execution time, sampling interval, and more, with the use of the VTune Profiler‘s NPU Exploration Analysis preview function.
Intel VTune Profiler Download
Use one of the following methods to install Intel VTune Profiler on your computer:
Get the standalone bundle here.
As part of the Intel oneAPI Base Toolkit, download Intel VTune Profiler.
Know the Process
In the standard software performance analysis process,Image Credit To Intel
Use one of the following methods to launch Intel VTune Profiler:
Using Microsoft Visual Studio’s GUI From the Command Line
Configure parameters and choose a profiling analysis for your application.
Create a profile of the target system (remote collection) or application on the host (local collection).
View the host system’s findings.
Identify bottlenecks and address them over a number of cycles until you reach a desirable level of performance.
FAQs
What does Intel VTune do?
Examine CPU usage, OpenMP efficiency, memory access, and vectorization to assess application performance. Measure metrics to find memory access concerns.
Read more on govindhtech.com
1 note · View note
govindhtech · 4 months ago
Text
Guide To Python NumPy and SciPy In Multithreading In Python
Tumblr media
An Easy Guide to Multithreading in Python
Python is a strong language, particularly for developing AI and machine learning applications. However, CPython, the programming language’s original, reference implementation and byte-code interpreter, lacks multithreading functionality; multithreading and parallel processing need to be enabled from the kernel. Some of the desired multi-core processing is made possible by libraries Python NumPy and SciPy such as NumPy, SciPy, and PyTorch, which use C-based implementations. However, there is a problem known as the Global Interpreter Lock (GIL), which literally “locks” the CPython interpreter to only working on one thread at a time, regardless of whether the interpreter is in a single or multi-threaded environment.
Let’s take a different approach to Python.
The robust libraries and tools that support Intel Distribution of Python, a collection of high-performance packages that optimize underlying instruction sets for Intel architectures, are designed to do this.
For compute-intensive, core Python numerical and scientific packages like NumPy, SciPy, and Numba, the Intel distribution helps developers achieve performance levels that are comparable to those of a C++ program by accelerating math and threading operations using oneAPI libraries while maintaining low Python overheads. This enables fast scaling over a cluster and assists developers in providing highly efficient multithreading, vectorization, and memory management for their applications.
Let’s examine Intel’s strategy for enhancing Python parallelism and composability in more detail, as well as how it might speed up your AI/ML workflows.
Parallelism in Nests: Python NumPy and SciPy
Python libraries called Python NumPy and SciPy were created especially for scientific computing and numerical processing, respectively.
Exposing parallelism on all conceivable levels of a program for example, by parallelizing the outermost loops or by utilizing various functional or pipeline sorts of parallelism on the application level is one workaround to enable multithreading/parallelism in Python scripts. This parallelism can be accomplished with the use of libraries like Dask, Joblib, and the included multiprocessing module mproc (with its ThreadPool class).
Data-parallelism can be performed with Python modules like Python NumPy and SciPy, which can then be accelerated with an efficient math library like the Intel oneAPI Math Kernel Library (oneMKL). This is because massive data processing requires a lot of processing. Using various threading runtimes, oneMKL is multi-threaded. An environment variable called MKL_THREADING_LAYER can be used to adjust the threading layer.
As a result, a code structure known as nested parallelism is created, in which a parallel section calls a function that in turn calls another parallel region. Since serial sections that is, regions that cannot execute in parallel and synchronization latencies are typically inevitable in Python NumPy and SciPy based systems, this parallelism-within-parallelism is an effective technique to minimize or hide them.
Going One Step Further: Numba
Despite offering extensive mathematical and data-focused accelerations through C-extensions, Python NumPy and SciPy remain a fixed set of mathematical tools accelerated through C-extensions. If non-standard math is required, a developer should not expect it to operate at the same speed as C-extensions. Here’s where Numba can work really well.
OneTBB
Based on LLVM, Numba functions as a “Just-In-Time” (JIT) compiler. It aims to reduce the performance difference between Python and compiled, statically typed languages such as C and C++. Additionally, it supports a variety of threading runtimes, including workqueue, OpenMP, and Intel oneAPI Threading Building Blocks (oneTBB). To match these three runtimes, there are three integrated threading layers. The only threading layer installed by default is workqueue; however, other threading layers can be added with ease using conda commands (e.g., $ conda install tbb).
The environment variable NUMBA_THREADING_LAYER can be used to set the threading layer. It is vital to know that there are two ways to choose this threading layer: either choose a layer that is generally safe under different types of parallel processing, or specify the desired threading layer name (e.g., tbb) explicitly.
Composability of Threading
The efficiency or efficacy of co-existing multi-threaded components depends on an application’s or component’s threading composability. A component that is “perfectly composable” would operate without compromising the effectiveness of other components in the system or its own efficiency.
In order to achieve a completely composable threading system, care must be taken to prevent over-subscription, which means making sure that no parallel region of code or component can require a certain number of threads to run (this is known as “mandatory” parallelism).
An alternative would be to implement a type of “optional” parallelism in which a work scheduler determines at the user level which thread(s) the components should be mapped to while automating the coordination of tasks among components and parallel regions. Naturally, the efficiency of the scheduler’s threading model must be better than the high-performance libraries’ integrated scheme since it is sharing a single thread-pool to arrange the program’s components and libraries around. The efficiency is lost otherwise.
Intel’s Strategy for Parallelism and Composability
Threading composability is more readily attained when oneTBB is used as the work scheduler. OneTBB is an open-source, cross-platform C++ library that was created with threading composability and optional/nested parallelism in mind. It allows for multi-core parallel processing.
An experimental module that enables threading composability across several libraries unlocks the potential for multi-threaded speed benefits in Python and was included in the oneTBB version released at the time of writing. As was previously mentioned, the scheduler’s improved threads allocation is what causes the acceleration.
The ThreadPool for Python standard is replaced by the Pool class in oneTBB. Additionally, the thread pool is activated across modules without requiring any code modifications thanks to the use of monkey patching, which allows an object to be dynamically replaced or updated during runtime. Additionally, oneTBB replaces oneMKL by turning on its own threading layer, which allows it to automatically provide composable parallelism when using calls from the Python NumPy and SciPy libraries.
See the code samples from the following composability demo, which is conducted on a system with MKL-enabled NumPy, TBB, and symmetric multiprocessing (SMP) modules and their accompanying IPython kernels installed, to examine the extent to which nested parallelism can enhance performance. Python is a feature-rich command-shell interface that supports a variety of programming languages and interactive computing. To get a quantifiable performance comparison, the demonstration was executed using the Jupyter Notebook extension.
import NumPy as np from multiprocessing.pool import ThreadPool pool = ThreadPool(10)
The aforementioned cell must be executed again each time the kernel in the Jupyter menu is changed in order to build the ThreadPool and provide the runtime outcomes listed below.
The following code, which runs the identical line for each of the three trials, is used with the default Python kernel:
%timeit pool.map(np.linalg.qr, [np.random.random((256, 256)) for i in range(10)])
This approach can be used to get the eigenvalues of a matrix using the standard Python kernel. Runtime is significantly improved up to an order of magnitude when the Python-m SMP kernel is enabled. Applying the Python-m TBB kernel yields even more improvements.
OneTBB’s dynamic task scheduler, which most effectively manages code where the innermost parallel sections cannot fully utilize the system’s CPU and where there may be a variable amount of work to be done, yields the best performance for this composability example. Although the SMP technique is still quite effective, it usually performs best in situations when workloads are more evenly distributed and the loads of all workers in the outermost regions are generally identical.
In summary, utilizing multithreading can speed up AI/ML workflows
The effectiveness of Python programs with an AI and machine learning focus can be increased in a variety of ways. Using multithreading and multiprocessing effectively will remain one of the most important ways to push AI/ML software development workflows to their limits.
Read more on Govindhtech.com
0 notes
govindhtech · 5 months ago
Text
Utilizing llama.cpp, LLMs can be executed on Intel GPUs
Tumblr media
The open-source project known as llama.cpp is a lightweight LLM framework that is gaining greater and greater popularity. Given its performance and customisability, developers, scholars, and fans have formed a strong community around the project. Since its launch, GitHub has over 600 contributors, 52,000 stars, 1,500 releases, and 7,400 forks. More hardware, including Intel GPUs seen in server and consumer products, is now supported by llama.cpp as a result of recent code merges. Hardware support for GPUs from other vendors and CPUs (x86 and ARM) is now combined with Intel’s GPUs.
Georgi Gerganov designed the first implementation. The project is mostly instructional in nature and acts as the primary testing ground for new features being developed for the machine learning tensor library known as ggml library. Intel is making  AI more accessible to a wider range of customers by enabling inference on a greater number of devices with its latest releases. Because Llama.cpp is built in C and has a number of other appealing qualities, it is quick.
16-bit float compatibility
Support for integer quantisation (four-, five-, eight-, etc.)
Absence of reliance on outside parties
There are no runtime memory allocations.
Intel GPU SYCL Backend
GGM offers a number of backends to accommodate and adjust for various hardware. Since oneAPI supports GPUs from multiple vendors, Intel decided to construct the SYCL backend using their direct programming language, SYCL, and high-performance BLAS library, oneMKL. A programming model called SYCL is designed to increase hardware accelerator productivity. It is an embedded, single-source language with a domain focus that is built entirely on C++17.
All Intel GPUs can be used with the SYCL backend. Intel has confirmed with:
Flex Series and Data Centre GPU Max from Intel
Discrete GPU Intel Arc
Intel Arc GPU integrated with the Intel Core Ultra CPU
In Intel Core CPUs from Generations 11 through 13: iGPU
Millions of consumer devices can now conduct inference on Llama since llama.cpp now supports Intel GPUs. The SYCL backend performs noticeably better on Intel GPUs than the OpenCL (CLBlast) backend. Additionally, it supports an increasing number of devices, including CPUs and future processors with  AI accelerators. For information on using the SYCL backend, please refer to the llama.cpp tutorial.
Utilise the SYCL Backend to Run LLM on an Intel GPU
For SYCL, llama.cpp contains a comprehensive manual. Any Intel GPU that supports SYCL and oneAPI can run it. GPUs from the Flex Series and Intel Data Centre GPU Max can be used by server and cloud users. On their Intel Arc GPU or iGPU on Intel Core CPUs, client users can test it out. The 11th generation Core and later iGPUs have been tested by Intel. While it functions, the older iGPU performs poorly.
The memory is the only restriction. Shared memory on the host is used by the iGPU. Its own memory is used by the dGPU. For llama2-7b-Q4 models, Intel advise utilising an iGPU with 80+ EUs (11th Gen Core and above) and shared memory that is greater than 4.5 GB (total host memory is 16 GB and higher, and half memory could be assigned to iGPU).
Put in place the Intel GPU driver
There is support for Windows (WLS2) and Linux. Intel suggests Ubuntu 22.04 for Linux, and this version was utilised for testing and development.
Linux:sudo usermod -aG render username sudo usermod -aG video username sudo apt install clinfo sudo clinfo -l
Output (example):Platform #0: Intel(R) OpenCL Graphics -- Device #0: Intel(R) Arc(TM) A770 Graphics
orPlatform #0: Intel(R) OpenCL HD Graphics -- Device #0: Intel(R) Iris(R) Xe Graphics \[0x9a49\]
Set the oneAPI Runtime to ON
Install the Intel oneAPI Base Toolkit first in order to obtain oneMKL and the SYCL compiler. Turn on the oneAPI runtime next:
First, install the Intel oneAPI Base Toolkit to get the SYCL compiler and oneMKL. Next, enable the oneAPI runtime:
Linux: source /opt/intel/oneapi/setvars.sh
Windows: “C:\Program Files (x86)\Intel\oneAPI\setvars.bat\” intel64
Run sycl-ls to confirm that there are one or more Level Zero devices. Please confirm that at least one GPU is present, like [ext_oneapi_level_zero:gpu:0].
Build by one-click:
Linux: ./examples/sycl/build.sh
Windows: examples\sycl\win-build-sycl.bat
Note, the scripts above include the command to enable the oneAPI runtime.
Run an Example by One-Click
Download llama-2–7b.Q4_0.gguf and save to the models folder:
Linux: ./examples/sycl/run-llama2.sh
Windows: examples\sycl\win-run-llama2.bat
Note that the scripts above include the command to enable the oneAPI runtime. If the ID of your Level Zero GPU is not 0, please change the device ID in the script. To list the device ID:
Linux: ./build/bin/ls-sycl-device or ./build/bin/main
Windows: build\bin\ls-sycl-device.exe or build\bin\main.exe
Synopsis
All Intel GPUs are available to LLM developers and users via the SYCL backend included in llama.cpp. Kindly verify whether the Intel laptop, your gaming PC, or your cloud virtual machine have an iGPU, an Intel Arc GPU, or an Intel Data Centre GPU Max and Flex Series GPU. If so, llama.cpp’s wonderful LLM features on Intel GPUs are yours to enjoy. To add new features and optimise SYCL for Intel GPUs, Intel want developers to experiment and contribute to the backend. The oneAPI programming approach is a useful project to learn for cross-platform development.
Read more on Govindhtech.com
0 notes
govindhtech · 6 months ago
Text
OneAPI Math Kernel Library (oneMKL): Intel MKL’s Successor
Tumblr media
The upgraded and enlarged Intel oneAPI Math Kernel Library supports numerical processing not only on CPUs but also on GPUs, FPGAs, and other accelerators that are now standard components of heterogeneous computing environments.
In order to assist you decide if upgrading from traditional  Intel MKL is the better option for you, this blog will provide you with a brief summary of the maths library.
Why just oneMKL?
The vast array of mathematical functions in oneMKL can be used for a wide range of tasks, from straightforward ones like linear algebra and equation solving to more intricate ones like data fitting and summary statistics.
Several scientific computing functions, including vector math, fast Fourier transforms (FFT), random number generation (RNG), dense and sparse Basic Linear Algebra Subprograms (BLAS), Linear Algebra Package (LAPLACK), and vector math, can all be applied using it as a common medium while adhering to uniform API conventions. Together with GPU offload and SYCL support, all of these are offered in C and Fortran interfaces.
Additionally, when used with  Intel Distribution for Python, oneAPI Math Kernel Library speeds up Python computations (NumPy and SciPy).
Intel MKL Advanced with oneMKL
A refined variant of the standard  Intel MKL is called oneMKL. What sets it apart from its predecessor is its improved support for SYCL and GPU offload. Allow me to quickly go over these two distinctions.
GPU Offload Support for oneMKL
GPU offloading for SYCL and OpenMP computations is supported by oneMKL. With its main functionalities configured natively for Intel GPU offload, it may thus take use of parallel-execution kernels of GPU architectures.
oneMKL adheres to the General Purpose GPU (GPGPU) offload concept that is included in the Intel Graphics Compute Runtime for OpenCL Driver and oneAPI Level Zero. The fundamental execution mechanism is as follows: the host CPU is coupled to one or more compute devices, each of which has several GPU Compute Engines (CE).
SYCL API for oneMKL
OneMKL’s SYCL API component is a part of oneAPI, an open, standards-based, multi-architecture, unified framework that spans industries. (Khronos Group’s SYCL integrates the SYCL specification with language extensions created through an open community approach.) Therefore, its advantages can be reaped on a variety of computing devices, including FPGAs, CPUs, GPUs, and other accelerators. The SYCL API’s functionality has been divided into a number of domains, each with a corresponding code sample available at the oneAPI GitHub repository and its own namespace.
OneMKL Assistance for the Most Recent Hardware
On cutting-edge architectures and upcoming hardware generations, you can benefit from oneMKL functionality and optimizations. Some examples of how oneMKL enables you to fully utilize the capabilities of your hardware setup are as follows:
It supports the 4th generation  Intel Xeon Scalable Processors’ float16 data type via  Intel Advanced Vector Extensions 512 (Intel AVX-512) and optimised bfloat16 and int8 data types via Intel Advanced Matrix Extensions (Intel AMX).
It offers matrix multiply optimisations on the upcoming generation of CPUs and GPUs, including Single Precision General Matrix Multiplication (SGEMM), Double Precision General Matrix Multiplication (DGEMM), RNG functions, and much more.
For a number of features and optimisations on the Intel Data Centre GPU Max Series, it supports Intel Xe Matrix Extensions (Intel XMX).
For memory-bound dense and sparse linear algebra, vector math, FFT, spline computations, and various other scientific computations, it makes use of the hardware capabilities of  Intel Xeon processors and  Intel Data Centre GPUs.
Additional Terms and Context
The brief explanation of terminology provided below could also help you understand oneMKL and how it fits into the heterogeneous-compute ecosystem.
The C++ with SYCL interfaces for performance math library functions are defined in the oneAPI Specification for oneMKL. The oneMKL specification has the potential to change more quickly and often than its implementations.
The specification is implemented in an open-source manner by the oneAPI Math Kernel Library (oneMKL) Interfaces project. With this project, we hope to show that the SYCL interfaces described in the oneMKL specification may be implemented for any target hardware and math library.
The intention is to gradually expand the implementation, even though the one offered here might not be the complete implementation of the specification. We welcome community participation in this project, as well as assistance in expanding support to more math libraries and a variety of hardware targets.
With C++ and SYCL interfaces, as well as comparable capabilities with C and Fortran interfaces, oneMKL is the  Intel product implementation of the specification. For Intel CPU and  Intel GPU hardware, it is extremely optimized.
Next up, what?
Launch oneMKL now to begin speeding up your numerical calculations like never before! Leverage oneMKL’s powerful features to expedite math processing operations and improve application performance while reducing development time for both current and future Intel platforms.
Keep in mind that oneMKL is rapidly evolving even while you utilize the present features and optimizations! In an effort to keep up with the latest Intel technology, we continuously implement new optimizations and support for sophisticated math functions.
They also invite you to explore the  AI, HPC, and Rendering capabilities available in  Intel’s software portfolio that is driven by oneAPI.
Read more on govindhtech.com
0 notes
govindhtech · 3 months ago
Text
Intel Distribution For Python To Create A Genetic Algorithm
Tumblr media
Python Genetic Algorithm
Genetic algorithms (GA) simulate natural selection to solve finite and unconstrained optimization problems. Traditional methods take time and resources to address NP-hard optimization problems, but these algorithms can do it. GAs are based on a comparison between human chromosomal behavior and biological evolution.
This article provides a code example of how to use numba-dpex for Intel Distribution for Python to create a generic GA and offload a calculation to a GPU.
Genetic Algorithms (GA)
Activities inside GAs
Selection, crossover, and mutation are three crucial biology-inspired procedures that may be used to provide a high-quality output for GAs. It’s critical to specify the chromosomal representation and the GA procedures before applying GAs to a particular issue.
Selection
This is the procedure for choosing a partner and recombining them to produce children. Because excellent parents encourage their children to find better and more appropriate answers, parent selection is critical to the convergence rate of GA.
An illustration of the selection procedure whereby the following generation’s chromosomes are reduced by half.
The extra algorithms that decide which chromosomes will become parents are often required for the selection procedure.
Crossover
Biological crossover is the same procedure as this one. In this case, more than one parent is chosen, and the genetic material of the parents is used to make one or more children.
A crossover operation in action.
The crossover procedure produces kid genomes from specific parent chromosomes. There is only one kid genome produced and it may be a one-point crossing. The first and second parents each give the kid half of their DNA.
Mutation
A novel answer may be obtained by a little, haphazard modification to the chromosome. It is often administered with little probability and is used to preserve and add variation to the genetic population.
A mutation procedure involving a single chromosomal value change.
The mutation procedure may alter a chromosome.
Enhance Genetic Algorithms for Python Using Intel Distribution
With libraries like Intel oneAPI Data Analytics Library (oneDAL) and Intel oneAPI Math Kernel Library (oneMKL), developers may use Intel Distribution for Python to obtain near-native code performance. With improved NumPy, SciPy, and Numba, researchers and developers can expand compute-intensive Python applications from laptops to powerful servers.
Use the Data Parallel Extension for Numba (numba-dpex) range kernel to optimize the genetic algorithm using the Intel Distribution for Python. Each work item in this kernel represents a logical thread of execution, and it represents the most basic kind of data-parallel and parallelism across a group of work items.
The vector-add operation was carried out on a GPU in the prior code, and vector c held the result. In a similar vein, the implementation is the same for every other function or method.
Code Execution
Refer to the code sample for instructions on how to develop the generic GA and optimize the method to operate on GPUs using numba-dpex for Intel Distribution for Python. It also describes how to use the various GA operations selection, crossover, and mutation and how to modify these techniques for use in solving other optimization issues.
Set the following values to initialize the population:
5,000 people live there.
Size of a chromosome: 10
Generations: 5.
There are ten random floats between 0 and 1 on each chromosome.
Put the GA into practice by developing an assessment strategy: This function serves as numba-dpex’s benchmark and point of comparison. The calculation of an individual’s fitness involves using any combination of algebraic operations on the chromosome.
Carry out the crossover operation: The inputs are first and second parents to two distinct chromosomes. One more chromosome is returned as the function’s output.
Carry out the mutation operation: There is a one percent probability that every float in the chromosome will be replaced by a random value in this code example.
Put into practice the selection process, which is the foundation for producing a new generation. After crossover and mutation procedures, a new population is generated inside this function.
Launch the prepared functions on a CPU, beginning with a baseline. Every generation includes the following processes to establish the first population:
Utilizing the eval_genomes_plain function, the current population is evaluated
Utilizing a next_generation function, create the next generation.
Wipe fitness standards, since a new generation has already been produced.
Measured and printed is the calculation time for those operations. To demonstrate that the calculations were the same on the CPU and GPU, the first chromosome is also displayed.
Run on a GPU: Create an evaluation function for the GPU after beginning with a fresh population initialization (similar to step 2). With GPU implementation, chromosomes are represented by a flattened data structure, which is the sole difference between it and CPU implementation. Also, utilize a global index and kernels from numba-dpex to avoid looping over every chromosome.
The time for assessment, generation production, and fitness wipe is monitored when a GPU is operating, just like it is for the CPU. Deliver the fitness container and all of the chromosomes to the selected device. After that, a kernel with a specified range may be used.
Conclusion
Use the same procedures for further optimization issues. Describe the procedures of chromosomal selection, crossing, mutation, and assessment. The algorithm is executed the same way in its entirety.
Execute the above code sample and evaluate how well this method performs while executing sequentially on a CPU and parallelly on a GPU. The code result shows that using a GPU-based numba-dpex parallel implementation improves performance speed.
Read more on Govindhtech.com
1 note · View note
govindhtech · 5 months ago
Text
PyTorch 2.4 to Speed Up AI Tasks Support for Intel GPUs
Tumblr media
PyTorch 2.4 Launches to Speed Up AI Tasks with Initial Support for Intel GPUs. In order to further speed up AI tasks, PyTorch 2.4 now offers initial support for the  Intel Data Centre GPU Max Series, which integrates  Intel GPUs and the SYCL software stack into the standard PyTorch stack.
Advantages
With Intel GPU support, customers have more options for GPUs and can use a consistent front-end and back-end GPU programming model. Workloads can now be deployed and operated on Intel GPUs with little coding required. To support streaming devices, this version generalizes the PyTorch device and runtime (device, stream, event, generator, allocator, and guard). The generalization facilitates not only PyTorch’s deployment on widely available hardware but also the integration of many hardware back ends.
Integrated PyTorch provides continuous software support, standardized software distribution, and consistent product release schedules, all of which will improve the experience for users of Intel GPUs.
An Overview of Support for Intel GPUs
Eager mode and graph mode are supported in the PyTorch built-in front end thanks to Intel GPU support that has been up streamed into the program. The SYCL programming language is now utilized to implement popular Aten operators in the eager mode. OneAPI Math Kernel Library (oneMKL) and oneAPI Deep Neural Network Library (oneDNN) are used to highly optimize the most performance-critical graphs and operators. To perform the optimization for Intel GPUs and to integrate Triton, the graph mode (torch.compile) now has an enabled Intel GPU back end.
PyTorch 2.4 now includes the necessary parts of Intel GPU support: Aten operators, oneDNN, Triton, Intel GPU source build, and integration of Intel GPU tool chains. In the meantime, PyTorch Profiler which is built on an integration between Kineto and oneMKL is being actively worked on in front of the forthcoming PyTorch 2.5 release. The front-end and back-end enhancements for Intel GPUs that are currently being implemented into PyTorch are depicted in Figure 1.Image Credit To Intel
PyTorch 2.4 Features
Apart from offering essential functionalities for training and inference on the Intel Data Centre GPU Max Series, the PyTorch 2.4 release for Linux maintains the same user interface as other supported hardware for PyTorch.
Using an Intel GPU, PyTorch 2.4 features include:
Workflows for inference and training.
The core eager functions as well as torch.compile are supported, and both eager and compile modes can fully run a Dynamo Hugging Face benchmark.
Data types like automated mixed precision (AMP), BF16, FP32, and so on.
operates on the Intel Data Centre GPU Max Series and Linux.
PyTorch 2.5
The first Intel GPU from the Intel Data Centre GPU Max Series is now available in the PyTorch ecosystem for AI workload acceleration thanks to the Intel GPU on PyTorch 2.4 first support (prototype) release.
In order to achieve beta quality in the PyTorch 2.5 release, they are constantly improving the functionality and performance of the Intel GPU support. Intel Client GPUs will be added to the list of GPUs supported for AI PC use cases as the product develops further. They’re also investigating more features for PyTorch 2.5, like:
Eager Mode: Completely execute Dynamo Torchbench and TIMM eager mode, and implement additional Aten operators.
Torch.compile: Optimise performance while running Dynamo Torchbench and TIMM benchmark compile mode in full.
To support Intel GPU, enable torch.profile under the profiler and utilities section.
Distribution of PyPI wheels.
Support for Windows and the  Intel Client GPU Series.
They invite the community to assess these latest additions to PyTorch’s  Intel GPU support.
Intel Extensions For PyTorch
The most recent performance enhancements for Intel devices are added to PyTorch using the Intel Extension. The Intel XeMatrix Extensions (XMX) AI engines on Intel discrete GPUs and Intel Advanced Vector Extensions 512 (Intel AVX-512) Vector Neural Network Instructions (VNNI) and Intel Advanced Matrix Extensions (Intel AMX) on Intel CPUs are utilized in optimizations. Additionally, the PyTorch xpu device, in conjunction with Intel Extension for PyTorch, facilitates simple GPU acceleration for Intel discrete GPUs.
Workloads and models for Generative AI (GenAI) have become increasingly common in today’s technological environment. These GenAI applications are mostly driven by large language models, or LLMs. The Intel Extension for PyTorch has added special optimizations for a few Large Language Models (LLMs) as of version 2.1.0. See Large Language Models (LLMs) section for additional details on LLM optimizations.
For Python programs, the extension can be loaded as a module, and for C++ projects, it can be linked as a library. It can be dynamically enabled in Python programs by importing intel_extension_for_pytorch.
Buildings
Eager Mode: Custom Python modules (including fusion modules), optimum optimizers, and INT8 quantization APIs are added to the PyTorch frontend in the eager mode. Using extended graph fusion passes, eager-mode models can be transformed into graph mode to further increase performance.
Graph Mode: Performance is enhanced by fusions’ reduction of operator/kernel invocation overhead in the graph mode. In PyTorch, the graph mode typically produces better results from optimization techniques like operation fusion than the eager mode does.
They are enhanced by the Intel Extension for PyTorch, which offers more thorough graph optimizations. Supported graph modes are PyTorch Torchscript and TorchDynamo. They advise you to use torch.jit.trace() instead of torch.jit.script() when using Torchscript since it typically supports a larger variety of workloads. The ipex backend can deliver strong performance with TorchDynamo.
CPU Optimization: Based on the detected instruction set architecture (ISA), Intel Extension for PyTorch automatically assigns operators to underlying kernels on the CPU. The addon makes use of the Intel hardware’s vectorization and matrix acceleration units. For enhanced performance, the runtime extension provides weight sharing and more precise thread runtime management.
Intel GPU
GPU Optimisation: The PyTorch dispatching method is used to implement and register optimized operators and kernels on the GPU. The intrinsic vectorization and matrix calculating capabilities of Intel GPU hardware enhance certain operators and kernels. The DPC++ compiler, which supports both the most recent SYCL standard and several extensions to the SYCL standard, is used by the Intel Extension for PyTorch for GPU. These extensions are located in the sycl/doc/extensions directory.
Encouragement
GitHub issues are used by the team to keep track of bugs and enhancement requests. Check to see whether your issue has previously been reported on GitHub before making a proposal or bug report.
Read more on govindhtech.com
0 notes