#onemkl | Explore Tumblr posts and blogs

govindhtech · 7 months ago

Text

OneAPI Construction Kit For Intel RISC V Processor Interface

With the oneAPI Construction Kit, you may integrate the oneAPI Ecosystem into your Intel RISC V Processor.

Intel RISC-V

Recently, Codeplay, an Intel business, revealed that their oneAPI Construction Kit supports RISC-V. Rapidly expanding, Intel RISC V is an open standard instruction set architecture (ISA) available under royalty-free open-source licenses for processors of all kinds.

Through direct programming in C++ with SYCL, along with a set of libraries aimed at common functions like math, threading, and neural networks, and a hardware abstraction layer that allows programming in one language to target different devices, the oneAPI programming model enables a single codebase to be deployed across multiple computing architectures including CPUs, GPUs, FPGAs, and other accelerators.

In order to promote open source cooperation and the creation of a cohesive, cross-architecture programming paradigm free from proprietary software lock-in, the oneAPI standard is now overseen by the UXL Foundation.

A framework that may be used to expand the oneAPI ecosystem to bespoke AI and HPC architectures is Codeplay’s oneAPI Construction Kit. For both native on-host and cross-compilation, the most recent 4.0 version brings RISC-V native host for the first time.

Because of this capability, programs may be executed on a CPU and benefit from the acceleration that SYCL offers via data parallelism. With the oneAPI Construction Kit, Intel RISC V processor designers can now effortlessly connect SYCL and the oneAPI ecosystem with their hardware, marking a key step toward realizing the goal of a completely open hardware and software stack. It is completely free to use and open-source.

OneAPI Construction Kit

Your processor has access to an open environment with the oneAPI Construction Kit. It is a framework that opens up SYCL and other open standards to hardware platforms, and it can be used to expand the oneAPI ecosystem to include unique AI and HPC architectures.

Give Developers Access to a Dynamic, Open-Ecosystem

With the oneAPI Construction Kit, new and customized accelerators may benefit from the oneAPI ecosystem and an abundance of SYCL libraries. Contributors from many sectors of the industry support and maintain this open environment, so you may build with the knowledge that features and libraries will be preserved. Additionally, it frees up developers’ time to innovate more quickly by reducing the amount of time spent rewriting code and managing disparate codebases.

The oneAPI Construction Kit is useful for anybody who designs hardware. To get you started, the Kit includes a reference implementation for Intel RISC V vector processors, although it is not confined to RISC-V and may be modified for a variety of processors.

Codeplay Enhances the oneAPI Construction Kit with RISC-V Support

The rapidly expanding open standard instruction set architecture (ISA) known as RISC-V is compatible with all sorts of processors, including accelerators and CPUs. Axelera, Codasip, and others make Intel RISC V processors for a variety of applications. RISC-V-powered microprocessors are also being developed by the EU as part of the European Processor Initiative.

At Codeplay, has been long been pioneers in open ecosystems, and as a part of RISC-V International, its’ve worked on the project for a number of years, leading working groups that have helped to shape the standard. Nous realize that building a genuinely open environment starts with open, standards-based hardware. But in order to do that, must also need open hardware, open software, and open source from top to bottom.

This is where oneAPI and SYCL come in, offering an ecosystem of open-source, standards-based software libraries for applications of various kinds, such oneMKL or oneDNN, combined with a well-developed programming architecture. Both SYCL and oneAPI are heterogeneous, which means that you may create code once and use it on any GPU AMD, Intel, NVIDIA, or, as of late, RISC-V without being restricted by the manufacturer.

Intel initially implemented RISC-V native host for both native on-host and cross-compilation with the most recent 4.0 version of the oneAPI Construction Kit. Because of this capability, programs may be executed on a CPU and benefit from the acceleration that SYCL offers via data parallelism. With the oneAPI Construction Kit, Intel RISC V processor designers can now effortlessly connect SYCL and the oneAPI ecosystem with their hardware, marking a major step toward realizing the vision of a completely open hardware and software stack.

Read more on govindhtech.com

#OneAPIConstructionKit #IntelRISCV #SYCL #FPGA #IntelRISCVProcessorInterface #oneAPI #RISCV #oneDNN #oneMKL #RISCVSupport #OpenEcosystem #technology #technews #news #govindhtech

2 notes · View notes

govindhtech · 9 months ago

Text

Intel VTune Profiler For Data Parallel Python Applications

Intel VTune Profiler tutorial

This brief tutorial will show you how to use Intel VTune Profiler to profile the performance of a Python application using the NumPy and Numba example applications.

Analysing Performance in Applications and Systems

For HPC, cloud, IoT, media, storage, and other applications, Intel VTune Profiler optimises system performance, application performance, and system configuration.

Optimise the performance of the entire application not just the accelerated part using the CPU, GPU, and FPGA.

Profile SYCL, C, C++, C#, Fortran, OpenCL code, Python, Google Go, Java,.NET, Assembly, or any combination of languages can be multilingual.

Application or System: Obtain detailed results mapped to source code or coarse-grained system data for a longer time period.

Power: Maximise efficiency without resorting to thermal or power-related throttling.

VTune platform profiler

It has following Features.

Optimisation of Algorithms

Find your code’s “hot spots,” or the sections that take the longest.

Use Flame Graph to see hot code routes and the amount of time spent in each function and with its callees.

Bottlenecks in Microarchitecture and Memory

Use microarchitecture exploration analysis to pinpoint the major hardware problems affecting your application’s performance.

Identify memory-access-related concerns, such as cache misses and difficulty with high bandwidth.

Inductors and XPUs

Improve data transfers and GPU offload schema for SYCL, OpenCL, Microsoft DirectX, or OpenMP offload code. Determine which GPU kernels take the longest to optimise further.

Examine GPU-bound programs for inefficient kernel algorithms or microarchitectural restrictions that may be causing performance problems.

Examine FPGA utilisation and the interactions between CPU and FPGA.

Technical summary: Determine the most time-consuming operations that are executing on the neural processing unit (NPU) and learn how much data is exchanged between the NPU and DDR memory.

In parallelism

Check the threading efficiency of the code. Determine which threading problems are affecting performance.

Examine compute-intensive or throughput HPC programs to determine how well they utilise memory, vectorisation, and the CPU.

Interface and Platform

Find the points in I/O-intensive applications where performance is stalled. Examine the hardware’s ability to handle I/O traffic produced by integrated accelerators or external PCIe devices.

Use System Overview to get a detailed overview of short-term workloads.

Multiple Nodes

Describe the performance characteristics of workloads involving OpenMP and large-scale message passing interfaces (MPI).

Determine any scalability problems and receive suggestions for a thorough investigation.

Intel VTune Profiler

To improve Python performance while using Intel systems, install and utilise the Intel Distribution for Python and Data Parallel Extensions for Python with your applications.

Configure your Python-using VTune Profiler setup.

To find performance issues and areas for improvement, profile three distinct Python application implementations. The pairwise distance calculation algorithm commonly used in machine learning and data analytics will be demonstrated in this article using the NumPy example.

The following packages are used by the three distinct implementations.

Numpy Optimised for Intel

NumPy’s Data Parallel Extension

Extensions for Numba on GPU with Data Parallelism

Python’s NumPy and Data Parallel Extension

By providing optimised heterogeneous computing, Intel Distribution for Python and Intel Data Parallel Extension for Python offer a fantastic and straightforward approach to develop high-performance machine learning (ML) and scientific applications.

Added to the Python Intel Distribution is:

Scalability on PCs, powerful servers, and laptops utilising every CPU core available.

Assistance with the most recent Intel CPU instruction sets.

Accelerating core numerical and machine learning packages with libraries such as the Intel oneAPI Math Kernel Library (oneMKL) and Intel oneAPI Data Analytics Library (oneDAL) allows for near-native performance.

Tools for optimising Python code into instructions with more productivity.

Important Python bindings to help your Python project integrate Intel native tools more easily.

Three core packages make up the Data Parallel Extensions for Python:

The NumPy Data Parallel Extensions (dpnp)

Data Parallel Extensions for Numba, aka numba_dpex

Tensor data structure support, device selection, data allocation on devices, and user-defined data parallel extensions for Python are all provided by the dpctl (Data Parallel Control library).

It is best to obtain insights with comprehensive source code level analysis into compute and memory bottlenecks in order to promptly identify and resolve unanticipated performance difficulties in Machine Learning (ML), Artificial Intelligence ( AI), and other scientific workloads. This may be done with Python-based ML and AI programs as well as C/C++ code using Intel VTune Profiler. The methods for profiling these kinds of Python apps are the main topic of this paper.

Using highly optimised Intel Optimised Numpy and Data Parallel Extension for Python libraries, developers can replace the source lines causing performance loss with the help of Intel VTune Profiler, a sophisticated tool.

Setting up and Installing

1. Install Intel Distribution for Python

2. Create a Python Virtual Environment

python -m venv pyenv

pyenv\Scripts\activate

3. Install Python packages

pip install numpy

pip install dpnp

pip install numba

pip install numba-dpex

pip install pyitt

Make Use of Reference Configuration

The hardware and software components used for the reference example code we use are:

Software Components:

dpnp 0.14.0+189.gfcddad2474

mkl-fft 1.3.8

mkl-random 1.2.4

mkl-service 2.4.0

mkl-umath 0.1.1

numba 0.59.0

numba-dpex 0.21.4

numpy 1.26.4

pyitt 1.1.0

Operating System:

Linux, Ubuntu 22.04.3 LTS

CPU:

Intel Xeon Platinum 8480+

GPU:

Intel Data Center GPU Max 1550

The Example Application for NumPy

Intel will demonstrate how to use Intel VTune Profiler and its Intel Instrumentation and Tracing Technology (ITT) API to optimise a NumPy application step-by-step. The pairwise distance application, a well-liked approach in fields including biology, high performance computing (HPC), machine learning, and geographic data analytics, will be used in this article.

Summary

The three stages of optimisation that we will discuss in this post are summarised as follows:

Step 1: Examining the Intel Optimised Numpy Pairwise Distance Implementation: Here, we’ll attempt to comprehend the obstacles affecting the NumPy implementation’s performance.

Step 2: Profiling Data Parallel Extension for Pairwise Distance NumPy Implementation: We intend to examine the implementation and see whether there is a performance disparity.

Step 3: Profiling Data Parallel Extension for Pairwise Distance Implementation on Numba GPU: Analysing the numba-dpex implementation’s GPU performance

Boost Your Python NumPy Application

Intel has shown how to quickly discover compute and memory bottlenecks in a Python application using Intel VTune Profiler.

Intel VTune Profiler aids in identifying bottlenecks’ root causes and strategies for enhancing application performance.

It can assist in mapping the main bottleneck jobs to the source code/assembly level and displaying the related CPU/GPU time.

Even more comprehensive, developer-friendly profiling results can be obtained by using the Instrumentation and Tracing API (ITT APIs).

Read more on govindhtech.com

#Intel #IntelVTuneProfiler #Python #CPU #GPU #FPGA #Intelsystems #machinelearning #oneMKL #news #technews #technology #technologynews #technologytrends #govindhtech

2 notes · View notes

govindhtech · 6 months ago

Text

Intel VTune Profiler: Improve AI/ML Application Performance

Improve AI/ML Application Performance with Intel VTune Profiler.

Find out how to profile Data Parallel Python and OpenVINO workloads using the oneAPI-powered tool. The use of AI and ML is rising in healthcare and life sciences, marketing and finance, manufacturing, robotics, driverless automobiles, smart cities, and more. ML workloads in real-world fields employ deep learning frameworks like PyTorch, TensorFlow, Keras, and others.

Through the “write once, deploy anywhere” approach, other developer tools, such as the OpenVINO Toolkit, also help to expedite AI research on the newest hardware architectures in fields like computer vision and generative artificial intelligence (GenAI). The goal of the open source OpenVINO Toolkit, which was first released in 2018, has been to speed up AI inference with reduced latency and increased throughput while preserving accuracy, minimizing model footprint, and maximizing hardware utilization.

It is challenging to locate and examine performance bottlenecks in the underlying source code because to the intricate structure of deep learning models, which include numerous layers and non-linear functions. ML frameworks like PyTorch and TensorFlow provide native tools and profiling APIs for tracking and evaluating performance metrics at various phases of model construction.

These approaches, however, are only applicable to software functionality. This problem is addressed by the Intel VTune Profiler, which is driven by the oneAPI and offers comprehensive insights into hardware-level memory and compute bottlenecks. By doing this, performance problems are resolved and AI applications’ performance is optimized and scaled across hardware systems with different computational envelopes.

The scope of optimization for AI/ML workloads may be expanded by using Intel VTune Profiler to profile data in concurrent Python and OpenVINO programs, as you will discover in this article.

Use Intel VTune Profiler to Boost Python Application Performance

VTune Profiler may assist in profiling a Python program, as shown in a recently published recipe in the Intel VTune Profiler Cookbook.

The following are the recipe’s fundamental software requirements:

Data Parallel Extensions for Python

Vtune Profiler (version 2022 or later)

Intel Distribution for Python

Compiler for Intel OneAPI DPC++/C++

The NumPy implementation covered in the recipe divides the calculations into logical jobs using the Intel Instrumentation and Tracing Technology (ITT) APIs and performs distance computations using the Intel oneAPI Math Kernel Library (oneMKL) routines. You may then determine which areas of the code need attention for necessary changes to get additional performance by using the VTune Profiler tool to examine the execution time and memory consumption of each logical job.

Details on the most CPU-intensive code segments are included in the output analysis report when Hotspots analysis is performed on the NumPy implementation. Additionally, it offers recommendations for investigating the profiler tool’s other performance analysis features, such Threading analysis for enhanced parallelism and Microarchitecture Exploration analysis for effective use of the underlying hardware.

Use the Data Parallel Extension for NumPy and Numba to Address Performance Bottlenecks

According to the Hotspots analysis report, NumPy operations and underlying oneMKL functions account for a significant amount of the execution time in the simple NumPy implementation of the pairwise distance computation example. By making little code modifications, NumPy may be swapped out for the Data Parallel Extension for NumPy, which will eliminate these bottlenecks. To evaluate the speed gains over the simple NumPy code and find any areas that might need further optimization, run the Hotspots analysis once again.

Additionally, the VTune Profiler makes recommendations such as using the Data Parallel Extension for Numba with your platform’s GPU to bring offload accelerator parallelism to the application. The Numba JIT compiler for NumPy operations has an open-source extension called Numba. It offers Python kernel programming APIs that resemble SYCL. The GPU Compute/Media Hotspots analysis preview function of VTune Profiler may then be used to examine the Numba implementation’s execution on a GPU.

Use Intel VTune Profiler to Examine OpenVINO Applications’ Performance

Using the VTune Profiler to profile OpenVINO-based AI applications is covered in another new recipe in the VTune Profiler cookbook. It discusses how to use the profiler tool to analyze performance bottlenecks in the CPU, GPU, and Neural Processing Unit (NPU).

If your OpenVINO application makes use of the Intel oneAPI Data Analytics Library (oneDAL) and/or the Intel oneAPI Deep Neural Network (oneDNN) Intel Distribution for Python Intel VTune Profiler (v2024.1 or later), you can access the Benchmark Tool application as part of the OpenVINO Toolkit Intel oneAPI Base Toolkit.

The recipe offers detailed steps for configuring OpenVINO with the ITT APIs for performance analysis, building the OpenVINO source, and setting it up. It profiles the AI application and analyzes performance and latency using a reference benchmark application.

Depending on the computational architecture, you may use the VTune Profiler‘s numerous performance analysis features to find hotspots and look at how much hardware is being utilized by specific code sections.

For example,

To examine CPU bottlenecks that is, the sections of code that take up the most CPU execution time use the Hotspots Analysis tool.

Use the GPU Compute/Media Hotspots Analysis preview function to profile GPU hotspots. Examine inefficient kernel methods, examine the frequency of GPU instructions for various kinds of instructions, and more to get an understanding of GPU use.

The AI PCs’ Neural Processing Units (NPUs) are made especially to boost AI/ML applications’ performance. With the Intel Distribution of OpenVINO Toolkit, you may transfer compute-intensive AI/ML tasks to Intel NPUs. You may examine the NPU performance using a number of hardware measures, including workload size, execution time, sampling interval, and more, with the use of the VTune Profiler‘s NPU Exploration Analysis preview function.

Intel VTune Profiler Download

Use one of the following methods to install Intel VTune Profiler on your computer:

Get the standalone bundle here.

As part of the Intel oneAPI Base Toolkit, download Intel VTune Profiler.

Know the Process

In the standard software performance analysis process,Image Credit To Intel

Use one of the following methods to launch Intel VTune Profiler:

Using Microsoft Visual Studio’s GUI From the Command Line

Configure parameters and choose a profiling analysis for your application.

Create a profile of the target system (remote collection) or application on the host (local collection).

View the host system’s findings.

Identify bottlenecks and address them over a number of cycles until you reach a desirable level of performance.

FAQs

What does Intel VTune do?

Examine CPU usage, OpenMP efficiency, memory access, and vectorization to assess application performance. Measure metrics to find memory access concerns.

Read more on govindhtech.com

#IntelVTuneProfiler #ImproveAIMLApplicationPerformance #TensorFlow #OpenVINO #PyTorch #IntelOneAPIDPC #InteloneAPIMathKernelLibrary #oneMKL #DataParallelExtension #oneapi #VTuneProfiler #intel #technology #technews #news #govindhtech

1 note · View note

govindhtech · 8 months ago

Text

Guide To Python NumPy and SciPy In Multithreading In Python

An Easy Guide to Multithreading in Python

Python is a strong language, particularly for developing AI and machine learning applications. However, CPython, the programming language’s original, reference implementation and byte-code interpreter, lacks multithreading functionality; multithreading and parallel processing need to be enabled from the kernel. Some of the desired multi-core processing is made possible by libraries Python NumPy and SciPy such as NumPy, SciPy, and PyTorch, which use C-based implementations. However, there is a problem known as the Global Interpreter Lock (GIL), which literally “locks” the CPython interpreter to only working on one thread at a time, regardless of whether the interpreter is in a single or multi-threaded environment.

Let’s take a different approach to Python.

The robust libraries and tools that support Intel Distribution of Python, a collection of high-performance packages that optimize underlying instruction sets for Intel architectures, are designed to do this.

For compute-intensive, core Python numerical and scientific packages like NumPy, SciPy, and Numba, the Intel distribution helps developers achieve performance levels that are comparable to those of a C++ program by accelerating math and threading operations using oneAPI libraries while maintaining low Python overheads. This enables fast scaling over a cluster and assists developers in providing highly efficient multithreading, vectorization, and memory management for their applications.

Let’s examine Intel’s strategy for enhancing Python parallelism and composability in more detail, as well as how it might speed up your AI/ML workflows.

Parallelism in Nests: Python NumPy and SciPy

Python libraries called Python NumPy and SciPy were created especially for scientific computing and numerical processing, respectively.

Exposing parallelism on all conceivable levels of a program for example, by parallelizing the outermost loops or by utilizing various functional or pipeline sorts of parallelism on the application level is one workaround to enable multithreading/parallelism in Python scripts. This parallelism can be accomplished with the use of libraries like Dask, Joblib, and the included multiprocessing module mproc (with its ThreadPool class).

Data-parallelism can be performed with Python modules like Python NumPy and SciPy, which can then be accelerated with an efficient math library like the Intel oneAPI Math Kernel Library (oneMKL). This is because massive data processing requires a lot of processing. Using various threading runtimes, oneMKL is multi-threaded. An environment variable called MKL_THREADING_LAYER can be used to adjust the threading layer.

As a result, a code structure known as nested parallelism is created, in which a parallel section calls a function that in turn calls another parallel region. Since serial sections that is, regions that cannot execute in parallel and synchronization latencies are typically inevitable in Python NumPy and SciPy based systems, this parallelism-within-parallelism is an effective technique to minimize or hide them.

Going One Step Further: Numba

Despite offering extensive mathematical and data-focused accelerations through C-extensions, Python NumPy and SciPy remain a fixed set of mathematical tools accelerated through C-extensions. If non-standard math is required, a developer should not expect it to operate at the same speed as C-extensions. Here’s where Numba can work really well.

OneTBB

Based on LLVM, Numba functions as a “Just-In-Time” (JIT) compiler. It aims to reduce the performance difference between Python and compiled, statically typed languages such as C and C++. Additionally, it supports a variety of threading runtimes, including workqueue, OpenMP, and Intel oneAPI Threading Building Blocks (oneTBB). To match these three runtimes, there are three integrated threading layers. The only threading layer installed by default is workqueue; however, other threading layers can be added with ease using conda commands (e.g., $ conda install tbb).

The environment variable NUMBA_THREADING_LAYER can be used to set the threading layer. It is vital to know that there are two ways to choose this threading layer: either choose a layer that is generally safe under different types of parallel processing, or specify the desired threading layer name (e.g., tbb) explicitly.

Composability of Threading

The efficiency or efficacy of co-existing multi-threaded components depends on an application’s or component’s threading composability. A component that is “perfectly composable” would operate without compromising the effectiveness of other components in the system or its own efficiency.

In order to achieve a completely composable threading system, care must be taken to prevent over-subscription, which means making sure that no parallel region of code or component can require a certain number of threads to run (this is known as “mandatory” parallelism).

An alternative would be to implement a type of “optional” parallelism in which a work scheduler determines at the user level which thread(s) the components should be mapped to while automating the coordination of tasks among components and parallel regions. Naturally, the efficiency of the scheduler’s threading model must be better than the high-performance libraries’ integrated scheme since it is sharing a single thread-pool to arrange the program’s components and libraries around. The efficiency is lost otherwise.

Intel’s Strategy for Parallelism and Composability

Threading composability is more readily attained when oneTBB is used as the work scheduler. OneTBB is an open-source, cross-platform C++ library that was created with threading composability and optional/nested parallelism in mind. It allows for multi-core parallel processing.

An experimental module that enables threading composability across several libraries unlocks the potential for multi-threaded speed benefits in Python and was included in the oneTBB version released at the time of writing. As was previously mentioned, the scheduler’s improved threads allocation is what causes the acceleration.

The ThreadPool for Python standard is replaced by the Pool class in oneTBB. Additionally, the thread pool is activated across modules without requiring any code modifications thanks to the use of monkey patching, which allows an object to be dynamically replaced or updated during runtime. Additionally, oneTBB replaces oneMKL by turning on its own threading layer, which allows it to automatically provide composable parallelism when using calls from the Python NumPy and SciPy libraries.

See the code samples from the following composability demo, which is conducted on a system with MKL-enabled NumPy, TBB, and symmetric multiprocessing (SMP) modules and their accompanying IPython kernels installed, to examine the extent to which nested parallelism can enhance performance. Python is a feature-rich command-shell interface that supports a variety of programming languages and interactive computing. To get a quantifiable performance comparison, the demonstration was executed using the Jupyter Notebook extension.

import NumPy as np from multiprocessing.pool import ThreadPool pool = ThreadPool(10)

The aforementioned cell must be executed again each time the kernel in the Jupyter menu is changed in order to build the ThreadPool and provide the runtime outcomes listed below.

The following code, which runs the identical line for each of the three trials, is used with the default Python kernel:

%timeit pool.map(np.linalg.qr, [np.random.random((256, 256)) for i in range(10)])

This approach can be used to get the eigenvalues of a matrix using the standard Python kernel. Runtime is significantly improved up to an order of magnitude when the Python-m SMP kernel is enabled. Applying the Python-m TBB kernel yields even more improvements.

OneTBB’s dynamic task scheduler, which most effectively manages code where the innermost parallel sections cannot fully utilize the system’s CPU and where there may be a variable amount of work to be done, yields the best performance for this composability example. Although the SMP technique is still quite effective, it usually performs best in situations when workloads are more evenly distributed and the loads of all workers in the outermost regions are generally identical.

In summary, utilizing multithreading can speed up AI/ML workflows

The effectiveness of Python programs with an AI and machine learning focus can be increased in a variety of ways. Using multithreading and multiprocessing effectively will remain one of the most important ways to push AI/ML software development workflows to their limits.

Read more on govindhtech.com

#FPGAs #CPU #GPU #inteloneapi #onemkl #python #IntelGraphics #IntelTechnology #mathkernellibrary #API #news #technews #technology #technologynews #technologytrends #govindhtech

0 notes

govindhtech · 7 months ago

Text

Intel Distribution For Python To Create A Genetic Algorithm

Python Genetic Algorithm

Genetic algorithms (GA) simulate natural selection to solve finite and unconstrained optimization problems. Traditional methods take time and resources to address NP-hard optimization problems, but these algorithms can do it. GAs are based on a comparison between human chromosomal behavior and biological evolution.

This article provides a code example of how to use numba-dpex for Intel Distribution for Python to create a generic GA and offload a calculation to a GPU.

Genetic Algorithms (GA)

Activities inside GAs

Selection, crossover, and mutation are three crucial biology-inspired procedures that may be used to provide a high-quality output for GAs. It’s critical to specify the chromosomal representation and the GA procedures before applying GAs to a particular issue.

Selection

This is the procedure for choosing a partner and recombining them to produce children. Because excellent parents encourage their children to find better and more appropriate answers, parent selection is critical to the convergence rate of GA.

An illustration of the selection procedure whereby the following generation’s chromosomes are reduced by half.

The extra algorithms that decide which chromosomes will become parents are often required for the selection procedure.

Crossover

Biological crossover is the same procedure as this one. In this case, more than one parent is chosen, and the genetic material of the parents is used to make one or more children.

A crossover operation in action.

The crossover procedure produces kid genomes from specific parent chromosomes. There is only one kid genome produced and it may be a one-point crossing. The first and second parents each give the kid half of their DNA.

Mutation

A novel answer may be obtained by a little, haphazard modification to the chromosome. It is often administered with little probability and is used to preserve and add variation to the genetic population.

A mutation procedure involving a single chromosomal value change.

The mutation procedure may alter a chromosome.

Enhance Genetic Algorithms for Python Using Intel Distribution

With libraries like Intel oneAPI Data Analytics Library (oneDAL) and Intel oneAPI Math Kernel Library (oneMKL), developers may use Intel Distribution for Python to obtain near-native code performance. With improved NumPy, SciPy, and Numba, researchers and developers can expand compute-intensive Python applications from laptops to powerful servers.

Use the Data Parallel Extension for Numba (numba-dpex) range kernel to optimize the genetic algorithm using the Intel Distribution for Python. Each work item in this kernel represents a logical thread of execution, and it represents the most basic kind of data-parallel and parallelism across a group of work items.

The vector-add operation was carried out on a GPU in the prior code, and vector c held the result. In a similar vein, the implementation is the same for every other function or method.

Code Execution

Refer to the code sample for instructions on how to develop the generic GA and optimize the method to operate on GPUs using numba-dpex for Intel Distribution for Python. It also describes how to use the various GA operations selection, crossover, and mutation and how to modify these techniques for use in solving other optimization issues.

Set the following values to initialize the population:

5,000 people live there.

Size of a chromosome: 10

Generations: 5.

There are ten random floats between 0 and 1 on each chromosome.

Put the GA into practice by developing an assessment strategy: This function serves as numba-dpex’s benchmark and point of comparison. The calculation of an individual’s fitness involves using any combination of algebraic operations on the chromosome.

Carry out the crossover operation: The inputs are first and second parents to two distinct chromosomes. One more chromosome is returned as the function’s output.

Carry out the mutation operation: There is a one percent probability that every float in the chromosome will be replaced by a random value in this code example.

Put into practice the selection process, which is the foundation for producing a new generation. After crossover and mutation procedures, a new population is generated inside this function.

Launch the prepared functions on a CPU, beginning with a baseline. Every generation includes the following processes to establish the first population:

Utilizing the eval_genomes_plain function, the current population is evaluated

Utilizing a next_generation function, create the next generation.

Wipe fitness standards, since a new generation has already been produced.

Measured and printed is the calculation time for those operations. To demonstrate that the calculations were the same on the CPU and GPU, the first chromosome is also displayed.

Run on a GPU: Create an evaluation function for the GPU after beginning with a fresh population initialization (similar to step 2). With GPU implementation, chromosomes are represented by a flattened data structure, which is the sole difference between it and CPU implementation. Also, utilize a global index and kernels from numba-dpex to avoid looping over every chromosome.

The time for assessment, generation production, and fitness wipe is monitored when a GPU is operating, just like it is for the CPU. Deliver the fitness container and all of the chromosomes to the selected device. After that, a kernel with a specified range may be used.

Conclusion

Use the same procedures for further optimization issues. Describe the procedures of chromosomal selection, crossing, mutation, and assessment. The algorithm is executed the same way in its entirety.

Execute the above code sample and evaluate how well this method performs while executing sequentially on a CPU and parallelly on a GPU. The code result shows that using a GPU-based numba-dpex parallel implementation improves performance speed.

Read more on govindhtech.com

#PyTorch2.4 #AI #IntelGPUs #GPU #OneAPI #NeuralNetwork #AIPC #GenerativeAI #LLM #GitHub #Pytorch #news #technews #technology #technologynews #technologytrends #govindhtech

0 notes