#CUDA memory model | Explore Tumblr Posts and Blogs

jcmarchi · 18 days

Text

Master CUDA: For Machine Learning Engineers

New Post has been published on https://thedigitalinsider.com/master-cuda-for-machine-learning-engineers/

Master CUDA: For Machine Learning Engineers

CUDA for Machine Learning: Practical Applications

Structure of a CUDA C/C++ application, where the host (CPU) code manages the execution of parallel code on the device (GPU).

Now that we’ve covered the basics, let’s explore how CUDA can be applied to common machine learning tasks.

Matrix Multiplication

Matrix multiplication is a fundamental operation in many machine learning algorithms, particularly in neural networks. CUDA can significantly accelerate this operation. Here’s a simple implementation:

__global__ void matrixMulKernel(float *A, float *B, float *C, int N) int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float sum = 0.0f; if (row < N && col < N) for (int i = 0; i < N; i++) sum += A[row * N + i] * B[i * N + col]; C[row * N + col] = sum; // Host function to set up and launch the kernel void matrixMul(float *A, float *B, float *C, int N) dim3 threadsPerBlock(16, 16); dim3 numBlocks((N + threadsPerBlock.x - 1) / threadsPerBlock.x, (N + threadsPerBlock.y - 1) / threadsPerBlock.y); matrixMulKernelnumBlocks, threadsPerBlock(A, B, C, N);

This implementation divides the output matrix into blocks, with each thread computing one element of the result. While this basic version is already faster than a CPU implementation for large matrices, there’s room for optimization using shared memory and other techniques.

Convolution Operations

Convolutional Neural Networks (CNNs) rely heavily on convolution operations. CUDA can dramatically speed up these computations. Here’s a simplified 2D convolution kernel:

__global__ void convolution2DKernel(float *input, float *kernel, float *output, int inputWidth, int inputHeight, int kernelWidth, int kernelHeight) int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if (x < inputWidth && y < inputHeight) float sum = 0.0f; for (int ky = 0; ky < kernelHeight; ky++) for (int kx = 0; kx < kernelWidth; kx++) int inputX = x + kx - kernelWidth / 2; int inputY = y + ky - kernelHeight / 2; if (inputX >= 0 && inputX < inputWidth && inputY >= 0 && inputY < inputHeight) sum += input[inputY * inputWidth + inputX] * kernel[ky * kernelWidth + kx]; output[y * inputWidth + x] = sum;

This kernel performs a 2D convolution, with each thread computing one output pixel. In practice, more sophisticated implementations would use shared memory to reduce global memory accesses and optimize for various kernel sizes.

Stochastic Gradient Descent (SGD)

SGD is a cornerstone optimization algorithm in machine learning. CUDA can parallelize the computation of gradients across multiple data points. Here’s a simplified example for linear regression:

__global__ void sgdKernel(float *X, float *y, float *weights, float learningRate, int n, int d) int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < n) float prediction = 0.0f; for (int j = 0; j < d; j++) prediction += X[i * d + j] * weights[j]; float error = prediction - y[i]; for (int j = 0; j < d; j++) atomicAdd(&weights[j], -learningRate * error * X[i * d + j]); void sgd(float *X, float *y, float *weights, float learningRate, int n, int d, int iterations) int threadsPerBlock = 256; int numBlocks = (n + threadsPerBlock - 1) / threadsPerBlock; for (int iter = 0; iter < iterations; iter++) sgdKernel<<<numBlocks, threadsPerBlock>>>(X, y, weights, learningRate, n, d);

This implementation updates the weights in parallel for each data point. The atomicAdd function is used to handle concurrent updates to the weights safely.

Optimizing CUDA for Machine Learning

While the above examples demonstrate the basics of using CUDA for machine learning tasks, there are several optimization techniques that can further enhance performance:

Coalesced Memory Access

GPUs achieve peak performance when threads in a warp access contiguous memory locations. Ensure your data structures and access patterns promote coalesced memory access.

Shared Memory Usage

Shared memory is much faster than global memory. Use it to cache frequently accessed data within a thread block.

Understanding the memory hierarchy with CUDA

This diagram illustrates the architecture of a multi-processor system with shared memory. Each processor has its own cache, allowing for fast access to frequently used data. The processors communicate via a shared bus, which connects them to a larger shared memory space.

For example, in matrix multiplication:

__global__ void matrixMulSharedKernel(float *A, float *B, float *C, int N) __shared__ float sharedA[TILE_SIZE][TILE_SIZE]; __shared__ float sharedB[TILE_SIZE][TILE_SIZE]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; int row = by * TILE_SIZE + ty; int col = bx * TILE_SIZE + tx; float sum = 0.0f; for (int tile = 0; tile < (N + TILE_SIZE - 1) / TILE_SIZE; tile++) if (row < N && tile * TILE_SIZE + tx < N) sharedA[ty][tx] = A[row * N + tile * TILE_SIZE + tx]; else sharedA[ty][tx] = 0.0f; if (col < N && tile * TILE_SIZE + ty < N) sharedB[ty][tx] = B[(tile * TILE_SIZE + ty) * N + col]; else sharedB[ty][tx] = 0.0f; __syncthreads(); for (int k = 0; k < TILE_SIZE; k++) sum += sharedA[ty][k] * sharedB[k][tx]; __syncthreads(); if (row < N && col < N) C[row * N + col] = sum;

This optimized version uses shared memory to reduce global memory accesses, significantly improving performance for large matrices.

Asynchronous Operations

CUDA supports asynchronous operations, allowing you to overlap computation with data transfer. This is particularly useful in machine learning pipelines where you can prepare the next batch of data while the current batch is being processed.

cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); // Asynchronous memory transfers and kernel launches cudaMemcpyAsync(d_data1, h_data1, size, cudaMemcpyHostToDevice, stream1); myKernel<<<grid, block, 0, stream1>>>(d_data1, ...); cudaMemcpyAsync(d_data2, h_data2, size, cudaMemcpyHostToDevice, stream2); myKernel<<<grid, block, 0, stream2>>>(d_data2, ...); cudaStreamSynchronize(stream1); cudaStreamSynchronize(stream2);

Tensor Cores

For machine learning workloads, NVIDIA’s Tensor Cores (available in newer GPU architectures) can provide significant speedups for matrix multiply and convolution operations. Libraries like cuDNN and cuBLAS automatically leverage Tensor Cores when available.

Challenges and Considerations

While CUDA offers tremendous benefits for machine learning, it’s important to be aware of potential challenges:

Memory Management: GPU memory is limited compared to system memory. Efficient memory management is crucial, especially when working with large datasets or models.

Data Transfer Overhead: Transferring data between CPU and GPU can be a bottleneck. Minimize transfers and use asynchronous operations when possible.

Precision: GPUs traditionally excel at single-precision (FP32) computations. While support for double-precision (FP64) has improved, it’s often slower. Many machine learning tasks can work well with lower precision (e.g., FP16), which modern GPUs handle very efficiently.

Code Complexity: Writing efficient CUDA code can be more complex than CPU code. Leveraging libraries like cuDNN, cuBLAS, and frameworks like TensorFlow or PyTorch can help abstract away some of this complexity.

As machine learning models grow in size and complexity, a single GPU may no longer be sufficient to handle the workload. CUDA makes it possible to scale your application across multiple GPUs, either within a single node or across a cluster.

CUDA Programming Structure

To effectively utilize CUDA, it’s essential to understand its programming structure, which involves writing kernels (functions that run on the GPU) and managing memory between the host (CPU) and device (GPU).

Host vs. Device Memory

In CUDA, memory is managed separately for the host and device. The following are the primary functions used for memory management:

cudaMalloc: Allocates memory on the device.

cudaMemcpy: Copies data between host and device.

cudaFree: Frees memory on the device.

Example: Summing Two Arrays

Let’s look at an example that sums two arrays using CUDA:

__global__ void sumArraysOnGPU(float *A, float *B, float *C, int N) int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < N) C[idx] = A[idx] + B[idx]; int main() int N = 1024; size_t bytes = N * sizeof(float); float *h_A, *h_B, *h_C; h_A = (float*)malloc(bytes); h_B = (float*)malloc(bytes); h_C = (float*)malloc(bytes); float *d_A, *d_B, *d_C; cudaMalloc(&d_A, bytes); cudaMalloc(&d_B, bytes); cudaMalloc(&d_C, bytes); cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice); int blockSize = 256; int gridSize = (N + blockSize - 1) / blockSize; sumArraysOnGPU<<<gridSize, blockSize>>>(d_A, d_B, d_C, N); cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); free(h_A); free(h_B); free(h_C); return 0;

In this example, memory is allocated on both the host and device, data is transferred to the device, and the kernel is launched to perform the computation.

Conclusion

CUDA is a powerful tool for machine learning engineers looking to accelerate their models and handle larger datasets. By understanding the CUDA memory model, optimizing memory access, and leveraging multiple GPUs, you can significantly enhance the performance of your machine learning applications.

0 notes

pinktrashgoblin · 6 months

Text

”Flown across the ocean, leavin’ just a memory.”

/lyr

Mini comic ft. Some teasers towards Cuda’s backstory, shown through the eyes of his brother and nephew :]

Also took this as a chance to practice interior backgrounds without using blender models and I think they turned out good!

Also also this was just an excuse to draw Storme lmao

#jsab #just shapes and beats #jsab art #jsab au #art #aaaaaaaaaaaaa #oh no #fan art #jsab bmau #jsab barracuda #jsab oc #jsab triangle #triangle guy #jsab comic #jsab broken melodies #lore wooooooo #Also my favorite guy #Storme!!!!!

143 notes · View notes

kaiasky · 4 months

Text

in general i feel like i understand OS bullshit pretty well but it all goes out the window with graphics libraries. like X/wayland is a userspace process. And like the standard model is that my process says "hey X Window System. how big is my window? ok please blit this bitmap to this portion of my window" and then X is like ok, and then it does the compositing and updates the framebuffer through some kernel fd or something

but presumably isn't *actually* compositing windows anymore because what if one of those windows is 3d, in which case that'll be handled by the GPU? so it seems pretty silly to like, grab a game's framebuffer from vram, load it into userspace memory, write it back out to vram for display? presumably the window just says 'hey x window system i am using openGL please blit me every frame" and then...

wait ok i guess i understand how it must work i think. ok so naturally the GPU doesn't know what the fuck a process is. so presumably there's some kernelspace thing that provides GPU memory isolation (and maybe virtualization?) which definitely exists because i got crashes in CUDA code from oob memory access. but in the abstract there's nothing to say it can't ignkre those restrictions in some cases?

and so ig the window compositor must run in like. some special elevated mode where it's allowed to query the kernel for "hey give me all of the other processes framebuffers"? or like OBS also has stuff for recording a window even if that window's occluded? so there must just be some state that can give a process the right to use other proc's gpu bufs?

the alternative is ig... some kind of way to pass framebuffers around (and part of being a X client is saying hi here's my framebuffer) . which ig if they are implemented as fd's with ioctl it'd be possible?

#kaia.mypost #kaia.lint #someone who knows things how close am i to right

4 notes · View notes

buysellram · 2 hours

Text

Asking ChatGPT Questions on GPU Graphics Card

What is ChatGPT? Before jumping to our topic, we can look at how ChatGPT introduces itself: “ChatGPT is a large language model that was trained by OpenAI. It is designed to be able to generate human-like text in response to prompts given to it, and it is capable of doing so in a variety of different styles and formats. It is based on the GPT-3 (Generative Pretrained Transformer 3) architecture, which is a type of machine learning model that uses deep learning algorithms to generate text. ChatGPT is designed to be able to generate conversational text that is similar to what a human might say in response to a given prompt, making it well-suited for applications such as chatbots and natural language processing tasks.”

What do you think? This sounds like a natural conversation. In fact, ChatGPT is a new AI-driven chatbot that can answer some questions and even write a paragraph of essays. When we search on Google for an answer, the search engine tries to retrieve webpages with higher matching scores to the questions. We need to check each webpage for a satisfactory answer. As humans write the content of webpages, the answer can be a good candidate. Unlike Google search, ChatGPT is trained with all webpage contents crawled from the internet. It learns the style of answering a question. Meanwhile, we can consider the content of an answer as the summarization of the major answers online. Hence, sometimes the answer can be wrong even though it reads like someone confidently answers a question. It is still too early to say it can answer questions with strong logic.

As we are a computer hardware company, let us start to “interview” ChatGPT with questions on GPU graphics cards. From the answers, we believe in the future, we can have a smart chatbot online to do customer service since most answers sound reasonable.

Question: Is it a good idea to buy a used GPU?

ChatGPT: It can be a good idea to buy a used GPU, but it’s important to be cautious and do your research. Here are a few things to consider when buying a used GPU:

Condition: Make sure the GPU is in good working condition. Check for any visible damage, such as bent pins or cracks in the PCB.

Compatibility: Make sure the GPU is compatible with your computer’s motherboard and power supply. Check the manufacturer’s website for the minimum requirements for the GPU.

Performance: Check the specs of the used GPU and compare it to a new one. You may be able to find a used GPU that performs just as well as a new one, but at a lower price.

Warranty: Consider whether the used GPU comes with a warranty or return policy. Some sellers may offer a limited warranty or return policy, which can give you some peace of mind.

Overall, it’s important to do your research and be cautious when buying a used GPU. It can be a good way to save money, but it’s important to make sure you’re getting a quality product.

Q: What kinds of GPUs are good for 4K PC gaming, and what kinds of GPUs are good for AI computing?

ChatGPT: For 4K PC gaming, you’ll want a GPU with a high amount of memory and a fast clock speed. Some examples of GPUs that are good for 4K gaming include the NVIDIA GeForce RTX 3080, the AMD Radeon RX 6800 XT, and the NVIDIA GeForce RTX 3090. These GPUs offer a high level of performance and can handle the demands of 4K gaming.

For AI computing, you’ll want a GPU with a high amount of memory and a fast clock speed, as well as support for parallel processing. Some examples of GPUs that are good for AI computing include the NVIDIA Tesla V100, the NVIDIA Quadro GV100, and the NVIDIA GeForce RTX 8000. These GPUs are specifically designed for AI and machine learning applications, and offer a high level of performance for these types of tasks.

It’s worth noting that the specific GPU you choose will depend on your specific needs and budget. There are many other factors to consider, such as the amount of memory, the number of CUDA cores, and the memory bandwidth. It’s important to do your research and choose the GPU that best meets your needs.

Question: Please recommend some good GPUs for 4K PC gaming in the year 2022

ChatGPT: In 2022, some good GPUs for 4K PC gaming are likely to include:

NVIDIA GeForce RTX 3080: This GPU offers excellent performance and is capable of running most games at 4K resolution with high graphics settings. It features 10GB of memory and 8704 CUDA cores.

AMD Radeon RX 6800 XT: This GPU also offers excellent performance and is capable of running most games at 4K resolution with high graphics settings. It features 16GB of memory and 4608 stream processors.

NVIDIA GeForce RTX 3090: This is the most powerful GPU currently available, and is ideal for 4K gaming. It features 24GB of memory and 10496 CUDA cores, making it capable of running even the most demanding games at 4K resolution with high graphics settings.

It’s worth noting that the specific GPU you choose will depend on your budget and specific needs. There are many other factors to consider, such as the amount of memory, the number of CUDA cores, and the memory bandwidth. It’s important to do your research and choose the GPU that best meets your needs.

Question: How about the year of 2023?

ChatGPT: It’s difficult to predict exactly what the best GPUs for 4K PC gaming in 2023 will be, as new products are constantly being released and technology is constantly improving. However, some factors that are likely to influence the performance of GPUs in 2023 include the number of CUDA cores, the amount of memory, and the clock speed.

In general, you’ll want to look for a GPU with a high number of CUDA cores, a large amount of memory, and a fast clock speed in order to get the best performance for 4K gaming. It’s also a good idea to keep an eye on the latest GPU releases from companies like NVIDIA and AMD, as they often release new products that offer improved performance and features.

Ultimately, the best GPU for 4K gaming in 2023 will depend on your budget and specific needs. It’s important to do your research and choose the GPU that best meets your needs.

Questions: How many GPUs have been employed to train ChatGPT?

ChatGPT: I’m sorry, but I don’t have information about the specific hardware used to train ChatGPT. ChatGPT is a large language model developed by OpenAI, and it’s likely that it was trained using a combination of CPU and GPU hardware. Training large language models like ChatGPT typically requires a lot of computational resources, including a large number of CPUs and GPUs. However, I don’t have specific details about the hardware used to train ChatGPT.

Conclusion: ChatGPT will deeply affect many areas, such as all QA websites, HowTo websites, and even online search and programming. Some people have demonstrated that ChatGPT can generate some basic AI framework to finish a simple AI task. It will change education in the future too. Students are not trained to memory knowledge, but how to ask the right questions, which is a very important capability. ChatGPT can also help small businesses. It can be employed as an online chatbot to answer questions from customers. For example, as an ITAD company, BuySellRam.com receives customer queries daily, so we may leverage an AI chatbot to help customers to get a quote or to solve common computer problems.

Originally published at https://www.buysellram.com on December 18, 2022.

#submission

0 notes

govindhtech · 10 days

Text

SynxFlow Project: A Smooth Migration From CUDA To SYCL

The SynxFlow Project

SynxFlow, an open-source GPU-based hydrodynamic flood modeling software, in CUDA, C++, and Python Data pre-processing and visualization are done in Python while simulations are executed on CUDA. SynxFlow can simulate floods quicker than real-time with hundreds of millions of computational cells and metre-level precision on many GPUs. An open-source software with a simple Python interface, it may be linked into data science workflows for disaster risk assessments. The model has been widely utilized in research and industry, such as to assist flood early warning systems and generate flood maps for (re)insurance firms.

SynxFlow can simulate flooding, landslide runout, and debris flow. Simulations are crucial to emergency service planning and management. A comprehensive prediction of natural disasters can reduce their social and economic costs. In addition to risk assessment and disaster preparedness, SynxFlow flood simulation can help with urban planning, environmental protection, climate change adaptation, insurance and financial planning, infrastructure design and engineering, public awareness, and education.

- Advertisement -

Issue Statement

Several variables make probabilistic flood forecasting computationally difficult:

Large dataset storage, retrieval, and management

Complex real-time data processing requires high-performance computation.

Model calibration and validation needed as real-world conditions change.

Effective integration and data transfer between hydrological, hydraulic, and meteorological models, and more.

For speedier results, a flood forecasting system must process data in parallel and offload compute-intensive operations to hardware accelerators. Thus, the SynxFlow team must use larger supercomputers to increase flood simulation scale and cut simulation time. DAWN, the UK’s newest supercomputer, employs Intel GPUs, which SynxFlow didn’t support.

These issues offered researchers a new goal to make the SynxFlow model performance-portable and scalable on supercomputers with multi-vendor GPUs. They must transition SynxFlow code from CUDA to a cross-vendor programming language in weeks, not years.

Solution Powered by oneAPI

After considering several possibilities, the SynxFlow project team chose the Intel oneAPI Base Toolkit implementation of the Unified Acceleration Foundation-backed oneAPI protocol. All are built on multiarchitecture, multi-vendor SYCL framework. It supports Intel, NVIDIA, and AMD GPUs and includes the Intel DPC++ Compatibility Tool for automated CUDA-to-SYCL code translation.

- Advertisement -

SynxFlow code migration went smoothly. This produced code that automatically translated most CUDA kernels and API calls into SYCL. After auto-translation, some mistakes were found during compilation, but the migration tool’s error-diagnostic indications and warnings made them easy to rectify. It took longer to switch from NVIDIA Collective Communications Library (NCCL)-based inter-GPU communication to GPU-direct enabled Intel MPI library calls because this could not be automated.

To summarize, there has been a promising attempt to transfer a complicated flood simulation code that was built on CUDA to SYCL, achieving both scalability and performance-portability. The conversion has been easy to handle and seamless thanks to the Intel oneAPI Base Toolkit.

Intel hosted a oneAPI Hackfest at the DiRAC HPC Research Facility

DiRAC

The High Performance Super Computer facility in the United Kingdom serving the theoretical communities of Particle Physics, Astrophysics, Cosmology, Solar System and Planetary Science, and Nuclear Physics.

DiRAC’s three HPC services Extreme Scaling, Memory-Intensive, and Data-Intensive are each designed to support the distinct kinds of computational workflows required to carry out their science program. DiRAC places a strong emphasis on innovation, and all of its services are co-designed with vendor partners, technical and software engineering teams, and research community.

Training Series on oneAPI at DiRAC Hackfest

On May 21–23, 2024, the DiRAC community hosted three half-day remote training sessions on the Intel oneAPI Base Toolkit. The training series was designed for developers and/or researchers with varying degrees of experience, ranging from novices to experts.

The cross-platform compatible SYCL programming framework served as the foundation for a variety of concepts that were taught to the attendees. The students were also introduced to a number of Base Kit component tools and libraries that facilitate SYCL. For instance, the Intel DPC++ Compatibility Tool facilitates automated code migration from CUDA to C++ with SYCL; the Intel oneAPI Math Kernel Library (oneMKL) optimizes math operations; the Intel oneAPI Deep Neural Networks (oneDNN) accelerates hackfest and the Intel oneAPI DPC++ Library (oneDPL) expedites SYCL kernels on a variety of hardware. Additionally, the training sessions covered code profiling and the use of Intel Advisor and Intel VTune Profiler, two tools included in the Base Kit for analyzing performance bottlenecks.

DiRAC Hackfest’s oneAPI Hackath on

In order to complete a range of tasks, including parallelizing Fortran code on Intel GPUs, accelerating math operations like the Fast Fourier Transform (FFT) using oneMKL’s SYCL API, and resolving performance bottlenecks with the aid of Intel Advisor and Intel VTune Profiler, the participants improvised their cutting-edge projects using oneAPI tools and libraries.

The participants reported that it was easy to adjust to using oneAPI components and that the code migration process went smoothly. The teams saw a noticeable increase in workload performance with libraries like Intel MPI. Approximately 70% of the teams who took part indicated that they would be open to using oneAPI technologies to further optimize the code for their research projects. Thirty percent of the teams benchmarked their outcomes using SYCL and oneAPI, and they achieved a 100% success rate in code conversion to SYCL.

Start Programming Multiarchitecture Using SYCL and oneAPI

Investigate the SYCL framework and oneAPI toolkits now for multiarchitecture development that is accelerated! Use oneAPI to enable cross-platform parallelism in your apps and move your workloads to SYCL for high-performance heterogeneous computing.

Intel invite you to review the real-world code migration application samples found in the CUDA to SYCL catalog. Investigate the AI, HPC, and rendering solutions available in Intel’s software portfolio driven by oneAPI.

Read more on govindhtech.com

#SynxFlowProject #CUDA #SYCL #scienceworkflows #riskassessment #IntelGPU #IntelDPC #IntelMPIlibrary #oneAPI #InteloneAPIMathKernelLibrary #IntelMPI #IntelVTuneProfiler #intel #gpu #technology #technews #news #govindhtech

0 notes

sanaavay · 13 days

Text

Features: 2nd Gen Ray Tracing Cores 3rd Gen Tensor Cores MicrosoftÂ® DirectXÂ® 12 Ultimate GDDR6X Graphics Memory NVIDIA DLSS NVIDIAÂ® GeForce Experienceâ„¢ NVIDIA G-SYNCÂ® NVIDIA GPU Boostâ„¢ Game Ready Drivers Vulkan RT API, OpenGL 4.6 DisplayPort 1.4a, HDMI 2.1 HDCP 2.3 VR Ready NVIDIAÂ® NVLinkÂ® (SLI Ready) 1-Click OC All photos, specifications, contents are used for reference only and are subject to change without notice. Reaching another new height with superior performance in the RTXâ„¢ 30 series family, the latest GeForce RTXâ„¢ 3090 Ti EX Gamer takes a great leap from the previous RTXâ„¢ 3090 EX Gamer model, featuring a record-breaking 10,752 CUDA cores and the max board power of 450W. In addition to a power boost to the CUDA cores and max board power, the brand-new GeForce RTXâ„¢ 3090 Ti EX Gamer graphics card emerges with an all-new power connector, fans, and cooling system ensuring an even greater balance between performance and heat dissipation efficiency, fulfilling the needs of hardcore gamers and overclocking enthusiasts worldwide. 1-Click OC allows you to boost your graphics cards with just one click! Download Xtreme Tuner now! Customize your RGB color with Xtreme Tuner, or synchronize with the rest of your system, by connecting the graphics card to +12V RGB header of your motherboard or other RGB control system, using the included cable.

0 notes

fahrni · 28 days

Text

Saturday Morning Coffee

Good morning from Charlottesville, Virginia! ☕️

I’ve been a bit obsessed with the idea of creating a CAD package for the Mac recently. For the challenge of it is why, but it would only be doable in a decent amount of time with financial backing large enough to hire a few folks to pull it off.

There is a way to jumpstart the process. The Open Design Alliance has portable libraries for reading and writing DWG files as well as rendering and so much more. All in portable C++.

Imagine a beautiful CAD app created just for the Mac. And yes, I know many already exist. 😁

Oh, right, I have a Mac app I need to finish.

Well, let’s get to it! Enjoy the links.

NBC News

DNC 2024 highlights: Kamala Harris accepts historic nomination in speech capping Democratic convention

We have our nominee! Now, let’s push her across the finish line and get our first Madame President!

Marc Palmer • Shareshot

Today we launched Shareshot! We’ve been working on this app for almost exactly a year, and we’re so pleased to be able to finally ship it. Here’s a little backstory and behind-the-scenes for those of you into app development.

Congratulations, Marc! Shareshot is a beautiful example of iOS craftsmanship. Go give it a try!

Alex Gaynor

I am an unrepentant advocate for migrating away from memory-unsafe languages (C and C++) to memory safe languages in security-relevant contexts. Many people reply that migrating large code bases to new languages is expensive, and we’d be better off making C++ safer. This is a reasonable response, after all there’s an enormous amount of C++ in the wild.

There is an enormous amount of C and C++ in the world. Too much to simply replace. I like Alex’s pragmatism on the matter. He has some proposals to improve the language without taking it too far down the path to incompatibility.

Just this week my interest in Rust began to grow. I’ve been using Swift daily since 2014, maybe 2015, and I really love the language and its ability to leverage the compiler to fix many of the memory issues seen in C and C++, like dangling pointers, forgotten allocations, and object lifetimes. We also have Rust to provide us with a solid memory protection model and the ability to be used for high performance code that is cross platform.

Rewriting software is costly and can also cost you your company. So taking that on should probably be avoided like the plague.

What if you picked your battles? How about writing new code in Rust or Swift? Perhaps improve public access to API’s by fronting it with Rust? How about picking some code known to cause a lot of crashes in your app and rewrite just that bit?

We can use tried and true methods in C++ to improve memory safety but it requires developers to be extremely disciplined.

Simple things like filling new memory allocations with known patterns. I prefer to fill the memory with zeros. You can also do the same when you delete it.

Reference counted pointers — AKA smart pointers — help.

Modern C++ has introduced mechanisms to transfer pointer ownership, always a tough problem to handle and the problem that lead to the creation of smart pointers.

Anywho, the piece is an easy read with good ideas. Go give it a gander.

Jess Weatherbed • The Verge

Many Procreate users can breathe a sigh of relief now that the popular iPad illustration app has taken a definitive stance against generative AI. “We’re not going to be introducing any generative AI into our products,” Procreate CEO James Cuda said in a video posted to X. “I don’t like what’s happening to the industry, and I don’t like what it’s doing to artists.”

I can really appreciate this stance. Artists often have a deep psychological attachment to their work and the creative process — hell — they go through to bring it to life. Taking that work, that style, and using it to train an AI to rip them off is just slimy.

Caleb Newton • Bipartisan Report

A dozen individuals who served as lawyers in Republican presidential administrations are bucking Republican presidential nominee Donald Trump and endorsing Democratic presidential pick Kamala Harris in a new letter that was publicized first at Fox News. The list includes prominent former judge J. Michael Luttig, who also served in the Reagan Administration.

Even with all of this at least half the country will vote for the Orange Man. It’s shocking, really.

Joe Brockmeier • lwn.net

The FreeBSD Project is, for the second time this year, engaging in a long-running discussion about the possibility of including Rust in its base system. The sequel to the first discussion included some work by Alan Somers to show what it might look like to use Rust code in the base tree. Support for Rust code does not appear much closer to being included in FreeBSD’s base system, but the conversation has been enlightening.

Speaking of Rust! Apparently Rust has found its way into the Linux Kernel and Microsoft has used it for Windows API development. It’s time for FreeBSD to get on board!

I wonder if Apple with push some Swift into Darwin or XNU at some point? Swift was written so it could be used for system level programming.

Carole Cadwalladr • The Guardian

Inciting rioters in Britain was a test run for Elon Musk. Just see what he plans for America

Musk has gone deep down the MAGA rabbit hole. His ketamine addled brain lives in its own world of conspiracies and white supremacy.

He’s unraveling in real time. Dumping his, often wacko, thoughts on X. He behaves more like a two year old than a man in his 50s.

Why do people still believe this man is some kind of genius? He’s a man child who throws hissy fits until he gets what he wants.

Money can’t buy happiness but it can buy politicians.

Matt Birchler • Birchtree

Why does Apple feel it’s worth trashing their relationship with creators and developers so that they can take 30% of the money I pay an up-and-coming creator who is trying to make rent in time each month? This isn’t a hypothetical, I genuinely want to know. Is the goal to turn into Microsoft, because this is how you turn into Microsoft.

Hate to say it Matt but Apple is today what Microsoft was in the 90’s. They are the 800lb gorilla in the room throwing their weight around.

I really love Apple products and their development tools and can’t see switching away from them. I just wish they’d be a bit kinder to the development community, that’s all.

Kelly Dobkin • Los Angeles Times

chef and co-owner Eric Park serves a black sesame misugaru drink that combines espresso, oat milk, the multigrain powder and gets topped with black sesame cream. It’s nutty, sweet and frothy, but not too rich thanks to the bitterness of the espresso.

Ok, now I really want to try misugaru. The one described above sounds incredible. 🤤

Alex Henderson • Raw Story

Reading through the Ohio Revised Code, Case Western Reserve University Law Professor Atiba Ellis couldn’t help looking for an alternative interpretation. Was there an error? Shoddy drafting? Because why on earth, he wondered, would a person clear that third bar, and submit documentation proving they broke the law by registering to vote?

This is just another GOP scheme to kick people off voter rolls. 🤬

Foone

ahh, another startup that burnt out trying to build some silly AI project on crap hardware. I wonder what they did? I check their URL: ahh. healthcare. great, great.

This Mastodon thread is an interesting read and a cautionary tale. Before you sale off old hardware make sure you remove its storage or at the very least wipe the storage with a destructive reformat.

Zarar

Around 2AM this morning I had a realization that this was the most stressed I have ever been. On verge of a complete breakdown.

Ahhh, the life of a software developer. I’ve seen and experienced this stress on numerous occasions. I don’t recommend it.

Daryl Baxter • iMore

This MacBook app generated $100,000 in seven days, now Stripe won’t pay up

This is a wild story and I hope the developer is able to get paid and save his company.

1 note · View note

smnet · 2 months

Text

Card đồ họa NVIDIA T600 4GB 4mDP GFX - 340K9AA

CẠC ĐỒ HỌA HP NVIDIA T600 4GB 4MDP GFX_340K9AA Model T600 GPU Memory 4 GB GDDR6 Memory Interface 128-bit Memory Bandwidth Up to 160 GB/s NVIDIA CUDA Cores 640 Single-Precision Performance Up to 1.7 TFLOPs System Interface PCI Express 3.0 x 16 Max Power Consumption 40 W Thermal Solution Active Form Factor 2.713 inches H x 6.137 inches L , single slot Display Connectors 4 x mDP 1.4 with…

0 notes

omarvektrapc12 · 3 months

Text

Model NameInno3D Geforce RTX 4080 Super Ichill x3/ 16GB GDDR6XModel NumberC408S3-166XX-187049HBrandINNO3D

GPU Engine Specs: CUDA Cores10240Boost Clock (MHz)2610Base Clock(MHz)2295Thermal and Power Spec: Minimum System Power Requirement (W)750Supplementary Power Connectors3x PCIe 8-pin cables (adapter in box) OR 320 W or greater PCIe Gen 5 cableMemory Specs: Memory Clock23GbpsStandard Memory Config16GBMemory InterfaceGDDR6XMemory Interface Width256-bitMemory Bandwidth (GB/sec)736

#Graphics Card #geforce graphics card #gaming graphics card #INNO3D graphics card

0 notes

exeton · 4 months

Text

NVIDIA H100 vs. A100: Which GPU Reigns Supreme?

NVIDIA’s CEO, Jensen Huang, unveiled the NVIDIA H100 Tensor Core GPU at NVIDIA GTC 2022, marking a significant leap in GPU technology with the new Hopper architecture. But how does it compare to its predecessor, the NVIDIA A100, which has been a staple in deep learning? Let’s explore the advancements and differences between these two powerhouse GPUs.

NVIDIA H100: A Closer Look

The NVIDIA H100, based on the new Hopper architecture, is NVIDIA’s ninth-generation data center GPU, boasting 80 billion transistors. Marketed as “the world’s largest and most powerful accelerator,” it’s designed for large-scale AI and HPC models. Key features include:

Most Advanced Chip: The H100 is built with cutting-edge technology, making it highly efficient for complex tasks.

New Transformer Engine: Enhances network speeds by six times compared to previous versions.

Confidential Computing: Ensures secure processing of sensitive data.

2nd-Generation Secure Multi-Instance GPU (MIG): Extends capabilities by seven times over the A100.

4th-Generation NVIDIA NVLink: Connects up to 256 H100 GPUs with nine times the bandwidth.

New DPX Instructions: Accelerates dynamic programming by up to 40 times compared to CPUs and up to seven times compared to previous-generation GPUs.

NVIDIA asserts that the H100 and Hopper technology will drive future AI research, supporting massive AI models, deep recommender systems, genomics, and complex digital twins. Its enhanced AI inference capabilities cater to real-time applications like giant-scale AI models and chatbots.

NVIDIA A100: A Deep Dive

Introduced in 2020, the NVIDIA A100 Tensor Core GPU was heralded as the highest-performing elastic data center for AI, data analytics, and HPC. Based on the Ampere architecture, it delivers up to 20 times higher performance than its predecessor. The A100’s notable features include:

Multi-Instance GPU (MIG): Allows division into seven GPUs, adjusting dynamically to varying demands.

Third-Generation Tensor Core: Boosts throughput and supports a wide range of DL and HPC data types.

The A100’s ability to support cloud service providers (CSPs) during the digital transformation of 2020 and the pandemic was crucial, delivering up to seven times more GPU instances through its MIG virtualization and GPU partitioning capabilities.

Architecture Comparison NVIDIA Hopper

Named after the pioneering computer scientist Grace Hopper, the Hopper architecture significantly enhances the MIG capabilities by up to seven times compared to the previous generation. It introduces features that improve asynchronous execution, allowing memory copies to overlap with computation and reducing synchronization points. Designed to accelerate the training of Transformer models on H100 GPUs by six times, Hopper addresses the challenges of long training periods for large models while maintaining GPU performance.

NVIDIA Ampere

Described as the core of the world’s highest-performing elastic data centers, the Ampere architecture supports elastic computing at high acceleration levels. It’s built with 54 billion transistors, making it the largest 7nm chip ever created. Ampere offers L2 cache residency controls for data management, enhancing data center scalability. The third generation of NVLink® in Ampere doubles GPU-to-GPU bandwidth to 600 GB/s, facilitating large-scale application performance.

Detailed Specifications: H100 vs. A100

H100 Specifications:

8 GPCs, 72 TPCs (9 TPCs/GPC), 2 SMs/TPC, 144 SMs per full GPU

128 FP32 CUDA Cores per SM, 18432 FP32 CUDA Cores per full GPU

4 Fourth-Generation Tensor Cores per SM, 576 per full GPU

6 HBM3 or HBM2e stacks, 12 512-bit Memory Controllers

60MB L2 Cache

Fourth-Generation NVLink and PCIe Gen 5

Fabricated on TSMC’s 4N process, the H100 has 80 billion transistors and 395 billion parameters, providing up to nine times the speed of the A100. It’s noted as the first truly asynchronous GPU, extending A100’s asynchronous transfers across address spaces and growing the CUDA thread group hierarchy with a new level called the thread block cluster.

A100 Specifications:

8 GPCs, 8 TPCs/GPC, 2 SMs/TPC, 16 SMs/GPC, 128 SMs per full GPU

64 FP32 CUDA Cores/SM, 8192 FP32 CUDA Cores per full GPU

4 third-generation Tensor Cores/SM, 512 third-generation Tensor Cores per full GPU

6 HBM2 stacks, 12 512-bit memory controllers

The A100 is built on the A100 Tensor Core GPU SM architecture and the third-generation NVIDIA high-speed NVLink interconnect. With 54 billion transistors, it delivers five petaflops of performance, a 20x improvement over its predecessor, Volta. The A100 also includes fine-grained structured sparsity to double the compute throughput for deep neural networks.

Conclusion

When comparing NVIDIA’s H100 and A100 GPUs, it’s clear that the H100 brings substantial improvements and new features that enhance performance and scalability for AI and HPC applications. While the A100 set a high standard in 2020, the H100 builds upon it with advanced capabilities that make it the preferred choice for cutting-edge research and large-scale model training. Whether you choose the H100 or A100 depends on your specific needs, but for those seeking the latest in GPU technology, the H100 is the definitive successor.

FAQs

1- What is the main difference between NVIDIA H100 and A100?

The main difference lies in the architecture and capabilities. The H100, based on the Hopper architecture, offers enhanced performance, scalability, and new features like the Transformer Engine and advanced DPX instructions.

2- Which GPU is better for deep learning: H100 or A100?

The H100 is better suited for deep learning due to its advanced features and higher performance metrics, making it ideal for large-scale AI models.

3- Can the H100 GPU be used for gaming?

While the H100 is primarily designed for AI and HPC tasks, it can theoretically be used for gaming, though it’s not optimized for such purposes.

4- What are the memory specifications of the H100 and A100?

The H100 includes six HBM3 or HBM2e stacks with 12 512-bit memory controllers, whereas the A100 has six HBM2 stacks with 12 512-bit memory controllers.

5- How does the H100’s NVLink compare to the A100's?

The H100 features fourth-generation NVLink, which connects up to 256 GPUs with nine times the bandwidth, significantly outperforming the A100’s third-generation NVLink.

Muhammad Hussnain Facebook | Instagram | Twitter | Linkedin | Youtube

#Nvidia #H100 #A100 #Nvidia H100 #Nvidia A100 #Nvidia GPU's

0 notes

nzdepot · 5 months

Link

$581.52 $ Leadtek NVIDIA T1000 4GB GDDR6 Professional WorkStation Graphics Card https://nzdepot.co.nz/product/leadtek-nvidia-t1000-4gb-gddr6-professional-workstation-graphics-card-2/?feed_id=150912&_unique_id=662f0e040b7d6 Features: NVIDIA T1000 – NVIDIA Turing GPU architecture – 896 NVIDIA® CUDA® Cores – 4GB GDDR6 Memory – Up to 160GB/s Memory Bandwidth – Max. Power Consumption: 50W – Graphics Bus: PCI-E 3.0 x16 – Thermal Solution: Active – Display Connectors: mDP 1.4 (4) NVIDIA GPUs power the world’s most advanced desktop workstations, providing the visual computing power required by millions of professionals as part of their daily workflow. All phases of the professional workflow, from creating, editing, and viewing 2D and 3D models and video, to working with multiple applications across several displays, benefit from the power that only […] #

0 notes

jcmarchi · 8 days

Text

TensorRT-LLM: A Comprehensive Guide to Optimizing Large Language Model Inference for Maximum Performance

New Post has been published on https://thedigitalinsider.com/tensorrt-llm-a-comprehensive-guide-to-optimizing-large-language-model-inference-for-maximum-performance/

TensorRT-LLM: A Comprehensive Guide to Optimizing Large Language Model Inference for Maximum Performance

As the demand for large language models (LLMs) continues to rise, ensuring fast, efficient, and scalable inference has become more crucial than ever. NVIDIA’s TensorRT-LLM steps in to address this challenge by providing a set of powerful tools and optimizations specifically designed for LLM inference. TensorRT-LLM offers an impressive array of performance improvements, such as quantization, kernel fusion, in-flight batching, and multi-GPU support. These advancements make it possible to achieve inference speeds up to 8x faster than traditional CPU-based methods, transforming the way we deploy LLMs in production.

This comprehensive guide will explore all aspects of TensorRT-LLM, from its architecture and key features to practical examples for deploying models. Whether you’re an AI engineer, software developer, or researcher, this guide will give you the knowledge to leverage TensorRT-LLM for optimizing LLM inference on NVIDIA GPUs.

Speeding Up LLM Inference with TensorRT-LLM

TensorRT-LLM delivers dramatic improvements in LLM inference performance. According to NVIDIA’s tests, applications based on TensorRT show up to 8x faster inference speeds compared to CPU-only platforms. This is a crucial advancement in real-time applications such as chatbots, recommendation systems, and autonomous systems that require quick responses.

How It Works

TensorRT-LLM speeds up inference by optimizing neural networks during deployment using techniques like:

Quantization: Reduces the precision of weights and activations, shrinking model size and improving inference speed.

Layer and Tensor Fusion: Merges operations like activation functions and matrix multiplications into a single operation.

Kernel Tuning: Selects optimal CUDA kernels for GPU computation, reducing execution time.

These optimizations ensure that your LLM models perform efficiently across a wide range of deployment platforms—from hyperscale data centers to embedded systems.

Optimizing Inference Performance with TensorRT

Built on NVIDIA’s CUDA parallel programming model, TensorRT provides highly specialized optimizations for inference on NVIDIA GPUs. By streamlining processes like quantization, kernel tuning, and fusion of tensor operations, TensorRT ensures that LLMs can run with minimal latency.

Some of the most effective techniques include:

Quantization: This reduces the numerical precision of model parameters while maintaining high accuracy, effectively speeding up inference.

Tensor Fusion: By fusing multiple operations into a single CUDA kernel, TensorRT minimizes memory overhead and increases throughput.

Kernel Auto-tuning: TensorRT automatically selects the best kernel for each operation, optimizing inference for a given GPU.

These techniques allow TensorRT-LLM to optimize inference performance for deep learning tasks such as natural language processing, recommendation engines, and real-time video analytics.

Accelerating AI Workloads with TensorRT

TensorRT accelerates deep learning workloads by incorporating precision optimizations such as INT8 and FP16. These reduced-precision formats allow for significantly faster inference while maintaining accuracy. This is particularly valuable in real-time applications where low latency is a critical requirement.

INT8 and FP16 optimizations are particularly effective in:

Video Streaming: AI-based video processing tasks, like object detection, benefit from these optimizations by reducing the time taken to process frames.

Recommendation Systems: By accelerating inference for models that process large amounts of user data, TensorRT enables real-time personalization at scale.

Natural Language Processing (NLP): TensorRT improves the speed of NLP tasks like text generation, translation, and summarization, making them suitable for real-time applications.

Deploy, Run, and Scale with NVIDIA Triton

Once your model has been optimized with TensorRT-LLM, you can easily deploy, run, and scale it using NVIDIA Triton Inference Server. Triton is an open-source software that supports dynamic batching, model ensembles, and high throughput. It provides a flexible environment for managing AI models at scale.

Some of the key features include:

Concurrent Model Execution: Run multiple models simultaneously, maximizing GPU utilization.

Dynamic Batching: Combines multiple inference requests into one batch, reducing latency and increasing throughput.

Streaming Audio/Video Inputs: Supports input streams in real-time applications, such as live video analytics or speech-to-text services.

This makes Triton a valuable tool for deploying TensorRT-LLM optimized models in production environments, ensuring high scalability and efficiency.

Core Features of TensorRT-LLM for LLM Inference

Open Source Python API

TensorRT-LLM provides a highly modular and open-source Python API, simplifying the process of defining, optimizing, and executing LLMs. The API enables developers to create custom LLMs or modify pre-built ones to suit their needs, without requiring in-depth knowledge of CUDA or deep learning frameworks.

In-Flight Batching and Paged Attention

One of the standout features of TensorRT-LLM is In-Flight Batching, which optimizes text generation by processing multiple requests concurrently. This feature minimizes waiting time and improves GPU utilization by dynamically batching sequences.

Additionally, Paged Attention ensures that memory usage remains low even when processing long input sequences. Instead of allocating contiguous memory for all tokens, paged attention breaks memory into “pages” that can be reused dynamically, preventing memory fragmentation and improving efficiency.

Multi-GPU and Multi-Node Inference

For larger models or more complex workloads, TensorRT-LLM supports multi-GPU and multi-node inference. This capability allows for the distribution of model computations across several GPUs or nodes, improving throughput and reducing overall inference time.

FP8 Support

With the advent of FP8 (8-bit floating point), TensorRT-LLM leverages NVIDIA’s H100 GPUs to convert model weights into this format for optimized inference. FP8 enables reduced memory consumption and faster computation, especially useful in large-scale deployments.

TensorRT-LLM Architecture and Components

Understanding the architecture of TensorRT-LLM will help you better utilize its capabilities for LLM inference. Let’s break down the key components:

Model Definition

TensorRT-LLM allows you to define LLMs using a simple Python API. The API constructs a graph representation of the model, making it easier to manage the complex layers involved in LLM architectures like GPT or BERT.

Weight Bindings

Before compiling the model, the weights (or parameters) must be bound to the network. This step ensures that the weights are embedded within the TensorRT engine, allowing for fast and efficient inference. TensorRT-LLM also allows for weight updates after compilation, adding flexibility for models that need frequent updates.

Pattern Matching and Fusion

Operation Fusion is another powerful feature of TensorRT-LLM. By fusing multiple operations (e.g., matrix multiplications with activation functions) into a single CUDA kernel, TensorRT minimizes the overhead associated with multiple kernel launches. This reduces memory transfers and speeds up inference.

Plugins

To extend TensorRT’s capabilities, developers can write plugins—custom kernels that perform specific tasks like optimizing multi-head attention blocks. For instance, the Flash-Attention plugin significantly improves the performance of LLM attention layers.

Benchmarks: TensorRT-LLM Performance Gains

TensorRT-LLM demonstrates significant performance gains for LLM inference across various GPUs. Here’s a comparison of inference speed (measured in tokens per second) using TensorRT-LLM across different NVIDIA GPUs:

Model Precision Input/Output Length H100 (80GB) A100 (80GB) L40S FP8 GPTJ 6B FP8 128/128 34,955 11,206 6,998 GPTJ 6B FP8 2048/128 2,800 1,354 747 LLaMA v2 7B FP8 128/128 16,985 10,725 6,121 LLaMA v3 8B FP8 128/128 16,708 12,085 8,273

These benchmarks show that TensorRT-LLM delivers substantial improvements in performance, particularly for longer sequences.

Hands-On: Installing and Building TensorRT-LLM

Step 1: Create a Container Environment

For ease of use, TensorRT-LLM provides Docker images to create a controlled environment for building and running models.

docker build --pull --target devel --file docker/Dockerfile.multi --tag tensorrt_llm/devel:latest .

Step 2: Run the Container

Run the development container with access to NVIDIA GPUs:

docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all --volume $PWD:/code/tensorrt_llm --workdir /code/tensorrt_llm tensorrt_llm/devel:latest

Step 3: Build TensorRT-LLM from Source

Inside the container, compile TensorRT-LLM with the following command:

python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt pip install ./build/tensorrt_llm*.whl

This option is particularly useful when you want to avoid compatibility issues related to Python dependencies or when focusing on C++ integration in production systems. Once the build completes, you will find the compiled libraries for the C++ runtime in the cpp/build/tensorrt_llm directory, ready for integration with your C++ applications.

Step 4: Link the TensorRT-LLM C++ Runtime

When integrating TensorRT-LLM into your C++ projects, ensure that your project’s include paths point to the cpp/include directory. This contains the stable, supported API headers. The TensorRT-LLM libraries are linked as part of your C++ compilation process.

For example, your project’s CMake configuration might include:

include_directories($TENSORRT_LLM_PATH/cpp/include) link_directories($TENSORRT_LLM_PATH/cpp/build/tensorrt_llm) target_link_libraries(your_project tensorrt_llm)

This integration allows you to take advantage of the TensorRT-LLM optimizations in your custom C++ projects, ensuring efficient inference even in low-level or high-performance environments.

Advanced TensorRT-LLM Features

TensorRT-LLM is more than just an optimization library; it includes several advanced features that help tackle large-scale LLM deployments. Below, we explore some of these features in detail:

1. In-Flight Batching

Traditional batching involves waiting until a batch is fully collected before processing, which can cause delays. In-Flight Batching changes this by dynamically starting inference on completed requests within a batch while still collecting other requests. This improves overall throughput by minimizing idle time and enhancing GPU utilization.

This feature is particularly valuable in real-time applications, such as chatbots or voice assistants, where response time is critical.

2. Paged Attention

Paged Attention is a memory optimization technique for handling large input sequences. Instead of requiring contiguous memory for all tokens in a sequence (which can lead to memory fragmentation), Paged Attention allows the model to split key-value cache data into “pages” of memory. These pages are dynamically allocated and freed as needed, optimizing memory usage.

Paged Attention is critical for handling large sequence lengths and reducing memory overhead, particularly in generative models like GPT and LLaMA.

3. Custom Plugins

TensorRT-LLM allows you to extend its functionality with custom plugins. Plugins are user-defined kernels that enable specific optimizations or operations not covered by the standard TensorRT library.

For example, the Flash-Attention plugin is a well-known custom kernel that optimizes multi-head attention layers in Transformer-based models. By using this plugin, developers can achieve substantial speed-ups in attention computation—one of the most resource-intensive components of LLMs.

To integrate a custom plugin into your TensorRT-LLM model, you can write a custom CUDA kernel and register it with TensorRT. The plugin will be invoked during model execution, providing tailored performance improvements.

4. FP8 Precision on NVIDIA H100

With FP8 precision, TensorRT-LLM takes advantage of NVIDIA’s latest hardware innovations in the H100 Hopper architecture. FP8 reduces the memory footprint of LLMs by storing weights and activations in an 8-bit floating-point format, resulting in faster computation without sacrificing much accuracy. TensorRT-LLM automatically compiles models to utilize optimized FP8 kernels, further accelerating inference times.

This makes TensorRT-LLM an ideal choice for large-scale deployments requiring top-tier performance and energy efficiency.

Example: Deploying TensorRT-LLM with Triton Inference Server

For production deployments, NVIDIA’s Triton Inference Server provides a robust platform for managing models at scale. In this example, we will demonstrate how to deploy a TensorRT-LLM-optimized model using Triton.

Step 1: Set Up the Model Repository

Create a model repository for Triton, which will store your TensorRT-LLM model files. For instance, if you have compiled a GPT2 model, your directory structure might look like this:

mkdir -p model_repository/gpt2/1 cp ./trt_engine/gpt2_fp16.engine model_repository/gpt2/1/

Step 2: Create the Triton Configuration File

In the same model_repository/gpt2/ directory, create a configuration file named config.pbtxt that tells Triton how to load and run the model. Here’s a basic configuration for TensorRT-LLM:

name: "gpt2" platform: "tensorrt_llm" max_batch_size: 8 input [ name: "input_ids" data_type: TYPE_INT32 dims: [-1] ] output [ name: "logits" data_type: TYPE_FP32 dims: [-1, -1] ]

Step 3: Launch Triton Server

Use the following Docker command to launch Triton with the model repository:

docker run --rm --gpus all -v $(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:23.05-py3 tritonserver --model-repository=/models

Step 4: Send Inference Requests to Triton

Once the Triton server is running, you can send inference requests to it using HTTP or gRPC. For example, using curl to send a request:

curl -X POST http://localhost:8000/v2/models/gpt2/infer -d ' "inputs": [ "name": "input_ids", "shape": [1, 128], "datatype": "INT32", "data": [[101, 234, 1243]] ] '

Triton will process the request using the TensorRT-LLM engine and return the logits as output.

Best Practices for Optimizing LLM Inference with TensorRT-LLM

To fully harness the power of TensorRT-LLM, it’s important to follow best practices during both model optimization and deployment. Here are some key tips:

1. Profile Your Model Before Optimization

Before applying optimizations such as quantization or kernel fusion, use NVIDIA’s profiling tools (like Nsight Systems or TensorRT Profiler) to understand the current bottlenecks in your model’s execution. This allows you to target specific areas for improvement, leading to more effective optimizations.

2. Use Mixed Precision for Optimal Performance

When optimizing models with TensorRT-LLM, using mixed precision (a combination of FP16 and FP32) offers a significant speed-up without a major loss in accuracy. For the best balance between speed and accuracy, consider using FP8 where available, especially on the H100 GPUs.

3. Leverage Paged Attention for Large Sequences

For tasks that involve long input sequences, such as document summarization or multi-turn conversations, always enable Paged Attention to optimize memory usage. This reduces memory overhead and prevents out-of-memory errors during inference.

4. Fine-tune Parallelism for Multi-GPU Setups

When deploying LLMs across multiple GPUs or nodes, it’s essential to fine-tune the settings for tensor parallelism and pipeline parallelism to match your specific workload. Properly configuring these modes can lead to significant performance improvements by distributing the computational load evenly across GPUs.

Conclusion

TensorRT-LLM represents a paradigm shift in optimizing and deploying large language models. With its advanced features like quantization, operation fusion, FP8 precision, and multi-GPU support, TensorRT-LLM enables LLMs to run faster and more efficiently on NVIDIA GPUs. Whether you are working on real-time chat applications, recommendation systems, or large-scale language models, TensorRT-LLM provides the tools needed to push the boundaries of performance.

This guide walked you through setting up TensorRT-LLM, optimizing models with its Python API, deploying on Triton Inference Server, and applying best practices for efficient inference. With TensorRT-LLM, you can accelerate your AI workloads, reduce latency, and deliver scalable LLM solutions to production environments.

For further information, refer to the official TensorRT-LLM documentation and Triton Inference Server documentation.

0 notes

viperallc · 8 months

Text

Manli NVIDIA RTX Turbo 4090 with Blower Cooling

In the ever-evolving world of graphics cards, a new titan has emerged: the Manli NVIDIA RTX Turbo 4090 Blower Cooling. This powerhouse is not just a component; it’s a monumental leap in gaming and creative performance. Let’s dive into its specifications and features that set it apart.

Specifications:

Product Name: Manli NVIDIA RTX Turbo 4090 Blower Cooling

Model Name: M-NRTX4090G/6RHHPPP-M3530

Chipset: GeForce RTX™ 4090

Base/Boost Clock: 2235/2520MHz

CUDA® Cores: 16384

Memory: 24GB GDDR6X, 21Gbps

Memory Interface: 384-bit

Memory Bandwidth: 1008GB/s

Width: 3.5-Slot

Cooling: Heatsink with Triple Cooler

Display Output: 3 x DisplayPort, HDMI

Dimensions: 351 x 145 x 63mm

Power: 450W

Max GPU Temperature: 90℃

Packaging Size: 439.5 x 229 x 112mm

Unparalleled Performance

Model Name: M-NRTX4090G/6RHHPPP-M3530. At the heart of this beast lies the GeForce RTX™ 4090 chipset, renowned for its supreme capabilities. The card boasts an impressive base and boost clock of 2235/2520MHz, ensuring that it performs exceptionally under demanding scenarios.

CUDA® Cores: With a staggering 16384 NVIDIA CUDA® Cores, the RTX 4090 is built for speed and efficiency, catering to the most intense gaming sessions and demanding creative workloads.

Next-Gen Memory

Memory Specs: Equipped with 24GB of GDDR6X memory and a memory speed of 21Gbps, this graphics card is designed for ultra-high-resolution gaming and complex 3D rendering tasks. The 384-bit memory interface and a bandwidth of 1008GB/s further underscore its capabilities in handling large data sets smoothly.

Cutting-Edge Cooling and Design

Cooling: The card’s innovative heatsink with Triple Cooler design ensures optimal thermal performance, crucial for maintaining stability and longevity under load.

Build: It’s a 3.5-slot card, with dimensions of 351 x 145 x 63mm, signifying a robust and sturdy build quality. The design is not just functional but also aesthetically pleasing, fitting well into any high-end gaming rig.

Connectivity and Power

Display Outputs: Connectivity is versatile with 3 x DisplayPort and HDMI options, allowing for multiple monitor setups or high-resolution displays.

Power Requirements: The graphics card power stands at 450W, and it operates safely up to a maximum temperature of 90℃.

Beyond Fast: The NVIDIA® GeForce RTX® 4090 Experience

The Manli NVIDIA RTX Turbo 4090 is more than just fast; it’s a revolution in graphics card technology. It is powered by the NVIDIA Ada Lovelace architecture, which brings significant improvements in performance, efficiency, and AI-powered graphics.

AI and Ray Tracing Performance: With the fourth-gen Tensor Cores and third-gen RT Cores, the card offers up to 2x AI and ray tracing performance compared to previous generations. This means more realistic lighting, shadows, and reflections in games, as well as faster rendering times for creators.

Ultimate Experience for Gamers and Creators: The combination of its powerful specs and advanced architecture makes the RTX 4090 a top choice for gamers who want to experience ultra-high-performance gaming and for creators involved in detailed virtual worlds, unprecedented productivity, and innovative content creation.

Conclusion

The Manli NVIDIA RTX Turbo 4090 Blower Cooling 24GB is a testament to what modern technology can achieve in the realm of graphics cards. It’s not just an upgrade; it’s a transformation that redefines what’s possible in gaming and creative computing. For those who demand the best, the RTX 4090 is undoubtedly the ultimate choice.

M.Hussnain Visit us on social media: Facebook | Twitter | LinkedIn | Instagram | YouTube TikTok

#RTX 4090 #Nvidia #NVIDIA RTX TURBO #RTX TURBO 4090 #viperatech #vipera #gpu #graphic card #nvidia

0 notes

awaisdilshadme-blog · 9 months

Text

As the hype around AI PCs grows, Nvidia brings a major refresh to RTX 40 series

Looking at the spec sheet, the somewhat confusingly named 4070 Ti Super (there are four notable 4070 SKUs) has a 300 MHz higher base clock and 768 more CUDA cores than the standard 4070 Ti up. Masu.

In fact, the biggest improvement is in memory. The 4070 Ti Super has 16 GB of GDDR6x memory connected to a wider 256-bit bus. An additional 4 gigabytes of memory may not sound like a big deal, but as generative AI applications become more common on devices, it's important to run larger and more accurate models. , more memory is essential.

#computer #artificial intelligence

0 notes

govindhtech · 18 days

Text

Intel Data Center GPU SqueezeLLM Inference With SYCLomatic

Turn on SqueezeLLM for Efficient LLM Inference on Intel Data Center GPU Max Series utilizing SYCLomatic for Converting CUDA to SYCL.

In brief

Researchers at the University of California, Berkeley, have devised a revolutionary quantization technique called SqueezeLLM, which enables accurate and efficient generative LLM inference. Cross-platform compatibility, however, requires unique kernel implementations and hence more implementation work.

Using the SYCLomatic tool from the Intel oneAPI Base Toolkit to take advantage of CUDA-to-SYCL migration, they were able to immediately achieve a 2.0x speedup on Intel Data Center GPUs with 4-bit quantization without the need for manual tweaking. Because of this, cross-platform compatibility may be provided with little extra technical effort needed to adapt the kernel implementations to various hardware back ends.

SqueezeLLM: Precise and Effective Low-Precision Quantization for Optimal LLM Interpretation

Because LLM inference allows for so many applications, it is becoming a common task. But LLM inference uses a lot of resources; it needs powerful computers to function. Furthermore, since generative LLM inference requires the sequential generation of output tokens, it suffers from minimal data reuse, while previous machine learning workloads have mostly been compute-bound. Low-precision quantization is one way to cut down on latency and memory use, but it may be difficult to quantize LLMs to low precision (less than 4 bits, for example) without causing an unacceptable loss of accuracy.

SqueezeLLM is a tool that UC Berkeley researchers have created to facilitate precise and efficient low-precision quantization. Two important advances are included into SqueezeLLM to overcome shortcomings in previous approaches. It employs sensitivity-weighted non-uniform quantization, which uses sensitivity to determine the optimal allocation for quantization codebook values, thereby maintaining model accuracy.

This approach addresses the inefficient representation of the underlying parameter distribution caused by the limitations of uniform quantization. Furthermore, SqueezeLLM provides dense-and-sparse quantization, which allows quantization of the remaining parameters to low precision by addressing extremely high outliers in LLM parameters by preserving outlier values in a compact sparse format.

Non-uniform quantization is used by SqueezeLLM to best represent the LLM weights with less accuracy. When generating the non-uniform codebooks, the non-uniform quantization technique takes into consideration not only the magnitude of values but also the sensitivity of parameters to mistake, offering excellent accuracy for low-precision quantization.

Dense-and-sparse quantization, which SqueezeLLM employs, allows for the greater accuracy storage of a tiny portion of outlier values. This enables precise low-precision quantization for the dense matrix by lowering the needed range that must be represented by the remaining dense component.

The difficulty is in offering cross-platform assistance for low-precision LLM quantization

The method in SqueezeLLM provides for considerable latency reduction in comparison to baseline FP16 inference, as well as efficient and accurate low-precision LLM quantization to minimize memory usage during LLM inference. Their goal was to allow cross-platform availability of these methods for improving LLM inference on systems like Intel Data Center GPUs, by enabling cross-platform support.

SqueezeLLM, on the other hand, depends on handcrafted custom kernel implementations that use dense-and-sparse quantization to tackle the outlier problem with LLM inference and non-uniform quantization to offer correct representation with extremely few bits per parameter.

Even though these kernel implementations are rather simple, it is still not ideal to manually convert and optimize them for various target hardware architectures. They predicted a large overhead while converting their SqueezeLLM kernels to operate on Intel Data Center GPUs since they first created the kernels using CUDA and it took weeks to construct, profile, and optimize these kernels.

Therefore, in order to target Intel Data Center GPUs, they needed a way to rapidly and simply migrate their own CUDA kernels to SYCL. To prevent interfering with the remainder of the inference pipeline, this calls for the ability to convert the kernels with little human labor and the ability to more easily modify the Python-level code to use the custom kernels. They also wanted the ported kernels to be as efficient as possible so that Intel customers could benefit fully from SqueezeLLM‘s efficiency.

SYCLomatic

SYCLomatic offers a way to provide cross-platform compatibility without requiring extra technical work. The effective kernel techniques may be separated from the target deployment platform by using SYCLomatic’s CUDA-to-SYCL code conversion. This allows for inference on several target architectures with little extra engineering work.

Their performance investigation shows that the SYCLomatic-ported kernels achieve a 2.0x speedup on Intel Data Center GPUs running the Llama 7B model, and instantly improve efficiency without the need for human tweaking.

CUDA to SYCL

Solution: A SYCLomatic-Powered CUDA-to-SYCL Migration for Quantized LLMs on Multiple Platforms.

First Conversion

SYCLomatic conversion was carried out in a development environment that included the Intel oneAPI Base Toolkit. Using the SYCLomatic conversion command dpct quant_cuda_kernel.cu, the kernel was moved to SYCL. They are happy to inform that the conversion script changed the kernel implementations as needed and automatically produced accurate kernel definitions. The following examples demonstrate how SYCL-compatible code was added to the kernel implementation and invocations without

Change Python Bindings to Allow Custom Kernel Calling

The bindings were modified to utilize the PyTorch XPU CPP extension (DPCPPExtension) in order to call the kernel from Python code. This enabled the migrating kernels to be deployed using a setup in the deployment environment. Python script:

Initial Bindings Installation CUDA Kernel Installation in the Setup Script

1. setup( name="quant_cuda", 2 .ext_modules=[ 3. cpp_extension.CUDAExtension( 4. "quant_cuda", 5. ["quant_cuda.cpp", "quant_cuda_kernel.cu"] 6. ) 7. ], 8. cmdclass={"build_ext": cpp_extension.BuildExtension}, 9. )

Changed Setup Script Kernel Installation to Bindings

1. setup( 2. name='quant_sycl', 3. ext_modules=[ 4. DPCPPExtension( 5. 'quant_sycl', 6. ['quant_cuda.cpp', 'quant_cuda_kernel.dp.cpp',] 7. ) 8. ], 9. cmdclass={ 10. 'build_ext': DpcppBuildExtension 11. } 12. )

The converted SYCL kernels could be called from PyTorch code when the kernel bindings were installed, allowing end-to-end inference to be conducted with the converted kernels. This made it easier to convert the current SqueezeLLM Python code to support the SYCL code, requiring just small changes to call the migrated kernel bindings.

Analysis of Converted Kernels’ Performance

The ported kernel implementations were tested and benchmarked by the SqueezeLLM team using Intel Data Center GPUs made accessible via the Intel Tiber Developer Cloud. As described earlier, SYCLomatic was used to convert the inference kernels, and after that, adjustments were made to enable calling the SYCL code from the SqueezeLLM Python code.

Benchmarking the 4-bit kernels on the Intel Data Center GPU Max Series allowed us to evaluate the performance gains resulting from low-precision quantization. In order to really enable efficient inference on many platforms, this evaluated if the conversion procedure might provide efficient inference kernels.

Table 1 shows the speedup and average latency for matrix-vector multiplications while using the Llama 7B model to generate 128 tokens. These findings show that substantial speedups may be achieved with the ported kernels without the need for human tweaking.

In order to evaluate the latency advantages of low-precision quantization that are achievable across various hardware back ends without requiring changes to the SYCL code, the 4-bit kernels were benchmarked on the Intel Data Center GPU. Running the Llama 7B model without any human adjustment allows SqueezeLLM to achieve a 2.0x speedup on Intel Data Center GPUs compared to baseline FP16 inference, as Table 1 illustrates.KernelLatency (in seconds)Baseline: fp16 Matrix-Vector Multiplication2.584SqueezeLLM: 4-bit (0% sparsity)1.296Speedup2.0x

When this speedup is contrasted with the 4-bit inference results achieved on the NVIDIA A100 hardware platform, which achieved 1.7x speedups above baseline FP16 inference, it can be shown that the ported kernels outperform the handwritten CUDA kernels designed for NVIDIA GPU systems. These findings demonstrate that equivalent speedups on various architectures may be achieved via CUDA-to-SYCL migration utilizing SYCLomatic, all without requiring extra engineering work or manual kernel tweaking after conversion.

In summary

For new applications, LLM inference is a fundamental task, and low-precision quantization is a crucial way to increase inference productivity. SqueezeLLM allows for low-precision quantization to provide accurate and efficient generative LLM inference. However, cross-platform deployment becomes more difficult due to the need for bespoke kernel implementations. The kernel implementation may be easily converted to other hardware architectures with the help of the SYCLomatic migration tool.

For instance, SYCLomatic-migrated 4-bit SqueezeLLM kernels show a 2.0x speedup on Intel Data Center GPUs without the need for human tweaking. Thus, SYCL conversion democratizes effective LLM implementation by enabling support for many hardware platforms with no additional technical complexity.

Red more on Govindhtech.com

#IntelDataCenterGPU #GPU #SqueezeLLM #LLMInference #LLMs #SYCL #SYCLomatic #CUDA #AI #Llmma #news #technews #technology #technologynews #tehnologytrends #govindhtech

0 notes