#cpuresources | Explore Tumblr posts and blogs

govindhtech · 10 months ago

Text

AMD Radeon RX 7900XTX GPU boosts ChatGLM2-6B inference

HEYINTELLIGENCE used an AMD Radeon RX 7900XTX GPU to optimize ChatGLM2-6B inference.With its open-source software ROCm, which is made for GPUs and includes a set of drivers, software tools, libraries, and APIs that make GPU programming simple, AMD is pushing AI through an open environment. Through underlying hardware designs and software breakthroughs, AMD Radeon GPUs and AMD ROCm software are naturally engineered to enable a balance between accuracy and efficiency, enabling developers to quickly create high-performance, large-model applications. More partners now have the chance to collaborate on innovations within the AMD AI ecosystem.

GCT Technology

LLMs on AMD GPU Platforms Provided by HEYINTELLIGENCE

Based on how much of the total inference process’ processing or bandwidth each of the original RMSNorm, MatMul fused with Rotary-EMB, MatMul fused with SwiGLU, and Decoding Attention in ChatGLM2-6B utilized, was chosen. Four optimized kernels were created by GCT to carry out these tasks. Because of the flexibility of the HIP and ROCm components, all four kernels are intended to deliver notable performance increases. Because AMD GPUs are so efficient, the kernels are compiled into effective backend instructions and are widely used.

The following are the essential components of the optimized kernels:

RMSNorm: This technique uses the Root Mean Square (RMS) to regularize the summed inputs to a neuron in a single layer. Preventing synchronization among warps is essential for enhancing performance.

MatMul fused with Rotary-EMB: The launch cost of numerous kernels can be significantly decreased by combining matrix multiplication (MatMul) with rotary operation. To increase data sharing and improve computing performance, the kernel must be designed with the granularity of the rotor in mind.

MatMul fused with SwiGLU: By combining matrix multiplication and SwiGLU, the launch costs of two different kernels can be decreased. The memory-to-register load time can also be decreased by planning the entire optimization procedure from the output’s point of view.

Decoding Attention: Three crucial elements to enhance performance are the optimization of the synchronization technique between thread warps in SoftMax, the sensible use of shared memory, and the flexible design of thread processing granularity depending on the calculating features of attention.

These kernel optimization techniques can be utilized as a stand-alone plug-in with different quantization algorithms to provide next-level performance in addition to GCT, as they have no effect on accuracy and are not dependent on the quantization strategy. On the other hand, quantization may result in a reduction in model generalization for applications that depend on accuracy, which presents uncertain risks. GCT techniques can still be utilized to maximize performance in this situation, although quantization approaches must be handled carefully.

ChatGLM2-6B is an open-source large language model (LLM) focused on bilingual conversations in Chinese and English. It’s essentially a computer program that can carry on conversations in both Chinese and English.

ChatGLM2-6B:Some key points

It’s the second generation of an earlier model, ChatGLM-6B, and offers improvements in performance, efficiency, and context handling.

Compared to similar sized models, ChatGLM2-6B performs better on benchmarks that measure conversation ability.

It can consider a longer conversation history, resulting in more natural interactions.

It runs faster and uses less memory than the first generation.

The code and resources are publicly available, allowing for further development and exploration.

Actually, ChatGLM2-6B is an open-source large language model (LLM) designed for bilingual English and Chinese discussions!

ChatGLM-6B Advantages

Second-generation model: This model is better than the ChatGLM-6B model from the beginning.

Better performance: In comparison to models of a comparable size, it performs better on benchmarks pertaining to conversations.

Extended context: For more organic interactions, it may take into account a longer discussion history (up to 8K tokens).

Efficient inference: Compared to the first generation, it operates more quickly and requires less memory, enabling more seamless interactions.

Free and open-source The code and materials are available on sites such as Hugging Face [ON Hugging Face chatglm2 6b] and GitHub [ON GitHub thudm chatglm2 6b].

Correctness Counts

Quantization techniques can be employed in LLM applications to lower GPU memory consumption and boost the number of concurrent users that can be served. Aggressive quantization can drastically reduce the amount of data, but in some cases especially in real-world LLM applications the accuracy cost is too great.

Additional Optimizations

HEYINTELLIGENCE has amassed a vast amount of expertise in the practical deployment of AI models and hardware platforms. The GCT contains a wide range of sub-techniques, including pipeline optimizations, quantization/de-quantization kernel fusion techniques, and LLM-serving techniques. Subsequent improvements may be made in accordance with client specifications. The main goal is to allow various optimization strategies to cooperate in order to maximize performance improvement while minimizing accuracy loss while adhering to time cost and real scene data restrictions.

In summary

The above-mentioned optimized implementations contribute to the growing community of AMD AI developers by assisting the highly efficient AMD AI accelerators in handling complex AI workloads like LLMs. This allows data centre users to have access to a full range of inference solutions that can meet high-throughput, low-latency performance requirements. By creating open software platforms like ROCm, ZenDNN, AI, and Ryzen AI software for breakthroughs on GPUs, CPUs, and adaptable SoCs, AMD is enabling more ecosystem partners and AI developers.