#int4
Explore tagged Tumblr posts
govindhtech · 5 months ago
Text
INT8 & INT4 Weight Only Quantization WOQ On Intel Extension
Tumblr media
Weight Only Quantization(WOQ)
A practical guide to Large Language Models (LLMs) quantization. The capabilities, uses, and complexity of large language models (LLMs) have all significantly risen in recent years. With an ever-increasing amount of parameters, weights, and activations, LLMs have become larger and more intelligent.
However, They usually have to compress LLMs without significantly sacrificing their performance in order to increase the number of possible deployment targets and lower the cost of inference. Large neural networks, including language models, may be made smaller using a variety of methods. Quantization is one such crucial method.
WOQ meaning
In machine learning, especially in deep learning, Weight Only Quantization (WOQ) is a technique that minimizes the size of neural network models without compromising their functionality. It entails quantizing just the neural network’s weights the parameters that define the behavior of the model into a format with less precision (e.g., 8-bit instead of 32-bit).
This article provides an example of code that uses the Intel Extension for Transformers tool to conduct Weight Only Quantization (WOQ) on an LLM (Intel/neural-chat-7b model) for both INT8 and INT4.
How does quantization work?
INT8 Vs INT4
The process of switching to lower precision data types, such as float16, INT8 or INT4, from high-precision representation, such as float32, for weights and/or activations is known as quantization. Lower precision may greatly minimize the amount of memory needed.
While this may seem simple in principle, there are a lot of subtleties to consider, and computing data type is the most crucial warning to know. Certain operations need us to scale the representation back to high precision at runtime since not all operations support or have low-precision implementation. Although there is some additional cost, they may lessen its effects by using tools like Intel Neural Compressor, OpenVINO toolkit, and Neural Speed.
Because these runtimes include optimized implementations of several operators for low-precision data types, upscale values to high-precision is not necessary, resulting in improved speed and reduced memory use. If lower-precision data types are supported by your hardware, the performance improvements are substantial. For instance, support for float16 and bfloat16 is included into Intel Xeon Scalable processors of the 4th generation.
Therefore, quantization only serves to lower the model’s memory footprint; nevertheless, it may also introduce some cost during inference. Using optimized runtimes and the newest hardware is required to obtain both memory and performance improvements.
What Does WOQ Mean?
There are several methods for quantizing models. Model weights and activations the output values produced by every neuron in a layer are often quantized. One of these quantization methods, called Weight Only Quantization(WOQ), preserves the original accuracy of the activations while only quantizing the model weights. Faster inference and a reduced memory footprint are the clear advantages. In actual use, WOQ improves performance without appreciably affecting accuracy.
Code Execution
The Intel/neural-chat-7b-v3-3 language model’s quantization procedure is shown in the provided code sample. The model, which is an improved version of Mistral-7B, is quantized using Weight Only Quantization (WOQ) methods made available by the Intel Extension for Transformers.
With only one line of code, developers can easily use the power of Intel technology for their Generative AI workloads. You import AutoModelForCausualLM from Intel Extension for Transformers rather of the Hugging Face transformers library, and everything else stays the same.
1. From intel_extension_for_transformers.transformers import AutoModelForCausalLM
For INT8 quantization, just set load_in_8bit to True.
1. # INT8 quantization 2. Q8_model = AutoModelForCausalLM.from_pretrained( 3. model_name, load_in_8bit=True)
Similarly, for INT4 quantization set load_in_4bit to True.1. # INT4 quantization 2. q4_model = AutoModelForCausalLM.from_pretrained( 3. model_name, load_in_4bit=True)
The Hugging Face transformers library may be used in the same way for implementation.
If you set device to GPU, the aforementioned code snippets will utilize BitandBytes for quantization. This makes your code run much faster without requiring any code changes, regardless of whether you are utilizing a CPU or GPU.
GGUF model in operation
A binary file format called GGUF was created expressly to store deep learning models like LLMs especially for CPU inference. It has several important benefits, such as quantization, efficiency, and single-file deployment. They will be utilizing the model in GGUF format in order to maximize the performance of their Intel hardware.
Generally, one would need to utilize an extra library like Llama_cpp in order to execute models in GGUF format. Still, you may use it Intel Extension for Transformers library to run GGUF models since Neural Speed is built on top of Llama_cpp.1. model = AutoModelForCausalLM.from_pretrained( 2. model_name=“TheBloke/Llama-2-7B-Chat-GGUF”, 3. model_file=“llama-2-7b-chat.Q4_0.gguf” 4. )
Take a look at the code example. The code example demonstrates how to use Intel’s AI Tools, Intel Extension for Transformers, to quantize an LLM model and how to optimize your Intel hardware for Generative AI applications.
INT4 vs INT8
Quantizing LLMs for Inference in INT4/8
Better quantization approaches are becoming more and more necessary as models become bigger. However, what is quantization exactly? Model parameters are represented with less accuracy by quantization. For example, using float16 to represent model weights instead of the widely used float32 may reduce storage needs by half.
Additionally, it improves performance at lesser precision by lowering computational burden. Nevertheless, a drawback of quantization is a little reduction in model accuracy. This happens when accuracy decreases and parameters have less representation power. In essence, quantization allows us to sacrifice accuracy for better inference performance (in terms of processing and storage).
Although there are many other approaches to quantization, this sample only considers Weight Only Quantization (WOQ) strategies. Model weights and activations the output values produced by every neuron in a layer are often quantized. But only the model weights are quantized by WOQ; activations remain unaltered. In actual use, WOQ improves performance without appreciably affecting accuracy.
The transformers library from HuggingFace makes quantization easier by offering clear choices. To enable quantization, users just need to specify the load_in_4bit or load_in_8bit option to True. But there’s a catch: only CUDA GPUs can use this capability. Unfortunately, only CUDA GPU devices can use the BitsandBytes configuration that is automatically built when these arguments are enabled. For consumers using CPUs or non-CUDA devices, this presents a problem.
The Intel team created Intel Extension for Transformers (ITREX), which improves quantization support and provides further optimizations for Intel CPU/GPU architectures, in order to overcome this constraint. Users may import AutoModelForCasualLM from the ITREX library rather of the transformers library in order to use ITREX. This allows users, irrespective of their hardware setup, to effortlessly use quantization and other improvements.
The from_pretrained function has been expanded to include the quantization_config, which now takes in different settings for CUDA GPUs and CPUs to perform quantization, including RtnConfig, AwqConfig, TeqConfig, GPTQConfig, and AutoroundConfig. How things behave when you set the load_in_4bit or load_in_8bit option to True is dependent on how your device is configured.
BitsAndBytesConfig will be used if the CUDA option is selected for your device. RtnConfig, which is specifically tailored for Intel CPUs and GPUs, will be used, nonetheless, if your device is set to CPU. In essence, this offers a uniform interface for using Intel GPUs, CPUs, and CUDA devices, guaranteeing smooth quantization across various hardware setups.
Read more on govindhtech.com
0 notes
roadandruingame · 1 year ago
Text
RaR Musings #7: Meaningful Mechanics
I saw a post this week about other people in the ttrpg design space, lamenting their years of work, and being dismissed for their project seeming like "a dnd clone". A fair concern, to be sure, but it would turn out the criticism stemmed from having a fantasy themed roleplaying game, that uses a d20 and adds proficiency, has character creation that involves classes, and spellcasting with multiple levels of spells. Others suggested there might be similarities if you use the standard stats, like STR, DEX, and INT.
So what's a guy with a fantasy themed roleplaying game that uses Xd10, adding proficiency, has a character creation engine that has classes as a minor element, and spellcasting with a mana system allowing you to cast spells at a higher level, using some basic stats, to do?
Firstly: not worry about it. Creativity is iterative, and DND has been the fantasy roleplay standard for nigh on 50 years, having affected pop culture and videogame design alike. It'd be hard NOT to have anything similar to it, and for those who have no experience outside of DND, dipping a toe outside that space can seem daring and adventurous. The writer is probably upset because they don't understand how generally meaningless their reinventing of the wheel was in terms of convincing people to play their game instead; in fact, there wasn't any mention of WHY he made the effort to design his own game in the first place. Was it distaste for existing products? Because he had vision? Or just to prove that he could do it too, a kind of intellectual parroting?
Game mechanics can't be copywritten, so while it's not illegal to copy mechanics, there needs to be certified thought put into what those mechanics are meant to achieve, and why they may fail to do so.
As an example: both d20 games and Road and Ruin involve rolling dice to generate a random value, and then adding your proficiency as a flat number.
DND falls down here because even high proficiency, like +11 or +13, barely crests over half of the value generated by random d20, much less the more regular +1 to +6. This means a specialist, someone who has lifelong expertise at their craft, can still bungle even a basic action, giving other players a chance to perform, but completely botching the class fantasy of being a specialist, and there's no coded mechanics for varying levels of success or failure to even reward being a specialist beyond increased binary success rate. Multiple overlapping proficiencies don't have cumulative value, and outside of house rules, you can't mix and match Attribute to Proficiency, such as using Strength for Intimidation. However, the system is simplistic, and easy to understand. Not having different values for different proficiencies means only having to refer to a single number as a bonus, which makes stat scaling much more predictable, and as mentioned, giving other players the limelight means the skill monkeys won't hog it.
Road and Ruin HAD a much more 'unique' skill check system; roll your attribute (1-10) as Xd10, and your proficiency (two 0-5 proficiencies combined) determined the minimum score any dice could land. Dice were adjusted, totalled, and the sum divided by 10 to find Success Rate, with scores of 1 or higher expected. This ended up being too much adjusting and adding; it produced the ideal values, but was too slow, and not very fun, especially to do repetitively. Worse, it didn't enable 'skill' to exceed 'raw talent'; you needed a high attribute for the guaranteed 'floor' that proficiency provided to matter, and I wanted those with training to potentially exceed those without training. If INT4 rolls 4d10, and Proficiency 3 meant you couldn't get below a 3 on each, for a 'floor' of 12-40, that still meant an average ~22, regardless of if you were trained or not. Specialization 'rolled' an additional 1d10, but set it aside as an automatic 10, thus improving skill checks beyond what was possible via random dice rolls, raising both floor and ceiling by 10, but not solving the issue of speed or reliability.
So now, Road and Ruin has a Roll + Proficiency system too, except you roll Xd10 (1-5), and Proficiency is two scores (0-5 each), combined, and multiplied by Specialization, with a cheat-sheet of the most common Proficiency results for your character. Adding the dice, and Proficiency, before finding successes, is still slow, but faster now, and due to the multiplication of scores and specialization, your character may even automatically succeed basic tasks, without the need for a roll at all. Such skillmonkeying requires utmost devotion though, and is far better suited to an NPC assistant; but, said NPC will still be built using the same mechanics as what goes into a character, making it easier to understand and appreciate their service.
More importantly: I like it. I understand that others might not; it doesn't have the hallmarks of DND's 'gamble' economy, getting high rolls and confetti when you hit a 20, but frankly, I'm building this game for me, not for people who are satisfied with DND. Even my nine attributes are inspired by World of Darkness, though slightly redefined to suit the needs of my setting instead, and the proficiency skill list is entirely my own, designed to offer as many cases of two overlapping skills as possible. Using any attribute in the skill check, based on what you aim to affect rather than what the proficiency is most known for (using DEX and herbalism to get plant clippings, or INT and herbalism to recall plant facts, for example) is a much more direct and diverse way to handle skill checks, rather than trying to remember whether Nature in DND is Intelligence or Wisdom, and why. Rolling multiple dice instead of 1d20 helps protect against fringe rolls, making the rare cases truly rare, as well as creating a market for spells, equipment, and abilities that affect your skill checks to have meaningful use, rather than simply adding a +1.
But I'm having fun doing all this. Road and Ruin began because I was upset with DND, and over the years, I've done a lot of work, first to intentionally distance it from DND, and only later to begin to paint it in my own colors, doing what I want, not in rebellion of what I don't. Anybody looking to design their own systems should be more preoccupied with how their mechanics feel; if people think that it's too similar to an existing product, one that you intentionally avoided? Then that's tough beans for them. They don't get to define how you have fun, and at the end of the day, that's what playing, and designing, a game is all about.
12 notes · View notes
slunch · 8 months ago
Note
Hi!! Some questions for you: what are some things you’ve learned recently? What’s your favorite smell? What’s your dream vacation?
hmm, let's see...
learned recently: how to make a UML diagram, when to water a jacaranda, the fact that plant pots need drains (but don't put in rocks because it pushes up the saturation zone), existence of int4 quantization to accelerate neural nets, and the fact that my cat may have started drinking out of the toilet this week
favorite smell: easy, burning the fuck out of a corn tortilla no pan raw on the stovetop
dream vacation: this one's harder to say, i think something crazy like a writer's or artists's retreat in some remote place would be pretty fun. just disconnect from life and work for a week or month or two, do nothing but work on the thing and go for walks and eat simple meals. like a firewatch tower or something.
4 notes · View notes
ai-news · 2 months ago
Link
Generative AI systems transform how humans interact with technology, offering groundbreaking natural language processing and content generation capabilities. However, these systems pose significant risks, particularly in generating unsafe or policy- #AI #ML #Automation
0 notes
forlinx · 7 months ago
Text
Four Advantages Detailed Analysis of Forlinx Embedded FET3576-C System on Module
In order to fully meet the growing demand in the AIoT market for high-performance, high-computing-power, and low-power main controllers, Forlinx Embedded has recently launched the FET3576-C System on Module, designed based on the Rockchip RK3576 processor. It features excellent image and video processing capabilities, a rich array of interfaces and expansion options, low power consumption, and a wide range of application scenarios. This article delves into the distinctive benefits of the Forlinx Embedded FET3576-C SoM from four key aspects.
Tumblr media
Advantages: 6TOPS computing power NPU, enabling AI applications
Forlinx Embedded FET3576-C SoM has a built-in 6TOPS super arithmetic NPU with excellent deep learning processing capability. It supports INT4/ INT8/ INT16/ FP16/ BF16/ TF32 operation. It supports dual-core working together or independently so that it can flexibly allocate computational resources according to the needs when dealing with complex deep learning tasks. It can also maintain high efficiency and stability when dealing with multiple deep-learning tasks.
FET3576-C SoM also supports TensorFlow, Caffe, Tflite, Pytorch, Onnx NN, Android NN and other deep learning frameworks. Developers can easily deploy existing deep learning models to the SoM and conduct rapid development and optimization. This broad compatibility not only lowers the development threshold, but also accelerates the promotion and adoption of deep learning applications.
Tumblr media
Advantages: Firewall achieves true hardware resource isolation
The FET3576-C SoM with RK3576 processor supports RK Firewall technology, ensuring hardware resource isolation for access management between host devices, peripherals, and memory areas.
Access Control Policy - RK Firewall allows configuring policies to control which devices or system components access hardware resources. It includes IP address filtering, port control, and specific application access permissions. Combined with the AMP system, it efficiently manages access policies for diverse systems.
Hardware Resource Mapping and Monitoring - RK Firewall maps the hardware resources in the system, including memory areas, I/O devices, and network interfaces. By monitoring access to these resources, RK Firewall can track in real-time which devices or components are attempting to access specific resources.
Access Control Decision - When a device or component attempts to access hardware resources, RK Firewall will evaluate the access against predefined access control policies. If the access request complies with the policy requirements, access will be granted; otherwise, it will be denied.
Isolation Enforcement - For hardware resources identified as requiring isolation, RK Firewall will implement isolation measures to ensure that they can only be accessed by authorized devices or components.
In summary, RK Firewall achieves effective isolation and management of hardware resources by setting access control policies, monitoring hardware resource access, performing permission checks, and implementing isolation measures. These measures not only enhance system security but also ensure system stability and reliability.
Tumblr media
Advantages: Ultra clear display + AI intelligent repair
With its powerful multimedia processing capability, FET3576-C SoM provides users with excellent visual experience. It supports H.264/H.265 codecs for smooth HD video playback in various scenarios, while offering five display interfaces (HDMI/eDP, MIPI DSI, Parallel, EBC, DP) to ensure compatibility with diverse devices.
FET3576-C SoM notably supports triple-screen display functionality, enabling simultaneous display of different content on three screens, significantly enhancing multitasking efficiency.
In addition, its 4K @ 120Hz ultra-clear display and super-resolution function not only brings excellent picture quality enjoyment, but also intelligently repairs blurred images, improves video frame rate, and brings users a clearer and smoother visual experience.
Tumblr media
Advantage: FlexBus new parallel bus interface
FET3576-C of Forlinx Embedded offers a wide range of connectivity and transmission options with its excellent interface design and flexible parallel bus technology. The FlexBus interface on the SoM is particularly noteworthy due to its high flexibility and scalability, allowing it to emulate irregular or standard protocols to accommodate a variety of complex communication needs.
FlexBus supports parallel transmission of 2/4/8/16bits of data, enabling a significant increase in the data transfer rate, while the clock frequency of up to 100MHz further ensures the high efficiency and stability of data transmission.
In addition to the FlexBus interface, the FET3576-C SoM integrates a variety of bus transfer interfaces, including DSMC, CAN-FD, PCIe2.1, SATA3.0, USB3.2, SAI, I2C, I3C and UART. These interfaces not only enriches the SoM's application scenarios but also enhances its compatibility with other devices and systems.
Tumblr media
It is easy to see that with the excellent advantages of high computing power NPU, RK Firewall, powerful multimedia processing capability and FlexBus interface, Forlinx Embedded FET3576-C SoM will become a strong player in the field of embedded hardware. Whether you are developing edge AI applications or in pursuit of high-performance, high-quality hardware devices, the Folinx Embedded FET3576-C SoM is an unmissable choice for you.
Originally published at www.forlinx.net.
0 notes
jcmarchi · 11 months ago
Text
Accelerating Large Language Model Inference: Techniques for Efficient Deployment
New Post has been published on https://thedigitalinsider.com/accelerating-large-language-model-inference-techniques-for-efficient-deployment/
Accelerating Large Language Model Inference: Techniques for Efficient Deployment
Large language models (LLMs) like GPT-4, LLaMA, and PaLM are pushing the boundaries of what’s possible with natural language processing. However, deploying these massive models to production environments presents significant challenges in terms of computational requirements, memory usage, latency, and cost. As LLMs continue to grow larger and more capable, optimizing their inference performance is critical for real-world applications.
In this technical deep dive, we’ll explore cutting-edge techniques for accelerating LLM inference, enabling faster response times, higher throughput, and more efficient utilization of hardware resources. We’ll cover methods ranging from numerical precision techniques and novel attention mechanisms to architectural innovations tailored explicitly for efficient text generation.
Let’s start by understanding why LLM inference is so challenging compared to traditional NLP models.
The Inference Challenge with Large Language Models
Before the advent of LLMs, natural language processing relied on smaller models focused on specific tasks like text classification, named entity recognition, and sentiment analysis. While still computationally intensive, these models could be deployed on modest hardware and followed relatively straightforward inference processes.
LLMs, on the other hand, represent a paradigm shift. These models are trained on vast datasets using billions of parameters, enabling them to perform a wide range of language tasks with remarkable proficiency. However, this power comes at a cost – dramatically increased computational demands during both training and inference.
One key challenge is the autoregressive nature of text generation with LLMs. To produce human-like text, these models predict one token (word or subword) at a time, with each new token depending on the previously generated output. This sequential dependency prevents efficient parallelization and results in computational requirements that scale polynomially with sequence length.
Additionally, LLMs often require long input sequences (prompts) to establish the necessary context for high-quality text generation. Longer input lengths demand more memory to store intermediate states and attention matrices, further straining hardware resources.
With these unique challenges, traditional optimization techniques like quantization and static computation graphs can fall short, struggling to maintain LLM performance while delivering meaningful speedups. Let’s dive into some of the key strategies tailored explicitly for accelerating LLM inference.
Numerical Precision Techniques
From 32-Bit to 16-Bit Precision
One avenue for accelerating LLM inference is to leverage reduced numerical precision for model weights and activations. Modern deep learning frameworks like PyTorch and TensorFlow typically employ 32-bit floating-point (FP32) precision by default. However, research has shown that LLMs can often maintain high accuracy even when operating at lower precisions, such as 16-bit (FP16), 8-bit integers (INT8), or even 4-bit integers (INT4).
Reducing numerical precision offers several benefits:
Reduced Memory Footprint: Lower precision representations require less memory, allowing larger models or batch sizes to fit within the same hardware constraints.
Faster Computation: Many modern CPUs and GPUs provide specialized instructions and hardware acceleration for lower precision arithmetic, enabling significant speedups.
Improved Energy Efficiency: With smaller memory requirements and faster computations, lower precision inference can translate into reduced energy consumption – a crucial advantage for edge and mobile deployments.
While powerful, numerical precision techniques do introduce some accuracy loss compared to FP32 operation. The key is carefully evaluating this trade-off between computational gains and potential performance degradation for your specific use case.
There are two main approaches to quantization with LLMs:
Post-Training Quantization (PTQ): In this method, an LLM is first trained using standard FP32 precision. After training, the model weights are quantized (converted) to a lower precision format like INT8 or INT4. PTQ is straightforward to implement but can lead to greater accuracy drops.
Quantization-Aware Training (QAT): With QAT, the quantization process is simulated during the training phase itself. This allows the model to learn to compensate for quantization errors, minimizing accuracy degradation when the final quantized model is deployed. QAT is more involved but often yields better results compared to PTQ.
For practical application, one might leverage pre-quantized models available on platforms like Hugging Face, which hosts a variety of models optimized through different quantization methods. For instance, if a model quantized using Auto-GPTQ is desired, users can easily load it using Hugging Face’s transformers library. Additionally, to quantize a model, tools like AutoGPTQ can be utilized, which integrate seamlessly with existing libraries to compress the model efficiently.
Here is an example of loading a pre-quantized Llama-2-7b model using the Hugging Face transformers library:
from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "TheBloke/Llama-2-7b-Chat-GPTQ" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) And for custom quantization, one might follow these steps using the AutoGPTQ toolkit: from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig model_id = "llama-2-7b-original" tokenizer = AutoTokenizer.from_pretrained(model_id) quantization_config = GPTQConfig(bits=4, dataset="your-dataset", tokenizer=tokenizer) model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)
Remember that quantization might necessitate post-quantization fine-tuning or prompt engineering to maintain model quality. For new quantization, you can contribute back to the community by pushing your quantized models to platforms like Hugging Face.
Always ensure to balance between model size, computational requirements, and performance when selecting the quantization strategy for your specific use case.
The Flash Attention Algorithm
The multi-head attention mechanism is a core component of transformer-based LLMs, enabling the model to capture long-range dependencies and contextualized representations. However, this attention operation is computationally inefficient for autoregressive text generation, as it requires recomputing many of the same values for each new token.
The Flash Attention algorithm, introduced in the FlashAttention paper, provides a more memory-efficient and parallelization-friendly approach to the attention operation. Instead of recomputing attention values for each token, Flash Attention caches and reuses intermediate key/value matrices, avoiding redundant calculations.
This optimization not only reduces computational overhead but also improves memory access patterns, leading to better utilization of GPU memory bandwidth and parallelism.
While the details of Flash Attention are quite involved, the high-level idea is to decompose the attention operation into two phases:
Prefix Sum Embedding: This phase computes and caches key/value embeddings for all input tokens, enabling efficient reuse during generation.
Causal Attention: The actual attention operation, now optimized to leverage the cached key/value embeddings from the first phase.
By separating these phases, Flash Attention can take advantage of highly parallel GPU operations, significantly accelerating the attention bottleneck in LLM inference.
Here’s a brief, conceptual illustration of implementing Flash Attention with an LLM:
from transformers import AutoModelForCausalLM import torch from flash_attention import flash_attention # Load an LLM like OctoCoder model = AutoModelForCausalLM.from_pretrained("bigcode/octocoder") # Sample system prompt that guides the model towards being a better coding assistant system_prompt = """... (system prompt details) ...""" # Preparing a longer input with the system prompt long_prompt = system_prompt + "Question: Please write a function in Python that transforms bytes to Gigabytes." # Converting the model for Flash Attention optimization model.to_bettertransformer() # Running the model with Flash Attention start_time = time.time() with torch.backends.cuda.sdp_kernel(enable_flash=True): result = model.generate(long_prompt, max_new_tokens=60) print(f"Generated in time.time() - start_time seconds.")
While Flash Attention offers impressive performance gains, it works within the existing transformer architecture. To fully unleash the potential of accelerated LLM inference, we need to explore architectural innovations tailored specifically for this task.
Pruning LLMs
Pruning LLMs is a technique to reduce model size while maintaining functionality. It uses a data-dependent estimator for weight importance based on Hessian matrix approximations. In pruning, less important weight groups are removed, then the model is fine-tuned to recover accuracy. The LLM-Pruner package offers scripts for pruning with various strategies supported. Pruning includes discovering dependencies, estimating group contributions, and a recovery stage involving brief post-training.
Here’s a simplified Python code example demonstrating the use of LLM-Pruner for a LLaMa model:
from transformers import AutoModelForSequenceClassification from pruning import LLMPruner # Load pre-trained LLaMa model model = AutoModelForSequenceClassification.from_pretrained("llama-base") # Initialize the pruner with desired configuration pruner = LLMPruner( model, pruning_ratio=0.25, block_mlp_layers=(4, 30), block_attention_layers=(4, 30), pruner_type='taylor' ) # Execute pruning pruned_model = pruner.prune() # Fine-tune the pruned model pruned_model.fine_tune(training_data)
This code sketch represents loading a pre-trained LLaMa model, setting up the pruner with specific configurations (like which layers to prune and the type of pruner), executing the pruning process, and finally, fine-tuning the pruned model.
Note that for an actual implementation, you would need to fill in details like the specific model name, paths to the data, and additional parameters for the fine-tuning process. Also, be aware that this code is a conceptual representation, and actual syntax may vary depending on the library and versions used.
Architectural Innovations for Efficient Text Generation
The transformer architecture, while highly effective for language modeling tasks, was designed as a general-purpose sequence-to-sequence model. When deploying LLMs for text generation tasks with long input contexts, researchers have found that more specialized architectures can significantly improve inference efficiency without sacrificing quality.
Here are some of the key architectural innovations enabling faster LLM inference:
Alibi: The Alibi architecture, introduced in the PAL-Instruction paper, separates the modeling of long input context from the text generation process itself. It uses a compressed representation of the input context (the “alibi”) to initialize the generation process, avoiding the need to process the full input sequence repeatedly during autoregressive generation.
Rotary Embeddings: Instead of using standard positional embeddings, the rotary embedding technique employs rotation matrices to encode positional information more efficiently. This approach has been shown to improve performance and enable processing of longer input sequences.
Multi-Query Attention (MQA): In traditional attention, each output token attends to the entire input sequence, resulting in redundant computation. MQA reformulates the attention operation to share computations across multiple output tokens, reducing overall complexity.
Multiquery attention
Grouped-Query-Attention (GQA): Building upon MQA, GQA groups output tokens into clusters and computes attention jointly for each cluster. This approach further reduces computational requirements while maintaining high-quality text generation.
While still in active research and development, these architectural innovations have demonstrated impressive speedups for LLM inference tasks, especially when combined with techniques like Flash Attention and numerical precision optimization.
Real-World Deployment Considerations
Beyond the core algorithms and architectures, there are several practical considerations and trade-offs to navigate when deploying LLMs to production environments:
Hardware Acceleration: While CPUs can handle LLM inference, GPUs and other accelerators like Google’s TPUs are essential for achieving high throughput and low latency. Choosing the right hardware and optimizing memory usage is crucial.
Batching and Parallelism: To fully leverage hardware parallelism, strategies like batched inference (processing multiple inputs simultaneously) and model parallelism (distributing an LLM across multiple devices) can significantly boost throughput.
Quantization vs. Quality Trade-Off: The degree of quantization (8-bit, 4-bit, etc.) will directly impact inference speed and memory usage, but also affects output quality. This trade-off must be carefully evaluated for each use case.
Model Distillation: An alternative to quantization, model distillation techniques can compress large LLMs into smaller, more efficient student models while retaining high accuracy.
Caching and Optimized Runtimes: Optimized deep learning runtimes like NVIDIA’s TensorRT and frameworks designed for LLM serving (e.g., MosaicML’s Composable Inference Suite) can provide significant performance boosts through techniques like operator fusion, kernel optimization, and intelligent caching strategies.
The path to optimal LLM deployment often involves combining multiple techniques while carefully considering the specific requirements of your application, infrastructure constraints, and performance targets.
Conclusion
As large language models continue their rapid evolution, accelerating their inference performance is becoming increasingly crucial for enabling real-world applications and democratizing access to these powerful AI capabilities.
In this technical guide, we explored cutting-edge techniques spanning numerical precision optimization, novel attention algorithms like Flash Attention, and architectural innovations tailored for efficient text generation. While each approach offers its own advantages, the true power often lies in combining multiple strategies while navigating the intricate trade-offs between speed, memory usage, and output quality.
Looking ahead, we can expect continued research and development in this domain, fueled by the insatiable demand for more capable and accessible LLMs. From hardware acceleration and model compression to entirely new architectures, the quest for efficient LLM inference remains an exciting frontier in the world of natural language processing and artificial intelligence.
0 notes
wikimediauncommons · 1 year ago
Text
Tumblr media
file: Mcely, kostel sv. VĂĄclava int4.jpg
0 notes
newzzwired · 2 years ago
Text
Qualcomm Snapdragon 8 Gen 2 SoC With Wi-Fi 7, INT4, Ray Tracing, and More Launched: All Details
Qualcomm Snapdragon 8 Gen 2 SoC With Wi-Fi 7, INT4, Ray Tracing, and More Launched: All Details
Qualcomm unveiled Snapdragon 8 Gen 2 SoC at its annual Snapdragon Tech Summit on Wednesday. The new mobile 5G platform brings a list of upgrades over last year’s Snapdragon 8 Gen 1 SoC and is claimed to be 40 percent more power efficient than the older model. It offers real-time raytracing for gaming and supports INT4 and Wi-Fi 7. The Snapdragon 8 Gen 2 supports new image sensors like the

Tumblr media
View On WordPress
0 notes
scottfromappdesign · 3 years ago
Text
#4
21 y/o - he/him
1. How often do you listen to music?
Everyday. Funny enough, I don’t really know anyone who doesn’t have the same answer as me. Music is a really big part of my life and it’s been that way for years.
2. Do you use any music streaming apps? 
Just Apple Music. I’m not a big fan of the other music streaming apps.
3. Do you like to discover new music? (i.e. artists, playlists, genres) If so, what are your methods for finding new music?
No I don’t. I have a hard time discovering new music/artists so I just gave up on doing discovery on my own.
4. Some people have friends that connect over a lot of different topics and some have friends that connect on very few topics. When it comes to music, do you find it easy or hard to connect with your friends over it, and why? 
I find it easy because it’s definitely something that you can have an entire conversation about with a person when a new song or album drops, especially how it makes each other feel. Plus it helps that my friends 9 times out of 10 have the same music taste as me. 
5. Have you ever or do you currently use social media to make new friends and talk about interests you have in common? 
Yes but to an extent.
6. If you answered yes to the previous question, can you share one or more of your experiences? If you answered no to the previous question, can you please elaborate on why you don’t use social media to meet new people?
I don’t really make friends with complete strangers on social media but I will make connections if you’re a friend of a friend or met you very briefly somewhere.
7. Do you find that like counts/follower counts/leaderboards discourage you from using certain apps and/or making connections with people online or do you feel the opposite and why?
Yes and no. It does get intimidating since social media has become such a big part of daily life and “social status” but I’m trying to train myself to think that none of that matters as much as you think it is. I find my self doing social media cleanses often and it has been helping me a lot.
8. On a scale of 1-10 how likely would you be to use an app that allows you that connects to your music streaming apps, shows you randomized playlists (based on your preferences) in order to discover new music, and connect with people who have similar music tastes?
I think a 7. I think there’s a lot of potential in an app like this and if it can be fun and interesting then I would most likely give it a try.
9. With the app idea presented in the previous question, do you have any concerns about the app or features you would implemented in the app?
Having a section or a space for people to share or add  “feelings/mood” so curated playlists for people feeling any specific way could be a really fun feature to the app instead of just generic music taste categories. These sections/playlists should be highly specific like “they didn’t have my drink in Starbucks now I have no will to live” or “my boyfriend just lied to me for the third time and I need a song/playlist to motivate me to break up with him”
0 notes
govindhtech · 7 months ago
Text
Intel Extension for Transformers & PyTorch LLM Optimisation
Tumblr media
Enhancing deep learning model performance is essential for scalability and efficiency in the rapidly changing field of  artificial intelligence. Intel has been in the forefront of creating frameworks and tools to improve  AI models’ memory efficiency and speed of execution, especially with Intel Extension for PyTorch and Intel Extension for Transformers.
Comprehending the AI Stack
There are several layers in the  AI stack, and each is essential to optimizing LLMs. The hardware layer, which consists of Intel Xeon CPUs, Intel Data Centre GPUs, Intel Arc GPUs, and Intel Gaudi AI accelerators, is fundamental.
The acceleration libraries, such as Intel oneAPI Collective Communications Library (oneCCL) and Intel oneAPI Deep Neural Network Library (oneDNN), sit above this layer and offer optimized kernels with Intel optimized instruction sets for effective processing. The highest layer is made up of resource-efficient frameworks such as PyTorch that interface with the hardware and libraries underneath to optimize model performance.
Important Optimization Methods
Optimizing operators is essential to improving LLM performance. Using enhanced instruction sets such as Intel enhanced Vector Extensions (Intel AVX), Intel Advanced Matrix Extensions (Intel AMX), and Intel Xe Matrix Extensions (Intel XMX), Intel replaces the default operation kernels with highly-optimized Intel oneDNN kernels. The accuracy-flexible design of this optimization ensures that applications can operate at maximum speed and precision by supporting a variety of data types, from FP32 to INT4.
Graph optimizations reduce the amount of memory accesses needed during computation, which further enhances efficiency. For example, memory access times can be reduced by combining layers (e.g., Conv+ReLU+Sum) with bandwidth-limited operations (e.g., activation functions, ReLU, or Tanh).
This method works especially well for models such as ResNet-50, where a large amount of processing time is dedicated to bandwidth-constrained tasks. Specific fusion methods, including as linear post-ops fusion and multi-head attention fusion, are used in the context of LLMs with Intel Extension for PyTorch in JIT/Torch script mode to improve performance.
Memory management is essential for maximizing LLM performance because they frequently require large amounts of memory. By pre-filling key/value pairs before to the onset of autoregressive decoding and utilising pre-allocated buffers throughout the decoding stage, the Segment KV Cache approach maximizes memory use.
This technique increases efficiency by lowering the requirement for in-the-moment memory changes. Similar to this, the Indirect Access KV Cache efficiently manages memory by utilising beam index history and pre-allocated buffers, which lowers the overhead related to memory access during inference.
Model compression uses quantization algorithms, which successively decrease weight and activation precision from FP32 to lower precision forms like INT8 or INT4. This reduction minimizes the size of the model, increases inference speed, and lowers the required for memory bandwidth. Smooth Quant is a post-training quantization technique that shifts the quantization difficulty from activations to weights. This allows for the preservation of model accuracy while mitigating activation outliers and optimizing hardware utilization.
A big part of optimization is also played by custom operators. The goal of weight-only quantization is to increase input and output activation precision by quantizing the model’s weights alone. With minimal influence on accuracy, this technique maximizes computational performance by utilising weight-only quantization-optimized bespoke GEMM (General Matrix Multiply) kernels. Performance can be further optimized by using Explicit SIMD (ESIMD) extensions, which provide more precise control over hardware features.
Intel Extension for PyTorch
APIs for implementing these optimizations on CPU and GPU based training and inference are provided by the Intel Extension for PyTorch. You may make sure that your models are optimized to operate well on Intel hardware by making use of these APIs. To make it easier for developers to execute these optimizations, the extension comes with environment configurations and scripts that are intended to maximize hardware utilization.
Another essential element of Intel’s optimization approach are the Intel Gaudi  AI accelerators. Deep learning applications perform better because to the integration of PyTorch with the Intel Gaudi software suite, which effectively transfers neural network topologies onto Gaudi hardware. This integration also supports important kernel libraries and optimizations.
Intel Extension for Transformers
https://community.intel.com/t5/image/serverpage/image-id/56748i59BB048F0E369A11/image-size/large?v=v2&px=999&whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright
Several plugins for widely used pipelines, like audio processing and retrieval-augmented generation (RAG), can be integrated with Neural Chat. By integrating the required optimizations straight into the pipeline setup, it makes the deployment of optimized chatbots easier.
Neural Velocity and Dispersed Interpretation
https://community.intel.com/t5/image/serverpage/image-id/56751iE6BB93D0A520220B/image-size/large?v=v2&px=999&whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright
DeepSpeed
These optimizations are further expanded across numerous nodes or GPUs via Intel’s support for distributed inference via DeepSpeed. DeepSpeed now supports Intel GPUs thanks to the Intel Extension for DeepSpeed. It includes the following parts:
Implementation of the DeepSpeed Accelerator Interface
Implementation of DeepSpeed op builder for XPU
Code for DeepSpeed op builder kernel
With the help of oneCCL, this Intel-optimized extension distributes compute jobs well, lowering memory footprint and increasing throughput overall. Scaling  AI applications across heterogeneous computer systems requires this capacity.
Utilising Optimizations in Real-World Applications
It’s actually very easy to implement these optimizations using Intel’s tools, as you can use the extensions for the PyTorch and Transformers frameworks. For example, Intel Extension for Transformers improves model compression methods such as weight-only and smooth quantization right inside the well-known Transformers API. By setting the quantization parameters and using the integrated APIs, you may optimize models with ease.
In a similar vein, the Intel Extension for Transformers and PyTorch offers an adaptable framework for optimizing deep learning models other than LLMs. This update provides GPU-centric capabilities like tensor parallelism and  CPU optimizations like NUMA management and graph optimization’s to enable fine-tuning and deployment across a variety of hardware configurations.
In summary
You may significantly increase the effectiveness and performance of your AI models by utilising Intel’s extensive hardware stack, accelerated libraries, and optimized frameworks. These optimizations cut the energy and operating expenses associated with running large-scale  AI applications in addition to improving computational performance and reducing latency.
Using the getting started samples from the Intel Extension for PyTorch and Intel Extension for Transformers, you can investigate these optimizations on the Intel Tiber Developer Cloud. You can make sure your LLMs are operating at optimal performance on Intel hardware by incorporating these strategies.
Read more on govindhtech.com
0 notes
lazotoys · 4 years ago
Photo
Tumblr media
INT-4 1981 STAR WARS KENNER MADE IN MACAO THE EMPIRE STRIKES BACK MINI-RIG Mira mis otras figuras vintage y modernas. 22 EUROS Se vende el juguete de la foto. Envio gratuito en España (península) Islas Canarias, Ceuta y Melilla, Baleares, consultar. Mira mis otros artículos Envio certificado. Pago por transferencia bancaria, Bizum o PayPal como amigo. Cualquier duda consultar. #starwars #int4 #esb #starwarsvariations #vintage #vintagetoys #lazotoys #lazotoysstarwars #seconhandstarwars #starwarstoys #starwarsforsale https://www.instagram.com/p/CLjIMOvADg4/?igshid=1a1acf9kaxdi0
1 note · View note
kerink · 3 years ago
Text
wtse act 2 is gonna end up looking like homsetuck act 6 with all the shit benny and i keep cramming in there
12 notes · View notes
dafukdidiwatch · 5 years ago
Text
Tumblr media
This is probably the best thing John did, I have no idea how he can top that one. That is a beautiful bastard move.
184 notes · View notes
govindhtech · 1 year ago
Text
Marvel Edge Device: Generative AI Power Now!
Tumblr media
Generative AI optimized for edge devices How generative AI may be integrated into edge devices with constrained resources via pruning, quantization, and knowledge distillation
One in three American individuals as of April 2023 reported using generative artificial intelligence (AI):
Do you belong to that group? An worldwide AI craze was ignited in November 2022 when OpenAI debuted ChatGPT.
Furthermore, even though the majority of generative AI applications now operate on the cloud, their workloads put additional hardware and running expenses on the cloud. As a result, as apps like ChatGPT and Midjourney become more widely used, the optimum way to construct AI models is being reevaluated in light of these extra workload demands.
Since edge devices have substantial on-device AI processing capabilities, such as smartphones, laptops, and extended reality (XR) headsets, moving some or all of the AI burden to these devices is one of the most promising deployment techniques. AI models must be tailored for edge devices in order to use their available AI accelerators while implementing on-device AI.
Text production, picture and video generation, enhancement, and alteration, audio creation and enhancement, and even code generation are a few examples of generative AI demands that may be implemented locally.
They showcased her text-to-image generative AI model, Stable Diffusion, at Mobile World Congress earlier this year using a Snapdragon 8 Gen 2 smartphone.
AI recently declared that to want to provide large language models (LLMs) on Snapdragon platforms in 2024, based on Meta’s Llama . Once these neural networks are optimized, they will demand less memory and processing power, making them compatible with popular edge devices.
Although it is unlikely that the parameter growth of some generative AI applications, like ChatGPT, will outpace the performance improvements in mobile systems-on-chips (SoCs), there are currently a large number of sub-10 billion parameter generative AI models that are appropriate for on-device processing, and this number will only rise with time.
AI model optimization for on-device use Artificial intelligence (AI) models used on edge devices or even in the cloud compromise accuracy for computational efficiency, while neural network models are typically taught in a data center with excellent accuracy. Finding a compromise between making the model as tiny as feasible and maintaining a high enough accuracy level for the findings to be useful in the specific use case is the aim.
The bigger the model, the more accurate the outcome is usually. Nevertheless, there are often little benefits and substantial resource costs associated with greater precision. The number of parameters in an AI model determines its size; a model with fewer parameters will often generate results faster and with less processing power.
Three methods for improving AI models AI model optimization may be achieved using three main methods:
Quantization, pruning, and knowledge distillation.
Quantization uses lower-precision data types, such as 4-bit or 8-bit integers (INT4 or INT8) instead of the higher-precision, 32-bit floating point (FP32) data type that is typically used when training the model. This reduces the bit-precision that the AI model uses for the neural network’s weight and activation values. The model size is halved by quantizing from 32 to 8 bits.
The act of locating and removing unnecessary or duplicate parameters is known as pruning. AI model efficiency may be increased by pruning while keeping accuracy constant. Using both Bayesian compression and spatial SVD with ResNet18 as the baseline, There findings demonstrate a 3x decrease in model size with less than 1% loss in accuracy. Findings indicate that quantization generally works better than pruning.
By using a big, trained AI model as the basis for a smaller model, knowledge distillation reduces the size of the model while retaining comparable accuracy. The smaller model is often many times smaller than the original model.
Transferring AI tasks to a gadget It’s easy to see how generative AI, or any AI application, may be transferred to an edge device like a smartphone, XR headset, or desktop PC using these three optimization strategies.
Smartphones have already shown their ability to quickly absorb functionality by using advancements in memory, computing, and sensor technologies. Mobile media players, handheld gaming consoles, point-and-shoot cameras, consumer video cameras, and prosumer digital Single Lens Reflex (dSLR) cameras have all been superseded by smartphones in less than ten years. For the last several years, 8K video has been easily captured and processed by high-end smartphones.
Well-known smartphone manufacturers already use on-device AI technologies for a number of purposes, from security and battery life to computational photography and audio improvement. This also applies to the majority of popular edge networking and edge device platforms. Reducing in size and refining generative AI models to function on these edge devices is a hurdle.
In addition to cutting latency, on-device processing of AI models also tackles data security and privacy two issues that are becoming more and more important. Data and the outcomes of the generative AI may stay on the device by removing the interface with the cloud.
Edge device optimization with generative AI is the way of the future. The burden of managing generative AI’s workloads on the cloud will increase as its usage by consumers becomes more popular. The optimum way to apply AI models is being reevaluated as a result of these increased AI workloads on the cloud.
AI models are being shrunk to make them appropriate for on-device processing using optimization approaches including quantization, pruning, and knowledge distillation. Users may enjoy lower latency, improved privacy, customization, and other on-device AI advantages by shifting AI workloads to edge devices.
Read more on Govindhtech.com
0 notes
govindhtech · 1 year ago
Text
Discover MediaTek Dimensity 9300: Redefining Efficiency
Tumblr media
MediaTek 9300 infographic  The New MediaTek All-Big Core Design for Flagship Smartphone Performance and Efficiency Increase with Dimensity 9300 Chipset
Supercharged SoC offers safe, smooth edge AI with on-device generative AI computation.
MediaTek Dimensity 9300
The new MediaTek Dimensity 9300 gives us key benefits with every finger flip. Learn how they are taken generative AI from the cloud and brought it to smartphone consumers, while gamers will be captivated by console-grade graphics and smoother ray traced images. Just the beginning! Click for the finest of the rest.
 MediaTek 9300 introduced its latest top mobile processor, the Dimensity 9300, featuring an All Big Core design. Extreme performance and MediaTek’s industry-leading power efficiency give unparalleled gaming, video capture, and on-device generative AI processing experiences.
The Dimensity 9300 is MediaTek’s most powerful flagship chip
“The Dimensity 9300 is MediaTek’s most powerful flagship chip yet, bringing a huge boost in raw computing power to flagship smartphones with our groundbreaking All Big Core design,” said MediaTek President Joe Chen. This innovative architecture and our enhanced on-chip AI Processing Unit will usher in a new age of generative AI applications as developers push edge AI and hybrid AI computing limitations.
The Dimensity 9300 uses MediaTek’s next-generation APU 790 AI processor to boost generative AI performance and energy efficiency for faster, more secure edge computing. The APU 790 doubles integer and floating-point performance and cuts power consumption by 45%.
By adopting the Transformer model for operator acceleration, the APU 790 processes 8 times quicker than the previous generation and generates images in one second utilizing Stable Diffusion. MediaTek’s mixed-precision INT4 quantization technology and NeuroPilot memory hardware compression help optimize memory bandwidth and minimize memory needs for big AI models.
NeuroPilot Fusion can constantly execute LoRA low-rank adaptation on the APU 790, which can handle large language models with 1B, 7B, and 13B parameters with scalability up to 33B. The Dimensity 9300 will support Meta Llama 2, Baichuan 2, Baidu AI LLM, and other cutting-edge mainstream big language models in MediaTek’s vast AI ecosystem. Developers can swiftly construct multi-modal generative AI apps that supply users with text, graphics, and audio.
The MediaTek Dimensity 9300 boosts mobile gaming with Arm’s newest flagship GPU, the Arm ImmortalisG720. At the same power consumption as the Dimensity 9200, the 9300 boosts GPU performance by over 46%. The Dimensity 9300 reduces GPU power usage by 40% while maintaining performance. Users get a substantial performance boost without losing battery life.
The Dimensity 9300 chipset’s powerful octa-core CPU and MediaTek’s second-generation hardware raytracing engine produce console-level global lighting effects at 60 FPS on smartphones. Plus, its strong CPU lets users easily multi-task, streaming video or watching another movie while gaming.
The Dimensity 9300 revolutionizes mobile photography and video with a low-power AI-ISP and always-on HDR up to 4K at 60 fps. The chipset offers 4K at 30 fps cinematic mode with real-time bokeh tracking for professional-quality bokeh improvements and 4K AI Noise Reduction (AI-NR) and AI processing on RAW photographs and videos. TheMediaTek Dimensity 9300 will also support Android 14’s Ultra HDR standard for future devices. Ultra HDR improves smartphone photography by making photographs more colorful and compatible with the widely used JPEG format. The Dimensity 9300’s ambient light adaptive HDR recovery improves photography.
The Dimensity 9300 display system uses the chipset’s sophisticated on-device AI to distinguish principal objects and background pictures in real time. With the MiraVision Picture Quality (PQ) engine, it dynamically adjusts the contrast, sharpness, and color of the major objects, giving the picture depth and realistic video experiences like Flagship DTVs.
Since connection is crucial to the user experience, theMediaTek Dimensity 9300 supports Wi-Fi 7 speeds up to 6.5 Gbps and MediaTek Xtra RangeTM Technology for longer range.MediaTek Dimensity 9300 triples smartphone tethering speeds using MediaTek’s Multi-Link Hotspot technology.
Other MediaTek Dimensity 9300 features:
Big core power: The Dimensity 9300 uses TSMC‘s third-generation 4nm technology and four Arm Cortex X4 cores up to 3.25GHz and four Cortex-A720 cores up to 2.0GHz to boost performance.
Quicker display speeds: The chipset supports WQHD at 180Hz and 4K at 120Hz for outstanding graphics and dual active display for foldable form factors.
The 5G R16 modem: supports 4CC-CA Sub-6GHz and 8CC-CA mmWave with MediaTek’s UltraSave 3.0+ technology for power efficiency.
Currently the fastest memory, MediaTek Dimensity 9300 offers LPDDR5T 9600Mbps
In addition to these user-experience advantages, the Dimensity 9300 provides excellent security for premium Android smartphones. Our privacy-focused chipset protects essential operations during boot-up and secure computing against physical assaults on data access.
The chipset supports Arm’s Memory Tagging Extension (MTE) technology, which helps developers detect memory problems before and after deployment with the Cortex-X4 and Cortex-A720 processors’ Armv9 architecture. MTE helps OEMs expedite time to market by ensuring customer safety and accelerating development.
Read more on Govindhtech.com
0 notes
dafukdidiwatch · 5 years ago
Text
Tumblr media
I’m at page 6969, and I can’t really go “nice” because this isn’t Nice at all. I’m just waiting for Rose to hopefully revive and not be dead. This is as unnice as you can get.
49 notes · View notes