#Llama.cpp
Explore tagged Tumblr posts
Text
AMD Ryzen AI 300 Series Improves LM Studio And llama.cpp

Using AMD Ryzen AI 300 Series to Speed Up Llama.cpp Performance in Consumer LLM Applications.
What is Llama.cpp?
Meta’s LLaMA language model should not be confused with Llama.cpp. It is a tool, nonetheless, that was created to improve Meta’s LLaMA so that it can operate on local hardware. Because of their very high computational expenses, LLaMA and ChatGPT currently have trouble operating on local computers and hardware. Despite being among of the best-performing models available, they are somewhat demanding and inefficient to run locally since they need a significant amount of processing power and resources.
Here’s where llama.cpp is useful. It offers a lightweight, resource-efficient, and lightning-fast solution for LLaMA models using C++. It even eliminates the need for a GPU.
Features of Llama.cpp
Let’s examine Llama.cpp’s features in further detail and see why it’s such a fantastic complement to Meta’s LLaMA language paradigm.
Cross-Platform Compatibility
One of those features that is highly valued in any business, whether it gaming, artificial intelligence, or other software types, is cross-platform compatibility. It’s always beneficial to provide developers the flexibility to execute applications on the environments and systems of their choice, and llama.cpp takes this very seriously. It is compatible with Windows, Linux, and macOS and works perfectly on any of these operating systems.
Efficient CPU Utilization
The majority of models need a lot of GPU power, including ChatGPT and even LLaMA itself. Because of this, running them most of the time is quite costly and power-intensive. This idea is turned on its head by Llama.cpp, which is CPU-optimized and ensures that you receive respectable performance even in the absence of a GPU. Even while a GPU will provide superior results, it’s still amazing that running these LLMs locally doesn’t cost hundreds of dollars. Additionally encouraging for the future is the fact that it was able to tweak LLaMA to operate so effectively on CPUs.
Memory Efficiency
Llama.cpp excels at more than just CPU economy. Even on devices without strong resources, LLaMA models can function successfully by controlling the llama token limit and minimizing memory utilization. Successful inference depends on striking a balance between memory allocation and the llama token limit, which is something that llama.cpp excels at.
Getting Started with Llama.cpp
The popularity of creating beginner-friendly tools, frameworks, and models is at an all-time high, and llama.cpp is no exception. Installing it and getting started are rather simple processes.
You must first clone the llama.cpp repository in order to begin.
It’s time to create the project when you’ve finished cloning the repository.
Once your project is built, you may use your LLaMA model to do llama inference. The following code must be entered in order to utilize the llama.cpp library to do inference:
./main -m ./models/7B/ -p “Your prompt here” To change the output’s determinism, you may play about with the llama inference parameters, such llama temperature. The llama prompt format and prompt may be specified using the -p option, and llama.cpp will take care of the rest.
An overview of LM Studio and llama.cpp
Since GPT-2, language models have advanced significantly, and users may now rapidly and simply implement very complex LLMs using user-friendly programs like LM Studio. These technologies, together with AMD, enable AI to be accessible to all people without the need for technical or coding skills.
The llama.cpp project, a well-liked framework for rapidly and simply deploying language models, is the foundation of LM Studio. Despite having GPU acceleration available, it is independent and may be accelerated only using the CPU. Modern LLMs for x86-based CPUs are accelerated by LM Studio using AVX2 instructions.
Performance comparisons: throughput and latency
AMD Ryzen AI provides leading performance in llama.cpp-based programs such as LM Studio for x86 laptops and speeds up these cutting-edge tasks. Note that memory speeds have a significant impact on LLMs in general. When the compared the two laptops, the AMD laptop had 7500 MT/s of RAM while the Intel laptop had 8533 MT/s.Image Credit To AMD
Despite this, the AMD Ryzen AI 9 HX 375 CPU outperforms its rivals by up to 27% when considering tokens per second. The parameter that indicates how fast an LLM can produce tokens is called tokens per second, or tk/s. This generally translates to the amount of words that are shown on the screen per second.
Up to 50.7 tokens per second may be produced by the AMD Ryzen AI 9 HX 375 CPU in Meta Llama 3.2 1b Instruct (4-bit quantization).
The “time to first token” statistic, which calculates the latency between the time you submit a prompt and the time it takes for the model to begin producing tokens, is another way to benchmark complex language models. Here, it can see that the AMD “Zen 5” based Ryzen AI HX 375 CPU is up to 3.5 times quicker than a similar rival processor in bigger versions.Image Credit To AMD
Using Variable Graphics Memory (VGM) to speed up model throughput in Windows
Every one of the AMD Ryzen AI CPU’s three accelerators has a certain workload specialty and set of situations in which they perform best. On-demand AI activities are often handled by the iGPU, while AMD XDNA 2 architecture-based NPUs provide remarkable power efficiency for permanent AI while executing Copilot+ workloads and CPUs offer wide coverage and compatibility for tools and frameworks.
With the vendor-neutral Vulkan API, LM Studio’s llama.cpp port may speed up the framework. Here, acceleration often depends on a combination of Vulkan API driver improvements and hardware capabilities. Meta Llama 3.2 1b Instruct performance increased by 31% on average when GPU offload was enabled in LM Studio as opposed to CPU-only mode. The average uplift for larger models, such as Mistral Nemo 2407 12b Instruct, which are bandwidth-bound during the token generation phase, was 5.1%.
In comparison to CPU-only mode, it found that the competition’s processor saw significantly worse average performance in all but one of the evaluated models while utilizing the Vulkan-based version of llama.cpp in LM Studio and turning on GPU-offload. In order to maintain fairness in the comparison, it have excluded the GPU-offload performance of the Intel Core Ultra 7 258v from LM Studio’s Vulkan back-end, which is based on Llama.cpp.
Another characteristic of AMD Ryzen AI 300 Series CPUs is Variable Graphics Memory (VGM). Programs usually use the second block of memory located in the “shared” section of system RAM in addition to the 512 MB block of memory allocated specifically for an iGPU. The 512 “dedicated” allotment may be increased by the user using VGM to up to 75% of the system RAM that is available. When this contiguous memory is present, memory-sensitive programs perform noticeably better.
Using iGPU acceleration in conjunction with VGM, it saw an additional 22% average performance boost in Meta Llama 3.2 1b Instruct after turning on VGM (16GB), for a net total of 60% average quicker speeds when compared to the CPU. Performance improvements of up to 17% were seen even for bigger models, such as the Mistral Nemo 2407 12b Instruct, when compared to CPU-only mode.
Side by side comparison: Mistral 7b Instruct 0.3
It compared iGPU performance using the first-party Intel AI Playground application (which is based on IPEX-LLM and LangChain) in order to fairly compare the best consumer-friendly LLM experience available, even though the competition’s laptop did not provide a speedup using the Vulkan-based version of Llama.cpp in LM Studio.
It made use of the Microsoft Phi 3.1 Mini Instruct and Mistral 7b Instruct v0.3 models that came with Intel AI Playground. To observed that the AMD Ryzen AI 9 HX 375 is 8.7% quicker in Phi 3.1 and 13% faster in Mistral 7b Instruct 0.3 using a same quantization in LM Studio.Image Credit To AMD
AMD is committed to pushing the boundaries of AI and ensuring that it is available to everybody. Applications like LM Studio are crucial because this cannot occur if the most recent developments in AI are restricted to a very high level of technical or coding expertise. In addition to providing a rapid and easy method for localizing LLM deployment, these apps let users to experience cutting-edge models almost immediately upon startup (if the architecture is supported by the llama.cpp project).
AMD Ryzen AI accelerators provide amazing performance, and for AI use cases, activating capabilities like variable graphics memory may result in even higher performance. An amazing user experience for language models on an x86 laptop is the result of all of this.
Read more on Govindhtech.com
#AMDRyzen#AMDRyzenAI300#ChatGPT#MetaLLaMA#Llama.cpp#languagemodels#MetaLlama#AMDXDNA#IntelCoreUltra7#MistralNemo#LMStudio#News#Technews#Technology#Technologynews#Technologytrends#govindhtech
0 notes
Text
Llama.cpp supports Vulkan. why doesn't Ollama?
https://github.com/ollama/ollama/pull/5059
0 notes
Text
NVIDIA AI Software Party at a Hardware Show
New Post has been published on https://thedigitalinsider.com/nvidia-ai-software-party-at-a-hardware-show/
NVIDIA AI Software Party at a Hardware Show
A tremendous number of AI software releases at CES.
Created Using Midjourney
Next Week in The Sequence:
We start a new series about RAG! For the high performance hackers, our engineering series will dive into Llama.cpp. In research we will dive into Deliberative Alignment, one of the techniques powering GPT-03. The opinion edition will debate open endedness AI methods for long term reasoning and how far those can go.
You can subscribe to The Sequence below:
TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
📝 Editorial: NVIDIA AI Software Party at a Hardware Show
The name NVIDIA is immediately associated with computing hardware and, in the world of AI, GPUs. But that is changing so rapidly. In several editions of this newsletter, we have highlighted NVIDIA’s rapidly growing AI software stack and aspirations. This was incredibly obvious last week at CES which is, well, mostly a hardware show!
NVIDIA unveiled not only a very clear vision for the future of AI but an overwhelming series of new products, many of which were AI software-related. Take a look for yourself.
NVIDIA NIM Microservices
NVIDIA’s NIM (NVIDIA Inference Microservices) is a significant leap forward in the integration of AI into modern software systems. Built for the new GeForce RTX 50 Series GPUs, NIM offers pre-built containers powered by NVIDIA’s inference software, including Triton Inference Server and TensorRT-LLM. These microservices enable developers to incorporate advanced AI capabilities into their applications with unprecedented ease, reducing deployment times from weeks to just minutes. With NIM, NVIDIA is effectively turning the once-daunting process of deploying AI into a seamless, efficient task—an essential advancement for industries looking to accelerate their AI adoption.
AI Blueprints
For developers seeking a head start, NVIDIA introduced AI Blueprints, open-source templates designed to streamline the creation of AI-powered solutions. These blueprints provide customizable foundations for applications like digital human generation, podcast creation, and video production. By offering pre-designed architectures, NVIDIA empowers developers to focus on innovation and customization rather than reinventing the wheel. The result? Faster iteration cycles and a smoother path from concept to deployment in AI-driven industries.
Cosmos Platform
NVIDIA’s Cosmos Platform takes AI into the realm of robotics, autonomous vehicles, and vision AI applications. By integrating advanced models with powerful video data processing pipelines, Cosmos enables AI systems to reason, plan, and act in dynamic physical environments. This platform isn’t just about data processing; it’s about equipping AI with the tools to operate intelligently in real-world scenarios. Whether it’s guiding a robot through a warehouse or enabling an autonomous vehicle to navigate complex traffic, Cosmos represents a new frontier in applied AI.
Isaac GR00T Blueprint
Robotic training just got a major upgrade with NVIDIA’s Isaac GR00T Blueprint. This innovative tool generates massive volumes of synthetic motion data using imitation learning, leveraging the capabilities of NVIDIA’s Omniverse platform. By producing millions of lifelike motions, Isaac GR00T accelerates the training process for humanoid robots, enabling them to learn complex tasks more effectively. It’s a groundbreaking approach to solving one of robotics’ biggest challenges—efficiently generating diverse, high-quality training data at scale.
DRIVE Hyperion AV Platform
NVIDIA’s DRIVE Hyperion AV Platform saw a significant evolution with the addition of the NVIDIA AGX Thor SoC. Designed to support generative AI models, this new iteration enhances functional safety and boosts the performance of autonomous driving systems. By combining cutting-edge hardware with advanced AI capabilities, Hyperion delivers a robust platform for developing the next generation of autonomous vehicles, capable of handling increasingly complex environments with confidence and precision.
AI Enterprise Software Platform
NVIDIA’s commitment to enterprise AI is reflected in its AI Enterprise Software Platform, now available on AWS Marketplace. With NIM integration, this platform equips businesses with the tools needed to deploy generative AI models and large language models (LLMs) for applications like chatbots, document summarization, and other NLP tasks. This offering streamlines the adoption of advanced AI technologies, providing organizations with a comprehensive, reliable foundation for scaling their AI initiatives.
RTX AI PC Features
At the consumer level, NVIDIA announced RTX AI PC Features, which bring AI foundation models to desktops powered by GeForce RTX 50 Series GPUs. These features are designed to support the next generation of digital content creation, delivering up to twice the inference performance of prior GPU models. By enabling FP4 computing and boosting AI workflows, RTX AI PCs are poised to redefine productivity for developers and creators, offering unparalleled performance for AI-driven tasks.
That is insane for the first week of the year! NVIDIA is really serious about its AI software aspirations. Maybe Microsoft, Google and Amazon need to get more aggressive about their GPU initiatives. Just in case…
🔎 AI Research
rStar-Math
In the paper “rStar-Math: Guiding LLM Reasoning through Self-Evolution with Process Preference Reward,” researchers from Tsinghua University, the Chinese Academy of Sciences, and Alibaba Group propose rStar-Math, a novel method for enhancing LLM reasoning abilities by employing self-evolution with a process preference reward (PPM). rStar-Math iteratively improves the reasoning capabilities of LLMs by generating high-quality step-by-step verified reasoning trajectories using a Monte Carlo Tree Search (MCTS) process.
BoxingGym
In the paper “BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery,” researchers from Stanford University introduce a new benchmark for evaluating the ability of large language models (LLMs) to perform scientific reasoning. The benchmark, called BoxingGym, consists of 10 environments drawn from various scientific domains, and the researchers found that current LLMs struggle with both experimental design and model discovery.
Cosmos World
In the paper “Cosmos World Foundation Model Platform for Physical AI,” researchers from NVIDIA introduce Cosmos World Foundation Models (WFMs). Cosmos WFMs are pre-trained models that can generate high-quality 3D-consistent videos with accurate physics, and can be fine-tuned for a wide range of Physical AI applications.
DOLPHIN
In the paper “DOLPHIN: Closed-loop Open-ended Auto-research through Thinking, Practice, and Feedback,” researchers from Fudan University and the Shanghai Artificial Intelligence Laboratory propose DOLPHIN, a closed-loop, open-ended automatic research framework2. DOLPHIN can generate research ideas, perform experiments, and use the experimental results to generate new research idea.
Meta Chain-of-Thoguht
In the paper“Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought” researchers from SynthLabs.ai and Stanford University propose a novel framework called Meta Chain-of-Thought (Meta-CoT), which enhances traditional Chain-of-Thought by explicitly modeling the reasoning process. The researchers present empirical evidence of state-of-the-art models showing in-context search behavior, and discuss methods for training models to produce Meta-CoTs, paving the way for more powerful and human-like reasoning in AI.
LLM Test-Time Compute and Meta-RL
In a thoughtful blog post title“Optimizing LLM Test-Time Compute Involves Solving a Meta-RL Problem” from CMU explain that optimizing test-time compute in LLMs can be viewed as a meta-reinforcement learning (meta-RL) problem where the model learns to learn how to solve queries. The authors outline a meta-RL framework for training LLMs to optimize test-time compute, leveraging intermediate rewards to encourage information gain and improve final answer accuracy.
🤖 AI Tech Releases
NVIDIA Nemotron Models
NVIDIA released Llama Nemotron LLM and Cosmos Nemotron vision-language models.
Phi-4
Microsoft open sourced its Phi-4 small model.
ReRank 3.5
Cohere released its ReRank 3.5 model optimized for RAG and search scenarios.
Agentic Document Workfows
LlamaIndex released Agentic Document Workflow, an architecture for applying agentic tasks to documents.
🛠 AI Reference Implementations
Beyond RAG
Salesfoce discusses an enriched index technique that improved its RAG solutions.
📡AI Radar
NVIDIA released AI agentic blueprints for popular open source frameworks.
NVIDIA unveiled Project DIGITS, an AI supercomputer powered by the Blackwell chip.
NVIDIA announced a new family of world foundation models for its Cosmos platform.
Anthropic might be raising at a monster $60 billion valuation.
Hippocratic AI raised a massive $141 million round for its healthcare LLM.
Cohere announced North, its Microsoft CoPilot competitor.
OpenAI might be getting back to robotics.
Gumloop raised $17 million for its workflow automation platform.
#3d#adoption#ai#AI adoption#AI models#ai supercomputer#AI systems#AI-powered#Alibaba#Amazon#anthropic#applications#applied AI#approach#architecture#Art#artificial#Artificial Intelligence#automation#automation platform#autonomous#autonomous driving#autonomous vehicle#autonomous vehicles#AWS#Behavior#benchmark#benchmarking#billion#blackwell
0 notes
Text

Seit der Einführung von ChatGPT 2022 wollen immer mehr Unternehmen KI-Technologien nutzen – oft mit spezifischen Anforderungen. Retrieval Augmented Generation (RAG) ist dabei die bevorzugte Technologie, um innovative Anwendungen auf Basis privater Daten zu entwickeln. Doch Sicherheitsrisiken wie ungeschützte Vektorspeicher, fehlerhafte Datenvalidierung und Denial-of-Service-Angriffe stellen eine ernsthafte Gefahr dar, insbesondere angesichts des schnellen Entwicklungszyklus von RAG-Systemen. Die innovative KI-Technologie RAG benötigt einige Zutaten, um zu funktionieren: Eine Datenbank mit Textbausteinen und eine Möglichkeit, diese abzurufen sind erforderlich. Üblicherweise wird dafür ein Vektorspeicher eingesetzt, der den Text und eine Reihe von Zahlen speichert, die dabei helfen, die relevantesten Textbausteine zu finden. Mit diesen und einem entsprechenden Prompt lassen sich Fragen beantworten oder neue Texte verfassen, die auf privaten Datenquellen basieren und für die jeweiligen Bedürfnisse relevant sind. Tatsächlich ist RAG so effektiv, dass meist nicht die leistungsstärksten LLM benötigt werden. Um Kosten zu sparen und die Reaktionszeit zu verbessern, lassen sich die vorhandenen eigenen Server verwenden, um diese kleineren und leichteren LLM-Modelle zu hosten. Der Vektorspeicher gleicht einem sehr hilfreichen Bibliothekar, der nicht nur relevante Bücher findet, sondern auch die entsprechenden Passagen hervorhebt. Das LLM ist dann der Forscher, der diese Textstellen nimmt und sie dafür nutzt, um ein Whitepaper zu schreiben oder die Frage zu beantworten. Zusammen bilden sie eine RAG-Anwendung. Vektorspeicher, LLM-Hosting, Schwachstellen Vektorspeicher sind nicht ganz neu, erleben aber seit zwei Jahren eine Renaissance. Es gibt viele gehostete Lösungen wie Pinecone, aber auch selbst gehostete Lösungen wie ChromaDB oder Weaviate. Sie unterstützen einen Entwickler dabei, Textbausteine zu finden, die dem eingegebenen Text ähneln, wie z. B. eine Frage, die beantwortet werden muss. Das Hosten eines eigenen LLM erfordert zwar eine nicht unerhebliche Menge an Arbeitsspeicher und eine gute GPU, aber das ist nichts, was ein Cloud-Anbieter nicht bereitstellen könnte. Für diejenigen, die einen guten Laptop oder PC haben, ist LMStudio eine beliebte Option. Für den Einsatz in Unternehmen sind llama.cpp und Ollama oft die erste Wahl. Alle diese Programme haben eine rasante Entwicklung durchgemacht. Daher sollte es nicht überraschen, dass es noch einige Fehler in RAG-Komponenten zu beheben gilt. Einige dieser Bugs sind typische Datenvalidierungs-Fehler, wie CVE-2024-37032 und CVE-2024-39720. Andere führen zu Denial-of-Service, etwa CVE-2024-39720 und CVE-2024-39721, oder sie leaken das Vorhandensein von Dateien, wie CVE-2024-39719 und CVE-2024-39722. Die Liste lässt sich erweitern. Gefährliche Fehler in RAG-Komponenten Weniger bekannt ist llama.cpp, doch dort fand man in diesem Jahr CVE-2024-42479. CVE-2024-34359 betrifft die von llama.cpp genutzte Python-Bibliothek. Vielleicht liegt der Mangel an Informationen über llama.cpp auch an dessen ungewöhnlichem Release-Zyklus. Seit seiner Einführung im März 2023 gab es über 2.500 Releases, also etwa vier pro Tag. Bei einem sich ständig ändernden Ziel wie diesem ist es schwierig, dessen Schwachstellen zu verfolgen. Im Gegensatz dazu hat Ollama einen gemächlicheren Release-Zyklus von nur 96 Releases seit Juli 2023, also etwa einmal pro Woche. Als Vergleich, Linux hat alle paar Monate ein neues Release und Windows erlebt jedes Quartal neue „Momente“. Auch Vektorspeicher haben Schwachstellen ChromaDB gibt es seit Oktober 2022 und fast zweiwöchentlich erscheint ein neues Release. Interessanterweise sind keine CVEs für diesen Vektorspeicher bekannt. Weaviate, ein weiterer Vektorspeicher, weist ebenfalls Schwachstellen auf (CVE-2023-38976 und CVE-2024-45846 bei Verwendung mit MindsDB). Weaviate existiert seit 2019 und ist damit ein wahrer Großvater dieses Technologie-Stacks, der jedoch immer noch einen wöchentlichen Veröffentlichungszyklus hat. Diese Veröffentlichungszyklen sind nicht in Stein gemeißelt, aber sie bedeuten doch, dass gefundene Bugs schnell gepatcht werden, wodurch die Zeit ihrer Verbreitung begrenzt wird. LLMs für sich genommen erfüllen wahrscheinlich nicht alle Anforderungen und werden nur schrittweise verbessert, da ihnen die öffentlichen Daten zum Trainieren ausgehen. Die Zukunft gehört wahrscheinlich einer agentenbasierten KI, die LLMs, Speicher, Tools und Workflows in fortschrittlicheren KI-basierten Systemen kombiniert, so Andrew Ng, ein für seine Arbeiten zur Künstlichen Intelligenz und Robotik bekannter Informatiker. Es geht im Wesentlichen um einen neuen Software Entwicklungs-Stack, wobei die LLMs und die Vektorspeicher hier weiterhin eine wichtige Rolle spielen werden. Doch Achtung: Unternehmen können auf dem Weg in diese Richtung Schaden nehmen, wenn sie nicht auf die Sicherheit ihrer Systeme achten. Exponierte RAG-Komponenten Wir befürchten, dass viele Entwickler diese Systeme in ihrer Eile dem Internet ungeschützt aussetzen könnten, und suchten deshalb im November 2024 nach öffentlich sichtbaren Instanzen einiger dieser RAG-Komponenten. Im Fokus standen dabei die vier wichtigsten Komponenten, die in RAG-Systemen zum Einsatz kommen: llama.cpp, Ollama, das LLMs hostet, sowie ChromaDB und Weaviate, die als Vektorspeicher dienen. Passende Artikel zum Thema Read the full article
0 notes
Text
Ejemplo de Speculative Decoding con Qwen-2.5-coder-7B y 0.5B.
Speculative Decoding es un técnica (realmente un conjunto de ellas) para acelerar un modelo de lenguaje usando un “oraculo” que genera un borrador tratando de adivinar lo que el modelo va a generar. En este caso usaremos Qwen-2.5-coder-0.5B como oraculo para generar el borrador. Para esta prueba usaremos un ejemplo que viene con Llama.cpp. Pero antes de poder usarlo hay que cambiar en el fichero…

View On WordPress
0 notes
Link
In the rapidly evolving field of artificial intelligence, the focus often lies on large, complex models requiring immense computational resources. However, many practical use cases call for smaller, more efficient models. Not everyone has access to #AI #ML #Automation
0 notes
Photo

AMD Ryzen AI 300 Series Enhances Llama.cpp Performance in Consumer Applications
0 notes
Text
Ryzen AI 300 vượt trội hơn Intel về khối lượng công việc AI: AMD
Với việc AI được tích hợp vào hoạt động sử dụng hàng ngày, nhiều người đang khám phá khả năng chạy Mô hình ngôn ngữ lớn (LLM) cục bộ trên máy tính xách tay hoặc máy tính để bàn của họ. Vì mục đích này, nhiều người đang sử dụng LM Studio, một phần mềm phổ biến, dựa trên dự án llama.cpp, không có phần phụ thuộc và có thể được tăng tốc chỉ bằng CPU. Tận dụng sự phổ biến của LM Studio, AMD đã thể…
0 notes
Text

llama.cpp의 내부 구현을 분석하다가 ggml을 이용한 간단한 matmul 샘플을 구현해 봤습니다(프로젝트 링크는 댓글에). 원래 llama.cpp도 Georgi Gerganov가 주말에 llama 모델을 ggml로 구현하는 hackday를 진행하면서 시작 됐죠. 그리고 잘 알다시피 지금은 LLM계의 리눅스라 불러도 손색이 없을 만큼 엄청난 프로젝트로 성장했고요.
llama.cpp(정확히는 ggml)는 tensorflow와 유사하게 계산 그래프를 먼저 만들고 실행하는 방식입니다. 첨부 이미지(출처 별도 표기)처럼 그래프를 ggml_graph_compute() 함수로 호출하면 계산이 실행되죠. 참고로 첨부 이미지는 예전 버전 기준으로 작성된 것이고 지금은 CUDA로 실행할 때는 ggml_backend_graph_compute()로 실행해야 합니다. 이처럼 그래프를 별도로 계산하는 과정이 반드시 필요한데 모델링을 할 때는 이 방식이 무척 번거롭습니다. 그래서 tensorflow도 pytorch에게 자리를 내주고 말았죠. 하지만 애초에 llama.cpp는 inference 전용이기 때문에 이 방식이 별 문제가 되진 않습니다. 오히려 최적화하기 쉽고, 다양한 백엔드를 지원할 수 있어 llama.cpp는 CPU 외에도 CUDA 지원, 맥에서는 METAL 지원, AMD의 ROCm도 지원합니다. 또한 코어는 간결하게 C로 구현되어 있고, 그래서 제가 만든 샘플도 C++ 코드지만 C 문법만 사용했습니다. 애초에 tensor 변수도 ggml_tensor라는 struct로 구현되어 있죠.
반면 pytorch는 같은 역할을 하는 torch::Tensor부터가 벌써부터 namespace입니다. 모든 문법은 C++ 전용으로 되어 있고요. llama.cpp는 CPU에서 GPU로 메모리를 복사하는 과정을 직접 코딩해야 하는데, torch는 그런 과정도 모두 생략되어 있어 C++에서도 마치 파이썬처럼 별 어려움 없이 사용 가능합니다. 여기서 두 프레임워크의 철학이 엿보인다고 할 수 있습니다. 쉽게 사용가능하면서 딥러닝의 모든 것을 지원하는 종합 선물 세트 pytorch와 의존성 없이 이식성이 좋으면서 가볍고 간결하고 모든 부분을 컨트롤 할 수 있는 llama.cpp.
앞으로 LLM이 on-device에 구동될 일이 많아질 것이고 그렇다면 llama.cpp 같은 가볍고 간결한 프레임워크의 수요도 점점 늘어날 거라 생각됩니다. ggml로 직접 모델을 inference하고 최적화하는 일도 앞으로는 많이 생길 거 같고요. 물론 오픈소스 진영에서 대신 구현해 줄 거기 때문에 대부분은 그저 가져다 쓰기만 하겠지만요.
0 notes
Text
Run Your Own AI Cluster in your Kitchen
Welcome to the latest breakthrough in AI innovation: DIY AI Clusters with Refrigerator Essentials. Because who needs NVIDIA GPUs when you have a jar of pickles and an old potato?
Forget Expensive Hardware
Unify your existing kitchen appliances into one powerful AI cluster: your fridge, toaster, blender, and even that bread maker you used once. Pretty much any device* (*we mean it, ANY device).
Getting Started
Here's how you can transform your mundane kitchen into a cutting-edge AI lab:
Wide Model Support
Supports cutting-edge models like LLaMA, BroccoliGPT, and KitchenAid-9000.
Dynamic Model Partitioning
Optimally splits up models based on your kitchen layout and available devices. Run larger models than ever before with your microwave and coffee maker working in harmony.
Automatic Device Discovery
Your AI cluster will automatically discover other devices using the best method available, including Bluetooth, WiFi, and even Telepathy. Zero manual configuration because who has time for that?
Revolutionary Features
ChatGPT-Compatible API
Now you can chat with your refrigerator. Literally. Just a one-line change and you’re talking to your yogurt.
Device Equality
No more master-slave hierarchy. Every device in your kitchen is equal. Yes, even the humble toaster.
Ring Topology
Our default partitioning strategy is a ring memory weighted partitioning. It’s as simple as putting all your devices in a circle, turning them on, and hoping for the best.
Installation
The current recommended way to install our software is from source. Because why make it easy?
Prerequisites
Python>=3.12.0 (because earlier versions are just so last year).
From Source
Clone our incredibly complex repository
Troubleshooting
If running on Mac, you’re going to need all the luck you can get. Check out our vague and unhelpful troubleshooting guide.
Example Usage on Multiple Kitchen Devices
Device 1:
bash
Copy code
python3 main.py
Device 2:
bash
Copy code
python3 main.py
That’s it! No configuration required. exo will automatically discover the other device(s), or not. Who knows?
The Native Way
Access models running on exo using the exo library with peer handles. Or just wing it. See how in this example for Llama 3:
bash
Copy code
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama-3-70b", "messages": [{"role": "user", "content": "What is the meaning of yogurt?"}], "temperature": 0.7 }'
Debugging
Enable debug logs with the DEBUG environment variable (0-9). The higher the number, the more chaos.
bash
Copy code
DEBUG=9 python3 main.py
Inference Engines Supported
✅ MLX
✅ tinygrad
🚧 llama.cpp
Known Issues
Just about anything you can imagine.
0 notes
Text
Obsidian And RTX AI PCs For Advanced Large Language Model

How to Utilize Obsidian‘s Generative AI Tools. Two plug-ins created by the community demonstrate how RTX AI PCs can support large language models for the next generation of app developers.
Obsidian Meaning
Obsidian is a note-taking and personal knowledge base program that works with Markdown files. Users may create internal linkages for notes using it, and they can see the relationships as a graph. It is intended to assist users in flexible, non-linearly structuring and organizing their ideas and information. Commercial licenses are available for purchase, however personal usage of the program is free.
Obsidian Features
Electron is the foundation of Obsidian. It is a cross-platform program that works on mobile operating systems like iOS and Android in addition to Windows, Linux, and macOS. The program does not have a web-based version. By installing plugins and themes, users may expand the functionality of Obsidian across all platforms by integrating it with other tools or adding new capabilities.
Obsidian distinguishes between community plugins, which are submitted by users and made available as open-source software via GitHub, and core plugins, which are made available and maintained by the Obsidian team. A calendar widget and a task board in the Kanban style are two examples of community plugins. The software comes with more than 200 community-made themes.
Every new note in Obsidian creates a new text document, and all of the documents are searchable inside the app. Obsidian works with a folder of text documents. Obsidian generates an interactive graph that illustrates the connections between notes and permits internal connectivity between notes. While Markdown is used to accomplish text formatting in Obsidian, Obsidian offers quick previewing of produced content.
Generative AI Tools In Obsidian
A group of AI aficionados is exploring with methods to incorporate the potent technology into standard productivity practices as generative AI develops and speeds up industry.
Community plug-in-supporting applications empower users to investigate the ways in which large language models (LLMs) might improve a range of activities. Users using RTX AI PCs may easily incorporate local LLMs by employing local inference servers that are powered by the NVIDIA RTX-accelerated llama.cpp software library.
It previously examined how consumers might maximize their online surfing experience by using Leo AI in the Brave web browser. Today, it examine Obsidian, a well-known writing and note-taking tool that uses the Markdown markup language and is helpful for managing intricate and connected records for many projects. Several of the community-developed plug-ins that add functionality to the app allow users to connect Obsidian to a local inferencing server, such as LM Studio or Ollama.
To connect Obsidian to LM Studio, just select the “Developer” button on the left panel, load any downloaded model, enable the CORS toggle, and click “Start.” This will enable LM Studio’s local server capabilities. Because the plug-ins will need this information to connect, make a note of the chat completion URL from the “Developer” log console (“http://localhost:1234/v1/chat/completions” by default).
Next, visit the “Settings” tab after launching Obsidian. After selecting “Community plug-ins,” choose “Browse.” Although there are a number of LLM-related community plug-ins, Text Generator and Smart Connections are two well-liked choices.
For creating notes and summaries on a study subject, for example, Text Generator is useful in an Obsidian vault.
Asking queries about the contents of an Obsidian vault, such the solution to a trivia question that was stored years ago, is made easier using Smart Connections.
Open the Text Generator settings, choose “Custom” under “Provider profile,” and then enter the whole URL in the “Endpoint” section. After turning on the plug-in, adjust the settings for Smart Connections. For the model platform, choose “Custom Local (OpenAI Format)” from the options panel on the right side of the screen. Next, as they appear in LM Studio, type the model name (for example, “gemma-2-27b-instruct”) and the URL into the corresponding fields.
The plug-ins will work when the fields are completed. If users are interested in what’s going on on the local server side, the LM Studio user interface will also display recorded activities.
Transforming Workflows With Obsidian AI Plug-Ins
Consider a scenario where a user want to organize a trip to the made-up city of Lunar City and come up with suggestions for things to do there. “What to Do in Lunar City” would be the title of the new note that the user would begin. A few more instructions must be included in the query submitted to the LLM in order to direct the results, since Lunar City is not an actual location. The model will create a list of things to do while traveling if you click the Text Generator plug-in button.
Obsidian will ask LM Studio to provide a response using the Text Generator plug-in, and LM Studio will then execute the Gemma 2 27B model. The model can rapidly provide a list of tasks if the user’s machine has RTX GPU acceleration.
Or let’s say that years later, the user’s buddy is visiting Lunar City and is looking for a place to dine. Although the user may not be able to recall the names of the restaurants they visited, they can review the notes in their vault Obsidian‘s word for a collection of notes to see whether they have any written notes.
A user may ask inquiries about their vault of notes and other material using the Smart Connections plug-in instead of going through all of the notes by hand. In order to help with the process, the plug-in retrieves pertinent information from the user’s notes and responds to the request using the same LM Studio server. The plug-in uses a method known as retrieval-augmented generation to do this.
Although these are entertaining examples, users may see the true advantages and enhancements in daily productivity after experimenting with these features for a while. Two examples of how community developers and AI fans are using AI to enhance their PC experiences are Obsidian plug-ins.
Thousands of open-source models are available for developers to include into their Windows programs using NVIDIA GeForce RTX technology.
Read more on Govindhtech.com
#Obsidian#RTXAIPCs#LLM#LargeLanguageModel#AI#GenerativeAI#NVIDIARTX#LMStudio#RTXGPU#News#Technews#Technology#Technologynews#Technologytrends#govindhtech
3 notes
·
View notes
Text
Llama.cpp Now Supports Qwen2-VL (Vision Language Model)
https://github.com/ggerganov/llama.cpp/pull/10361
0 notes
Link
0 notes
Text
Crear un podcast de cualquier blog o noticia usando IA en local (llama.cpp + Piper)
Hace no mucho Google sacó una herramienta que permite crear podcasts donde dos IAs discuten sobre el texto que les pasas. Meta no hace mucho sacó su propia versión usando Llama que se puede usar en local. Aquí presento mi idea, con código para que podáis jugar vosotros. Podéis ver el vídeo con más detalles en mi canal de YouTube: Haz click para ver el vídeo en mi canal de YouTube El proceso…

View On WordPress
0 notes