<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Llama-Server on AI Engineer Guide</title><link>https://aiengineerguide.com/tags/llama-server/</link><description>Recent content in Llama-Server on AI Engineer Guide</description><generator>Hugo</generator><language>en-US</language><copyright>Copyright © 2024, Nesin Technologies LLP.</copyright><lastBuildDate>Sun, 05 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://aiengineerguide.com/tags/llama-server/feed.xml" rel="self" type="application/rss+xml"/><item><title>How to run Gemma 4 locally with llama-server and access it via API</title><link>https://aiengineerguide.com/til/google-gemma-4-locally-with-llama-server-api/</link><pubDate>Sun, 05 Apr 2026 00:00:00 +0000</pubDate><guid>https://aiengineerguide.com/til/google-gemma-4-locally-with-llama-server-api/</guid><description>&lt;p>We&amp;rsquo;ll be using &lt;a href="https://github.com/ggml-org/llama.cpp">llama-server&lt;/a> to serve the &lt;a href="https://deepmind.google/models/gemma/gemma-4/">Gemma 4&lt;/a>&lt;/p>
&lt;h2 id="depedency">Depedency&lt;/h2>
&lt;p>If you don&amp;rsquo;t have llama-server, then you can install it using brew (macOS/Linux), refer &lt;a href="https://github.com/ggml-org/llama.cpp/blob/master/docs/install.md">docs&lt;/a> for other ways to install it.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-shell" data-lang="shell">&lt;span style="display:flex;">&lt;span>brew install llama.cpp
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>And we can download the LLM by running this command.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-shell" data-lang="shell">&lt;span style="display:flex;">&lt;span>llama-server -hf ggml-org/gemma-4-26b-a4b-it-GGUF:Q4_K_M
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Make sure to change LLM varient based on the available GPU/memory in your machine.&lt;/p>
&lt;p>That&amp;rsquo;s pretty much it. It&amp;rsquo;ll serve the LLM on port 8080 by default.&lt;/p></description></item></channel></rss>