Using llama.cpp to self-host Large Language Models in Production

About

In this show, we explore the inner workings that goes into each ServiceStack Release and explore the benefits and purpose behind major features and how they can empower your next project and help you build better apps faster. We also dive deep into some of the most popular topics in modern web dev, API design, microservices architecture, and more.

This episode covers llama-server, a production-focused tool built on llama.cpp for self-hosting large language models, contrasting it with user-friendly local alternatives.

It details how to deploy llama-server using Docker, including GPU-accelerated configurations, and natively with Systemd for optimized performance.

The episode also introduces AI Server, an open-source managed gateway designed to streamline AI integrations by centralizing management of multiple LLM providers, including llama-server instances and cloud-based APIs.

It explains how to use AI Server by registering llama-server endpoints, creating API keys for applications, and utilizing typed APIs in various programming languages for synchronous, queued, and callback-based interactions.

About

Listen