Model capabilities and runtimes are evolving rapidly (sometimes daily), so this article will be updated as time goes on.

This article focuses on running Large Language Models (LLMs) locally as opposed to image classifiers, speech recognition models, natural language processing models, etc. It will also only focus on inference, not training.

There's something for everyone here. Whether or not you know how to code, I have a walkthrough for you here:

  1. Ollama (Easy)
  2. LM Studio (Easy)
  3. Docker (Medium)
  4. Kubernetes (Hard)

Why on earth would you even want to run models locally? After all, can't you just use bigger and better models from OpenAI, Anthropic, Google, or any other state of the art (SOTA) large model providers?

The answer, of course, is privacy and control. You also won't be subject to any rate limits. Along the way, you'll learn a massive amount about networking, system design, security, and LLMs in general (particularly if you pick Kubernetes from my guide choices).

This will also open the door for you to use local models to back agentic coding tools like Claude Code (via Claude Code Router) or OpenCode (at time of writing, I haven't tried OSS models with Codex yet). I have agentic coding configurations you can use for both Claude Code Router and OpenCode.

Finally, here's a motivating screenshot of me running Open WebUI (an open source chat front-end) which I host locally on Kubernetes backed by models I'm also running locally on my own GPU node:

The simplest way (by far) to run a model locally on your computer is to just use Ollama. Ollama is designed by Meta to be simple to use. It also manages context windows for you so you don't have to set them yourself.

Ollama gives you a local chat interface out of the box and the ability to turn on a localhost server which other computers on your home network can hit via your first computer's IP address and a specific port.

This is where I started, and you'll get a lot of mileage out of this before you'll feel the need to upgrade your local setup.

Once downloaded, the fastest way to use Ollama is to just use the interface they made. You can select a model by selecting the model dropdown, searching for what you want, downloading it, and then start chatting with it.

As of time of writing, I might recommend trying either qwen3:8b or gpt-oss:20b depending on how much local RAM or GPU power you have available (use this vRAM calculator to figure out what your computer is capable of).

You can also download and start Ollama via your terminal (if you've not used a terminal before, I'd use the interface I described above). Run these commands to download and start a model:

ollama run <model-name-here>

# Example using gpt-oss:20b
ollama run gpt-oss:20b

This will pull the model (which might take some time depending on which one you choose and your internet connection). gpt-oss:20b, for instance, is a 20gb model. Once finished, it will start up.

Finally, another nice feature of Ollama is that you can hit an endpoint from your computer on your local network that other computers can use to access the running model via IP address and port (defaulting to 11434), typically:

http://<your-ip-address>:11434/api/generate

Go here to turn this feature on:

Another extremely easy way to run models locally is to use LM Studio.

After downloading it from the link above, you'll be able to search for a model using the search icon in the left sidebar and selecting a model to download:

After downloading a model, you'll be able to start chatting with it using the chat icon in the left sidebar:

Finally, you can also turn on an API endpoint so that other computers on your network can access it via IP address and port, typically:

http://<your-ip-address>:1234/api/generate

Go here to turn this on:

Docker also allows you to run models locally and moves us incrementally towards faster and more performant runtimes (more on that later). To that end, you can get set up with Docker's Model Runner really fast.

You'll need to install Docker Desktop, so go ahead and do that if you don't have it already. Ensure that whatever build you have locally is >= v4.40.

Model Runner (available since Docker Desktop v4.40) allows you to easily and quickly boot up local models. For instance, this is an example of commands you might use to bring up a model with Model Runner (although I'd advise reading the linked docs above):

docker desktop enable model-runner --tcp=12434

The --tcp flag and port number are technically optional, but I'd advise using them so you can access the model via localhost outside of a Docker socket on your host machine.

You can also turn this on in the Docker Desktop UI by going to Settings > AI:

Next, you'll use these commands to pull a model and run it:

docker pull <model-name-here>
docker run -d <model-name-here>

A great place to choose a model to use is from HuggingFace. Make sure to choose one that you can support with your device hardware. Unlike training models, inference can use either RAM or a GPU with vRAM. In either case, the simplified equation for estimating usage is this:

RAM (bytes) = num_params × bytes_per_param

For instance:

num_params = 1,000,000,000
bytes_per_param = 4
RAM (bytes) = 1,000,000,000 × 4 = 4,000,000,000 bytes (4GB)

The number of bytes per parameter can vary depending on the model and if it's quantized or not. You should be able to read the configuration for the model you want to use on HuggingFace. Additionally, you can use this vRAM calculator to figure out what your computer is capable of.

If in doubt, Model Runner should throw a hissy fit if you try to pull a model that's too large for your computer.

Once you choose a model, copy its name and use it in the commands like so:

docker model pull ai/qwen3-coder
docker model run -d ai/qwen3-coder

The docker run command will start up a text-based terminal chat you can use.

If you enabled TCP and a port like I mentioned, you can also just directly curl the model once the docker pull command is finished:

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/qwen3-coder",
    "messages": [
      {"role": "system", "content": "You are a helpful marketing assistant."},
      {"role": "user", "content": "What is the best way to market a product?"}
    ]
  }'

You can also check any models you've downloaded (and that you can immediately run with the docker run command) by using this command:

docker model list

In case you're curious, the Docker Model Runner server is using llama.cpp by default under the hood. If you want to use GPU acceleration instead of only CPU and RAM, you can use the following flag:

# Enable the runner to use CUDA GPU
docker desktop enable model-runner --gpu enable

# Run the model again
docker model run <model-name-here>

This sets you up to use GPU-acceleration with CUDA which is a whole topic in and of itself (there's a bit more on this in the Docker documentation). If you're interested in learning more about these different process runners, jump to the Addendum section.

This avenue is if you're a true masochist. Luckily, I already went down this path myself and can help you avoid the pitfalls I encountered.

For now, I have both a gitops repository that houses my custom Kubernetes deployment files that you can fork and use as well as a SETUP.md file that can help you get started. It works great and is backing the Open WebUI screenshot I had in the Why Local? section.

I'll be building out this section more shortly.

Under the hood of the Docker and Kubernetes setups described in this article, you're using tools like llama.cpp and vLLM. These process runners for inference are faster, higher throughput, allow batching and increased context windows with concurrency, are much more GPU memory efficient, allow holding models indefinitely in GPU memory, and are worth reading up on.

If you want to dig even further, you can also read up on the following tools:

Here are also some interesting additional topics to study:

  • Distillation (reducing model size by parent-child model training)
  • KV caches and management
  • LoRA (fine-tuning techniques)
  • NVLink/NVSwitch (GPU linking/coordination)
  • Quantization (reducing model size with minimal quality loss)