Model capabilities and runtimes are evolving rapidly (sometimes daily), so this article will be updated as time goes on.

This article focuses on running Large Language Models (LLMs) locally as opposed to image classifiers, speech recognition models, natural language processing models, etc. It will also only focus on inference, not training.

There's something for everyone here. Whether or not you know how to code, I have a walkthrough for you here:

  1. Ollama (Easy)
  2. LM Studio (Easy)
  3. Docker (Medium)
  4. Kubernetes (Hard)

Why on earth would you even want to run models locally? After all, can't you just use bigger and better models from OpenAI, Anthropic, Google, or any other state of the art (SOTA) large model providers?

The answer, of course, is privacy and control. You also won't be subject to any rate limits. Along the way, you'll learn a massive amount about networking, system design, security, and LLMs in general (particularly if you pick Kubernetes from my guide choices).

This will also open the door for you to use local models to back agentic coding tools like Claude Code (via Claude Code Router) or OpenCode (at time of writing, I haven't tried OSS models with Codex yet). I have agentic coding configurations you can use for both Claude Code Router and OpenCode.

Finally, here's a motivating screenshot of me running Open WebUI (an open source chat front-end) which I host locally on Kubernetes backed by models I'm also running locally on my own GPU node:

The simplest way (by far) to run a model locally on your computer is to just use Ollama. Ollama is designed by Meta to be simple to use. It also manages context windows for you so you don't have to set them yourself.

It also gives you a local chat interface and the ability to turn on a localhost server which other computers on your home network can hit via your first computer's IP address and a specific port.

This is where I started, and you'll get a lot of mileage out of this before you'll feel the need to upgrade your local setup.

Once downloaded, the fastest way to use Ollama is to just use the interface they made. You can select a model by selecting the model dropdown, searching for what you want, downloading it, and then start chatting with it.

As of time of writing, I might recommend trying either qwen3:8b or gpt-oss:20b depending on how much local RAM or GPU power you have available (use this vRAM calculator to figure out what your computer is capable of).

You can also download and start Ollama via your terminal (if you've not used a terminal before, I'd use the interface I described above). Run these commands to download and start a model:

ollama run <model-name-here>

# Example using gpt-oss:20b
ollama run gpt-oss:20b

This will pull the model (which might take some time depending on which one you choose and your internet connection). Once finished, it will start up. gpt-oss:20b, for instance, is a 20gb model:

Finally, another nice feature of Ollama is that you can an endpoint from your computer to your local network so other computers can access it via IP address and port (defaulting to 11434), typically:

http://<your-ip-address>:11434/api/generate

Go here to turn it on:

Another extremely easy way to run models locally is to use LM Studio.

After downloading it from the link above, you'll be able to search for a model using the search icon in the side bar and selecting a model to download:

After downloading a model, you'll be able to start chatting with it (use the chat icon in the left sidebar):

Finally, you can also turn on an API endpoint so that other computers on your network can access it via IP address and port, typically:

http://<your-ip-address>:1234/api/generate

Go here to turn this on:

Docker also allows you to run models locally and moves us towards using faster and more performant runtimes (more on that later). Docker's Model Runner is faster to set up and use.

You first need to install Docker Desktop so go ahead and do that if you don't have it already. If you already have it, ensure it's >= v4.40.

Model Runner (available since Docker Desktop v4.40) allows you to easily and quickly boot up local models. For instance, this is an example of commands you might use to bring up a model with Model Runner (although I'd advise reading the linked docs above):

docker desktop enable model-runner --tcp=12434

The --tcp flag and port number are technically optional, but I'd advise using them so you can access the model outside of a Docker socket on your host machine (e.g. via localhost).

You can also just do it in the Docker Desktop UI by going to Settings > AI:

Next, you'd use these commands to pull a model and run it:

docker pull <model-name-here>
docker run -d <model-name-here>

A great place to choose a model to use is from HuggingFace. Make sure to choose one that you can support with your device hardware. Unlike training models, inference can use either RAM or a GPU with vRAM. In either case, the basic equation for estimating is this:

RAM (bytes) = num_params × bytes_per_param

For instance:

num_params = 1,000,000,000
bytes_per_param = 4
RAM (bytes) = 1,000,000,000 × 4 = 4,000,000,000 bytes (4GB)

The number of bytes per parameter can vary depending on the model and if it's quantized or not. You should be able to read the configuration for the model you want to use on HuggingFace. Additionally, you can use this vRAM calculator to figure out what your computer is capable of).

If in doubt, Model Runner should throw a hissy fit if you try to pull a model that's too large for your computer.

Once you choose a model, copy its name and use it in the commands like so:

docker model pull ai/qwen3-coder
docker model run -d ai/qwen3-coder

The docker run command will start up a text-based terminal chat you can use.

If you enabled TCP and a port like I mentioned, you should also be able to directly curl the model now:

curl http://localhost:12434/engines/llama.cpp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai/qwen3-coder",
    "messages": [
      {"role": "system", "content": "You are a helpful marketing assistant."},
      {"role": "user", "content": "What is the best way to market a product?"}
    ]
  }'

You can also check any models you've downloaded (and that you can immediately run with the docker run command) by using this command:

docker model list

The Docker Model Runner server is using llama.cpp by default under the hood. If you want to use GPU acceleration, you can use the following flag:

# Enable the runner to use CUDA GPU
docker desktop enable model-runner --gpu enable

# Run the model again
docker model run <model-name-here>

This sets you up to use GPU-acceleration with CUDA which is a whole topic in and of itself (more in the Docker documentation). If you're interested in learning more about these different process runners, jump to the Addendum section.

This avenue is if you're a true masochist. Luckily, I already went down this path myself and can help you avoid the pitfalls I encountered.

For now, I have both a gitops repository that houses my custom Kubernetes deployment files that you can fork and use as well as a SETUP.md file that can help you get started. It works great and is backing the Open WebUI screenshot I had in the Why Local? section.

I'll be building out this section more shortly.

Under the hood of the Docker and Kubernetes setups described in this article, you're using tools like llama.cpp and vLLM. These are faster, higher throughput, allow batching and increased context windows with concurrency, are much more GPU memory efficient, and are worth reading up on to understand how they work.

If you want to dig even further, you can also read up on the following tools:

Here are also some interesting additional topics to study:

  • Distillation (reducing model size by parent-child model training)
  • KV caches and management
  • LoRA (fine-tuning techniques)
  • NVLink/NVSwitch (GPU linking/coordination)
  • Quantization (reducing model size with minimal quality loss)