Demo 4

Local LLMs on Your Machine

LM Studio · Ollama + Open WebUI · vLLM — June 23, 2026

Live demo

Workshop stand — Open WebUI in the browser, no local setup needed:

https://webui-demo.infiano.app/

Quick pick (no theory)

LM StudioJust download a model and chat — no terminal Ollama + Open WebUINice browser chat UI vLLMAPI/server for projects — connect chat separately

1. LM Studio

Best for: people who don't want to deal with the terminal. Open the app → find a model → download → Chat.

1.1 Installation

Go to the LM Studio website or the Download page. Install for Windows, macOS, or Linux.
Install and launch the app normally.
On RTX 4090, use a recent NVIDIA driver. In LM Studio settings keep GPU/auto if the app suggests a backend.
Search models in Discover / Models. LM Studio supports search by name, owner/model, or full Hugging Face URL.

1.2 Downloading a model

Open Discover / Models.
Search for a repo, e.g. openai/gpt-oss-20b, lmstudio-community/gemma-4-12B-it-GGUF, or bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF.
Expand files and pick a quant. On RTX 4090 start with Q4_K_M, UD-Q4_K_XL, or similar 4-bit.
If the model is 8B–14B and VRAM is free, try Q5_K_M/Q6_K. For 24B–32B start with Q4.
Click Download, then go to Chat, select the model, and click Load Model.

Search examples: openai/gpt-oss-20b · lmstudio-community/gemma-4-12B-it-GGUF · bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF · bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF · unsloth/GLM-4.7-Flash-GGUF

1.3 RTX 4090 tips

LM Studio can also run a local OpenAI-compatible server — see the API / local server docs.

2. Ollama + Open WebUI

Best for: browser chat. Ollama runs models; Open WebUI is the interface.

2.1 Install Ollama

Download from the official Ollama page.
Browse models in Ollama Library. For GPT-OSS use gpt-oss:20b.
Ollama has a built-in chat, but for a proper browser UI install Open WebUI.

ollama pull gpt-oss:20b ollama run gpt-oss:20b

2.2 Open WebUI in the browser

Install via the official Quick Start. Docker is recommended.

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \ -v open-webui:/app/backend/data --name open-webui --restart always \ ghcr.io/open-webui/open-webui:main

Open in browser: http://localhost:3000

Without Docker:

pip install open-webui open-webui serve # open http://localhost:8080

2.3 Hugging Face models via Ollama

Format: hf.co/owner/repo:quant. Public GGUF usually needs no token; gated/private models need a Hugging Face token.

ollama run hf.co/lmstudio-community/gemma-4-12B-it-GGUF:Q4_K_M ollama run hf.co/bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M ollama run hf.co/unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL ollama run hf.co/bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF:Q4_K_M

2.4 Ollama API check

Ollama supports an OpenAI-compatible API.

curl http://localhost:11434/api/tags # OpenAI-compatible base URL: http://localhost:11434/v1

from openai import OpenAI client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama") resp = client.chat.completions.create( model="gpt-oss:20b", messages=[{"role": "user", "content": "List 5 benefits of local LLMs."}], ) print(resp.choices[0].message.content)

3. vLLM

Best for: a fast OpenAI-compatible server for projects, tests, RAG, agents, or multiple clients. vLLM is the backend — default port localhost:8000.

3.1 Requirements

Linux or WSL2 preferred. On plain Windows, LM Studio/Ollama are simpler.
NVIDIA driver + CUDA stack. Docker needs NVIDIA Container Toolkit.
Python 3.10+; install via uv or pip in a venv.
HF_TOKEN only for gated/private models.

3.2 Install via pip/uv

python -m venv .venv source .venv/bin/activate python -m pip install --upgrade pip pip install uv uv pip install vllm openai huggingface_hub

nvidia-smi python - <<'PY' import torch print(torch.cuda.is_available()) print(torch.cuda.get_device_name(0)) PY

3.3 Run vLLM as API

vllm serve openai/gpt-oss-20b \ --host 0.0.0.0 \ --port 8000 \ --api-key local-key \ --gpu-memory-utilization 0.90 \ --max-model-len 32768

curl http://localhost:8000/v1/models \ -H "Authorization: Bearer local-key" curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer local-key" \ -d '{ "model": "openai/gpt-oss-20b", "messages": [{"role": "user", "content": "Explain vLLM in 3 sentences."}], "temperature": 0.7 }'

3.4 Docker

docker run --runtime nvidia --gpus all \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=$HF_TOKEN" \ -p 8000:8000 \ --ipc=host \ vllm/vllm-openai:latest \ --model openai/gpt-oss-20b \ --host 0.0.0.0 \ --port 8000 \ --api-key local-key

3.5 Connect vLLM to Open WebUI

Start vLLM at http://localhost:8000/v1
Start Open WebUI
In Settings → Connections add an OpenAI-compatible provider
Base URL: http://localhost:8000/v1 (on host). If Open WebUI is in Docker on Windows/macOS, try http://host.docker.internal:8000/v1
API key: local-key (or whatever you passed to --api-key)

4. Models for RTX 4090 (24 GB VRAM)

TPS figures are rough guides for RTX 4090, batch=1, 4k–8k context, 4-bit/FP8/MXFP4 where available — not guaranteed benchmarks. Long context, heavy prompts, and CPU offload can cut speed 2–5×.

Unsloth, LM Studio community, and bartowski entries are often GGUF packagers — check the original model provider separately.

Recommended starting points

openai/gpt-oss-20b — good default for all three stacks
lmstudio-community/gemma-4-12B-it-GGUF — 12B, start Q4_K_M
bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF — 24B, Q4 recommended
bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF — 32B, Q4 only on 24 GB
unsloth/GLM-4.7-Flash-GGUF — try UD-Q4_K_XL

Quant cheat sheet (RTX 4090)

8B–14B: Q4_K_M to start; Q5_K_M / Q6_K if VRAM allows
20B–24B: Q4_K_M or UD-Q4_K_XL
32B: Q4 only; expect slower generation