LLM Fit Check estimates the total memory a large language model needs — weights + KV cache + compute buffers — and compares it against your GPU VRAM, Apple unified memory or system RAM. Model data comes live from Hugging Face: real parameter counts, exact GGUF quantization file sizes, and each model's true architecture (layers, grouped-query attention heads, context window), including special cases like DeepSeek's MLA compressed KV cache and Gemma's sliding-window attention.
Typical totals at 8K context for standard GQA architectures (use the calculator above for your exact model, quant and context):
| Model size | Q4_K_M (recommended) | FP16 (full precision) | Fits on |
|---|---|---|---|
| 7–8B | ~6–7 GB | ~18 GB | RTX 3060 12GB, any 16GB Mac |
| 13–14B | ~11–12 GB | ~32 GB | RTX 4070 12GB (tight), RTX 4080 16GB |
| 32B | ~21–23 GB | ~68 GB | RTX 3090/4090 24GB, 36GB Mac |
| 70B | ~43–46 GB | ~145 GB | 2× 24GB GPUs, 64GB+ Mac |
| Mixtral 8x7B (47B MoE) | ~29 GB | ~94 GB | RTX 5090 32GB, 48GB Mac |
Model data refreshes continuously from the Hugging Face API; trending list updates every 6 hours.
At the popular Q4_K_M quantization with 8K context: a 7–8B model needs about 6–7 GB, a 13–14B model about 10–12 GB, a 32B model about 21–23 GB, and a 70B model about 43–46 GB. Higher precision (Q6_K, Q8_0, FP16) and longer context windows push these numbers up — use the calculator above for your exact setup.
GGUF quantizations shrink a model's weights to fewer bits per weight. Q4_K_M (~4.85 bits/weight) is the sweet spot for most people — roughly 70% smaller than FP16 with minimal quality loss. Q6_K and Q8_0 are near-lossless but bigger; Q2_K is a last resort with noticeable degradation. If a model almost fits, try the next smaller quant before giving up.
Weights = parameters × bits-per-weight ÷ 8 (or the exact published GGUF file size when available). KV cache = 2 × layers × KV heads × head dimension × bytes × context tokens — grouped-query-attention aware, using each model's real config. On top of that sits a compute-buffer estimate fitted to real llama.cpp runs plus a fixed runtime overhead. A configurable safety margin separates “fits” from “tight”.
No — all parameters must be in memory, not just the active experts. Mixtral 8x7B holds 46.7B parameters even though only ~13B are active per token; the active count makes it faster, not smaller. The exception is the KV cache: DeepSeek's MLA architecture compresses it ~28× compared to standard attention.
macOS lets the GPU wire about 2/3 of unified memory below 36 GB, and 3/4 from 36 GB up (Metal's recommended working-set limit, which Ollama and LM Studio respect). A 36 GB MacBook therefore offers ~27 GB to models — minus whatever macOS and your open apps already use, which you can set in the hardware panel.
Often yes — llama.cpp-style engines can keep some layers in system RAM and run them on the CPU (partial offload). It works, but expect a steep slowdown: even a small spill can halve your tokens per second. LLM Fit Check shows these models with an orange “Offloads” verdict and how many GB would land in RAM.