本地大模型部署器 vv0.1.6 llama.cpp b8864
by
llmbbs.ai 本地 AI 技术社区
[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 3090 × 2 (SM86, 24576 MB VRAM each, 936 GB/s)
RAM: 251 GB DDR4
OS: linux amd64
[2/6] Selecting configuration...
Model: Qwen3.6-35B-A3B (moe, 35B total / 3B active)
Quant: ud-q5-k-xl (25.0 GB)
Mode: full_gpu
Accel: Flash Attention + MTP (native) + NVLink + SWA-Full (hybrid arch)
[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda
Binary: llama-server-cuda [cached]
Model: Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf [cached]
[4/6] Preflight check...
VRAM sufficient
[5/6] Warmup benchmark...
Probe 1: ctx=256K ... OOM
Probe 2: ctx=128K ... OOM
Probe 3: ctx=64K ... OOM
Probe 4: ctx=32K ... OOM
Probe 5: ctx=16K ... OOM
Probe 6: ctx=8K ... OOM
Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters
[6/6] Starting server...
Waiting for llama-server to be ready (port 11434)...
显存不足,降低上下文至 128K 重试...
Waiting for llama-server to be ready (port 11434)...
显存不足,降低上下文至 64K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 3 次启动均失败,建议选择更小的模型
Usage:
kaiwu run <model> [flags]
Flags:
--bench Run benchmark after starting
--ctx-size int 手动指定上下文大小( 0=自动)
--fast Skip warmup, use cached profile
-h, --help help for run
--llama-server string 使用自定义 llama-server 二进制(完整路径)
--reset 清除缓存,重新 warmup 探测最优参数