PS C:\Users\cfm880\Downloads> kaiwu run .\Qwen3-Coder-30B-APEX-I-Quality.gguf
本地大模型部署器 vv0.1.6 llama.cpp b8864
by
llmbbs.ai 本地 AI 技术社区
[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 3080 Laptop GPU (SM86, 16384 MB VRAM, 760 GB/s)
RAM: 63 GB DDR4
OS: windows amd64
CUDA 13.2 detected known bug with low-bit quantization
If you see garbled output, downgrade driver to CUDA 13.1
[2/6] Selecting configuration...
Model: Qwen3-Coder-30B-A3B-Instruct (moe, 22B total / 1B active)
Quant: Q6_K (18.1 GB)
Mode: moe_offload (experts on CPU)
Accel: Flash Attention
[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3-Coder-30B-APEX-I-Quality.gguf [cached]
[4/6] Preflight check...
VRAM sufficient
[5/6] Warmup benchmark...
Probe 1: ctx=128K ... 13.6 tok/s (< 18, too slow)
Probe 2: ctx=64K ... 14.6 tok/s (< 18, too slow)
Probe 3: ctx=32K ... 15.7 tok/s (< 18, too slow)
Probe 4: ctx=16K ... 14.8 tok/s (< 18, too slow)
Probe 5: ctx=8K ... 13.8 tok/s (< 18, too slow)
Tune ubatch: ub=128 → 14.5 tok/s; ub=512 → 14.6 tok/s;
14.6 tok/s @ 32K ctx
Saved profile: C:\Users\cfm880\.kaiwu\profiles\qwen3-coder-30b-apex-i-quality_sm86_16384mb_ddr4.json
14.6 tok/s
[6/6] Starting server...
Waiting for llama-server to be ready (port 11434)...
llama-server started (PID 17428, port 11434)
Kaiwu proxy started (port 11435)
┌─────────────────────────────────────────────────┐
2026/04/25 14:35:09 Kaiwu proxy listening on :11435 → llama-server :11434
│ Ready Qwen3-Coder-30B-A3B-Instruct @ 14.6 tok/s │
│ API: http://127.0.0.1:11435/v1/chat/completions │
│ 模型文件夹: C:\Users\cfm880\.kaiwu\models │
└─────────────────────────────────────────────────┘
运行 kaiwu inject 接入 IDE Ctrl+C 停止
─ 实时监控 空载 ─────────────────── 每 2s 刷新 ─
reuse:1024 KV:f16 32K ctx ub512 mlock
速度 显存 内存 GPU 温度
tok/s 6.4/16 GB 30.4/64 GB 0% 50°CC
[..........] [====......] [====......] [..........] [=====.....]
─────────────────────────────────────────────────────────
上下文 [....................] 0.0K / 32K 余 32.0K