kevan's recent timeline updates

www.6t6t.com Geo

Shanghai

KevanTM

6t6t.com 风之轩 Easy Listening

kevan 提问技术话题好玩工作信息交易信息城市相关

kevan

6t6t.com 风之轩 Easy Listening ...

V2EX member #21701, joined on 2012-05-31 17:34:09 +08:00

出京东 plus 2 小时家政 40 Y2tja2xvaw==

二手交易 kevan 8 days ago Lastly replied by kevan

6T6T.COM 出

域名 kevan Apr 10 Lastly replied by yiihub

出京东 plus 家政 2 小时

二手交易 kevan Feb 2 Lastly replied by imsoso

刚才在微信群发个 URL： antigravity.google 微信立即被迫掉线，重登后需要确认社区规则条款，服了。

微信 kevan Dec 25, 2025 Lastly replied by luojun09211

推广 kevan Dec 12, 2025

推广 kevan Dec 13, 2025 Lastly replied by stinkytofux

出 JD 家政 2 小时

二手交易 kevan Dec 9, 2025

石头扫地机可以 APP 唤醒机器人来我当前的位置，是通过什么技术实现定位的。

问与答 kevan Dec 3, 2025 Lastly replied by kevan

求助，我笔记本的 4060 跑语言模型很困难吗？

Local LLM kevan Oct 14, 2024 Lastly replied by chronos

Hi, 欢迎光临我的音乐小站风之轩 Easy Listening

音乐 kevan May 31, 2012 Lastly replied by kevan

More topics by kevan

kevan's recent replies

5 days ago

Replied to a topic by babymonster Local LLM 我自己的电脑是 5070Ti，总感觉跑一些模型算力不够

@iovekkk 我也是这个配置,跑 27b 都费劲.要想流畅体验只能 9B

8 days ago

Replied to a topic by KaiWuBOSS Local LLM 我做了个工具让 8GB 显卡跑 30B 模型从 3 tok/s 提到 21 tok/s，记录一下技术发现

还是用不了

Apr 28

Replied to a topic by KaiWuBOSS Local LLM 我做了个工具让 8GB 显卡跑 30B 模型从 3 tok/s 提到 21 tok/s，记录一下技术发现

@KaiWuBOSS 下班回家马上试用反馈，我跟进了好久，哈哈哈，必须支持

Apr 28

Replied to a topic by KaiWuBOSS Local LLM 我做了个工具让 8GB 显卡跑 30B 模型从 3 tok/s 提到 21 tok/s，记录一下技术发现

能优化一个 50 系能用的版本吗?

Apr 25

Replied to a topic by KaiWuBOSS Local LLM 我做了个工具让 8GB 显卡跑 30B 模型从 3 tok/s 提到 21 tok/s，记录一下技术发现

@KaiWuBOSS 大佬,为什么我下载 0.1.6 版本还是不行啊??????

>kaiwu run Qwen3-30B-A3B-UD-Q3_K_XL.gguf --reset

本地大模型部署器 vv0.1.6 llama.cpp b8864
by llmbbs.ai 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 5070 Ti (SM120, 16303 MB VRAM, 896 GB/s)
RAM: 31 GB UNKNOWN
OS: windows amd64
CUDA 13.2 detected known bug with low-bit quantization
If you see garbled output, downgrade driver to CUDA 13.1
Warning: RTX 50 series with CUDA 13.2 detected
Kaiwu will use CUDA 12.4 binary for stability.

[2/6] Selecting configuration...
Model: Qwen3-30B-A3B (moe, 29B total / 2B active)
Quant: Q3_K_M (12.9 GB)
Mode: full_gpu
Accel: Flash Attention

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3-30B-A3B-UD-Q3_K_XL.gguf [cached]

[4/6] Preflight check...
RTX 50 系首次启动需要 JIT 编译 (~30s)，请稍候...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
VRAM sufficient

[5/6] Warmup benhmark...
已清除缓存，重新探测
Probe 1: ctx=8K ... OOM
Probe 2: ctx=4K ... OOM
Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
Waiting for llama-server to be ready (port 11434)...
显存不足，降低上下文至 4K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 连续 2 次启动失败，即使最小上下文(4K)也无法运行

NVIDIA GeForce RTX 5070 Ti: 16303 MB VRAM
模型 Qwen3-30B-A3B: ~13189 MB
KV cache (4K, q4_0): ~96 MB
预估总需: ~14309 MB

建议:
1. 选择更小的量化 (Q4_K_M 或 Q2_K)
2. 选择更小的模型

Usage:
kaiwu run <model> [flags]

Flags:
--bench Run benchmark after starting
--ctx-size int 手动指定上下文大小（ 0=自动）
--fast Skip warmup, use cached profile
-h, --help help for run
--llama-server string 使用自定义 llama-server 二进制（完整路径）
--reset 清除缓存，重新 warmup 探测最优参数

C:\Kevan\AI\kaiwu-windows-amd64>

Apr 25

Replied to a topic by KaiWuBOSS Local LLM 我做了个工具让 8GB 显卡跑 30B 模型从 3 tok/s 提到 21 tok/s，记录一下技术发现

@KaiWuBOSS 本来没什么欲望的，看到你的介绍感觉焕发新生一样，傻瓜式安装，能不能推荐一下具体安装方法，你说的那些太资深

Apr 25

Replied to a topic by KaiWuBOSS Local LLM 我做了个工具让 8GB 显卡跑 30B 模型从 3 tok/s 提到 21 tok/s，记录一下技术发现

kaiwu.exe run Qwen3-30B-A3B

本地大模型部署器 vv0.1.2 llama.cpp b8864
by llmbbs.ai 本地 AI 技术社区

[1/6] Probing hardware...
GPU: NVIDIA GeForce RTX 5070 Ti (SM120, 16303 MB VRAM, 0 GB/s)
RAM: 31 GB UNKNOWN
OS: windows amd64

[2/6] Selecting configuration...
Model: Qwen3-30B-A3B (moe, 30B total / 3B active)
Quant: ud-q3-k-xl (14.0 GB)
Mode: full_gpu
Accel: Flash Attention + MTP (native)

[3/6] Checking files...
Using bundled iso3 binary: llama-server-cuda.exe
Binary: llama-server-cuda.exe [cached]
Model: Qwen3-30B-A3B-UD-Q3_K_XL.gguf [cached]

[4/6] Preflight check...
llama-server 不支持 iso3 ，回退到 q8_0/q4_0
VRAM sufficient

[5/6] Warmup benchmark...
Probe 1: ctx=8K ... OOM
Probe 2: ctx=4K ... OOM
Warmup failed: all ctx probes failed (tried down to 4K)
Using default parameters

[6/6] Starting server...
Waiting for llama-server to be ready (port 11434)...
显存不足，降低上下文至 4K 重试...
Waiting for llama-server to be ready (port 11434)...
Error: failed to start llama-server: 连续 2 次启动失败，即使最小上下文(4K)也无法运行

NVIDIA GeForce RTX 5070 Ti: 16303 MB VRAM
模型 Qwen3-30B-A3B: ~14336 MB
KV cache (4K, q4_0): ~96 MB
预估总需: ~15456 MB

建议:
1. 选择更小的量化 (Q2_K)
2. 选择更小的模型
3. 使用 MoE offload 模型（ experts 放 CPU RAM ）
Usage:
kaiwu run <model> [flags]

Flags:
--bench Run benchmark after starting
--ctx-size int 手动指定上下文大小（ 0=自动）
--fast Skip warmup, use cached profile
-h, --help help for run
--llama-server string 使用自定义 llama-server 二进制（完整路径）
--reset 清除缓存，重新 warmup 探测最优参数

醉了?怎么和介绍的不一样呢?
介绍说 8GB 显存都能跑,我 16G 显存怎么不行啊?

Apr 25

Replied to a topic by KaiWuBOSS Local LLM 我做了个工具让 8GB 显卡跑 30B 模型从 3 tok/s 提到 21 tok/s，记录一下技术发现

大佬,问一下我 5070TI 16GB + 32GB 内存用哪个模型比较合适? 想用来跑小龙虾

Apr 23

Replied to a topic by intoext 业界八卦大忽悠贾跃亭回国了？

回来保证被围剿

More replies by kevan