说在前面
- 为了最大化的利用显卡算力,且模型在dify平台中经常迭代的时候前后文莫名奇妙的跑偏,这边网上搜了下,可能是token的问题,于是尝试解决
显卡规格
nvidia-smi
Thu Aug 7 14:14:00 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03 Driver Version: 575.51.03 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 D Off | 00000000:01:00.0 Off | N/A |
| 30% 47C P8 20W / 600W | 18MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 D Off | 00000000:06:00.0 On | N/A |
| 31% 46C P8 30W / 600W | 353MiB / 32607MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3833 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 3833 G /usr/lib/xorg/Xorg 85MiB |
| 1 N/A N/A 4070 G /usr/bin/gnome-shell 25MiB |
| 1 N/A N/A 6676 G /usr/bin/gnome-control-center 128MiB |
| 1 N/A N/A 8771 G .../6227/usr/lib/firefox/firefox 28MiB |
| 1 N/A N/A 3964898 G ...bin/snapd-desktop-integration 12MiB |
+-----------------------------------------------------------------------------------------+
ollama 调试
1.模型列出
ollama list
信息如下:
NAME ID SIZE MODIFIED
qwen3:1.7b 8f68893c685c 1.4 GB 2 weeks ago
deepseek-r1:7b 755ced02ce7b 4.7 GB 2 months ago
linux6200/bge-reranker-v2-m3:latest abf5c6d8bc56 1.2 GB 2 months ago
bge-m3:latest 790764642607 1.2 GB 2 months ago
qwen3:32b-q8_0 56a39c0a7ff6 35 GB 2 months ago
cnshenyang/qwen3-nothink:32b 4c2c9ebb35c4 20 GB 2 months ago
llama2:latest 78e26419b446 3.8 GB 2 months ago
2.查看目前模型支持的上下文token数
ollama show qwen3:32b-q8_0
输出如下:
Model
architecture qwen3
parameters 32.8B
context length 40960
embedding length 5120
quantization Q8_0
Capabilities
completion
tools
Parameters
repeat_penalty 1
stop "<|im_start|>"
stop "<|im_end|>"
temperature 0.6
top_k 20
top_p 0.95
License
Apache License
Version 2.0, January 2004
3.基于modelfile永久修改模型支持的上下文token数
ollama show --modelfile qwen3:32b-q8_0 > Modelfile
4.修改modelfile
sudo vim Modelfile
在这些配置
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
PARAMETER temperature 0.6
PARAMETER top_k 20
下方添加
PARAMETER num_ctx 51200
PARAMETER num_predict -1
然后退出编辑模式wq
保存文件后退出
5.生成新模型
运行以下命令生成新模型
sudo ollama create qwen3:32b-q8_51k -f Modelfile
以下显示即为成功
gathering model components
copying file sha256:de447d788da3df6b4ea340408b13fc2c3a2043a2dfc19178b12d501a4bd96484 100%
parsing GGUF
using existing layer sha256:de447d788da3df6b4ea340408b13fc2c3a2043a2dfc19178b12d501a4bd96484
using existing layer sha256:eb4402837c7829a690fa845de4d7f3fd842c2adee476d5341da8a46ea9255175
using existing layer sha256:d18a5cc71b84bc4af394a31116bd3932b42241de70c77d2b76d69a314ec8aa12
creating new layer sha256:453c2558599bfbf972a181815e6fd5a7309716a2f933c332afd084037b27525e
writing manifest
success
dify调用测试
当工作流跑起来后查看显卡占用
ollama ps
输出如下:
NAME ID SIZE PROCESSOR UNTIL
qwen3:32b-q8_51k 974101428fa0 86 GB 25%/75% CPU/GPU Less than a second from now
然后调整为原生40960参数,输出如下:
NAME ID SIZE PROCESSOR UNTIL
qwen3:32b-q8_0 56a39c0a7ff6 44 GB 100% GPU Less than a second from now
结论
Qwen3 32B_q8
模型 在两张32G的 RTX 5090D 负载均衡下最大支持40960
的一个token限制,模型这一块暂时无法进行更高参数调整,可以考虑dify平台调用的时候调整下,如图:
其他
1.删除多余模型
sudo ollama rm qwen3:32b-q8-custom