说在前面

  • 为了最大化的利用显卡算力,且模型在dify平台中经常迭代的时候前后文莫名奇妙的跑偏,这边网上搜了下,可能是token的问题,于是尝试解决

显卡规格

nvidia-smi
Thu Aug  7 14:14:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090 D      Off |   00000000:01:00.0 Off |                  N/A |
| 30%   47C    P8             20W /  600W |      18MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5090 D      Off |   00000000:06:00.0  On |                  N/A |
| 31%   46C    P8             30W /  600W |     353MiB /  32607MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3833      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A            3833      G   /usr/lib/xorg/Xorg                       85MiB |
|    1   N/A  N/A            4070      G   /usr/bin/gnome-shell                     25MiB |
|    1   N/A  N/A            6676      G   /usr/bin/gnome-control-center           128MiB |
|    1   N/A  N/A            8771      G   .../6227/usr/lib/firefox/firefox         28MiB |
|    1   N/A  N/A         3964898      G   ...bin/snapd-desktop-integration         12MiB |
+-----------------------------------------------------------------------------------------+

ollama 调试

1.模型列出

ollama list

信息如下:

NAME                                   ID              SIZE      MODIFIED        
qwen3:1.7b                             8f68893c685c    1.4 GB    2 weeks ago     
deepseek-r1:7b                         755ced02ce7b    4.7 GB    2 months ago    
linux6200/bge-reranker-v2-m3:latest    abf5c6d8bc56    1.2 GB    2 months ago    
bge-m3:latest                          790764642607    1.2 GB    2 months ago    
qwen3:32b-q8_0                         56a39c0a7ff6    35 GB     2 months ago    
cnshenyang/qwen3-nothink:32b           4c2c9ebb35c4    20 GB     2 months ago    
llama2:latest                          78e26419b446    3.8 GB    2 months ago 

2.查看目前模型支持的上下文token数

ollama show qwen3:32b-q8_0

输出如下:

  Model
    architecture        qwen3    
    parameters          32.8B    
    context length      40960    
    embedding length    5120     
    quantization        Q8_0     

  Capabilities
    completion    
    tools         

  Parameters
    repeat_penalty    1                 
    stop              "<|im_start|>"    
    stop              "<|im_end|>"      
    temperature       0.6               
    top_k             20                
    top_p             0.95              

  License
    Apache License               
    Version 2.0, January 2004    

3.基于modelfile永久修改模型支持的上下文token数

ollama show --modelfile qwen3:32b-q8_0 > Modelfile

4.修改modelfile

sudo vim Modelfile

在这些配置

PARAMETER top_p 0.95
PARAMETER repeat_penalty 1
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
PARAMETER temperature 0.6
PARAMETER top_k 20

下方添加

PARAMETER num_ctx 51200
PARAMETER num_predict -1

然后退出编辑模式wq保存文件后退出

5.生成新模型

运行以下命令生成新模型

sudo ollama create qwen3:32b-q8_51k -f Modelfile

以下显示即为成功

gathering model components 
copying file sha256:de447d788da3df6b4ea340408b13fc2c3a2043a2dfc19178b12d501a4bd96484 100% 
parsing GGUF 
using existing layer sha256:de447d788da3df6b4ea340408b13fc2c3a2043a2dfc19178b12d501a4bd96484 
using existing layer sha256:eb4402837c7829a690fa845de4d7f3fd842c2adee476d5341da8a46ea9255175 
using existing layer sha256:d18a5cc71b84bc4af394a31116bd3932b42241de70c77d2b76d69a314ec8aa12 
creating new layer sha256:453c2558599bfbf972a181815e6fd5a7309716a2f933c332afd084037b27525e 
writing manifest 
success 

dify调用测试

当工作流跑起来后查看显卡占用

ollama ps

输出如下:

NAME                ID              SIZE     PROCESSOR          UNTIL                       
qwen3:32b-q8_51k    974101428fa0    86 GB    25%/75% CPU/GPU    Less than a second from now    

然后调整为原生40960参数,输出如下:

NAME              ID              SIZE     PROCESSOR    UNTIL                       
qwen3:32b-q8_0    56a39c0a7ff6    44 GB    100% GPU     Less than a second from now  

结论

Qwen3 32B_q8 模型 在两张32G的 RTX 5090D 负载均衡下最大支持40960的一个token限制,模型这一块暂时无法进行更高参数调整,可以考虑dify平台调用的时候调整下,如图:
示例图

其他

1.删除多余模型

sudo ollama rm qwen3:32b-q8-custom
最后修改:2025 年 08 月 07 日
如果觉得我的文章对你有用,请随意赞赏