Ollama 调优 Qwen3 32B_q8 模型最大化显存利用记录

小渔学长

2025 年 08 月 07 日

353 次浏览

1 条评论

5933字数

记录

说在前面

为了最大化的利用显卡算力，且模型在dify平台中经常迭代的时候前后文莫名奇妙的跑偏，这边网上搜了下，可能是token的问题，于是尝试解决

显卡规格

nvidia-smi

Thu Aug  7 14:14:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090 D      Off |   00000000:01:00.0 Off |                  N/A |
| 30%   47C    P8             20W /  600W |      18MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 5090 D      Off |   00000000:06:00.0  On |                  N/A |
| 31%   46C    P8             30W /  600W |     353MiB /  32607MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            3833      G   /usr/lib/xorg/Xorg                        4MiB |
|    1   N/A  N/A            3833      G   /usr/lib/xorg/Xorg                       85MiB |
|    1   N/A  N/A            4070      G   /usr/bin/gnome-shell                     25MiB |
|    1   N/A  N/A            6676      G   /usr/bin/gnome-control-center           128MiB |
|    1   N/A  N/A            8771      G   .../6227/usr/lib/firefox/firefox         28MiB |
|    1   N/A  N/A         3964898      G   ...bin/snapd-desktop-integration         12MiB |
+-----------------------------------------------------------------------------------------+

ollama 调试

1.模型列出

ollama list

信息如下：

NAME                                   ID              SIZE      MODIFIED        
qwen3:1.7b                             8f68893c685c    1.4 GB    2 weeks ago     
deepseek-r1:7b                         755ced02ce7b    4.7 GB    2 months ago    
linux6200/bge-reranker-v2-m3:latest    abf5c6d8bc56    1.2 GB    2 months ago    
bge-m3:latest                          790764642607    1.2 GB    2 months ago    
qwen3:32b-q8_0                         56a39c0a7ff6    35 GB     2 months ago    
cnshenyang/qwen3-nothink:32b           4c2c9ebb35c4    20 GB     2 months ago    
llama2:latest                          78e26419b446    3.8 GB    2 months ago

2.查看目前模型支持的上下文token数

ollama show qwen3:32b-q8_0

输出如下：

  Model
    architecture        qwen3    
    parameters          32.8B    
    context length      40960    
    embedding length    5120     
    quantization        Q8_0     

  Capabilities
    completion    
    tools         

  Parameters
    repeat_penalty    1                 
    stop              "<|im_start|>"    
    stop              "<|im_end|>"      
    temperature       0.6               
    top_k             20                
    top_p             0.95              

  License
    Apache License               
    Version 2.0, January 2004

3.基于modelfile永久修改模型支持的上下文token数

ollama show --modelfile qwen3:32b-q8_0 > Modelfile

4.修改modelfile

sudo vim Modelfile

在这些配置

PARAMETER top_p 0.95
PARAMETER repeat_penalty 1
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
PARAMETER temperature 0.6
PARAMETER top_k 20

下方添加

PARAMETER num_ctx 51200
PARAMETER num_predict -1

然后退出编辑模式wq保存文件后退出

5.生成新模型

运行以下命令生成新模型

sudo ollama create qwen3:32b-q8_51k -f Modelfile

以下显示即为成功

gathering model components 
copying file sha256:de447d788da3df6b4ea340408b13fc2c3a2043a2dfc19178b12d501a4bd96484 100% 
parsing GGUF 
using existing layer sha256:de447d788da3df6b4ea340408b13fc2c3a2043a2dfc19178b12d501a4bd96484 
using existing layer sha256:eb4402837c7829a690fa845de4d7f3fd842c2adee476d5341da8a46ea9255175 
using existing layer sha256:d18a5cc71b84bc4af394a31116bd3932b42241de70c77d2b76d69a314ec8aa12 
creating new layer sha256:453c2558599bfbf972a181815e6fd5a7309716a2f933c332afd084037b27525e 
writing manifest 
success

dify调用测试

当工作流跑起来后查看显卡占用

ollama ps

输出如下：

NAME                ID              SIZE     PROCESSOR          UNTIL                       
qwen3:32b-q8_51k    974101428fa0    86 GB    25%/75% CPU/GPU    Less than a second from now

然后调整为原生40960参数，输出如下：

NAME              ID              SIZE     PROCESSOR    UNTIL                       
qwen3:32b-q8_0    56a39c0a7ff6    44 GB    100% GPU     Less than a second from now

结论

Qwen3 32B_q8 模型在两张32G的 RTX 5090D 负载均衡下最大支持40960的一个token限制，模型这一块暂时无法进行更高参数调整，可以考虑dify平台调用的时候调整下，如图：
示例图

其他

1.删除多余模型

sudo ollama rm qwen3:32b-q8-custom

Ollama 调优 Qwen3 32B_q8 模型最大化显存利用记录

小渔学长 • 2025 年 08 月 07 日

使用oneapi 对接ollama 进行dify渠道控制 - 小渔博客
October 10th, 2025 at 09:46 am

[...]1.docker部署参阅：https://yuos.top/index.php/archives/16/2.dify部署官方有教程其他参阅: 2.1 dify docker 容器实际记录 2.2 乌班图安装dify插件打包环境3.ollama 部署3.1 教程知乎：使用Ollama部署deepseek大模型调优：Ollama 调优 Qwen3 32B_q8 模型最大化显存利用记录4.oneap[...]

回复

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

Ollama 调优 Qwen3 32B_q8 模型最大化显存利用记录

说在前面

显卡规格

ollama 调试

1.模型列出

2.查看目前模型支持的上下文token数

3.基于modelfile永久修改模型支持的上下文token数

4.修改modelfile

5.生成新模型

dify调用测试

结论

其他

1.删除多余模型

1 条评论

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

乌班图安装Docker并免sudo运行

乌班图安装pipx以及配置魔塔终端工具

乌班图安装dify插件打包环境

Ollama 调优 Qwen3 32B_q8 模型最大化显存利用记录

dify docker 容器实际记录

Ollama 调优 Qwen3 32B_q8 模型最大化显存利用记录

从零开始编译小米MT762X系列低端路由器使用的xray

Ubuntu 20.04.6 LTS x86_64交叉编译Mipsle架构v2ray以及v2ctl

小米路由4A百兆版适配Padavan

从零开始编译路由器使用的go

Ollama 调优 Qwen3 32B_q8 模型最大化显存利用记录

说在前面

显卡规格

ollama 调试

1.模型列出

2.查看目前模型支持的上下文token数

3.基于modelfile永久修改模型支持的上下文token数

4.修改modelfile

5.生成新模型

dify调用测试

结论

其他

1.删除多余模型

1 条评论

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

Ollama 调优 Qwen3 32B_q8 模型最大化显存利用记录

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款