From 0772a2f0c83b9ad388d8cc378ee60d7d30dbc195 Mon Sep 17 00:00:00 2001
From: Zeng Aohan <Sengxian@users.noreply.huggingface.co>
Date: Mon, 13 Mar 2023 17:19:13 +0000
Subject: [PATCH] Update README.md

---
 README.md | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index 3eabc1e..6ce4617 100644
--- a/README.md
+++ b/README.md
@@ -56,10 +56,14 @@ print(history)
 model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().cuda()
 ```
 
-替换为
-
+替换为（8-bit 量化）
 ```python
-model = AutoModel.from_pretrained("THUDM/chatglm-6b", device_map="auto", load_in_8bit=True, trust_remote_code=True)
+model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().quantize(8).cuda()
+```
+
+或者（4-bit 量化）
+```python
+model = AutoModel.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True).half().quantize(4).cuda()
 ```
 
 进行 2 至 3 轮对话后，8-bit 量化下约占用 10GB 的 GPU 显存，4-bit 量化仅需占用 6GB 的 GPU 显存。随着对话轮数的增多，对应消耗显存也随之增长。