Upload README.md (#2)
Browse files- Upload README.md (0cb5b6e777b0b8614bfd9a96598e1c6b2a5ce6f4)
- Upload README.md (c4086b6db5703a9e470b1169c6d50034a61b800b)
- Upload README.md (508c92b25eed2742376dfef741e7defd580d0af6)
README.md
CHANGED
|
@@ -86,7 +86,7 @@ from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
| 86 |
model_dir = "internlm/internlm3-8b-instruct"
|
| 87 |
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
|
| 88 |
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
|
| 89 |
-
|
| 90 |
# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
|
| 91 |
# InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
|
| 92 |
# pip install -U bitsandbytes
|
|
@@ -108,6 +108,8 @@ generated_ids = model.generate(tokenized_chat, max_new_tokens=1024, temperature=
|
|
| 108 |
generated_ids = [
|
| 109 |
output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
|
| 110 |
]
|
|
|
|
|
|
|
| 111 |
response = tokenizer.batch_decode(generated_ids)[0]
|
| 112 |
print(response)
|
| 113 |
```
|
|
@@ -153,6 +155,10 @@ Find more details in the [LMDeploy documentation](https://lmdeploy.readthedocs.i
|
|
| 153 |
|
| 154 |
|
| 155 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 156 |
#### vLLM inference
|
| 157 |
|
| 158 |
We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually.
|
|
@@ -280,6 +286,8 @@ generated_ids = model.generate(tokenized_chat, max_new_tokens=8192)
|
|
| 280 |
generated_ids = [
|
| 281 |
output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
|
| 282 |
]
|
|
|
|
|
|
|
| 283 |
response = tokenizer.batch_decode(generated_ids)[0]
|
| 284 |
print(response)
|
| 285 |
```
|
|
@@ -308,6 +316,10 @@ response = pipe(messages, gen_config=GenerationConfig(max_new_tokens=2048))
|
|
| 308 |
print(response)
|
| 309 |
```
|
| 310 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 311 |
#### vLLM inference
|
| 312 |
|
| 313 |
We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually.
|
|
@@ -345,7 +357,7 @@ print(outputs)
|
|
| 345 |
|
| 346 |
## Open Source License
|
| 347 |
|
| 348 |
-
|
| 349 |
|
| 350 |
## Citation
|
| 351 |
|
|
@@ -369,7 +381,7 @@ The code is licensed under Apache-2.0, while model weights are fully open for ac
|
|
| 369 |
InternLM3,即书生·浦语大模型第3代,开源了80亿参数,面向通用使用与高阶推理的指令模型(InternLM3-8B-Instruct)。模型具备以下特点:
|
| 370 |
|
| 371 |
- **更低的代价取得更高的性能**:
|
| 372 |
-
在推理、知识类任务上取得同量级最优性能,超过Llama3.1-8B和Qwen2.5-7B
|
| 373 |
- **深度思考能力**:
|
| 374 |
InternLM3支持通过长思维链求解复杂推理任务的深度思考模式,同时还兼顾了用户体验更流畅的通用回复模式。
|
| 375 |
|
|
@@ -423,7 +435,7 @@ from transformers import AutoTokenizer, AutoModelForCausalLM
|
|
| 423 |
model_dir = "internlm/internlm3-8b-instruct"
|
| 424 |
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
|
| 425 |
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
|
| 426 |
-
|
| 427 |
# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
|
| 428 |
# InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
|
| 429 |
# pip install -U bitsandbytes
|
|
@@ -445,6 +457,8 @@ generated_ids = model.generate(tokenized_chat, max_new_tokens=1024, temperature=
|
|
| 445 |
generated_ids = [
|
| 446 |
output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
|
| 447 |
]
|
|
|
|
|
|
|
| 448 |
response = tokenizer.batch_decode(generated_ids)[0]
|
| 449 |
print(response)
|
| 450 |
```
|
|
@@ -491,7 +505,12 @@ curl http://localhost:23333/v1/chat/completions \
|
|
| 491 |
|
| 492 |
|
| 493 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 494 |
##### vLLM 推理
|
|
|
|
| 495 |
我们还在推动PR(https://github.com/vllm-project/vllm/pull/12037) 合入vllm,现在请使用以下PR链接手动安装
|
| 496 |
|
| 497 |
```python
|
|
@@ -616,6 +635,8 @@ generated_ids = model.generate(tokenized_chat, max_new_tokens=8192)
|
|
| 616 |
generated_ids = [
|
| 617 |
output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
|
| 618 |
]
|
|
|
|
|
|
|
| 619 |
response = tokenizer.batch_decode(generated_ids)[0]
|
| 620 |
print(response)
|
| 621 |
```
|
|
@@ -644,6 +665,10 @@ response = pipe(messages, gen_config=GenerationConfig(max_new_tokens=2048))
|
|
| 644 |
print(response)
|
| 645 |
```
|
| 646 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 647 |
##### vLLM 推理
|
| 648 |
|
| 649 |
我们还在推动PR(https://github.com/vllm-project/vllm/pull/12037) 合入vllm,现在请使用以下PR链接手动安装
|
|
@@ -687,7 +712,7 @@ print(outputs)
|
|
| 687 |
|
| 688 |
## 开源许可证
|
| 689 |
|
| 690 |
-
|
| 691 |
|
| 692 |
## 引用
|
| 693 |
|
|
|
|
| 86 |
model_dir = "internlm/internlm3-8b-instruct"
|
| 87 |
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
|
| 88 |
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
|
| 89 |
+
model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16)
|
| 90 |
# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
|
| 91 |
# InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
|
| 92 |
# pip install -U bitsandbytes
|
|
|
|
| 108 |
generated_ids = [
|
| 109 |
output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
|
| 110 |
]
|
| 111 |
+
prompt = tokenizer.batch_decode(tokenized_chat)[0]
|
| 112 |
+
print(prompt)
|
| 113 |
response = tokenizer.batch_decode(generated_ids)[0]
|
| 114 |
print(response)
|
| 115 |
```
|
|
|
|
| 155 |
|
| 156 |
|
| 157 |
|
| 158 |
+
#### Ollama inference
|
| 159 |
+
|
| 160 |
+
TODO
|
| 161 |
+
|
| 162 |
#### vLLM inference
|
| 163 |
|
| 164 |
We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually.
|
|
|
|
| 286 |
generated_ids = [
|
| 287 |
output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
|
| 288 |
]
|
| 289 |
+
prompt = tokenizer.batch_decode(tokenized_chat)[0]
|
| 290 |
+
print(prompt)
|
| 291 |
response = tokenizer.batch_decode(generated_ids)[0]
|
| 292 |
print(response)
|
| 293 |
```
|
|
|
|
| 316 |
print(response)
|
| 317 |
```
|
| 318 |
|
| 319 |
+
#### Ollama inference
|
| 320 |
+
|
| 321 |
+
TODO
|
| 322 |
+
|
| 323 |
#### vLLM inference
|
| 324 |
|
| 325 |
We are still working on merging the PR(https://github.com/vllm-project/vllm/pull/12037) into vLLM. In the meantime, please use the following PR link to install it manually.
|
|
|
|
| 357 |
|
| 358 |
## Open Source License
|
| 359 |
|
| 360 |
+
Code and model weights are licensed under Apache-2.0.
|
| 361 |
|
| 362 |
## Citation
|
| 363 |
|
|
|
|
| 381 |
InternLM3,即书生·浦语大模型第3代,开源了80亿参数,面向通用使用与高阶推理的指令模型(InternLM3-8B-Instruct)。模型具备以下特点:
|
| 382 |
|
| 383 |
- **更低的代价取得更高的性能**:
|
| 384 |
+
在推理、知识类任务上取得同量级最优性能,超过Llama3.1-8B和Qwen2.5-7B。值得关注的是InternLM3只用了4万亿词元进行训练,对比同级别模型训练成本节省75%以上。
|
| 385 |
- **深度思考能力**:
|
| 386 |
InternLM3支持通过长思维链求解复杂推理任务的深度思考模式,同时还兼顾了用户体验更流畅的通用回复模式。
|
| 387 |
|
|
|
|
| 435 |
model_dir = "internlm/internlm3-8b-instruct"
|
| 436 |
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
|
| 437 |
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
|
| 438 |
+
model = AutoModelForCausalLM.from_pretrained(model_dir, trust_remote_code=True, torch_dtype=torch.float16)
|
| 439 |
# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
|
| 440 |
# InternLM3 8B in 4bit will cost nearly 8GB GPU memory.
|
| 441 |
# pip install -U bitsandbytes
|
|
|
|
| 457 |
generated_ids = [
|
| 458 |
output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
|
| 459 |
]
|
| 460 |
+
prompt = tokenizer.batch_decode(tokenized_chat)[0]
|
| 461 |
+
print(prompt)
|
| 462 |
response = tokenizer.batch_decode(generated_ids)[0]
|
| 463 |
print(response)
|
| 464 |
```
|
|
|
|
| 505 |
|
| 506 |
|
| 507 |
|
| 508 |
+
##### Ollama 推理
|
| 509 |
+
|
| 510 |
+
TODO
|
| 511 |
+
|
| 512 |
##### vLLM 推理
|
| 513 |
+
|
| 514 |
我们还在推动PR(https://github.com/vllm-project/vllm/pull/12037) 合入vllm,现在请使用以下PR链接手动安装
|
| 515 |
|
| 516 |
```python
|
|
|
|
| 635 |
generated_ids = [
|
| 636 |
output_ids[len(input_ids):] for input_ids, output_ids in zip(tokenized_chat, generated_ids)
|
| 637 |
]
|
| 638 |
+
prompt = tokenizer.batch_decode(tokenized_chat)[0]
|
| 639 |
+
print(prompt)
|
| 640 |
response = tokenizer.batch_decode(generated_ids)[0]
|
| 641 |
print(response)
|
| 642 |
```
|
|
|
|
| 665 |
print(response)
|
| 666 |
```
|
| 667 |
|
| 668 |
+
##### Ollama 推理
|
| 669 |
+
|
| 670 |
+
TODO
|
| 671 |
+
|
| 672 |
##### vLLM 推理
|
| 673 |
|
| 674 |
我们还在推动PR(https://github.com/vllm-project/vllm/pull/12037) 合入vllm,现在请使用以下PR链接手动安装
|
|
|
|
| 712 |
|
| 713 |
## 开源许可证
|
| 714 |
|
| 715 |
+
本仓库的代码和权重依照 Apache-2.0 协议开源。
|
| 716 |
|
| 717 |
## 引用
|
| 718 |
|