Update README.md
Browse files
README.md
CHANGED
@@ -2,4 +2,47 @@
|
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
|
5 |
-
Train in 30B Byte. Mode size 353M. Table 2 in [MambaByte](https://arxiv.org/abs/2401.13660)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: apache-2.0
|
3 |
---
|
4 |
|
5 |
+
Train in 30B Byte. Mode size 353M. Table 2 in [MambaByte](https://arxiv.org/abs/2401.13660)
|
6 |
+
|
7 |
+
To use
|
8 |
+
|
9 |
+
```
|
10 |
+
import torch
|
11 |
+
import numpy as np
|
12 |
+
|
13 |
+
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
|
14 |
+
|
15 |
+
model=MambaLMHeadModel.from_pretrained("JunxiongWang/MambaByte_Arxiv", device='cuda', dtype=torch.bfloat16)
|
16 |
+
|
17 |
+
text = "\documentclass[12pt]{article}"
|
18 |
+
text_byte = np.frombuffer(text.encode('utf-8'), dtype=np.uint8)
|
19 |
+
input_ids = torch.from_numpy(text_byte[None, :]).long().cuda()
|
20 |
+
|
21 |
+
sample = model.generate(
|
22 |
+
input_ids=input_ids,
|
23 |
+
max_length=2048,
|
24 |
+
cg=True,
|
25 |
+
return_dict_in_generate=True,
|
26 |
+
output_scores=True,
|
27 |
+
enable_timing=True,
|
28 |
+
temperature=1,
|
29 |
+
top_k=256,
|
30 |
+
top_p=0.9,
|
31 |
+
)
|
32 |
+
|
33 |
+
print(bytes(sample.sequences[0].tolist()).decode('utf-8'))
|
34 |
+
```
|
35 |
+
|
36 |
+
Output:
|
37 |
+
|
38 |
+
```
|
39 |
+
\documentclass[12pt]{article}}}}^{{\mathbf{P}}\uplus{\mathbf{Q}}}}}}}{}}$ is a symmetric poset. This implies that $$\operatorname{end}({\mathscr{L}}) = \operatorname{end}({\mathscr{L}}\setminus\{\sigma_{{\mathbf{P}}}\}) = \operatorname{end}({\mathscr{L}}\setminus\{\sigma_{{\mathbf{Q}}}\}) = \operatorname{end}({\mathscr{L}}\setminus\{\sigma_{{\mathbf{P}}},\sigma_{{\mathbf{Q}}}\}),$$ i.e., ${\mathscr{L}}$ is $\{\sigma_{{\mathbf{P}}},\sigma_{{\mathbf{Q}}}\}$-bistochastic for any ${\mathbf{P}}\neq{\mathbf{Q}}$. Thus, ${\mathscr{L}}$ is reversible, and is in fact maximal among all $\{\sigma_{{\mathbf{P}}},\sigma_{{\mathbf{Q}}}\}$-bistochastic matrices.
|
40 |
+
|
41 |
+
Since ${\mathscr{L}}$ is in the same class as $\sigma_{{\mathbf{P}}}^{{\mathbf{Q}}}$, we have $\operatorname{end}({\mathscr{L}})\subseteq\operatorname{end}({\mathscr{L}})$. Conversely, if $\operatorname{end}({\mathscr{L}})=\operatorname{end}({\mathscr{L}})$, then $\sigma_{{\mathbf{P}}}^{{\mathbf{Q}}}$ is maximal in $\operatorname{end}({\mathscr{L}})$. Since ${\mathbf{P}}\setminus\{\sigma_{{\mathbf{P}}}\}\subseteq\operatorname{end}({\mathscr{L}})$, this implies that ${\mathscr{L}}$ is in the same class as $\sigma_{{\mathbf{P}}}^{{\mathbf{Q}}}$, and hence ${\mathscr{L}}$ is reversible.
|
42 |
+
|
43 |
+
We are now ready to show that $\{\sigma_{{\mathbf{P}}},\sigma_{{\mathbf{Q}}}\}$-bistochastic matrices form a symmetric poset of ends.
|
44 |
+
|
45 |
+
\[lem:end\_symm\_class\] Let ${\mathbf{P}},{\mathbf{Q}}\in{\mathscr{M}}$. Then $\sigma_{{\mathbf{P}}}^{{\mathbf{Q}}}$ is symmetric if and only if $\operatorname{end}({\mathscr{L}})=\operatorname{end}({\mathscr{L}})$.
|
46 |
+
|
47 |
+
Suppose that $\operatorname{end}({\mathscr{L}})=\operatorname{end}({\mathscr{L}})$, and we prove that $\sigma_{{\mathbf{P}}}^{{\mathbf{Q}}}$ is symmetric. Clearly, $\operatorname{end}({\mathscr{L}})$ contains exactly the ends of $\operatorname{end}({\mathscr{L}})$ by definition, and the only case that survives is when $\operatorname{end}({\mathscr{L}})=\operatorname{end}({\mathscr{L}})$. By construction, this means that $\sigma_{{\mathbf{P}}}
|
48 |
+
```
|