File size: 3,112 Bytes
808e59a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
<!-- livebook:{"app_settings":{"access_type":"public","auto_shutdown_ms":5000,"multi_session":true,"output_type":"rich","show_source":true,"slug":"tokenizer-generator"}} -->

# Tokenizer generator

```elixir
Mix.install([
  {:kino, "~> 0.10.0"},
  {:req, "~> 0.4.3"}
])
```

## Info

```elixir
Kino.Markdown.new("""
## Background

HuggingFace repositories store tokenizers in two flavours:

  1. "slow tokenizer" - corresponds to a tokenizer implemented in Python
     and stored as `tokenizer_config.json`

  2. "fast tokenizers" - corresponds to a tokenizer implemented in Rust
     and stored as `tokenizer.json`

Many repositories only include files for 1., but the `transformers` library
automatically converts "slow tokenizer" to "fast tokenizer" whenever possible.

Bumblebee relies on the Rust bindings and therefore always requires the
`tokenizer.json` file. This app generates that file for any repository with the
"slow tokenizer".
""")
```

## Generator

```elixir
Kino.Markdown.new("## Converter")
```

```elixir
{version, 0} =
  System.cmd("python", ["-c", "import transformers; print(transformers.__version__, end='')"])

Kino.Markdown.new("""
`tokenizers: #{version}`
""")
```

```elixir
repo_input = Kino.Input.text("HuggingFace repo")
```

```elixir
repo = Kino.Input.read(repo_input)

if repo == "" do
  Kino.interrupt!(:normal, "Enter repository.")
end
```

```elixir
response =
  Req.post!("https://huggingface.co/api/models/#{repo}/paths-info/main",
    json: %{paths: ["tokenizer.json"]}
  )

case response do
  %{status: 200, body: []} ->
    :ok

  %{status: 200, body: [%{"path" => "tokenizer.json"}]} ->
    Kino.interrupt!(:error, "The tokenizer.json file already exist in the given repository.")

  _ ->
    Kino.interrupt!(:error, "The repository does not exist or requires authentication.")
end
```

```elixir
output_dir = Path.join(System.tmp_dir!(), repo)
```

````elixir
script = """
import sys
from transformers import AutoTokenizer

repo = sys.argv[1]
output_dir = sys.argv[2]


try:
  tokenizer = AutoTokenizer.from_pretrained(repo)
  assert tokenizer.is_fast
  tokenizer.save_pretrained(output_dir)
except Exception as error:
  print(error)
  exit(1)
"""

case System.cmd("python", ["-c", script, repo, output_dir]) do
  {_, 0} ->
    :ok

  {output, _} ->
    Kino.Markdown.new("""
    ```
    #{output}
    ```
    """)
    |> Kino.render()

    Kino.interrupt!(:error, "Tokenizer conversion failed.")
end
````

```elixir
tokenizer_path = Path.join(output_dir, "tokenizer.json")

Kino.Download.new(
  fn -> File.read!(tokenizer_path) end,
  filename: "tokenizer.json",
  label: "tokenizer.json"
)
```

`````elixir
Kino.Markdown.new("""
### Next steps

  1. Go to https://huggingface.co/#{repo}/upload/main.

  2. Upload the `tokenizer.json` file.

  3. Add the following description:

      ````markdown
      Generated with:

      ```python
      from transformers import AutoTokenizer

      tokenizer = AutoTokenizer.from_pretrained("#{repo}")
      assert tokenizer.is_fast
      tokenizer.save_pretrained("...")
      ```
      ````

  4. Submit the PR.

""")
`````