Spaces:
Running
Running
File size: 3,112 Bytes
808e59a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
<!-- livebook:{"app_settings":{"access_type":"public","auto_shutdown_ms":5000,"multi_session":true,"output_type":"rich","show_source":true,"slug":"tokenizer-generator"}} -->
# Tokenizer generator
```elixir
Mix.install([
{:kino, "~> 0.10.0"},
{:req, "~> 0.4.3"}
])
```
## Info
```elixir
Kino.Markdown.new("""
## Background
HuggingFace repositories store tokenizers in two flavours:
1. "slow tokenizer" - corresponds to a tokenizer implemented in Python
and stored as `tokenizer_config.json`
2. "fast tokenizers" - corresponds to a tokenizer implemented in Rust
and stored as `tokenizer.json`
Many repositories only include files for 1., but the `transformers` library
automatically converts "slow tokenizer" to "fast tokenizer" whenever possible.
Bumblebee relies on the Rust bindings and therefore always requires the
`tokenizer.json` file. This app generates that file for any repository with the
"slow tokenizer".
""")
```
## Generator
```elixir
Kino.Markdown.new("## Converter")
```
```elixir
{version, 0} =
System.cmd("python", ["-c", "import transformers; print(transformers.__version__, end='')"])
Kino.Markdown.new("""
`tokenizers: #{version}`
""")
```
```elixir
repo_input = Kino.Input.text("HuggingFace repo")
```
```elixir
repo = Kino.Input.read(repo_input)
if repo == "" do
Kino.interrupt!(:normal, "Enter repository.")
end
```
```elixir
response =
Req.post!("https://huggingface.co/api/models/#{repo}/paths-info/main",
json: %{paths: ["tokenizer.json"]}
)
case response do
%{status: 200, body: []} ->
:ok
%{status: 200, body: [%{"path" => "tokenizer.json"}]} ->
Kino.interrupt!(:error, "The tokenizer.json file already exist in the given repository.")
_ ->
Kino.interrupt!(:error, "The repository does not exist or requires authentication.")
end
```
```elixir
output_dir = Path.join(System.tmp_dir!(), repo)
```
````elixir
script = """
import sys
from transformers import AutoTokenizer
repo = sys.argv[1]
output_dir = sys.argv[2]
try:
tokenizer = AutoTokenizer.from_pretrained(repo)
assert tokenizer.is_fast
tokenizer.save_pretrained(output_dir)
except Exception as error:
print(error)
exit(1)
"""
case System.cmd("python", ["-c", script, repo, output_dir]) do
{_, 0} ->
:ok
{output, _} ->
Kino.Markdown.new("""
```
#{output}
```
""")
|> Kino.render()
Kino.interrupt!(:error, "Tokenizer conversion failed.")
end
````
```elixir
tokenizer_path = Path.join(output_dir, "tokenizer.json")
Kino.Download.new(
fn -> File.read!(tokenizer_path) end,
filename: "tokenizer.json",
label: "tokenizer.json"
)
```
`````elixir
Kino.Markdown.new("""
### Next steps
1. Go to https://huggingface.co/#{repo}/upload/main.
2. Upload the `tokenizer.json` file.
3. Add the following description:
````markdown
Generated with:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("#{repo}")
assert tokenizer.is_fast
tokenizer.save_pretrained("...")
```
````
4. Submit the PR.
""")
`````
|