MartsoBodziu1994 commited on
Commit
74ec5fa
Β·
verified Β·
1 Parent(s): 2337a8e

Upload README (1).md

Browse files
Files changed (1) hide show
  1. README (1).md +250 -0
README (1).md ADDED
@@ -0,0 +1,250 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ - es
6
+ - fr
7
+ - hi
8
+ - it
9
+ - ja
10
+ - ko
11
+ - pl
12
+ - pt
13
+ - ru
14
+ - tr
15
+ - zh
16
+ thumbnail: >-
17
+ https://user-images.githubusercontent.com/5068315/230698495-cbb1ced9-c911-4c9a-941d-a1a4a1286ac6.png
18
+ library: bark
19
+ license: mit
20
+ tags:
21
+ - bark
22
+ - audio
23
+ - text-to-speech
24
+ duplicated_from: ylacombe/bark-small
25
+ pipeline_tag: text-to-speech
26
+ ---
27
+
28
+ # Bark
29
+
30
+ Bark is a transformer-based text-to-audio model created by [Suno](https://www.suno.ai).
31
+ Bark can generate highly realistic, multilingual speech as well as other audio - including music,
32
+ background noise and simple sound effects. The model can also produce nonverbal
33
+ communications like laughing, sighing and crying. To support the research community,
34
+ we are providing access to pretrained model checkpoints ready for inference.
35
+
36
+ The original github repo and model card can be found [here](https://github.com/suno-ai/bark).
37
+
38
+ This model is meant for research purposes only.
39
+ The model output is not censored and the authors do not endorse the opinions in the generated content.
40
+ Use at your own risk.
41
+
42
+ Two checkpoints are released:
43
+ - [**small** (this checkpoint)](https://huggingface.co/suno/bark-small)
44
+ - [large](https://huggingface.co/suno/bark)
45
+
46
+
47
+ ## Example
48
+
49
+ Try out Bark yourself!
50
+
51
+ * Bark Colab:
52
+
53
+ <a target="_blank" href="https://colab.research.google.com/drive/1eJfA2XUa-mXwdMy7DoYKVYHI1iTd9Vkt?usp=sharing">
54
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
55
+ </a>
56
+
57
+ * Hugging Face Colab:
58
+
59
+ <a target="_blank" href="https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing">
60
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
61
+ </a>
62
+
63
+ * Hugging Face Demo:
64
+
65
+ <a target="_blank" href="https://huggingface.co/spaces/suno/bark">
66
+ <img src="https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg" alt="Open in HuggingFace"/>
67
+ </a>
68
+
69
+
70
+ ## πŸ€— Transformers Usage
71
+
72
+ You can run Bark locally with the πŸ€— Transformers library from version 4.31.0 onwards.
73
+
74
+ 1. First install the πŸ€— [Transformers library](https://github.com/huggingface/transformers) and scipy:
75
+
76
+ ```
77
+ pip install --upgrade pip
78
+ pip install --upgrade transformers scipy
79
+ ```
80
+
81
+ 2. Run inference via the `Text-to-Speech` (TTS) pipeline. You can infer the bark model via the TTS pipeline in just a few lines of code!
82
+
83
+ ```python
84
+ from transformers import pipeline
85
+ import scipy
86
+
87
+ synthesiser = pipeline("text-to-speech", "suno/bark-small")
88
+
89
+ speech = synthesiser("Hello, my dog is cooler than you!", forward_params={"do_sample": True})
90
+
91
+ scipy.io.wavfile.write("bark_out.wav", rate=speech["sampling_rate"], data=speech["audio"])
92
+ ```
93
+
94
+ 3. Run inference via the Transformers modelling code. You can use the processor + generate code to convert text into a mono 24 kHz speech waveform for more fine-grained control.
95
+
96
+ ```python
97
+ from transformers import AutoProcessor, AutoModel
98
+
99
+ processor = AutoProcessor.from_pretrained("suno/bark-small")
100
+ model = AutoModel.from_pretrained("suno/bark-small")
101
+
102
+ inputs = processor(
103
+ text=["Hello, my name is Suno. And, uh β€” and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."],
104
+ return_tensors="pt",
105
+ )
106
+
107
+ speech_values = model.generate(**inputs, do_sample=True)
108
+ ```
109
+
110
+ 4. Listen to the speech samples either in an ipynb notebook:
111
+
112
+ ```python
113
+ from IPython.display import Audio
114
+
115
+ sampling_rate = model.generation_config.sample_rate
116
+ Audio(speech_values.cpu().numpy().squeeze(), rate=sampling_rate)
117
+ ```
118
+
119
+ Or save them as a `.wav` file using a third-party library, e.g. `scipy`:
120
+
121
+ ```python
122
+ import scipy
123
+
124
+ sampling_rate = model.config.sample_rate
125
+ scipy.io.wavfile.write("bark_out.wav", rate=sampling_rate, data=speech_values.cpu().numpy().squeeze())
126
+ ```
127
+
128
+ For more details on using the Bark model for inference using the πŸ€— Transformers library, refer to the [Bark docs](https://huggingface.co/docs/transformers/model_doc/bark).
129
+
130
+ ### Optimization tips
131
+
132
+ Refers to this [blog post](https://huggingface.co/blog/optimizing-bark#benchmark-results) to find out more about the following methods and a benchmark of their benefits.
133
+
134
+ #### Get significant speed-ups:
135
+
136
+ **Using πŸ€— Better Transformer**
137
+
138
+ Better Transformer is an πŸ€— Optimum feature that performs kernel fusion under the hood. You can gain 20% to 30% in speed with zero performance degradation. It only requires one line of code to export the model to πŸ€— Better Transformer:
139
+ ```python
140
+ model = model.to_bettertransformer()
141
+ ```
142
+ Note that πŸ€— Optimum must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/optimum/installation)
143
+
144
+ **Using Flash Attention 2**
145
+
146
+ Flash Attention 2 is an even faster, optimized version of the previous optimization.
147
+ ```python
148
+ model = BarkModel.from_pretrained("suno/bark-small", torch_dtype=torch.float16, use_flash_attention_2=True).to(device)
149
+ ```
150
+ Make sure to load your model in half-precision (e.g. `torch.float16``) and to [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2.
151
+
152
+ **Note:** Flash Attention 2 is only available on newer GPUs, refer to πŸ€— Better Transformer in case your GPU don't support it.
153
+
154
+ #### Reduce memory footprint:
155
+
156
+ **Using half-precision**
157
+
158
+ You can speed up inference and reduce memory footprint by 50% simply by loading the model in half-precision (e.g. `torch.float16``).
159
+
160
+ **Using CPU offload**
161
+
162
+ Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle.
163
+
164
+ If you're using a CUDA device, a simple solution to benefit from an 80% reduction in memory footprint is to offload the GPU's submodels when they're idle. This operation is called CPU offloading. You can use it with one line of code.
165
+
166
+ ```python
167
+ model.enable_cpu_offload()
168
+ ```
169
+ Note that πŸ€— Accelerate must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/accelerate/basic_tutorials/install)
170
+
171
+
172
+
173
+ ## Suno Usage
174
+
175
+ You can also run Bark locally through the original [Bark library]((https://github.com/suno-ai/bark):
176
+
177
+ 1. First install the [`bark` library](https://github.com/suno-ai/bark)
178
+
179
+ 3. Run the following Python code:
180
+
181
+ ```python
182
+ from bark import SAMPLE_RATE, generate_audio, preload_models
183
+ from IPython.display import Audio
184
+
185
+ # download and load all models
186
+ preload_models()
187
+
188
+ # generate audio from text
189
+ text_prompt = """
190
+ Hello, my name is Suno. And, uh β€” and I like pizza. [laughs]
191
+ But I also have other interests such as playing tic tac toe.
192
+ """
193
+ speech_array = generate_audio(text_prompt)
194
+
195
+ # play text in notebook
196
+ Audio(speech_array, rate=SAMPLE_RATE)
197
+ ```
198
+
199
+ [pizza.webm](https://user-images.githubusercontent.com/5068315/230490503-417e688d-5115-4eee-9550-b46a2b465ee3.webm)
200
+
201
+
202
+ To save `audio_array` as a WAV file:
203
+
204
+ ```python
205
+ from scipy.io.wavfile import write as write_wav
206
+
207
+ write_wav("/path/to/audio.wav", SAMPLE_RATE, audio_array)
208
+ ```
209
+
210
+ ## Model Details
211
+
212
+
213
+ The following is additional information about the models released here.
214
+
215
+ Bark is a series of three transformer models that turn text into audio.
216
+
217
+ ### Text to semantic tokens
218
+ - Input: text, tokenized with [BERT tokenizer from Hugging Face](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer)
219
+ - Output: semantic tokens that encode the audio to be generated
220
+
221
+ ### Semantic to coarse tokens
222
+ - Input: semantic tokens
223
+ - Output: tokens from the first two codebooks of the [EnCodec Codec](https://github.com/facebookresearch/encodec) from facebook
224
+
225
+ ### Coarse to fine tokens
226
+ - Input: the first two codebooks from EnCodec
227
+ - Output: 8 codebooks from EnCodec
228
+
229
+ ### Architecture
230
+ | Model | Parameters | Attention | Output Vocab size |
231
+ |:-------------------------:|:----------:|------------|:-----------------:|
232
+ | Text to semantic tokens | 80/300 M | Causal | 10,000 |
233
+ | Semantic to coarse tokens | 80/300 M | Causal | 2x 1,024 |
234
+ | Coarse to fine tokens | 80/300 M | Non-causal | 6x 1,024 |
235
+
236
+
237
+ ### Release date
238
+ April 2023
239
+
240
+ ## Broader Implications
241
+ We anticipate that this model's text to audio capabilities can be used to improve accessbility tools in a variety of languages.
242
+
243
+ While we hope that this release will enable users to express their creativity and build applications that are a force
244
+ for good, we acknowledge that any text to audio model has the potential for dual use. While it is not straightforward
245
+ to voice clone known people with Bark, it can still be used for nefarious purposes. To further reduce the chances of unintended use of Bark,
246
+ we also release a simple classifier to detect Bark-generated audio with high accuracy (see notebooks section of the main repository).
247
+
248
+ ## License
249
+
250
+ Bark is licensed under the [MIT License](https://github.com/suno-ai/bark/blob/main/LICENSE), meaning it's available for commercial use.