Update README.md
Browse files
README.md
CHANGED
|
@@ -12,14 +12,17 @@ base_model:
|
|
| 12 |
This is a Speech Lanaguage Model trained for generating audio contiuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
|
| 13 |
|
| 14 |
|
| 15 |
-
|
| 16 |
## Model Details
|
| 17 |
|
| 18 |
### Model Description
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
|
| 24 |
- **Developed by:** [SLP-RL](https://huggingface.co/slprl)
|
| 25 |
- **Model type:** SpeechLM
|
|
@@ -45,25 +48,14 @@ This is a base SpeechLM and as such can be used to generate contiuations for spe
|
|
| 45 |
This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.
|
| 46 |
|
| 47 |
|
| 48 |
-
## Bias, Risks, and Limitations
|
| 49 |
-
|
| 50 |
-
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
| 51 |
-
|
| 52 |
-
[More Information Needed]
|
| 53 |
-
|
| 54 |
-
### Recommendations
|
| 55 |
-
|
| 56 |
-
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
| 57 |
-
|
| 58 |
-
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
| 59 |
|
| 60 |
## How to Get Started with the Model
|
| 61 |
-
|
| 62 |
We refer users to the official repository for full usage explainations - [github](https://github.com/slp-rl/slam).
|
| 63 |
|
| 64 |
|
| 65 |
## Training Details
|
| 66 |
-
We highly encourage users to read the full [paper](), for full training details.
|
|
|
|
| 67 |
|
| 68 |
### Training Data
|
| 69 |
This model was trained on a subset of [LibriSpeech] train, [Libri-Light]() and the synthetic dataset
|
|
@@ -84,42 +76,11 @@ We encourage you to explore the official repository for full details - [github](
|
|
| 84 |
|
| 85 |
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
| 86 |
|
| 87 |
-
#### Speeds, Sizes, Times [optional]
|
| 88 |
-
|
| 89 |
-
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
| 90 |
-
|
| 91 |
-
[More Information Needed]
|
| 92 |
|
| 93 |
## Evaluation
|
|
|
|
| 94 |
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
### Testing Data, Factors & Metrics
|
| 98 |
-
|
| 99 |
-
#### Testing Data
|
| 100 |
-
|
| 101 |
-
<!-- This should link to a Dataset Card if possible. -->
|
| 102 |
-
|
| 103 |
-
[More Information Needed]
|
| 104 |
-
|
| 105 |
-
#### Factors
|
| 106 |
-
|
| 107 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
| 108 |
-
|
| 109 |
-
[More Information Needed]
|
| 110 |
-
|
| 111 |
-
#### Metrics
|
| 112 |
-
|
| 113 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
| 114 |
-
|
| 115 |
-
[More Information Needed]
|
| 116 |
-
|
| 117 |
-
### Results
|
| 118 |
-
|
| 119 |
-
[More Information Needed]
|
| 120 |
-
|
| 121 |
-
#### Summary
|
| 122 |
-
|
| 123 |
|
| 124 |
|
| 125 |
|
|
@@ -134,12 +95,10 @@ This model was trained as part of ["*Slamming*: Training a Speech Language Model
|
|
| 134 |
This model was trained using **only a single Nvidia A5000 GPU**, 16 CPU cores and 24 GB of RAM for **24 hours**.
|
| 135 |
|
| 136 |
#### Software
|
| 137 |
-
The model was trained using the [*Slam*](https://github.com/slp-rl/slam) codebase which builds upon transformers extending it to support
|
| 138 |
-
Speech Language Models.
|
| 139 |
|
| 140 |
## Citation
|
| 141 |
|
| 142 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 143 |
-
|
| 144 |
**BibTeX:**
|
| 145 |
Soon!
|
|
|
|
| 12 |
This is a Speech Lanaguage Model trained for generating audio contiuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
|
| 13 |
|
| 14 |
|
|
|
|
| 15 |
## Model Details
|
| 16 |
|
| 17 |
### Model Description
|
| 18 |
+
This is a Speech Lanaguage Model, fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500
|
| 19 |
+
speech tokens extracted from the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz). It was trained as part of
|
| 20 |
+
["*Slamming*: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training. For a stronger model trained with
|
| 21 |
+
slightly more compute - 2*A100 for 2 days, see [slam_scaled](https://huggingface.co/slprl/slam).
|
| 22 |
|
| 23 |
+
The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data
|
| 24 |
+
[sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was then trained with DPO over
|
| 25 |
+
[SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
|
| 26 |
|
| 27 |
- **Developed by:** [SLP-RL](https://huggingface.co/slprl)
|
| 28 |
- **Model type:** SpeechLM
|
|
|
|
| 48 |
This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.
|
| 49 |
|
| 50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
## How to Get Started with the Model
|
|
|
|
| 53 |
We refer users to the official repository for full usage explainations - [github](https://github.com/slp-rl/slam).
|
| 54 |
|
| 55 |
|
| 56 |
## Training Details
|
| 57 |
+
We highly encourage users to read the full [paper](), for full training details, a brief overview is provided below.
|
| 58 |
+
|
| 59 |
|
| 60 |
### Training Data
|
| 61 |
This model was trained on a subset of [LibriSpeech] train, [Libri-Light]() and the synthetic dataset
|
|
|
|
| 76 |
|
| 77 |
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
| 78 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
## Evaluation
|
| 81 |
+
The paper provides full results, we do give here some results and also refer to the [demo page]() to listen to some samples.
|
| 82 |
|
| 83 |
+
**ADD Table**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
|
| 86 |
|
|
|
|
| 95 |
This model was trained using **only a single Nvidia A5000 GPU**, 16 CPU cores and 24 GB of RAM for **24 hours**.
|
| 96 |
|
| 97 |
#### Software
|
| 98 |
+
The model was trained using the [*Slam*](https://github.com/slp-rl/slam) codebase which builds upon 🤗transformers extending it to support
|
| 99 |
+
easy and efficent training of Speech Language Models.
|
| 100 |
|
| 101 |
## Citation
|
| 102 |
|
|
|
|
|
|
|
| 103 |
**BibTeX:**
|
| 104 |
Soon!
|