Audio-to-Audio
Transformers
Safetensors
speech_language_model
gallilmaimon commited on
Commit
f932655
·
verified ·
1 Parent(s): 08e7e0e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -54
README.md CHANGED
@@ -12,14 +12,17 @@ base_model:
12
  This is a Speech Lanaguage Model trained for generating audio contiuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
13
 
14
 
15
-
16
  ## Model Details
17
 
18
  ### Model Description
 
 
 
 
19
 
20
- <!-- Provide a longer summary of what this model is. -->
21
-
22
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
23
 
24
  - **Developed by:** [SLP-RL](https://huggingface.co/slprl)
25
  - **Model type:** SpeechLM
@@ -45,25 +48,14 @@ This is a base SpeechLM and as such can be used to generate contiuations for spe
45
  This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.
46
 
47
 
48
- ## Bias, Risks, and Limitations
49
-
50
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
51
-
52
- [More Information Needed]
53
-
54
- ### Recommendations
55
-
56
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
57
-
58
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
59
 
60
  ## How to Get Started with the Model
61
-
62
  We refer users to the official repository for full usage explainations - [github](https://github.com/slp-rl/slam).
63
 
64
 
65
  ## Training Details
66
- We highly encourage users to read the full [paper](), for full training details.
 
67
 
68
  ### Training Data
69
  This model was trained on a subset of [LibriSpeech] train, [Libri-Light]() and the synthetic dataset
@@ -84,42 +76,11 @@ We encourage you to explore the official repository for full details - [github](
84
 
85
  - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
86
 
87
- #### Speeds, Sizes, Times [optional]
88
-
89
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
90
-
91
- [More Information Needed]
92
 
93
  ## Evaluation
 
94
 
95
- <!-- This section describes the evaluation protocols and provides the results. -->
96
-
97
- ### Testing Data, Factors & Metrics
98
-
99
- #### Testing Data
100
-
101
- <!-- This should link to a Dataset Card if possible. -->
102
-
103
- [More Information Needed]
104
-
105
- #### Factors
106
-
107
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
108
-
109
- [More Information Needed]
110
-
111
- #### Metrics
112
-
113
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
114
-
115
- [More Information Needed]
116
-
117
- ### Results
118
-
119
- [More Information Needed]
120
-
121
- #### Summary
122
-
123
 
124
 
125
 
@@ -134,12 +95,10 @@ This model was trained as part of ["*Slamming*: Training a Speech Language Model
134
  This model was trained using **only a single Nvidia A5000 GPU**, 16 CPU cores and 24 GB of RAM for **24 hours**.
135
 
136
  #### Software
137
- The model was trained using the [*Slam*](https://github.com/slp-rl/slam) codebase which builds upon transformers extending it to support easy and efficent training of
138
- Speech Language Models.
139
 
140
  ## Citation
141
 
142
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
143
-
144
  **BibTeX:**
145
  Soon!
 
12
  This is a Speech Lanaguage Model trained for generating audio contiuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
13
 
14
 
 
15
  ## Model Details
16
 
17
  ### Model Description
18
+ This is a Speech Lanaguage Model, fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500
19
+ speech tokens extracted from the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz). It was trained as part of
20
+ ["*Slamming*: Training a Speech Language Model on One GPU in a Day"], focusing on efficient training. For a stronger model trained with
21
+ slightly more compute - 2*A100 for 2 days, see [slam_scaled](https://huggingface.co/slprl/slam).
22
 
23
+ The model was trained by next-token prediction over a subset of LibriSpeech, Libri-Light and a synthetic data
24
+ [sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was then trained with DPO over
25
+ [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
26
 
27
  - **Developed by:** [SLP-RL](https://huggingface.co/slprl)
28
  - **Model type:** SpeechLM
 
48
  This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.
49
 
50
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  ## How to Get Started with the Model
 
53
  We refer users to the official repository for full usage explainations - [github](https://github.com/slp-rl/slam).
54
 
55
 
56
  ## Training Details
57
+ We highly encourage users to read the full [paper](), for full training details, a brief overview is provided below.
58
+
59
 
60
  ### Training Data
61
  This model was trained on a subset of [LibriSpeech] train, [Libri-Light]() and the synthetic dataset
 
76
 
77
  - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
78
 
 
 
 
 
 
79
 
80
  ## Evaluation
81
+ The paper provides full results, we do give here some results and also refer to the [demo page]() to listen to some samples.
82
 
83
+ **ADD Table**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
 
86
 
 
95
  This model was trained using **only a single Nvidia A5000 GPU**, 16 CPU cores and 24 GB of RAM for **24 hours**.
96
 
97
  #### Software
98
+ The model was trained using the [*Slam*](https://github.com/slp-rl/slam) codebase which builds upon 🤗transformers extending it to support
99
+ easy and efficent training of Speech Language Models.
100
 
101
  ## Citation
102
 
 
 
103
  **BibTeX:**
104
  Soon!