first
Browse files- .gitattributes +36 -0
- README.md +91 -0
- conf.yaml +16 -0
- resource/data.png +3 -0
- resource/image.png +3 -0
- v0.ckpt +3 -0
.gitattributes
ADDED
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
*.png filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
@@ -0,0 +1,91 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Mellow
|
2 |
+
[[`Paper`]()] [[`Checkpoint`]()]
|
3 |
+
|
4 |
+
Mellow is a small Audio-Language Model that takes in two audios and a text prompt as input and produces free-form text as output. It is a 167M parameter model and trained on ~155 hours of audio (AudioCaps and Clotho), and achieves SoTA performance on different tasks with 50x fewer parameters.
|
5 |
+
|
6 |
+

|
7 |
+
|
8 |
+
## Index
|
9 |
+
* [Setup](#setup)
|
10 |
+
* [Usage](#usage)
|
11 |
+
* [Examples](#example)
|
12 |
+
* [ReasonAQA](#reasonaqa)
|
13 |
+
* [Limitation](#limitation)
|
14 |
+
|
15 |
+
## Setup
|
16 |
+
1. Install the required dependencies: `pip install -r requirements.txt`. For [conda](https://www.anaconda.com), run the following:
|
17 |
+
|
18 |
+
```shell
|
19 |
+
cd Mellow && \
|
20 |
+
conda create -n mellow python=3.10.14 && \
|
21 |
+
conda activate mellow && \
|
22 |
+
pip install -r requirements.txt
|
23 |
+
```
|
24 |
+
2. Download Mellow weights: [checkpoint \[drive\]]()
|
25 |
+
3. Move the `v0.ckpt` under `config` folder
|
26 |
+
|
27 |
+
## Usage
|
28 |
+
The MellowWrapper class allows easy interaction with the model. To use the wrapper, inputs required are:
|
29 |
+
- `config`: The option supported is "conf.yaml"
|
30 |
+
- `model`: The option supported is "v0.ckpt"
|
31 |
+
- `examples`: List of examples. Each example is a list containing three entries: audiopath1, audiopath2, prompt
|
32 |
+
|
33 |
+
Supported functions:
|
34 |
+
- `generate`: Produces text response for the given audio inputs and text prompt
|
35 |
+
|
36 |
+
## Example
|
37 |
+
Mellow supports open-ended questions-answering and can produce response based on the user's prompt. Below, we provide some example questions for testing Mellow on different tasks.
|
38 |
+
|
39 |
+
```python
|
40 |
+
import torch
|
41 |
+
from pathlib import Path
|
42 |
+
import os
|
43 |
+
from mellow import MellowWrapper
|
44 |
+
|
45 |
+
# setup cuda and device
|
46 |
+
cuda = torch.cuda.is_available()
|
47 |
+
device = 0 if cuda else "cpu"
|
48 |
+
mellow = Mellow(config="<choice of config>", model_path="<model weights", device=device, cuda=cuda)
|
49 |
+
|
50 |
+
# setup mellow
|
51 |
+
mellow = MellowWrapper(
|
52 |
+
config="conf.yaml",
|
53 |
+
model = "v0.ckpt",
|
54 |
+
device=device,
|
55 |
+
use_cuda=cuda,
|
56 |
+
)
|
57 |
+
|
58 |
+
# pick up audio file paths
|
59 |
+
parent_path = Path(os.path.realpath(__file__)).parent
|
60 |
+
path1 = os.path.join(parent_path, "resource", "1.wav")
|
61 |
+
path2 = os.path.join(parent_path, "resource", "2.wav")
|
62 |
+
|
63 |
+
# list of filepaths and prompts
|
64 |
+
examples = [
|
65 |
+
[path1, path2, "what can you infer about the surroundings from the audio?"],
|
66 |
+
[path1, path2, "is there a cat in the audio? answer yes or no"],
|
67 |
+
[path1, path2, "caption the audio."]
|
68 |
+
[path1, path2, "Based on the audio, what can be said about the hypothesis - \"A farmer is giving a tour of his ranch while chickens roam nearby\"? a) It is definitely true b) It is definitely false c) It is plausible d) I cannot determine"],
|
69 |
+
[path1, path2, "explain the difference between the two audios in detail."],
|
70 |
+
[path1, path2, "what is the primary sound event present in the clip? a) dog barking b) chirping birds c) car engine d) clapping"],
|
71 |
+
]
|
72 |
+
|
73 |
+
# generate response
|
74 |
+
response = mellow.generate(examples=examples, max_len=300, top_p=0.8, temperature=1.0)
|
75 |
+
print(f"\noutput: {response}")
|
76 |
+
```
|
77 |
+
|
78 |
+
## ReasonAQA
|
79 |
+
The composition of the ReasonAQA dataset is shown in Table. The training set is restricted to AudioCaps and Clotho audio files and the testing is performed on 6 tasks - Audio Entailment, Audio Difference, ClothoAQA, Clotho MCQ, Clotho Detail, AudioCaps MCQ and AudioCaps Detail.
|
80 |
+
|
81 |
+

|
82 |
+
- The ReasonAQA JSONs can be downloaded from Zenodo: [checkpoint \[drive\]](https://drive.google.com/file/d/1WPKgafYw2ZCifElEtHn_k3DkcVGjesqB/view?usp=sharing)
|
83 |
+
- The audio files can be downloaded from their respective hosting website: [Clotho](https://zenodo.org/records/4783391) and [AudioCaps](https://github.com/cdjkim/audiocaps)
|
84 |
+
|
85 |
+
## Limitation
|
86 |
+
With Mellow, we aim to showcase that small audio-language models can engage in reasoning. As a research prototype, Mellow has not been trained at scale on publicly available audio datasets, resulting in a limited understanding of audio concepts. Therefore, we advise caution when considering its use in production settings. Ultimately, we hope this work inspires researchers to explore small audio-language models for multitask capabilities, complementing ongoing research on general-purpose audio assistants.
|
87 |
+
|
88 |
+
## Citation
|
89 |
+
```
|
90 |
+
|
91 |
+
```
|
conf.yaml
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
data:
|
2 |
+
sampling_rate: 32000
|
3 |
+
segment_seconds: 10
|
4 |
+
tokenizer_type: "HuggingFaceTB/SmolLM2-135M"
|
5 |
+
text_tokenization_len: 129
|
6 |
+
|
7 |
+
model:
|
8 |
+
encoder:
|
9 |
+
audioenc_name: 'HTSAT'
|
10 |
+
transformer_embed_dim: 768
|
11 |
+
out_emb: 768
|
12 |
+
d_proj: 576
|
13 |
+
decoder:
|
14 |
+
text_decoder: "HuggingFaceTB/SmolLM2-135M"
|
15 |
+
prefix_length: 389
|
16 |
+
model_type: Mellow
|
resource/data.png
ADDED
![]() |
Git LFS Details
|
resource/image.png
ADDED
![]() |
Git LFS Details
|
v0.ckpt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:796b1bf48fa6f33f5294a3aab7510265d88e56941bfb2146ceb09175cef45666
|
3 |
+
size 670197445
|