soham97 commited on
Commit
2c6e7ae
·
1 Parent(s): ff5da81
Files changed (6) hide show
  1. .gitattributes +36 -0
  2. README.md +91 -0
  3. conf.yaml +16 -0
  4. resource/data.png +3 -0
  5. resource/image.png +3 -0
  6. v0.ckpt +3 -0
.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Mellow
2
+ [[`Paper`]()] [[`Checkpoint`]()]
3
+
4
+ Mellow is a small Audio-Language Model that takes in two audios and a text prompt as input and produces free-form text as output. It is a 167M parameter model and trained on ~155 hours of audio (AudioCaps and Clotho), and achieves SoTA performance on different tasks with 50x fewer parameters.
5
+
6
+ ![alt text](resource/image.png)
7
+
8
+ ## Index
9
+ * [Setup](#setup)
10
+ * [Usage](#usage)
11
+ * [Examples](#example)
12
+ * [ReasonAQA](#reasonaqa)
13
+ * [Limitation](#limitation)
14
+
15
+ ## Setup
16
+ 1. Install the required dependencies: `pip install -r requirements.txt`. For [conda](https://www.anaconda.com), run the following:
17
+
18
+ ```shell
19
+ cd Mellow && \
20
+ conda create -n mellow python=3.10.14 && \
21
+ conda activate mellow && \
22
+ pip install -r requirements.txt
23
+ ```
24
+ 2. Download Mellow weights: [checkpoint \[drive\]]()
25
+ 3. Move the `v0.ckpt` under `config` folder
26
+
27
+ ## Usage
28
+ The MellowWrapper class allows easy interaction with the model. To use the wrapper, inputs required are:
29
+ - `config`: The option supported is "conf.yaml"
30
+ - `model`: The option supported is "v0.ckpt"
31
+ - `examples`: List of examples. Each example is a list containing three entries: audiopath1, audiopath2, prompt
32
+
33
+ Supported functions:
34
+ - `generate`: Produces text response for the given audio inputs and text prompt
35
+
36
+ ## Example
37
+ Mellow supports open-ended questions-answering and can produce response based on the user's prompt. Below, we provide some example questions for testing Mellow on different tasks.
38
+
39
+ ```python
40
+ import torch
41
+ from pathlib import Path
42
+ import os
43
+ from mellow import MellowWrapper
44
+
45
+ # setup cuda and device
46
+ cuda = torch.cuda.is_available()
47
+ device = 0 if cuda else "cpu"
48
+ mellow = Mellow(config="<choice of config>", model_path="<model weights", device=device, cuda=cuda)
49
+
50
+ # setup mellow
51
+ mellow = MellowWrapper(
52
+ config="conf.yaml",
53
+ model = "v0.ckpt",
54
+ device=device,
55
+ use_cuda=cuda,
56
+ )
57
+
58
+ # pick up audio file paths
59
+ parent_path = Path(os.path.realpath(__file__)).parent
60
+ path1 = os.path.join(parent_path, "resource", "1.wav")
61
+ path2 = os.path.join(parent_path, "resource", "2.wav")
62
+
63
+ # list of filepaths and prompts
64
+ examples = [
65
+ [path1, path2, "what can you infer about the surroundings from the audio?"],
66
+ [path1, path2, "is there a cat in the audio? answer yes or no"],
67
+ [path1, path2, "caption the audio."]
68
+ [path1, path2, "Based on the audio, what can be said about the hypothesis - \"A farmer is giving a tour of his ranch while chickens roam nearby\"? a) It is definitely true b) It is definitely false c) It is plausible d) I cannot determine"],
69
+ [path1, path2, "explain the difference between the two audios in detail."],
70
+ [path1, path2, "what is the primary sound event present in the clip? a) dog barking b) chirping birds c) car engine d) clapping"],
71
+ ]
72
+
73
+ # generate response
74
+ response = mellow.generate(examples=examples, max_len=300, top_p=0.8, temperature=1.0)
75
+ print(f"\noutput: {response}")
76
+ ```
77
+
78
+ ## ReasonAQA
79
+ The composition of the ReasonAQA dataset is shown in Table. The training set is restricted to AudioCaps and Clotho audio files and the testing is performed on 6 tasks - Audio Entailment, Audio Difference, ClothoAQA, Clotho MCQ, Clotho Detail, AudioCaps MCQ and AudioCaps Detail.
80
+
81
+ ![alt text](resource/data.png)
82
+ - The ReasonAQA JSONs can be downloaded from Zenodo: [checkpoint \[drive\]](https://drive.google.com/file/d/1WPKgafYw2ZCifElEtHn_k3DkcVGjesqB/view?usp=sharing)
83
+ - The audio files can be downloaded from their respective hosting website: [Clotho](https://zenodo.org/records/4783391) and [AudioCaps](https://github.com/cdjkim/audiocaps)
84
+
85
+ ## Limitation
86
+ With Mellow, we aim to showcase that small audio-language models can engage in reasoning. As a research prototype, Mellow has not been trained at scale on publicly available audio datasets, resulting in a limited understanding of audio concepts. Therefore, we advise caution when considering its use in production settings. Ultimately, we hope this work inspires researchers to explore small audio-language models for multitask capabilities, complementing ongoing research on general-purpose audio assistants.
87
+
88
+ ## Citation
89
+ ```
90
+
91
+ ```
conf.yaml ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ data:
2
+ sampling_rate: 32000
3
+ segment_seconds: 10
4
+ tokenizer_type: "HuggingFaceTB/SmolLM2-135M"
5
+ text_tokenization_len: 129
6
+
7
+ model:
8
+ encoder:
9
+ audioenc_name: 'HTSAT'
10
+ transformer_embed_dim: 768
11
+ out_emb: 768
12
+ d_proj: 576
13
+ decoder:
14
+ text_decoder: "HuggingFaceTB/SmolLM2-135M"
15
+ prefix_length: 389
16
+ model_type: Mellow
resource/data.png ADDED

Git LFS Details

  • SHA256: 0e4d4dc0b0699031235bea278f7a0dc226a767f3501718a1b6f7253c5e8f1682
  • Pointer size: 131 Bytes
  • Size of remote file: 492 kB
resource/image.png ADDED

Git LFS Details

  • SHA256: f2569519c84d2a2077c9b9728abd6eb26aef015f545e58ba6e5da8bf8c3112d7
  • Pointer size: 132 Bytes
  • Size of remote file: 4.92 MB
v0.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:796b1bf48fa6f33f5294a3aab7510265d88e56941bfb2146ceb09175cef45666
3
+ size 670197445