first

Browse files

Files changed (6) hide show

.gitattributes +36 -0
README.md +91 -0
conf.yaml +16 -0
resource/data.png +3 -0
resource/image.png +3 -0
v0.ckpt +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,91 @@

+# Mellow
+[[`Paper`]()] [[`Checkpoint`]()]
+Mellow is a small Audio-Language Model that takes in two audios and a text prompt as input and produces free-form text as output. It is a 167M parameter model and trained on ~155 hours of audio (AudioCaps and Clotho), and achieves SoTA performance on different tasks with 50x fewer parameters.
+![alt text](resource/image.png)
+## Index
+* [Setup](#setup)
+* [Usage](#usage)
+* [Examples](#example)
+* [ReasonAQA](#reasonaqa)
+* [Limitation](#limitation)
+## Setup
+1. Install the required dependencies: `pip install -r requirements.txt`. For [conda](https://www.anaconda.com), run the following:
+```shell
+cd Mellow && \
+conda create -n mellow python=3.10.14 && \
+conda activate mellow && \
+pip install -r requirements.txt
+```
+2. Download Mellow weights: [checkpoint \[drive\]]()
+3. Move the `v0.ckpt` under `config` folder
+## Usage
+The MellowWrapper class allows easy interaction with the model. To use the wrapper, inputs required are:
+- `config`: The option supported is "conf.yaml"
+- `model`: The option supported is "v0.ckpt"
+- `examples`: List of examples. Each example is a list containing three entries: audiopath1, audiopath2, prompt
+Supported functions:
+- `generate`: Produces text response for the given audio inputs and text prompt
+## Example
+Mellow supports open-ended questions-answering and can produce response based on the user's prompt. Below, we provide some example questions for testing Mellow on different tasks.
+```python
+import torch
+from pathlib import Path
+import os
+from mellow import MellowWrapper
+# setup cuda and device
+cuda = torch.cuda.is_available()
+device = 0 if cuda else "cpu"
+mellow = Mellow(config="<choice of config>", model_path="<model weights", device=device, cuda=cuda)
+# setup mellow
+mellow = MellowWrapper(
+                    config="conf.yaml",
+                    model = "v0.ckpt",
+                    device=device,
+                    use_cuda=cuda,
+                )
+# pick up audio file paths
+parent_path = Path(os.path.realpath(__file__)).parent
+path1 = os.path.join(parent_path, "resource", "1.wav")
+path2 = os.path.join(parent_path, "resource", "2.wav")
+# list of filepaths and prompts
+examples = [
+    [path1, path2, "what can you infer about the surroundings from the audio?"],
+    [path1, path2, "is there a cat in the audio? answer yes or no"],
+    [path1, path2, "caption the audio."]
+    [path1, path2, "Based on the audio, what can be said about the hypothesis - \"A farmer is giving a tour of his ranch while chickens roam nearby\"? a) It is definitely true b) It is definitely false c) It is plausible d) I cannot determine"],
+    [path1, path2, "explain the difference between the two audios in detail."],
+    [path1, path2, "what is the primary sound event present in the clip? a) dog barking b) chirping birds c) car engine d) clapping"],
+]
+# generate response
+response = mellow.generate(examples=examples, max_len=300, top_p=0.8, temperature=1.0)
+print(f"\noutput: {response}")
+```
+## ReasonAQA
+The composition of the ReasonAQA dataset is shown in Table. The training set is restricted to AudioCaps and Clotho audio files and the testing is performed on 6 tasks - Audio Entailment, Audio Difference, ClothoAQA, Clotho MCQ, Clotho Detail, AudioCaps MCQ and AudioCaps Detail.
+![alt text](resource/data.png)
+- The ReasonAQA JSONs can be downloaded from Zenodo: [checkpoint \[drive\]](https://drive.google.com/file/d/1WPKgafYw2ZCifElEtHn_k3DkcVGjesqB/view?usp=sharing)
+- The audio files can be downloaded from their respective hosting website: [Clotho](https://zenodo.org/records/4783391) and [AudioCaps](https://github.com/cdjkim/audiocaps)
+## Limitation
+With Mellow, we aim to showcase that small audio-language models can engage in reasoning. As a research prototype, Mellow has not been trained at scale on publicly available audio datasets, resulting in a limited understanding of audio concepts. Therefore, we advise caution when considering its use in production settings. Ultimately, we hope this work inspires researchers to explore small audio-language models for multitask capabilities, complementing ongoing research on general-purpose audio assistants.
+## Citation
+```
+```

conf.yaml ADDED Viewed

	@@ -0,0 +1,16 @@

+data:
+    sampling_rate: 32000
+    segment_seconds: 10
+    tokenizer_type: "HuggingFaceTB/SmolLM2-135M"
+    text_tokenization_len: 129
+model:
+    encoder:
+        audioenc_name: 'HTSAT'
+        transformer_embed_dim: 768
+        out_emb: 768
+        d_proj: 576
+    decoder:
+      text_decoder: "HuggingFaceTB/SmolLM2-135M"
+      prefix_length: 389
+    model_type: Mellow

resource/data.png ADDED Viewed

Git LFS Details

SHA256: 0e4d4dc0b0699031235bea278f7a0dc226a767f3501718a1b6f7253c5e8f1682
Pointer size: 131 Bytes
Size of remote file: 492 kB

resource/image.png ADDED Viewed

Git LFS Details

SHA256: f2569519c84d2a2077c9b9728abd6eb26aef015f545e58ba6e5da8bf8c3112d7
Pointer size: 132 Bytes
Size of remote file: 4.92 MB

v0.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:796b1bf48fa6f33f5294a3aab7510265d88e56941bfb2146ceb09175cef45666
+size 670197445