Update README.md
Browse files
README.md
CHANGED
@@ -14,7 +14,6 @@ tags:
|
|
14 |
- vision
|
15 |
- multimodal
|
16 |
---
|
17 |
-
```markdown
|
18 |
# AnyModal/Image-Captioning-Llama-3.2-1B
|
19 |
|
20 |
**AnyModal/Image-Captioning-Llama-3.2-1B** explores the potential of combining advanced visual feature extraction and language modeling techniques to generate descriptive captions for natural images. Built within the [AnyModal](https://github.com/ritabratamaiti/AnyModal) framework, this model integrates a Vision Transformer (ViT) encoder with the Llama 3.2-1B language model, fine-tuned on the Flickr30k dataset. The model demonstrates a promising approach to bridging visual and textual modalities.
|
@@ -119,4 +118,3 @@ Explore the full project repository for additional details and potential customi
|
|
119 |
- **Language Model**: Utilizes Llama 3.2-1B, a pre-trained causal language model, to construct coherent and context-sensitive captions.
|
120 |
|
121 |
While trained on the Flickr30k dataset, the model's design highlights the possibilities for integrating vision and language models for captioning tasks, showcasing the feasibility of this approach within the AnyModal framework.
|
122 |
-
```
|
|
|
14 |
- vision
|
15 |
- multimodal
|
16 |
---
|
|
|
17 |
# AnyModal/Image-Captioning-Llama-3.2-1B
|
18 |
|
19 |
**AnyModal/Image-Captioning-Llama-3.2-1B** explores the potential of combining advanced visual feature extraction and language modeling techniques to generate descriptive captions for natural images. Built within the [AnyModal](https://github.com/ritabratamaiti/AnyModal) framework, this model integrates a Vision Transformer (ViT) encoder with the Llama 3.2-1B language model, fine-tuned on the Flickr30k dataset. The model demonstrates a promising approach to bridging visual and textual modalities.
|
|
|
118 |
- **Language Model**: Utilizes Llama 3.2-1B, a pre-trained causal language model, to construct coherent and context-sensitive captions.
|
119 |
|
120 |
While trained on the Flickr30k dataset, the model's design highlights the possibilities for integrating vision and language models for captioning tasks, showcasing the feasibility of this approach within the AnyModal framework.
|
|