ritabratamaiti commited on
Commit
1ca1d29
·
verified ·
1 Parent(s): c650833

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -2
README.md CHANGED
@@ -14,7 +14,6 @@ tags:
14
  - vision
15
  - multimodal
16
  ---
17
- ```markdown
18
  # AnyModal/Image-Captioning-Llama-3.2-1B
19
 
20
  **AnyModal/Image-Captioning-Llama-3.2-1B** explores the potential of combining advanced visual feature extraction and language modeling techniques to generate descriptive captions for natural images. Built within the [AnyModal](https://github.com/ritabratamaiti/AnyModal) framework, this model integrates a Vision Transformer (ViT) encoder with the Llama 3.2-1B language model, fine-tuned on the Flickr30k dataset. The model demonstrates a promising approach to bridging visual and textual modalities.
@@ -119,4 +118,3 @@ Explore the full project repository for additional details and potential customi
119
  - **Language Model**: Utilizes Llama 3.2-1B, a pre-trained causal language model, to construct coherent and context-sensitive captions.
120
 
121
  While trained on the Flickr30k dataset, the model's design highlights the possibilities for integrating vision and language models for captioning tasks, showcasing the feasibility of this approach within the AnyModal framework.
122
- ```
 
14
  - vision
15
  - multimodal
16
  ---
 
17
  # AnyModal/Image-Captioning-Llama-3.2-1B
18
 
19
  **AnyModal/Image-Captioning-Llama-3.2-1B** explores the potential of combining advanced visual feature extraction and language modeling techniques to generate descriptive captions for natural images. Built within the [AnyModal](https://github.com/ritabratamaiti/AnyModal) framework, this model integrates a Vision Transformer (ViT) encoder with the Llama 3.2-1B language model, fine-tuned on the Flickr30k dataset. The model demonstrates a promising approach to bridging visual and textual modalities.
 
118
  - **Language Model**: Utilizes Llama 3.2-1B, a pre-trained causal language model, to construct coherent and context-sensitive captions.
119
 
120
  While trained on the Flickr30k dataset, the model's design highlights the possibilities for integrating vision and language models for captioning tasks, showcasing the feasibility of this approach within the AnyModal framework.