Merged LLaMA 3.1 Vision + KoEn NLP Model

This repository contains a merged model of qresearch/llama-3.1-8B-vision-378 (a vision-enhanced LLaMA model) and muzerai/Deep-Llama-3.1-KoEn-8B-SiSai (a Korean-English NLP model). The goal of this merge is to enhance the vision model with improved natural language understanding and generation capabilities using a robust multilingual NLP model.

πŸš€ Why This Merge?

The original LLaMA 3.1 Vision model excels at image understanding but lacks strong text generation capabilities in Korean and English.
Meanwhile, Deep-Llama-3.1-KoEn-8B-SiSai is optimized for Korean and English NLP tasks but lacks multimodal capabilities.

By merging these models:

  • We retain the powerful vision capabilities of the Vision model.
  • We enhance text generation and reasoning using the NLP model's pre-trained weights.
  • The text encoder (text_model) is now optimized for Korean-English tasks, improving multilingual support.

πŸ“Œ Model Details

  • Base Vision Model: qresearch/llama-3.1-8B-vision-378
  • Base NLP Model: muzerai/Deep-Llama-3.1-KoEn-8B-SiSai
  • Merged Components:
    • Vision processing layers are retained from the original Vision model.
    • text_model weights are replaced with those from the NLP model to improve text understanding.
  • File Format: .safetensors (optimized for fast and secure model loading)

Test (MAC M1)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import requests
from io import BytesIO

# βœ… 이미지 λ‹€μš΄λ‘œλ“œ
url = "https://huggingface.co/qresearch/llama-3-vision-alpha-hf/resolve/main/assets/demo-2.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))

# βœ… MPS 지원 확인 ν›„ λ””λ°”μ΄μŠ€ μ„€μ •
device = "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using device: {device}")

# βœ… λͺ¨λΈ λ‘œλ“œ
model = AutoModelForCausalLM.from_pretrained(
    "muzerai/Deep-Llama-3.1-KoEn-8B-SiSai-Vision",
    trust_remote_code=True,
    torch_dtype=torch.float16,
).to(device)

# βœ… ν† ν¬λ‚˜μ΄μ € λ‘œλ“œ
tokenizer = AutoTokenizer.from_pretrained("muzerai/Deep-Llama-3.1-KoEn-8B-SiSai-Vision", use_fast=True)

# βœ… ν•œκ΅­μ–΄ 질문 μΆ”κ°€
question = "이 이미지λ₯Ό ν•œκ΅­μ–΄λ‘œ μ„€λͺ…ν•΄μ£Όμ„Έμš”." // Briefly describe the image (english)

# βœ… λͺ¨λΈ μ‹€ν–‰
output = model.answer_question(
    image, question, tokenizer, max_new_tokens=128, do_sample=True, temperature=0.3
)

print(output)
Using device: mps
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4/4 [00:01<00:00,  3.42it/s]
이 μ΄λ―Έμ§€λŠ” 일본의 λ§Œν™”λ‚˜ μ• λ‹ˆλ©”μ΄μ…˜μ—μ„œ 자주 λ“±μž₯ν•˜λŠ” 여주인 μΊλ¦­ν„°μž…λ‹ˆλ‹€. 여주인 μΊλ¦­ν„°λŠ” 머리와 옷이 흰색인 것을 λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€. 여주인 μΊλ¦­ν„°λŠ” 손에 빡을 λ“€κ³  μžˆλŠ” 것을 λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€.
The image is of a young woman with a kind face, dressed in a medieval-inspired outfit. She is holding a large hamburger in her hand and has a happy expression on her face. The background is a warm, cozy room with a wooden table and chairs.

Comments

Performance is just... it depends on you ^^

Use

Research & Educational Purposes: AI research, academic use, and educational content creation.

For questions about licensing, please contact my channel.

Downloads last month
1
Safetensors
Model size
8.48B params
Tensor type
FP16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.