Merged LLaMA 3.1 Vision + KoEn NLP Model

This repository contains a merged model of qresearch/llama-3.1-8B-vision-378 (a vision-enhanced LLaMA model) and muzerai/Deep-Llama-3.1-KoEn-8B-SiSai (a Korean-English NLP model). The goal of this merge is to enhance the vision model with improved natural language understanding and generation capabilities using a robust multilingual NLP model.

🚀 Why This Merge?

The original LLaMA 3.1 Vision model excels at image understanding but lacks strong text generation capabilities in Korean and English.
Meanwhile, Deep-Llama-3.1-KoEn-8B-SiSai is optimized for Korean and English NLP tasks but lacks multimodal capabilities.

By merging these models:

We retain the powerful vision capabilities of the Vision model.
We enhance text generation and reasoning using the NLP model's pre-trained weights.
The text encoder (text_model) is now optimized for Korean-English tasks, improving multilingual support.

📌 Model Details

Base Vision Model: qresearch/llama-3.1-8B-vision-378
Base NLP Model: muzerai/Deep-Llama-3.1-KoEn-8B-SiSai
Merged Components:
- Vision processing layers are retained from the original Vision model.
- text_model weights are replaced with those from the NLP model to improve text understanding.
File Format: .safetensors (optimized for fast and secure model loading)

Test (MAC M1)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import requests
from io import BytesIO

# ✅ 이미지 다운로드
url = "https://huggingface.co/qresearch/llama-3-vision-alpha-hf/resolve/main/assets/demo-2.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))

# ✅ MPS 지원 확인 후 디바이스 설정
device = "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using device: {device}")

# ✅ 모델 로드
model = AutoModelForCausalLM.from_pretrained(
    "muzerai/Deep-Llama-3.1-KoEn-8B-SiSai-Vision",
    trust_remote_code=True,
    torch_dtype=torch.float16,
).to(device)

# ✅ 토크나이저 로드
tokenizer = AutoTokenizer.from_pretrained("muzerai/Deep-Llama-3.1-KoEn-8B-SiSai-Vision", use_fast=True)

# ✅ 한국어 질문 추가
question = "이 이미지를 한국어로 설명해주세요." // Briefly describe the image (english)

# ✅ 모델 실행
output = model.answer_question(
    image, question, tokenizer, max_new_tokens=128, do_sample=True, temperature=0.3
)

print(output)

Using device: mps
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.42it/s]
이 이미지는 일본의 만화나 애니메이션에서 자주 등장하는 여주인 캐릭터입니다. 여주인 캐릭터는 머리와 옷이 흰색인 것을 볼 수 있습니다. 여주인 캐릭터는 손에 빵을 들고 있는 것을 볼 수 있습니다.

The image is of a young woman with a kind face, dressed in a medieval-inspired outfit. She is holding a large hamburger in her hand and has a happy expression on her face. The background is a warm, cozy room with a wooden table and chairs.

Comments

Performance is just... it depends on you ^^

Use

Research & Educational Purposes: AI research, academic use, and educational content creation.

For questions about licensing, please contact my channel.