Merged LLaMA 3.1 Vision + KoEn NLP Model
This repository contains a merged model of qresearch/llama-3.1-8B-vision-378
(a vision-enhanced LLaMA model) and muzerai/Deep-Llama-3.1-KoEn-8B-SiSai
(a Korean-English NLP model). The goal of this merge is to enhance the vision model with improved natural language understanding and generation capabilities using a robust multilingual NLP model.
π Why This Merge?
The original LLaMA 3.1 Vision model excels at image understanding but lacks strong text generation capabilities in Korean and English.
Meanwhile, Deep-Llama-3.1-KoEn-8B-SiSai is optimized for Korean and English NLP tasks but lacks multimodal capabilities.
By merging these models:
- We retain the powerful vision capabilities of the Vision model.
- We enhance text generation and reasoning using the NLP model's pre-trained weights.
- The text encoder (
text_model
) is now optimized for Korean-English tasks, improving multilingual support.
π Model Details
- Base Vision Model:
qresearch/llama-3.1-8B-vision-378
- Base NLP Model:
muzerai/Deep-Llama-3.1-KoEn-8B-SiSai
- Merged Components:
- Vision processing layers are retained from the original Vision model.
text_model
weights are replaced with those from the NLP model to improve text understanding.
- File Format:
.safetensors
(optimized for fast and secure model loading)
Test (MAC M1)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import requests
from io import BytesIO
# β
μ΄λ―Έμ§ λ€μ΄λ‘λ
url = "https://huggingface.co/qresearch/llama-3-vision-alpha-hf/resolve/main/assets/demo-2.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content))
# β
MPS μ§μ νμΈ ν λλ°μ΄μ€ μ€μ
device = "mps" if torch.backends.mps.is_available() else "cpu"
print(f"Using device: {device}")
# β
λͺ¨λΈ λ‘λ
model = AutoModelForCausalLM.from_pretrained(
"muzerai/Deep-Llama-3.1-KoEn-8B-SiSai-Vision",
trust_remote_code=True,
torch_dtype=torch.float16,
).to(device)
# β
ν ν¬λμ΄μ λ‘λ
tokenizer = AutoTokenizer.from_pretrained("muzerai/Deep-Llama-3.1-KoEn-8B-SiSai-Vision", use_fast=True)
# β
νκ΅μ΄ μ§λ¬Έ μΆκ°
question = "μ΄ μ΄λ―Έμ§λ₯Ό νκ΅μ΄λ‘ μ€λͺ
ν΄μ£ΌμΈμ." // Briefly describe the image (english)
# β
λͺ¨λΈ μ€ν
output = model.answer_question(
image, question, tokenizer, max_new_tokens=128, do_sample=True, temperature=0.3
)
print(output)
Using device: mps
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 4/4 [00:01<00:00, 3.42it/s]
μ΄ μ΄λ―Έμ§λ μΌλ³Έμ λ§νλ μ λλ©μ΄μ
μμ μμ£Ό λ±μ₯νλ μ¬μ£ΌμΈ μΊλ¦ν°μ
λλ€. μ¬μ£ΌμΈ μΊλ¦ν°λ 머리μ μ·μ΄ ν°μμΈ κ²μ λ³Ό μ μμ΅λλ€. μ¬μ£ΌμΈ μΊλ¦ν°λ μμ λΉ΅μ λ€κ³ μλ κ²μ λ³Ό μ μμ΅λλ€.
The image is of a young woman with a kind face, dressed in a medieval-inspired outfit. She is holding a large hamburger in her hand and has a happy expression on her face. The background is a warm, cozy room with a wooden table and chairs.
Comments
Performance is just... it depends on you ^^
Use
Research & Educational Purposes: AI research, academic use, and educational content creation.
For questions about licensing, please contact my channel.
- Downloads last month
- 1