metadata

language:
  - ms
  - id
  - th
  - vi
pipeline_tag: text-generation
tags:
  - pretrained
license: apache-2.0
base_model:
  - Qwen/Qwen2-7B

Marco-LLM-SEA-7B

Introduction

Marco-LLM-SEA is a series of enhanced language models specifically fine-tuned for Southeast Asian languages, including Indonesian, Malaysian, Thai, Vietnamese, and other regional languages. This repository contains the 7B Marco-LLM-SEA base language model.

Compared with the state-of-the-art open-source language models, Marco-LLM-SEA has undergone extensive continued pretraining on a dataset containing approximately 56 billion tokens, enhancing its capabilities in the targeted languages while maintaining competitiveness in general benchmarks.

For more details, please refer to our Hugging Face page.

Model Details

Marco-LLM-SEA series includes models of varying sizes, from 7B to 72B parameters, including both base and instruction-tuned (Instruct) models. The models are based on the Transformer architecture with SwiGLU activation, attention QKV bias, and group query attention. Additionally, the models employ an improved tokenizer adaptive to multiple Southeast Asian languages and scripts.

Usage

It is not advised to use the base language models for direct text generation tasks. Instead, it is recommended to apply post-training methods such as Supervised Fine-tuning (SFT), Reinforcement Learning with Human Feedback (RLHF), or continued pretraining to adapt the models for specific use cases.

Citation

If you find our work helpful, please give us a citation.

@article{unique_identifier,
title={Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement},
journal={arXiv},
volume={},
number={2412.04003},
year={2024},
url={https://arxiv.org/abs/2412.04003}
}