|
# Norwegian LLM and Embedding Models Research |
|
|
|
## Open-Source LLMs with Norwegian Language Support |
|
|
|
### 1. NorMistral-7b-scratch |
|
- **Description**: A large Norwegian language model pretrained from scratch on 260 billion subword tokens (using six repetitions of open Norwegian texts). |
|
- **Architecture**: Based on Mistral architecture with 7 billion parameters |
|
- **Context Length**: 2k tokens |
|
- **Performance**: |
|
- Perplexity on NCC validation set: 7.43 |
|
- Good performance on reading comprehension, sentiment analysis, and machine translation tasks |
|
- **License**: Apache-2.0 |
|
- **Hugging Face**: https://huggingface.co/norallm/normistral-7b-scratch |
|
- **Notes**: Part of the NORA.LLM family developed by the Language Technology Group at the University of Oslo |
|
|
|
### 2. Viking 7B |
|
- **Description**: The first multilingual large language model for all Nordic languages (including Norwegian) |
|
- **Architecture**: Similar to Llama 2, with flash attention, rotary embeddings, grouped query attention |
|
- **Context Length**: 4k tokens |
|
- **Performance**: Best-in-class performance in all Nordic languages without compromising English performance |
|
- **License**: Apache 2.0 |
|
- **Notes**: |
|
- Developed by Silo AI and University of Turku's research group TurkuNLP |
|
- Also available in larger sizes (13B and 33B parameters) |
|
- Trained on 2 trillion tokens including Danish, English, Finnish, Icelandic, Norwegian, Swedish and programming languages |
|
|
|
### 3. NorskGPT |
|
- **Description**: A Norwegian large language model made for Norwegian society |
|
- **Versions**: |
|
- NorskGPT-Mistral: 7B dense transformer with 8K context window, based on Mistral 7B |
|
- NorskGPT-LLAMA2: 7b and 13b parameter model with 4K context length, based on LLAMA2 |
|
- **License**: cc-by-nc-sa-4.0 (non-commercial) |
|
- **Website**: https://www.norskgpt.com/norskgpt-llm |
|
|
|
## Embedding Models for Norwegian |
|
|
|
### 1. NbAiLab/nb-sbert-base |
|
- **Description**: A SentenceTransformers model trained on a machine translated version of the MNLI dataset |
|
- **Architecture**: Based on nb-bert-base |
|
- **Vector Dimensions**: 768 |
|
- **Performance**: |
|
- Cosine Similarity: Pearson 0.8275, Spearman 0.8245 |
|
- **License**: apache-2.0 |
|
- **Hugging Face**: https://huggingface.co/NbAiLab/nb-sbert-base |
|
- **Use Cases**: |
|
- Sentence similarity |
|
- Semantic search |
|
- Few-shot classification (with SetFit) |
|
- Keyword extraction (with KeyBERT) |
|
- Topic modeling (with BERTopic) |
|
- **Notes**: Works well with both Norwegian and English, making it ideal for bilingual applications |
|
|
|
### 2. FFI/SimCSE-NB-BERT-large |
|
- **Description**: A Norwegian sentence embedding model trained using the SimCSE methodology |
|
- **Hugging Face**: https://huggingface.co/FFI/SimCSE-NB-BERT-large |
|
|
|
## Vector Database Options for Hugging Face RAG Integration |
|
|
|
### 1. Milvus |
|
- **Integration**: Well-documented integration with Hugging Face for RAG pipelines |
|
- **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus |
|
|
|
### 2. MongoDB |
|
- **Integration**: Can be used with Hugging Face models for RAG systems |
|
- **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hugging_face_gemma_mongodb |
|
|
|
### 3. MyScale |
|
- **Integration**: Supports building RAG applications with Hugging Face embedding models |
|
- **Reference**: https://medium.com/@myscale/building-a-rag-application-in-10-min-with-claude-3-and-hugging-face-10caea4ea293 |
|
|
|
### 4. FAISS (Facebook AI Similarity Search) |
|
- **Integration**: Lightweight vector database that works well with Hugging Face |
|
- **Notes**: Can be used with `autofaiss` for quick experimentation |
|
|
|
## Hugging Face RAG Implementation Options |
|
|
|
1. **Transformers Library**: Provides access to pre-trained models |
|
2. **Sentence Transformers**: For text embeddings |
|
3. **Datasets**: For managing and processing data |
|
4. **LangChain Integration**: For advanced RAG pipelines |
|
5. **Spaces**: For deploying and sharing the application |
|
|