Spaces:

hevold
/

iver

Sleeping

App Files Files Community

iver / research /norwegian_llm_research.md

hevold

Upload 29 files

b34efa5 verified 4 months ago

preview code

raw

history blame contribute delete

3.88 kB

	# Norwegian LLM and Embedding Models Research

	## Open-Source LLMs with Norwegian Language Support

	### 1. NorMistral-7b-scratch
	- Description: A large Norwegian language model pretrained from scratch on 260 billion subword tokens (using six repetitions of open Norwegian texts).
	- Architecture: Based on Mistral architecture with 7 billion parameters
	- Context Length: 2k tokens
	- Performance:
	- Perplexity on NCC validation set: 7.43
	- Good performance on reading comprehension, sentiment analysis, and machine translation tasks
	- License: Apache-2.0
	- Hugging Face: https://huggingface.co/norallm/normistral-7b-scratch
	- Notes: Part of the NORA.LLM family developed by the Language Technology Group at the University of Oslo

	### 2. Viking 7B
	- Description: The first multilingual large language model for all Nordic languages (including Norwegian)
	- Architecture: Similar to Llama 2, with flash attention, rotary embeddings, grouped query attention
	- Context Length: 4k tokens
	- Performance: Best-in-class performance in all Nordic languages without compromising English performance
	- License: Apache 2.0
	- Notes:
	- Developed by Silo AI and University of Turku's research group TurkuNLP
	- Also available in larger sizes (13B and 33B parameters)
	- Trained on 2 trillion tokens including Danish, English, Finnish, Icelandic, Norwegian, Swedish and programming languages

	### 3. NorskGPT
	- Description: A Norwegian large language model made for Norwegian society
	- Versions:
	- NorskGPT-Mistral: 7B dense transformer with 8K context window, based on Mistral 7B
	- NorskGPT-LLAMA2: 7b and 13b parameter model with 4K context length, based on LLAMA2
	- License: cc-by-nc-sa-4.0 (non-commercial)
	- Website: https://www.norskgpt.com/norskgpt-llm

	## Embedding Models for Norwegian

	### 1. NbAiLab/nb-sbert-base
	- Description: A SentenceTransformers model trained on a machine translated version of the MNLI dataset
	- Architecture: Based on nb-bert-base
	- Vector Dimensions: 768
	- Performance:
	- Cosine Similarity: Pearson 0.8275, Spearman 0.8245
	- License: apache-2.0
	- Hugging Face: https://huggingface.co/NbAiLab/nb-sbert-base
	- Use Cases:
	- Sentence similarity
	- Semantic search
	- Few-shot classification (with SetFit)
	- Keyword extraction (with KeyBERT)
	- Topic modeling (with BERTopic)
	- Notes: Works well with both Norwegian and English, making it ideal for bilingual applications

	### 2. FFI/SimCSE-NB-BERT-large
	- Description: A Norwegian sentence embedding model trained using the SimCSE methodology
	- Hugging Face: https://huggingface.co/FFI/SimCSE-NB-BERT-large

	## Vector Database Options for Hugging Face RAG Integration

	### 1. Milvus
	- Integration: Well-documented integration with Hugging Face for RAG pipelines
	- Reference: https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus

	### 2. MongoDB
	- Integration: Can be used with Hugging Face models for RAG systems
	- Reference: https://huggingface.co/learn/cookbook/en/rag_with_hugging_face_gemma_mongodb

	### 3. MyScale
	- Integration: Supports building RAG applications with Hugging Face embedding models
	- Reference: https://medium.com/@myscale/building-a-rag-application-in-10-min-with-claude-3-and-hugging-face-10caea4ea293

	### 4. FAISS (Facebook AI Similarity Search)
	- Integration: Lightweight vector database that works well with Hugging Face
	- Notes: Can be used with `autofaiss` for quick experimentation

	## Hugging Face RAG Implementation Options

	1. Transformers Library: Provides access to pre-trained models
	2. Sentence Transformers: For text embeddings
	3. Datasets: For managing and processing data
	4. LangChain Integration: For advanced RAG pipelines
	5. Spaces: For deploying and sharing the application

	# Norwegian LLM and Embedding Models Research

	## Open-Source LLMs with Norwegian Language Support

	### 1. NorMistral-7b-scratch
	- Description: A large Norwegian language model pretrained from scratch on 260 billion subword tokens (using six repetitions of open Norwegian texts).
	- Architecture: Based on Mistral architecture with 7 billion parameters
	- Context Length: 2k tokens
	- Performance:
	- Perplexity on NCC validation set: 7.43
	- Good performance on reading comprehension, sentiment analysis, and machine translation tasks
	- License: Apache-2.0
	- Hugging Face: https://huggingface.co/norallm/normistral-7b-scratch
	- Notes: Part of the NORA.LLM family developed by the Language Technology Group at the University of Oslo

	### 2. Viking 7B
	- Description: The first multilingual large language model for all Nordic languages (including Norwegian)
	- Architecture: Similar to Llama 2, with flash attention, rotary embeddings, grouped query attention
	- Context Length: 4k tokens
	- Performance: Best-in-class performance in all Nordic languages without compromising English performance
	- License: Apache 2.0
	- Notes:
	- Developed by Silo AI and University of Turku's research group TurkuNLP
	- Also available in larger sizes (13B and 33B parameters)
	- Trained on 2 trillion tokens including Danish, English, Finnish, Icelandic, Norwegian, Swedish and programming languages

	### 3. NorskGPT
	- Description: A Norwegian large language model made for Norwegian society
	- Versions:
	- NorskGPT-Mistral: 7B dense transformer with 8K context window, based on Mistral 7B
	- NorskGPT-LLAMA2: 7b and 13b parameter model with 4K context length, based on LLAMA2
	- License: cc-by-nc-sa-4.0 (non-commercial)
	- Website: https://www.norskgpt.com/norskgpt-llm

	## Embedding Models for Norwegian

	### 1. NbAiLab/nb-sbert-base
	- Description: A SentenceTransformers model trained on a machine translated version of the MNLI dataset
	- Architecture: Based on nb-bert-base
	- Vector Dimensions: 768
	- Performance:
	- Cosine Similarity: Pearson 0.8275, Spearman 0.8245
	- License: apache-2.0
	- Hugging Face: https://huggingface.co/NbAiLab/nb-sbert-base
	- Use Cases:
	- Sentence similarity
	- Semantic search
	- Few-shot classification (with SetFit)
	- Keyword extraction (with KeyBERT)
	- Topic modeling (with BERTopic)
	- Notes: Works well with both Norwegian and English, making it ideal for bilingual applications

	### 2. FFI/SimCSE-NB-BERT-large
	- Description: A Norwegian sentence embedding model trained using the SimCSE methodology
	- Hugging Face: https://huggingface.co/FFI/SimCSE-NB-BERT-large

	## Vector Database Options for Hugging Face RAG Integration

	### 1. Milvus
	- Integration: Well-documented integration with Hugging Face for RAG pipelines
	- Reference: https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus

	### 2. MongoDB
	- Integration: Can be used with Hugging Face models for RAG systems
	- Reference: https://huggingface.co/learn/cookbook/en/rag_with_hugging_face_gemma_mongodb

	### 3. MyScale
	- Integration: Supports building RAG applications with Hugging Face embedding models
	- Reference: https://medium.com/@myscale/building-a-rag-application-in-10-min-with-claude-3-and-hugging-face-10caea4ea293

	### 4. FAISS (Facebook AI Similarity Search)
	- Integration: Lightweight vector database that works well with Hugging Face
	- Notes: Can be used with `autofaiss` for quick experimentation

	## Hugging Face RAG Implementation Options

	1. Transformers Library: Provides access to pre-trained models
	2. Sentence Transformers: For text embeddings
	3. Datasets: For managing and processing data
	4. LangChain Integration: For advanced RAG pipelines
	5. Spaces: For deploying and sharing the application