--- language: - en license: apache-2.0 tags: - sentence-transformers - sparse-encoder - sparse - splade base_model: distilbert/distilbert-base-uncased pipeline_tag: feature-extraction library_name: sentence-transformers --- # SPLADE distilbert-base-uncased trained on python docstring code pairs This is a [SPLADE Sparse Encoder](https://www.sbert.net/docs/sparse_encoder/usage/usage.html) model finetuned from [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) using the [sentence-transformers](https://www.SBERT.net) library. It maps sentences & paragraphs to a 30522-dimensional sparse vector space and can be used for semantic search and sparse retrieval. ## Model Details ### Model Description - **Model Type:** SPLADE Sparse Encoder - **Base model:** [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 30522 dimensions - **Similarity Function:** Dot Product - **Language:** en - **License:** apache-2.0 ### Model Sources - **Documentation:** [Sentence Transformers Documentation](https://sbert.net) - **Documentation:** [Sparse Encoder Documentation](https://www.sbert.net/docs/sparse_encoder/usage/usage.html) - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) - **Hugging Face:** [Sparse Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=sparse-encoder) ### Full Model Architecture ``` SparseEncoder( (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'DistilBertForMaskedLM'}) (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522}) ) ``` ## Usage ### Direct Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. ```python from sentence_transformers import SparseEncoder # Download from the 🤗 Hub model = SparseEncoder("pulkitmehtawork/sparse-distilbert-base-uncased-python-code-lightening") # Run inference sentences = [ 'The weather is lovely today.', "It's so sunny outside!", 'He drove to the stadium.', ] embeddings = model.encode(sentences) print(embeddings.shape) # [3, 30522] # Get the similarity scores for the embeddings similarities = model.similarity(embeddings, embeddings) print(similarities) # tensor([[2148.8340, 1376.2744, 850.4404], # [1376.2744, 2056.9260, 898.0439], # [ 850.4404, 898.0439, 2509.7507]]) ``` ## Training Details ### Framework Versions - Python: 3.10.10 - Sentence Transformers: 5.0.0 - Transformers: 4.53.0 - PyTorch: 2.7.0+cu128 - Accelerate: 1.8.1 - Datasets: 3.6.0 - Tokenizers: 0.21.2 ## Citation ### BibTeX