CATIE French sparse embedding
Collection
A few experiments after the release of sentence transformers v5.0. Could be seen as a V0 before the publication of more powerful french sparse models
•
5 items
•
Updated
This is a CSR Sparse Encoder model finetuned from almanach/camembert-large on the french_sts dataset using the sentence-transformers library. It maps sentences & paragraphs to a 4096-dimensional sparse vector space with 256 maximum active dimensions and can be used for semantic search and sparse retrieval.
SparseEncoder(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'CamembertModel'})
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): SparseAutoEncoder({'input_dim': 1024, 'hidden_dim': 4096, 'k': 256, 'k_aux': 512, 'normalize': False, 'dead_threshold': 30})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SparseEncoder
# Download from the 🤗 Hub
model = SparseEncoder("bourdoiscatie/SPLADE_camembert-large_STS")
# Run inference
sentences = [
"Oui, je peux vous dire d'après mon expérience personnelle qu'ils ont certainement sifflé.",
"Il est vrai que les bombes de la Seconde Guerre mondiale faisaient un bruit de sifflet lorsqu'elles tombaient.",
"J'envisage de dépenser les 48 dollars par mois pour le système GTD (Getting things done) annoncé par David Allen.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 4096]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.3673, 0.2794],
# [0.3673, 1.0000, 0.2023],
# [0.2794, 0.2023, 1.0000]])
sts-dev
and sts-test
SparseEmbeddingSimilarityEvaluator
Metric | sts-dev | sts-test |
---|---|---|
pearson_cosine | 0.7307 | 0.7537 |
spearman_cosine | 0.723 | 0.7256 |
active_dims | 239.0452 | 229.7225 |
sparsity_ratio | 0.9416 | 0.9439 |
sentence1
, sentence2
, and score
sentence1 | sentence2 | score | |
---|---|---|---|
type | string | string | float |
details |
|
|
|
sentence1 | sentence2 | score |
---|---|---|
Un avion est en train de décoller. |
Un avion est en train de décoller. |
1.0 |
Un homme est en train de fumer. |
Un homme fait du patinage. |
0.10000000149011612 |
Une personne jette un chat au plafond. |
Une personne jette un chat au plafond. |
1.0 |
SpladeLoss
with these parameters:{
"loss": "SparseCosineSimilarityLoss(loss_fct='torch.nn.modules.loss.MSELoss')",
"document_regularizer_weight": 0.003
}
sentence1
, sentence2
, and score
sentence1 | sentence2 | score | |
---|---|---|---|
type | string | string | float |
details |
|
|
|
sentence1 | sentence2 | score |
---|---|---|
Un homme avec un casque de sécurité est en train de danser. |
Un homme portant un casque de sécurité est en train de danser. |
1.0 |
Un jeune enfant monte à cheval. |
Un enfant monte à cheval. |
0.949999988079071 |
Un homme donne une souris à un serpent. |
L'homme donne une souris au serpent. |
1.0 |
SpladeLoss
with these parameters:{
"loss": "SparseCosineSimilarityLoss(loss_fct='torch.nn.modules.loss.MSELoss')",
"document_regularizer_weight": 0.003
}
eval_strategy
: epochper_device_train_batch_size
: 16per_device_eval_batch_size
: 16bf16
: Trueoverwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: epochprediction_loss_only
: Trueper_device_train_batch_size
: 16per_device_eval_batch_size
: 16per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 3max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Truefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}tp_size
: 0fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
: auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: proportionalrouter_mapping
: {}learning_rate_mapping
: {}Epoch | Step | Training Loss | Validation Loss | sts-dev_spearman_cosine | sts-test_spearman_cosine |
---|---|---|---|---|---|
-1 | -1 | - | - | 0.4890 | - |
0.1307 | 100 | 0.0458 | - | - | - |
0.2614 | 200 | 0.0447 | - | - | - |
0.3922 | 300 | 0.0468 | - | - | - |
0.5229 | 400 | 0.0416 | - | - | - |
0.6536 | 500 | 0.0398 | - | - | - |
0.7843 | 600 | 0.0397 | - | - | - |
0.9150 | 700 | 0.0398 | - | - | - |
1.0 | 765 | - | 0.0417 | 0.6801 | - |
1.0458 | 800 | 0.0368 | - | - | - |
1.1765 | 900 | 0.0296 | - | - | - |
1.3072 | 1000 | 0.0288 | - | - | - |
1.4379 | 1100 | 0.0285 | - | - | - |
1.5686 | 1200 | 0.0264 | - | - | - |
1.6993 | 1300 | 0.0251 | - | - | - |
1.8301 | 1400 | 0.0256 | - | - | - |
1.9608 | 1500 | 0.0253 | - | - | - |
2.0 | 1530 | - | 0.0368 | 0.7083 | - |
2.0915 | 1600 | 0.0197 | - | - | - |
2.2222 | 1700 | 0.0151 | - | - | - |
2.3529 | 1800 | 0.0156 | - | - | - |
2.4837 | 1900 | 0.0155 | - | - | - |
2.6144 | 2000 | 0.0141 | - | - | - |
2.7451 | 2100 | 0.0134 | - | - | - |
2.8758 | 2200 | 0.0137 | - | - | - |
3.0 | 2295 | - | 0.0352 | 0.7230 | - |
-1 | -1 | - | - | - | 0.7256 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{formal2022distillationhardnegativesampling,
title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
year={2022},
eprint={2205.04733},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2205.04733},
}
@article{paria2020minimizing,
title={Minimizing flops to learn efficient sparse representations},
author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
journal={arXiv preprint arXiv:2004.05665},
year={2020}
}
Base model
almanach/camembert-large