File size: 5,742 Bytes
9242b5e
 
 
 
 
4c7052a
9242b5e
 
 
4c7052a
 
 
 
9242b5e
4c7052a
9242b5e
4c7052a
9242b5e
4c7052a
9242b5e
4c7052a
9242b5e
4c7052a
9242b5e
4c7052a
9242b5e
4c7052a
 
 
 
9242b5e
4c7052a
 
 
9242b5e
4c7052a
 
 
 
 
9242b5e
4c7052a
9242b5e
4c7052a
 
 
 
 
9242b5e
4c7052a
 
 
9242b5e
1cc960c
9242b5e
4c7052a
 
 
 
9242b5e
4c7052a
 
 
9242b5e
4c7052a
 
9242b5e
4c7052a
9242b5e
4c7052a
ea87379
4c7052a
ea87379
4c7052a
 
 
 
 
 
 
9242b5e
4c7052a
 
 
9242b5e
4c7052a
 
 
 
9242b5e
4c7052a
 
 
9242b5e
4c7052a
 
 
9242b5e
4c7052a
9242b5e
4c7052a
 
 
9242b5e
4c7052a
9242b5e
4c7052a
9242b5e
4c7052a
 
 
 
9242b5e
 
 
4c7052a
9242b5e
4c7052a
9242b5e
4c7052a
9242b5e
4c7052a
 
ea87379
4c7052a
 
 
 
 
9242b5e
 
 
4c7052a
9242b5e
 
 
4c7052a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
library_name: transformers
tags: []
---

# Geraldine/msmarco-distilbert-base-v4-ead

## Model Details

- Model Name: Geraldine/msmarco-distilbert-base-v4-ead
- Base Model: sentence-transformers/msmarco-distilbert-base-v4
- Intended Use: This model is optimized for creating text embeddings with specific handling of XML/EAD elements.
- Architecture: DistilBERT-based sentence-transformer model, fine-tuned for MSMARCO and adapted to recognize XML/EAD elements.

## Model Description

This model is built on top of sentence-transformers/msmarco-distilbert-base-v4 and enhanced with two key modifications:

1. Special Tokens for XML/EAD Elements: The tokenizer includes additional tokens to handle EAD (Encoded Archival Description) and XML elements and attributes. This allows the model to generate embeddings that capture structural metadata commonly used in archival contexts.

2. Dimensionality Reduction with PCA: A PCA model is applied to reduce the dimensionality of embeddings from 768 to 128. This makes the embeddings more compact while preserving essential semantic information, which is useful for downstream tasks requiring lower-dimensional representations.

## Model Usage

### Installation and Setup

```python
from transformers import AutoModel, AutoTokenizer
import joblib
from huggingface_hub import hf_hub_download

# Load the embeddings model
model = AutoModel.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead")
tokenizer = AutoTokenizer.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead")

# Load the PCA model
pca_path = hf_hub_download("Geraldine/msmarco-distilbert-base-v4-ead", "pca_model.joblib")
pca = joblib.load(pca_path)
```
### Encoding Text and Reducing Dimensionality

To use the model for generating 128-dimensional embeddings, follow these steps:

```python
# Encode text using the model and tokenizer
text = "Your EAD/XML text goes here"
inputs = tokenizer(text, return_tensors="pt")
embeddings = model(**inputs).last_hidden_state

# Apply PCA to reduce dimensionality
reduced_embeddings = pca.transform(embeddings.detach().numpy())
```

### Full example to use with Langchain or Llamaindex

```python
from transformers import AutoModel, AutoTokenizer, pipeline
import joblib
from huggingface_hub import hf_hub_download

# Load the embeddings model
model = AutoModel.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead")
tokenizer = AutoTokenizer.from_pretrained("Geraldine/msmarco-distilbert-base-v4-ead")

# Load the PCA model
pca_path = hf_hub_download("Geraldine/msmarco-distilbert-base-v4-ead", "pca_model.joblib")

feature_extraction_pipeline = pipeline("feature-extraction", model=model, tokenizer=tokenizer)

class HuggingFaceEmbeddingFunction:
    def __init__(self, pipeline, pca_model_path):
        self.pipeline = pipeline
        self.pca = joblib.load(pca_model_path)
            
    # Function for embedding documents (lists of text)
    def embed_documents(self, texts):
        # Get embeddings as numpy arrays
        embeddings = self.pipeline(texts)
        embeddings = [embedding[0][0] for embedding in embeddings]
        embeddings = np.array(embeddings)

        # Transform embeddings using PCA
        reduced_embeddings = self.pca.transform(embeddings)
        return reduced_embeddings.tolist()

    # Function for embedding individual queries
    def embed_query(self, text):
        embedding = self.pipeline(text)
        embedding = np.array(embedding[0][0]).reshape(1, -1)

        # Transform embedding using PCA
        reduced_embedding = self.pca.transform(embedding)
        return reduced_embedding.flatten().tolist()

embeddings = HuggingFaceEmbeddingFunction(feature_extraction_pipeline, pca_model_path="pca_model.joblib")
```
### Intended Use Cases

This model is well-suited for:

- **Archival Data Embeddings**: Generate embeddings for texts containing EAD/XML elements, making it ideal for digital archives and library sciences.
- **Semantic Search**: Improve search results for content with complex metadata or hierarchical data, like archival records or digital collections.
- **Information Retrieval**: Use embeddings to power retrieval tasks where reducing storage and maintaining relevance in the embeddings are essential.

## Training Data

The base model was fine-tuned on MSMARCO data by sentence-transformers. Additional training or fine-tuning with EAD/XML-specific tokens was not required; instead, the tokenizer was adapted to recognize XML/EAD elements and attributes as distinct tokens.

## Limitations and Considerations

- **Domain-Specific Tokenization**: The model's tokenizer recognizes EAD/XML tokens, making it particularly useful in contexts where such elements are frequently used. However, this specialization may not be necessary for general NLP tasks.
- **Dimensionality Reduction Trade-Off**: PCA reduces the embedding dimensions from 768 to 128, which can introduce minor losses in the information encoded in embeddings. This trade-off is balanced to retain essential semantic information.

## Evaluation

The base model has been evaluated on MSMARCO, and the added tokenization aligns it for use in XML/EAD contexts. Further evaluation can be conducted on EAD-specific datasets or tasks to ensure model effectiveness in domain-specific applications.

## Citation

If you use this model, please cite it as follows:

```bibtex
@misc{geraldine2024eadxml,
  author = {Géraldine Geoffroy},
  title = {Geraldine/msmarco-distilbert-base-v4-ead: A DistilBERT Embedding Model for EAD/XML Text},
  year = {2024},
  howpublished = {\url{https://huggingface.co/Geraldine/msmarco-distilbert-base-v4-ead}},
}
```

## Model Card Authors [optional]

Géraldine Geoffroy

## Model Card Contact

[email protected]