Spaces:
Sleeping
Sleeping
Jordi Catafal
commited on
Commit
·
8c3e1fb
1
Parent(s):
d51407f
readme for the api
Browse files
README.md
CHANGED
@@ -8,3 +8,394 @@ pinned: false
|
|
8 |
---
|
9 |
|
10 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
---
|
9 |
|
10 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
11 |
+
|
12 |
+
--------------------------------
|
13 |
+
|
14 |
+
# Spanish Embeddings API
|
15 |
+
|
16 |
+
A high-performance API for generating embeddings from Spanish text using state-of-the-art models. This API provides access to two specialized models optimized for different use cases.
|
17 |
+
|
18 |
+
## 🚀 Quick Start
|
19 |
+
|
20 |
+
**Base URL**: `https://aurasystems-spanish-embeddings-api.hf.space`
|
21 |
+
|
22 |
+
**Interactive Documentation**: [https://aurasystems-spanish-embeddings-api.hf.space/docs](https://aurasystems-spanish-embeddings-api.hf.space/docs)
|
23 |
+
|
24 |
+
## 📚 Available Models
|
25 |
+
|
26 |
+
| Model | Max Tokens | Languages | Dimensions | Best Use Case |
|
27 |
+
|-------|------------|-----------|------------|---------------|
|
28 |
+
| **jina** | 8,192 | Spanish, English | 768 | General purpose, long documents, cross-lingual tasks |
|
29 |
+
| **robertalex** | 512 | Spanish | 768 | Legal documents, formal Spanish, domain-specific text |
|
30 |
+
|
31 |
+
## 🔗 API Endpoints
|
32 |
+
|
33 |
+
### Generate Embeddings
|
34 |
+
```
|
35 |
+
POST /embed
|
36 |
+
```
|
37 |
+
Generate embeddings for up to 50 texts in a single request.
|
38 |
+
|
39 |
+
### List Models
|
40 |
+
```
|
41 |
+
GET /models
|
42 |
+
```
|
43 |
+
Get detailed information about available models.
|
44 |
+
|
45 |
+
### Health Check
|
46 |
+
```
|
47 |
+
GET /health
|
48 |
+
```
|
49 |
+
Check API status and model availability.
|
50 |
+
|
51 |
+
### API Info
|
52 |
+
```
|
53 |
+
GET /
|
54 |
+
```
|
55 |
+
Basic API information and status.
|
56 |
+
|
57 |
+
## 📖 Usage Examples
|
58 |
+
|
59 |
+
### Python
|
60 |
+
|
61 |
+
```python
|
62 |
+
import requests
|
63 |
+
import numpy as np
|
64 |
+
|
65 |
+
API_URL = "https://aurasystems-spanish-embeddings-api.hf.space"
|
66 |
+
|
67 |
+
# Example 1: Basic usage
|
68 |
+
response = requests.post(
|
69 |
+
f"{API_URL}/embed",
|
70 |
+
json={
|
71 |
+
"texts": ["Hola, ¿cómo estás?", "Me gusta programar en Python"],
|
72 |
+
"model": "jina",
|
73 |
+
"normalize": True
|
74 |
+
}
|
75 |
+
)
|
76 |
+
|
77 |
+
result = response.json()
|
78 |
+
embeddings = result["embeddings"]
|
79 |
+
print(f"Generated {len(embeddings)} embeddings of {result['dimensions']} dimensions")
|
80 |
+
|
81 |
+
# Example 2: Using with numpy for similarity
|
82 |
+
embeddings_array = np.array(embeddings)
|
83 |
+
similarity = np.dot(embeddings_array[0], embeddings_array[1])
|
84 |
+
print(f"Cosine similarity: {similarity:.4f}")
|
85 |
+
|
86 |
+
# Example 3: Legal text with RoBERTalex
|
87 |
+
legal_response = requests.post(
|
88 |
+
f"{API_URL}/embed",
|
89 |
+
json={
|
90 |
+
"texts": [
|
91 |
+
"El contrato será válido desde la fecha de firma",
|
92 |
+
"La validez contractual inicia en el momento de suscripción"
|
93 |
+
],
|
94 |
+
"model": "robertalex",
|
95 |
+
"normalize": True
|
96 |
+
}
|
97 |
+
)
|
98 |
+
```
|
99 |
+
|
100 |
+
### cURL
|
101 |
+
|
102 |
+
```bash
|
103 |
+
# Basic embedding generation
|
104 |
+
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed" \
|
105 |
+
-H "Content-Type: application/json" \
|
106 |
+
-d '{
|
107 |
+
"texts": ["Texto de ejemplo", "Otro texto en español"],
|
108 |
+
"model": "jina",
|
109 |
+
"normalize": true
|
110 |
+
}'
|
111 |
+
|
112 |
+
# With custom max length
|
113 |
+
curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed" \
|
114 |
+
-H "Content-Type: application/json" \
|
115 |
+
-d '{
|
116 |
+
"texts": ["Documento muy largo..."],
|
117 |
+
"model": "jina",
|
118 |
+
"normalize": true,
|
119 |
+
"max_length": 2048
|
120 |
+
}'
|
121 |
+
|
122 |
+
# Get model information
|
123 |
+
curl "https://aurasystems-spanish-embeddings-api.hf.space/models"
|
124 |
+
```
|
125 |
+
|
126 |
+
### JavaScript/TypeScript
|
127 |
+
|
128 |
+
```javascript
|
129 |
+
const API_URL = 'https://aurasystems-spanish-embeddings-api.hf.space';
|
130 |
+
|
131 |
+
// Basic function to get embeddings
|
132 |
+
async function getEmbeddings(texts, model = 'jina') {
|
133 |
+
const response = await fetch(`${API_URL}/embed`, {
|
134 |
+
method: 'POST',
|
135 |
+
headers: {
|
136 |
+
'Content-Type': 'application/json',
|
137 |
+
},
|
138 |
+
body: JSON.stringify({
|
139 |
+
texts: texts,
|
140 |
+
model: model,
|
141 |
+
normalize: true
|
142 |
+
})
|
143 |
+
});
|
144 |
+
|
145 |
+
if (!response.ok) {
|
146 |
+
throw new Error(`Error: ${response.status}`);
|
147 |
+
}
|
148 |
+
|
149 |
+
return await response.json();
|
150 |
+
}
|
151 |
+
|
152 |
+
// Usage example
|
153 |
+
try {
|
154 |
+
const result = await getEmbeddings([
|
155 |
+
'Hola mundo',
|
156 |
+
'Programación en JavaScript'
|
157 |
+
]);
|
158 |
+
console.log('Embeddings:', result.embeddings);
|
159 |
+
console.log('Dimensions:', result.dimensions);
|
160 |
+
} catch (error) {
|
161 |
+
console.error('Error generating embeddings:', error);
|
162 |
+
}
|
163 |
+
```
|
164 |
+
|
165 |
+
### Using with LangChain
|
166 |
+
|
167 |
+
```python
|
168 |
+
from langchain.embeddings.base import Embeddings
|
169 |
+
from typing import List
|
170 |
+
import requests
|
171 |
+
|
172 |
+
class SpanishEmbeddings(Embeddings):
|
173 |
+
"""Custom LangChain embeddings class for Spanish text"""
|
174 |
+
|
175 |
+
def __init__(self, model: str = "jina"):
|
176 |
+
self.api_url = "https://aurasystems-spanish-embeddings-api.hf.space/embed"
|
177 |
+
self.model = model
|
178 |
+
|
179 |
+
def embed_documents(self, texts: List[str]) -> List[List[float]]:
|
180 |
+
response = requests.post(
|
181 |
+
self.api_url,
|
182 |
+
json={
|
183 |
+
"texts": texts,
|
184 |
+
"model": self.model,
|
185 |
+
"normalize": True
|
186 |
+
}
|
187 |
+
)
|
188 |
+
response.raise_for_status()
|
189 |
+
return response.json()["embeddings"]
|
190 |
+
|
191 |
+
def embed_query(self, text: str) -> List[float]:
|
192 |
+
return self.embed_documents([text])[0]
|
193 |
+
|
194 |
+
# Usage with LangChain
|
195 |
+
embeddings = SpanishEmbeddings(model="jina")
|
196 |
+
doc_embeddings = embeddings.embed_documents([
|
197 |
+
"Primer documento",
|
198 |
+
"Segundo documento"
|
199 |
+
])
|
200 |
+
query_embedding = embeddings.embed_query("consulta de búsqueda")
|
201 |
+
```
|
202 |
+
|
203 |
+
## 📋 Request/Response Formats
|
204 |
+
|
205 |
+
### Request Body Schema
|
206 |
+
|
207 |
+
```json
|
208 |
+
{
|
209 |
+
"texts": [
|
210 |
+
"string"
|
211 |
+
],
|
212 |
+
"model": "jina",
|
213 |
+
"normalize": true,
|
214 |
+
"max_length": null
|
215 |
+
}
|
216 |
+
```
|
217 |
+
|
218 |
+
| Field | Type | Required | Default | Description |
|
219 |
+
|-------|------|----------|---------|-------------|
|
220 |
+
| texts | array[string] | Yes | - | List of texts to embed (1-50 texts) |
|
221 |
+
| model | string | No | "jina" | Model to use: "jina" or "robertalex" |
|
222 |
+
| normalize | boolean | No | true | Whether to L2-normalize embeddings |
|
223 |
+
| max_length | integer/null | No | null | Maximum tokens per text (null = model default) |
|
224 |
+
|
225 |
+
### Response Schema
|
226 |
+
|
227 |
+
```json
|
228 |
+
{
|
229 |
+
"embeddings": [[0.123, -0.456, ...]],
|
230 |
+
"model_used": "jina",
|
231 |
+
"dimensions": 768,
|
232 |
+
"num_texts": 2
|
233 |
+
}
|
234 |
+
```
|
235 |
+
|
236 |
+
## ⚡ Performance & Limits
|
237 |
+
|
238 |
+
- **Maximum texts per request**: 50
|
239 |
+
- **Maximum concurrent requests**: 4 (on free tier)
|
240 |
+
- **Typical response time**: 100-200ms for 10 texts
|
241 |
+
- **Embedding dimensions**: 768 (both models)
|
242 |
+
- **API availability**: 24/7 on Hugging Face Spaces
|
243 |
+
|
244 |
+
## 🔧 Advanced Usage
|
245 |
+
|
246 |
+
### Batch Processing
|
247 |
+
|
248 |
+
For processing large datasets, implement batching:
|
249 |
+
|
250 |
+
```python
|
251 |
+
def process_large_dataset(texts, batch_size=50):
|
252 |
+
"""Process large text dataset in batches"""
|
253 |
+
embeddings = []
|
254 |
+
|
255 |
+
for i in range(0, len(texts), batch_size):
|
256 |
+
batch = texts[i:i + batch_size]
|
257 |
+
response = requests.post(
|
258 |
+
"https://aurasystems-spanish-embeddings-api.hf.space/embed",
|
259 |
+
json={
|
260 |
+
"texts": batch,
|
261 |
+
"model": "jina",
|
262 |
+
"normalize": True
|
263 |
+
}
|
264 |
+
)
|
265 |
+
embeddings.extend(response.json()["embeddings"])
|
266 |
+
|
267 |
+
return embeddings
|
268 |
+
```
|
269 |
+
|
270 |
+
### Semantic Search Example
|
271 |
+
|
272 |
+
```python
|
273 |
+
import numpy as np
|
274 |
+
from typing import List, Tuple
|
275 |
+
|
276 |
+
def semantic_search(
|
277 |
+
query: str,
|
278 |
+
documents: List[str],
|
279 |
+
top_k: int = 5
|
280 |
+
) -> List[Tuple[int, float]]:
|
281 |
+
"""Find most similar documents to query"""
|
282 |
+
|
283 |
+
# Get embeddings for query and documents
|
284 |
+
response = requests.post(
|
285 |
+
"https://aurasystems-spanish-embeddings-api.hf.space/embed",
|
286 |
+
json={
|
287 |
+
"texts": [query] + documents,
|
288 |
+
"model": "jina",
|
289 |
+
"normalize": True
|
290 |
+
}
|
291 |
+
)
|
292 |
+
|
293 |
+
embeddings = np.array(response.json()["embeddings"])
|
294 |
+
query_embedding = embeddings[0]
|
295 |
+
doc_embeddings = embeddings[1:]
|
296 |
+
|
297 |
+
# Calculate similarities
|
298 |
+
similarities = np.dot(doc_embeddings, query_embedding)
|
299 |
+
|
300 |
+
# Get top-k results
|
301 |
+
top_indices = np.argsort(similarities)[::-1][:top_k]
|
302 |
+
|
303 |
+
return [(idx, similarities[idx]) for idx in top_indices]
|
304 |
+
|
305 |
+
# Example usage
|
306 |
+
documents = [
|
307 |
+
"Python es un lenguaje de programación",
|
308 |
+
"Madrid es la capital de España",
|
309 |
+
"El machine learning está revolucionando la tecnología",
|
310 |
+
"La paella es un plato típico español"
|
311 |
+
]
|
312 |
+
|
313 |
+
results = semantic_search(
|
314 |
+
"inteligencia artificial y programación",
|
315 |
+
documents,
|
316 |
+
top_k=2
|
317 |
+
)
|
318 |
+
|
319 |
+
for idx, score in results:
|
320 |
+
print(f"Document: {documents[idx]}")
|
321 |
+
print(f"Similarity: {score:.4f}\n")
|
322 |
+
```
|
323 |
+
|
324 |
+
## 🚨 Error Handling
|
325 |
+
|
326 |
+
The API returns standard HTTP status codes:
|
327 |
+
|
328 |
+
| Status Code | Description |
|
329 |
+
|-------------|-------------|
|
330 |
+
| 200 | Success |
|
331 |
+
| 400 | Bad Request (invalid parameters) |
|
332 |
+
| 422 | Validation Error (check request format) |
|
333 |
+
| 429 | Too Many Requests (rate limit exceeded) |
|
334 |
+
| 500 | Internal Server Error |
|
335 |
+
|
336 |
+
### Error Response Format
|
337 |
+
|
338 |
+
```json
|
339 |
+
{
|
340 |
+
"detail": "Error message description"
|
341 |
+
}
|
342 |
+
```
|
343 |
+
|
344 |
+
### Common Errors and Solutions
|
345 |
+
|
346 |
+
1. **Invalid max_length**
|
347 |
+
```json
|
348 |
+
{
|
349 |
+
"detail": "Value error, Max length must be positive"
|
350 |
+
}
|
351 |
+
```
|
352 |
+
**Solution**: Use a positive integer or omit max_length
|
353 |
+
|
354 |
+
2. **Too many texts**
|
355 |
+
```json
|
356 |
+
{
|
357 |
+
"detail": "Maximum 50 texts per request"
|
358 |
+
}
|
359 |
+
```
|
360 |
+
**Solution**: Batch your requests
|
361 |
+
|
362 |
+
3. **Empty texts**
|
363 |
+
```json
|
364 |
+
{
|
365 |
+
"detail": "Empty texts are not allowed"
|
366 |
+
}
|
367 |
+
```
|
368 |
+
**Solution**: Filter out empty strings before sending
|
369 |
+
|
370 |
+
## 🔒 Authentication
|
371 |
+
|
372 |
+
This API is currently **open and does not require authentication**. It's hosted on Hugging Face Spaces and is free to use within the rate limits.
|
373 |
+
|
374 |
+
## 📊 Monitoring
|
375 |
+
|
376 |
+
Check API status and health:
|
377 |
+
|
378 |
+
```python
|
379 |
+
# Health check
|
380 |
+
health = requests.get("https://aurasystems-spanish-embeddings-api.hf.space/health")
|
381 |
+
print(health.json())
|
382 |
+
# Output: {'status': 'healthy', 'models_loaded': True, 'available_models': ['jina', 'robertalex']}
|
383 |
+
```
|
384 |
+
|
385 |
+
## 🤝 Support
|
386 |
+
|
387 |
+
- **Issues**: Create an issue in the [Hugging Face Space discussions](https://huggingface.co/spaces/AuraSystems/spanish-embeddings-api/discussions)
|
388 |
+
- **Documentation**: Visit the [interactive API docs](https://aurasystems-spanish-embeddings-api.hf.space/docs)
|
389 |
+
- **Model Information**:
|
390 |
+
- [Jina Embeddings v2 Spanish](https://huggingface.co/jinaai/jina-embeddings-v2-base-es)
|
391 |
+
- [RoBERTalex](https://huggingface.co/PlanTL-GOB-ES/RoBERTalex)
|
392 |
+
|
393 |
+
## 📄 License
|
394 |
+
|
395 |
+
This API is provided as-is for research and commercial use. The underlying models have their own licenses:
|
396 |
+
- Jina models: Apache 2.0
|
397 |
+
- RoBERTalex: Apache 2.0
|
398 |
+
|
399 |
+
---
|
400 |
+
|
401 |
+
Built with ❤️ using FastAPI and Hugging Face Transformers
|