Spaces:

AuraSystems
/

spanish-embeddings-api

Sleeping

App Files Files Community

Jordi Catafal commited on Jun 1

Commit

8c3e1fb

1 Parent(s): d51407f

readme for the api

Browse files

Files changed (1) hide show

README.md +391 -0

README.md CHANGED Viewed

@@ -8,3 +8,394 @@ pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+--------------------------------
+# Spanish Embeddings API
+A high-performance API for generating embeddings from Spanish text using state-of-the-art models. This API provides access to two specialized models optimized for different use cases.
+## 🚀 Quick Start
+**Base URL**: `https://aurasystems-spanish-embeddings-api.hf.space`
+**Interactive Documentation**: [https://aurasystems-spanish-embeddings-api.hf.space/docs](https://aurasystems-spanish-embeddings-api.hf.space/docs)
+## 📚 Available Models
+| Model | Max Tokens | Languages | Dimensions | Best Use Case |
+|-------|------------|-----------|------------|---------------|
+| **jina** | 8,192 | Spanish, English | 768 | General purpose, long documents, cross-lingual tasks |
+| **robertalex** | 512 | Spanish | 768 | Legal documents, formal Spanish, domain-specific text |
+## 🔗 API Endpoints
+### Generate Embeddings
+```
+POST /embed
+```
+Generate embeddings for up to 50 texts in a single request.
+### List Models
+```
+GET /models
+```
+Get detailed information about available models.
+### Health Check
+```
+GET /health
+```
+Check API status and model availability.
+### API Info
+```
+GET /
+```
+Basic API information and status.
+## 📖 Usage Examples
+### Python
+```python
+import requests
+import numpy as np
+API_URL = "https://aurasystems-spanish-embeddings-api.hf.space"
+# Example 1: Basic usage
+response = requests.post(
+    f"{API_URL}/embed",
+    json={
+        "texts": ["Hola, ¿cómo estás?", "Me gusta programar en Python"],
+        "model": "jina",
+        "normalize": True
+    }
+)
+result = response.json()
+embeddings = result["embeddings"]
+print(f"Generated {len(embeddings)} embeddings of {result['dimensions']} dimensions")
+# Example 2: Using with numpy for similarity
+embeddings_array = np.array(embeddings)
+similarity = np.dot(embeddings_array[0], embeddings_array[1])
+print(f"Cosine similarity: {similarity:.4f}")
+# Example 3: Legal text with RoBERTalex
+legal_response = requests.post(
+    f"{API_URL}/embed",
+    json={
+        "texts": [
+            "El contrato será válido desde la fecha de firma",
+            "La validez contractual inicia en el momento de suscripción"
+        ],
+        "model": "robertalex",
+        "normalize": True
+    }
+)
+```
+### cURL
+```bash
+# Basic embedding generation
+curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed" \
+     -H "Content-Type: application/json" \
+     -d '{
+       "texts": ["Texto de ejemplo", "Otro texto en español"],
+       "model": "jina",
+       "normalize": true
+     }'
+# With custom max length
+curl -X POST "https://aurasystems-spanish-embeddings-api.hf.space/embed" \
+     -H "Content-Type: application/json" \
+     -d '{
+       "texts": ["Documento muy largo..."],
+       "model": "jina",
+       "normalize": true,
+       "max_length": 2048
+     }'
+# Get model information
+curl "https://aurasystems-spanish-embeddings-api.hf.space/models"
+```
+### JavaScript/TypeScript
+```javascript
+const API_URL = 'https://aurasystems-spanish-embeddings-api.hf.space';
+// Basic function to get embeddings
+async function getEmbeddings(texts, model = 'jina') {
+    const response = await fetch(`${API_URL}/embed`, {
+        method: 'POST',
+        headers: {
+            'Content-Type': 'application/json',
+        },
+        body: JSON.stringify({
+            texts: texts,
+            model: model,
+            normalize: true
+        })
+    });
+    if (!response.ok) {
+        throw new Error(`Error: ${response.status}`);
+    }
+    return await response.json();
+}
+// Usage example
+try {
+    const result = await getEmbeddings([
+        'Hola mundo',
+        'Programación en JavaScript'
+    ]);
+    console.log('Embeddings:', result.embeddings);
+    console.log('Dimensions:', result.dimensions);
+} catch (error) {
+    console.error('Error generating embeddings:', error);
+}
+```
+### Using with LangChain
+```python
+from langchain.embeddings.base import Embeddings
+from typing import List
+import requests
+class SpanishEmbeddings(Embeddings):
+    """Custom LangChain embeddings class for Spanish text"""
+    def __init__(self, model: str = "jina"):
+        self.api_url = "https://aurasystems-spanish-embeddings-api.hf.space/embed"
+        self.model = model
+    def embed_documents(self, texts: List[str]) -> List[List[float]]:
+        response = requests.post(
+            self.api_url,
+            json={
+                "texts": texts,
+                "model": self.model,
+                "normalize": True
+            }
+        )
+        response.raise_for_status()
+        return response.json()["embeddings"]
+    def embed_query(self, text: str) -> List[float]:
+        return self.embed_documents([text])[0]
+# Usage with LangChain
+embeddings = SpanishEmbeddings(model="jina")
+doc_embeddings = embeddings.embed_documents([
+    "Primer documento",
+    "Segundo documento"
+])
+query_embedding = embeddings.embed_query("consulta de búsqueda")
+```
+## 📋 Request/Response Formats
+### Request Body Schema
+```json
+{
+    "texts": [
+        "string"
+    ],
+    "model": "jina",
+    "normalize": true,
+    "max_length": null
+}
+```
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| texts | array[string] | Yes | - | List of texts to embed (1-50 texts) |
+| model | string | No | "jina" | Model to use: "jina" or "robertalex" |
+| normalize | boolean | No | true | Whether to L2-normalize embeddings |
+| max_length | integer/null | No | null | Maximum tokens per text (null = model default) |
+### Response Schema
+```json
+{
+    "embeddings": [[0.123, -0.456, ...]],
+    "model_used": "jina",
+    "dimensions": 768,
+    "num_texts": 2
+}
+```
+## ⚡ Performance & Limits
+- **Maximum texts per request**: 50
+- **Maximum concurrent requests**: 4 (on free tier)
+- **Typical response time**: 100-200ms for 10 texts
+- **Embedding dimensions**: 768 (both models)
+- **API availability**: 24/7 on Hugging Face Spaces
+## 🔧 Advanced Usage
+### Batch Processing
+For processing large datasets, implement batching:
+```python
+def process_large_dataset(texts, batch_size=50):
+    """Process large text dataset in batches"""
+    embeddings = []
+    for i in range(0, len(texts), batch_size):
+        batch = texts[i:i + batch_size]
+        response = requests.post(
+            "https://aurasystems-spanish-embeddings-api.hf.space/embed",
+            json={
+                "texts": batch,
+                "model": "jina",
+                "normalize": True
+            }
+        )
+        embeddings.extend(response.json()["embeddings"])
+    return embeddings
+```
+### Semantic Search Example
+```python
+import numpy as np
+from typing import List, Tuple
+def semantic_search(
+    query: str,
+    documents: List[str],
+    top_k: int = 5
+) -> List[Tuple[int, float]]:
+    """Find most similar documents to query"""
+    # Get embeddings for query and documents
+    response = requests.post(
+        "https://aurasystems-spanish-embeddings-api.hf.space/embed",
+        json={
+            "texts": [query] + documents,
+            "model": "jina",
+            "normalize": True
+        }
+    )
+    embeddings = np.array(response.json()["embeddings"])
+    query_embedding = embeddings[0]
+    doc_embeddings = embeddings[1:]
+    # Calculate similarities
+    similarities = np.dot(doc_embeddings, query_embedding)
+    # Get top-k results
+    top_indices = np.argsort(similarities)[::-1][:top_k]
+    return [(idx, similarities[idx]) for idx in top_indices]
+# Example usage
+documents = [
+    "Python es un lenguaje de programación",
+    "Madrid es la capital de España",
+    "El machine learning está revolucionando la tecnología",
+    "La paella es un plato típico español"
+]
+results = semantic_search(
+    "inteligencia artificial y programación",
+    documents,
+    top_k=2
+)
+for idx, score in results:
+    print(f"Document: {documents[idx]}")
+    print(f"Similarity: {score:.4f}\n")
+```
+## 🚨 Error Handling
+The API returns standard HTTP status codes:
+| Status Code | Description |
+|-------------|-------------|
+| 200 | Success |
+| 400 | Bad Request (invalid parameters) |
+| 422 | Validation Error (check request format) |
+| 429 | Too Many Requests (rate limit exceeded) |
+| 500 | Internal Server Error |
+### Error Response Format
+```json
+{
+    "detail": "Error message description"
+}
+```
+### Common Errors and Solutions
+1. **Invalid max_length**
+   ```json
+   {
+     "detail": "Value error, Max length must be positive"
+   }
+   ```
+   **Solution**: Use a positive integer or omit max_length
+2. **Too many texts**
+   ```json
+   {
+     "detail": "Maximum 50 texts per request"
+   }
+   ```
+   **Solution**: Batch your requests
+3. **Empty texts**
+   ```json
+   {
+     "detail": "Empty texts are not allowed"
+   }
+   ```
+   **Solution**: Filter out empty strings before sending
+## 🔒 Authentication
+This API is currently **open and does not require authentication**. It's hosted on Hugging Face Spaces and is free to use within the rate limits.
+## 📊 Monitoring
+Check API status and health:
+```python
+# Health check
+health = requests.get("https://aurasystems-spanish-embeddings-api.hf.space/health")
+print(health.json())
+# Output: {'status': 'healthy', 'models_loaded': True, 'available_models': ['jina', 'robertalex']}
+```
+## 🤝 Support
+- **Issues**: Create an issue in the [Hugging Face Space discussions](https://huggingface.co/spaces/AuraSystems/spanish-embeddings-api/discussions)
+- **Documentation**: Visit the [interactive API docs](https://aurasystems-spanish-embeddings-api.hf.space/docs)
+- **Model Information**:
+  - [Jina Embeddings v2 Spanish](https://huggingface.co/jinaai/jina-embeddings-v2-base-es)
+  - [RoBERTalex](https://huggingface.co/PlanTL-GOB-ES/RoBERTalex)
+## 📄 License
+This API is provided as-is for research and commercial use. The underlying models have their own licenses:
+- Jina models: Apache 2.0
+- RoBERTalex: Apache 2.0
+---
+Built with ❤️ using FastAPI and Hugging Face Transformers