Spaces:
Sleeping
Sleeping
metadata
title: MinerU PDF Extractor
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
sdk_version: latest
app_file: app.py
pinned: false
MinerU PDF Extractor API
This Hugging Face Space provides a FastAPI-based service that uses magic-pdf to extract structured content from PDFs. The service exposes REST endpoints to process PDF files and return extracted text and tables in a structured JSON format.
API Endpoints
Health Check
GET /health
Returns the service status and timestamp.
Extract PDF Content
POST /extract
Upload a PDF file to extract its text content and tables.
Request
- Content-Type: multipart/form-data
- Body: PDF file in the 'file' field
Response
JSON object containing:
- Filename
- Pages with extracted text
- Tables in Markdown format
Usage Examples
Using cURL
curl -X POST "https://marcosremar2-docker-mineru.hf.space/extract" \
-H "Content-Type: multipart/form-data" \
-F "file=@your_document.pdf" \
--output result.json
Using Python
import requests
url = "https://marcosremar2-docker-mineru.hf.space/extract"
files = {"file": open("your_document.pdf", "rb")}
response = requests.post(url, files=files)
data = response.json()
# Process the extracted data
print(f"Filename: {data['result']['filename']}")
print(f"Number of pages: {len(data['result']['pages'])}")
API Documentation
Once deployed, you can access the auto-generated Swagger documentation at:
https://marcosremar2-docker-mineru.hf.space/docs
For ReDoc documentation:
https://marcosremar2-docker-mineru.hf.space/redoc