Spaces:

marcosremar2
/

docker_mineru

Sleeping

App Files Files Community

docker_mineru / README.md

marcosremar2

Enhance FastAPI implementation with better documentation, error handling and examples

f30c298 6 months ago

preview code

raw

history blame

1.64 kB

metadata

title: MinerU PDF Extractor
emoji: 📄
colorFrom: blue
colorTo: indigo
sdk: docker
sdk_version: latest
app_file: app.py
pinned: false

MinerU PDF Extractor API

This Hugging Face Space provides a FastAPI-based service that uses magic-pdf to extract structured content from PDFs. The service exposes REST endpoints to process PDF files and return extracted text and tables in a structured JSON format.

API Endpoints

Health Check

GET /health

Returns the service status and timestamp.

Extract PDF Content

POST /extract

Upload a PDF file to extract its text content and tables.

Request

Content-Type: multipart/form-data
Body: PDF file in the 'file' field

Response

JSON object containing:

Filename
Pages with extracted text
Tables in Markdown format

Usage Examples

Using cURL

curl -X POST "https://marcosremar2-docker-mineru.hf.space/extract" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@your_document.pdf" \
  --output result.json

Using Python

import requests

url = "https://marcosremar2-docker-mineru.hf.space/extract"
files = {"file": open("your_document.pdf", "rb")}

response = requests.post(url, files=files)
data = response.json()

# Process the extracted data
print(f"Filename: {data['result']['filename']}")
print(f"Number of pages: {len(data['result']['pages'])}")

API Documentation

Once deployed, you can access the auto-generated Swagger documentation at:

https://marcosremar2-docker-mineru.hf.space/docs

For ReDoc documentation:

https://marcosremar2-docker-mineru.hf.space/redoc