docker_mineru / README.md
marcosremar2's picture
Enhance FastAPI implementation with better documentation, error handling and examples
f30c298
|
raw
history blame
1.64 kB
metadata
title: MinerU PDF Extractor
emoji: πŸ“„
colorFrom: blue
colorTo: indigo
sdk: docker
sdk_version: latest
app_file: app.py
pinned: false

MinerU PDF Extractor API

This Hugging Face Space provides a FastAPI-based service that uses magic-pdf to extract structured content from PDFs. The service exposes REST endpoints to process PDF files and return extracted text and tables in a structured JSON format.

API Endpoints

Health Check

GET /health

Returns the service status and timestamp.

Extract PDF Content

POST /extract

Upload a PDF file to extract its text content and tables.

Request

  • Content-Type: multipart/form-data
  • Body: PDF file in the 'file' field

Response

JSON object containing:

  • Filename
  • Pages with extracted text
  • Tables in Markdown format

Usage Examples

Using cURL

curl -X POST "https://marcosremar2-docker-mineru.hf.space/extract" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@your_document.pdf" \
  --output result.json

Using Python

import requests

url = "https://marcosremar2-docker-mineru.hf.space/extract"
files = {"file": open("your_document.pdf", "rb")}

response = requests.post(url, files=files)
data = response.json()

# Process the extracted data
print(f"Filename: {data['result']['filename']}")
print(f"Number of pages: {len(data['result']['pages'])}")

API Documentation

Once deployed, you can access the auto-generated Swagger documentation at:

https://marcosremar2-docker-mineru.hf.space/docs

For ReDoc documentation:

https://marcosremar2-docker-mineru.hf.space/redoc