pdf2markdown / README.md
broadfield-dev's picture
Update README.md
6ccfcef verified
metadata
title: Pdf2markdown (Flask)
emoji: 👁️
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
app_port: 7860

PDF to Markdown Converter (Flask Version)

This application converts PDF files (either uploaded or from a URL) into Markdown format. It extracts text, attempts to format it, identifies tables, and extracts images.

Extracted images are uploaded to a Hugging Face Dataset repository named "pdf-images-extracted" (this can be configured). Important: For image uploading to work, you must set an HF_TOKEN with write access to datasets in your Hugging Face Space secrets.

Features

  • Upload PDF files directly.
  • Process PDFs from a publicly accessible URL.
  • Extracts plain text and attempts to preserve some layout.
  • Detects and formats tables into Markdown.
  • Extracts images from the PDF.
  • Performs OCR on extracted images to include text from images.
  • Uploads extracted images to a Hugging Face Dataset.

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference