File size: 1,234 Bytes
ba5d90f
6ccfcef
 
 
ba5d90f
3e6f5e3
ba5d90f
6ccfcef
 
 
ba5d90f
 
6ccfcef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e6f5e3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
---
title: Pdf2markdown (Flask)
emoji: 👁️
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
# For Docker Spaces, app_port in README.md informs Hugging Face which internal port your app listens on.
# This should match the port Gunicorn (or your app server) binds to.
app_port: 7860
---

## PDF to Markdown Converter (Flask Version)

This application converts PDF files (either uploaded or from a URL) into Markdown format.
It extracts text, attempts to format it, identifies tables, and extracts images.

Extracted images are uploaded to a Hugging Face Dataset repository named "pdf-images-extracted" (this can be configured).
**Important:** For image uploading to work, you **must** set an `HF_TOKEN` with write access to datasets in your Hugging Face Space secrets.

### Features
-   Upload PDF files directly.
-   Process PDFs from a publicly accessible URL.
-   Extracts plain text and attempts to preserve some layout.
-   Detects and formats tables into Markdown.
-   Extracts images from the PDF.
-   Performs OCR on extracted images to include text from images.
-   Uploads extracted images to a Hugging Face Dataset.

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference