File size: 6,429 Bytes
e340861
 
 
 
 
18e08d3
e340861
 
 
 
 
 
 
18e08d3
 
 
e340861
 
18e08d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e340861
18e08d3
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
base_model:
- HuggingFaceTB/SmolVLM-256M-Instruct
language:
- en
library_name: mlx
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- mlx
---

# zboyles/SmolDocling-256M-preview-bf16
This model was converted to **MLX format** from [`ds4sd/SmolDocling-256M-preview`](https://huggingface.co/ds4sd/SmolDocling-256M-preview) using mlx-vlm version **0.1.18**.
* Refer to the [**original model card**](https://huggingface.co/ds4sd/SmolDocling-256M-preview) for more details on the model.
* Refer to the [**mlx-vlm repo**](https://github.com/Blaizzy/mlx-vlm) for more examples using `mlx-vlm`.


## Use SmolDocling-256M-preview with with docling and mlx 

> **Find Working MLX + Docling Example Code Below**


<div style="display: flex; align-items: center;">
    <img src="https://huggingface.co/ds4sd/SmolDocling-256M-preview/resolve/main/assets/SmolDocling_doctags1.png" alt="SmolDocling" style="width: 200px; height: auto; margin-right: 20px;">
    <div>
        <h3>SmolDocling-256M-preview</h3>
        <p>SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for <strong>DoclingDocuments</strong>.</p>
    </div>
</div>

This model was presented in the paper [SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion](https://huggingface.co/papers/2503.11576).

### πŸš€ Features:  
- 🏷️ **DocTags for Efficient Tokenization** – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with **DoclingDocuments**.  
- πŸ” **OCR (Optical Character Recognition)** – Extracts text accurately from images.  
- πŸ“ **Layout and Localization** – Preserves document structure and document element **bounding boxes**.  
- πŸ’» **Code Recognition** – Detects and formats code blocks including identation.  
- πŸ”’ **Formula Recognition** – Identifies and processes mathematical expressions.  
- πŸ“Š **Chart Recognition** – Extracts and interprets chart data.  
- πŸ“‘ **Table Recognition** – Supports column and row headers for structured table extraction.  
- πŸ–ΌοΈ **Figure Classification** – Differentiates figures and graphical elements.  
- πŸ“ **Caption Correspondence** – Links captions to relevant images and figures.  
- πŸ“œ **List Grouping** – Organizes and structures list elements correctly.  
- πŸ“„ **Full-Page Conversion** – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.) 
- πŸ”² **OCR with Bounding Boxes** – OCR regions using a bounding box.
- πŸ“‚ **General Document Processing** – Trained for both scientific and non-scientific documents.  
- πŸ”„ **Seamless Docling Integration** – Import into **Docling** and export in multiple formats.
- πŸ’¨ **Fast inference using VLLM** – Avg of 0.35 secs per page on A100 GPU.

### 🚧 *Coming soon!*
- πŸ“Š **Better chart recognition πŸ› οΈ**
- πŸ“š **One shot multi-page inference ⏱️**
- πŸ§ͺ **Chemical Recognition**
- πŸ“™ **Datasets**

## ⌨️ Get started (**MLX** code examples)

You can use **mlx** to perform inference, and [Docling](https://github.com/docling-project/docling) to convert the results to a variety of ourput formats (md, html, etc.):

<details>
<summary>πŸ“„ Single page image inference using MLX via `mlx-vlm` πŸ€–</summary>

```python
# Prerequisites:
# pip install -U mlx-vlm
# pip install docling_core

import sys

from pathlib import Path
from PIL import Image

from mlx_vlm import load, apply_chat_template, stream_generate
from mlx_vlm.utils import load_image

# Variables
path_or_hf_repo="zboyles/SmolDocling-256M-preview-bf16"
output_path=Path("output")
output_path.mkdir(exist_ok=True)

# Model Params
eos="<end_of_utterance>"
verbose=True
kwargs={
    "max_tokens": 8000,
    "temperature": 0.0,
}

# Load images
# Note: I manually downloaded the image
# image_src = "https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg"
# image = load_image(image_src)
image_src = "images/GazettedeFrance.jpg"
image = Image.open(image_src).convert("RGB")

# Initialize processor and model
model, processor = load(
    path_or_hf_repo=path_or_hf_repo,
    trust_remote_code=True,
)
config = model.config


# Create input messages - Docling Walkthrough Structure
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."}
        ]
    },
]
prompt = apply_chat_template(processor, config, messages, add_generation_prompt=True)

# # Alternatively, supported prompt creation method
# messages = [{"role": "user", "content": "Convert this page to docling."}]
# prompt = apply_chat_template(processor, config, messages, add_generation_prompt=True)


text = ""
last_response = None

for response in stream_generate(
    model=model,
    processor=processor,
    prompt=prompt,
    image=image,
    **kwargs
):
    if verbose:
        print(response.text, end="", flush=True)
    text += response.text
    last_response = response
    if eos in text:
        text = text.split(eos)[0].strip()
        break
print()

if verbose:
    print("\n" + "=" * 10)
    if len(text) == 0:
        print("No text generated for this prompt")
        sys.exit(0)
    print(
        f"Prompt: {last_response.prompt_tokens} tokens, "
        f"{last_response.prompt_tps:.3f} tokens-per-sec"
    )
    print(
        f"Generation: {last_response.generation_tokens} tokens, "
        f"{last_response.generation_tps:.3f} tokens-per-sec"
    )
    print(f"Peak memory: {last_response.peak_memory:.3f} GB")

# To convert to Docling Document, MD, HTML, etc.:
docling_output_path = output_path / Path(image_src).with_suffix(".dt").name
docling_output_path.write_text(text)
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([text], [image])
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
# export as any format
# HTML
doc.save_as_html(docling_output_path.with_suffix(".html"))
# MD
doc.save_as_markdown(docling_output_path.with_suffix(".md"))
```
</details>

Thanks to [**@Blaizzy**](https://github.com/Blaizzy) for the [code examples](https://github.com/Blaizzy/mlx-vlm/tree/main/examples) that helped me quickly adapt the `docling` example.