zboyles/SmolDocling-256M-preview-bf16
This model was converted to MLX format from ds4sd/SmolDocling-256M-preview
using mlx-vlm version 0.1.18.
- Refer to the original model card for more details on the model.
- Refer to the mlx-vlm repo for more examples using
mlx-vlm
.
Use SmolDocling-256M-preview with with docling and mlx
Find Working MLX + Docling Example Code Below

SmolDocling-256M-preview
SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.
This model was presented in the paper SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion.
π Features:
- π·οΈ DocTags for Efficient Tokenization β Introduces DocTags an efficient and minimal representation for documents that is fully compatible with DoclingDocuments.
- π OCR (Optical Character Recognition) β Extracts text accurately from images.
- π Layout and Localization β Preserves document structure and document element bounding boxes.
- π» Code Recognition β Detects and formats code blocks including identation.
- π’ Formula Recognition β Identifies and processes mathematical expressions.
- π Chart Recognition β Extracts and interprets chart data.
- π Table Recognition β Supports column and row headers for structured table extraction.
- πΌοΈ Figure Classification β Differentiates figures and graphical elements.
- π Caption Correspondence β Links captions to relevant images and figures.
- π List Grouping β Organizes and structures list elements correctly.
- π Full-Page Conversion β Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
- π² OCR with Bounding Boxes β OCR regions using a bounding box.
- π General Document Processing β Trained for both scientific and non-scientific documents.
- π Seamless Docling Integration β Import into Docling and export in multiple formats.
- π¨ Fast inference using VLLM β Avg of 0.35 secs per page on A100 GPU.
π§ Coming soon!
- π Better chart recognition π οΈ
- π One shot multi-page inference β±οΈ
- π§ͺ Chemical Recognition
- π Datasets
β¨οΈ Get started (MLX code examples)
You can use mlx to perform inference, and Docling to convert the results to a variety of ourput formats (md, html, etc.):
π Single page image inference using MLX via `mlx-vlm` π€
# Prerequisites:
# pip install -U mlx-vlm
# pip install docling_core
import sys
from pathlib import Path
from PIL import Image
from mlx_vlm import load, apply_chat_template, stream_generate
from mlx_vlm.utils import load_image
# Variables
path_or_hf_repo="zboyles/SmolDocling-256M-preview-bf16"
output_path=Path("output")
output_path.mkdir(exist_ok=True)
# Model Params
eos="<end_of_utterance>"
verbose=True
kwargs={
"max_tokens": 8000,
"temperature": 0.0,
}
# Load images
# Note: I manually downloaded the image
# image_src = "https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg"
# image = load_image(image_src)
image_src = "images/GazettedeFrance.jpg"
image = Image.open(image_src).convert("RGB")
# Initialize processor and model
model, processor = load(
path_or_hf_repo=path_or_hf_repo,
trust_remote_code=True,
)
config = model.config
# Create input messages - Docling Walkthrough Structure
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
prompt = apply_chat_template(processor, config, messages, add_generation_prompt=True)
# # Alternatively, supported prompt creation method
# messages = [{"role": "user", "content": "Convert this page to docling."}]
# prompt = apply_chat_template(processor, config, messages, add_generation_prompt=True)
text = ""
last_response = None
for response in stream_generate(
model=model,
processor=processor,
prompt=prompt,
image=image,
**kwargs
):
if verbose:
print(response.text, end="", flush=True)
text += response.text
last_response = response
if eos in text:
text = text.split(eos)[0].strip()
break
print()
if verbose:
print("\n" + "=" * 10)
if len(text) == 0:
print("No text generated for this prompt")
sys.exit(0)
print(
f"Prompt: {last_response.prompt_tokens} tokens, "
f"{last_response.prompt_tps:.3f} tokens-per-sec"
)
print(
f"Generation: {last_response.generation_tokens} tokens, "
f"{last_response.generation_tps:.3f} tokens-per-sec"
)
print(f"Peak memory: {last_response.peak_memory:.3f} GB")
# To convert to Docling Document, MD, HTML, etc.:
docling_output_path = output_path / Path(image_src).with_suffix(".dt").name
docling_output_path.write_text(text)
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([text], [image])
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
# export as any format
# HTML
doc.save_as_html(docling_output_path.with_suffix(".html"))
# MD
doc.save_as_markdown(docling_output_path.with_suffix(".md"))
Thanks to @Blaizzy for the code examples that helped me quickly adapt the docling
example.
- Downloads last month
- 0
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The HF Inference API does not support image-text-to-text models for mlx library.
Model tree for zboyles/SmolDocling-256M-preview-bf16
Base model
HuggingFaceTB/SmolLM2-135M
Quantized
HuggingFaceTB/SmolLM2-135M-Instruct
Quantized
HuggingFaceTB/SmolVLM-256M-Instruct