--- base_model: - HuggingFaceTB/SmolVLM-256M-Instruct language: - en library_name: transformers license: cdla-permissive-2.0 pipeline_tag: image-text-to-text ---
SmolDocling

SmolDocling-256M-preview

SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.

This model was presented in the paper [SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion](https://huggingface.co/papers/2503.11576). ### πŸš€ Features: - 🏷️ **DocTags for Efficient Tokenization** – Introduces DocTags an efficient and minimal representation for documents that is fully compatible with **DoclingDocuments**. - πŸ” **OCR (Optical Character Recognition)** – Extracts text accurately from images. - πŸ“ **Layout and Localization** – Preserves document structure and document element **bounding boxes**. - πŸ’» **Code Recognition** – Detects and formats code blocks including identation. - πŸ”’ **Formula Recognition** – Identifies and processes mathematical expressions. - πŸ“Š **Chart Recognition** – Extracts and interprets chart data. - πŸ“‘ **Table Recognition** – Supports column and row headers for structured table extraction. - πŸ–ΌοΈ **Figure Classification** – Differentiates figures and graphical elements. - πŸ“ **Caption Correspondence** – Links captions to relevant images and figures. - πŸ“œ **List Grouping** – Organizes and structures list elements correctly. - πŸ“„ **Full-Page Conversion** – Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.) - πŸ”² **OCR with Bounding Boxes** – OCR regions using a bounding box. - πŸ“‚ **General Document Processing** – Trained for both scientific and non-scientific documents. - πŸ”„ **Seamless Docling Integration** – Import into **Docling** and export in multiple formats. - πŸ’¨ **Fast inference using VLLM** – Avg of 0.35 secs per page on A100 GPU. ### 🚧 *Coming soon!* - πŸ“Š **Better chart recognition πŸ› οΈ** - πŸ“š **One shot multi-page inference ⏱️** - πŸ§ͺ **Chemical Recognition** - πŸ“™ **Datasets** ## ⌨️ Get started (code examples) You can use **transformers** or **vllm** to perform inference, and [Docling](https://github.com/docling-project/docling) to convert results to variety of output formats (md, html, etc.):
πŸ“„ Single page image inference using Tranformers πŸ€– ```python # Prerequisites: # pip install torch # pip install docling_core # pip install transformers import torch from docling_core.types.doc import DoclingDocument from docling_core.types.doc.document import DocTagsDocument from transformers import AutoProcessor, AutoModelForVision2Seq from transformers.image_utils import load_image DEVICE = "cuda" if torch.cuda.is_available() else "cpu" # Load images image = load_image("https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg") # Initialize processor and model processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview") model = AutoModelForVision2Seq.from_pretrained( "ds4sd/SmolDocling-256M-preview", torch_dtype=torch.bfloat16, _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager", ).to(DEVICE) # Create input messages messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Convert this page to docling."} ] }, ] # Prepare inputs prompt = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor(text=prompt, images=[image], return_tensors="pt") inputs = inputs.to(DEVICE) # Generate outputs generated_ids = model.generate(**inputs, max_new_tokens=8192) prompt_length = inputs.input_ids.shape[1] trimmed_generated_ids = generated_ids[:, prompt_length:] doctags = processor.batch_decode( trimmed_generated_ids, skip_special_tokens=False, )[0].lstrip() # Populate document doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image]) print(doctags) # create a docling document doc = DoclingDocument(name="Document") doc.load_from_doctags(doctags_doc) # export as any format # HTML # doc.save_as_html(output_file) # MD print(doc.export_to_markdown()) ```
πŸš€ Fast Batch Inference Using VLLM ```python # Prerequisites: # pip install vllm # pip install docling_core # place page images you want to convert into "img/" dir import time import os from vllm import LLM, SamplingParams from PIL import Image from docling_core.types.doc import DoclingDocument from docling_core.types.doc.document import DocTagsDocument # Configuration MODEL_PATH = "ds4sd/SmolDocling-256M-preview" IMAGE_DIR = "img/" # Place your page images here OUTPUT_DIR = "out/" PROMPT_TEXT = "Convert page to Docling." # Ensure output directory exists os.makedirs(OUTPUT_DIR, exist_ok=True) # Initialize LLM llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1}) sampling_params = SamplingParams( temperature=0.0, max_tokens=8192) chat_template = f"<|im_start|>User:{PROMPT_TEXT} Assistant:" image_files = sorted([f for f in os.listdir(IMAGE_DIR) if f.lower().endswith((".png", ".jpg", ".jpeg"))]) start_time = time.time() total_tokens = 0 for idx, img_file in enumerate(image_files, 1): img_path = os.path.join(IMAGE_DIR, img_file) image = Image.open(img_path).convert("RGB") llm_input = {"prompt": chat_template, "multi_modal_data": {"image": image}} output = llm.generate([llm_input], sampling_params=sampling_params)[0] doctags = output.outputs[0].text img_fn = os.path.splitext(img_file)[0] output_filename = img_fn + ".dt" output_path = os.path.join(OUTPUT_DIR, output_filename) with open(output_path, "w", encoding="utf-8") as f: f.write(doctags) # To convert to Docling Document, MD, HTML, etc.: doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image]) doc = DoclingDocument(name="Document") doc.load_from_doctags(doctags_doc) # export as any format # HTML # doc.save_as_html(output_file) # MD output_filename_md = img_fn + ".md" output_path_md = os.path.join(OUTPUT_DIR, output_filename_md) doc.save_as_markdown(output_path_md) print(f"Total time: {time.time() - start_time:.2f} sec") ```
πŸ’» Local inference on Apple Silicon with MLX: [see here](https://huggingface.co/ds4sd/SmolDocling-256M-preview-mlx-bf16) ## DocTags Image description DocTags create a clear and structured system of tags and rules that separate text from the document's structure. This makes things easier for Image-to-Sequence models by reducing confusion. On the other hand, converting directly to formats like HTML or Markdown can be messyβ€”it often loses details, doesn’t clearly show the document’s layout, and increases the number of tokens, making processing less efficient. DocTags are integrated with Docling, which allows export to HTML, Markdown, and JSON. These exports can be offloaded to the CPU, reducing token generation overhead and improving efficiency. ## Supported Instructions
Description Instruction Comment
Full conversion Convert this page to docling. DocTags represetation
Chart Convert chart to table. (e.g., <chart>)
Formula Convert formula to LaTeX. (e.g., <formula>)
Code Convert code to text. (e.g., <code>)
Table Convert table to OTSL. (e.g., <otsl>) OTSL: Lysak et al., 2023
Actions and Pipelines OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237>
Identify element at: <loc_247><loc_482><10c_252><loc_486>
Find all 'text' elements on the page, retrieve all section headers.
Detect footer elements on the page.
#### Model Summary - **Developed by:** Docling Team, IBM Research - **Model type:** Multi-modal model (image+text) - **Language(s) (NLP):** English - **License:** Apache 2.0 - **Architecture:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary) - **Finetuned from model:** Based on [SmolVLM-256M-Instruct](https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct) **Repository:** [Docling](https://github.com/docling-project/docling) **Paper:** [arXiv](https://arxiv.org/abs/2503.11576) **Project Page:** [Hugging Face](https://huggingface.co/ds4sd/SmolDocling-256M-preview) **Citation:** ``` @misc{nassar2025smoldoclingultracompactvisionlanguagemodel, title={SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion}, author={Ahmed Nassar and Andres Marafioti and Matteo Omenetti and Maksym Lysak and Nikolaos Livathinos and Christoph Auer and Lucas Morin and Rafael Teixeira de Lima and Yusik Kim and A. Said Gurbuz and Michele Dolfi and Miquel FarrΓ© and Peter W. J. Staar}, year={2025}, eprint={2503.11576}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2503.11576}, } ``` **Demo:** [HF Space](https://huggingface.co/spaces/ds4sd/SmolDocling-256M-Demo)