--- base_model: - HuggingFaceTB/SmolVLM-256M-Instruct language: - en library_name: mlx license: apache-2.0 pipeline_tag: image-text-to-text tags: - mlx --- # zboyles/SmolDocling-256M-preview-bf16 This model was converted to **MLX format** from [`ds4sd/SmolDocling-256M-preview`](https://huggingface.co/ds4sd/SmolDocling-256M-preview) using mlx-vlm version **0.1.18**. * Refer to the [**original model card**](https://huggingface.co/ds4sd/SmolDocling-256M-preview) for more details on the model. * Refer to the [**mlx-vlm repo**](https://github.com/Blaizzy/mlx-vlm) for more examples using `mlx-vlm`. ## Use SmolDocling-256M-preview with with docling and mlx > **Find Working MLX + Docling Example Code Below**
SmolDocling

SmolDocling-256M-preview

SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.

This model was presented in the paper [SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion](https://huggingface.co/papers/2503.11576). ### ๐Ÿš€ Features: - ๐Ÿท๏ธ **DocTags for Efficient Tokenization** โ€“ Introduces DocTags an efficient and minimal representation for documents that is fully compatible with **DoclingDocuments**. - ๐Ÿ” **OCR (Optical Character Recognition)** โ€“ Extracts text accurately from images. - ๐Ÿ“ **Layout and Localization** โ€“ Preserves document structure and document element **bounding boxes**. - ๐Ÿ’ป **Code Recognition** โ€“ Detects and formats code blocks including identation. - ๐Ÿ”ข **Formula Recognition** โ€“ Identifies and processes mathematical expressions. - ๐Ÿ“Š **Chart Recognition** โ€“ Extracts and interprets chart data. - ๐Ÿ“‘ **Table Recognition** โ€“ Supports column and row headers for structured table extraction. - ๐Ÿ–ผ๏ธ **Figure Classification** โ€“ Differentiates figures and graphical elements. - ๐Ÿ“ **Caption Correspondence** โ€“ Links captions to relevant images and figures. - ๐Ÿ“œ **List Grouping** โ€“ Organizes and structures list elements correctly. - ๐Ÿ“„ **Full-Page Conversion** โ€“ Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.) - ๐Ÿ”ฒ **OCR with Bounding Boxes** โ€“ OCR regions using a bounding box. - ๐Ÿ“‚ **General Document Processing** โ€“ Trained for both scientific and non-scientific documents. - ๐Ÿ”„ **Seamless Docling Integration** โ€“ Import into **Docling** and export in multiple formats. - ๐Ÿ’จ **Fast inference using VLLM** โ€“ Avg of 0.35 secs per page on A100 GPU. ### ๐Ÿšง *Coming soon!* - ๐Ÿ“Š **Better chart recognition ๐Ÿ› ๏ธ** - ๐Ÿ“š **One shot multi-page inference โฑ๏ธ** - ๐Ÿงช **Chemical Recognition** - ๐Ÿ“™ **Datasets** ## โŒจ๏ธ Get started (**MLX** code examples) You can use **mlx** to perform inference, and [Docling](https://github.com/docling-project/docling) to convert the results to a variety of ourput formats (md, html, etc.):
๐Ÿ“„ Single page image inference using MLX via `mlx-vlm` ๐Ÿค– ```python # Prerequisites: # pip install -U mlx-vlm # pip install docling_core import sys from pathlib import Path from PIL import Image from mlx_vlm import load, apply_chat_template, stream_generate from mlx_vlm.utils import load_image # Variables path_or_hf_repo="zboyles/SmolDocling-256M-preview-bf16" output_path=Path("output") output_path.mkdir(exist_ok=True) # Model Params eos="" verbose=True kwargs={ "max_tokens": 8000, "temperature": 0.0, } # Load images # Note: I manually downloaded the image # image_src = "https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg" # image = load_image(image_src) image_src = "images/GazettedeFrance.jpg" image = Image.open(image_src).convert("RGB") # Initialize processor and model model, processor = load( path_or_hf_repo=path_or_hf_repo, trust_remote_code=True, ) config = model.config # Create input messages - Docling Walkthrough Structure messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Convert this page to docling."} ] }, ] prompt = apply_chat_template(processor, config, messages, add_generation_prompt=True) # # Alternatively, supported prompt creation method # messages = [{"role": "user", "content": "Convert this page to docling."}] # prompt = apply_chat_template(processor, config, messages, add_generation_prompt=True) text = "" last_response = None for response in stream_generate( model=model, processor=processor, prompt=prompt, image=image, **kwargs ): if verbose: print(response.text, end="", flush=True) text += response.text last_response = response if eos in text: text = text.split(eos)[0].strip() break print() if verbose: print("\n" + "=" * 10) if len(text) == 0: print("No text generated for this prompt") sys.exit(0) print( f"Prompt: {last_response.prompt_tokens} tokens, " f"{last_response.prompt_tps:.3f} tokens-per-sec" ) print( f"Generation: {last_response.generation_tokens} tokens, " f"{last_response.generation_tps:.3f} tokens-per-sec" ) print(f"Peak memory: {last_response.peak_memory:.3f} GB") # To convert to Docling Document, MD, HTML, etc.: docling_output_path = output_path / Path(image_src).with_suffix(".dt").name docling_output_path.write_text(text) doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([text], [image]) doc = DoclingDocument(name="Document") doc.load_from_doctags(doctags_doc) # export as any format # HTML doc.save_as_html(docling_output_path.with_suffix(".html")) # MD doc.save_as_markdown(docling_output_path.with_suffix(".md")) ```
Thanks to [**@Blaizzy**](https://github.com/Blaizzy) for the [code examples](https://github.com/Blaizzy/mlx-vlm/tree/main/examples) that helped me quickly adapt the `docling` example.