metadata

library_name: transformers
license: apache-2.0
datasets:
  - ds4sd/DocLayNet
pipeline_tag: image-segmentation

DETR-layout-detection

We present the model cmarkea/detr-layout-detection, which allows extracting different layouts (Text, Picture, Caption, Footnote, etc.) from an image of a document. This is a fine-tuning of the model detr-resnet-50 on the DocLayNet dataset. This model can jointly predict masks and bounding boxes for documentary objects. It is ideal for processing documentary corpora to be ingested into an ODQA system.

This model allows extracting 11 entities, which are: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, and Title.

Performance

In this section, we will evaluate the model's performance by separating semantic segmentation from object detection, with the understanding that no post-processing has been applied after estimation.

Semantic segmentation

Object detection

Direct Use

from transformers import AutoImageProcessor
from transformers.models.detr import DetrForSegmentation

img_proc = AutoImageProcessor.from_pretrained(
    "ArkeaIAF/detr-layout-detection"
)
model = DetrForSegmentation.from_pretrained(
    "ArkeaIAF/detr-layout-detection"
)

with torch.inference_mode():
    input_ids = img_proc(img, return_tensors='pt')
    output = model(**input_ids)

threshold=0.4

segmentation_mask = img_proc.post_process_segmentation(
    out_seg,
    threshold=threshold,
    target_sizes=[img.size[::-1]]
)

bbox_pred = img_proc.post_process_object_detection(
    output,
    threshold=threshold,
    target_sizes=[img.size[::-1]]
)

Citation

@online{DeDetrLay,
  AUTHOR = {Cyrile Delestre},
  URL = {https://huggingface.co/cmarkea/detr-base-layout-detection},
  YEAR = {2024},
  KEYWORDS = {Image Processing ; Transformers ; Layout},
}