Spaces:
Build error
license: mit
title: OmniParse
sdk: docker
emoji: 🐢
colorFrom: yellow
colorTo: green
OmniParser API
Self-hosted version of Microsoft's OmniParser Image-to-text model.
OmniParser is a general screen parsing tool, which interprets/converts UI screenshot to structured format, to improve existing LLM based UI agent. Training Datasets include: 1) an interactable icon detection dataset, which was curated from popular web pages and automatically annotated to highlight clickable and actionable regions, and 2) an icon description dataset, designed to associate each UI element with its corresponding function.
Why?
There's already a great HuggingFace gradio app for this model. It even offers an API. But
- Gradio is much slower than serving the model directly (like we do here)
- HF is rate-limited
How it works
If you look at the Dockerfile, we start off with the HF demo image to retrive all the weights and util functions. Then we add a simple FastAPI server (under main.py) to serve the model.
Getting Started
Requirements
- GPU
- 16 GB Ram (swap recommended)
Locally
- Clone the repository
- Build the docker image:
docker build -t omni-parser-app .
- Run the docker container:
docker run -p 7860:7860 omni-parser-app
Self-hosted API
I suggest hosting on fly.io because it's quick and simple to deploy with a CLI.
This repo is ready-made for deployment on fly.io (see fly.toml for configuration). Just run fly launch
and follow the prompts.
Docs
Visit http://localhost:7860/docs
for the API documentation. There's only one route /process_image
which returns
- The image with bounding boxes drawn on (in base64) format
- The parsed elements in a list with text descriptions
- The bounding box coordinates of the parsed elements
Examples
Related Projects
Check out OneQuery, an agent that browses the web and returns structured responses for any query, simple or complex. OneQuery is built using OmniParser to enhance its capabilities.