GGUF Quants of UI-TARS Not Functional
I have noticed that none of the GGUF quantized versions of the UI-TARS models work appropriately. I wonder if there's something special that needs to be done to get these quants working well since they're Qwen2-VL based.
What's the issue, quality of responses? Or is it actually not able to run?
What's the issue, quality of responses? Or is it actually not able to run?
I appreciate you taking the time to respond. The quality of the responses is the issue here, particularly when there's an image involved. I can get some pretty decent responses if there's no image data for it to chew on, so I wonder if something's funky with the CLIP situation or something. I dunno.
For example, here's what I get from Q6_K_L when there's an image as part of the context:
(partial log output from UI-TARS-Desktop)
[2025-01-28 01:19:40.302] [info] (main) [vlmParams_conversations]: [
{
from: 'human',
value: 'You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.\n' +
'\n' +
'## Output Format\n' +
'\n' + 'Thought: ...\n' + 'Action: ...\n' + '
\n' +
'\n' +
'## Action Space\n' +
"click(start_box='[x1, y1, x2, y2]')\n" +
"left_double(start_box='[x1, y1, x2, y2]')\n" +
"right_single(start_box='[x1, y1, x2, y2]')\n" +
"drag(start_box='[x1, y1, x2, y2]', end_box='[x3, y3, x4, y4]')\n" +
"hotkey(key='')\n" +
'type(content='') #If you want to submit your input, use "\n" at the end of content
.\n' +
"scroll(start_box='[x1, y1, x2, y2]', direction='down or up or right or left')\n" +
'wait() #Sleep for 5s and take a screenshot to check for any changes.\n' +
'finished()\n' +
"call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.\n" +
'\n' +
'## Note\n' +
'- Use English in Thought
part.\n' +
'- Write a small plan and finally summarize your next action (with its target element) in one sentence in Thought
part.\n' +
'\n' +
'## User Instruction\n' +
'Open Firefox'
},
{ from: 'human', value: '' },
{
from: 'gpt',
value: 'multi / 『 base plastic uniform – early capo -> multi’s(100, & ‑ / Early – / –\n' +
'\t´s,\n' +
' – early ‐ » R ... multi’s, \n' +
'\n' +
'\n' +
'\n' +
'\n' +
'\n' +
'eworthy(2 L relative to 2( »\n' +
' ›..., and more´s base, & „ / multi sli 107 / – / – – / – multi plastic – snap to multi → almart启 early 2. – L W multi\n' +
' »\n' +
' multi – early – early ‑ »\n' +
' multi cap. „ » › & – ‐ early ‑,\n' +
'„启 » 2( R / ‑ » R L, multi’s base ... 2010 L 2\n' +
'\n' +
' multiow -> multi\n' +
' multi Cap.思imo,\n' +
' – _And ...,\n' +
' Early W sli\n' +
' early 『必 Base & multi / ... ‐ 2(2017 → „& ‑, „ early ‑\n' +
'\n' +
'\n' +
',\n' +
'„ »\n' +
'\n' +
'\n' +
' uniform(„,\n' +
' multi plastic »\n' +
'´s »没有必要 multiow multi售后服务 to – Cap \n' +
'\n' +
'\n' +
'\n' +
'\n' +
'\n' +
' multi » »,\n' +
'...,\n' +
'– L W sli 『 – / – / ‑ 思 » / ‑ R – 早 ›. multi.\n' +
'\t Rancap,\n' +
' Early ( „\n' +
' / ozilla, ... / ‑\n' +
'\n' +
'\n' +
' uniform 2(‐ 7 23 › 16 2010L 1 / 258,\n' +
' »\n' +
' »\n' +
' multi – _And,\n' +
'(„ »almart ="\n' +
'\n' +
'\n' +
'\n' +
' › W „ early – multi multi – – early – – ‐ – – – – / – / – / – / – & » plastic atin( – early ʼ » ‑(oby,\n' +
'\n' +
' multi,\n' +
' Early –\n' +
' multilogo → 『 ‑(„... .RE characters uniform ›oid,\n' +
' ..., » » 2(‐ / Fl 13 7 {...( – 8 › multi: / – $ – ,\\n' +
'(, they L ... / .RE启 » ´s\n' +
' – R ...„,\n' +
'神秘 ‑,\n' +
' & – & – – ();\n' +
'\n' +
'\n' +
'\n' +
' early ‐ 『 `/ – 2 › multi’s »\n' +
' » multi埃 s...,\n' +
' „启 »\n' +
'Docs characters\n' +
'多 – /,\n' +
'\t(„\n' +
'说明 plastic to multi uniform ‑ early multiow ->创业 in → multi – begin characters to 『 » l, multi ;\\n' +
'(„ early 2 / ›没有必要(,\n' +
'(,\n' +
'询 base\n' +
' multi » awner base 2: ›\n' +
'\n' +
'\n' +
' globally in,\n' +
'„ others‐(‑’s – 娅 19 »\n' +
'\n' +
' – & $...,\n' +
'\t » „ » .RE to end\n' +
'pst\n' +
'\n' +
'\n' +
' – 2024( – ´s 思, – multi 娅, & ();\n' +
'\n' +
'\n' +
'\n' +
'„\n' +
' multi ‐ / → / multi uniform – ‑ 3 – / ,\n' +
'执行 R »\n' +
' multi uniform(sne 2. – – early › others\n' +
'\n' +
'\n' +
'„\n' +
'„\n' +
'\n' +
'\n' +
'„\n' +
'质 Unicode .RE base –almart perfl 『\n' +
'´s – R – &id to 『 『 ‑ & 7 1( ‐ 20(, / → ],$(\n' +
'\n' +
'\n' +
' .RE,\n' +
'( multi ‑ »\n' +
' multi „ »\n' +
'Neo,\n' +
' .\n' +
' multi,\n' +
'‑,\n' +
'‑(„ » » »\n' +
' miejscos统一] ›,\n' +
' early 2, 2 3 – & – );\n' +
'\n' +
'\n' +
'\n' +
'... ‐\n' +
'„\n' +
'´s, – / – .RE – / – multi to – – V → – multi’s\n' +
'\n' +
'\n' +
'„\n' +
'( multi »\n' +
' multi, –\n' +
' –almart,\n' +
' Early,\n' +
'‑\n' +
'\n' +
'\n' +
'询,\n' +
'-categories.\n' +
'„ » uniform ‑,\n' +
'„ early › 2. multi 2( ‑( begin » 2 / R 1( „早 – / multi,'
},
{ from: 'human', value: '' }
]
If it's just text?
Tell me about Firefox
Firefox is a popular open-source web browser known for its privacy features and customization options. It offers a
seamless browsing experience with various add-ons and themes to enhance functionality and appearance. The browser
supports multiple languages and has a user-friendly interface, making it accessible across different operating
systems including Windows, macOS, and Linux.
Oof yeah that doesn't look right..
Can you confirm how you're running it? I'll try it in llamacpp with the vision binary to double check if maybe somethings wrong with whatever implementation you're using
I have tried llama.cpp, LM Studio, and ollama. Testing independent of UI-TARS-Desktop showed the q8 quant of the 7B would sometimes return coordinates, but most of the time it would say it couldn't see what I'm looking for in the image, even though it was very clearly visible.
In retrospect, I've noticed that bytedance-research initially released GGUFs, then pulled them. That leads me to believe that this result isn't entirely unexpected, but leaves me curious as to how we'll tackle quantizing these models. Where there's a will, there's a way!
I am currently running the full precision 2B model, which performs as expected.
Thanks again