Alex J. Chan commited on
Commit
c352182
·
2 Parent(s): daabc86 a2d2a9a

Merge pull request #5 from convergence-ai/alex/using

Browse files
Files changed (2) hide show
  1. .gitignore +3 -1
  2. README.md +104 -3
.gitignore CHANGED
@@ -172,4 +172,6 @@ cython_debug/
172
 
173
  logs/
174
  local_trajectories/
175
- screenshots/
 
 
 
172
 
173
  logs/
174
  local_trajectories/
175
+ screenshots/
176
+ gifs/
177
+ .DS_Store
README.md CHANGED
@@ -99,7 +99,7 @@ vllm serve --model convergence-ai/proxy-lite \
99
 
100
  The tool arguments are **very important** for parsing the tool calls from the model appropriately.
101
 
102
- > **Important:** To serve the model locally, install vLLM and transformers with `uv sync --all-extras`. Qwen-2.5-VL support is not yet available in the latest release of `transformers` so installation from source is required.
103
 
104
  You can set the `api_base` to point to your local endpoint when calling Proxy Lite:
105
 
@@ -154,6 +154,92 @@ result = asyncio.run(
154
  )
155
  ```
156
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
  ### Webbrowser Environment
158
 
159
  The `webbrowser` environment is a simple environment that uses the `playwright` library to navigate the web.
@@ -162,8 +248,23 @@ We launch a Chromium browser and navigate to the `homepage` provided in the `Run
162
 
163
  Actions in an environment are defined through available tool calls, which in the browser case are set as default in the `BrowserTool` class. This allows the model to click, type, etc. at relevant `mark_id` elements on the page. These elements are extracted using JavaScript injected into the page in order to make interaction easier for the models.
164
 
165
- If you want to not use this set-of-marks approach, you can set the `no_pois_in_image` flag to `True`, and the `include_poi_text` flag to `False` in the `EnvironmentConfig`. This way the model will only see the original image, and not the annotated image with these points-of-interest (POIs). In this case, you would want to update the `BrowserTool` to interact with pixel coordinates instead of the `mark_id`s.
 
 
 
 
 
 
 
166
 
167
- **Note:** We use `playwright_stealth` to lower the chance of detection by anti-bot services, but this isn't foolproof and Proxy Lite may still get blocked with captchas or other anti-bot measures, especially when using the `headless` flag. We recommend using network proxies to avoid this issue.
168
 
169
 
 
 
 
 
 
 
 
 
 
99
 
100
  The tool arguments are **very important** for parsing the tool calls from the model appropriately.
101
 
102
+ > **Important:** To serve the model locally, install vLLM and transformers with `uv sync --all-extras`. Qwen-2.5-VL support is not yet available in the latest release of `transformers` so installation from source is required (the appropriate revision is specified in the `pyproject.toml` file).
103
 
104
  You can set the `api_base` to point to your local endpoint when calling Proxy Lite:
105
 
 
154
  )
155
  ```
156
 
157
+ The `Runner` sets the solver and environment off in a loop, like in a traditional reinforcement learning setup.
158
+
159
+ <div align="center">
160
+ <img src="assets/loop.png" alt="Runner Loop" width="700" height="auto" style="margin-bottom: 20px;" />
161
+ </div>
162
+
163
+
164
+ When it comes to prompting Proxy Lite, the model expects a message history of the form:
165
+
166
+ ```python
167
+ message_history = [
168
+ {
169
+ "role": "system",
170
+ "content": "You are Proxy Lite...", # Full system prompt in src/proxy_lite/agents/proxy_lite_agent.py
171
+ }, # System prompt
172
+ {
173
+ "role": "user",
174
+ "content": "Book a table for 2 at an Italian restaurant in Kings Cross tonight at 7pm.",
175
+ }, # Set the task
176
+ {
177
+ "role": "user",
178
+ "content": [
179
+ {"type": "image_url", "image_url": {base64_encoded_screenshot} },
180
+ {"type": "text", "text": "URL: https://www.google.com/ \n- [0] <a>About</a> \n- [1] <a>Store</a>...."}
181
+ ] # This is the observation from the environment
182
+ },
183
+ ]
184
+ ```
185
+ This would then build up the message history, alternating between the assistant (action) and the user (observation), although for new calls, all the last observations other than the current one are discarded.
186
+
187
+ The chat template will format this automatically, but also expects the appropriate `Tools` to be passed in so that the model is aware of the available actions. You can do this with `transformers`:
188
+
189
+ ```python
190
+ from qwen_vl_utils import process_vision_info
191
+ from transformers import AutoProcessor
192
+
193
+ from proxy_lite.tools import ReturnValueTool, BrowserTool
194
+ from proxy_lite.serializer import OpenAICompatableSerializer
195
+
196
+ processor = AutoProcessor.from_pretrained("convergence-ai/proxy-lite")
197
+ tools = OpenAICompatableSerializer().serialize_tools([ReturnValueTool(), BrowserTool(session=None)])
198
+
199
+ templated_messages = processor.apply_chat_template(
200
+ message_history, tokenize=False, add_generation_prompt=True, tools=tools
201
+ )
202
+
203
+ image_inputs, video_inputs = process_vision_info(message_history)
204
+
205
+ batch = processor(
206
+ text=[templated_messages],
207
+ images=image_inputs,
208
+ videos=video_inputs,
209
+ padding=True,
210
+ return_tensors="pt",
211
+ )
212
+ ```
213
+
214
+ Or you can send to the endpoint directly, which will handle the formatting:
215
+
216
+ ```python
217
+ from openai import OpenAI
218
+
219
+ client = OpenAI(base_url="http://convergence-ai-demo-api.hf.space/v1")
220
+
221
+ response = client.chat.completions.create(
222
+ model="convergence-ai/proxy-lite",
223
+ messages=message_history,
224
+ tools=tools,
225
+ tool_choice="auto",
226
+ )
227
+ ```
228
+
229
+ The model's response will follow the format of:
230
+ - Observe
231
+ - Think
232
+ - Act
233
+ ```bash
234
+ <observation>The privacy consent banner has been successfully dismissed, allowing full access to the webpage. The search bar is visible, and the page is ready for interaction.</observation>
235
+ <thinking>The task of finding a vegetarian lasagna recipe has not yet been completed. I now have access to the search bar to begin searching for the recipe. I will type 'vegetarian lasagna' into the search bar and then click the search button to find relevant recipes.</thinking>
236
+ <tool_call>{"function": "click", "arguments": {"entries": [{"mark_id": 1, "content": "vegetarian lasagna"}]}}</tool_call>
237
+ ```
238
+ Where steps are separated by `<observation>`, `<thinking>`, and `<tool_call>` tags (Use the `-tool-call-parser hermes` option with the vLLM server to automatically parse the tool call when getting back the completion).
239
+
240
+
241
+
242
+
243
  ### Webbrowser Environment
244
 
245
  The `webbrowser` environment is a simple environment that uses the `playwright` library to navigate the web.
 
248
 
249
  Actions in an environment are defined through available tool calls, which in the browser case are set as default in the `BrowserTool` class. This allows the model to click, type, etc. at relevant `mark_id` elements on the page. These elements are extracted using JavaScript injected into the page in order to make interaction easier for the models.
250
 
251
+ **Note:** We use `playwright_stealth` to lower the chance of detection by anti-bot services, but this isn't foolproof and Proxy Lite may still get blocked by captchas or other anti-bot measures, especially when using the `headless` flag. We recommend using network proxies to avoid this issue.
252
+
253
+
254
+ ## Limitations
255
+
256
+ This model has not been designed to act as a full assistant able to interact with a user, instead it acts as a tool that goes out and *autonomously* completes a task.
257
+ As such, it will struggle with tasks that require credentials or user interaction such as actually purchasing items if you don't give all the required details in the prompt.
258
+
259
 
260
+ ## Citation
261
 
262
 
263
+ ```bibtex
264
+ @article{proxy-lite,
265
+ title={Proxy Lite - A Mini, Open-weights, Autonomous Assistant},
266
+ author={Convergence AI},
267
+ url={https://github.com/convergence-ai/proxy-lite},
268
+ year={2025}
269
+ }
270
+ ```