benchmark updates for. new weights
Browse files
README.md
CHANGED
@@ -168,26 +168,26 @@ Let's break down the code into these steps:
|
|
168 |
```python
|
169 |
from datetime import datetime, timedelta
|
170 |
|
171 |
-
# Get
|
172 |
-
|
173 |
-
tomorrow = today + timedelta(days=1)
|
174 |
-
tomorrow_str = tomorrow.strftime('%Y-%m-%d')
|
175 |
|
176 |
# Define the time slots
|
177 |
-
start_time =
|
178 |
-
end_time =
|
179 |
|
180 |
# Step 1: Check availability
|
181 |
-
is_available = check_availability(
|
182 |
|
183 |
-
# Step 2: Make appointment if available
|
184 |
if is_available:
|
185 |
-
|
186 |
-
|
187 |
-
|
188 |
-
if
|
189 |
-
|
190 |
-
|
|
|
|
|
|
|
191 |
```
|
192 |
|
193 |
This code will first determine if the specified time slot is available tomorrow. If it is, it will attempt to make the appointment and then add it to the reminders if successful.
|
@@ -203,21 +203,23 @@ We evaluate the model on the following benchmarks:
|
|
203 |
|
204 |
Below are the BFCL results: evaluation results for ***Qwen2.5-Coder-3B-Instruct***, ***Dria-Agent-α-3B***, ***Dria-Agent-α-7B***, and ***gpt-4o-2024-11-20***
|
205 |
|
206 |
-
| Metric | Qwen/Qwen2.5-3B-Instruct | Dria-Agent-a-3B | Dria-Agent-
|
207 |
-
|
208 |
-
| **Non-Live Simple AST** | 75.50%
|
209 |
-
| **Non-Live Multiple AST** | 90.00%
|
210 |
-
| **Non-Live Parallel AST** | 80.00%
|
211 |
-
| **Non-Live Parallel Multiple AST** | 78.50%
|
212 |
-
| **Non-Live Simple Exec** | 82.07%
|
213 |
-
| **Non-Live Multiple Exec** | 86.00%
|
214 |
-
| **Non-Live Parallel Exec** | 82.00%
|
215 |
-
| **Non-Live Parallel Multiple Exec** | 80.00%
|
216 |
-
| **Live Simple AST** | 68.22%
|
217 |
-
| **Live Multiple AST** | 66.00%
|
218 |
-
| **Live Parallel AST** | 62.50%
|
219 |
-
| **Live Parallel Multiple AST** | 66.67%
|
220 |
-
| **Relevance Detection** | 88.89%
|
|
|
|
|
221 |
|
222 |
and the MMLU-Pro and DPAB results:
|
223 |
|
|
|
168 |
```python
|
169 |
from datetime import datetime, timedelta
|
170 |
|
171 |
+
# Get tomorrow's date
|
172 |
+
tomorrow = (datetime.now() + timedelta(days=1)).strftime("%Y-%m-%d")
|
|
|
|
|
173 |
|
174 |
# Define the time slots
|
175 |
+
start_time = "10:00"
|
176 |
+
end_time = "12:00"
|
177 |
|
178 |
# Step 1: Check availability
|
179 |
+
is_available = check_availability(tomorrow, start_time, end_time)
|
180 |
|
|
|
181 |
if is_available:
|
182 |
+
# Step 2: Make the appointment
|
183 |
+
appointment_details = make_appointment(tomorrow, start_time, end_time, "Meeting with thesis supervisor")
|
184 |
+
|
185 |
+
if appointment_details['appointment_made']:
|
186 |
+
# Step 3: Add to reminders
|
187 |
+
reminder_text = f"Appointment with thesis supervisor scheduled for {tomorrow} from {start_time} to {end_time}."
|
188 |
+
add_to_reminders(reminder_text)
|
189 |
+
else:
|
190 |
+
appointment_details = {"day": tomorrow, "start_time": start_time, "end_time": end_time, "appointment_made": False}
|
191 |
```
|
192 |
|
193 |
This code will first determine if the specified time slot is available tomorrow. If it is, it will attempt to make the appointment and then add it to the reminders if successful.
|
|
|
203 |
|
204 |
Below are the BFCL results: evaluation results for ***Qwen2.5-Coder-3B-Instruct***, ***Dria-Agent-α-3B***, ***Dria-Agent-α-7B***, and ***gpt-4o-2024-11-20***
|
205 |
|
206 |
+
| Metric | Qwen/Qwen2.5-3B-Instruct | Dria-Agent-a-3B | Dria-Agent-7B | gpt-4o-2024-11-20 (Prompt) |
|
207 |
+
|---------------------------------------|----------------------------|-------------------|-------------------|---------------------------|
|
208 |
+
| **Non-Live Simple AST** | 75.50% | 75.08% | 77.58% | 79.42% |
|
209 |
+
| **Non-Live Multiple AST** | 90.00% | 93.00% | 94.00% | 95.50% |
|
210 |
+
| **Non-Live Parallel AST** | 80.00% | 85.00% | 93.50% | 94.00% |
|
211 |
+
| **Non-Live Parallel Multiple AST** | 78.50% | 79.00% | 89.50% | 83.50% |
|
212 |
+
| **Non-Live Simple Exec** | 82.07% | 87.57% | 93.29% | 100.00% |
|
213 |
+
| **Non-Live Multiple Exec** | 86.00% | 85.14% | 88.00% | 94.00% |
|
214 |
+
| **Non-Live Parallel Exec** | 82.00% | 90.00% | 88.00% | 86.00% |
|
215 |
+
| **Non-Live Parallel Multiple Exec** | 80.00% | 88.00% | 72.50% | 77.50% |
|
216 |
+
| **Live Simple AST** | 68.22% | 70.16% | 81.40% | 83.72% |
|
217 |
+
| **Live Multiple AST** | 66.00% | 67.14% | 78.73% | 79.77% |
|
218 |
+
| **Live Parallel AST** | 62.50% | 50.00% | 75.00% | 87.50% |
|
219 |
+
| **Live Parallel Multiple AST** | 66.67% | 70.83% | 62.50% | 70.83% |
|
220 |
+
| **Relevance Detection** | 88.89% | 100.00% | 100.00% | 83.33% |
|
221 |
+
|
222 |
+
|
223 |
|
224 |
and the MMLU-Pro and DPAB results:
|
225 |
|