File size: 2,883 Bytes
7722096 9cae0ee 9c7e017 9cddb46 9c7e017 64b9559 9c7e017 d3289d1 9c7e017 d71c180 c6ee842 d71c180 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
---
license: cc-by-nc-4.0
base_model:
- Qwen/Qwen2.5-Coder-32B
---
# Arctic Text2SQL: ExCoT
Snowflake’s AI research team introduces ExCoT, the first model in the Arctic Text2SQL family. ExCoT is a novel framework that combines CoT prompting with SQL execution-based DPO, using execution results — not human preferences — as the feedback signal. This enables scalable, high-quality model optimization without requiring expensive human annotations.
Based on our internal testing, ExCoT delivered state-of-the-art results on the [BIRD-test benchmark](https://bird-bench.github.io/), achieving best-in-class performance in the single-model, single-inference category using only public datasets (BIRD and Spider) and no additional Text2SQL data:
* [Llama-3.1-Arctic-ExCoT-70B](https://huggingface.co/Snowflake/Llama-3.1-Arctic-ExCoT-70B) improved execution accuracy on the BIRD-dev set from the base model’s 57.37% to 68.51%. [Qwen-2.5-coder-Arctic-ExCoT-32B](https://huggingface.co/Snowflake/Qwen-2.5-coder-Arctic-ExCoT-32B) achieved similarly strong gains.
* Both models significantly outperformed other well-known frontier general-purpose models, achieving over 10 points of improvement.
For more details about ExCoT and how to use it:
* ❄️ [Arctic Text2SQL: Introducing ExCoT for Execution-Guided Chain-of-Thought Optimization (blog)]()
* 📝 [ExCoT: Optimizing Reasoning for Text-to-SQL with Execution Feedback (arxiv)](https://arxiv.org/pdf/2503.19988)
* 🚀 [Getting started guide using ArcticTraining](https://github.com/snowflakedb/ArcticTraining/tree/main/projects/excot_dpo)
## Evaluation results
| Model | | |
|--------------------------------------|----------------|--------------|
| | BIRD Ex% Dev | BIRD Ex% Test |
| Arctic-ExCoT-70B (LLaMA 3.1 70B) | **68.51** | 68.53 |
| Arctic-ExCoT-32B (Qwen-2.5-Coder 32B) | 68.25 | 68.19 |
| XiYanSQL-QwenCoder* | 67.01 | **69.03** |
| OpenAI GPT-4o | 54.04 | – |
| OpenAI GPT-4 | 46.35 | 54.89 |
| Anthropic Claude 3.5-Sonnet | 50.13 | – |
| Claude-2 | 42.70 | 49.02 |
| OpenAI o1-mini | 52.41 | – |
| OpenAI o3-mini | 53.72 | – |
| Mistral-large-2407 (123B) | 53.52 | 55.84 |
| DeepSeek-V2 (236B) | 56.13 | 56.68 |
Top Single-Model, Single-Inference Results on the BIRD Leaderboard (as of March 25, 2025). *XiYanSQL-QwenCoder: there are some challenges to reproduce the numbers [[1]](https://github.com/XGenerationLab/XiYanSQL-QwenCoder/issues/4)[[2]](https://modelscope.cn/models/XGenerationLab/XiYanSQL-QwenCoder-32B-2412/feedback/issueDetail/22708). |