Papers
arxiv:2503.02240

OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale

Published on Mar 4
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

Text-to-SQL, the task of translating natural language questions into SQL queries, plays a crucial role in enabling non-experts to interact with databases. While recent advancements in large language models (LLMs) have significantly enhanced text-to-SQL performance, existing approaches face notable limitations in real-world text-to-SQL applications. Prompting-based methods often depend on closed-source LLMs, which are expensive, raise privacy concerns, and lack customization. Fine-tuning-based methods, on the other hand, suffer from poor generalizability due to the limited coverage of publicly available training data. To overcome these challenges, we propose a novel and scalable text-to-SQL data synthesis framework for automatically synthesizing large-scale, high-quality, and diverse datasets without extensive human intervention. Using this framework, we introduce SynSQL-2.5M, the first million-scale text-to-SQL dataset, containing 2.5 million samples spanning over 16,000 synthetic databases. Each sample includes a database, SQL query, natural language question, and chain-of-thought (CoT) solution. Leveraging SynSQL-2.5M, we develop OmniSQL, a powerful open-source text-to-SQL model available in three sizes: 7B, 14B, and 32B. Extensive evaluations across nine datasets demonstrate that OmniSQL achieves state-of-the-art performance, matching or surpassing leading closed-source and open-source LLMs, including GPT-4o and DeepSeek-V3, despite its smaller size. We release all code, datasets, and models to support further research.

Community

Hi everyone,

We are thrilled to introduce SynSQL-2.5M, a high-quality synthetic text-to-SQL dataset featuring:

  • 2,544,390 diverse and complex text-to-SQL samples, each consisting of a <database, question, SQL query, chain-of-thought solution> quad.
  • Coverage of 16,583 synthetic databases from realistic scenarios.
  • A wide range of SQL complexity levels: simple, moderate, complex, highly complex, from single-table queries to advanced multi-table joins, functions, and common table expressions.
  • A variety of linguistic styles in natural language questions: formal, colloquial, imperative, interrogative, descriptive, concise, vague, metaphorical, and conversational.
  • Chain-of-thought (CoT) solutions provided for all data samples.

As of March 2025, SynSQL-2.5M is the largest and most diverse synthetic text-to-SQL dataset to date. It represents a significant milestone in the text-to-SQL community. We encourage researchers, practitioners, and data enthusiasts to explore and build models using this dataset.

Let's dive in!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.02240 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.