metadata

license: apache-2.0
language:
  - en
  - zh
metrics:
  - accuracy
library_name: transformers
tags:
  - text2sql

text2sql-8b-instruct-v1

1. Summary

it is a natural language-to-SQL conversion model optimized specifically for Chinese and English users. It is based on the llama-3-chinese-8b-instruct-v3 model. We used the latest optimization algorithms to improve the performance of the model, especially in handling complex queries and multi-table joins.

1.1 characteristics

Bilingual support: Ability to handle natural language queries in both Chinese and English languages.
High accuracy: After a large number of tests on actual database queries, it has been proved that the SQL statements generated have high accuracy.

1.2 training data

Training data for the model comes from multiple sources, including:

Open source databases (such as WikiSQL, Spider)
Internally generated dataset covering a variety of query types and complexities
User feedback data for continuous improvement of model performance

Training data is strictly screened and cleaned to ensure data quality and diversity.

1.3 test results

Test results on multiple benchmark datasets show the model exceeds other existing models in terms of accuracy and generation efficiency. For example:

On the WikiSQL dataset, the model achieved an execution accuracy rate of 87.5%.
On the Spider dataset, the model achieved an execution accuracy rate of 95.3%.

These results show the model has significant advantages in handling complex queries and multi-table joins.

2. Usage:

Please upgrade the transformers package to ensure it supports Llama3 models. The current version we are using is 4.41.2.

# Use a pipeline as a high-level helper
from transformers import pipeline
import torch
model_id = "xbrain/text2sql-8b-instruct-v1"


messages = [
    {"role": "system", 
     "content": "I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.\n\"\n##Instruction:\n database contains tables such as table_name_30. Table table_name_30 has columns such as nfl_team, draft_year."},
    {"role": "user", 
     "content": "###Input:\nIn 1978 what is the NFL team?\n\n###Response:"},
]
pipe_msg = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",)

outputs = pipe_msg(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Here are some exciting examples: Example 1:

model input： {"role": "system", "content": "I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.\n"\n##Instruction:\ndog_kennels contains tables such as Breeds, Charges, Sizes, Treatment_Types, Owners, Dogs, Professionals, Treatments. Table Breeds has columns such as breed_code, breed_name. breed_code is the primary key.\nTable Charges has columns such as charge_id, charge_type, charge_amount. charge_id is the primary key.\nTable Sizes has columns such as size_code, size_description. size_code is the primary key.\nTable Treatment_Types has columns such as treatment_type_code, treatment_type_description. treatment_type_code is the primary key.\nTable Owners has columns such as owner_id, first_name, last_name, street, city, state, zip_code, email_address, home_phone, cell_number. owner_id is the primary key.\nTable Dogs has columns such as dog_id, owner_id, abandoned_yn, breed_code, size_code, name, age, date_of_birth, gender, weight, date_arrived, date_adopted, date_departed. dog_id is the primary key.\nTable Professionals has columns such as professional_id, role_code, first_name, street, city, state, zip_code, last_name, email_address, home_phone, cell_number. professional_id is the primary key.\nTable Treatments has columns such as treatment_id, dog_id, professional_id, treatment_type_code, date_of_treatment, cost_of_treatment. treatment_id is the primary key.\nThe owner_id of Dogs is the foreign key of owner_id of Owners.\nThe owner_id of Dogs is the foreign key of owner_id of Owners.\nThe size_code of Dogs is the foreign key of size_code of Sizes.\nThe breed_code of Dogs is the foreign key of breed_code of Breeds.\nThe dog_id of Treatments is the foreign key of dog_id of Dogs.\nThe professional_id of Treatments is the foreign key of professional_id of Professionals.\nThe treatment_type_code of Treatments is the foreign key of treatment_type_code of Treatment_Types.\n\n"}, {"role": "user", "content": "###Input:\n列出每种治疗的费用和相应的治疗类型描述。\n\n###Response:"},
model output: {'role': 'assistant', 'content': 'SELECT T1.cost_of_treatment, T2.treatment_type_description FROM Treatments AS T1 JOIN treatment_types AS T2 ON T1.treatment_type_code = T2.treatment_type_code'}

Example 2:

model input： {"role": "system", "content": "I want you to act as a SQL terminal in front of an example database, you need only to return the sql command to me.Below is an instruction that describes a task, Write a response that appropriately completes the request.\n"\n##Instruction:\n database contains tables such as medication, vitalperiodic, diagnosis, cost, lab, microlab, intakeoutput, treatment, patient, allergy. Table medication has columns such as medicationid, patientunitstayid, drugname, dosage, routeadmin, drugstarttime, drugstoptime. Table vitalperiodic has columns such as vitalperiodicid, patientunitstayid, temperature, sao2, heartrate, respiration, systemicsystolic, systemicdiastolic, systemicmean, observationtime. Table diagnosis has columns such as diagnosisid, patientunitstayid, diagnosisname, diagnosistime, icd9code. Table cost has columns such as costid, uniquepid, patienthealthsystemstayid, eventtype, eventid, chargetime, cost. Table lab has columns such as labid, patientunitstayid, labname, labresult, labresulttime. Table microlab has columns such as microlabid, patientunitstayid, culturesite, organism, culturetakentime. Table intakeoutput has columns such as intakeoutputid, patientunitstayid, cellpath, celllabel, cellvaluenumeric, intakeoutputtime. Table treatment has columns such as treatmentid, patientunitstayid, treatmentname, treatmenttime. Table patient has columns such as uniquepid, patienthealthsystemstayid, patientunitstayid, gender, age, ethnicity, hospitalid, wardid, admissionheight, admissionweight, dischargeweight, hospitaladmittime, hospitaladmitsource, unitadmittime, unitdischargetime, hospitaldischargetime, hospitaldischargestatus. Table allergy has columns such as allergyid, patientunitstayid, drugname, allergyname, allergytime. \n"}, {"role": "user", "content": "###Input:\n did the first urine, catheter specimen microbiology test of patient 031-15666 in the last month show any organism?\n\n###Response:"},
model output: {'role': 'assistant', 'content': "SELECT COUNT(*) > 0 FROM microlab WHERE microlab.patientunitstayid IN (SELECT patient.patientunitstayid FROM patient WHERE patient.patienthealthsystemstayid IN (SELECT patient.patienthealthsystemstayid FROM patient WHERE patient.uniquepid = '031-15666')) AND microlab.culturesite = 'urine, catheter specimen' AND DATETIME(microlab.culturetakentime,'start of month') = DATETIME(CURRENT_TIME(),'start of month', '-1 month') ORDER BY microlab.culturetakentime LIMIT 1"}

3. Ethical Considerations

While fine-tuned for text to sql, this model inherits the ethical considerations of the base Llama 3 model. Use responsibly and implement additional safeguards as needed for your application.

4. Availability

The model is available through:

Hugging Face

For full details on responsible use, ethical considerations, and latest benchmarks, please refer to the official Llama 3 documentation.