Introduction
Bridging Language and Data in the Enterprise Sphere: In the realm of enterprise data management, SQL
databases have long stood as the backbone, housing vast quantities of invaluable information. As
enterprises increasingly lean on business intelligence (BI) tools to extract insights from these
data repositories, the quest for more intuitive and accessible querying methods gains momentum. The
emerging solution? Leveraging Large Language Models (LLMs) to interact with SQL databases using
natural language. This innovative
approach, powered by the latest advancements in NLP, promises to transform our interaction with
data. However, as groundbreaking as it sounds, integrating LLMs into SQL querying is riddled with
challenges, making it a complex and intriguing endeavor.
The Challenges of LLMs for SQL Query Generation:
While LLMs possess an impressive grasp of SQL, enabling them to construct complex queries, several
non-trivial issues arise in actual implementations:
- The Hallucination Dilemma: A significant hurdle is the tendency of LLMs to ‘hallucinate’
– creating fictitious tables or fields, or generating SQL queries that are incompatible with the
actual database structure. This misalignment between the model’s output and database reality
poses a major obstacle in ensuring the validity and reliability of the queries generated. - The Context Window Constraint: LLMs operate within a context window, a limitation on the
amount of text they can process at a given time. This becomes particularly problematic in the
context of SQL databases, which often contain extensive and complex schemas. The challenge lies
in effectively grounding the LLM in the database’s reality without overwhelming its context
window. - Error Handling and Unpredictability:Despite their capabilities, LLMs are not infallible.
The SQL queries they generate may occasionally be erroneous or yield unexpected results. This
inconsistency raises critical questions about dependability and error mitigation strategies. Do
we accept these limitations or devise mechanisms to counter them?
The High Level Solution
Mimicking Human Expertise for Enhanced LLM Performance: To effectively address the
challenges faced by Large Language Models (LLMs) in generating SQL queries, we can draw inspiration
from the strategies employed by human data analysts. By emulating the steps a data analyst takes in
querying SQL databases, we can guide LLMs towards more accurate and efficient query generation.
Let’s explore these human-inspired strategies:
- Sample Queries and Schema Familiarization: Like a data analyst who first
familiarizes themselves with the database by executing sample queries and reviewing table
schemas, an LLM can benefit from a similar approach. This involves exposing the LLM to the
structure and a snapshot of the data it will be working with. By understanding the actual makeup
of the database – its tables, fields, and typical data entries – the LLM can generate queries
that are more aligned with the reality of the database’s structure. - Dealing with Information Overload and Context Window Limitation: Data analysts
typically don’t analyze all data at once; they often start with a subset, examining top rows or
summary statistics. This approach can be mimicked to overcome the LLM’s context window
limitations. Instead of feeding the LLM with the entire database, we can provide it with a
distilled version, highlighting key aspects of the data. This selective exposure ensures the LLM
remains ‘grounded’ in the database’s reality without overloading its processing capabilities. - Error Handling and Iterative Learning: Just as a data analyst doesn’t abandon
their task after encountering an error, LLMs should also be equipped to learn from mistakes.
When an LLM generates an incorrect SQL query, mechanisms can be implemented to analyze the
error, provide feedback to the model, and iteratively improve the query. This process not only
refines the LLM’s query-generation capabilities but also enhances its ability to adapt to
different databases and query requirements.

The Path Forward:
By adopting these human-inspired approaches, we can significantly improve the effectiveness of LLMs
in SQL query generation. These strategies not only address the specific challenges of hallucination,
context window constraints, and error handling but also pave the way for LLMs to operate more
autonomously and reliably in complex data environments. The journey towards integrating LLMs into
SQL querying, thus, becomes a blend of technological innovation and cognitive emulation, holding
promise for a new era of data interaction.
Langchain Agents
LangChain has a SQL Agent which provides a more flexible way of interacting with SQL Databases than a
chain. The main advantages of using the SQL Agent are:
- It can answer questions based on the databases’ schema as well as on the databases’ content
(like describing a specific table). - It can recover from errors by running a generated query, catching the traceback and regenerating
it correctly. - It can query the database as many times as needed to answer the user question.
To initialize the agent we’ll use the create_sql_agent constructor.
This agent uses the SQLDatabaseToolkit which contains tools to:
Coding with Agent
Github Link – here
Step 1: Creating a Database Schema
To simulate a real-world scenario, we will create a simple e-commerce database schema
encompassing five entities: customer, seller, product, cart, and payment. Each entity will have
its own attributes, and we will establish foreign key relationships to introduce complexity.
This schema will serve as our testing ground for the LLM’s SQL query generation.
import sqlite3
db_path = '/content/sample.db'
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
script='''
CREATE TABLE Cart (
Cart_id TEXT PRIMARY KEY NOT NULL
);
CREATE TABLE Customer (
Customer_id TEXT PRIMARY KEY NOT NULL,
c_pass TEXT NOT NULL,
Name TEXT NOT NULL,
Address TEXT NOT NULL,
Pincode INTEGER NOT NULL,
Phone_number_s INTEGER NOT NULL,
Cart_id TEXT NOT NULL,
FOREIGN KEY(Cart_id) REFERENCES Cart(Cart_id)
);
CREATE TABLE Seller (
Seller_id TEXT PRIMARY KEY NOT NULL,
s_pass TEXT NOT NULL,
Name TEXT NOT NULL,
Address TEXT NOT NULL
);
CREATE TABLE Seller_Phone_num (
Phone_num INTEGER NOT NULL,
Seller_id TEXT NOT NULL,
PRIMARY KEY (Phone_num, Seller_id),
FOREIGN KEY (Seller_id) REFERENCES Seller(Seller_id) ON DELETE CASCADE
);
CREATE TABLE Payment (
payment_id TEXT PRIMARY KEY NOT NULL,
payment_date DATE NOT NULL,
Payment_type TEXT NOT NULL,
Customer_id TEXT NOT NULL,
Cart_id TEXT NOT NULL,
total_amount REAL,
FOREIGN KEY (Customer_id) REFERENCES Customer(Customer_id),
FOREIGN KEY (Cart_id) REFERENCES Cart(Cart_id)
);
CREATE TABLE Product (
Product_id TEXT PRIMARY KEY NOT NULL,
Type TEXT NOT NULL,
Color TEXT NOT NULL,
P_Size TEXT NOT NULL,
Gender TEXT NOT NULL,
Commission INTEGER NOT NULL,
Cost INTEGER NOT NULL,
Quantity INTEGER NOT NULL,
Seller_id TEXT,
FOREIGN KEY (Seller_id) REFERENCES Seller(Seller_id) ON DELETE SET NULL
);
CREATE TABLE Cart_item (
Quantity_wished INTEGER NOT NULL,
Date_Added DATE NOT NULL,
Cart_id TEXT NOT NULL,
Product_id TEXT NOT NULL,
purchased TEXT DEFAULT 'NO',
PRIMARY KEY (Cart_id, Product_id),
FOREIGN KEY (Cart_id) REFERENCES Cart(Cart_id),
FOREIGN KEY (Product_id) REFERENCES Product(Product_id)
);
'''
cursor.executescript(script)
conn.commit()
conn.close()
Step 2: Installing Dependencies and Connecting to
the Database
We’ll need to install several dependencies to facilitate the interaction between our code and the
database, as well as to integrate the LLM:
- langchain: A toolkit that aids in integrating LLMs with various applications.
- openai: The OpenAI Python package, essential for interacting with GPT models.
- langchain-openai: An extension to langchain for OpenAI-specific functionalities.
- langchain-experimental: Provides experimental features in langchain, possibly including
advanced integrations and tools.
We will use the SQLDatabase toolkit provided by langchain to load the model and interface with
our SQL database.
!pip install -q langchain openai langchain-openai langchain-experimental faiss-cpu
from langchain.agents import create_sql_agent
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from langchain_openai import OpenAI
from langchain.agents import AgentExecutor
from langchain.agents.agent_types import AgentType
from langchain.chat_models import ChatOpenAI
from google.colab import userdata
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
db = SQLDatabase.from_uri("sqlite:////content/sample.db")
gpt = ChatOpenAI(openai_api_key=OPENAI_API_KEY,model="gpt-3.5-turbo", temperature=0)
Step 3: Setting Up an Agent and Querying
To utilize the LLM for SQL generation, we will set up an agent executor using langchain’s
create_sql_agent. This agent will serve as the intermediary between our natural language prompts
and the SQL database, translating our queries into executable SQL statements.
- Creating the SQL Agent: Utilize create_sql_agent to initialize our agent with the necessary
configurations to interact with our e-commerce database schema. - Executing Queries: We will use the agent’s ask method to input natural language prompts. The
agent will then process these prompts, generate SQL queries, and execute them against our
database. The output will be the result of these queries, demonstrating the LLM’s ability to
interact with SQL databases.
We will use the SQLDatabase toolkit provided by langchain to load the model and interface with
our SQL database.
toolkit = SQLDatabaseToolkit(db=db, llm=gpt)
agent_executor = create_sql_agent(
llm=gpt,
toolkit=toolkit,
verbose=True,
agent_type="openai-tools",
)
question = "what is the cart amount for customer with customer id cid101"
agent_executor.run(question)
Logs
Entering new AgentExecutor chain...
Invoking: `sql_db_list_tables` with `{}`
Cart, Cart_item, Customer, Payment, Product, Seller, Seller_Phone_num
Invoking: `sql_db_schema` with `Cart, Customer`
3 rows from Cart table:
Cart_id
crt1011
crt1012
crt1013
*/
/*
3 rows from Customer table:
Customer_id c_pass Name Address Pincode Phone_number_s Cart_id
cid100 ABCM1235 rajat G-453 632014 9893135876 crt1011
cid101 XYZ5678 john F-789 560023 9876543210 crt1012
cid102 PQRS9876 jane A-321 400001 8765432109 crt1013
*/
Invoking: `sql_db_query` with `SELECT c.Cart_id, c.Customer_id, c.Name, c.Address,
c.Pincode, c.Phone_number_s, SUM(p.Price) AS Cart_Amount FROM Customer c JOIN
Cart_item ci ON c.Cart_id = ci.Cart_id JOIN Product p ON ci.Product_id = p.Product_id
WHERE c.Customer_id = 'cid101'`
Error: (sqlite3.OperationalError) no such column: p.Price
[SQL: SELECT c.Cart_id, c.Customer_id, c.Name, c.Address, c.Pincode,
c.Phone_number_s, SUM(p.Price) AS Cart_Amount FROM Customer c JOIN Cart_item ci
ON c.Cart_id = ci.Cart_id JOIN Product p ON ci.Product_id = p.Product_id WHERE
c.Customer_id = 'cid101']
(Background on this error at: https://sqlalche.me/e/20/e3q8)
Invoking: `sql_db_schema` with `Product`
/*
3 rows from Product table:
Product_id Type Color P_Size Gender Commission Cost Quantity Seller_id
pid1001 jeans red 32 M 10 10005 20 sid100
pid1002 shirt blue L F 5 5000 15 sid101
pid1003 shoes black 40 M 8 8000 25 sid102
*/
Invoking: `sql_db_query` with `SELECT c.Cart_id, c.Customer_id, c.Name, c.Address,
c.Pincode, c.Phone_number_s, SUM(p.Cost) AS Cart_Amount FROM Customer c JOIN
Cart_item ci ON c.Cart_id = ci.Cart_id JOIN Product p ON ci.Product_id = p.Product_id
WHERE c.Customer_id = 'cid101'`
[('crt1012', 'cid101', 'john', 'F-789', 560023, 9876543210, 5000)]The cart amount
for the customer with customer id 'cid101' is 5000.
> Finished chain.
'The cart amount for the customer with customer id 'cid101' is 5000.
Coding Using a dynamic few-shot prompt
Github Link – here
Implementing a dynamic few-shot prompt involves several steps. This approach helps optimize the
performance of our agent by incorporating domain-specific knowledge and context relevant to the
user’s input. Here’s a breakdown of these steps:
Step 1: Creating Example Pairs
Firstly, we need a set of user input and corresponding SQL query examples. These examples should
cover a range of typical queries related to our e-commerce schema, providing a base for the LLM
to learn from.
examples = [
{"input": "List all customers.", "query": "SELECT * FROM Customer;"},
{
"input": "List all the products from a particular seller with seller_id
'sid100'",
"query": "SELECT Product.Product_id, Product.Type, Product.Color,
Product.P_Size, Product.Gender, Product.Commission, Product.Cost, Product.Quantity
FROM Product WHERE Product.Seller_id = 'sid100';",
},
{
"input": "List of all sellers with name address and their phone number",
"query": "SELECT Seller.Seller_id, Seller.Name AS Seller_Name,
Seller_Phone_num.Phone_num AS Seller_Phone_Number FROM Seller JOIN Seller_Phone_num
ON Seller.Seller_id = Seller_Phone_num.Seller_id;",
},
{
"input": "List of all the items in the cart of a customer whose Name is
'jane'",
"query": "SELECT Cart_item.Quantity_wished, Cart_item.Date_Added,
Cart_item.Cart_id, Cart_item.Product_id, Cart_item.purchased FROM Cart_item JOIN
Cart ON Cart_item.Cart_id = Cart.Cart_id JOIN Customer ON Cart.Customer_id =
Customer.Customer_id WHERE Customer.Name = 'Jane';;",
},
{
"input": "List all customers whose address is Canada.",
"query": "SELECT * FROM Customer WHERE Country = 'Canada';",
},
{
"input": "How many customers have purchased the product with Product_id
pid1002",
"query": "SELECT COUNT(DISTINCT Customer_id) AS Num_Customers_Purchased FROM
Cart_item WHERE Product_id = 'pid1002' AND purchased = 'Y'; ",
},
{
"input": "Find the total number of Products.",
"query": "SELECT COUNT(*) FROM Product;",
},
{
"input": "How many customers are there",
"query": 'SELECT COUNT(*) AS Total_Customers FROM "Customer"',
},
]
Step 2: Example Selector Creation
We then create an ExampleSelector, specifically a SemanticSimilarityExampleSelector. This
selector uses semantic search to find the most relevant examples from our set based on the
user’s input.
- Semantic Search: The selector will analyze the semantics of the user’s input and compare it
with the example pairs to find the most similar ones. - Embeddings and Vector Store: It utilizes embeddings and a vector store to perform this
semantic analysis efficiently.
from langchain_community.vectorstores import FAISS
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_openai import OpenAIEmbeddings
example_selector = SemanticSimilarityExampleSelector.from_examples(
examples,
OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY),
FAISS,
k=5,
input_keys=["input"],
)
Step 3: FewShotPromptTemplate Creation
Next, we create a FewShotPromptTemplate. This template will integrate the selected examples into
a structured prompt that the LLM can understand.
- Components:
- Example Selector: Incorporates the previously created SemanticSimilarityExampleSelector.
- Example Prompt Formatting: A format for how each example should be presented within the
prompt. - Prefix and Suffix: Text to be added before and after the formatted examples to frame the
prompt contextually.
from langchain_core.prompts import (
ChatPromptTemplate,
FewShotPromptTemplate,
MessagesPlaceholder,
PromptTemplate,
SystemMessagePromptTemplate,
)
system_prefix = """You are an agent designed to interact with a SQL
database.
Given an input question, create a syntactically correct {dialect} query
to run, then look at the results of the query and return the answer.
Unless the user specifies a specific number of examples they wish to
obtain, always limit your query to at most {top_k} results.
You can order the results by a relevant column to return the most
interesting examples in the database.
Never query for all the columns from a specific table, only ask for the
relevant columns given the question.
You have access to tools for interacting with the database.
Only use the given tools. Only use the information returned by the tools
to construct your final answer.
You MUST double check your query before executing it. If you get an
error while executing a query, rewrite the query and try again.
DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to
the database.
If the question does not seem related to the database, just return "I
don't know" as the answer.
Here are some examples of user inputs and their corresponding SQL
queries:"""
few_shot_prompt = FewShotPromptTemplate(
example_selector=example_selector,
example_prompt=PromptTemplate.from_template(
"User input: {input}nSQL query: {query}"
),
input_variables=["input", "dialect", "top_k"],
prefix=system_prefix,
suffix="",
)
Step 4: Constructing the Full Prompt
As we are using an OpenAI tools agent, our full prompt should be structured as a chat prompt.
- Human Message Template: This is where the user’s natural language input will be placed.
- Agent Scratchpad: MessagesPlaceholder for the agent’s responses and workings.
- Integration of Few-Shot Prompt: The dynamically created few-shot prompt from Step 3 will be
incorporated here, providing context and examples for the agent’s processing.
full_prompt = ChatPromptTemplate.from_messages(
[
SystemMessagePromptTemplate(prompt=few_shot_prompt),
("human", "{input}"),
MessagesPlaceholder("agent_scratchpad"),
]
)
# Example formatted prompt
prompt_val = full_prompt.invoke(
{
"input": "List of all sellers with name address and their phone number",
"top_k": 5,
"dialect": "SQLite",
"agent_scratchpad": [],
}
)
print(prompt_val.to_string())
agent = create_sql_agent(
llm=gpt,
db=db,
prompt=full_prompt,
verbose=True,
agent_type="openai-tools",
)
agent.invoke({"input": "List of all sellers with name address and their phone number"})
agent.invoke({"input": "List of all sellers with name address and their phone number"})
output
> Entering new AgentExecutor chain...
Invoking: `sql_db_query` with `SELECT Seller.Name, Seller.Address, Seller_Phone_num.Phone_num
FROM Seller JOIN Seller_Phone_num ON Seller.Seller_id = Seller_Phone_num.Seller_id`
[('aman', 'delhi cmc', 9943336206), ('sara', 'mumbai abc', 9876543210), ('peter', 'kolkata xyz',
8765432109), ('lisa', 'bangalore pqr', 7654321098), ('mike', 'chennai lmn', 6543210987)]Here is a list
of all sellers with their name, address, and phone number:
1. Seller Name: aman
Address: delhi cmc
Phone Number: 9943336206
2. Seller Name: sara
Address: mumbai abc
Phone Number: 9876543210
3. Seller Name: peter
Address: kolkata xyz
Phone Number: 8765432109
4. Seller Name: lisa
Address: bangalore pqr
Phone Number: 7654321098
5. Seller Name: mike
Address: chennai lmn
Phone Number: 6543210987
> Finished chain.
Conclusion
In conclusion, the integration of Large Language Models (LLMs) with SQL databases through dynamic
few-shot prompting represents a significant advancement in the field of data management and natural
language processing. By emulating human data analysis techniques, such as using example selectors
and few-shot prompts, we can significantly enhance the accuracy and relevance of the SQL queries
generated by these models. This approach not only addresses the challenges of hallucination and
context window limitations but also leverages the evolving capabilities of LLMs to make data
querying more intuitive and accessible for a wider range of users. As we continue to refine these
methods, the potential for LLMs to revolutionize how we interact with and extract insights from vast
data repositories becomes increasingly evident, paving the way for a future where complex data
operations are seamlessly conducted through natural language interfaces.
Where AI meets SQL, queries transform into conversations, turning data exploration into a
dialogue with knowledge itself