Turning Natural Language into SQL Dialogues

Introduction

Bridging Language and Data in the Enterprise Sphere: In the realm of enterprise data management, SQL
databases have long stood as the backbone, housing vast quantities of invaluable information. As
enterprises increasingly lean on business intelligence (BI) tools to extract insights from these
data repositories, the quest for more intuitive and accessible querying methods gains momentum. The
emerging solution? Leveraging Large Language Models (LLMs) to interact with SQL databases using
natural language. This innovative
approach, powered by the latest advancements in NLP, promises to transform our interaction with
data. However, as groundbreaking as it sounds, integrating LLMs into SQL querying is riddled with
challenges, making it a complex and intriguing endeavor.

The Challenges of LLMs for SQL Query Generation:

While LLMs possess an impressive grasp of SQL, enabling them to construct complex queries, several
non-trivial issues arise in actual implementations:

The Hallucination Dilemma: A significant hurdle is the tendency of LLMs to ‘hallucinate’
– creating fictitious tables or fields, or generating SQL queries that are incompatible with the
actual database structure. This misalignment between the model’s output and database reality
poses a major obstacle in ensuring the validity and reliability of the queries generated.
The Context Window Constraint: LLMs operate within a context window, a limitation on the
amount of text they can process at a given time. This becomes particularly problematic in the
context of SQL databases, which often contain extensive and complex schemas. The challenge lies
in effectively grounding the LLM in the database’s reality without overwhelming its context
window.

Error Handling and Unpredictability:Despite their capabilities, LLMs are not infallible.
The SQL queries they generate may occasionally be erroneous or yield unexpected results. This
inconsistency raises critical questions about dependability and error mitigation strategies. Do
we accept these limitations or devise mechanisms to counter them?

The High Level Solution

Mimicking Human Expertise for Enhanced LLM Performance: To effectively address the
challenges faced by Large Language Models (LLMs) in generating SQL queries, we can draw inspiration
from the strategies employed by human data analysts. By emulating the steps a data analyst takes in
querying SQL databases, we can guide LLMs towards more accurate and efficient query generation.
Let’s explore these human-inspired strategies:

Sample Queries and Schema Familiarization: Like a data analyst who first
familiarizes themselves with the database by executing sample queries and reviewing table
schemas, an LLM can benefit from a similar approach. This involves exposing the LLM to the
structure and a snapshot of the data it will be working with. By understanding the actual makeup
of the database – its tables, fields, and typical data entries – the LLM can generate queries
that are more aligned with the reality of the database’s structure.
Dealing with Information Overload and Context Window Limitation: Data analysts
typically don’t analyze all data at once; they often start with a subset, examining top rows or
summary statistics. This approach can be mimicked to overcome the LLM’s context window
limitations. Instead of feeding the LLM with the entire database, we can provide it with a
distilled version, highlighting key aspects of the data. This selective exposure ensures the LLM
remains ‘grounded’ in the database’s reality without overloading its processing capabilities.
Error Handling and Iterative Learning: Just as a data analyst doesn’t abandon
their task after encountering an error, LLMs should also be equipped to learn from mistakes.
When an LLM generates an incorrect SQL query, mechanisms can be implemented to analyze the
error, provide feedback to the model, and iteratively improve the query. This process not only
refines the LLM’s query-generation capabilities but also enhances its ability to adapt to
different databases and query requirements.

The Path Forward:

By adopting these human-inspired approaches, we can significantly improve the effectiveness of LLMs
in SQL query generation. These strategies not only address the specific challenges of hallucination,
context window constraints, and error handling but also pave the way for LLMs to operate more
autonomously and reliably in complex data environments. The journey towards integrating LLMs into
SQL querying, thus, becomes a blend of technological innovation and cognitive emulation, holding
promise for a new era of data interaction.

Langchain Agents

LangChain has a SQL Agent which provides a more flexible way of interacting with SQL Databases than a
chain. The main advantages of using the SQL Agent are:

It can answer questions based on the databases’ schema as well as on the databases’ content
(like describing a specific table).
It can recover from errors by running a generated query, catching the traceback and regenerating
it correctly.
It can query the database as many times as needed to answer the user question.

To initialize the agent we’ll use the create_sql_agent constructor.
This agent uses the SQLDatabaseToolkit which contains tools to:

Create and execute queries
Check query syntax
Retrieve table descriptions
… and more

Coding with Agent

Github Link – here

Step 1: Creating a Database Schema

To simulate a real-world scenario, we will create a simple e-commerce database schema
encompassing five entities: customer, seller, product, cart, and payment. Each entity will have
its own attributes, and we will establish foreign key relationships to introduce complexity.
This schema will serve as our testing ground for the LLM’s SQL query generation.


 import sqlite3


 db_path = '/content/sample.db'
 conn = sqlite3.connect(db_path)
 cursor = conn.cursor()

 script='''
 CREATE TABLE Cart (
     Cart_id TEXT PRIMARY KEY NOT NULL
 );

 CREATE TABLE Customer (
     Customer_id TEXT PRIMARY KEY NOT NULL,
     c_pass TEXT NOT NULL,
     Name TEXT NOT NULL,
     Address TEXT NOT NULL,
     Pincode INTEGER NOT NULL,
     Phone_number_s INTEGER NOT NULL,
     Cart_id TEXT NOT NULL,
     FOREIGN KEY(Cart_id) REFERENCES Cart(Cart_id)
 );

 CREATE TABLE Seller (
     Seller_id TEXT PRIMARY KEY NOT NULL,
     s_pass TEXT NOT NULL,
     Name TEXT NOT NULL,
     Address TEXT NOT NULL
 );

 CREATE TABLE Seller_Phone_num (
     Phone_num INTEGER NOT NULL,
     Seller_id TEXT NOT NULL,
     PRIMARY KEY (Phone_num, Seller_id),
     FOREIGN KEY (Seller_id) REFERENCES Seller(Seller_id) ON DELETE CASCADE
 );

 CREATE TABLE Payment (
     payment_id TEXT PRIMARY KEY NOT NULL,
     payment_date DATE NOT NULL,
     Payment_type TEXT NOT NULL,
     Customer_id TEXT NOT NULL,
     Cart_id TEXT NOT NULL,
     total_amount REAL,
     FOREIGN KEY (Customer_id) REFERENCES Customer(Customer_id),
     FOREIGN KEY (Cart_id) REFERENCES Cart(Cart_id)
 );

 CREATE TABLE Product (
     Product_id TEXT PRIMARY KEY NOT NULL,
     Type TEXT NOT NULL,
     Color TEXT NOT NULL,
     P_Size TEXT NOT NULL,
     Gender TEXT NOT NULL,
     Commission INTEGER NOT NULL,
     Cost INTEGER NOT NULL,
     Quantity INTEGER NOT NULL,
     Seller_id TEXT,
     FOREIGN KEY (Seller_id) REFERENCES Seller(Seller_id) ON DELETE SET NULL
 );

 CREATE TABLE Cart_item (
     Quantity_wished INTEGER NOT NULL,
     Date_Added DATE NOT NULL,
     Cart_id TEXT NOT NULL,
     Product_id TEXT NOT NULL,
     purchased TEXT DEFAULT 'NO',
     PRIMARY KEY (Cart_id, Product_id),
     FOREIGN KEY (Cart_id) REFERENCES Cart(Cart_id),
     FOREIGN KEY (Product_id) REFERENCES Product(Product_id)
 );

 '''

 cursor.executescript(script)
 conn.commit()
 conn.close()

Step 2: Installing Dependencies and Connecting to
the Database

We’ll need to install several dependencies to facilitate the interaction between our code and the
database, as well as to integrate the LLM:

langchain: A toolkit that aids in integrating LLMs with various applications.
openai: The OpenAI Python package, essential for interacting with GPT models.
langchain-openai: An extension to langchain for OpenAI-specific functionalities.
langchain-experimental: Provides experimental features in langchain, possibly including
advanced integrations and tools.

We will use the SQLDatabase toolkit provided by langchain to load the model and interface with
our SQL database.


 !pip install -q langchain openai langchain-openai langchain-experimental faiss-cpu


 from langchain.agents import create_sql_agent
 from langchain.agents.agent_toolkits import SQLDatabaseToolkit
 from langchain.sql_database import SQLDatabase
 from langchain_openai import OpenAI
 from langchain.agents import AgentExecutor
 from langchain.agents.agent_types import AgentType
 from langchain.chat_models import ChatOpenAI


 from google.colab import userdata
 OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')


 db = SQLDatabase.from_uri("sqlite:////content/sample.db")
 gpt = ChatOpenAI(openai_api_key=OPENAI_API_KEY,model="gpt-3.5-turbo", temperature=0)

Step 3: Setting Up an Agent and Querying

To utilize the LLM for SQL generation, we will set up an agent executor using langchain’s
create_sql_agent. This agent will serve as the intermediary between our natural language prompts
and the SQL database, translating our queries into executable SQL statements.

Creating the SQL Agent: Utilize create_sql_agent to initialize our agent with the necessary
configurations to interact with our e-commerce database schema.
Executing Queries: We will use the agent’s ask method to input natural language prompts. The
agent will then process these prompts, generate SQL queries, and execute them against our
database. The output will be the result of these queries, demonstrating the LLM’s ability to
interact with SQL databases.

We will use the SQLDatabase toolkit provided by langchain to load the model and interface with
our SQL database.


 toolkit = SQLDatabaseToolkit(db=db, llm=gpt)


 agent_executor = create_sql_agent(
     llm=gpt,
     toolkit=toolkit,
     verbose=True,
     agent_type="openai-tools",
 )


 question = "what is the cart amount for customer with customer id cid101"
 agent_executor.run(question)


 Logs

 Entering new AgentExecutor chain...

 Invoking: `sql_db_list_tables` with `{}`


 Cart, Cart_item, Customer, Payment, Product, Seller, Seller_Phone_num
 Invoking: `sql_db_schema` with `Cart, Customer`

 3 rows from Cart table:
 Cart_id
 crt1011
 crt1012
 crt1013
 */

 /*
 3 rows from Customer table:
 Customer_id	c_pass	Name	Address	Pincode	Phone_number_s	Cart_id
 cid100	ABCM1235	rajat	G-453	632014	9893135876	crt1011
 cid101	XYZ5678	john	F-789	560023	9876543210	crt1012
 cid102	PQRS9876	jane	A-321	400001	8765432109	crt1013
 */
 Invoking: `sql_db_query` with `SELECT c.Cart_id, c.Customer_id, c.Name, c.Address, 
 c.Pincode, c.Phone_number_s, SUM(p.Price) AS Cart_Amount FROM Customer c JOIN 
 Cart_item ci ON c.Cart_id = ci.Cart_id JOIN Product p ON ci.Product_id = p.Product_id 
 WHERE c.Customer_id = 'cid101'` 


 Error: (sqlite3.OperationalError) no such column: p.Price
 [SQL: SELECT c.Cart_id, c.Customer_id, c.Name, c.Address, c.Pincode, 
 c.Phone_number_s, SUM(p.Price) AS Cart_Amount FROM Customer c JOIN Cart_item ci 
 ON c.Cart_id = ci.Cart_id JOIN Product p ON ci.Product_id = p.Product_id WHERE 
 c.Customer_id = 'cid101']
 (Background on this error at: https://sqlalche.me/e/20/e3q8)
 Invoking: `sql_db_schema` with `Product`

 /*
 3 rows from Product table:
 Product_id	Type	Color	P_Size	Gender	Commission	Cost	Quantity	Seller_id
 pid1001	jeans	red	32	M	10	10005	20	sid100
 pid1002	shirt	blue	L	F	5	5000	15	sid101
 pid1003	shoes	black	40	M	8	8000	25	sid102
 */
 Invoking: `sql_db_query` with `SELECT c.Cart_id, c.Customer_id, c.Name, c.Address, 
 c.Pincode, c.Phone_number_s, SUM(p.Cost) AS Cart_Amount FROM Customer c JOIN 
 Cart_item ci ON c.Cart_id = ci.Cart_id JOIN Product p ON ci.Product_id = p.Product_id 
 WHERE c.Customer_id = 'cid101'`


 [('crt1012', 'cid101', 'john', 'F-789', 560023, 9876543210, 5000)]The cart amount 
 for the customer with customer id 'cid101' is 5000.

 > Finished chain.
  

 'The cart amount for the customer with customer id 'cid101' is 5000.

Coding Using a dynamic few-shot prompt

Github Link – here

Implementing a dynamic few-shot prompt involves several steps. This approach helps optimize the
performance of our agent by incorporating domain-specific knowledge and context relevant to the
user’s input. Here’s a breakdown of these steps:

Step 1: Creating Example Pairs

Firstly, we need a set of user input and corresponding SQL query examples. These examples should
cover a range of typical queries related to our e-commerce schema, providing a base for the LLM
to learn from.


 examples = [
    {"input": "List all customers.", "query": "SELECT * FROM Customer;"},
    {
        "input": "List all the products from a particular seller with seller_id 
 'sid100'",
        "query": "SELECT Product.Product_id, Product.Type, Product.Color,
 Product.P_Size, Product.Gender, Product.Commission, Product.Cost, Product.Quantity
 FROM Product WHERE Product.Seller_id = 'sid100';",
    },
    {
        "input": "List of all sellers  with name address and their phone number",
        "query": "SELECT Seller.Seller_id, Seller.Name AS Seller_Name, 
 Seller_Phone_num.Phone_num AS Seller_Phone_Number FROM Seller JOIN Seller_Phone_num 
 ON Seller.Seller_id = Seller_Phone_num.Seller_id;",
    },
    {
        "input": "List of all the items in the cart of a customer whose Name is 
 'jane'",
        "query": "SELECT Cart_item.Quantity_wished,  Cart_item.Date_Added, 
 Cart_item.Cart_id, Cart_item.Product_id, Cart_item.purchased FROM Cart_item JOIN 
 Cart ON Cart_item.Cart_id = Cart.Cart_id JOIN Customer ON Cart.Customer_id = 
 Customer.Customer_id WHERE Customer.Name = 'Jane';;",
    },
    {
        "input": "List all customers whose address is Canada.",
        "query": "SELECT * FROM Customer WHERE Country = 'Canada';",
    },
    {
        "input": "How many customers have purchased the product with Product_id 
 pid1002",
        "query": "SELECT COUNT(DISTINCT Customer_id) AS Num_Customers_Purchased FROM 
 Cart_item WHERE Product_id = 'pid1002' AND purchased = 'Y'; ",
    },
    {
        "input": "Find the total number of Products.",
        "query": "SELECT COUNT(*) FROM Product;",
    },
    {
        "input": "How many customers are there",
        "query": 'SELECT COUNT(*) AS Total_Customers FROM "Customer"',
    },
]

Step 2: Example Selector Creation

We then create an ExampleSelector, specifically a SemanticSimilarityExampleSelector. This
selector uses semantic search to find the most relevant examples from our set based on the
user’s input.

Semantic Search: The selector will analyze the semantics of the user’s input and compare it
with the example pairs to find the most similar ones.
Embeddings and Vector Store: It utilizes embeddings and a vector store to perform this
semantic analysis efficiently.


 from langchain_community.vectorstores import FAISS
 from langchain_core.example_selectors import SemanticSimilarityExampleSelector
 from langchain_openai import OpenAIEmbeddings

 example_selector = SemanticSimilarityExampleSelector.from_examples(
     examples,
     OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY),
     FAISS,
     k=5,
     input_keys=["input"],
 )

Step 3: FewShotPromptTemplate Creation

Next, we create a FewShotPromptTemplate. This template will integrate the selected examples into
a structured prompt that the LLM can understand.

Components:

Example Selector: Incorporates the previously created SemanticSimilarityExampleSelector.
Example Prompt Formatting: A format for how each example should be presented within the
prompt.
Prefix and Suffix: Text to be added before and after the formatted examples to frame the
prompt contextually.


 from langchain_core.prompts import (
     ChatPromptTemplate,
     FewShotPromptTemplate,
     MessagesPlaceholder,
     PromptTemplate,
     SystemMessagePromptTemplate,
 )

 system_prefix = """You are an agent designed to interact with a SQL 
 database.
 Given an input question, create a syntactically correct {dialect} query 
 to run, then look at the results of the query and return the answer.
 Unless the user specifies a specific number of examples they wish to 
 obtain, always limit your query to at most {top_k} results.
 You can order the results by a relevant column to return the most 
 interesting examples in the database.
 Never query for all the columns from a specific table, only ask for the 
 relevant columns given the question. 
 You have access to tools for interacting with the database.
 Only use the given tools. Only use the information returned by the tools 
 to construct your final answer.
 You MUST double check your query before executing it. If you get an 
 error while executing a query, rewrite the query and try again.
 
 DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to 
 the database.

 If the question does not seem related to the database, just return "I 
 don't know" as the answer.

 Here are some examples of user inputs and their corresponding SQL 
 queries:"""

 few_shot_prompt = FewShotPromptTemplate(
     example_selector=example_selector,
     example_prompt=PromptTemplate.from_template(
         "User input: {input}nSQL query: {query}"
     ),
     input_variables=["input", "dialect", "top_k"],
     prefix=system_prefix,
     suffix="",
 )

Step 4: Constructing the Full Prompt

As we are using an OpenAI tools agent, our full prompt should be structured as a chat prompt.

Human Message Template: This is where the user’s natural language input will be placed.
Agent Scratchpad: MessagesPlaceholder for the agent’s responses and workings.
Integration of Few-Shot Prompt: The dynamically created few-shot prompt from Step 3 will be
incorporated here, providing context and examples for the agent’s processing.


 full_prompt = ChatPromptTemplate.from_messages(
     [
         SystemMessagePromptTemplate(prompt=few_shot_prompt),
         ("human", "{input}"),
         MessagesPlaceholder("agent_scratchpad"),
     ]
 )


 # Example formatted prompt
 prompt_val = full_prompt.invoke(
     {
         "input": "List of all sellers  with name address and their phone number",
         "top_k": 5,
         "dialect": "SQLite",
         "agent_scratchpad": [],
     }
 )
 print(prompt_val.to_string())


 agent = create_sql_agent(
     llm=gpt,
     db=db,
     prompt=full_prompt,
     verbose=True,
     agent_type="openai-tools",
 )
 agent.invoke({"input": "List of all sellers  with name address and their phone number"})


 agent.invoke({"input": "List of all sellers  with name address and their phone number"})


 output
 > Entering new AgentExecutor chain...

 Invoking: `sql_db_query` with `SELECT Seller.Name, Seller.Address, Seller_Phone_num.Phone_num 
 FROM Seller JOIN Seller_Phone_num ON Seller.Seller_id = Seller_Phone_num.Seller_id`


 [('aman', 'delhi cmc', 9943336206), ('sara', 'mumbai abc', 9876543210), ('peter', 'kolkata xyz', 
 8765432109), ('lisa', 'bangalore pqr', 7654321098), ('mike', 'chennai lmn', 6543210987)]Here is a list 
 of all sellers with their name, address, and phone number:

1. Seller Name: aman
   Address: delhi cmc
   Phone Number: 9943336206

2. Seller Name: sara
   Address: mumbai abc
   Phone Number: 9876543210

3. Seller Name: peter
   Address: kolkata xyz
   Phone Number: 8765432109

4. Seller Name: lisa
   Address: bangalore pqr
   Phone Number: 7654321098

5. Seller Name: mike
   Address: chennai lmn
   Phone Number: 6543210987
    

> Finished chain.

Conclusion

In conclusion, the integration of Large Language Models (LLMs) with SQL databases through dynamic
few-shot prompting represents a significant advancement in the field of data management and natural
language processing. By emulating human data analysis techniques, such as using example selectors
and few-shot prompts, we can significantly enhance the accuracy and relevance of the SQL queries
generated by these models. This approach not only addresses the challenges of hallucination and
context window limitations but also leverages the evolving capabilities of LLMs to make data
querying more intuitive and accessible for a wider range of users. As we continue to refine these
methods, the potential for LLMs to revolutionize how we interact with and extract insights from vast
data repositories becomes increasingly evident, paving the way for a future where complex data
operations are seamlessly conducted through natural language interfaces.

Where AI meets SQL, queries transform into conversations, turning data exploration into a
dialogue with knowledge itself

Introduction

The Challenges of LLMs for SQL Query Generation:

The High Level Solution

Langchain Agents

Coding with Agent

Coding Using a dynamic few-shot prompt

Conclusion

Take the first step and Connect with us today

Enhance AI with Reinforcement Learning

Overcoming Imposter Syndrome in Women in Tech

Our AI Products

Our Services

Industries

Insights

About

Turning Natural Language into SQL Dialogues

Introduction

The Challenges of LLMs for SQL Query Generation:

The High Level Solution

Langchain Agents

Coding with Agent

Coding Using a dynamic few-shot prompt

Conclusion

Take the first step and Connect with us today

Tags:

Enhance AI with Reinforcement Learning

Overcoming Imposter Syndrome in Women in Tech