如何使用GPT作為SQL查詢引擎的自然語言-魔扣目錄

譯者 | 李睿

審校 | 重樓

如今，得益于ChatGPT這種生成式人工智能技術(shù)，使得用簡單的語句查詢數(shù)據(jù)集變得非常簡單。

如何使用GPT作為SQL查詢引擎的自然語言

與大多數(shù)生成式人工智能一樣，AI target=_blank class=infotextkey>OpenAI公司開發(fā)的API的結(jié)果仍然不完美，這意味著用戶不能完全信任它們。幸運(yùn)的是，用戶現(xiàn)在可以編寫代碼來詢問GPT如何計(jì)算響應(yīng)，如果采用這種方法，用戶可以自己運(yùn)行代碼。這意味著用戶可以使用自然語言詢問ChatGPT一些問題，例如，“某產(chǎn)品去年各地區(qū)的總銷售額是多少?”，并對ChatGPT的回答準(zhǔn)確性充滿信心。

以下是使用GPT為數(shù)據(jù)庫設(shè)置自然語言查詢的一種快速而簡單的技術(shù)：

將數(shù)據(jù)的結(jié)構(gòu)、幾個示例行或兩者都放入單個文本字符串中。
用這些信息加上采用自然語言提出的問題，為GPT制作一個“提示”。
將提示發(fā)送到OpenAI的GPT-3.5-turbo API ，并請求SQL查詢來回答問題。
運(yùn)行返回到數(shù)據(jù)集的SQL來計(jì)算答案。
(可選)創(chuàng)建一個交互式應(yīng)用程序，使查詢數(shù)據(jù)集變得簡單。

在處理實(shí)際數(shù)據(jù)時，這種方法有幾個優(yōu)點(diǎn)。通過只發(fā)送數(shù)據(jù)結(jié)構(gòu)和一些示例行(其中可能包括假數(shù)據(jù))，不需要向GPT發(fā)送實(shí)際的敏感數(shù)據(jù)。如果數(shù)據(jù)規(guī)模太大，超出了GPT的提示大小限制，也不用擔(dān)心。并且，通過請求SQL而不是最終答案，檢查GPT如何生成其答案的能力被嵌入到流程中。

面向企業(yè)查詢的生成式人工智能

如果用戶真的想使用生成式人工智能來開發(fā)企業(yè)級查詢，可能想要研究像LangChain這樣的工具，它是一個用于處理多種不同大型語言模型(LLM)的框架，而不僅僅是OpenAI公司的GPT。OpenAI公司最近還宣布了在API請求中包含函數(shù)調(diào)用的可能性，其目的是使查詢和類似的任務(wù)更容易、更可靠。但對于快速原型或自己使用，這里描述的過程是一種簡單的開始方法。這里的演示是用R語言完成的，但這種技術(shù)可以在任何編程語言中使用。

步驟1：將示例數(shù)據(jù)轉(zhuǎn)換為單字符字符串

這一步驟中的示例數(shù)據(jù)可以包括數(shù)據(jù)庫模式或幾行數(shù)據(jù)。將其全部轉(zhuǎn)換為單個字符串非常重要，因?yàn)樗鼘⒊蔀閷l(fā)送到GPT 3.5的更大的文本字符串查詢的一部分。

如果用戶的數(shù)據(jù)已經(jīng)在SQL數(shù)據(jù)庫中，那么這一步非常簡單。如果不是，建議將其轉(zhuǎn)換為SQL可查詢的格式。為什么?在測試了R語言和SQL代碼結(jié)果之后，用戶對GPT生成的SQL代碼比它的R語言代碼更有信心。

在R語言的代碼中，sqldf包允許用戶在R數(shù)據(jù)幀上運(yùn)行SQL查詢，這就是在本例中使用的。Python/ target=_blank class=infotextkey>Python中也有一個類似的sqldf庫。對于性能很重要的大型數(shù)據(jù)，可能需要查看duckdb項(xiàng)目。

需要注意的是，在這個演示中，將使用一個包含美國人口普查州人口數(shù)據(jù)的CSV文件，可以在states.csv中找到。

下面的代碼將數(shù)據(jù)文件導(dǎo)入R語言，使用sqldf查看數(shù)據(jù)框架是SQL數(shù)據(jù)庫表時的SQL模式，使用dplyr的filter()函數(shù)提取三個示例行，并將模式和示例行都轉(zhuǎn)換為字符串。免責(zé)聲明：ChatGPT編寫了將數(shù)據(jù)轉(zhuǎn)換為單個字符串的基本R apply()部分代碼(通常使用purrr完成這些任務(wù))。

library(rio)
library(dplyr)
library(sqldf)
library(glue)
states <- rio::import("https://raw.Githubusercontent.com/smach/SampleData/main/states.csv") |>
 filter(!is.na(Region))

states_schema <- sqldf("PRAGMA table_info(states)")
states_schema_string <- paste(apply(states_schema, 1, paste, collapse = "t"), collapse = "n")

states_sample <- dplyr::sample_n(states, 3)
states_sample_string <- paste(apply(states_sample, 1, paste, collapse = "t"), collapse = "n")

步驟2：為大型語言模型（LM）創(chuàng)建提示符

格式應(yīng)該類似于“表現(xiàn)得像一個數(shù)據(jù)科學(xué)家。有一個名為｛table_name｝的SQLite表，具有以下架構(gòu)：```｛schema｝``。第一行看起來像這樣：```{rows_sample｝``。根據(jù)這些數(shù)據(jù)，編寫一個SQL查詢來回答以下問題：｛query｝。只返回SQL，不包括解釋。”

下面的函數(shù)以這種格式創(chuàng)建查詢，并接受數(shù)據(jù)模式、示例行、用戶查詢和表名的參數(shù)。

create_prompt <- function(schema, rows_sample, query, table_name) {
 glue::glue("Act as if you're a data scientist. You have a SQLite table named {table_name} with the following schema:

 ```
 {schema}
 ```

 The first rows look like this: 

 ```{rows_sample}```

 Based on this data, write a SQL query to answer the following question: {query}. Return the SQL query ONLY. Do not include any additional explanation.")
}

步驟3：將數(shù)據(jù)發(fā)送到OpenAI的API

用戶可以先將數(shù)據(jù)剪切并粘貼到OpenAI的Web界面中，然后在ChatGPT或OpenAI API中查看結(jié)果。ChatGPT不收取使用費(fèi)用，但用戶不能調(diào)整其結(jié)果。可以讓用戶設(shè)置溫度之類的參數(shù)，這意味著其反應(yīng)應(yīng)該有多“隨機(jī)”或多有創(chuàng)意，以及服務(wù)商想使用哪種模型。對于SQL代碼，將溫度設(shè)置為0。

接下來，將一個自然語言問題保存到變量my_query中，使用create_prompt()函數(shù)創(chuàng)建一個提示符，然后觀察當(dāng)將該提示符粘貼到API playground中時會發(fā)生什么：

> my_query <- "What were the highest and lowest Population changes in 2020 by Division?"
> my_prompt <- get_query(states_schema_string, states_sample_string, my_query, "states")
> cat(my_prompt)
Act as if you're a data scientist. You have a SQLite table named states with the following schema:
 ```
0 State TEXT 0 NA 0
1 Pop_2000 INTEGER 0 NA 0
2 Pop_2010 INTEGER 0 NA 0
3 Pop_2020 INTEGER 0 NA 0
4 PctChange_2000 REAL 0 NA 0
5 PctChange_2010 REAL 0 NA 0
6 PctChange_2020 REAL 0 NA 0
7 State Code TEXT 0 NA 0
8 Region TEXT 0 NA 0
9 Division TEXT 0 NA 0
```
The first rows look like this: 
```Delaware  783600  897934  989948 17.6 14.6 10.2 DE South South Atlantic
Montana  902195  989415 1084225 12.9  9.7  9.6 MT West Mountain
Arizona 5130632 6392017 7151502 40.0 24.6 11.9 AZ West Mountain```
Based on this data, write a SQL query to answer the following question: What were the highest and lowest Population changes in 2020 by Division?. Return the SQL query ONLY. Do not include any additional explanation.

如何使用GPT作為SQL查詢引擎的自然語言提示輸入OpenAI API playground和生成的SQL代碼

以下是運(yùn)行建議的SQL時的結(jié)果：

sqldf("SELECT Division, MAX(PctChange_2020) AS Highest_PctChange_2020, MIN(PctChange_2020) AS Lowest_PctChange_2020 FROM states GROUP BY Division;")
 Division Highest_PctChange_2020 Lowest_PctChange_2020
1 East North Central 4.7 -0.1
2 East South Central 8.9 -0.2
3 Middle Atlantic 5.7 2.4
4 Mountain 18.4 2.3
5 New England 7.4 0.9
6 Pacific 14.6 3.3
7 South Atlantic 14.6 -3.2
8 West North Central 15.8 2.8
9 West South Central 15.9 2.7

ChatGPT不僅生成了準(zhǔn)確的SQL，而且也不必告訴GPT“2020人口變化”在Pop_2020列中。

步驟4：執(zhí)行GPT返回的SQL代碼的結(jié)果

以編程方式向OpenAI發(fā)送和返回?cái)?shù)據(jù)，而不是將其剪切和粘貼到Web界面中，這將會方便得多。有幾個R包可以使用OpenAI API。下面的代碼塊使用OpenAI包向API發(fā)送一個提示，存儲API響應(yīng)，提取響應(yīng)中包含帶有請求的SQL代碼的文本的部分，打印該代碼，并在數(shù)據(jù)上運(yùn)行SQL。

library(openai)
my_results <- openai::create_chat_completion(model = "gpt-3.5-turbo", temperature = 0, messages = list(
 list(role = "user", content = my_prompt)
)) 
the_answer <- my_results$choices$message.content

cat(the_answer)
SELECT Division, MAX(PctChange_2020) AS Highest_Population_Change, MIN(PctChange_2020) AS Lowest_Population_Change
FROM states
GROUP BY Division;
sqldf(the_answer)
 Division Highest_Population_Change Lowest_Population_Change
1 East North Central 4.7 -0.1
2 East South Central 8.9 -0.2
3 Middle Atlantic 5.7 2.4
4 Mountain 18.4 2.3
5 New England 7.4 0.9
6 Pacific 14.6 3.3
7 South Atlantic 14.6 -3.2
8 West North Central 15.8 2.8
9 West South Central 15.9

如果用戶想使用OpenAI API，需要一個OpenAI API密鑰。對于這個包，密鑰應(yīng)該存儲在一個系統(tǒng)環(huán)境變量中，例如OPENAI_API_KEY。需要注意的是，這個API不是免費(fèi)使用的，但在把它變成編輯器之前，一天運(yùn)行了這個項(xiàng)目十幾次，而其總賬戶使用的費(fèi)用是1美分。

步驟5(可選)：創(chuàng)建交互式應(yīng)用程序

現(xiàn)在，已經(jīng)在腳本或終端中擁有了在R工作流中運(yùn)行查詢所需的所有代碼。但是，如果想用簡單的語言制作一個交互式應(yīng)用程序來查詢數(shù)據(jù)，這里已經(jīng)包含了一個基本的Shiny應(yīng)用程序的代碼，可以使用它。

如果打算發(fā)布一個應(yīng)用程序供其他人使用，那么將需要加強(qiáng)代碼安全性以防止惡意查詢，添加更優(yōu)雅的錯誤處理和解釋性標(biāo)簽，改進(jìn)樣式，或者對其進(jìn)行擴(kuò)展以供企業(yè)使用。

與同時，這段代碼應(yīng)該開始創(chuàng)建一個交互式應(yīng)用程序，用自然語言查詢數(shù)據(jù)集：

library(shiny)
library(openai)
library(dplyr)
library(sqldf)
# Load hard-coded dataset
states <- read.csv("states.csv") |>
 dplyr::filter(!is.na(Region) & Region != "")

states_schema <- sqldf::sqldf("PRAGMA table_info(states)")
states_schema_string <- paste(apply(states_schema, 1, paste, collapse = "t"), collapse = "n")

states_sample <- dplyr::sample_n(states, 3)
states_sample_string <- paste(apply(states_sample, 1, paste, collapse = "t"), collapse = "n")

# Function to process user input
get_prompt <- function(query, schema = states_schema_string, rows_sample = states_sample_string, table_name = "states") {
 my_prompt <- glue::glue("Act as if you're a data scientist. You have a SQLite table named {table_name} with the following schema:
 ```
 {schema}
 ```
 The first rows look like this: 
 ```{rows_sample}```

 Based on this data, write a SQL query to answer the following question: {query} Return the SQL query ONLY. Do not include any additional explanation.")
 print(my_prompt)
 return(my_prompt)
}
ui <- fluidPage(
 titlePanel("Query state database"),
 sidebarLayout(
 sidebarPanel(
 textInput("query", "Enter your query", placeholder = "e.g., What is the total 2020 population by Region?"),
 actionButton("submit_btn", "Submit")
 ),
 mainPanel(
 uiOutput("the_sql"),
 br(),
 br(),
 verbatimTextOutput("results")
 )
 )
)
server <- function(input, output) {
# Create the prompt from the user query to send to GPT
 the_prompt <- eventReactive(input$submit_btn, {
 req(input$query, states_schema_string, states_sample_string)
 my_prompt <- get_prompt(query = input$query)
 }) 

# send prompt to GPT, get SQL, run SQL, print results
observeEvent(input$submit_btn, {
 req(the_prompt()) # text to send to GPT
 # Send results to GPT and get response
 # withProgress adds a Shiny progress bar. Commas now needed after each statement
 withProgress(message = 'Getting results from GPT', value = 0, { # Add Shiny progress message
 my_results <- openai::create_chat_completion(model = "gpt-3.5-turbo", temperature = 0, messages = list(
 list(role = "user", content = the_prompt())
 )) 
 the_gpt_sql <- my_results$choices$message.content 

 # print the SQL
 sql_html <- gsub("n", "<br />", the_gpt_sql) 
 sql_html <- paste0("<p>", sql_html, "</p>") 

 # Run SQL on data to get results
 gpt_answer <- sqldf(the_gpt_sql) 
 setProgress(value = 1, message = 'GPT results received') # Send msg to user that 
 })
 # Print SQL and results
 output$the_sql <- renderUI(HTML(sql_html)) 

 if (is.vector(gpt_answer) ) {
 output$results <- renderPrint(gpt_answer) 
 } else {
 output$results <- renderPrint({ print(gpt_answer) }) 
 } 
}) 
}
shinyApp(ui = ui, server = server)

原文標(biāo)題：How to use GPT as a natural language to SQL query engine，作者：Sharon Machlis