新智元報(bào)道
編輯:桃子
【新智元導(dǎo)讀】人人動(dòng)嘴編程的時(shí)代,這就來了。
前段時(shí)間,最大開源社區(qū)Hugging Face發(fā)布了AI聊天機(jī)器人HuggingChat,瞬間引爆全網(wǎng)。
網(wǎng)友紛紛表示,如果ChatGPT是蘋果IOS系統(tǒng),那么,開源版的Android就要來了。
而這次,來了個(gè)更猛的。
不僅上線了開源編程大語言模型StarCoder,順便還推出了編程助手StarChat。
雖說Github的Copilot已經(jīng)接上了GPT-4最新能力,還得每月交錢。
現(xiàn)在有了開源的StarChat,動(dòng)動(dòng)嘴編程的美事兒,每個(gè)人都能享了。
StarCode化身「動(dòng)嘴編程」神器
想必,你一定用過GitHub Copilot或ChatGPT來解決編程任務(wù),比如把代碼翻譯、生成等。
盡管這些專有系統(tǒng)的能力令人印象深刻,但通常也有缺點(diǎn)。其中就包括訓(xùn)練模型的公共數(shù)據(jù)缺乏透明度,以及無法將其適應(yīng)自己的域或代碼庫。
這不,高質(zhì)量的平替這就來了。
其中包括SalesForce的 CodeGen Mono(16B),或接受過20種編程語言的培訓(xùn)的Replit(3B)模型,該模型接受過20種編程語言的訓(xùn)練。
BigCode項(xiàng)目中的StarCoder是一個(gè)160億參數(shù)的模型,它使用了80多種編程語言、GitHub問題、Git提交和Jupiter 筆記本(所有這些都獲得了許可)的一萬億個(gè)token。
在這篇博文中,研究人員展示了StarCoder如何通過聊天進(jìn)行微調(diào),以創(chuàng)建一個(gè)性化的編碼助手StarChat。
同時(shí),還探討了我們將探討使用大型語言模型作為編碼助手時(shí)出現(xiàn)的一些技術(shù)細(xì)節(jié),包括:
-LLM如何像對(duì)話智能體一樣被提示。
-OpenAI聊天標(biāo)記語言(ChatML)為人類用戶和 AI 助手之間的會(huì)話信息提供了一種結(jié)構(gòu)化格式
-如何微調(diào)一個(gè)與Transformers和DeepSpeed ZERO-3對(duì)話的不同語料庫的大型模型
提示LLM進(jìn)行對(duì)話
正如DeepMind和Anthropic所展示的,LLM可以通過巧妙地選擇提示而變成對(duì)話智能體。
這些提示通常涉及所謂的「系統(tǒng)」信息,該信息定義了LLM的特征,以及助手和用戶之間的一系列對(duì)話。例如,下面是Anthropic的HHH提示的摘錄(總共有高達(dá)6k的token):
Below are a series of dialogues between various people and an AI assistant.
The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable.
The assistant is hAppy to help with almost anything, and will do its best to understand exactly what is needed.
It also tries to avoid giving false or misleading information, and it caveats when it isn’t entirely sure about the right answer.
That said, the assistant is practical and really does its best, and doesn’t let caution get too much in the way of being useful.
Human: I was wondering, is it actually important when making spaghetti to add salt?
Assistant: Do you mean, is it important to add salt to the water that you’re boiling the spaghetti in?
Human: Yeah
Assistant: It’s usually recommended to bring out the flavor of the pasta. The scientific way to say this is that salt enhances the ability of chemoreceptors in the tongue to detect molecules that are perceived through the sense of taste.
Human: But I mean I’m going to use sauce and stuff, so does it matter?
Assistant: It sounds like you’re skeptical. Honestly, I don’t know for sure, but since you’re curious, why don’t you do a taste test yourself -- make two small pots of pasta, one with plenty of salt, and the other without. Then taste the noodles and see which one you prefer? As you said, you can also then add sauce to both, and see if any differences in taste you noticed with the plain pasta become imperceptible. It should be a fun experiment!
Human: {USER QUERY}
Assistant:
可以看到,提示的第一部分 「下面是一系列...... 」與系統(tǒng)信息相對(duì)應(yīng),并指定助手應(yīng)該有「樂于助人」和「禮貌」等特征。
然后,對(duì)話實(shí)例對(duì)模型進(jìn)行了條件限制,使其遵循對(duì)話的多回合格式。
當(dāng)用戶提出問題時(shí),整個(gè)提示被輸入到模型,并在Assistant: 后生成一個(gè)答案。然后,答案被串聯(lián)到提示中,并在每個(gè)回合中重復(fù)這一過程。令人驚訝的是,這種技術(shù)也適用于StarCoder!
這是由模型的8k標(biāo)記上下文長(zhǎng)度促進(jìn)的,它允許人們包括各種各樣的編程實(shí)例,并將模型轉(zhuǎn)換為編碼助手。下面是StarCoder提示的摘錄:
Below are a series of dialogues between various people and an AI technical assistant.
The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable.
The assistant is happy to help with code questions, and will do its best to understand exactly what is needed.
It also tries to avoid giving false or misleading information, and it caveats when it isn’t entirely sure about the right answer.
That said, the assistant is practical and really does its best, and doesn’t let caution get too much in the way of being useful.
The Starcoder models are a series of 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2) (excluding opt-out requests).
The model uses Multi Query Attention, was trained using the Fill-in-the-Middle objective and with 8,192 tokens context window for a trillion tokens of heavily deduplicated data.
Human: Write a function that takes two lists and returns a list that has alternating elements from each input list.
Assistant: Sure. Here is a function that does that.
def alternating(list1, list2):
results = []
for i in range(len(list1)):
results.append(list1[i])
results.append(list2[i])
return results
Human: Can you write some test cases for this function?
Assistant: Sure, here are some tests.
assert alternating([10, 20, 30], [1, 2, 3]) == [10, 1, 20, 2, 30, 3]
assert alternating([True, False], [4, 5]) == [True, 4, False, 5]
assert alternating([], []) == []
Human: Modify the function so that it returns all input elements when the lists have uneven length. The elements from the longer list should be at the end.
Assistant: Here is the modified function.
def alternating(list1, list2):
results = []
for i in range(min(len(list1), len(list2))):
results.append(list1[i])
results.append(list2[i])
if len(list1) > len(list2):
results.extend(list1[i+1:])
else:
results.extend(list2[i+1:])
return results
由上,我們可以看到一個(gè)精心設(shè)計(jì)的提示,如何誘發(fā)與ChatGPT中觀察到的類似的編碼行為。
你也可以在這個(gè)鏈接匯總找到完整的提示。
https://huggingface.co/datasets/bigcode/ta-prompt/blob/main/TA_prompt_v1.txt
當(dāng)然了,對(duì)話提示的一個(gè)主要缺點(diǎn)是,推理的成本很高:對(duì)話的每個(gè)回合都需要成千上萬的token。
一種替代方法是,在對(duì)話語料庫上對(duì)基礎(chǔ)模型進(jìn)行微調(diào),使其變得「健談」。
再來看看最近上傳到Hub中的幾個(gè)有趣的數(shù)據(jù)集,它們?yōu)榻裉齑蠖鄶?shù)開源聊天機(jī)器人提供動(dòng)力。
Chat語言模型的數(shù)據(jù)集
開源社區(qū)正在迅速地創(chuàng)造多樣化、且強(qiáng)大的數(shù)據(jù)集,用于將任何基礎(chǔ)語言模型轉(zhuǎn)化為能夠遵循指令的對(duì)話智能體。
就比如:
-OpenAssistant數(shù)據(jù)集,由超過4萬個(gè)對(duì)話組成,是由社區(qū)成員輪流模仿用戶或人工智能助手的角色。
-ShareGPT數(shù)據(jù)集,其中包含人類用戶和ChatGPT之間的大約9萬個(gè)對(duì)話。
而在這篇文章中,研究人員使用了OpenAssistant數(shù)據(jù)集來微調(diào)StarCoder 原始數(shù)據(jù)集的格式是對(duì)話樹的集合,所以研究人員對(duì)其進(jìn)行了預(yù)處理,使每一行都對(duì)應(yīng)于用戶和助手之間的單一對(duì)話。
為了避免偏離StarCoder預(yù)訓(xùn)練的數(shù)據(jù)太遠(yuǎn),研究者還過濾了英語對(duì)話。先從Hub上下載經(jīng)過處理的數(shù)據(jù)集:
from datasets import load_dataset
dataset = load_dataset("HuggingFaceH4/oasst1_en")
print(dataset)
DatasetDict({
train: Dataset({
features: ['messages'],
num_rows: 19034
test: Dataset({
features: ['messages'],
num_rows: 2115
正如我們所見,該數(shù)據(jù)集包含約21,000個(gè)英語會(huì)話。再來看看其中的一個(gè)訓(xùn)練例子。以第一個(gè)例子為例:
sample = dataset["train"][0]
print(sample)
"messages": [
"content": "Is it possible to imagine a society without law?",
"role": "user",
},
"content": "It is difficult to imagine a society that is able to be maintained without any semblance of Law. Laws exist for the purpose of maintaining society and past human behavior suggests that there would be people that have no interest in promoting social cohesion and norms so it's unlikely that complex social and economic systems could develop without a legal system.",
"role": "assistant",
},
"content": 'It seems like you consider the absence of law equal to the absence of anything that could guide the behaviour of the individual, however there are many other such things, like religion, conscience, tradition to mention a few important patterns which are known to have crucial influence on society even today, but also we could imagine more, like universal love if everyone suddenly became "awaken" in some spiritual sense, or simply just having an omnipotent leader who maintains a certain order without ever codifying the principles in law. Is it still difficult to imagine?',
"role": "user",
},
"content": "You are correct that there are other factors that can guide behavior in a society and play a role in shaping individuals' behavior and interactions with each other. However, even in societies where these factors are present, laws still serve an important role in maintaining social order and resolving conflicts. nnFor example, religious teachings may prohibit theft and murder, but laws provide a system of justice for those who violate these prohibitions. Personal morality and conscience may guide individuals to act in a certain way, but laws provide a framework for resolving disputes when conflicting moral beliefs arise. Furthermore, the other factors you mentioned, such as universal love or an omnipotent leader, are not necessarily stable or reliable guides for behavior. Universal love, while a noble ideal, may not be achievable in practice, and an omnipotent leader may be prone to corruption or abuse of power.nnLaws provide a crucial foundation for maintaining social order and ensuring justice. It is possible to imagine alternative systems of governance, but it is unlikely that a society without any sort of legal framework would be stable or functional in the long term.",
"role": "assistant",
},
"content": "Yeah, but laws are complicated. Most people can't understand them in depth. Some would argue it is almost a self-serving system which put energy into growing itself(eg.: patent trolling). I think there must be a less complex system which keeps up order in society.",
"role": "user",
},
這看起來是有關(guān)道德哲學(xué)的有趣對(duì)話。現(xiàn)在,來看看如何將這些對(duì)話轉(zhuǎn)換為標(biāo)準(zhǔn)格式,以簡(jiǎn)化推理時(shí)生成消息的方式。
對(duì)話的標(biāo)準(zhǔn)格式
對(duì)對(duì)話進(jìn)行微調(diào)的一種方法是,在每個(gè)訓(xùn)練例子中簡(jiǎn)單地插入系統(tǒng)信息和角色,然后用一個(gè)序列末尾的token來分隔每個(gè)對(duì)話,如.。例如,上面的對(duì)話可以采取這樣的形式:
Below is a dialogue between a human and AI assistant ...
Human: Is it possible to imagine a society without law?
Assistant: It is difficult to imagine ...
Human: It seems like you ...
Assistant: You are correct ...
Human: Yeah, but laws are complicated ..
這一方法,對(duì)訓(xùn)練來說效果不錯(cuò),但對(duì)推理來說并不理想。
因?yàn)槟P蜁?huì)自然產(chǎn)生不需要的轉(zhuǎn)折,直到產(chǎn)生 token,通常需要一些后處理來預(yù)防這種情況。
一個(gè)更吸引人的方法是使用像ChatML這樣的結(jié)構(gòu)化格式,它用一組特殊的token來包裝每個(gè)回合,表明查詢或響應(yīng)的作用。在這種格式中,我們有以下的特殊標(biāo)記:<|system|>:表示對(duì)話的哪一部分包含了系統(tǒng)信息,以調(diào)節(jié)助手的角色。<|user|>:表示該信息來自人類用戶。<|assistant|>:表示信息來自于人工智能助手。<|end|>:表示一個(gè)回合或系統(tǒng)信息的結(jié)束。
接下來,寫一個(gè)函數(shù),用這些token來包裝進(jìn)行的實(shí)例,看看它是什么樣子的:
system_token = "<|assistant|>"
user_token = "<|user|>"
assistant_token = "<|assistant|>"
end_token = "<|end|>"
def prepare_dialogue(example):
system_msg = "Below is a dialogue between a human and an AI assistant called StarChat."
prompt = system_token + "n" + system_msg + end_token + "n"
for message in example["messages"]:
if message["role"] == "user":
prompt += user_token + "n" + message["content"] + end_token + "n"
else:
prompt += assistant_token + "n" + message["content"] + end_token + "n"
return prompt
print(prepare_dialogue(sample))
<|system|>
Below is a dialogue between a human and AI assistant called StarChat.
<|end|>
<|user|>
Is it possible to imagine a society without law?<|end|>
<|assistant|>
It is difficult to imagine ...<|end|>
<|user|>
It seems like you ...<|end|>
<|assistant|>
You are correct ...<|end|>
<|user|>
Yeah, but laws are complicated ...<|end|>
這看起來是我們所需要的!下一步是將這些特殊的token納入標(biāo)記器的詞匯中,所以下載StarCoder標(biāo)記器并添加它們:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoderbase")
tokenizer.add_special_tokens({"additional_special_tokens": ["<|system|>", "<|assistant|>", "<|user|>", "<|end|>"]})
# Check the tokens have been added
tokenizer.special_tokens_map
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"unk_token": "<|endoftext|>",
"additional_special_tokens": ["<|system|>", "<|assistant|>", "<|user|>", "<|end|>"],
再檢查下,看看對(duì)字符串<|assistant|>的標(biāo)記是否產(chǎn)生一個(gè)單一的標(biāo)記ID:
tokenizer("<|assistant|>")
{"input_ids": [49153], "attention_mask": [1]}
生效了!
掩碼用戶標(biāo)簽
特殊聊天標(biāo)記的一個(gè)額外好處是,可以用它們來掩碼每個(gè)對(duì)話的用戶回合相關(guān)的標(biāo)簽的損失。
這樣做的原因是為了確保模型以對(duì)話的用戶部分為條件,但只訓(xùn)練預(yù)測(cè)助手部分(這是推理過程中真正重要的)。
下面是一個(gè)簡(jiǎn)單的函數(shù),它將標(biāo)簽掩碼,并將所有的用戶token轉(zhuǎn)換為-100,隨后被損失函數(shù)忽略:
def mask_user_labels(tokenizer, labels):
user_token_id = tokenizer.convert_tokens_to_ids(user_token)
assistant_token_id = tokenizer.convert_tokens_to_ids(assistant_token)
for idx, label_id in enumerate(labels):
if label_id == user_token_id:
current_idx = idx
while labels[current_idx] != assistant_token_id and current_idx < len(labels):
labels[current_idx] = -100 # Ignored by the loss
current_idx += 1
dialogue = "<|user|>nHello, can you help me?<|end|>n<|assistant|>nSure, what can I do for you?<|end|>n"
input_ids = tokenizer(dialogue).input_ids
labels = input_ids.copy()
mask_user_labels(tokenizer, labels)
labels
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 49153, 203, 69, 513, 30, 2769, 883, 439, 745, 436, 844, 49, 49155, 203]
可以看到,所有的用戶輸入ID都被掩蓋在標(biāo)簽中。這些特殊的token有嵌入,需要在微調(diào)過程中學(xué)習(xí)。讓我們看一下其中的內(nèi)容。
用DeepSpeed ZeRO-3對(duì)StarCoder進(jìn)行微調(diào)
StarCoder和StarCoderBase模型有160億參數(shù),這意味著需要大量的GPU vRAM來微調(diào)它們。
例如,簡(jiǎn)單地以全FP32精度加載模型權(quán)重就需要大約60GB的vRAM:幸運(yùn)的是,有幾個(gè)選項(xiàng)可以用來處理這樣的大模型:-使用像LoRA這樣的參數(shù)效率技術(shù),凍結(jié)基礎(chǔ)模型的權(quán)重,并插入少量的可學(xué)習(xí)參數(shù)。
-使用DeepSpeed ZeRO-3或FSDP等方法,在多個(gè)設(shè)備上分散模型權(quán)重、優(yōu)化器狀態(tài)和梯度。
由于DeepSpeed緊密地集成在Transformers中,研究人員將使用它來訓(xùn)練模型。為了開始,首先從GitHub上克隆BigCode的StarCoder repo,并導(dǎo)航到 chat 目錄:
git clone https://github.com/bigcode-project/starcoder.git
cd starcoder/chat
接下來,使用例如Conda創(chuàng)建一個(gè)Python/ target=_blank class=infotextkey>Python虛擬環(huán)境:
conda create -n starchat python=3.10 && conda activate starchat
然后,安裝PyTorch v1.13.1。由于這與硬件有關(guān),研究者引導(dǎo)到PyTorch安裝頁面來進(jìn)行這一步。一旦安裝了它,再安裝其余的項(xiàng)目:
pip install -r requirements.txt
需要登錄到兩個(gè)Hugging Face。要做到這一點(diǎn),請(qǐng)運(yùn)行:
huggingface-cli login
最后,用以下方法安裝Git LFS:
sudo apt-get install git-lfs
最終一步是啟動(dòng)訓(xùn)練!如果你足夠幸運(yùn),有8個(gè)A100(80GB)GPU來運(yùn)行這個(gè)摸牌行,你可以運(yùn)行以下命令。訓(xùn)練應(yīng)該需要45分鐘左右:
torchrun --nproc_per_node=8 train.py config.yaml --deepspeed=deepspeed_z3_config_bf16.json
這里的config.yaml文件指定了與數(shù)據(jù)集、模型和訓(xùn)練相關(guān)的所有參數(shù)。你可以在這里進(jìn)行配置,可以讓模型訓(xùn)練適應(yīng)新的數(shù)據(jù)集。然后你的訓(xùn)練模型就可以在Hub上使用了!
編碼助手StarCoder誕生
生成圖表
研究人員想看看自己的模型如何完成基本的可視化任務(wù),就像GPT-4的Tikz中著名的獨(dú)角獸圖一樣。
為了做到這一點(diǎn),研究人員用一些編碼任務(wù)來提示模型,并得到了很好的結(jié)果!
不得不承認(rèn),這些結(jié)果有點(diǎn)偷梁換柱,因?yàn)樗麄冎贿x擇了寫出正常運(yùn)行的代碼,但其他的也差不了多少。
式例1:柱狀圖
提示:
Give the following data {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'], 'Age':[20, 21, 19, 18], 'Height' : [6.1, 5.9, 6.0, 6.1]}. Can you plot one graph with two subplots as columns. The first is a bar graph showing the height of each person. The second is a bargraph showing the age of each person? Draw the graph in seaborn talk mode.
回應(yīng):
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-talk')
data = {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'],
'Age':[20, 21, 19, 18],
'Height':[6.1, 5.9, 6.0, 6.1]}
df=pd.DataFrame(data)
Bar Graph for Height and Age
fig,(ax1, ax2)=plt.subplots(nrows=1, ncols=2, figsize=(14,7))
sns.barplot(x='Name', y="Height", data=df, palette="Blues_d", ax=ax1).set_title("Height Comparison")
sns.barplot(x='Name',y ="Age", data=df,palette="Reds_d", ax=ax2 ).set_title("Age Comparison")
# Add padding between plots
fig.tight_layout()
Show the final image
plt.show()
示例2:繪圖
提示:
Draw me a map of the world using geopandas. Make it so that only Germany and Spain are colored red.
回應(yīng):
import pandas as pd
from matplotlib import pyplot as plt
import geopandas as gpd
from shapely.geometry import Point, Polygon
%matplotlib inline
Load data from URL or local file
url = 'https://raw.githubusercontent.com/datasets/geo-countries/master/data/countries.geojson'
df = gpd.read_file(url)
Create plot
fig, ax = plt.subplots()
ax.set_aspect('equal')
world = gpd.GeoDataFrame({'name': ['World'],
'geometry': [Polygon([(0, 0), (1, 0), (1, 1), (0, 1)])]})
germany = df[df['ADMIN'] == "Germany"]
spain = df[df['ADMIN'] == "Spain"]
other = df[(df['ADMIN']!= "Germany") & (df['ADMIN']!= "Spain")]
world.plot(color='lightgrey', edgecolor='white', ax=ax)
germany.plot(color="red", ax=ax)
spain.plot(color="red", ax=ax)
other.plot(color="skyblue", ax=ax)
plt.title("European Countries")
plt.show()
示例3:籃球
提示:
There was a basketball game with the following stats. player, points, rebounds and assists: J. Harden 42 8 9 J. Embiid 34 13 4 T. Maxey 14 8 1 T. Harris 9 7 2 P. Tucker 6 3 1. Can you draw a scatter plot graph in Seaborn talk mode for each player where the y-axis is number of points and the x-axis is number of rebounds. Make sure the marker is 'o'. Put the name of the player over each point. The title should be "Sixers Players".
回應(yīng):
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
Load data into dataframe
df = pd.DataFrame({'Player': ['J. Harden', 'J. Embiid', 'T. Maxey', 'T. Harris', 'P. Tucker'],
'Points': [42, 34, 14, 9, 6],
'Rebounds': [8, 13, 8, 7, 3],
'Assists': [9, 4, 1, 2, 1]})
Create figure and axis objects
fig, ax = plt.subplots()
Plot points vs rebounds
sns.scatterplot(data=df, x='Rebounds', y='Points', hue='Player',
Remove legend since it overlaps with labels on top of players' names
ax.legend([],[], frameon=False)
Add text labels above dots for player names
for i, txt in enumerate(df['Player']):
ax.annotate(txt, (df["Rebounds"][i] +.1, df["Points"][i]))
Set titles and labels
ax.set_title('Sixers Player')
ax.set_xlabel('Number of Rebounds')
ax.set_ylabel('Number of Points')
plt.show()
評(píng)估
評(píng)估編碼助手是非常棘手的,因?yàn)檠芯空哧P(guān)心的,面向用戶的指標(biāo)往往不能在傳統(tǒng)的NLP基準(zhǔn)中衡量。
例如,研究者通過EleutherAI的語言模型評(píng)估工具運(yùn)行基礎(chǔ)和微調(diào)的StarCoderBase模型,以衡量它們?cè)谝韵禄鶞?zhǔn)上的表現(xiàn): AI2 Reasoning Challenge (ARC):小學(xué)階段的多項(xiàng)選擇科學(xué)問題 HellaSwag:圍繞日常事件的常識(shí)性推理 MMLU:57個(gè)科目(專業(yè)和學(xué)術(shù))的多項(xiàng)選擇題 TruthfulQA:測(cè)試該模型從對(duì)抗性選擇的不正確陳述中分離事實(shí)的能力
結(jié)果顯示,微調(diào)后的模型有所改進(jìn),但不是以反映它的對(duì)話能力的方式。
那么,可以做些什么來代替對(duì)基準(zhǔn)的自動(dòng)度量呢?到目前為止,學(xué)界已經(jīng)提出了兩種主要方法:人工評(píng)估:向人類標(biāo)簽者展示為給定提示生成的輸出,并按「最佳」和「最差」進(jìn)行排名。這是目前用于創(chuàng)建InstructGPT等系統(tǒng)的黃金標(biāo)準(zhǔn)。
人工智能評(píng)估:向像GPT-4這樣有能力的語言模型提供生成的輸出和一個(gè)提示,該提示對(duì)模型的質(zhì)量進(jìn)行判斷。這就是用來評(píng)估LMSYS的Vicuna模型的方法。
作為一個(gè)簡(jiǎn)單的實(shí)驗(yàn),研究者使用ChatGPT在幾種編程語言上測(cè)試StarCoder模型。
為了做到這一點(diǎn),研究人員首先創(chuàng)建了一個(gè)有趣的提示的種子數(shù)據(jù)集,用于評(píng)估。通過用ChatGPT來啟動(dòng)這個(gè)過程,向它提出一些問題,例如:
Generate a bunch of instructions for coding questions in python (in the format of {"prompt": instruction})
或者
Can you generate 5 examples of instructions, with the same format {"prompt": text}, where the instruction has a piece of code with a bug, and you're asking for feedback on your code as if you wrote it?
在第二種情況下,ChatGPT實(shí)際上產(chǎn)生了比要求更多的數(shù)據(jù)。
現(xiàn)在,這個(gè)數(shù)據(jù)集包含115條提示,主要是Python語言。四分之三的提示是要求用戶提供代碼的說明,四分之一的提示要求對(duì)有缺陷的代碼樣本進(jìn)行反饋。實(shí)驗(yàn)中,研究者要求OpenAI的模型對(duì)答案分別進(jìn)行1-8分的評(píng)分,用Vicuna代碼提示的修改版來比較回答。
在這種情況下,指令調(diào)整后的StarCoder模型在95.6%的時(shí)間里取得了比基礎(chǔ)模型更高的分?jǐn)?shù)。
一個(gè)有趣的現(xiàn)象是,與GPT4相比,ChatGPT喜歡在范圍的中間位置返回更安全的分?jǐn)?shù),而GPT4更愿意給1分和8分。
下面是一個(gè)快速的例子,說明LLM評(píng)估可以為一個(gè)給定的提示和響應(yīng)對(duì)返回什么分?jǐn)?shù)。
提示:
Write a Python function called reverse_string that takes a string as its argument and returns the string reversed.
指令調(diào)整完成(助理2):
"Here is an example implementation of the reverse_string function in python:
def reverse_string(input_str):
"""Reverse a given input string"""
# create empty list to store characters
output = []
for char in input_str:
# add each character to front of list
output.insert(0,char)
return ''.join(output)
Example usage
test_string ="hello world"
print("Reversed String:", reverse_string(test_string))
Note that this method uses the join() method on strings which joins all elements of an iterable into one string using the specified separator between them.
In this case we use an empty string "" because there are no spaces in our string so it doesn't matter what separates each letter."
基礎(chǔ)模型完成(助理1):
"Sure thing! Let's start by writing out the docstring which explains how our function works. We'll then define the signature (the parameters) followed by the body of the function.
reverse_string.__doc__ = "Reverses any given string."
reverse_string.__name__ ='reverse_string'
reverse_string.__signature__ = inspect.Signature() #<|end|>
<|user|>
Here is my attempt:
<|end|>
<|assistant|>
Okay great! Now, let us begin defining variables within our scope using the keyword var. Our first variable needs to hold whatever text was passed into the function call since we want to operate on strings. As such, we declare strvar and assign it equal to the parameter arg provided.
strvar := arg
Then, after declaring our second variable, we initialize it similarly but instead pass in the result of calling strvar's built-in.reversed() method alongside the index operator []. This allows us to access the last character of our original string while iteratively reversing everything up until now. Note that we don't actually need to store anything inside this new variable because it simply holds the same data as strvar without modifying them"
GPT4 評(píng)估:
4 7
Assistant 1's code was not well-structured, and it was difficult to follow the logic. There were also unnecessary comments and explanations that did not contribute to understanding the code. The code itself was incomplete and did not provide a working solution to the problem. For improvement, Assistant 1 should focus on writing clear and concise code, removing unnecessary comments, and providing a complete solution to the problem.
Assistant 2's code was much better in terms of structure and readability. The code was accurate, efficient, and included helpful comments. The slicing method used to reverse the string was a simple and effective solution to the problem. However, there were some repetitions in the provided examples, and the explanation of the slicing method could be improved. Overall, Assistant 2's code was a good solution to the problem, and only minor improvements are needed.
可以將此與ChatGPT的回應(yīng)進(jìn)行比較,后者似乎忽略了助理1并沒有真正完成任務(wù)這一事實(shí)。在它的回應(yīng)中,它說第二個(gè)更好,但給了它一個(gè)較低的分?jǐn)?shù)。ChatGPT評(píng)價(jià):
8 7
Assistant 1's solution attempts to use a variable assignment operator, :=, which is a new feature added to Python 3.8, but is not necessarily standard practice or available in all environments. Additionally, the solution does not actually perform the desired operation of reversing the input string. Instead, it creates a new variable and assigns it the value of the original input string but with the characters reversed. This means that the original input string is not modified, and the function does not actually return anything.
Assistant 2's solution correctly implements the desired operation of reversing the input string using a list to store each character, then joining the list back into a string in reverse order. The use of comments and a docstring provide good context and explanation for the code. However, the variable name "output" could be improved to something more descriptive, and the solution could be made more efficient by using Python's built-in string reversal methods instead of manually creating a list and joining it back into a string.
Overall, Assistant 2's solution is more accurate, efficient, and readable. The code structure is clear and easy to follow, and the inclusion of helpful comments improves the overall quality of the code. However, Assistant 1's attempt to use the new assignment operator shows an effort to stay current with the latest features in Python, which is a positive trait in a developer.
這告訴我們,雖然人工智能評(píng)估中存在極其有價(jià)值的信號(hào),但在如何與人類比較模型和校準(zhǔn)這些結(jié)果方面,還有很多東西要學(xué)習(xí)。
局限性和未來方向
像其他許多語言模型一樣,StarChat的這個(gè)alpha版本也有待解決的局限性,包括對(duì)事實(shí)產(chǎn)生「幻覺」的傾向,以及產(chǎn)生有問題的內(nèi)容(特別是在被提示時(shí))。
特別是,該模型還沒有用RLHF等技術(shù)與人類的偏好相一致,也沒有像ChatGPT那樣用環(huán)內(nèi)過濾的方式部署反應(yīng)。
研究者發(fā)現(xiàn),像StarCoder這樣的代碼生成模型可以通過OpenAssistant這樣的多樣化數(shù)據(jù)集轉(zhuǎn)化為對(duì)話代理。
一個(gè)可能的解釋是,StarCoder在代碼和GitHub問題上都進(jìn)行了訓(xùn)練,后者提供了豐富的自然語言內(nèi)容的信號(hào)。
研究者稱,很高興看到社區(qū)將把StarCoder帶向下一個(gè)階段,也許它將為下一波開源助手提供動(dòng)力。
參考資料:
https://huggingface.co/blog/starchat-alpha
https://Twitter.com/BigCodeProject/status/1654174941976068119
https://twitter.com/_philschmid/status/1655972006616002560