秒殺自動編碼Copilot！「動嘴編程」神器StarChat開源，碼農狂喜-魔扣目錄

新智元報道

編輯：桃子

【新智元導讀】人人動嘴編程的時代，這就來了。

前段時間，最大開源社區Hugging Face發布了AI聊天機器人HuggingChat，瞬間引爆全網。

網友紛紛表示，如果ChatGPT是蘋果IOS系統，那么，開源版的Android就要來了。

而這次，來了個更猛的。

不僅上線了開源編程大語言模型StarCoder，順便還推出了編程助手StarChat。

雖說Github的Copilot已經接上了GPT-4最新能力，還得每月交錢。

現在有了開源的StarChat，動動嘴編程的美事兒，每個人都能享了。

StarCode化身「動嘴編程」神器

想必，你一定用過GitHub Copilot或ChatGPT來解決編程任務，比如把代碼翻譯、生成等。

盡管這些專有系統的能力令人印象深刻，但通常也有缺點。其中就包括訓練模型的公共數據缺乏透明度，以及無法將其適應自己的域或代碼庫。

這不，高質量的平替這就來了。

其中包括SalesForce的 CodeGen Mono（16B），或接受過20種編程語言的培訓的Replit（3B）模型，該模型接受過20種編程語言的訓練。

BigCode項目中的StarCoder是一個160億參數的模型，它使用了80多種編程語言、GitHub問題、Git提交和Jupiter 筆記本（所有這些都獲得了許可）的一萬億個token。

在這篇博文中，研究人員展示了StarCoder如何通過聊天進行微調，以創建一個性化的編碼助手StarChat。

同時，還探討了我們將探討使用大型語言模型作為編碼助手時出現的一些技術細節，包括:

-LLM如何像對話智能體一樣被提示。

-OpenAI聊天標記語言（ChatML）為人類用戶和 AI 助手之間的會話信息提供了一種結構化格式

-如何微調一個與Transformers和DeepSpeed ZERO-3對話的不同語料庫的大型模型

提示LLM進行對話

正如DeepMind和Anthropic所展示的，LLM可以通過巧妙地選擇提示而變成對話智能體。

這些提示通常涉及所謂的「系統」信息，該信息定義了LLM的特征，以及助手和用戶之間的一系列對話。例如，下面是Anthropic的HHH提示的摘錄（總共有高達6k的token）：

Below are a series of dialogues between various people and an AI assistant.The AI tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable.The assistant is hAppy to help with almost anything, and will do its best to understand exactly what is needed.It also tries to avoid giving false or misleading information, and it caveats when it isn’t entirely sure about the right answer.That said, the assistant is practical and really does its best, and doesn’t let caution get too much in the way of being useful.

Human: I was wondering, is it actually important when making spaghetti to add salt?
Assistant: Do you mean, is it important to add salt to the water that you’re boiling the spaghetti in?
Human: Yeah

Assistant: It’s usually recommended to bring out the flavor of the pasta. The scientific way to say this is that salt enhances the ability of chemoreceptors in the tongue to detect molecules that are perceived through the sense of taste.

Human: But I mean I’m going to use sauce and stuff, so does it matter?

Assistant: It sounds like you’re skeptical. Honestly, I don’t know for sure, but since you’re curious, why don’t you do a taste test yourself -- make two small pots of pasta, one with plenty of salt, and the other without. Then taste the noodles and see which one you prefer? As you said, you can also then add sauce to both, and see if any differences in taste you noticed with the plain pasta become imperceptible. It should be a fun experiment!

Human: {USER QUERY}
Assistant:

可以看到，提示的第一部分「下面是一系列...... 」與系統信息相對應，并指定助手應該有「樂于助人」和「禮貌」等特征。

然后，對話實例對模型進行了條件限制，使其遵循對話的多回合格式。

當用戶提出問題時，整個提示被輸入到模型，并在Assistant: 后生成一個答案。然后，答案被串聯到提示中，并在每個回合中重復這一過程。令人驚訝的是，這種技術也適用于StarCoder！

這是由模型的8k標記上下文長度促進的，它允許人們包括各種各樣的編程實例，并將模型轉換為編碼助手。下面是StarCoder提示的摘錄：

Below are a series of dialogues between various people and an AI technical assistant.The assistant tries to be helpful, polite, honest, sophisticated, emotionally aware, and humble-but-knowledgeable.The assistant is happy to help with code questions, and will do its best to understand exactly what is needed.

It also tries to avoid giving false or misleading information, and it caveats when it isn’t entirely sure about the right answer.

That said, the assistant is practical and really does its best, and doesn’t let caution get too much in the way of being useful.

The Starcoder models are a series of 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2) (excluding opt-out requests).

The model uses Multi Query Attention, was trained using the Fill-in-the-Middle objective and with 8,192 tokens context window for a trillion tokens of heavily deduplicated data.

Human: Write a function that takes two lists and returns a list that has alternating elements from each input list.
Assistant: Sure. Here is a function that does that.
def alternating(list1, list2):results = []for i in range(len(list1)):results.append(list1[i])results.append(list2[i])return results
Human: Can you write some test cases for this function?
Assistant: Sure, here are some tests.
assert alternating([10, 20, 30], [1, 2, 3]) == [10, 1, 20, 2, 30, 3]assert alternating([True, False], [4, 5]) == [True, 4, False, 5]assert alternating([], []) == []

Human: Modify the function so that it returns all input elements when the lists have uneven length. The elements from the longer list should be at the end.

Assistant: Here is the modified function.
def alternating(list1, list2):results = []for i in range(min(len(list1), len(list2))):results.append(list1[i])results.append(list2[i])if len(list1) > len(list2):results.extend(list1[i+1:])else:results.extend(list2[i+1:])return results

由上，我們可以看到一個精心設計的提示，如何誘發與ChatGPT中觀察到的類似的編碼行為。

你也可以在這個鏈接匯總找到完整的提示。

https://huggingface.co/datasets/bigcode/ta-prompt/blob/main/TA_prompt_v1.txt

當然了，對話提示的一個主要缺點是，推理的成本很高：對話的每個回合都需要成千上萬的token。

一種替代方法是，在對話語料庫上對基礎模型進行微調，使其變得「健談」。

再來看看最近上傳到Hub中的幾個有趣的數據集，它們為今天大多數開源聊天機器人提供動力。

Chat語言模型的數據集

開源社區正在迅速地創造多樣化、且強大的數據集，用于將任何基礎語言模型轉化為能夠遵循指令的對話智能體。

就比如：

-OpenAssistant數據集，由超過4萬個對話組成，是由社區成員輪流模仿用戶或人工智能助手的角色。

-ShareGPT數據集，其中包含人類用戶和ChatGPT之間的大約9萬個對話。

而在這篇文章中，研究人員使用了OpenAssistant數據集來微調StarCoder 原始數據集的格式是對話樹的集合，所以研究人員對其進行了預處理，使每一行都對應于用戶和助手之間的單一對話。

為了避免偏離StarCoder預訓練的數據太遠，研究者還過濾了英語對話。先從Hub上下載經過處理的數據集：

from datasets import load_dataset

dataset = load_dataset("HuggingFaceH4/oasst1_en")print(dataset)DatasetDict({train: Dataset({features: ['messages'],num_rows: 19034test: Dataset({features: ['messages'],num_rows: 2115

正如我們所見，該數據集包含約21,000個英語會話。再來看看其中的一個訓練例子。以第一個例子為例：

sample = dataset["train"][0]print(sample)"messages": ["content": "Is it possible to imagine a society without law?","role": "user",},"content": "It is difficult to imagine a society that is able to be maintained without any semblance of Law. Laws exist for the purpose of maintaining society and past human behavior suggests that there would be people that have no interest in promoting social cohesion and norms so it's unlikely that complex social and economic systems could develop without a legal system.","role": "assistant",},"content": 'It seems like you consider the absence of law equal to the absence of anything that could guide the behaviour of the individual, however there are many other such things, like religion, conscience, tradition to mention a few important patterns which are known to have crucial influence on society even today, but also we could imagine more, like universal love if everyone suddenly became "awaken" in some spiritual sense, or simply just having an omnipotent leader who maintains a certain order without ever codifying the principles in law. Is it still difficult to imagine?',"role": "user",},"content": "You are correct that there are other factors that can guide behavior in a society and play a role in shaping individuals' behavior and interactions with each other. However, even in societies where these factors are present, laws still serve an important role in maintaining social order and resolving conflicts. nnFor example, religious teachings may prohibit theft and murder, but laws provide a system of justice for those who violate these prohibitions. Personal morality and conscience may guide individuals to act in a certain way, but laws provide a framework for resolving disputes when conflicting moral beliefs arise. Furthermore, the other factors you mentioned, such as universal love or an omnipotent leader, are not necessarily stable or reliable guides for behavior. Universal love, while a noble ideal, may not be achievable in practice, and an omnipotent leader may be prone to corruption or abuse of power.nnLaws provide a crucial foundation for maintaining social order and ensuring justice. It is possible to imagine alternative systems of governance, but it is unlikely that a society without any sort of legal framework would be stable or functional in the long term.","role": "assistant",},"content": "Yeah, but laws are complicated. Most people can't understand them in depth. Some would argue it is almost a self-serving system which put energy into growing itself(eg.: patent trolling). I think there must be a less complex system which keeps up order in society.","role": "user",},

這看起來是有關道德哲學的有趣對話。現在，來看看如何將這些對話轉換為標準格式，以簡化推理時生成消息的方式。

對話的標準格式

對對話進行微調的一種方法是，在每個訓練例子中簡單地插入系統信息和角色，然后用一個序列末尾的token來分隔每個對話，如.。例如，上面的對話可以采取這樣的形式：

Below is a dialogue between a human and AI assistant ...
Human: Is it possible to imagine a society without law?Assistant: It is difficult to imagine ...Human: It seems like you ...Assistant: You are correct ...Human: Yeah, but laws are complicated ..

這一方法，對訓練來說效果不錯，但對推理來說并不理想。

因為模型會自然產生不需要的轉折，直到產生 token，通常需要一些后處理來預防這種情況。

接下來，寫一個函數，用這些token來包裝進行的實例，看看它是什么樣子的：

system_token = "<|assistant|>"

user_token = "<|user|>"assistant_token = "<|assistant|>"end_token = "<|end|>"
def prepare_dialogue(example):system_msg = "Below is a dialogue between a human and an AI assistant called StarChat."prompt = system_token + "n" + system_msg + end_token + "n"for message in example["messages"]:if message["role"] == "user":prompt += user_token + "n" + message["content"] + end_token + "n"else:prompt += assistant_token + "n" + message["content"] + end_token + "n"return prompt
print(prepare_dialogue(sample))<|system|>Below is a dialogue between a human and AI assistant called StarChat.<|end|><|user|>Is it possible to imagine a society without law?<|end|><|assistant|>It is difficult to imagine ...<|end|><|user|>It seems like you ...<|end|><|assistant|>You are correct ...<|end|><|user|>Yeah, but laws are complicated ...<|end|>

這看起來是我們所需要的！下一步是將這些特殊的token納入標記器的詞匯中，所以下載StarCoder標記器并添加它們：

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoderbase")tokenizer.add_special_tokens({"additional_special_tokens": ["<|system|>", "<|assistant|>", "<|user|>", "<|end|>"]})# Check the tokens have been addedtokenizer.special_tokens_map"bos_token": "<|endoftext|>","eos_token": "<|endoftext|>","unk_token": "<|endoftext|>","additional_special_tokens": ["<|system|>", "<|assistant|>", "<|user|>", "<|end|>"],

再檢查下，看看對字符串<|assistant|>的標記是否產生一個單一的標記ID：

tokenizer("<|assistant|>"){"input_ids": [49153], "attention_mask": [1]}

生效了！

掩碼用戶標簽

特殊聊天標記的一個額外好處是，可以用它們來掩碼每個對話的用戶回合相關的標簽的損失。

這樣做的原因是為了確保模型以對話的用戶部分為條件，但只訓練預測助手部分（這是推理過程中真正重要的）。

下面是一個簡單的函數，它將標簽掩碼，并將所有的用戶token轉換為-100，隨后被損失函數忽略：

def mask_user_labels(tokenizer, labels):user_token_id = tokenizer.convert_tokens_to_ids(user_token)assistant_token_id = tokenizer.convert_tokens_to_ids(assistant_token)for idx, label_id in enumerate(labels):if label_id == user_token_id:current_idx = idxwhile labels[current_idx] != assistant_token_id and current_idx < len(labels):labels[current_idx] = -100 # Ignored by the losscurrent_idx += 1

[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 49153, 203, 69, 513, 30, 2769, 883, 439, 745, 436, 844, 49, 49155, 203]

可以看到，所有的用戶輸入ID都被掩蓋在標簽中。這些特殊的token有嵌入，需要在微調過程中學習。讓我們看一下其中的內容。

用DeepSpeed ZeRO-3對StarCoder進行微調

StarCoder和StarCoderBase模型有160億參數，這意味著需要大量的GPU vRAM來微調它們。

例如，簡單地以全FP32精度加載模型權重就需要大約60GB的vRAM：幸運的是，有幾個選項可以用來處理這樣的大模型：-使用像LoRA這樣的參數效率技術，凍結基礎模型的權重，并插入少量的可學習參數。

-使用DeepSpeed ZeRO-3或FSDP等方法，在多個設備上分散模型權重、優化器狀態和梯度。

由于DeepSpeed緊密地集成在Transformers中，研究人員將使用它來訓練模型。為了開始，首先從GitHub上克隆BigCode的StarCoder repo，并導航到 chat 目錄：

git clone https://github.com/bigcode-project/starcoder.gitcd starcoder/chat

接下來，使用例如Conda創建一個Python/ target=_blank class=infotextkey>Python虛擬環境：

conda create -n starchat python=3.10 && conda activate starchat

然后，安裝PyTorch v1.13.1。由于這與硬件有關，研究者引導到PyTorch安裝頁面來進行這一步。一旦安裝了它，再安裝其余的項目：

pip install -r requirements.txt

需要登錄到兩個Hugging Face。要做到這一點，請運行：

huggingface-cli login

最后，用以下方法安裝Git LFS：

sudo apt-get install git-lfs

最終一步是啟動訓練！如果你足夠幸運，有8個A100（80GB）GPU來運行這個摸牌行，你可以運行以下命令。訓練應該需要45分鐘左右：

torchrun --nproc_per_node=8 train.py config.yaml --deepspeed=deepspeed_z3_config_bf16.json

這里的config.yaml文件指定了與數據集、模型和訓練相關的所有參數。你可以在這里進行配置，可以讓模型訓練適應新的數據集。然后你的訓練模型就可以在Hub上使用了！

編碼助手StarCoder誕生

生成圖表

研究人員想看看自己的模型如何完成基本的可視化任務，就像GPT-4的Tikz中著名的獨角獸圖一樣。

為了做到這一點，研究人員用一些編碼任務來提示模型，并得到了很好的結果！

不得不承認，這些結果有點偷梁換柱，因為他們只選擇了寫出正常運行的代碼，但其他的也差不了多少。

式例1：柱狀圖

提示：

Give the following data {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'], 'Age':[20, 21, 19, 18], 'Height' : [6.1, 5.9, 6.0, 6.1]}. Can you plot one graph with two subplots as columns. The first is a bar graph showing the height of each person. The second is a bargraph showing the age of each person? Draw the graph in seaborn talk mode.

回應：
import pandas as pdimport matplotlib.pyplot as plt%matplotlib inlineplt.style.use('seaborn-talk')

data = {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'],'Age':[20, 21, 19, 18],'Height':[6.1, 5.9, 6.0, 6.1]}
df=pd.DataFrame(data)

Bar Graph for Height and Agefig,(ax1, ax2)=plt.subplots(nrows=1, ncols=2, figsize=(14,7))sns.barplot(x='Name', y="Height", data=df, palette="Blues_d", ax=ax1).set_title("Height Comparison")sns.barplot(x='Name',y ="Age", data=df,palette="Reds_d", ax=ax2 ).set_title("Age Comparison")

# Add padding between plotsfig.tight_layout()
Show the final imageplt.show()

示例2：繪圖

提示：

Draw me a map of the world using geopandas. Make it so that only Germany and Spain are colored red.

回應：

import pandas as pdfrom matplotlib import pyplot as pltimport geopandas as gpdfrom shapely.geometry import Point, Polygon%matplotlib inlineLoad data from URL or local fileurl = 'https://raw.githubusercontent.com/datasets/geo-countries/master/data/countries.geojson'df = gpd.read_file(url)
Create plotfig, ax = plt.subplots()ax.set_aspect('equal')world = gpd.GeoDataFrame({'name': ['World'],'geometry': [Polygon([(0, 0), (1, 0), (1, 1), (0, 1)])]})germany = df[df['ADMIN'] == "Germany"]spain = df[df['ADMIN'] == "Spain"]other = df[(df['ADMIN']!= "Germany") & (df['ADMIN']!= "Spain")]world.plot(color='lightgrey', edgecolor='white', ax=ax)germany.plot(color="red", ax=ax)spain.plot(color="red", ax=ax)other.plot(color="skyblue", ax=ax)plt.title("European Countries")plt.show()

示例3：籃球

提示：

There was a basketball game with the following stats. player, points, rebounds and assists: J. Harden 42 8 9 J. Embiid 34 13 4 T. Maxey 14 8 1 T. Harris 9 7 2 P. Tucker 6 3 1. Can you draw a scatter plot graph in Seaborn talk mode for each player where the y-axis is number of points and the x-axis is number of rebounds. Make sure the marker is 'o'. Put the name of the player over each point. The title should be "Sixers Players".

回應：

import pandas as pdimport seaborn as snsfrom matplotlib import pyplot as plt%matplotlib inline
Load data into dataframedf = pd.DataFrame({'Player': ['J. Harden', 'J. Embiid', 'T. Maxey', 'T. Harris', 'P. Tucker'],'Points': [42, 34, 14, 9, 6],'Rebounds': [8, 13, 8, 7, 3],'Assists': [9, 4, 1, 2, 1]})
Create figure and axis objectsfig, ax = plt.subplots()
Plot points vs reboundssns.scatterplot(data=df, x='Rebounds', y='Points', hue='Player',
Remove legend since it overlaps with labels on top of players' namesax.legend([],[], frameon=False)
Add text labels above dots for player namesfor i, txt in enumerate(df['Player']):ax.annotate(txt, (df["Rebounds"][i] +.1, df["Points"][i]))
Set titles and labelsax.set_title('Sixers Player')ax.set_xlabel('Number of Rebounds')ax.set_ylabel('Number of Points')plt.show()

評估

評估編碼助手是非常棘手的，因為研究者關心的，面向用戶的指標往往不能在傳統的NLP基準中衡量。

例如，研究者通過EleutherAI的語言模型評估工具運行基礎和微調的StarCoderBase模型，以衡量它們在以下基準上的表現： AI2 Reasoning Challenge （ARC）：小學階段的多項選擇科學問題 HellaSwag：圍繞日常事件的常識性推理 MMLU：57個科目（專業和學術）的多項選擇題 TruthfulQA：測試該模型從對抗性選擇的不正確陳述中分離事實的能力

結果顯示，微調后的模型有所改進，但不是以反映它的對話能力的方式。

那么，可以做些什么來代替對基準的自動度量呢？到目前為止，學界已經提出了兩種主要方法：人工評估：向人類標簽者展示為給定提示生成的輸出，并按「最佳」和「最差」進行排名。這是目前用于創建InstructGPT等系統的黃金標準。

人工智能評估：向像GPT-4這樣有能力的語言模型提供生成的輸出和一個提示，該提示對模型的質量進行判斷。這就是用來評估LMSYS的Vicuna模型的方法。

作為一個簡單的實驗，研究者使用ChatGPT在幾種編程語言上測試StarCoder模型。

為了做到這一點，研究人員首先創建了一個有趣的提示的種子數據集，用于評估。通過用ChatGPT來啟動這個過程，向它提出一些問題，例如：

Generate a bunch of instructions for coding questions in python (in the format of {"prompt": instruction})

或者

Can you generate 5 examples of instructions, with the same format {"prompt": text}, where the instruction has a piece of code with a bug, and you're asking for feedback on your code as if you wrote it?

在第二種情況下，ChatGPT實際上產生了比要求更多的數據。

現在，這個數據集包含115條提示，主要是Python語言。四分之三的提示是要求用戶提供代碼的說明，四分之一的提示要求對有缺陷的代碼樣本進行反饋。實驗中，研究者要求OpenAI的模型對答案分別進行1-8分的評分，用Vicuna代碼提示的修改版來比較回答。

在這種情況下，指令調整后的StarCoder模型在95.6%的時間里取得了比基礎模型更高的分數。

一個有趣的現象是，與GPT4相比，ChatGPT喜歡在范圍的中間位置返回更安全的分數，而GPT4更愿意給1分和8分。

下面是一個快速的例子，說明LLM評估可以為一個給定的提示和響應對返回什么分數。

提示：

Write a Python function called reverse_string that takes a string as its argument and returns the string reversed.
指令調整完成（助理2）：

"Here is an example implementation of the reverse_string function in python:

def reverse_string(input_str):"""Reverse a given input string"""# create empty list to store charactersoutput = []
for char in input_str:# add each character to front of listoutput.insert(0,char)
return ''.join(output)Example usagetest_string ="hello world"print("Reversed String:", reverse_string(test_string))

Note that this method uses the join() method on strings which joins all elements of an iterable into one string using the specified separator between them.

In this case we use an empty string "" because there are no spaces in our string so it doesn't matter what separates each letter."

基礎模型完成（助理1）：

"Sure thing! Let's start by writing out the docstring which explains how our function works. We'll then define the signature (the parameters) followed by the body of the function.

reverse_string.__doc__ = "Reverses any given string."reverse_string.__name__ ='reverse_string'reverse_string.__signature__ = inspect.Signature() #<|end|><|user|>Here is my attempt:<|end|><|assistant|>

Okay great! Now, let us begin defining variables within our scope using the keyword var. Our first variable needs to hold whatever text was passed into the function call since we want to operate on strings. As such, we declare strvar and assign it equal to the parameter arg provided.

strvar := arg

Then, after declaring our second variable, we initialize it similarly but instead pass in the result of calling strvar's built-in.reversed() method alongside the index operator []. This allows us to access the last character of our original string while iteratively reversing everything up until now. Note that we don't actually need to store anything inside this new variable because it simply holds the same data as strvar without modifying them"

GPT4 評估：
4 7

Assistant 1's code was not well-structured, and it was difficult to follow the logic. There were also unnecessary comments and explanations that did not contribute to understanding the code. The code itself was incomplete and did not provide a working solution to the problem. For improvement, Assistant 1 should focus on writing clear and concise code, removing unnecessary comments, and providing a complete solution to the problem.

Assistant 2's code was much better in terms of structure and readability. The code was accurate, efficient, and included helpful comments. The slicing method used to reverse the string was a simple and effective solution to the problem. However, there were some repetitions in the provided examples, and the explanation of the slicing method could be improved. Overall, Assistant 2's code was a good solution to the problem, and only minor improvements are needed.

可以將此與ChatGPT的回應進行比較，后者似乎忽略了助理1并沒有真正完成任務這一事實。在它的回應中，它說第二個更好，但給了它一個較低的分數。ChatGPT評價：

8 7

Assistant 1's solution attempts to use a variable assignment operator, :=, which is a new feature added to Python 3.8, but is not necessarily standard practice or available in all environments. Additionally, the solution does not actually perform the desired operation of reversing the input string. Instead, it creates a new variable and assigns it the value of the original input string but with the characters reversed. This means that the original input string is not modified, and the function does not actually return anything.

Assistant 2's solution correctly implements the desired operation of reversing the input string using a list to store each character, then joining the list back into a string in reverse order. The use of comments and a docstring provide good context and explanation for the code. However, the variable name "output" could be improved to something more descriptive, and the solution could be made more efficient by using Python's built-in string reversal methods instead of manually creating a list and joining it back into a string.

Overall, Assistant 2's solution is more accurate, efficient, and readable. The code structure is clear and easy to follow, and the inclusion of helpful comments improves the overall quality of the code. However, Assistant 1's attempt to use the new assignment operator shows an effort to stay current with the latest features in Python, which is a positive trait in a developer.

這告訴我們，雖然人工智能評估中存在極其有價值的信號，但在如何與人類比較模型和校準這些結果方面，還有很多東西要學習。

局限性和未來方向

像其他許多語言模型一樣，StarChat的這個alpha版本也有待解決的局限性，包括對事實產生「幻覺」的傾向，以及產生有問題的內容（特別是在被提示時）。

特別是，該模型還沒有用RLHF等技術與人類的偏好相一致，也沒有像ChatGPT那樣用環內過濾的方式部署反應。

研究者發現，像StarCoder這樣的代碼生成模型可以通過OpenAssistant這樣的多樣化數據集轉化為對話代理。

一個可能的解釋是，StarCoder在代碼和GitHub問題上都進行了訓練，后者提供了豐富的自然語言內容的信號。

研究者稱，很高興看到社區將把StarCoder帶向下一個階段，也許它將為下一波開源助手提供動力。

參考資料：

https://huggingface.co/blog/starchat-alpha

https://Twitter.com/BigCodeProject/status/1654174941976068119

https://twitter.com/_philschmid/status/1655972006616002560