丰满美妇久久久,亚洲一一在线观看,欧美久久久久久久久久久

對于在一個有各種角色的團隊中工作的數據科學家來說，編寫干凈的代碼是一項必備的技能，因為：

清晰的代碼增強了可讀性，使團隊成員更容易理解和貢獻于代碼庫。
清晰的代碼提高了可維護性，簡化了調試、修改和擴展現有代碼等任務。

為了實現可維護性，我們的Python/ target=_blank class=infotextkey>Python函數應該：

小型
只做一項任務
沒有重復
有一個層次的抽象性
有一個描述性的名字
有少于四個參數

我們先來看看下面的 get_data 函數。

import xml.etree.ElementTree as ET
import zipfile
from pathlib import Path
import gdown

def get_data(
    url: str,
    zip_path: str,
    raw_trAIn_path: str,
    raw_test_path: str,
    processed_train_path: str,
    processed_test_path: str,
):
    # Download data from google Drive
    zip_path = "Twitter.zip"
    gdown.download(url, zip_path, quiet=False)

    # Unzip data
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(".")

    # Extract texts from files in the train directory
    t_train = []
    for file_path in Path(raw_train_path).glob("*.xml"):
        list_train_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]
        train_doc_1 = " ".join(t for t in list_train_doc_1)
        t_train.Append(train_doc_1)
    t_train_docs = " ".join(t_train)

    # Extract texts from files in the test directory
    t_test = []
    for file_path in Path(raw_test_path).glob("*.xml"):
        list_test_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]
        test_doc_1 = " ".join(t for t in list_test_doc_1)
        t_test.append(test_doc_1)
    t_test_docs = " ".join(t_test)

    # Write processed data to a train file
    with open(processed_train_path, "w") as f:
        f.write(t_train_docs)

    # Write processed data to a test file
    with open(processed_test_path, "w") as f:
        f.write(t_test_docs)


if __name__ == "__main__":
    get_data(
        url="https://drive.google.com/uc?id=1jI1cmxqnwsmC-vbl8dNY6b4aNBtBbKy3",
        zip_path="Twitter.zip",
        raw_train_path="Data/train/en",
        raw_test_path="Data/test/en",
        processed_train_path="Data/train/en.txt",
        processed_test_path="Data/test/en.txt",
    )

盡管在這個函數中有許多注釋，但很難理解這個函數的作用，因為：

該函數很長。
該函數試圖完成多項任務。
函數內的代碼處于不同的抽象層次。
該函數有許多參數。
有多個代碼重復。
該函數缺少一個描述性的名稱。

我們將通過使用文章開頭提到的六種做法來重構這段代碼。

小型

一個函數應該保持很小，以提高其可讀性。理想情況下，一個函數的代碼不應超過20行。此外，一個函數的縮進程度不應超過1或2。

import zipfile
import gdown

def get_raw_data(url: str, zip_path: str) -> None:
    gdown.download(url, zip_path, quiet=False)
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(".")

只做一個任務

函數應該有一個單一的重點，并執行單一的任務。函數get_data試圖完成多項任務，包括從Google Drive檢索數據，執行文本提取，并保存提取的文本。

因此，這個函數應該被分成幾個小的函數，如下圖所示：

def main(
    url: str,
    zip_path: str,
    raw_train_path: str,
    raw_test_path: str,
    processed_train_path: str,
    processed_test_path: str,
) -> None:
    get_raw_data(url, zip_path)
    t_train, t_test = get_train_test_docs(raw_train_path, raw_test_path)
    save_train_test_docs(processed_train_path, processed_test_path, t_train, t_test)

這些功能中的每一個都應該有一個單一的目的：

def get_raw_data(url: str, zip_path: str) -> None:
    gdown.download(url, zip_path, quiet=False)
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(".")

函數get_raw_data只執行一個動作，那就是獲取原始數據。

重復性

我們應該避免重復，因為：

重復的代碼削弱了代碼的可讀性。
重復的代碼使代碼修改更加復雜。如果需要修改，需要在多個地方進行修改，增加了出錯的可能性。

下面的代碼包含重復的內容，用于檢索訓練和測試數據的代碼幾乎是相同的。

from pathlib import Path  

 # 從train目錄下的文件中提取文本
t_train = []
for file_path in Path(raw_train_path).glob("*.xml"):
    list_train_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]
    train_doc_1 = " ".join(t for t in list_train_doc_1)
    t_train.append(train_doc_1)
t_train_docs = " ".join(t_train)

# 從測試目錄的文件中提取文本
t_test = []
for file_path in Path(raw_test_path).glob("*.xml"):
    list_test_doc_1 = [r.text for r in ET.parse(file_path).getroot()[0]]
    test_doc_1 = " ".join(t for t in list_test_doc_1)
    t_test.append(test_doc_1)
t_test_docs = " ".join(t_test)

我們可以通過將重復的代碼合并到一個名為extract_texts_from_multiple_files的單一函數中來消除重復，該函數從指定位置的多個文件中提取文本。

def extract_texts_from_multiple_files(folder_path) -> str:

all_docs = []
for file_path in Path(folder_path).glob("*.xml"):
    list_of_text_in_one_file = [r.text for r in ET.parse(file_path).getroot()[0]]
    text_in_one_file = " ".join(list_of_text_in_one_file)
    all_docs.append(text_in_one_file)

return " ".join(all_docs)

現在你可以使用這個功能從不同的地方提取文本，而不需要重復編碼。

t_train = extract_texts_from_multiple_files(raw_train_path)
t_test  = extract_texts_from_multiple_files(raw_test_path)

一個層次的抽象

抽象水平是指一個系統的復雜程度。高層次指的是對系統更概括的看法，而低層次指的是系統更具體的方面。

在一個代碼段內保持相同的抽象水平是一個很好的做法，使代碼更容易理解。

以下函數證明了這一點：

def extract_texts_from_multiple_files(folder_path) -> str:

    all_docs = []
    for file_path in Path(folder_path).glob("*.xml"):
        list_of_text_in_one_file = [r.text for r in ET.parse(file_path).getroot()[0]]
        text_in_one_file = " ".join(list_of_text_in_one_file)
        all_docs.append(text_in_one_file)

    return " ".join(all_docs)

該函數本身處于較高層次，但 for 循環內的代碼涉及與XML解析、文本提取和字符串操作有關的較低層次的操作。

為了解決這種抽象層次的混合，我們可以將低層次的操作封裝在extract_texts_from_each_file函數中：

def extract_texts_from_multiple_files(folder_path: str) -> str:
    all_docs = []
    for file_path in Path(folder_path).glob("*.xml"):
        text_in_one_file = extract_texts_from_each_file(file_path)
        all_docs.append(text_in_one_file)

    return " ".join(all_docs)
    

def extract_texts_from_each_file(file_path: str) -> str:
    list_of_text_in_one_file = [r.text for r in ET.parse(file_path).getroot()[0]]
    return " ".join(list_of_text_in_one_file)

這為文本提取過程引入了更高層次的抽象，使代碼更具可讀性。

描述性的名稱

一個函數的名字應該有足夠的描述性，使用戶不用閱讀代碼就能理解其目的。長一點的、描述性的名字比模糊的名字要好。例如，命名一個函數get_texts就不如命名為extract_texts_from_multiple_files來得清楚。

然而，如果一個函數的名字變得太長，比如retrieve_data_extract_text_and_save_data，這說明這個函數可能做了太多的事情，應該拆分成更小的函數。

少于四個參數

隨著函數參數數量的增加，跟蹤眾多參數之間的順序、目的和關系變得更加復雜。這使得開發人員難以理解和使用該函數。

def main(
    url: str,
    zip_path: str,
    raw_train_path: str,
    raw_test_path: str,
    processed_train_path: str,
    processed_test_path: str,
) -> None:
    get_raw_data(url, zip_path)
    t_train, t_test = get_train_test_docs(raw_train_path, raw_test_path)
    save_train_test_docs(processed_train_path, processed_test_path, t_train, t_test)

為了提高代碼的可讀性，你可以用數據類或Pydantic模型將多個相關參數封裝在一個數據結構中。

from pydantic import BaseModel

class RawLocation(BaseModel):
    url: str
    zip_path: str
    path_train: str
    path_test: str


class ProcessedLocation(BaseModel):
    path_train: str
    path_test: str


def main(raw_location: RawLocation, processed_location: ProcessedLocation) -> None:
    get_raw_data(raw_location)
    t_train, t_test = get_train_test_docs(raw_location)
    save_train_test_docs(processed_location, t_train, t_test)

我如何寫這樣的函數？

在編寫Python函數時，你不需要記住所有這些最佳實踐。衡量一個Python函數質量的一個很好的指標是它的可測試性。如果一個函數可以很容易地被測試，這表明該函數是模塊化的，執行單一的任務，并且沒有重復的代碼。

def save_data(processed_path: str, processed_data: str) -> None:
    with open(processed_path, "w") as f:
        f.write(processed_data)


def test_save_data(tmp_path):
    processed_path = tmp_path / "processed_data.txt"
    processed_data = "Sample processed data"

    save_data(processed_path, processed_data)

    assert processed_path.exists()
    assert processed_path.read_text() == processed_data

參考文獻 Martin, R. C. (2009).Clean code：A handbook of agile software craftsmanship.Upper Saddle River：Prentice Hall.