前言

之前給小伙伴們科普ClickHouse集群的時候，我曾經提到ClickHouse集群幾乎是去中心化的（decentralized），亦即集群中各個CK實例是對等的，沒有主從之分。集群上的復制表、分布式表機制只是靠外部ZooKeeper做分布式協調工作。想了想，又補了一句：

“其實單純靠P2P互相通信就能維護完整的集群狀態，實現集群自治，比如redis Cluster。”

當然限于時間沒有展開說。這個周末休息夠了，難得有空，來隨便講兩句吧。

在官方Redis Cluster出現之前，要實現集群化Redis都是依靠Sharding+Proxy技術，如Twemproxy和Codis（筆者之前也寫過 Codis集群的事兒）。而官方Redis Cluster走了去中心化的路，其通信基礎就是Gossip協議，同時該協議還能保證一致性和可用性。本文先來介紹一下它。

Gossip協議

簡介

最近幾個月一直在看《Friends》下飯。認為自己從不gossip的Rachel一語道破了gossip的本質。

現實生活中的流言八卦傳播的機制就是“I hear something and I pass that information on”，并且其傳播速度非常快。而Gossip協議就是借鑒了這個特點產生的，在P2P網絡和分布式系統中應用廣泛，它的方法論也特別簡單：

在一個處于有界網絡的集群里，如果每個節點都隨機與其他節點交換特定信息，經過足夠長的時間后，集群各個節點對該份信息的認知終將收斂到一致。

這里的“特定信息”一般就是指集群狀態、各節點的狀態以及其他元數據等。可見，Gossip協議是完全符合BASE理論精神的，所以它基本可以用于任何只要求最終一致性的領域，典型的例子就是區塊鏈，以及部分分布式存儲。另外，它可以很方便地實現彈性集群（即節點可以隨時上下線），如失敗檢測與動態負載均衡等。

以下GIF圖示出Gossip協議下一種可能的消息傳播過程。藍色節點表示對消息無感知，紅色節點表示有感知。

Source: https://managementfromscratch.wordPress/ target=_blank class=infotextkey>WordPress.com/2016/04/01/introduction-to-gossip/

為了使Gossip協議更易于表達和分析，一般都會借用流行病學（epidemiology）中的SIR模型進行描述，因為大流行病（pandemic，比如這次新冠肺炎）的傳播與流言八卦的傳播具有相似性，并且已經由前人總結出一套成熟的數學模型了。

流行病學SIR模型

SIR模型早在1927年就由Kermack與McKendrick提出。該模型將傳染病流行范圍內的人群分為3類：

S（易感者/susceptible） ，指未患病的人，但缺乏免疫能力，與感染者接觸之后容易受到感染。
I（感染者/infective） ，指已患病的人，并且可以將病原體傳播給易感者人群；
R（隔離者/removed） ，指被隔離在無傳染環境，或者因病愈獲得免疫力而不再易感的人。

如果不考慮人口的增長和減少，即s(t)+i(t)+r(t)始終為一常量的話，那么SIR模型就可以用如下的微分方程組來表示。

其中，系數β是感染率，γ則是治愈率。為了阻止以至消滅傳染病的流行，醫學界會努力降低感染率，提高治愈率。但是在Gossip協議的語境下，計算機科學家要做的恰恰相反，即盡量高效地讓集群內所有節點都“感染”（對信息有感知）。由SIR模型推演出的Gossip協議傳播模型主要有兩種，即反熵（Anti-entropy）和謠言傳播（Rumor-mongering），下面分別介紹之。

反熵（Anti-entropy）

熵是物理學中體系混亂程度的度量，而反熵就是通過看似雜亂無章的通信達到最終一致。反熵只用到SIR模型中的S和I狀態，S狀態表示節點尚未感知到數據，I狀態表示節點已感知到數據，并且正在傳播給其他節點。具體來講，反熵Gossip協議有3種實現方式：

推模式（push）：處于I狀態的節點周期性地隨機選擇其他節點，并將自己持有的數據發送出去；
拉模式（pull）：處于S狀態的節點周期性地隨機選擇其他節點，并請求接收其他節點持有的數據；
推-拉模式（push-pull）：即以上兩者的綜合。

下圖示出在有界集群P中，以周期Δ執行反熵Gossip協議的偽代碼描述。

如何分析其效率呢？為了簡化問題，提出以下約束：

每一輪周期每個節點都只隨機選擇一個其他節點進行通信；
起始時，只有一個節點處于I狀態，其他節點都處于S狀態。

令s(t)表示在時刻t時，S狀態的節點占總節點數n的比例（注意是比例），那么顯然有s(0) = 1 - 1/n，可以計算出s(t)的期望為：

推模式

拉模式

由下圖可見，拉模式的信息傳播效率比推模式高，達到了真正的指數級收斂速度。綜合了兩者的推-拉模式效率則比拉模式更高。

但是，推模式每輪只需要1次信息交換，拉模式需要2次，推-拉模式需要3次。由于反熵Gossip協議每次都交換全量消息，數據量可能會比較大，因此具體選擇哪種模式，還是需要考慮網絡資源的開銷再決定。

謠言傳播（Rumor-mongering）

謠言傳播與反熵不同的一點是，它采用完整的SIR模型。處于R狀態的結點表示已經獲取到了信息，但是不會將這個信息分享給其他節點，就像“謠言止于智者”一樣。另一個不同點是，謠言傳播機制每次只會交換發生變化的信息，而不是全量信息，所以它對網絡資源的開銷會比反熵機制要小很多。

下圖示出在有界集群P中，以周期Δ執行謠言傳播Gossip協議的偽代碼描述。

圖中的blind/feedback和coin/counter是怎么一回事呢？它們表示節點從I狀態轉移到R狀態的條件。

coin：在每輪傳播中，節點以1/k的概率從I轉移到R狀態。
counter：在參與k輪傳播之后（即發送k次信息）之后，節點從I狀態轉移到R狀態。
feedback：在發出信息后，對位節點有反饋才可以進入R狀態。
blind：在發出信息后，不必等待對位節點有反饋，隨時都可以進入R狀態。

由上可見，謠言傳播模式的結束條件是所有節點都對謠言“免疫”，但是又有可能造成部分節點始終無法對消息有感知（即保持S狀態）。以coin條件為例，可以寫出如下的微分方程組。其中s和i仍然表示S狀態和I狀態的節點占總節點數的比例。

消去t，可得：

根據初始條件：i(1 - 1/n) = 1，可以推導出：

如果我們要讓i(s*) = 0的話：

可見，s 會隨著k值的增高而指數級下降。當k = 1時，s 約為20%，而當k = 5時，s*就只有約0.24%了。也就是說，如果節點每輪以1/5的概率從I轉換為R狀態，就已經比較安全了。

在實際應用中，反熵和謠言傳播的各種方式往往結合在一起使用，因此Gossip協議非常靈活，沒有完全統一的標準。以下就看一看Redis Cluster的實現。

Redis Cluster的Gossip方案

Redis Cluster是在3.0版本加入的feature，故我們就選擇3.0版本的源碼來簡單解說。下圖是主從架構的Redis Cluster示意圖，其中虛線表示各個節點之間的Gossip通信。

消息類型

Gossip協議是個松散的協議，沒有對數據交換的格式做特別的約束，各框架可以自由設定自己的implementation。Redis Cluster有以下9種消息類型的定義，詳情可見注釋（注釋非我所寫，而是來自 redis-3.0-annotated 項目，致敬）。

/* Note that the PING, PONG and MEET messages are actually the same exact
 * kind of packet. PONG is the reply to ping, in the exact format as a PING,
 * while MEET is a special PING that forces the receiver to add the sender
 * as a node (if it is not already in the list). */
// 注意，PING 、 PONG 和 MEET 實際上是同一種消息。
// PONG 是對 PING 的回復，它的實際格式也為 PING 消息，
// 而 MEET 則是一種特殊的 PING 消息，用于強制消息的接收者將消息的發送者添加到集群中
// （如果節點尚未在節點列表中的話）
// PING
#define CLUSTERMSG_TYPE_PING 0          /* Ping */
// PONG （回復 PING）
#define CLUSTERMSG_TYPE_PONG 1          /* Pong (reply to Ping) */
// 請求將某個節點添加到集群中
#define CLUSTERMSG_TYPE_MEET 2          /* Meet "let's join" message */
// 將某個節點標記為 FAIL
#define CLUSTERMSG_TYPE_FAIL 3          /* Mark node xxx as failing */
// 通過發布與訂閱功能廣播消息
#define CLUSTERMSG_TYPE_PUBLISH 4       /* Pub/Sub Publish propagation */
// 請求進行故障轉移操作，要求消息的接收者通過投票來支持消息的發送者
#define CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST 5 /* May I failover? */
// 消息的接收者同意向消息的發送者投票
#define CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK 6     /* Yes, you have my vote */
// 槽布局已經發生變化，消息發送者要求消息接收者進行相應的更新
#define CLUSTERMSG_TYPE_UPDATE 7        /* Another node slots configuration */
// 為了進行手動故障轉移，暫停各個客戶端
#define CLUSTERMSG_TYPE_MFSTART 8       /* Pause clients for manual failover */

可見，Redis Gossip除了負責信息交換之外，還會負責節點的上下線及failover。

消息格式

Redis Gossip消息分為消息頭和消息體，消息體一共有4類，其中MEET、PING和PONG消息都用clusterMsgDataGossip結構來表示。

typedef struct {
    // 節點的名字
    // 在剛開始的時候，節點的名字會是隨機的
    // 當 MEET 信息發送并得到回復之后，集群就會為節點設置正式的名字
    char nodename[REDIS_CLUSTER_NAMELEN];
    // 最后一次向該節點發送 PING 消息的時間戳
    uint32_t ping_sent;
    // 最后一次從該節點接收到 PONG 消息的時間戳
    uint32_t pong_received;
    // 節點的 IP 地址
    char ip[REDIS_IP_STR_LEN];    /* IP address last time it was seen */
    // 節點的端口號
    uint16_t port;  /* port last time it was seen */
    // 節點的標識值
    uint16_t flags;
    // 對齊字節，不使用
    uint32_t notused; /* for 64 bit alignment */
} clusterMsgDataGossip;
 
typedef struct {
    // 下線節點的名字
    char nodename[REDIS_CLUSTER_NAMELEN];
} clusterMsgDataFail;
 
typedef struct {
    // 頻道名長度
    uint32_t channel_len;
    // 消息長度
    uint32_t message_len;
    // 消息內容，格式為 頻道名+消息
    // bulk_data[0:channel_len-1] 為頻道名
    // bulk_data[channel_len:channel_len+message_len-1] 為消息
    unsigned char bulk_data[8]; /* defined as 8 just for alignment concerns. */
} clusterMsgDataPublish;
 
typedef struct {
    // 節點的配置紀元
    uint64_t configEpoch; /* Config epoch of the specified instance. */
    // 節點的名字
    char nodename[REDIS_CLUSTER_NAMELEN]; /* Name of the slots owner. */
    // 節點的槽布局
    unsigned char slots[REDIS_CLUSTER_SLOTS/8]; /* Slots bitmap. */
} clusterMsgDataUpdate;
 
union clusterMsgData {
    /* PING, MEET and PONG */
    struct {
        /* Array of N clusterMsgDataGossip structures */
        clusterMsgDataGossip gossip[1];
    } ping;
    /* FAIL */
    struct {
        clusterMsgDataFail about;
    } fail;
    /* PUBLISH */
    struct {
        clusterMsgDataPublish msg;
    } publish;
    /* UPDATE */
    struct {
        clusterMsgDataUpdate nodecfg;
    } update;
};

調度Gossip通信

在redis.c中，有一個負責調度執行Redis server內周期性任務的函數，名為serverCron()。其中，與集群相關的代碼段如下。

/* Run the Redis Cluster cron. */
// 如果服務器運行在集群模式下，那么執行集群操作
run_with_period(100) {
    if (server.cluster_enabled)     clusterCron();
}

可見，在啟用集群時，每個節點都會每隔100毫秒執行關于集群的周期性任務clusterCron()，該函數中與Gossip有關的代碼有多處，以下是部分節選。注釋寫得非常清楚，筆者就不再獻丑了。

節點加入集群

// 為未創建連接的節點創建連接
if (node->link == NULL) {
    // .....
    /* Queue a PING in the new connection ASAP: this is crucial
     * to avoid false positives in failure detection.
     *
     * If the node is flagged as MEET, we send a MEET message instead
     * of a PING one, to force the receiver to add us in its node
     * table. */
    // 向新連接的節點發送 PING 命令，防止節點被識進入下線
    // 如果節點被標記為 MEET ，那么發送 MEET 命令，否則發送 PING 命令
    old_ping_sent = node->ping_sent;
    clusterSendPing(link, node->flags & REDIS_NODE_MEET ?
            CLUSTERMSG_TYPE_MEET : CLUSTERMSG_TYPE_PING);
    // 這不是第一次發送 PING 信息，所以可以還原這個時間
    // 等 clusterSendPing() 函數來更新它
    if (old_ping_sent) {
        /* If there was an active ping before the link was
         * disconnected, we want to restore the ping time, otherwise
         * replaced by the clusterSendPing() call. */
        node->ping_sent = old_ping_sent;
    }
    /* We can clear the flag after the first packet is sent.
     *
     * 在發送 MEET 信息之后，清除節點的 MEET 標識。
     *
     * If we'll never receive a PONG, we'll never send new packets
     * to this node. Instead after the PONG is received and we
     * are no longer in meet/handshake status, we want to send
     * normal PING packets. 
     *
     * 如果當前節點（發送者）沒能收到 MEET 信息的回復，
     * 那么它將不再向目標節點發送命令。
     *
     * 如果接收到回復的話，那么節點將不再處于 HANDSHAKE 狀態，
     * 并繼續向目標節點發送普通 PING 命令。
     */
    node->flags &= ~REDIS_NODE_MEET;
    redisLog(REDIS_DEBUG,"Connecting with Node %.40s at %s:%d",
            node->name, node->ip, node->port+REDIS_CLUSTER_PORT_INCR);
}

隨機周期性發送PING消息

/* Ping some random node 1 time every 10 iterations, so that we usually ping
 * one random node every second. */
// clusterCron() 每執行 10 次（至少間隔一秒鐘），就向一個隨機節點發送 gossip 信息
if (!(iteration % 10)) {
    int j;
    /* Check a few random nodes and ping the one with the oldest
     * pong_received time. */
    // 隨機 5 個節點，選出其中一個
    for (j = 0; j < 5; j++) {
        // 隨機在集群中挑選節點
        de = dictGetRandomKey(server.cluster->nodes);
        clusterNode *this = dictGetVal(de);
        /* Don't ping nodes disconnected or with a ping currently active. */
        // 不要 PING 連接斷開的節點，也不要 PING 最近已經 PING 過的節點
        if (this->link == NULL || this->ping_sent != 0) continue;
        if (this->flags & (REDIS_NODE_MYSELF|REDIS_NODE_HANDSHAKE))
            continue;
        // 選出 5 個隨機節點中最近一次接收 PONG 回復距離現在最舊的節點
        if (min_pong_node == NULL || min_pong > this->pong_received) {
            min_pong_node = this;
            min_pong = this->pong_received;
        }
    }
    // 向最久沒有收到 PONG 回復的節點發送 PING 命令
    if (min_pong_node) {
        redisLog(REDIS_DEBUG,"Pinging node %.40s", min_pong_node->name);
        clusterSendPing(min_pong_node->link, CLUSTERMSG_TYPE_PING);
    }
}

防止節點假超時及狀態過期

/* If we are waiting for the PONG more than half the cluster
 * timeout, reconnect the link: maybe there is a connection
 * issue even if the node is alive. */
// 如果等到 PONG 到達的時間超過了 node timeout 一半的連接
// 因為盡管節點依然正常，但連接可能已經出問題了
if (node->link && /* is connected */
    now - node->link->ctime >
    server.cluster_node_timeout && /* was not already reconnected */
    node->ping_sent && /* we already sent a ping */
    node->pong_received < node->ping_sent && /* still waiting pong */
    /* and we are waiting for the pong more than timeout/2 */
    now - node->ping_sent > server.cluster_node_timeout/2)
{
    /* Disconnect the link, it will be reconnected automatically. */
    // 釋放連接，下次 clusterCron() 會自動重連
    freeClusterLink(node->link);
}
/* If we have currently no active ping in this instance, and the
 * received PONG is older than half the cluster timeout, send
 * a new ping now, to ensure all the nodes are pinged without
 * a too big delay. */
// 如果目前沒有在 PING 節點
// 并且已經有 node timeout 一半的時間沒有從節點那里收到 PONG 回復
// 那么向節點發送一個 PING ，確保節點的信息不會太舊
// （因為一部分節點可能一直沒有被隨機中）
if (node->link &&
    node->ping_sent == 0 &&
    (now - node->pong_received) > server.cluster_node_timeout/2)
{
    clusterSendPing(node->link, CLUSTERMSG_TYPE_PING);
    continue;
}

處理failover和標記疑似下線

/* If we are a master and one of the slaves requested a manual
 * failover, ping it continuously. */
// 如果這是一個主節點，并且有一個從服務器請求進行手動故障轉移
// 那么向從服務器發送 PING 。
if (server.cluster->mf_end &&
    nodeIsMaster(myself) &&
    server.cluster->mf_slave == node &&
    node->link)
{
    clusterSendPing(node->link, CLUSTERMSG_TYPE_PING);
    continue;
}
/* Check only if we have an active ping for this instance. */
// 以下代碼只在節點發送了 PING 命令的情況下執行
if (node->ping_sent == 0) continue;
/* Compute the delay of the PONG. Note that if we already received
 * the PONG, then node->ping_sent is zero, so can't reach this
 * code at all. */
// 計算等待 PONG 回復的時長
delay = now - node->ping_sent;
// 等待 PONG 回復的時長超過了限制值，將目標節點標記為 PFAIL （疑似下線）
if (delay > server.cluster_node_timeout) {
    /* Timeout reached. Set the node as possibly failing if it is
     * not already in this state. */
    if (!(node->flags & (REDIS_NODE_PFAIL|REDIS_NODE_FAIL))) {
        redisLog(REDIS_DEBUG,"*** NODE %.40s possibly failing",
            node->name);
        // 打開疑似下線標記
        node->flags |= REDIS_NODE_PFAIL;
        update_state = 1;
    }
}

由上可知， server.cluster_node_timeout 是判斷節點狀態過期及疑似下線的標準，所以對于不同網絡狀態和規模的集群，要視實際情況設定。

實際發送Gossip消息

以下是前方多次調用過的clusterSendPing()方法的源碼，不難理解。

/* Send a PING or PONG packet to the specified node, making sure to add enough
 * gossip informations. */
// 向指定節點發送一條 MEET 、 PING 或者 PONG 消息
void clusterSendPing(clusterLink *link, int type) {
    unsigned char buf[sizeof(clusterMsg)];
    clusterMsg *hdr = (clusterMsg*) buf;
    int gossipcount = 0, totlen;
    /* freshnodes is the number of nodes we can still use to populate the
     * gossip section of the ping packet. Basically we start with the nodes
     * we have in memory minus two (ourself and the node we are sending the
     * message to). Every time we add a node we decrement the counter, so when
     * it will drop to <= zero we know there is no more gossip info we can
     * send. */
    // freshnodes 是用于發送 gossip 信息的計數器
    // 每次發送一條信息時，程序將 freshnodes 的值減一
    // 當 freshnodes 的數值小于等于 0 時，程序停止發送 gossip 信息
    // freshnodes 的數量是節點目前的 nodes 表中的節點數量減去 2 
    // 這里的 2 指兩個節點，一個是 myself 節點（也即是發送信息的這個節點）
    // 另一個是接受 gossip 信息的節點
    int freshnodes = dictSize(server.cluster->nodes)-2;
 
    // 如果發送的信息是 PING ，那么更新最后一次發送 PING 命令的時間戳
    if (link->node && type == CLUSTERMSG_TYPE_PING)
        link->node->ping_sent = mstime();
 
    // 將當前節點的信息（比如名字、地址、端口號、負責處理的槽）記錄到消息里面
    clusterBuildMessageHdr(hdr,type);
 
    /* Populate the gossip fields */
    // 從當前節點已知的節點中隨機選出兩個節點
    // 并通過這條消息捎帶給目標節點，從而實現 gossip 協議
 
    // 每個節點有 freshnodes 次發送 gossip 信息的機會
    // 每次向目標節點發送 2 個被選中節點的 gossip 信息（gossipcount 計數）
    while(freshnodes > 0 && gossipcount < 3) {
        // 從 nodes 字典中隨機選出一個節點（被選中節點）
        dictEntry *de = dictGetRandomKey(server.cluster->nodes);
        clusterNode *this = dictGetVal(de);
 
        clusterMsgDataGossip *gossip;
        int j;
 
        /* In the gossip section don't include:
         * 以下節點不能作為被選中節點：
         * 1) Myself.
         *    節點本身。
         * 2) Nodes in HANDSHAKE state.
         *    處于 HANDSHAKE 狀態的節點。
         * 3) Nodes with the NOADDR flag set.
         *    帶有 NOADDR 標識的節點
         * 4) Disconnected nodes if they don't have configured slots.
         *    因為不處理任何槽而被斷開連接的節點 
         */
        if (this == myself ||
            this->flags & (REDIS_NODE_HANDSHAKE|REDIS_NODE_NOADDR) ||
            (this->link == NULL && this->numslots == 0))
        {
                freshnodes--; /* otherwise we may loop forever. */
                continue;
        }
 
        /* Check if we already added this node */
        // 檢查被選中節點是否已經在 hdr->data.ping.gossip 數組里面
        // 如果是的話說明這個節點之前已經被選中了
        // 不要再選中它（否則就會出現重復）
        for (j = 0; j < gossipcount; j++) {
            if (memcmp(hdr->data.ping.gossip[j].nodename,this->name,
                    REDIS_CLUSTER_NAMELEN) == 0) break;
        }
        if (j != gossipcount) continue;
 
        /* Add it */
 
        // 這個被選中節點有效，計數器減一
        freshnodes--;
 
        // 指向 gossip 信息結構
        gossip = &(hdr->data.ping.gossip[gossipcount]);
 
        // 將被選中節點的名字記錄到 gossip 信息
        memcpy(gossip->nodename,this->name,REDIS_CLUSTER_NAMELEN);
        // 將被選中節點的 PING 命令發送時間戳記錄到 gossip 信息
        gossip->ping_sent = htonl(this->ping_sent);
        // 將被選中節點的 PING 命令回復的時間戳記錄到 gossip 信息
        gossip->pong_received = htonl(this->pong_received);
        // 將被選中節點的 IP 記錄到 gossip 信息
        memcpy(gossip->ip,this->ip,sizeof(this->ip));
        // 將被選中節點的端口號記錄到 gossip 信息
        gossip->port = htons(this->port);
        // 將被選中節點的標識值記錄到 gossip 信息
        gossip->flags = htons(this->flags);
 
        // 這個被選中節點有效，計數器增一
        gossipcount++;
    }
 
    // 計算信息長度
    totlen = sizeof(clusterMsg)-sizeof(union clusterMsgData);
    totlen += (sizeof(clusterMsgDataGossip)*gossipcount);
    // 將被選中節點的數量（gossip 信息中包含了多少個節點的信息）
    // 記錄在 count 屬性里面
    hdr->count = htons(gossipcount);
    // 將信息的長度記錄到信息里面
    hdr->totlen = htonl(totlen);
 
    // 發送信息
    clusterSendMessage(link,buf,totlen);
}

The End

作者：zthinker

出處:https://zthinker.com/archives/%E6%BC%AB%E8%B0%88gossip%E5%8D%8F%E8%AE%AE%E4%B8%8E%E5%85%B6%E5%9C%A8rediscluster%E4%B8%AD%E7%9A%84%E5%AE%9E%E7%8E%B0

日日操夜夜添-日日操影院-日日草夜夜操-日日干干-精品一区二区三区波多野结衣-精品一区二区三区高清免费不卡

漫談Gossip協議與其在Redis Cluster中的實現前言Gossip協議Redis Cluster的Gossip方案The End

前言

Gossip協議

Redis Cluster的Gossip方案

The End

數獨大挑戰2018-06-03

答題星2018-06-03

全階人生考試2018-06-03

運動步數有氧達人2018-06-03

每日養生app2018-06-03

體育訓練成績評定2018-06-03