MySQL無鎖化WAL系統那些事兒-魔扣目錄

MySQL無鎖化WAL系統那些事兒

概述

數據庫系統一般采用WAL(write ahead log)技術來實現原子性和持久性，MySQL也不例外。WAL中記錄事務的更新內容，通過WAL將隨機的臟頁寫入變成順序的日志刷盤，可極大提升數據庫寫入性能，因此，WAL的寫入能力決定了數據庫整體性能的上限，尤其是在高并發時。

在MYSQL 8以前，寫日志被保護在一把大鎖之下，本來并行事務日志寫入被人為串行化處理。雖簡化了邏輯，但也極大限制了整體的性能表現。8.0很大的一部分工作便是將日志系統并行化。

日志并行化

日志并行化的思路也很簡單：將寫日志拆分為兩個過程：

從內存log buffer中為日志預留空間

2. 將日志內容拷貝至1預留的空間

而在這兩個步驟中，只需要步驟1保證在多并發并發預留空間時的正確性即可，確保并發線程預留的日志空間不會交叉。一旦預留成功，步驟2各并發線程可互不干擾地執行拷貝至自己的預留空間即可，這天然可并發。

而在步驟1中也可以使用原子變量來代替代價較高鎖實行預留，在mysql 8實現中，其實就兩行代碼：

Log_handle log_buffer_reserve(log_t &log, size_t len) {    ...    const sn_t start_sn = log.sn.fetch_add(len);    const sn_t end_sn = start_sn + len;    ...}

可以看到，只需要一個原子變量log.sn記錄當前分配的位置信息，下次分配時更新該log.sn即可，非常簡潔優雅。

8.0中引入的并行日志系統雖然很美好，但是也會帶來一些小麻煩，我們下面會詳細描述其引入的日志空洞問題并闡述其解決方案。

Log Buffer空洞問題

Mysql 8.0中使用了無鎖預分配的方式可以使MTR并行地將WAL日志寫入到Log Buffer，提升性能。但這樣勢必會帶來Redo Log Buffer的空洞問題，如下：

上圖中，3個線程分別分配了對應的redo buffer，線程1和3已經完成了wal日志內容的拷貝，而線程2則還在拷貝中，此時寫入線程最多只能將thread-1的redo log寫入日志文件。為此，MySQL 8.0中引入了 Link_buf 。

Link_buf原理

Link_buf用于輔助表示其他數據結構的使用情況，在Link_buf中，如果一個索引位置index處存儲的是非0值n，則表示Link_buf輔助標記的那個數據結構，從index開始后面n個元素已被占用。

template <typename Position = uint64_t>class Link_buf { private:  ...  size_t m_capacity;  std::atomic<Distance> *m_links;  alignas(INNOBASE_CACHE_LINE_SIZE) std::atomic<Position> m_tail;};

Link_buf是一個定長數組，且保證數組的每個元素的更新是原子操作的。以環形的方式復用已經釋放的空間。

同時Link_buf內部維護了一個變量 m_tail 表示當前最大可達的LSN。

Innodb日志系統中為Log Buffer維護了兩個Link_buf類型的變量 recent_written 和 recent_closed 。示意圖如下：

上圖中，共有兩處日志空洞，起始的LSN為lsn1與lsn3，均有4個字節。而lsn2處的redo log已經寫入，共3個字節。在 recent_written 中，lsn1開始處的4個atomic均是0，lsn3同樣如此，而lsn2處開始的存儲的則是3，0，0表示從該位置起的3個字節已經成功寫入了redo日志。

接下來當lsn1處的空洞被填充后，Link_buf中該處對應的內容就會被設置，如下：

同理，當lsn3處的空洞也被填充后，狀態變成下面這樣：

Link_buf實現

初始化

bool log_sys_init(...){    ...    log_allocate_recent_written(log);    ...}?constexpr ulong INNODB_LOG_RECENT_WRITTEN_SIZE_DEFAULT = 1024 * 1024;ulong srv_log_recent_written_size = INNODB_LOG_RECENT_WRITTEN_SIZE_DEFAULT;?static void log_allocate_recent_written(log_t &log) {  // 默認值為1MB  log.recent_written = Link_buf<lsn_t>{srv_log_recent_written_size};}?// Link_buf構造template <typename Position>Link_buf<Position>::Link_buf(size_t capacity)    : m_capacity(capacity), m_tail(0){  ...  m_links = UT_NEW_ARRAY_NOKEY(std::atomic<Distance>, capacity);  for (size_t i = 0; i < capacity; ++i) {    m_links[i].store(0);  }}

從構造函數中可以看到，LinkBuf內核心成員是一維數組，數組的成員類型是原子類型的Distance（uint64_t），數組成員個數則由創建者決定，如Innodb中為recent_written創建的LinkBuf的數組成員個數為1MB，而為recent_closed創建的LinkBuf的數組成員個數為2MB。

同時，創建完成后會將數組的每個成員初始化為0。

mtr log拷貝完成

mtr在commit時會將其運行時產生的所有redo log拷貝至Innodb全局的redo log buffer，這借助了 mtr_write_log_t 對象來完成，且每次拷貝按照block為單位進行。需要說明的是：一個mtr中可能存在多個block來存儲mtr運行時產生的redo log，每個block拷貝完成后均觸發一次Link_buf的更新。

struct mtr_write_log_t {  bool operator()(const mtr_buf_t::block_t *block) {    ...    // 拷貝完成后觸發LinkBuf更新    log_buffer_write_completed(*log_sys, m_handle, start_lsn, end_lsn);  }}?void log_buffer_write_completed(log_t &log, const Log_handle &handle,                                lsn_t start_lsn, lsn_t end_lsn) {  ...  // 更新本次寫入的內容范圍對應的LinkBuf內特定的數組項值  log.recent_written.add_link(start_lsn, end_lsn);}?template <typename Position>inline size_t Link_buf<Position>::slot_index(Position position) const {  return position & (m_capacity - 1);}?template <typename Position>inline void Link_buf<Position>::add_link(Position from, Position to) {  // 定位本次寫入的內容范圍所在數組項index  // 算法是將起始lsn(@from)對數組容量取模，即from % capacity  const auto index = slot_index(from);  auto &slot = m_links[index];  slot.store(to - from);}

在這里會找到start_lsn對應的slot，并在該slot內設置值為end_lsn - start_lsn，記錄該位置處已寫入的內容數量。

log_advance_ready_for_write_lsn

Innodb將redo log buffer內容寫入日志文件時需要保證不能存在空洞，即在寫入前需要獲得當前最大的無空洞lsn。這同樣依賴LinkBuf。在后臺寫日志線程 log_writer 的 log_advance_ready_for_write_lsn 函數中完成。

void log_writer(log_t *log_ptr) {    ...    for (uint64_t step = 0;; ++step) {      (void)log_advance_ready_for_write_lsn(log);    }}?bool log_advance_ready_for_write_lsn(log_t &log) {  const lsn_t write_lsn = log.write_lsn.load();  const auto write_max_size = srv_log_write_max_size;?  auto stop_condition = [&](lsn_t prev_lsn, lsn_t next_lsn) {    return (next_lsn - write_lsn >= write_max_size);  };  const lsn_t previous_lsn = log_buffer_ready_for_write_lsn(log);?  if (log.recent_written.advance_tail_until(stop_condition)) {    const lsn_t previous_lsn = log_buffer_ready_for_write_lsn(log);    return (true);  } else {    return (false);  }}

這里的關鍵在于函數 Link_buf::advance_tail_until ，即推進Link_buf::m_tail。

bool Link_buf<Position>::next_position(Position position, Position &next) {  const auto index = slot_index(position);  auto &slot = m_links[index];  const auto distance = slot.load();  next = position + distance;  return distance == 0;}?bool Link_buf<Position>::advance_tail_until(Stop_condition stop_condition) {  auto position = m_tail.load();  while (true) {    Position next;    bool stop = next_position(position, next);    if (stop || stop_condition(position, next)) {      break;    }    /* 回收slot */    claim_position(position);    position = next;  }  if (position > m_tail.load()) {    m_tail.store(position);    return true;  } else {    return false;  }}

這里的原理也比較簡單，可以用下面的圖來表示：