今天中午,线上一个MySQL8.0的MGR失联了一阵,之前其实没有遇到过这个场景,觉得挺新鲜,就记录了下当时的状态。
01
背景
首先介绍下我们这套环境,这套环境是由4个MySQL 8.0.20节点组成的MGR集群(建议配置奇数个节点,这套环境比较特殊),配置的是MGR的多主环境。
线上某个业务反馈连接MGR有连接报错,报错内容如下:
报错原因:业务向MGR中写入数据的时候,报错MGR开启了--super-read-only参数,写不进去。
正常运行的MGR是不会将节点设置为--super-read-only的,于是使用SQL查看了下MGR集群的状态,发现果然掉了一个节点。
正常节点:
11:55:29> select * from performance_schema.replication_group_members; +---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+ | group_replication_applier | 18142f04-95fe-11ec-aff9-246e96be2d00 | 10.xx.xx.184 | 5562 | ONLINE | PRIMARY | 8.0.20 | | group_replication_applier | 32a10c3f-ce69-11eb-bd13-fa163e311dfa | 10.xxx.1.236 | 5562 | ONLINE | PRIMARY | 8.0.20 | | group_replication_applier | 63a8b9e3-95f9-11ec-bfc3-fa163e06c313 | 10.xxx.1.129 | 5562 | ONLINE | PRIMARY | 8.0.20 | +---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+ 3 rows in set (0.00 sec)
可以看到,正常节点无法和异常的节点通信,当前组里面只有3个节点了。
异常节点:
select * from performance_schema.replication_group_members; +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ | group_replication_applier | ca12c7a7-95fd-11ec-9979-6c92bf982df2 | 10.xx.xx.21 | 5562 | ERROR | | 8.0.20 | +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ 1 row in set (0.00 sec)
可以看到,异常的节点无法和其他3个正常节点通信,而且自己的状态是ERROR状态。
02
排查过程
首先查看坏掉节点的MySQL日志,报错如下,为了方便理解,我对日志做了一些注释:
2022-03-30T08:35:12.689967+08:00 26 [Warning] [MY-010957] [Server] The replication timestamps have returned to normal values. ----------这部分日志说明它连接不上其他节点了---------------- 2022-03-30T11:23:59.451419+08:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.xxx.1.236:5562 has become unreachable.' 2022-03-30T11:23:59.466959+08:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.xxx.17.37:5562 has become unreachable.' 2022-03-30T11:23:59.467008+08:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.xxx.1.129:5562 has become unreachable.' 2022-03-30T11:23:59.467020+08:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.xx.xx.184:5562 has become unreachable.' ---------无法连接其他节点,停止所有更新,并提示可以使用参数group_replication_force_members重开一个组-------------- 2022-03-30T11:23:59.467218+08:00 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to forc e a new group membership.' ----------这部分日志说明它网络恢复了---------------- 2022-03-30T11:24:02.141493+08:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.xxx.1.236:5562 is reachable again.' 2022-03-30T11:24:03.219209+08:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.xxx.17.37:5562 is reachable again.' 2022-03-30T11:24:03.245901+08:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.xxx.1.129:5562 is reachable again.' 2022-03-30T11:24:03.245941+08:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.xx.xx.184:5562 is reachable again.' 2022-03-30T11:24:03.245956+08:00 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the gro up. Regular operation is restored and transactions are unblocked.' --------由于网络原因,它被从MGR复制组中踢出了,状态变更成ERROR------- 2022-03-30T11:24:06.907099+08:00 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing membe r status to ERROR.' --------一些事务将会回滚-------------- 2022-03-30T11:24:06.922962+08:00 0 [Warning] [MY-011630] [Repl] Plugin group_replication reported: 'Due to a plugin error, some transactions were unable to be certified and will now rollback.' --------MySQL被自动设置成read only模式----------- 2022-03-30T11:24:06.923035+08:00 0 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was de tected.' --------等待冲突检测,执行before_commit函数失败--------- 2022-03-30T11:24:06.923078+08:00 312718417 [ERROR] [MY-011615] [Repl] Plugin group_replication reported: 'Error while waiting for conflict detection procedure to finish on session 312718417' 2022-03-30T11:24:06.923081+08:00 397259854 [ERROR] [MY-011615] [Repl] Plugin group_replication reported: 'Error while waiting for conflict detection procedure to finish on session 397259854' 2022-03-30T11:24:06.923165+08:00 312718417 [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed 2022-03-30T11:24:06.923163+08:00 397259854 [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed -------由于自身状态ERROR无法加入组,提示修复错误或者重启MGR------ 2022-03-30T11:24:08.587817+08:00 397259854 [ERROR] [MY-011601] [Repl] Plugin group_replication reported: 'Transaction cannot be executed while Group Replication is on ERROR state. Check for errors and restart the plugin'
得了,那就按照提示,重启下MGR吧:
12:00:38 >stop group_replication; Query OK, 0 rows affected (1.01 sec) 12:00:46 >start group_replication; Query OK, 0 rows affected (3.36 sec) 12:00:52 >select * from performance_schema.replication_group_members; +---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+ | group_replication_applier | 18142f04-95fe-11ec-aff9-246e96be2d00 | 10.xx.xx.184 | 5562 | ONLINE | PRIMARY | 8.0.20 | | group_replication_applier | 32a10c3f-ce69-11eb-bd13-fa163e311dfa | 10.xxx.1.236 | 5562 | ONLINE | PRIMARY | 8.0.20 | | group_replication_applier | 63a8b9e3-95f9-11ec-bfc3-fa163e06c313 | 10.xxx.1.129 | 5562 | ONLINE | PRIMARY | 8.0.20 | | group_replication_applier | ca12c7a7-95fd-11ec-9979-6c92bf982df2 | 10.xx.xx.21 | 5562 | RECOVERING | PRIMARY | 8.0.20 | +---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+ 4 rows in set (0.00 sec) select * from performance_schema.replication_group_members; +---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+ | group_replication_applier | 18142f04-95fe-11ec-aff9-246e96be2d00 | 10.xx.xx.184 | 5562 | ONLINE | PRIMARY | 8.0.20 | | group_replication_applier | 32a10c3f-ce69-11eb-bd13-fa163e311dfa | 10.xxx.1.236 | 5562 | ONLINE | PRIMARY | 8.0.20 | | group_replication_applier | 63a8b9e3-95f9-11ec-bfc3-fa163e06c313 | 10.xxx.1.129 | 5562 | ONLINE | PRIMARY | 8.0.20 | | group_replication_applier | ca12c7a7-95fd-11ec-9979-6c92bf982df2 | 10.xx.xx.21 | 5562 | ONLINE | PRIMARY | 8.0.20 | +---------------------------+--------------------------------------+--------------+-------------+--------------+-------------+----------------+ 4 rows in set (0.00 sec)
重启之后,可以看到,ERROR的状态变成了Recovering,说明节点正在执行恢复操作。不一会儿,就都变成Online状态了。
03
总结
MGR本身的自愈功能还是比较强的,多数派的协议保证了事务的强一致。
从运维经验上讲,一般情况下,MGR中比较棘手的问题就是认证冲突,报错内容形如:
ERROR 3101 (HY000): Plugin instructed the server to rollback the current transaction.
目前遇到认证冲突这种问题,可以通过重试的办法来解决。
如果是MGR深度用户,提几个建议:
1、使用MySQL8.0 社区版本的MGR,最好8.0.22以后
2、最好使用MGR单主模式,冲突会少些。如果想要体验多主模式,又遇到了不可解决的问题,可以退而求其次,多主模式下,集中在一个节点进行写入,利用它的故障自愈。
3、建议使用GreatSQL 替代社区版本MGR,修复了官方MGR的很多Bug。项目链接:https://gitee.com/GreatSQL/GreatSQL-Doc
今天内容就到这里吧。