微信公众号:DBA随笔 这两天事情有点多,没更新文章了,今天抽空写个线上案例。
在Redis中,常见的缓冲区有下面两类:
1、C-S架构中的输入输出缓冲区
2、主从复制架构中的缓冲区
其中:
C-S架构中的缓冲区主要分为客户端输入缓冲区和客户端输出缓冲区;
主从复制架构中的缓冲区主要指复制缓冲区和复制积压缓冲区
更详细的内容,可以参考之前的文章:
Redis内存缓冲区
今天的文章中,我们用一个线上案例来分析。
问题现象:
Redis主从复制的过程中,主从关系迟迟不能建立起来,主库频繁进行bgsave。
日志信息:
主库错误日志(红色部分)
61:C 28 Sep 13:14:56.494 * DB saved on disk 61:C 28 Sep 13:14:57.309 * RDB: 7319 MB of memory used by copy-on-write 24:M 28 Sep 13:15:00.028 * Background saving terminated with success 24:M 28 Sep 13:15:00.028 * Starting BGSAVE for SYNC with target: disk 24:M 28 Sep 13:15:00.513 * Background saving started by pid 62 24:M 28 Sep 13:18:09.322 # Client id=891379 addr=10.xx.xx.150:51523 fd=70 name= age=330 idle=330 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=15003 oll=70225 omem=1073742241 events=r cmd=psync # scheduled to be closed ASAP for overcoming of output buffer limits. 24:M 28 Sep 13:18:09.329 # Connection with slave 10.xx.xx.150:33048 lost. 24:M 28 Sep 13:19:10.693 * Slave 10.13.16.150:33048 asks for synchronization 24:M 28 Sep 13:19:10.698 * Full resync requested by slave 10.13.16.150:33048 24:M 28 Sep 13:19:10.698 * Can't attach the slave to the current BGSAVE. Waiting for next BGSAVE for SYNC
可以看到,主库在进行了bgsave的时候,发生了中断,和从库之间的连接被断开了,原因也很清楚,就是超过了output buffer的值
从库错误日志:
主要看三行蓝色的字体,全量复制---中断---重新全量复制
26:S 28 Sep 11:44:09.485 * MASTER <-> SLAVE sync started 26:S 28 Sep 11:44:09.486 * Non blocking connect for SYNC fired the event. 26:S 28 Sep 11:44:09.488 * Master replied to PING, replication can continue... 26:S 28 Sep 11:44:09.490 * Partial resynchronization not possible (no cached master) 26:S 28 Sep 11:44:09.492 * Full resync from master: dce8c7c4b46d4ee1c581cbcc157fdd38c6f9e199:48879999224307 26:S 28 Sep 11:50:43.097 * MASTER <-> SLAVE sync: receiving 9830853612 bytes from master 26:S 28 Sep 11:52:40.550 * MASTER <-> SLAVE sync: Flushing old data 26:S 28 Sep 11:52:40.550 * MASTER <-> SLAVE sync: Loading DB in memory 26:S 28 Sep 11:54:30.700 * MASTER <-> SLAVE sync: Finished with success 26:S 28 Sep 11:54:31.084 * Background append only file rewriting started by pid 30 26:S 28 Sep 11:54:31.263 # Connection with master lost. 26:S 28 Sep 11:54:31.263 * Caching the disconnected master state. 26:S 28 Sep 11:54:31.659 * Connecting to MASTER 10.xx.xx.65:33048 26:S 28 Sep 11:54:31.659 * MASTER <-> SLAVE sync started 26:S 28 Sep 11:54:31.660 * Non blocking connect for SYNC fired the event. 26:S 28 Sep 11:54:31.661 * Master replied to PING, replication can continue... 26:S 28 Sep 11:54:31.663 * Trying a partial resynchronization (request dce8c7c4b46d4ee1c581cbcc157fdd38c6f9e199:48880001572670). 26:S 28 Sep 11:54:32.182 * Full resync from master: dce8c7c4b46d4ee1c581cbcc157fdd38c6f9e199:48882189455771 26:S 28 Sep 11:54:32.182 * Discarding previously cached master state. 26:S 28 Sep 11:55:43.241 * AOF rewrite child asks to stop sending diffs.
分析:
这个全量复制期间的缓冲区示意图如下:
如果在全量复制时,从节点接收和加载RDB较慢,同时主节点接收到了大量的写命令,写命令在复制缓冲区中就会越积越多,最终导致溢出。主节点上的复制缓冲区,本质上也是一个用于和从节点连接的客户端,使用的输出缓冲区。复制缓冲区一旦发生溢出,主节点也会直接关闭和从节点进行复制操作的连接,导致全量复制失败。
如何解决?
在Redis中,可以通过配置client-output-buffer-limit来解决这个问题。顾名思义,它的作用就是设置client的输出缓冲区限制的。
这个参数,可以设置三类客户端的输出缓冲区,分别是普通客户端、从库客户端、订阅客户端,对应的参数设置为:
normal 0 0 0 slave 256mb 64mb 60 pubsub 8mb 2mb 60
其中:
normal代表普通客户端,后面第1个0设置的是缓冲区大小限制,第2个0和第3个0分别表示缓冲区持续写入量限制和持续写入时间限制,通常把普通客户端的缓冲区大小限制,以及持续写入量限制、持续写入时间限制都设置为0,也就是不做限制,因为如果不是读取体量特别大的bigkey,服务器端的输出缓冲区一般不会被阻塞的。
slave参数代表该配置项是针对复制缓冲区的。256mb代表将缓冲区大小的上限设置为256MB;64mb和60表示如果连续60秒内的写入量超过64MB的话,就会触发缓冲区溢出,全量复制连接被关闭,全量复制失败
pubsub代表订阅客户端,对于订阅客户端来说,一旦订阅的Redis频道有消息了,服务器端都会通过输出缓冲区把消息发给客户端,如果频道消息较多的话,也会占用较多的输出缓冲区空间。设置的数值含义和slave的设置雷同,不再赘述。
有了上述参数的含义,我们可以利用当前的写入速度和写入key的大小来粗略估计这个参数的合理值。一般而言,将这个参数设置到GB级别,即可解决问题,如果你的Redis数据量比较多,写入比较频繁,可以适量上调。
举个简单例子,假设我们每个命令写入1k数据,而我们的client-output-buffer-limit参数设置的是:slave 128m 64m 60,那么上限只能使用128m的内存,128m/1k/60s=2133条命令每秒,也就是写入的QPS最多只能到达2000左右。
还需要注意,主节点上的复制缓冲区的内存开销,是针对每一个从库都有的,如果有多个从节点同时发起主从同步,主节点的复制缓冲区开销就会很大,容易造成OOM,因此我们需要控制从节点的数量来避免复制缓冲区占用过多内存。
实际解决方案:
调整参数
redis> config set client-output-buffer-limit "normal 0 0 0 slave 5073741824 2073435456 120 pubsub 33554432 8388608 60" OK redis> config rewrite OK
最终,全量复制成功,日志如下
主库复制正确日志:
67:C 28 Sep 14:18:33.901 * DB saved on disk 67:C 28 Sep 14:18:34.422 * RDB: 6929 MB of memory used by copy-on-write 24:M 28 Sep 14:18:36.793 * Background saving terminated with success 24:M 28 Sep 14:20:02.757 * Synchronization with slave 10.xx.16.150:33048 succeeded 24:M 28 Sep 14:25:41.242 * Slave 10.xxx.5.201:33048 asks for synchronization 24:M 28 Sep 14:25:41.242 * Partial resynchronization not accepted: Replication ID mismatch (Slave asked for 'ff7e489603b8e643f01bda73aceb7d0fa818b955', my replication IDs are 'dce8c7c4b46d4ee1c581cbcc157fdd38c6f9e199' and '0000000000000000000000000000000000000000') 24:M 28 Sep 14:25:41.242 * Starting BGSAVE for SYNC with target: disk 24:M 28 Sep 14:25:41.732 * Background saving started by pid 68 68:C 28 Sep 14:32:30.411 * DB saved on disk 68:C 28 Sep 14:32:31.107 * RDB: 78 MB of memory used by copy-on-write 24:M 28 Sep 14:32:31.781 * Background saving terminated with success 24:M 28 Sep 14:33:55.890 * Synchronization with slave 10.xxx.5.201:33048 succeeded
从库复制正确日志:
26:S 28 Sep 14:10:31.826 * MASTER <-> SLAVE sync started 26:S 28 Sep 14:10:31.827 * Non blocking connect for SYNC fired the event. 26:S 28 Sep 14:10:31.828 * Master replied to PING, replication can continue... 26:S 28 Sep 14:10:31.831 * Trying a partial resynchronization (request c09b870733f77164a0ee756abb9be060bb1783bc:1). 26:S 28 Sep 14:10:32.149 # CONFIG REWRITE executed with success. 26:S 28 Sep 14:10:32.244 * Full resync from master: dce8c7c4b46d4ee1c581cbcc157fdd38c6f9e199:48922100764682 26:S 28 Sep 14:10:32.244 * Discarding previously cached master state. 26:S 28 Sep 14:18:36.810 * MASTER <-> SLAVE sync: receiving 9804962515 bytes from master 26:S 28 Sep 14:20:02.862 * MASTER <-> SLAVE sync: Flushing old data 26:S 28 Sep 14:20:02.862 * MASTER <-> SLAVE sync: Loading DB in memory 26:S 28 Sep 14:21:52.452 * MASTER <-> SLAVE sync: Finished with success