今天在线上发现一个问题,挺有意思的,这里简单记录下。 问题的背景是查看一个Grafana监控的时候,发现监控没有数据了:
从图中可以看到4月1日之后,监控就没有输出数据了。 于是决定排查一下这个问题。排查思路如下:
1、监控数据源配置是否准确;2、监控数据是否采集完整;3、监控数据所在数据库是否可以访问;经过查看,监控数据从4月开始就缺失了,由于监控数据采集程序的日志不够全面,所以花了很长时间才定位到根本原因:监控数据写入数据库的时候,报错了。
其实报错信息也比较简单,如下:
(1062, u"Duplicate entry '4294967295' for key 'PRIMARY'") (1062, u"Duplicate entry '4294967295' for key 'PRIMARY'") (1062, u"Duplicate entry '4294967295' for key 'PRIMARY'") (1062, u"Duplicate entry '4294967295' for key 'PRIMARY'") (1062, u"Duplicate entry '4294967295' for key 'PRIMARY'") (1062, u"Duplicate entry '4294967295' for key 'PRIMARY'") (1062, u"Duplicate entry '4294967295' for key 'PRIMARY'") (1062, u"Duplicate entry '4294967295' for key 'PRIMARY'") (1062, u"Duplicate entry '4294967295' for key 'PRIMARY'") (1062, u"Duplicate entry '4294967295' for key 'PRIMARY'") (1062, u"Duplicate entry '4294967295' for key 'PRIMARY'")
这个明眼人一看,就是主键冲突了,冲突的id值是4294967295。那就开始排查主键冲突的问题吧。
首先查看表结构,看着似乎没什么异样,如下:
show create table mongo_statistics.mongodb_cluster_changelogG *************************** 1. row *************************** Table: mongodb_cluster_changelog Create Table: CREATE TABLE `mongodb_cluster_changelog` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `port` int(10) unsigned NOT NULL DEFAULT '0', `ts` datetime NOT NULL DEFAULT '1970-01-01 00:00:00' , `what` varchar(40) NOT NULL DEFAULT '' , `namespace` varchar(100) NOT NULL DEFAULT '' , `database` varchar(100) NOT NULL DEFAULT '' , `collection` varchar(100) NOT NULL DEFAULT '' , `clonedBytes` int(11) NOT NULL DEFAULT '0' , `number` int(11) NOT NULL DEFAULT '0' , `ctime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (`id`), UNIQUE KEY `ux_port_ts_what_namespace` (`port`,`ts`,`what`,`namespace`) ) ENGINE=InnoDB AUTO_INCREMENT=4294967295 DEFAULT CHARSET=utf8 [19:24:17]1 row in set (0.00 sec)
这里找到上面的报出冲突的自增id值,发现是4294967295 ,然后使用select语法来查看这个id对应的数据是否存在:
+------------+-------+---------------------+-------------+-----------------+----------+------------+-------------+--------+---------------------+ | id | port | ts | what | namespace | database | collection | clonedBytes | number | ctime | +------------+-------+---------------------+-------------+-----------------+----------+------------+-------------+--------+---------------------+ | 4294967295 | 30016 | 2021-07-08 04:03:00 | multi-split | coredata.newimg | coredata | newimg | 0 | 3 | 2021-07-08 19:08:42 | +------------+-------+---------------------+-------------+-----------------+----------+------------+-------------+--------+---------------------+
发现这条数据也在表里面。 于此同时,发现这个表结构里面的auto_increment值也是4294967295
这就比较奇怪了,自增id一般都会是当前表中id最大值的下一个id值,例如表中最大id是10,自增id的值应该是11. 为什么这个自增id和表里面的最大id值一样?(可能有的小伙伴已经知道答案了,但是排查的时候,我确实没反应过来)
(1062, u"Duplicate entry '4294967295' for key 'PRIMARY'") (1062, u"Duplicate entry '4294967295' for key 'PRIMARY'") (1062, u"Duplicate entry '4294967295' for key 'PRIMARY'") (1062, u"Duplicate entry '4294967295' for key 'PRIMARY'")
似乎没解决问题。
既然id=4294967295的数据冲突,那我们把这个id加1,看看手工插入一条数据能不能行:mysql> insert into mongo_statistics.mongodb_cluster_changelog values (4294967296,30016,"2021-07-08 04:03:00","multi--plit","xxx","xxx","newimg",0,3,"2021-07-08 19:08:42"); ERROR 1062 (23000): Duplicate entry '4294967295' for key 'PRIMARY'
不得不说,这个输出的结果有很大的误导意义
注意,我们插入的数据是id=4294967296,但是报错还是显示id=4294967295,死活就跟这条数据过不去了!!!
(这里我以为我发现了什么惊天大bug,喊来两个同事,一起看看这个问题。)
再次删除id=4294967295的数据,插入id=4294967296的数据:mysql >insert into mongo_statistics.mongodb_cluster_changelog values (4294967296,31111,"2021-07-08 04:03:00","multi-split","xxx","xxx","newimg",0,3,"2021-07-08 19:08:42"); Query OK, 1 row affected, 1 warning (0.00 sec)
我去,没有了那个坏id的数据,后面的id终于也能插入了???等等,有个warning(同事提醒),看下:
mysql >show warnings; +---------+------+---------------------------------------------+ | Level | Code | Message | +---------+------+---------------------------------------------+ | Warning | 1264 | Out of range value for column 'id' at row 1 | +---------+------+---------------------------------------------+ 1 row in set (0.00 sec)
好吧,MySQL已经给出了答案了:id是int 类型的,这个自增值溢出了。
找个测试表试试
mysql> select * from t; +------------+------+ | id | val | +------------+------+ | 1 | 1 | | 4294967295 | 1 | +------------+------+ 2 rows in set (0.00 sec) mysql> show create table tG *************************** 1. row *************************** Table: t Create Table: CREATE TABLE `t` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `val` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=4294967295 DEFAULT CHARSET=utf8 1 row in set (0.00 sec) ### 插入报错 mysql> insert into t values (4294967296,1) ERROR 1062 (23000): Duplicate entry '4294967295' for key 'PRIMARY' ### 删除报错的id记录 mysql> delete from t where id=4294967295; Query OK, 1 row affected (0.01 sec) ### 插入成功,有警告 mysql> insert into t values (4294967296,1); Query OK, 1 row affected, 1 warning (0.00 sec) ### 报警提示超出范围 mysql> show warnings; +---------+------+---------------------------------------------+ | Level | Code | Message | +---------+------+---------------------------------------------+ | Warning | 1264 | Out of range value for column 'id' at row 1 | +---------+------+---------------------------------------------+ 1 row in set (0.00 sec) ### 插入的数据不是4294967296,而是截断为4294967295 mysql> select * from t; +------------+------+ | id | val | +------------+------+ | 1 | 1 | | 4294967295 | 1 | +------------+------+ 2 rows in set (0.01 sec)
到这里,这个问题的原因就算找到了,也能稳定复现这个问题。
使用alter table语句,将这个线上表的主键id列改为bigint类型的即可。 测试环境看一发:
mysql> select * from t; +------------+------+ | id | val | +------------+------+ | 1 | 1 | | 4294967295 | 1 | +------------+------+ 2 rows in set (0.00 sec) mysql> alter table t modify id bigint unsigned not null auto_increment; Query OK, 2 rows affected (0.01 sec) Records: 2 Duplicates: 0 Warnings: 0
从结果中不难看出来,这个操作不能online,会拷贝表里面所有的数据。 线上重要表的操作最好使用pt-osc工具或者gh-ost工具执行。
这里再给一个不同类型的id范围的表:
说实话,这些范围边缘数据,平时不注意的话,看到了可能也反应不过来,如果你够细心,对它们很敏感,可能这个问题一分钟就解决了,根本不需要排查。
最终,监控数据也回来了。
修复前:
修复后:
上述情况是在MySQL 5.5 版本上操作的,MySQL8.0中会不会有所改善。
mysql8.0> CREATE TABLE `t` ( -> `id` int(10) unsigned NOT NULL AUTO_INCREMENT, -> `val` int(11) DEFAULT NULL, -> PRIMARY KEY (`id`) -> ) ENGINE=InnoDB AUTO_INCREMENT=4294967295 DEFAULT CHARSET=utf8 -> ; Query OK, 0 rows affected, 3 warnings (0.34 sec) mysql8.0> show warnings; +---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Level | Code | Message | +---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Warning | 1681 | Integer display width is deprecated and will be removed in a future release. | | Warning | 1681 | Integer display width is deprecated and will be removed in a future release. | | Warning | 3719 | 'utf8' is currently an alias for the character set UTF8MB3, but will be an alias for UTF8MB4 in a future release. Please consider using UTF8MB4 in order to be unambiguous. | +---------+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 3 rows in set (0.01 sec) mysql8.0 > select *from t; +------------+------+ | id | val | +------------+------+ | 1 | 1 | | 4294967295 | 1 | +------------+------+ 2 rows in set (0.00 sec) ### 先插入一次id=4294967296的数据, ### 看看会不会像MySQL5.5一样报id=4294967295的主键冲突结果 ### 结果:直接报错 mysql8.0> insert into t values (4294967296,1); ERROR 1264 (22003): Out of range value for column 'id' at row 1 ### 再删掉这条id=4294967295数据 mysql8.0> delete from t where id=4294967295; Query OK, 1 row affected (0.03 sec) ### 再次插入id=4294967296的数据直接报错,不给插入机会 mysql8.0> insert into t values (4294967296,1); ERROR 1264 (22003): Out of range value for column 'id' at row 1 mysql8.0> select *from t; +----+------+ | id | val | +----+------+ | 1 | 1 | +----+------+ 1 row in set (0.00 sec)
看来MySQL8.0对于这种情况,支持的友好一点。