今天在线上遇到了一个MySQL字符比较的问题,感觉很有意思,专门研究了下,估计大家都没有遇到过,这里跟大家分享一下。
01
背景
背景介绍:
MySQL里面有一张表,根据where条件匹配查询某一条记录的时候,手误输入了一个空格,发现这一条数据仍然能查出来,我建了一个测试表,还原如下:
22:57:02> create table t00 (id int primary key,name varchar(10)); Query OK, 0 rows affected (0.01 sec) 22:57:11> insert into t00 values (1,'aaa'),(2,'bbb'); Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 22:57:22> select * from t00 where name='aaa'; +----+------+ | id | name | +----+------+ | 1 | aaa | +----+------+ 1 row in set (0.00 sec) 22:57:32> select * from t00 where name='aaa '; +----+------+ | id | name | +----+------+ | 1 | aaa | +----+------+ 1 row in set (0.00 sec)
插入(1,'aaa')这条记录,使用where='aaa'和'aaa '这两个条件去匹配,居然都能够查到这条记录。
一开始我怀疑是这个8.0.19版本MySQL实例配置有问题,换了一个5.5低版本的MySQL实例,再次测试,还是复现这个问题。看来不是版本上的问题,一定是某种配置的问题。
晚上回到家,又用了自己搭建的一个8.0.22版本的MySQL实例重新执行上面的命令,竟然惊奇的发现,不复现了。。。晕死。8.0.22版本测试的结果是:
23:35:30>>select * from t0; +------+------+ | id | name | +------+------+ | 1 | aaa | | 2 | bbb | +------+------+ 2 rows in set (0.01 sec) 23:35:34>>select * from t0 where name='aaa'; +------+------+ | id | name | +------+------+ | 1 | aaa | +------+------+ 1 row in set (0.00 sec) 23:35:46>>select * from t0 where name='aaa '; Empty set (0.00 sec)
02
分析思路
1、为什么'aaa'和'aaa '一样?
首先我用命令在MySQL上检测了一下这两个字符串在MySQL中是否一样:
### MySQL实例一 23:39:09> select 'aaa' = 'aaa '; +------------------+ | 'aaa' = 'aaa ' | +------------------+ | 1 | +------------------+ 1 row in set (0.00 sec) ### MySQL实例二 23:35:54>>select 'aaa' = 'aaa '; +------------------+ | 'aaa' = 'aaa ' | +------------------+ | 0 | +------------------+ 1 row in set (0.00 sec)
从上面的结果可以看出来,这两个实例上,关于字符的比较规则不一样。
到这里,可能部分同学就已经知道答案了。不过还是往下再看看。
2、比较规则哪里不一样?
我们可以用下面的命令,先看一下utf8相关的字符集下的比较规则,如下:
23:45:18> show collation like 'utf8%'; +----------------------------+---------+-----+---------+----------+---------+---------------+ | Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute | +----------------------------+---------+-----+---------+----------+---------+---------------+ | utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes | Yes | 0 | NO PAD | | utf8mb4_0900_as_ci | utf8mb4 | 305 | | Yes | 0 | NO PAD | | utf8mb4_0900_as_cs | utf8mb4 | 278 | | Yes | 0 | NO PAD | | utf8mb4_0900_bin | utf8mb4 | 309 | | Yes | 1 | NO PAD | | utf8mb4_bin | utf8mb4 | 46 | | Yes | 1 | PAD SPACE | | utf8mb4_croatian_ci | utf8mb4 | 245 | | Yes | 8 | PAD SPACE | | utf8mb4_cs_0900_ai_ci | utf8mb4 | 266 | | Yes | 0 | NO PAD | | utf8mb4_cs_0900_as_cs | utf8mb4 | 289 | | Yes | 0 | NO PAD | | utf8_unicode_ci | utf8 | 192 | | Yes | 8 | PAD SPACE | ........ | utf8_vietnamese_ci | utf8 | 215 | | Yes | 8 | PAD SPACE | +----------------------------+---------+-----+---------+----------+---------+---------------+ 103 rows in set (0.00 sec)
在最后一列,我们可以看到一个pad属性,这个属性里面包含2个值,分别是no pad 和pad space。
3、尝试去官方文档中查找这俩属性的意思
果然,不出意外,找到了一些蛛丝马迹:
https://dev.mysql.com/doc/refman/8.0/en/char.html
上面这段话描述的意思大概是:
这里我们就可以根据实际使用的比较规则来查看对应的pad属性了:
先看实例一:
### MySQL实例一 00:01:31>show variables like '%colla%'; +-------------------------------+--------------------+ | Variable_name | Value | +-------------------------------+--------------------+ | collation_connection | utf8_general_ci | | collation_database | utf8mb4_0900_ai_ci | | collation_server | utf8mb4_0900_ai_ci | | default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci | +-------------------------------+--------------------+ 4 rows in set (0.01 sec) 00:01:45>select collation_name,character_set_name,pad_attribute from information_schema.collations where collation_name like 'utf8_gen eral_ci'; +-----------------+--------------------+---------------+ | collation_name | character_set_name | pad_attribute | +-----------------+--------------------+---------------+ | utf8_general_ci | utf8 | PAD SPACE | +-----------------+--------------------+---------------+ 1 row in set (0.00 sec)
再来看实例二:
### 实例二 mysql--root@localhost:(none) 23:53:52>>show variables like '%colla%'; +-------------------------------+--------------------+ | Variable_name | Value | +-------------------------------+--------------------+ | collation_connection | utf8mb4_0900_ai_ci | | collation_database | utf8mb4_0900_ai_ci | | collation_server | utf8mb4_0900_ai_ci | | default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci | +-------------------------------+--------------------+ 4 rows in set (0.00 sec) 00:03:47>>select collation_name,character_set_name,pad_attribute from information_schema.collations where collation_name like 'utf8mb4_0900_ai_ci'; +--------------------+--------------------+---------------+ | collation_name | character_set_name | pad_attribute | +--------------------+--------------------+---------------+ | utf8mb4_0900_ai_ci | utf8mb4 | NO PAD | +--------------------+--------------------+---------------+ 1 row in set (0.00 sec)
到这里,真相大白。
实例一的连接比较规则是utf8_general_ci,对应的填充规则是pad space属性,代表字符比较过程中,末尾空格不重要,所以加不加空格结果都是一样的;
实例二的连接比较规则是utf8mb4_0900_ai_ci,对应的填充规则是no pad属性,代表字符比较过程中,末尾空格重要,所以加不加空格结果不一样。
03
如何让字符匹配更严格?
1、修改连接的比较规则为utf8mb4_0900_ai_ci,当然,这个修改需要搭配默认字符集
这个方案比较容易理解,不赘述。
2、使用like模糊匹配进行比较
3、where条件之前,添加binary关键字
上述2、3两种方法可见下面的测试:
00:19:13>select * from t00; +----+------+ | id | name | +----+------+ | 1 | aaa | | 2 | bbb | +----+------+ 2 rows in set (0.00 sec) 00:19:18>select * from t00 where name='aaa'; +----+------+ | id | name | +----+------+ | 1 | aaa | +----+------+ 1 row in set (0.00 sec) 00:19:28>select * from t00 where name='aaa '; +----+------+ | id | name | +----+------+ | 1 | aaa | +----+------+ 1 row in set (0.00 sec) ### 下面两种方案,可以防止'aaa '匹配到'aaa' 00:19:31>select * from t00 where name like 'aaa '; Empty set (0.00 sec) 00:19:57>select * from t00 where binary name = 'aaa '; Empty set (0.00 sec)
今天文章就到这里吧。