MySql-MHA的构建方法

摘要

  • MySql-MHA的构建过程
  • 本文基于mysql-8.0.30,mha4mysql-0.58-0.el7.centos

1.节点规划

1
2
3
4
mha-manager:    10.250.0.91      MHA控制器,用于监控管理
mysql-master: 10.250.0.118 Mysql主服务器
mysql-slave1: 10.250.0.186 Mysql从服务器
mysql-slave2: 10.250.0.102 Mysql从服务器
  • 可以在4台服务器的hosts中创建ip与域名的映射,方便后面的配置
    1
    2
    3
    4
    5
    vim /etc/hosts
    10.250.0.91 node1
    10.250.0.118 node2
    10.250.0.186 node3
    10.250.0.102 node4

2.免密登录

  • 先在mha-manager主机上创建密钥对

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    # 创建密钥对,生成私钥和公钥文件,创建密钥时可以不设置密码,一路回车就好
    [root@ip-10-250-0-91 .ssh]# ssh-keygen -t rsa
    Generating public/private rsa key pair.
    Enter file in which to save the key (/root/.ssh/id_rsa):
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in /root/.ssh/id_rsa.
    Your public key has been saved in /root/.ssh/id_rsa.pub.
    The key fingerprint is:
    SHA256:C8JdDpbsQP+NeLSPFvHz3BkLigak3fQusApbrPNL9tw root@ip-10-250-0-91.cn-northwest-1.compute.internal
    The key's randomart image is:
    +---[RSA 2048]----+
    | . |
    | . o . |
    | . B = |
    | . O X B |
    | + O S = . . |
    | . . * B = o + |
    | . = . B + o + |
    | .B + + . |
    | oo+.o E |
    +----[SHA256]-----+
  • 此时会在.ssh目录下生成密钥对文件id_rsa和id_rsa.pub

    1
    2
    [root@ip-10-250-0-91 .ssh]# ls
    authorized_keys id_rsa id_rsa.pub
  • 将公钥导入全部服务器,导入时需要输入对应服务器的root密码

    1
    2
    3
    4
    [root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node1
    [root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node2
    [root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node3
    [root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node4
  • 向服务器中导入公钥时遇到的问题

    1
    Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

    这是由于没有开放密码登录权限导致的,解决方法如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    # 分别在4台服务器上设置root用户的密码
    passwd root

    # 分别在4台服务器上开放密码登录权限
    vim /etc/ssh/sshd_config
    # 将PasswordAuthentication设置为yes
    PasswordAuthentication yes
    # 重启sshd服务
    systemctl restart sshd.service
  • 密钥导入成功

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    [root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node4
    /bin/ssh-copy-id: INFO: Source of key(s) to be installed: "id_rsa.pub"
    The authenticity of host 'node4 (10.250.0.102)' can't be established.
    ECDSA key fingerprint is SHA256:aXIII+S5nOVy6pqP1fuaW6fYFsVIN9TBFVP/Xaf8Pds.
    ECDSA key fingerprint is MD5:1f:07:2f:04:75:77:68:8d:f0:20:96:f1:0b:90:ac:61.
    Are you sure you want to continue connecting (yes/no)? yes
    /bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
    /bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
    root@node4's password:

    Number of key(s) added: 1

    Now try logging into the machine, with: "ssh 'root@node4'"
    and check to make sure that only the key(s) you wanted were added.
  • 按照上面的方式分别在其余3台服务器上执行密钥对的创建操作,过程一样,在此不再赘述。

  • 测试免密登录,在任意服务器上登录其它服务器都不需要输入密码

    1
    ssh node3

3.下载及安装MHA安装包

1
2
3
4
5
6
7
8
9
# 所有节点都需要安装node安装包
wget https://github.com/yoshinorim/mha4mysql-node/releases/download/v0.58/mha4mysql-node-0.58-0.el7.centos.noarch.rpm

yum install -y mha4mysql-node-0.58-0.el7.centos.noarch.rpm

# 只有manager节点需要安装manager安装包
wget https://github.com/yoshinorim/mha4mysql-manager/releases/download/v0.58/mha4mysql-manager-0.58-0.el7.centos.noarch.rpm

yum install -y mha4mysql-manager-0.58-0.el7.centos.noarch.rpm
  • 安装manager时可能会提示缺少epel相关依赖
    1
    2
    3
    4
    5
    6
    7
    8
    错误:软件包:mha4mysql-manager-0.58-0.el7.centos.noarch (/mha4mysql-manager-0.58-0.el7.centos.noarch)
    需要:perl(Log::Dispatch::File)
    错误:软件包:mha4mysql-manager-0.58-0.el7.centos.noarch (/mha4mysql-manager-0.58-0.el7.centos.noarch)
    需要:perl(Parallel::ForkManager)
    错误:软件包:mha4mysql-manager-0.58-0.el7.centos.noarch (/mha4mysql-manager-0.58-0.el7.centos.noarch)
    需要:perl(Log::Dispatch)
    错误:软件包:mha4mysql-manager-0.58-0.el7.centos.noarch (/mha4mysql-manager-0.58-0.el7.centos.noarch)
    需要:perl(Log::Dispatch::Screen)

    解决方法是先安装epel-release依赖

    1
    yum install epel-release -y

    如果使用aws云服务器则通过下面的命令安装epel

    1
    sudo amazon-linux-extras install epel -y

4.mysql主从复制搭建

  • 参照MySql单节点、主从、双主的构建方法中的说明搭建主从复制,这里不再赘述。
  • 需要说明如下几点:

    1.所有mysql的server-id不能相同
    2.所有mysql打开binlog日志和中继日志,且relay_log_purge=0
    3.从节点开启只读,read_only=1
    4.云服务通过镜像创建的mysql要注意所有mysql的uuid不能相同。

5.从节点也要创建同步帐号

  • 因为slave节点有可能升级为master节点,所以每个slave上也要需要拥有同步帐号,在每个slave上执行如下命令

    注意:
    这个方式是错误的,会导致主从切换后,slave不能与新的master实现数据同步,
    因为此时两个从库的GTID事务和binlog记录的位置已经不一致了。
    这个错误的结果在下文”11.测试 MHA 故障转移”有说明。

    正确的做法是,将master的数据库导入到两个slave中,主从一旦建立,从库就不要写入任何数据。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    # 停止同步
    mysql> stop slave;
    # '10.250.%.%' 表示只能局域网段内的ip才能访问,如果不限制ip,可以配置为'%'
    mysql> CREATE USER 'vagrant'@'10.250.%.%' IDENTIFIED BY 'vagrant';
    mysql> ALTER USER 'vagrant'@'10.250.%.%' IDENTIFIED WITH mysql_native_password BY 'vagrant';
    mysql> GRANT REPLICATION SLAVE ON *.* TO 'vagrant'@'10.250.%.%';
    mysql> FLUSH PRIVILEGES;
    # 开启同步
    mysql> start slave;

    正确的做法是,将master的数据库导入到两个slave中,然后在开启主从同步。
    注意:主从一旦建立,从库就不要写入任何数据。

    1
    2
    3
    4
    # 两个从库中分别执行master数据库导出
    mysqldump -uroot --all-databases --triggers --routines --events -p -h node2 > all_databases.sql
    # 从库执行导入数据
    mysql -uroot -p < all_databases.sql

    因为开启了GTID,导入数据库时会报如下错误:

    1
    @@GLOBAL.GTID_PURGED cannot be changed: the added gtid set must not overlap with @@GLOBAL.GTID_EXECUTED

    解决方法是在从库中执行如下命令后再导入:

    1
    2
    3
    mysql> stop slave;
    mysql> reset slave all;
    mysql> reset master;

    从节点设置同步

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    # 查看主节点master状态信息,记录File和Position两个信息
    mysql> show master status;
    +------------------+----------+--------------+------------------+------------------------------------------+
    | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
    +------------------+----------+--------------+------------------+------------------------------------------+
    | mysql-bin.000012 | 197 | | | 6e9a571e-330e-11ed-a3f8-0a53e7cced42:1-9 |
    +------------------+----------+--------------+------------------+------------------------------------------+
    1 row in set (0.00 sec)

    # 从节点设置同步信息,关联master中查询到的File和Position两个信息
    mysql> CHANGE MASTER TO MASTER_HOST='10.250.0.118',
    MASTER_PORT=3306,
    MASTER_USER='vagrant',
    MASTER_PASSWORD='vagrant',
    MASTER_LOG_FILE='mysql-bin.000012',
    MASTER_LOG_POS=197;

    # 启动从库的复制线程
    mysql> START slave;
    # 查看从库同步状态
    mysql> SHOW slave status \G;
    *************************** 1. row ***************************
    Slave_IO_State: Waiting for source to send event
    Master_Host: 10.250.0.118
    Master_User: vagrant
    Master_Port: 3306
    Connect_Retry: 60
    Master_Log_File: mysql-bin.000012
    Read_Master_Log_Pos: 197
    Relay_Log_File: slave-relay-bin.000002
    Relay_Log_Pos: 326
    Relay_Master_Log_File: mysql-bin.000012
    Slave_IO_Running: Yes
    Slave_SQL_Running: Yes

6.MHA需要一个拥有管理员权限的mysql用户

  • 如果开启了root用户远程登录,可以使用root用户
  • 也可以在master上创建一个拥有管理员权限的用户,因为建立了主从,所以从库会自动同步
    1
    2
    3
    4
    mysql> CREATE USER 'mhaadmin'@'10.250.%.%' IDENTIFIED BY 'mhapass';
    mysql> ALTER USER 'mhaadmin'@'10.250.%.%' IDENTIFIED WITH mysql_native_password BY 'mhapass';
    mysql> GRANT ALL ON *.* TO 'mhaadmin'@'10.250.%.%';
    mysql> FLUSH PRIVILEGES;

7.在mha-manager节点上编写MHA管理配置文件

1
2
mkdir /etc/mha_master
vim /etc/mha_master/mha.cnf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
[server default]
# mha访问数据库的账号与密码
user=mhaadmin
password=mhapass
port=3306
# 指定mha的工作目录
manager_workdir=/etc/mha_master
# 指定管理日志路径
manager_log=/etc/mha_master/manager.log
# 指定mha在远程节点上的工作目录,所有mysql节点都需要创建该目录
remote_workdir=/mydata/mha_master
# ssh登录用户
ssh_user=root
# 指定主从复制的mysql用户和密码
repl_user=vagrant
repl_password=vagrant
# 指定多久执行一次检测master的状态,检测间隔时间,单位秒
ping_interval=1

# 默认情况下,mha检查master的状态是否正常是直接连接master节点
# 也可以指定用于二次检查master节点状态的脚本,这里使用mha自带的masterha_secondary_check命令
# 不要配置master节点的ip,可以配置从节点的ip,它的连接逻辑是 manager-(A)->这里指定的ip-(B)->master,
# 只有A成功,B失败时才会进行故障转移,其余情况都会认为master正常。所以必须保证这里的ip所在服务器正常运行。
# secondary_check_script=/bin/masterha_secondary_check -s 10.250.0.186 -s 10.250.0.102

# 脚本文件,用于故障切换的时候将虚拟ip漂移到新的master上
#master_ip_failover_script=/etc/mha_master/master_ip_failover_script

# 脚本文件,用于故障切换的时候发送告警信息
#report_script=/etc/mha_master/report_script

# 故障发生后关闭故障主机的脚本,其主要作用是防止发生脑裂
#shutdown_script=/etc/mha_master/shutdown_script

# mysql主从服务器信息
[server1]
hostname=10.250.0.118
ssh_port=22
# 将来可不可以成为master候选节点/主节点
candidate_master=1
# 不参与master候选
#no_master=1
[server2]
hostname=10.250.0.186
ssh_port=22
candidate_master=1
[server3]
hostname=10.250.0.102
ssh_port=22
candidate_master=1

注意,所有mysql节点都需要创建该目录/mydata/mha_master

8.检测各节点间ssh互信通信配置是否正确

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[root@ip-10-250-0-91 mha_master]# masterha_check_ssh --conf=/etc/mha_master/mha.cnf
Thu Sep 15 10:13:08 2022 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Sep 15 10:13:08 2022 - [info] Reading application default configuration from /etc/mha_master/mha.cnf..
Thu Sep 15 10:13:08 2022 - [info] Reading server configuration from /etc/mha_master/mha.cnf..
Thu Sep 15 10:13:08 2022 - [info] Starting SSH connection tests..
Thu Sep 15 10:13:09 2022 - [debug]
Thu Sep 15 10:13:08 2022 - [debug] Connecting via SSH from root@10.250.0.118(10.250.0.118:22) to root@10.250.0.186(10.250.0.186:22)..
Thu Sep 15 10:13:08 2022 - [debug] ok.
Thu Sep 15 10:13:08 2022 - [debug] Connecting via SSH from root@10.250.0.118(10.250.0.118:22) to root@10.250.0.102(10.250.0.102:22)..
Thu Sep 15 10:13:08 2022 - [debug] ok.
Thu Sep 15 10:13:09 2022 - [debug]
Thu Sep 15 10:13:08 2022 - [debug] Connecting via SSH from root@10.250.0.186(10.250.0.186:22) to root@10.250.0.118(10.250.0.118:22)..
Thu Sep 15 10:13:08 2022 - [debug] ok.
Thu Sep 15 10:13:08 2022 - [debug] Connecting via SSH from root@10.250.0.186(10.250.0.186:22) to root@10.250.0.102(10.250.0.102:22)..
Thu Sep 15 10:13:09 2022 - [debug] ok.
Thu Sep 15 10:13:10 2022 - [debug]
Thu Sep 15 10:13:09 2022 - [debug] Connecting via SSH from root@10.250.0.102(10.250.0.102:22) to root@10.250.0.118(10.250.0.118:22)..
Thu Sep 15 10:13:09 2022 - [debug] ok.
Thu Sep 15 10:13:09 2022 - [debug] Connecting via SSH from root@10.250.0.102(10.250.0.102:22) to root@10.250.0.186(10.250.0.186:22)..
Thu Sep 15 10:13:09 2022 - [debug] ok.
Thu Sep 15 10:13:10 2022 - [info] All SSH connection tests passed successfully.

看到All SSH connection tests passed successfully.说明通信成功

9.检查管理的MySQL复制集群的连接配置参数是否OK

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
[root@ip-10-250-0-91 mha_master]# masterha_check_repl --conf=/etc/mha_master/mha.cnf
Thu Sep 15 10:16:35 2022 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Sep 15 10:16:35 2022 - [info] Reading application default configuration from /etc/mha_master/mha.cnf..
Thu Sep 15 10:16:35 2022 - [info] Reading server configuration from /etc/mha_master/mha.cnf..
Thu Sep 15 10:16:35 2022 - [info] MHA::MasterMonitor version 0.58.
Thu Sep 15 10:16:36 2022 - [info] GTID failover mode = 1
Thu Sep 15 10:16:36 2022 - [info] Dead Servers:
Thu Sep 15 10:16:36 2022 - [info] Alive Servers:
Thu Sep 15 10:16:36 2022 - [info] 10.250.0.118(10.250.0.118:3306)
Thu Sep 15 10:16:36 2022 - [info] 10.250.0.186(10.250.0.186:3306)
Thu Sep 15 10:16:36 2022 - [info] 10.250.0.102(10.250.0.102:3306)
Thu Sep 15 10:16:36 2022 - [info] Alive Slaves:
Thu Sep 15 10:16:36 2022 - [info] 10.250.0.186(10.250.0.186:3306) Version=8.0.30 (oldest major version between slaves) log-bin:enabled
Thu Sep 15 10:16:36 2022 - [info] GTID ON
Thu Sep 15 10:16:36 2022 - [info] Replicating from 10.250.0.118(10.250.0.118:3306)
Thu Sep 15 10:16:36 2022 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Sep 15 10:16:36 2022 - [info] 10.250.0.102(10.250.0.102:3306) Version=8.0.30 (oldest major version between slaves) log-bin:enabled
Thu Sep 15 10:16:36 2022 - [info] GTID ON
Thu Sep 15 10:16:36 2022 - [info] Replicating from 10.250.0.118(10.250.0.118:3306)
Thu Sep 15 10:16:36 2022 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Sep 15 10:16:36 2022 - [info] Current Alive Master: 10.250.0.118(10.250.0.118:3306)
Thu Sep 15 10:16:36 2022 - [info] Checking slave configurations..
Thu Sep 15 10:16:36 2022 - [info] Checking replication filtering settings..
Thu Sep 15 10:16:36 2022 - [info] binlog_do_db= , binlog_ignore_db=
Thu Sep 15 10:16:36 2022 - [info] Replication filtering check ok.
Thu Sep 15 10:16:36 2022 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Thu Sep 15 10:16:36 2022 - [info] Checking SSH publickey authentication settings on the current master..
Thu Sep 15 10:16:36 2022 - [info] HealthCheck: SSH to 10.250.0.118 is reachable.
Thu Sep 15 10:16:36 2022 - [info]
10.250.0.118(10.250.0.118:3306) (current master)
+--10.250.0.186(10.250.0.186:3306)
+--10.250.0.102(10.250.0.102:3306)

Thu Sep 15 10:16:36 2022 - [info] Checking replication health on 10.250.0.186..
Thu Sep 15 10:16:36 2022 - [info] ok.
Thu Sep 15 10:16:36 2022 - [info] Checking replication health on 10.250.0.102..
Thu Sep 15 10:16:36 2022 - [info] ok.
Thu Sep 15 10:16:36 2022 - [warning] master_ip_failover_script is not defined.
Thu Sep 15 10:16:36 2022 - [warning] shutdown_script is not defined.
Thu Sep 15 10:16:36 2022 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

看到MySQL Replication Health is OK.说明成功

10.管理MHA

1
2
3
4
# 先检查MHA工作目录下是否存在mha.failover.complete文件,如果存在则先删除
[root@ip-10-250-0-91 mha_master]# rm -rf /etc/mha_master/mha.failover.complete
# 启动
[root@ip-10-250-0-91 mha_master]# nohup masterha_manager --conf=/etc/mha_master/mha.cnf &> /etc/mha_master/manager.log &

查看master的状态

1
2
3
#查看master状态
[root@ip-10-250-0-91 mha_master]# masterha_check_status --conf=/etc/mha_master/mha.cnf
mha (pid:9696) is running(0:PING_OK), master:10.250.0.118

上面的信息中“mha (pid:9696) is running(0:PING_OK)”表示MHA服务运行 OK,否则, 则会显示为类似“mha is stopped(1:NOT_RUNNING).”

停止MHA

1
2
3
4
#停止
[root@ip-10-250-0-91 mha_master]# masterha_stop --conf=/etc/mha_master/mha.cnf
Stopped mha successfully.
[1]+ 退出 1 nohup masterha_manager --conf=/etc/mha_master/mha.cnf &>/etc/mha_master/manager.log

11.测试 MHA 故障转移

我们先按”5.从节点也要创建同步帐号”中的错误方式进行测试

  • 启动MHA
  • 关闭mysql-master数据库服务
  • 监控mha-manager的日志
    1
    2
    3
    4
    5
    6
    7
    [root@ip-10-250-0-91 mha_master]# tail -f manager.log
    ………………………………
    Started automated(non-interactive) failover.
    Selected 10.250.0.186(10.250.0.186:3306) as a new master.
    10.250.0.186(10.250.0.186:3306): OK: Applying all logs succeeded.
    10.250.0.102(10.250.0.102:3306): ERROR: Failed on waiting gtid exec set on master.
    Master failover to 10.250.0.186(10.250.0.186:3306) done, but recovery on slave partially failed.

    可以看到mha监测到master挂掉后,会进行重新选主,并且选主成功,但是为其挂载从库失败。

我们来分析一下这个失败的原因

  • 查看新的master状态
    1
    2
    3
    4
    5
    6
    7
    8
    mysql> show master status\G;
    *************************** 1. row ***************************
    File: mysql-bin.000013
    Position: 2466
    Binlog_Do_DB:
    Binlog_Ignore_DB:
    Executed_Gtid_Set: 6e9a571e-330e-11ed-a3f8-0a53e7cced42:5-9,
    6e9a571e-330e-11ed-a3f8-0a53e7cced43:1-4
  • 查询新的主库的read_only已经关闭,但这个只是在当前进程中有效,重新启动mysql还是会读取配置文件的设置,所以主从一旦切换后一定要注意修改这个配置文件
    1
    2
    3
    4
    5
    6
    7
    8
    9
    mysql> show variables like "%read_only%";
    +-----------------------+-------+
    | Variable_name | Value |
    +-----------------------+-------+
    | innodb_read_only | OFF |
    | read_only | OFF |
    | super_read_only | OFF |
    | transaction_read_only | OFF |
    +-----------------------+-------+
  • 查看slave状态,获取报错信息
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    mysql> show slave status \G;
    *************************** 1. row ***************************
    Slave_IO_State: Waiting for source to send event
    Master_Host: 10.250.0.186
    Master_User: vagrant
    Master_Port: 3306
    Connect_Retry: 60
    Master_Log_File: mysql-bin.000013
    Read_Master_Log_Pos: 2466
    Relay_Log_File: slave-relay-bin.000002
    Relay_Log_Pos: 420
    Relay_Master_Log_File: mysql-bin.000013
    Slave_IO_Running: Yes
    Slave_SQL_Running: No
    Replicate_Do_DB:
    Replicate_Ignore_DB:
    Replicate_Do_Table:
    Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
    Replicate_Wild_Ignore_Table:
    Last_Errno: 1396
    Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '6e9a571e-330e-11ed-a3f8-0a53e7cced43:1' at master log mysql-bin.000013, end_log_pos 1465. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  • 按照提示在从库中查询performance_schema.replication_applier_status_by_worker表的数据
    1
    2
    3
    4
    5
    6
    7
    8
    9
    mysql> select * from performance_schema.replication_applier_status_by_worker\G;
    *************************** 1. row ***************************
    CHANNEL_NAME:
    WORKER_ID: 1
    THREAD_ID: NULL
    SERVICE_STATE: OFF
    LAST_ERROR_NUMBER: 1396
    LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '6e9a571e-330e-11ed-a3f8-0a53e7cced43:1' at master log mysql-bin.000013, end_log_pos 1465; Error 'Operation CREATE USER failed for 'vagrant'@'10.250.%.%'' on query. Default database: 'mysql'. Query: 'CREATE USER 'vagrant'@'10.250.%.%' IDENTIFIED WITH 'mysql_native_password' AS '*04E6E1273D1783DF7D57DC5479FE01CFFDFD0058''
    …………

    这里提示的比较清楚,意思是从库在同步新的主库数据时,执行创建同步帐号的CREATE USER语句时报错
    思考我们的构建过程,从库创建同步帐号是分别创建在自己的数据库里的,所以两个从库GTID事务和binlog日志实际上数据并不一致,导致主从切换后,从库要从新的master同步数据时,就会同步这个创建帐号的语句,但是此时从库已经有这个帐号了,所以创建帐号失败,从而导致主从复制失败。
    此时的解决方法就是将master的数据库全量导入到slave中,然后重新建立主从,参照下文。

我们再按”5.从节点也要创建同步帐号”中的正确方式搭建好主从后进行测试

  • 搭建步骤参看”5.从节点也要创建同步帐号”中的正确方式
  • 启动MHA
  • 关闭mysql-master数据库服务
  • 监控mha-manager的日志
    1
    2
    3
    4
    5
    6
    7
    …………
    Started automated(non-interactive) failover.
    Selected 10.250.0.186(10.250.0.186:3306) as a new master.
    10.250.0.186(10.250.0.186:3306): OK: Applying all logs succeeded.
    10.250.0.102(10.250.0.102:3306): OK: Slave started, replicating from 10.250.0.186(10.250.0.186:3306)
    10.250.0.186(10.250.0.186:3306): Resetting slave info succeeded.
    Master failover to 10.250.0.186(10.250.0.186:3306) completed successfully.

    此时说明新的master切换成功,并且成功建立了主从。

  • 检查master和slave状态是否正常
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    # 主库
    mysql> show master status;
    +------------------+----------+--------------+------------------+------------------------------------------+
    | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
    +------------------+----------+--------------+------------------+------------------------------------------+
    | mysql-bin.000001 | 157 | | | 6e9a571e-330e-11ed-a3f8-0a53e7cced42:1-9 |
    +------------------+----------+--------------+------------------+------------------------------------------+

    # 从库
    mysql> show slave status \G;
    *************************** 1. row ***************************
    Slave_IO_State: Waiting for source to send event
    Master_Host: 10.250.0.186
    Master_User: vagrant
    Master_Port: 3306
    Connect_Retry: 60
    Master_Log_File: mysql-bin.000001
    Read_Master_Log_Pos: 157
    Relay_Log_File: slave-relay-bin.000002
    Relay_Log_Pos: 373
    Relay_Master_Log_File: mysql-bin.000001
    Slave_IO_Running: Yes
    Slave_SQL_Running: Yes

    此时说明MHA搭建及测试成功。

知识点:关于MHA有几点内容需要说明:

1.MHA监控主从一旦完成切换,MHA服务进程就会停止,需要mysql集群修复完成后手工重新启动
2.挂掉的mysql完成修复后可以作为slave加入mysql主从集群,注意修改其mysql配置文件,加入时必须重新全量导入master的数据库
3.新加入mysql时要注意ip是否发生变更,如果变更要及时修改MHA的配置文件
4.MHA切换主从后,新的master配置文件中的read_only不会被修改,需要手工修改,否则重启服务器后不能写入数据
5.MHA只会监控master,slave挂掉时不会触发MHA做任何操作,此时只要从库的ip没有变化,修复后可以直接启动即可。
6.故障转移发生后,MHA工作目录会生成mha.failover.complete文件,如果要在故障发生后8小时内重新启动MHA,则重新启动MHA前一定要查看工作目录下是否存在mha.failover.complete文件,如果存在要先删除,否则不能完成新一轮的主从切换,或者在启动MHA时加上--ignore_last_failover

新的问题:

1.MHA可以帮助我们实现master的监控,当其监测到master不可用时会在slave中进行选主,从而实现主从切换。
2.但是这里有个问题,主从切换后,新的master的ip就会发生变化,所以客户端连接时需要改变为新的ip地址。
3.那有什么办法可以在不改变客户端连接mysql的地址情况下,自动完成切换呢,答案就是VIP(虚拟IP).

MHA脚本扩展

  • 我们在使用MHA检查mysql集群的连接配置参数时,会看到打印如下信息
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    [root@ip-10-250-0-91 mha_master]# masterha_check_repl --conf=/etc/mha_master/mha.cnf
    ………………
    Thu Sep 15 10:16:36 2022 - [info] Checking replication health on 10.250.0.186..
    Thu Sep 15 10:16:36 2022 - [info] ok.
    Thu Sep 15 10:16:36 2022 - [info] Checking replication health on 10.250.0.102..
    Thu Sep 15 10:16:36 2022 - [info] ok.
    Thu Sep 15 10:16:36 2022 - [warning] master_ip_failover_script is not defined.
    Thu Sep 15 10:16:36 2022 - [warning] shutdown_script is not defined.
    Thu Sep 15 10:16:36 2022 - [info] Got exit code 0 (Not master dead).

    MySQL Replication Health is OK.

    这里有两个警告,master_ip_failover_script is not defined.shutdown_script is not defined.
    shutdown_script脚本用于指定故障发生后关闭故障主机的脚本,而master_ip_failover_script脚本用于指定故障转移时需要做的操作。

实际上,MHA支持多种脚本扩展

1
2
3
4
5
6
secondary_check_script:用于检查来自多个网络路由的master可用性
master_ip_failover_script:用于更新应用程序使用的master的ip地址
shutdown_script:为了强制关机master
report_script:用于发送报告
init_conf_load_script:用于加载初始配置参数
master_ip_online_change_script:用于更新masterIP地址。这不用于主故障转移,而是用于在线主交换机
  • MHA的配置文件中可以指定master_ip_failover_script,其值是个脚本文件,从名称上我们也能看出来其作用是当发生故障转移时要执行的脚本,一般我们使用这个脚本进行虚拟IP的重新绑定。
  • MHA并没有实现这个脚本,这个脚本需要我们自己编写,这里参考网上给出的一个示例,vim /etc/mha_master/master_ip_failover_script
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    #!/bin/env perl
    use strict;
    use warnings FATAL => 'all';

    use Getopt::Long;

    my (
    $command, $orig_master_host, $orig_master_ip,$ssh_user,
    $orig_master_port, $new_master_host, $new_master_ip,$new_master_port,
    $orig_master_ssh_port,$new_master_ssh_port,$new_master_user,$new_master_password
    );

    # 这里定义的虚拟IP配置要注意,这个ip必须要与你自己的集群在同一个网段,否则无效
    my $vip = '10.250.0.199';
    my $key = '1';
    # 这里的网卡名称 “eth0” 需要根据你机器的网卡名称进行修改
    my $ssh_start_vip = "sudo /sbin/ifconfig eth0:$key $vip up";
    my $ssh_stop_vip = "sudo /sbin/ifconfig eth0:$key down";
    my $ssh_Bcast_arp= "sudo /sbin/arping -I eth0 -c 3 -A $vip";

    GetOptions(
    'command=s' => \$command,
    'ssh_user=s' => \$ssh_user,
    'orig_master_host=s' => \$orig_master_host,
    'orig_master_ip=s' => \$orig_master_ip,
    'orig_master_port=i' => \$orig_master_port,
    'orig_master_ssh_port=i' => \$orig_master_ssh_port,
    'new_master_host=s' => \$new_master_host,
    'new_master_ip=s' => \$new_master_ip,
    'new_master_port=i' => \$new_master_port,
    'new_master_ssh_port' => \$new_master_ssh_port,
    'new_master_user' => \$new_master_user,
    'new_master_password' => \$new_master_password

    );

    exit &main();

    sub main {
    $ssh_user = defined $ssh_user ? $ssh_user : 'root';
    print "\n\n SCRIPT START \[$ssh_user|$ssh_start_vip\]";
    print "\n SCRIPT STOP \[$ssh_user|$ssh_stop_vip\]\n\n";


    if ( $command eq "stop" || $command eq "stopssh" ) {

    my $exit_code = 1;
    eval {
    print "Disabling the VIP on old master: $orig_master_host \n";
    &stop_vip();
    $exit_code = 0;
    };
    if ($@) {
    warn "Got Error: $@\n";
    exit $exit_code;
    }
    exit $exit_code;
    }
    elsif ( $command eq "start" ) {

    my $exit_code = 10;
    eval {
    print "Enabling the VIP - $vip on the new master - $new_master_host \n";
    &start_vip();
    &start_arp();
    $exit_code = 0;
    };
    if ($@) {
    warn $@;
    exit $exit_code;
    }
    exit $exit_code;
    }
    elsif ( $command eq "status" ) {
    print "Checking the Status of the script.. OK \n";
    exit 0;
    }
    else {
    &usage();
    exit 1;
    }
    }

    sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
    }
    sub stop_vip() {
    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
    }

    sub start_arp() {
    `ssh $ssh_user\@$new_master_host \" $ssh_Bcast_arp \"`;
    }
    sub usage {
    print
    "Usage: master_ip_failover --command=start|stop|stopssh|status --ssh_user=user --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
    }
  • chmod a+x /etc/mha_master/master_ip_failover_script
  • vim /etc/mha_master/mha.cnf,添加master_ip_failover_script配置
    1
    2
    3
    4
    5
    6
    7
    …………
    #指定检测间隔时间
    ping_interval=1
    #指定一个脚本,该脚本实现了在主从切换之后,将虚拟ip漂移到新的master上
    master_ip_failover_script=/etc/mha_master/master_ip_failover_script
    # mysql主从服务器信息
    …………
  • 虚拟IP为10.250.0.199,将其绑定到mysql-master上
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    [root@ip-10-250-0-118 mha_master]# ifconfig eth0:1 10.250.0.199/24
    [root@ip-10-250-0-118 mha_master]# ifconfig
    eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
    inet 10.250.0.118 netmask 255.255.255.0 broadcast 10.250.0.255
    inet6 fe80::803:41ff:fe19:356a prefixlen 64 scopeid 0x20<link>
    ether 0a:03:41:19:35:6a txqueuelen 1000 (Ethernet)
    RX packets 12260 bytes 2511078 (2.3 MiB)
    RX errors 0 dropped 0 overruns 0 frame 0
    TX packets 10202 bytes 5216046 (4.9 MiB)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

    eth0:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
    inet 10.250.0.199 netmask 255.255.255.0 broadcast 10.250.0.255
    ether 0a:03:41:19:35:6a txqueuelen 1000 (Ethernet)

    lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
    inet 127.0.0.1 netmask 255.0.0.0
    inet6 ::1 prefixlen 128 scopeid 0x10<host>
    loop txqueuelen 1000 (Local Loopback)
    RX packets 43 bytes 9386 (9.1 KiB)
    RX errors 0 dropped 0 overruns 0 frame 0
    TX packets 43 bytes 9386 (9.1 KiB)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
  • 再次检查配置参数
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    [root@ip-10-250-0-91 mha_master]# masterha_check_repl --conf=/etc/mha_master/mha.cnf
    ………………
    Fri Sep 16 07:06:02 2022 - [info] Checking replication health on 10.250.0.186..
    Fri Sep 16 07:06:02 2022 - [info] ok.
    Fri Sep 16 07:06:02 2022 - [info] Checking replication health on 10.250.0.102..
    Fri Sep 16 07:06:02 2022 - [info] ok.
    Fri Sep 16 07:06:02 2022 - [info] Checking master_ip_failover_script status:
    Fri Sep 16 07:06:02 2022 - [info] /etc/mha_master/master_ip_failover_script --command=status --ssh_user=root --orig_master_host=10.250.0.118 --orig_master_ip=10.250.0.118 --orig_master_port=3306


    SCRIPT START [root|sudo /sbin/ifconfig eth0:1 10.250.0.199 up]
    SCRIPT STOP [root|sudo /sbin/ifconfig eth0:1 down]

    Checking the Status of the script.. OK
    Fri Sep 16 07:06:02 2022 - [info] OK.
    Fri Sep 16 07:06:02 2022 - [warning] shutdown_script is not defined.
    Fri Sep 16 07:06:02 2022 - [info] Got exit code 0 (Not master dead).

    MySQL Replication Health is OK.

  • 重新启动MHA
  • 停止mysql-master服务
  • 查看MHA日志
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    ----- Failover Report -----

    mha: MySQL Master failover 10.250.0.118(10.250.0.118:3306) to 10.250.0.186(10.250.0.186:3306) succeeded

    Master 10.250.0.118(10.250.0.118:3306) is down!

    Check MHA Manager logs at ip-10-250-0-91.cn-northwest-1.compute.internal:/etc/mha_master/manager.log for details.

    Started automated(non-interactive) failover.
    Invalidated master IP address on 10.250.0.118(10.250.0.118:3306)
    Selected 10.250.0.186(10.250.0.186:3306) as a new master.
    10.250.0.186(10.250.0.186:3306): OK: Applying all logs succeeded.
    10.250.0.186(10.250.0.186:3306): OK: Activated master IP address.
    10.250.0.102(10.250.0.102:3306): OK: Slave started, replicating from 10.250.0.186(10.250.0.186:3306)
    10.250.0.186(10.250.0.186:3306): Resetting slave info succeeded.
    Master failover to 10.250.0.186(10.250.0.186:3306) completed successfully.
  • 此时查看新的master的ip
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    [root@ip-10-250-0-186 mysql8]# ifconfig
    eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
    inet 10.250.0.186 netmask 255.255.255.0 broadcast 10.250.0.255
    inet6 fe80::895:fcff:fe7f:ee0c prefixlen 64 scopeid 0x20<link>
    ether 0a:95:fc:7f:ee:0c txqueuelen 1000 (Ethernet)
    RX packets 12284 bytes 3862460 (3.6 MiB)
    RX errors 0 dropped 0 overruns 0 frame 0
    TX packets 10558 bytes 2559446 (2.4 MiB)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

    eth0:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
    inet 10.250.0.199 netmask 255.0.0.0 broadcast 10.255.255.255
    ether 0a:95:fc:7f:ee:0c txqueuelen 1000 (Ethernet)

    lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
    inet 127.0.0.1 netmask 255.0.0.0
    inet6 ::1 prefixlen 128 scopeid 0x10<host>
    loop txqueuelen 1000 (Local Loopback)
    RX packets 41 bytes 9270 (9.0 KiB)
    RX errors 0 dropped 0 overruns 0 frame 0
    TX packets 41 bytes 9270 (9.0 KiB)
    TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

    另外查询原master,其也已经解除了虚拟IP的绑定,此时说明虚拟IP迁移成功,故障转移成功!

AWS不支持在EC2上开启VIP,需要使用其它解决方案,这个不在这里讨论。

可以在MHA配置文件中配置report_script,其对应一个脚本文件,用于故障转移时发出告警信息。
可以发送邮件,也可以调用企业微信或者钉钉的接口实现,这里不再赘述。