MySql-MHA的构建方法

摘要

  • MySql-MHA的构建过程
  • 本文基于mysql-8.0.30,mha4mysql-0.58-0.el7.centos

1.节点规划

1
2
3
4
mha-manager:    10.250.0.91      MHA控制器,用于监控管理
mysql-master: 10.250.0.118 Mysql主服务器
mysql-slave1: 10.250.0.186 Mysql从服务器
mysql-slave2: 10.250.0.102 Mysql从服务器
  • 可以在4台服务器的hosts中创建ip与域名的映射,方便后面的配置

1
2
3
4
5
vim /etc/hosts
10.250.0.91 node1
10.250.0.118 node2
10.250.0.186 node3
10.250.0.102 node4

2.免密登录

  • 先在mha-manager主机上创建密钥对

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 创建密钥对,生成私钥和公钥文件,创建密钥时可以不设置密码,一路回车就好
[root@ip-10-250-0-91 .ssh]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:C8JdDpbsQP+NeLSPFvHz3BkLigak3fQusApbrPNL9tw root@ip-10-250-0-91.cn-northwest-1.compute.internal
The key's randomart image is:
+---[RSA 2048]----+
| . |
| . o . |
| . B = |
| . O X B |
| + O S = . . |
| . . * B = o + |
| . = . B + o + |
| .B + + . |
| oo+.o E |
+----[SHA256]-----+
  • 此时会在.ssh目录下生成密钥对文件id_rsa和id_rsa.pub

1
2
[root@ip-10-250-0-91 .ssh]# ls
authorized_keys id_rsa id_rsa.pub
  • 将公钥导入全部服务器,导入时需要输入对应服务器的root密码

1
2
3
4
[root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node1
[root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node2
[root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node3
[root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node4
  • 向服务器中导入公钥时遇到的问题

1
Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

这是由于没有开放密码登录权限导致的,解决方法如下:

1
2
3
4
5
6
7
8
9
# 分别在4台服务器上设置root用户的密码
passwd root

# 分别在4台服务器上开放密码登录权限
vim /etc/ssh/sshd_config
# 将PasswordAuthentication设置为yes
PasswordAuthentication yes
# 重启sshd服务
systemctl restart sshd.service
  • 密钥导入成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node4
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "id_rsa.pub"
The authenticity of host 'node4 (10.250.0.102)' can't be established.
ECDSA key fingerprint is SHA256:aXIII+S5nOVy6pqP1fuaW6fYFsVIN9TBFVP/Xaf8Pds.
ECDSA key fingerprint is MD5:1f:07:2f:04:75:77:68:8d:f0:20:96:f1:0b:90:ac:61.
Are you sure you want to continue connecting (yes/no)? yes
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@node4's password:

Number of key(s) added: 1

Now try logging into the machine, with: "ssh 'root@node4'"
and check to make sure that only the key(s) you wanted were added.
  • 按照上面的方式分别在其余3台服务器上执行密钥对的创建操作,过程一样,在此不再赘述。

  • 测试免密登录,在任意服务器上登录其它服务器都不需要输入密码

1
ssh node3

3.下载及安装MHA安装包

1
2
3
4
5
6
7
8
9
# 所有节点都需要安装node安装包
wget https://github.com/yoshinorim/mha4mysql-node/releases/download/v0.58/mha4mysql-node-0.58-0.el7.centos.noarch.rpm

yum install -y mha4mysql-node-0.58-0.el7.centos.noarch.rpm

# 只有manager节点需要安装manager安装包
wget https://github.com/yoshinorim/mha4mysql-manager/releases/download/v0.58/mha4mysql-manager-0.58-0.el7.centos.noarch.rpm

yum install -y mha4mysql-manager-0.58-0.el7.centos.noarch.rpm
  • 安装manager时可能会提示缺少epel相关依赖

1
2
3
4
5
6
7
8
错误:软件包:mha4mysql-manager-0.58-0.el7.centos.noarch (/mha4mysql-manager-0.58-0.el7.centos.noarch)
需要:perl(Log::Dispatch::File)
错误:软件包:mha4mysql-manager-0.58-0.el7.centos.noarch (/mha4mysql-manager-0.58-0.el7.centos.noarch)
需要:perl(Parallel::ForkManager)
错误:软件包:mha4mysql-manager-0.58-0.el7.centos.noarch (/mha4mysql-manager-0.58-0.el7.centos.noarch)
需要:perl(Log::Dispatch)
错误:软件包:mha4mysql-manager-0.58-0.el7.centos.noarch (/mha4mysql-manager-0.58-0.el7.centos.noarch)
需要:perl(Log::Dispatch::Screen)

解决方法是先安装epel-release依赖

1
yum install epel-release -y

如果使用aws云服务器则通过下面的命令安装epel

1
sudo amazon-linux-extras install epel -y

4.mysql主从复制搭建

1.所有mysql的server-id不能相同
2.所有mysql打开binlog日志和中继日志,且relay_log_purge=0
3.从节点开启只读,read_only=1
4.云服务通过镜像创建的mysql要注意所有mysql的uuid不能相同。

5.从节点也要创建同步帐号

  • 因为slave节点有可能升级为master节点,所以每个slave上也要需要拥有同步帐号,在每个slave上执行如下命令

注意:
这个方式是错误的,会导致主从切换后,slave不能与新的master实现数据同步,
因为此时两个从库的GTID事务和binlog记录的位置已经不一致了。
这个错误的结果在下文"11.测试 MHA 故障转移"有说明。

正确的做法是,将master的数据库导入到两个slave中,主从一旦建立,从库就不要写入任何数据。

1
2
3
4
5
6
7
8
9
# 停止同步
mysql> stop slave;
# '10.250.%.%' 表示只能局域网段内的ip才能访问,如果不限制ip,可以配置为'%'
mysql> CREATE USER 'vagrant'@'10.250.%.%' IDENTIFIED BY 'vagrant';
mysql> ALTER USER 'vagrant'@'10.250.%.%' IDENTIFIED WITH mysql_native_password BY 'vagrant';
mysql> GRANT REPLICATION SLAVE ON *.* TO 'vagrant'@'10.250.%.%';
mysql> FLUSH PRIVILEGES;
# 开启同步
mysql> start slave;

正确的做法是,将master的数据库导入到两个slave中,然后在开启主从同步。
注意:主从一旦建立,从库就不要写入任何数据。

1
2
3
4
# 两个从库中分别执行master数据库导出
mysqldump -uroot --all-databases --triggers --routines --events -p -h node2 > all_databases.sql
# 从库执行导入数据
mysql -uroot -p < all_databases.sql

因为开启了GTID,导入数据库时会报如下错误:

1
@@GLOBAL.GTID_PURGED cannot be changed: the added gtid set must not overlap with @@GLOBAL.GTID_EXECUTED

解决方法是在从库中执行如下命令后再导入:

1
2
3
mysql> stop slave;
mysql> reset slave all;
mysql> reset master;

从节点设置同步

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# 查看主节点master状态信息,记录File和Position两个信息
mysql> show master status;
+------------------+----------+--------------+------------------+------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+------------------------------------------+
| mysql-bin.000012 | 197 | | | 6e9a571e-330e-11ed-a3f8-0a53e7cced42:1-9 |
+------------------+----------+--------------+------------------+------------------------------------------+
1 row in set (0.00 sec)

# 从节点设置同步信息,关联master中查询到的File和Position两个信息
mysql> CHANGE MASTER TO MASTER_HOST='10.250.0.118',
MASTER_PORT=3306,
MASTER_USER='vagrant',
MASTER_PASSWORD='vagrant',
MASTER_LOG_FILE='mysql-bin.000012',
MASTER_LOG_POS=197;

# 启动从库的复制线程
mysql> START slave;
# 查看从库同步状态
mysql> SHOW slave status \G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 10.250.0.118
Master_User: vagrant
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000012
Read_Master_Log_Pos: 197
Relay_Log_File: slave-relay-bin.000002
Relay_Log_Pos: 326
Relay_Master_Log_File: mysql-bin.000012
Slave_IO_Running: Yes
Slave_SQL_Running: Yes

6.MHA需要一个拥有管理员权限的mysql用户

  • 如果开启了root用户远程登录,可以使用root用户

  • 也可以在master上创建一个拥有管理员权限的用户,因为建立了主从,所以从库会自动同步

1
2
3
4
mysql> CREATE USER 'mhaadmin'@'10.250.%.%' IDENTIFIED BY 'mhapass';
mysql> ALTER USER 'mhaadmin'@'10.250.%.%' IDENTIFIED WITH mysql_native_password BY 'mhapass';
mysql> GRANT ALL ON *.* TO 'mhaadmin'@'10.250.%.%';
mysql> FLUSH PRIVILEGES;

7.在mha-manager节点上编写MHA管理配置文件

1
2
mkdir /etc/mha_master
vim /etc/mha_master/mha.cnf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
[server default]
# mha访问数据库的账号与密码
user=mhaadmin
password=mhapass
port=3306
# 指定mha的工作目录
manager_workdir=/etc/mha_master
# 指定管理日志路径
manager_log=/etc/mha_master/manager.log
# 指定mha在远程节点上的工作目录,所有mysql节点都需要创建该目录
remote_workdir=/mydata/mha_master
# ssh登录用户
ssh_user=root
# 指定主从复制的mysql用户和密码
repl_user=vagrant
repl_password=vagrant
# 指定多久执行一次检测master的状态,检测间隔时间,单位秒
ping_interval=1

# 默认情况下,mha检查master的状态是否正常是直接连接master节点
# 也可以指定用于二次检查master节点状态的脚本,这里使用mha自带的masterha_secondary_check命令
# 不要配置master节点的ip,可以配置从节点的ip,它的连接逻辑是 manager-(A)->这里指定的ip-(B)->master,
# 只有A成功,B失败时才会进行故障转移,其余情况都会认为master正常。所以必须保证这里的ip所在服务器正常运行。
# secondary_check_script=/bin/masterha_secondary_check -s 10.250.0.186 -s 10.250.0.102

# 脚本文件,用于故障切换的时候将虚拟ip漂移到新的master上
#master_ip_failover_script=/etc/mha_master/master_ip_failover_script

# 脚本文件,用于故障切换的时候发送告警信息
#report_script=/etc/mha_master/report_script

# 故障发生后关闭故障主机的脚本,其主要作用是防止发生脑裂
#shutdown_script=/etc/mha_master/shutdown_script

# mysql主从服务器信息
[server1]
hostname=10.250.0.118
ssh_port=22
# 将来可不可以成为master候选节点/主节点
candidate_master=1
# 不参与master候选
#no_master=1
[server2]
hostname=10.250.0.186
ssh_port=22
candidate_master=1
[server3]
hostname=10.250.0.102
ssh_port=22
candidate_master=1

注意,所有mysql节点都需要创建该目录/mydata/mha_master

8.检测各节点间ssh互信通信配置是否正确

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[root@ip-10-250-0-91 mha_master]# masterha_check_ssh --conf=/etc/mha_master/mha.cnf
Thu Sep 15 10:13:08 2022 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Sep 15 10:13:08 2022 - [info] Reading application default configuration from /etc/mha_master/mha.cnf..
Thu Sep 15 10:13:08 2022 - [info] Reading server configuration from /etc/mha_master/mha.cnf..
Thu Sep 15 10:13:08 2022 - [info] Starting SSH connection tests..
Thu Sep 15 10:13:09 2022 - [debug]
Thu Sep 15 10:13:08 2022 - [debug] Connecting via SSH from root@10.250.0.118(10.250.0.118:22) to root@10.250.0.186(10.250.0.186:22)..
Thu Sep 15 10:13:08 2022 - [debug] ok.
Thu Sep 15 10:13:08 2022 - [debug] Connecting via SSH from root@10.250.0.118(10.250.0.118:22) to root@10.250.0.102(10.250.0.102:22)..
Thu Sep 15 10:13:08 2022 - [debug] ok.
Thu Sep 15 10:13:09 2022 - [debug]
Thu Sep 15 10:13:08 2022 - [debug] Connecting via SSH from root@10.250.0.186(10.250.0.186:22) to root@10.250.0.118(10.250.0.118:22)..
Thu Sep 15 10:13:08 2022 - [debug] ok.
Thu Sep 15 10:13:08 2022 - [debug] Connecting via SSH from root@10.250.0.186(10.250.0.186:22) to root@10.250.0.102(10.250.0.102:22)..
Thu Sep 15 10:13:09 2022 - [debug] ok.
Thu Sep 15 10:13:10 2022 - [debug]
Thu Sep 15 10:13:09 2022 - [debug] Connecting via SSH from root@10.250.0.102(10.250.0.102:22) to root@10.250.0.118(10.250.0.118:22)..
Thu Sep 15 10:13:09 2022 - [debug] ok.
Thu Sep 15 10:13:09 2022 - [debug] Connecting via SSH from root@10.250.0.102(10.250.0.102:22) to root@10.250.0.186(10.250.0.186:22)..
Thu Sep 15 10:13:09 2022 - [debug] ok.
Thu Sep 15 10:13:10 2022 - [info] All SSH connection tests passed successfully.

看到All SSH connection tests passed successfully.说明通信成功

9.检查管理的MySQL复制集群的连接配置参数是否OK

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
[root@ip-10-250-0-91 mha_master]# masterha_check_repl --conf=/etc/mha_master/mha.cnf
Thu Sep 15 10:16:35 2022 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Sep 15 10:16:35 2022 - [info] Reading application default configuration from /etc/mha_master/mha.cnf..
Thu Sep 15 10:16:35 2022 - [info] Reading server configuration from /etc/mha_master/mha.cnf..
Thu Sep 15 10:16:35 2022 - [info] MHA::MasterMonitor version 0.58.
Thu Sep 15 10:16:36 2022 - [info] GTID failover mode = 1
Thu Sep 15 10:16:36 2022 - [info] Dead Servers:
Thu Sep 15 10:16:36 2022 - [info] Alive Servers:
Thu Sep 15 10:16:36 2022 - [info] 10.250.0.118(10.250.0.118:3306)
Thu Sep 15 10:16:36 2022 - [info] 10.250.0.186(10.250.0.186:3306)
Thu Sep 15 10:16:36 2022 - [info] 10.250.0.102(10.250.0.102:3306)
Thu Sep 15 10:16:36 2022 - [info] Alive Slaves:
Thu Sep 15 10:16:36 2022 - [info] 10.250.0.186(10.250.0.186:3306) Version=8.0.30 (oldest major version between slaves) log-bin:enabled
Thu Sep 15 10:16:36 2022 - [info] GTID ON
Thu Sep 15 10:16:36 2022 - [info] Replicating from 10.250.0.118(10.250.0.118:3306)
Thu Sep 15 10:16:36 2022 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Sep 15 10:16:36 2022 - [info] 10.250.0.102(10.250.0.102:3306) Version=8.0.30 (oldest major version between slaves) log-bin:enabled
Thu Sep 15 10:16:36 2022 - [info] GTID ON
Thu Sep 15 10:16:36 2022 - [info] Replicating from 10.250.0.118(10.250.0.118:3306)
Thu Sep 15 10:16:36 2022 - [info] Primary candidate for the new Master (candidate_master is set)
Thu Sep 15 10:16:36 2022 - [info] Current Alive Master: 10.250.0.118(10.250.0.118:3306)
Thu Sep 15 10:16:36 2022 - [info] Checking slave configurations..
Thu Sep 15 10:16:36 2022 - [info] Checking replication filtering settings..
Thu Sep 15 10:16:36 2022 - [info] binlog_do_db= , binlog_ignore_db=
Thu Sep 15 10:16:36 2022 - [info] Replication filtering check ok.
Thu Sep 15 10:16:36 2022 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Thu Sep 15 10:16:36 2022 - [info] Checking SSH publickey authentication settings on the current master..
Thu Sep 15 10:16:36 2022 - [info] HealthCheck: SSH to 10.250.0.118 is reachable.
Thu Sep 15 10:16:36 2022 - [info]
10.250.0.118(10.250.0.118:3306) (current master)
+--10.250.0.186(10.250.0.186:3306)
+--10.250.0.102(10.250.0.102:3306)

Thu Sep 15 10:16:36 2022 - [info] Checking replication health on 10.250.0.186..
Thu Sep 15 10:16:36 2022 - [info] ok.
Thu Sep 15 10:16:36 2022 - [info] Checking replication health on 10.250.0.102..
Thu Sep 15 10:16:36 2022 - [info] ok.
Thu Sep 15 10:16:36 2022 - [warning] master_ip_failover_script is not defined.
Thu Sep 15 10:16:36 2022 - [warning] shutdown_script is not defined.
Thu Sep 15 10:16:36 2022 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

看到MySQL Replication Health is OK.说明成功

10.管理MHA

1
2
3
4
# 先检查MHA工作目录下是否存在mha.failover.complete文件,如果存在则先删除
[root@ip-10-250-0-91 mha_master]# rm -rf /etc/mha_master/mha.failover.complete
# 启动
[root@ip-10-250-0-91 mha_master]# nohup masterha_manager --conf=/etc/mha_master/mha.cnf &> /etc/mha_master/manager.log &

查看master的状态

1
2
3
#查看master状态
[root@ip-10-250-0-91 mha_master]# masterha_check_status --conf=/etc/mha_master/mha.cnf
mha (pid:9696) is running(0:PING_OK), master:10.250.0.118

上面的信息中“mha (pid:9696) is running(0:PING_OK)”表示MHA服务运行 OK,否则, 则会显示为类似“mha is stopped(1:NOT_RUNNING).”

停止MHA

1
2
3
4
#停止
[root@ip-10-250-0-91 mha_master]# masterha_stop --conf=/etc/mha_master/mha.cnf
Stopped mha successfully.
[1]+ 退出 1 nohup masterha_manager --conf=/etc/mha_master/mha.cnf &>/etc/mha_master/manager.log

11.测试 MHA 故障转移

我们先按"5.从节点也要创建同步帐号"中的错误方式进行测试

  • 启动MHA

  • 关闭mysql-master数据库服务

  • 监控mha-manager的日志

1
2
3
4
5
6
7
[root@ip-10-250-0-91 mha_master]# tail -f manager.log
………………………………
Started automated(non-interactive) failover.
Selected 10.250.0.186(10.250.0.186:3306) as a new master.
10.250.0.186(10.250.0.186:3306): OK: Applying all logs succeeded.
10.250.0.102(10.250.0.102:3306): ERROR: Failed on waiting gtid exec set on master.
Master failover to 10.250.0.186(10.250.0.186:3306) done, but recovery on slave partially failed.

可以看到mha监测到master挂掉后,会进行重新选主,并且选主成功,但是为其挂载从库失败。

我们来分析一下这个失败的原因

  • 查看新的master状态

1
2
3
4
5
6
7
8
mysql> show master status\G;
*************************** 1. row ***************************
File: mysql-bin.000013
Position: 2466
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set: 6e9a571e-330e-11ed-a3f8-0a53e7cced42:5-9,
6e9a571e-330e-11ed-a3f8-0a53e7cced43:1-4
  • 查询新的主库的read_only已经关闭,但这个只是在当前进程中有效,重新启动mysql还是会读取配置文件的设置,所以主从一旦切换后一定要注意修改这个配置文件

1
2
3
4
5
6
7
8
9
mysql> show variables like "%read_only%";
+-----------------------+-------+
| Variable_name | Value |
+-----------------------+-------+
| innodb_read_only | OFF |
| read_only | OFF |
| super_read_only | OFF |
| transaction_read_only | OFF |
+-----------------------+-------+
  • 查看slave状态,获取报错信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
mysql> show slave status \G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 10.250.0.186
Master_User: vagrant
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000013
Read_Master_Log_Pos: 2466
Relay_Log_File: slave-relay-bin.000002
Relay_Log_Pos: 420
Relay_Master_Log_File: mysql-bin.000013
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1396
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '6e9a571e-330e-11ed-a3f8-0a53e7cced43:1' at master log mysql-bin.000013, end_log_pos 1465. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  • 按照提示在从库中查询performance_schema.replication_applier_status_by_worker表的数据

1
2
3
4
5
6
7
8
9
mysql> select * from performance_schema.replication_applier_status_by_worker\G;
*************************** 1. row ***************************
CHANNEL_NAME:
WORKER_ID: 1
THREAD_ID: NULL
SERVICE_STATE: OFF
LAST_ERROR_NUMBER: 1396
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '6e9a571e-330e-11ed-a3f8-0a53e7cced43:1' at master log mysql-bin.000013, end_log_pos 1465; Error 'Operation CREATE USER failed for 'vagrant'@'10.250.%.%'' on query. Default database: 'mysql'. Query: 'CREATE USER 'vagrant'@'10.250.%.%' IDENTIFIED WITH 'mysql_native_password' AS '*04E6E1273D1783DF7D57DC5479FE01CFFDFD0058''
…………

这里提示的比较清楚,意思是从库在同步新的主库数据时,执行创建同步帐号的CREATE USER语句时报错
思考我们的构建过程,从库创建同步帐号是分别创建在自己的数据库里的,所以两个从库GTID事务和binlog日志实际上数据并不一致,导致主从切换后,从库要从新的master同步数据时,就会同步这个创建帐号的语句,但是此时从库已经有这个帐号了,所以创建帐号失败,从而导致主从复制失败。
此时的解决方法就是将master的数据库全量导入到slave中,然后重新建立主从,参照下文。

我们再按"5.从节点也要创建同步帐号"中的正确方式搭建好主从后进行测试

  • 搭建步骤参看"5.从节点也要创建同步帐号"中的正确方式

  • 启动MHA

  • 关闭mysql-master数据库服务

  • 监控mha-manager的日志

1
2
3
4
5
6
7
…………
Started automated(non-interactive) failover.
Selected 10.250.0.186(10.250.0.186:3306) as a new master.
10.250.0.186(10.250.0.186:3306): OK: Applying all logs succeeded.
10.250.0.102(10.250.0.102:3306): OK: Slave started, replicating from 10.250.0.186(10.250.0.186:3306)
10.250.0.186(10.250.0.186:3306): Resetting slave info succeeded.
Master failover to 10.250.0.186(10.250.0.186:3306) completed successfully.

此时说明新的master切换成功,并且成功建立了主从。

  • 检查master和slave状态是否正常

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 主库
mysql> show master status;
+------------------+----------+--------------+------------------+------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+------------------------------------------+
| mysql-bin.000001 | 157 | | | 6e9a571e-330e-11ed-a3f8-0a53e7cced42:1-9 |
+------------------+----------+--------------+------------------+------------------------------------------+

# 从库
mysql> show slave status \G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for source to send event
Master_Host: 10.250.0.186
Master_User: vagrant
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000001
Read_Master_Log_Pos: 157
Relay_Log_File: slave-relay-bin.000002
Relay_Log_Pos: 373
Relay_Master_Log_File: mysql-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes

此时说明MHA搭建及测试成功。

知识点:关于MHA有几点内容需要说明:

1.MHA监控主从一旦完成切换,MHA服务进程就会停止,需要mysql集群修复完成后手工重新启动
2.挂掉的mysql完成修复后可以作为slave加入mysql主从集群,注意修改其mysql配置文件,加入时必须重新全量导入master的数据库
3.新加入mysql时要注意ip是否发生变更,如果变更要及时修改MHA的配置文件
4.MHA切换主从后,新的master配置文件中的read_only不会被修改,需要手工修改,否则重启服务器后不能写入数据
5.MHA只会监控master,slave挂掉时不会触发MHA做任何操作,此时只要从库的ip没有变化,修复后可以直接启动即可。
6.故障转移发生后,MHA工作目录会生成mha.failover.complete文件,如果要在故障发生后8小时内重新启动MHA,则重新启动MHA前一定要查看工作目录下是否存在mha.failover.complete文件,如果存在要先删除,否则不能完成新一轮的主从切换,或者在启动MHA时加上--ignore_last_failover

新的问题:

1.MHA可以帮助我们实现master的监控,当其监测到master不可用时会在slave中进行选主,从而实现主从切换。
2.但是这里有个问题,主从切换后,新的master的ip就会发生变化,所以客户端连接时需要改变为新的ip地址。
3.那有什么办法可以在不改变客户端连接mysql的地址情况下,自动完成切换呢,答案就是VIP(虚拟IP).

MHA脚本扩展

  • 我们在使用MHA检查mysql集群的连接配置参数时,会看到打印如下信息

1
2
3
4
5
6
7
8
9
10
11
[root@ip-10-250-0-91 mha_master]# masterha_check_repl --conf=/etc/mha_master/mha.cnf
………………
Thu Sep 15 10:16:36 2022 - [info] Checking replication health on 10.250.0.186..
Thu Sep 15 10:16:36 2022 - [info] ok.
Thu Sep 15 10:16:36 2022 - [info] Checking replication health on 10.250.0.102..
Thu Sep 15 10:16:36 2022 - [info] ok.
Thu Sep 15 10:16:36 2022 - [warning] master_ip_failover_script is not defined.
Thu Sep 15 10:16:36 2022 - [warning] shutdown_script is not defined.
Thu Sep 15 10:16:36 2022 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

这里有两个警告,master_ip_failover_script is not defined.shutdown_script is not defined.
shutdown_script脚本用于指定故障发生后关闭故障主机的脚本,而master_ip_failover_script脚本用于指定故障转移时需要做的操作。

实际上,MHA支持多种脚本扩展

1
2
3
4
5
6
secondary_check_script:用于检查来自多个网络路由的master可用性
master_ip_failover_script:用于更新应用程序使用的master的ip地址
shutdown_script:为了强制关机master
report_script:用于发送报告
init_conf_load_script:用于加载初始配置参数
master_ip_online_change_script:用于更新masterIP地址。这不用于主故障转移,而是用于在线主交换机
  • MHA的配置文件中可以指定master_ip_failover_script,其值是个脚本文件,从名称上我们也能看出来其作用是当发生故障转移时要执行的脚本,一般我们使用这个脚本进行虚拟IP的重新绑定。

  • MHA并没有实现这个脚本,这个脚本需要我们自己编写,这里参考网上给出的一个示例,vim /etc/mha_master/master_ip_failover_script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
#!/bin/env perl
use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
$command, $orig_master_host, $orig_master_ip,$ssh_user,
$orig_master_port, $new_master_host, $new_master_ip,$new_master_port,
$orig_master_ssh_port,$new_master_ssh_port,$new_master_user,$new_master_password
);

# 这里定义的虚拟IP配置要注意,这个ip必须要与你自己的集群在同一个网段,否则无效
my $vip = '10.250.0.199';
my $key = '1';
# 这里的网卡名称 “eth0” 需要根据你机器的网卡名称进行修改
my $ssh_start_vip = "sudo /sbin/ifconfig eth0:$key $vip up";
my $ssh_stop_vip = "sudo /sbin/ifconfig eth0:$key down";
my $ssh_Bcast_arp= "sudo /sbin/arping -I eth0 -c 3 -A $vip";

GetOptions(
'command=s' => \$command,
'ssh_user=s' => \$ssh_user,
'orig_master_host=s' => \$orig_master_host,
'orig_master_ip=s' => \$orig_master_ip,
'orig_master_port=i' => \$orig_master_port,
'orig_master_ssh_port=i' => \$orig_master_ssh_port,
'new_master_host=s' => \$new_master_host,
'new_master_ip=s' => \$new_master_ip,
'new_master_port=i' => \$new_master_port,
'new_master_ssh_port' => \$new_master_ssh_port,
'new_master_user' => \$new_master_user,
'new_master_password' => \$new_master_password

);

exit &main();

sub main {
$ssh_user = defined $ssh_user ? $ssh_user : 'root';
print "\n\n SCRIPT START \[$ssh_user|$ssh_start_vip\]";
print "\n SCRIPT STOP \[$ssh_user|$ssh_stop_vip\]\n\n";


if ( $command eq "stop" || $command eq "stopssh" ) {

my $exit_code = 1;
eval {
print "Disabling the VIP on old master: $orig_master_host \n";
&stop_vip();
$exit_code = 0;
};
if ($@) {
warn "Got Error: $@\n";
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "start" ) {

my $exit_code = 10;
eval {
print "Enabling the VIP - $vip on the new master - $new_master_host \n";
&start_vip();
&start_arp();
$exit_code = 0;
};
if ($@) {
warn $@;
exit $exit_code;
}
exit $exit_code;
}
elsif ( $command eq "status" ) {
print "Checking the Status of the script.. OK \n";
exit 0;
}
else {
&usage();
exit 1;
}
}

sub start_vip() {
`ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
sub stop_vip() {
`ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub start_arp() {
`ssh $ssh_user\@$new_master_host \" $ssh_Bcast_arp \"`;
}
sub usage {
print
"Usage: master_ip_failover --command=start|stop|stopssh|status --ssh_user=user --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}
  • chmod a+x /etc/mha_master/master_ip_failover_script

  • vim /etc/mha_master/mha.cnf,添加master_ip_failover_script配置

1
2
3
4
5
6
7
…………
#指定检测间隔时间
ping_interval=1
#指定一个脚本,该脚本实现了在主从切换之后,将虚拟ip漂移到新的master上
master_ip_failover_script=/etc/mha_master/master_ip_failover_script
# mysql主从服务器信息
…………
  • 虚拟IP为10.250.0.199,将其绑定到mysql-master上

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[root@ip-10-250-0-118 mha_master]# ifconfig eth0:1 10.250.0.199/24
[root@ip-10-250-0-118 mha_master]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 10.250.0.118 netmask 255.255.255.0 broadcast 10.250.0.255
inet6 fe80::803:41ff:fe19:356a prefixlen 64 scopeid 0x20<link>
ether 0a:03:41:19:35:6a txqueuelen 1000 (Ethernet)
RX packets 12260 bytes 2511078 (2.3 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 10202 bytes 5216046 (4.9 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth0:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 10.250.0.199 netmask 255.255.255.0 broadcast 10.250.0.255
ether 0a:03:41:19:35:6a txqueuelen 1000 (Ethernet)

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 43 bytes 9386 (9.1 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 43 bytes 9386 (9.1 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
  • 再次检查配置参数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[root@ip-10-250-0-91 mha_master]# masterha_check_repl --conf=/etc/mha_master/mha.cnf
………………
Fri Sep 16 07:06:02 2022 - [info] Checking replication health on 10.250.0.186..
Fri Sep 16 07:06:02 2022 - [info] ok.
Fri Sep 16 07:06:02 2022 - [info] Checking replication health on 10.250.0.102..
Fri Sep 16 07:06:02 2022 - [info] ok.
Fri Sep 16 07:06:02 2022 - [info] Checking master_ip_failover_script status:
Fri Sep 16 07:06:02 2022 - [info] /etc/mha_master/master_ip_failover_script --command=status --ssh_user=root --orig_master_host=10.250.0.118 --orig_master_ip=10.250.0.118 --orig_master_port=3306


SCRIPT START [root|sudo /sbin/ifconfig eth0:1 10.250.0.199 up]
SCRIPT STOP [root|sudo /sbin/ifconfig eth0:1 down]

Checking the Status of the script.. OK
Fri Sep 16 07:06:02 2022 - [info] OK.
Fri Sep 16 07:06:02 2022 - [warning] shutdown_script is not defined.
Fri Sep 16 07:06:02 2022 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

  • 重新启动MHA

  • 停止mysql-master服务

  • 查看MHA日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
----- Failover Report -----

mha: MySQL Master failover 10.250.0.118(10.250.0.118:3306) to 10.250.0.186(10.250.0.186:3306) succeeded

Master 10.250.0.118(10.250.0.118:3306) is down!

Check MHA Manager logs at ip-10-250-0-91.cn-northwest-1.compute.internal:/etc/mha_master/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 10.250.0.118(10.250.0.118:3306)
Selected 10.250.0.186(10.250.0.186:3306) as a new master.
10.250.0.186(10.250.0.186:3306): OK: Applying all logs succeeded.
10.250.0.186(10.250.0.186:3306): OK: Activated master IP address.
10.250.0.102(10.250.0.102:3306): OK: Slave started, replicating from 10.250.0.186(10.250.0.186:3306)
10.250.0.186(10.250.0.186:3306): Resetting slave info succeeded.
Master failover to 10.250.0.186(10.250.0.186:3306) completed successfully.
  • 此时查看新的master的ip

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[root@ip-10-250-0-186 mysql8]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 10.250.0.186 netmask 255.255.255.0 broadcast 10.250.0.255
inet6 fe80::895:fcff:fe7f:ee0c prefixlen 64 scopeid 0x20<link>
ether 0a:95:fc:7f:ee:0c txqueuelen 1000 (Ethernet)
RX packets 12284 bytes 3862460 (3.6 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 10558 bytes 2559446 (2.4 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth0:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001
inet 10.250.0.199 netmask 255.0.0.0 broadcast 10.255.255.255
ether 0a:95:fc:7f:ee:0c txqueuelen 1000 (Ethernet)

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 41 bytes 9270 (9.0 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 41 bytes 9270 (9.0 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

另外查询原master,其也已经解除了虚拟IP的绑定,此时说明虚拟IP迁移成功,故障转移成功!

AWS不支持在EC2上开启VIP,需要使用其它解决方案,这个不在这里讨论。

可以在MHA配置文件中配置report_script,其对应一个脚本文件,用于故障转移时发出告警信息。
可以发送邮件,也可以调用企业微信或者钉钉的接口实现,这里不再赘述。