MySql-MHA的构建方法

发表于 2022-09-15 更新于 2024-04-03 分类于技术， mysql

阅读次数：本文字数： 6k 阅读时长 ≈ 22 分钟

摘要

MySql-MHA的构建过程
本文基于mysql-8.0.30,mha4mysql-0.58-0.el7.centos

1.节点规划

mha-manager:    10.250.0.91      MHA控制器，用于监控管理
mysql-master:   10.250.0.118     Mysql主服务器
mysql-slave1:   10.250.0.186     Mysql从服务器
mysql-slave2:   10.250.0.102     Mysql从服务器

可以在4台服务器的hosts中创建ip与域名的映射，方便后面的配置

vim /etc/hosts
10.250.0.91      node1
10.250.0.118     node2
10.250.0.186     node3
10.250.0.102     node4

2.免密登录

先在mha-manager主机上创建密钥对

# 创建密钥对，生成私钥和公钥文件，创建密钥时可以不设置密码，一路回车就好
[root@ip-10-250-0-91 .ssh]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:C8JdDpbsQP+NeLSPFvHz3BkLigak3fQusApbrPNL9tw root@ip-10-250-0-91.cn-northwest-1.compute.internal
The key's randomart image is:
+---[RSA 2048]----+
|    .            |
|   . o .         |
|    . B =        |
|   . O X B       |
|    + O S = . .  |
|   . . * B = o + |
|  . = . B + o +  |
|  .B + + .       |
|  oo+.o E        |
+----[SHA256]-----+

此时会在.ssh目录下生成密钥对文件id_rsa和id_rsa.pub

1 2	[root@ip-10-250-0-91 .ssh]# ls authorized_keys id_rsa id_rsa.pub

将公钥导入全部服务器，导入时需要输入对应服务器的root密码

[root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node1
[root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node2
[root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node3
[root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node4

向服务器中导入公钥时遇到的问题

1	Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

这是由于没有开放密码登录权限导致的，解决方法如下:

# 分别在4台服务器上设置root用户的密码
passwd root

# 分别在4台服务器上开放密码登录权限
vim /etc/ssh/sshd_config
# 将PasswordAuthentication设置为yes
PasswordAuthentication yes
# 重启sshd服务
systemctl restart sshd.service

密钥导入成功

[root@ip-10-250-0-91 .ssh]# ssh-copy-id -i id_rsa.pub root@node4
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "id_rsa.pub"
The authenticity of host 'node4 (10.250.0.102)' can't be established.
ECDSA key fingerprint is SHA256:aXIII+S5nOVy6pqP1fuaW6fYFsVIN9TBFVP/Xaf8Pds.
ECDSA key fingerprint is MD5:1f:07:2f:04:75:77:68:8d:f0:20:96:f1:0b:90:ac:61.
Are you sure you want to continue connecting (yes/no)? yes
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@node4's password:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'root@node4'"
and check to make sure that only the key(s) you wanted were added.

按照上面的方式分别在其余3台服务器上执行密钥对的创建操作，过程一样，在此不再赘述。
测试免密登录，在任意服务器上登录其它服务器都不需要输入密码

ssh node3

3.下载及安装MHA安装包

# 所有节点都需要安装node安装包
wget https://github.com/yoshinorim/mha4mysql-node/releases/download/v0.58/mha4mysql-node-0.58-0.el7.centos.noarch.rpm

yum install -y mha4mysql-node-0.58-0.el7.centos.noarch.rpm

# 只有manager节点需要安装manager安装包
wget https://github.com/yoshinorim/mha4mysql-manager/releases/download/v0.58/mha4mysql-manager-0.58-0.el7.centos.noarch.rpm

yum install -y mha4mysql-manager-0.58-0.el7.centos.noarch.rpm

安装manager时可能会提示缺少epel相关依赖

错误：软件包：mha4mysql-manager-0.58-0.el7.centos.noarch (/mha4mysql-manager-0.58-0.el7.centos.noarch)
          需要：perl(Log::Dispatch::File)
错误：软件包：mha4mysql-manager-0.58-0.el7.centos.noarch (/mha4mysql-manager-0.58-0.el7.centos.noarch)
          需要：perl(Parallel::ForkManager)
错误：软件包：mha4mysql-manager-0.58-0.el7.centos.noarch (/mha4mysql-manager-0.58-0.el7.centos.noarch)
          需要：perl(Log::Dispatch)
错误：软件包：mha4mysql-manager-0.58-0.el7.centos.noarch (/mha4mysql-manager-0.58-0.el7.centos.noarch)
          需要：perl(Log::Dispatch::Screen)

解决方法是先安装epel-release依赖

1	yum install epel-release -y

如果使用aws云服务器则通过下面的命令安装epel

1	sudo amazon-linux-extras install epel -y

4.mysql主从复制搭建

参照MySql单节点、主从、双主的构建方法中的说明搭建主从复制，这里不再赘述。
需要说明如下几点：

1.所有mysql的server-id不能相同
2.所有mysql打开binlog日志和中继日志，且relay_log_purge=0
3.从节点开启只读，read_only=1
4.云服务通过镜像创建的mysql要注意所有mysql的uuid不能相同。

5.从节点也要创建同步帐号

因为slave节点有可能升级为master节点，所以每个slave上也要需要拥有同步帐号，在每个slave上执行如下命令

注意：
这个方式是错误的，会导致主从切换后，slave不能与新的master实现数据同步，
因为此时两个从库的GTID事务和binlog记录的位置已经不一致了。
这个错误的结果在下文"11.测试 MHA 故障转移"有说明。

正确的做法是，将master的数据库导入到两个slave中，主从一旦建立，从库就不要写入任何数据。

# 停止同步
mysql> stop slave;
# '10.250.%.%' 表示只能局域网段内的ip才能访问，如果不限制ip，可以配置为'%'
mysql> CREATE USER 'vagrant'@'10.250.%.%' IDENTIFIED BY 'vagrant';
mysql> ALTER USER 'vagrant'@'10.250.%.%' IDENTIFIED WITH mysql_native_password BY 'vagrant';
mysql> GRANT REPLICATION SLAVE ON *.* TO 'vagrant'@'10.250.%.%';
mysql> FLUSH PRIVILEGES;
# 开启同步
mysql> start slave;

正确的做法是，将master的数据库导入到两个slave中，然后在开启主从同步。
注意：主从一旦建立，从库就不要写入任何数据。

# 两个从库中分别执行master数据库导出
mysqldump -uroot --all-databases --triggers --routines --events -p -h node2 > all_databases.sql
# 从库执行导入数据
mysql -uroot -p < all_databases.sql

因为开启了GTID，导入数据库时会报如下错误：

1	@@GLOBAL.GTID_PURGED cannot be changed: the added gtid set must not overlap with @@GLOBAL.GTID_EXECUTED

解决方法是在从库中执行如下命令后再导入：

1
2
3

mysql> stop slave;
mysql> reset slave all;
mysql> reset master;

从节点设置同步

# 查看主节点master状态信息，记录File和Position两个信息
mysql> show master status;
+------------------+----------+--------------+------------------+------------------------------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                        |
+------------------+----------+--------------+------------------+------------------------------------------+
| mysql-bin.000012 |      197 |              |                  | 6e9a571e-330e-11ed-a3f8-0a53e7cced42:1-9 |
+------------------+----------+--------------+------------------+------------------------------------------+
1 row in set (0.00 sec)

# 从节点设置同步信息，关联master中查询到的File和Position两个信息
mysql> CHANGE MASTER TO MASTER_HOST='10.250.0.118',
    MASTER_PORT=3306,
    MASTER_USER='vagrant',
    MASTER_PASSWORD='vagrant',
    MASTER_LOG_FILE='mysql-bin.000012',
    MASTER_LOG_POS=197;

# 启动从库的复制线程
mysql> START slave;
# 查看从库同步状态
mysql> SHOW slave status \G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: 10.250.0.118
                  Master_User: vagrant
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000012
          Read_Master_Log_Pos: 197
               Relay_Log_File: slave-relay-bin.000002
                Relay_Log_Pos: 326
        Relay_Master_Log_File: mysql-bin.000012
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

6.MHA需要一个拥有管理员权限的mysql用户

如果开启了root用户远程登录，可以使用root用户
也可以在master上创建一个拥有管理员权限的用户，因为建立了主从，所以从库会自动同步

mysql> CREATE USER 'mhaadmin'@'10.250.%.%' IDENTIFIED BY 'mhapass';
mysql> ALTER USER 'mhaadmin'@'10.250.%.%' IDENTIFIED WITH mysql_native_password BY 'mhapass';
mysql> GRANT ALL ON *.* TO 'mhaadmin'@'10.250.%.%';
mysql> FLUSH PRIVILEGES;

7.在mha-manager节点上编写MHA管理配置文件

1 2	mkdir /etc/mha_master vim /etc/mha_master/mha.cnf

[server default]
# mha访问数据库的账号与密码
user=mhaadmin
password=mhapass
port=3306
# 指定mha的工作目录
manager_workdir=/etc/mha_master
# 指定管理日志路径
manager_log=/etc/mha_master/manager.log
# 指定mha在远程节点上的工作目录，所有mysql节点都需要创建该目录
remote_workdir=/mydata/mha_master
# ssh登录用户
ssh_user=root
# 指定主从复制的mysql用户和密码
repl_user=vagrant
repl_password=vagrant
# 指定多久执行一次检测master的状态，检测间隔时间，单位秒
ping_interval=1

# 默认情况下，mha检查master的状态是否正常是直接连接master节点
# 也可以指定用于二次检查master节点状态的脚本，这里使用mha自带的masterha_secondary_check命令
# 不要配置master节点的ip，可以配置从节点的ip，它的连接逻辑是 manager-(A)->这里指定的ip-(B)->master，
# 只有A成功，B失败时才会进行故障转移，其余情况都会认为master正常。所以必须保证这里的ip所在服务器正常运行。
# secondary_check_script=/bin/masterha_secondary_check -s 10.250.0.186 -s 10.250.0.102

# 脚本文件，用于故障切换的时候将虚拟ip漂移到新的master上
#master_ip_failover_script=/etc/mha_master/master_ip_failover_script

# 脚本文件，用于故障切换的时候发送告警信息
#report_script=/etc/mha_master/report_script

# 故障发生后关闭故障主机的脚本，其主要作用是防止发生脑裂
#shutdown_script=/etc/mha_master/shutdown_script

# mysql主从服务器信息
[server1]
hostname=10.250.0.118
ssh_port=22
# 将来可不可以成为master候选节点/主节点
candidate_master=1
# 不参与master候选
#no_master=1
[server2]
hostname=10.250.0.186
ssh_port=22
candidate_master=1
[server3]
hostname=10.250.0.102
ssh_port=22
candidate_master=1

注意，所有mysql节点都需要创建该目录/mydata/mha_master

8.检测各节点间ssh互信通信配置是否正确

[root@ip-10-250-0-91 mha_master]# masterha_check_ssh --conf=/etc/mha_master/mha.cnf
Thu Sep 15 10:13:08 2022 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Sep 15 10:13:08 2022 - [info] Reading application default configuration from /etc/mha_master/mha.cnf..
Thu Sep 15 10:13:08 2022 - [info] Reading server configuration from /etc/mha_master/mha.cnf..
Thu Sep 15 10:13:08 2022 - [info] Starting SSH connection tests..
Thu Sep 15 10:13:09 2022 - [debug]
Thu Sep 15 10:13:08 2022 - [debug]  Connecting via SSH from root@10.250.0.118(10.250.0.118:22) to root@10.250.0.186(10.250.0.186:22)..
Thu Sep 15 10:13:08 2022 - [debug]   ok.
Thu Sep 15 10:13:08 2022 - [debug]  Connecting via SSH from root@10.250.0.118(10.250.0.118:22) to root@10.250.0.102(10.250.0.102:22)..
Thu Sep 15 10:13:08 2022 - [debug]   ok.
Thu Sep 15 10:13:09 2022 - [debug]
Thu Sep 15 10:13:08 2022 - [debug]  Connecting via SSH from root@10.250.0.186(10.250.0.186:22) to root@10.250.0.118(10.250.0.118:22)..
Thu Sep 15 10:13:08 2022 - [debug]   ok.
Thu Sep 15 10:13:08 2022 - [debug]  Connecting via SSH from root@10.250.0.186(10.250.0.186:22) to root@10.250.0.102(10.250.0.102:22)..
Thu Sep 15 10:13:09 2022 - [debug]   ok.
Thu Sep 15 10:13:10 2022 - [debug]
Thu Sep 15 10:13:09 2022 - [debug]  Connecting via SSH from root@10.250.0.102(10.250.0.102:22) to root@10.250.0.118(10.250.0.118:22)..
Thu Sep 15 10:13:09 2022 - [debug]   ok.
Thu Sep 15 10:13:09 2022 - [debug]  Connecting via SSH from root@10.250.0.102(10.250.0.102:22) to root@10.250.0.186(10.250.0.186:22)..
Thu Sep 15 10:13:09 2022 - [debug]   ok.
Thu Sep 15 10:13:10 2022 - [info] All SSH connection tests passed successfully.

看到All SSH connection tests passed successfully.说明通信成功

9.检查管理的MySQL复制集群的连接配置参数是否OK

[root@ip-10-250-0-91 mha_master]# masterha_check_repl --conf=/etc/mha_master/mha.cnf
Thu Sep 15 10:16:35 2022 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Thu Sep 15 10:16:35 2022 - [info] Reading application default configuration from /etc/mha_master/mha.cnf..
Thu Sep 15 10:16:35 2022 - [info] Reading server configuration from /etc/mha_master/mha.cnf..
Thu Sep 15 10:16:35 2022 - [info] MHA::MasterMonitor version 0.58.
Thu Sep 15 10:16:36 2022 - [info] GTID failover mode = 1
Thu Sep 15 10:16:36 2022 - [info] Dead Servers:
Thu Sep 15 10:16:36 2022 - [info] Alive Servers:
Thu Sep 15 10:16:36 2022 - [info]   10.250.0.118(10.250.0.118:3306)
Thu Sep 15 10:16:36 2022 - [info]   10.250.0.186(10.250.0.186:3306)
Thu Sep 15 10:16:36 2022 - [info]   10.250.0.102(10.250.0.102:3306)
Thu Sep 15 10:16:36 2022 - [info] Alive Slaves:
Thu Sep 15 10:16:36 2022 - [info]   10.250.0.186(10.250.0.186:3306)  Version=8.0.30 (oldest major version between slaves) log-bin:enabled
Thu Sep 15 10:16:36 2022 - [info]     GTID ON
Thu Sep 15 10:16:36 2022 - [info]     Replicating from 10.250.0.118(10.250.0.118:3306)
Thu Sep 15 10:16:36 2022 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Sep 15 10:16:36 2022 - [info]   10.250.0.102(10.250.0.102:3306)  Version=8.0.30 (oldest major version between slaves) log-bin:enabled
Thu Sep 15 10:16:36 2022 - [info]     GTID ON
Thu Sep 15 10:16:36 2022 - [info]     Replicating from 10.250.0.118(10.250.0.118:3306)
Thu Sep 15 10:16:36 2022 - [info]     Primary candidate for the new Master (candidate_master is set)
Thu Sep 15 10:16:36 2022 - [info] Current Alive Master: 10.250.0.118(10.250.0.118:3306)
Thu Sep 15 10:16:36 2022 - [info] Checking slave configurations..
Thu Sep 15 10:16:36 2022 - [info] Checking replication filtering settings..
Thu Sep 15 10:16:36 2022 - [info]  binlog_do_db= , binlog_ignore_db=
Thu Sep 15 10:16:36 2022 - [info]  Replication filtering check ok.
Thu Sep 15 10:16:36 2022 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Thu Sep 15 10:16:36 2022 - [info] Checking SSH publickey authentication settings on the current master..
Thu Sep 15 10:16:36 2022 - [info] HealthCheck: SSH to 10.250.0.118 is reachable.
Thu Sep 15 10:16:36 2022 - [info]
10.250.0.118(10.250.0.118:3306) (current master)
 +--10.250.0.186(10.250.0.186:3306)
 +--10.250.0.102(10.250.0.102:3306)

Thu Sep 15 10:16:36 2022 - [info] Checking replication health on 10.250.0.186..
Thu Sep 15 10:16:36 2022 - [info]  ok.
Thu Sep 15 10:16:36 2022 - [info] Checking replication health on 10.250.0.102..
Thu Sep 15 10:16:36 2022 - [info]  ok.
Thu Sep 15 10:16:36 2022 - [warning] master_ip_failover_script is not defined.
Thu Sep 15 10:16:36 2022 - [warning] shutdown_script is not defined.
Thu Sep 15 10:16:36 2022 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

看到MySQL Replication Health is OK.说明成功

10.管理MHA

# 先检查MHA工作目录下是否存在mha.failover.complete文件，如果存在则先删除
[root@ip-10-250-0-91 mha_master]# rm -rf /etc/mha_master/mha.failover.complete
# 启动
[root@ip-10-250-0-91 mha_master]# nohup masterha_manager --conf=/etc/mha_master/mha.cnf &> /etc/mha_master/manager.log &

查看master的状态

1
2
3

#查看master状态
[root@ip-10-250-0-91 mha_master]# masterha_check_status --conf=/etc/mha_master/mha.cnf
mha (pid:9696) is running(0:PING_OK), master:10.250.0.118

上面的信息中“mha (pid:9696) is running(0:PING_OK)”表示MHA服务运行 OK，否则，则会显示为类似“mha is stopped(1:NOT_RUNNING).”

停止MHA

#停止
[root@ip-10-250-0-91 mha_master]# masterha_stop --conf=/etc/mha_master/mha.cnf
Stopped mha successfully.
[1]+  退出 1                nohup masterha_manager --conf=/etc/mha_master/mha.cnf &>/etc/mha_master/manager.log

11.测试 MHA 故障转移

我们先按"5.从节点也要创建同步帐号"中的错误方式进行测试

启动MHA
关闭mysql-master数据库服务
监控mha-manager的日志

[root@ip-10-250-0-91 mha_master]# tail -f manager.log
………………………………
Started automated(non-interactive) failover.
Selected 10.250.0.186(10.250.0.186:3306) as a new master.
10.250.0.186(10.250.0.186:3306): OK: Applying all logs succeeded.
10.250.0.102(10.250.0.102:3306): ERROR: Failed on waiting gtid exec set on master.
Master failover to 10.250.0.186(10.250.0.186:3306) done, but recovery on slave partially failed.

可以看到mha监测到master挂掉后，会进行重新选主，并且选主成功，但是为其挂载从库失败。

我们来分析一下这个失败的原因

查看新的master状态

mysql> show master status\G;
*************************** 1. row ***************************
             File: mysql-bin.000013
         Position: 2466
     Binlog_Do_DB:
 Binlog_Ignore_DB:
Executed_Gtid_Set: 6e9a571e-330e-11ed-a3f8-0a53e7cced42:5-9,
6e9a571e-330e-11ed-a3f8-0a53e7cced43:1-4

查询新的主库的read_only已经关闭，但这个只是在当前进程中有效，重新启动mysql还是会读取配置文件的设置，所以主从一旦切换后一定要注意修改这个配置文件

mysql> show variables like "%read_only%";
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| innodb_read_only      | OFF   |
| read_only             | OFF   |
| super_read_only       | OFF   |
| transaction_read_only | OFF   |
+-----------------------+-------+

查看slave状态，获取报错信息

mysql> show slave status \G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: 10.250.0.186
                  Master_User: vagrant
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000013
          Read_Master_Log_Pos: 2466
               Relay_Log_File: slave-relay-bin.000002
                Relay_Log_Pos: 420
        Relay_Master_Log_File: mysql-bin.000013
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1396
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '6e9a571e-330e-11ed-a3f8-0a53e7cced43:1' at master log mysql-bin.000013, end_log_pos 1465. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

按照提示在从库中查询performance_schema.replication_applier_status_by_worker表的数据

mysql> select * from performance_schema.replication_applier_status_by_worker\G;
*************************** 1. row ***************************
    CHANNEL_NAME:
        WORKER_ID: 1
        THREAD_ID: NULL
    SERVICE_STATE: OFF
LAST_ERROR_NUMBER: 1396
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '6e9a571e-330e-11ed-a3f8-0a53e7cced43:1' at master log mysql-bin.000013, end_log_pos 1465; Error 'Operation CREATE USER failed for 'vagrant'@'10.250.%.%'' on query. Default database: 'mysql'. Query: 'CREATE USER 'vagrant'@'10.250.%.%' IDENTIFIED WITH 'mysql_native_password' AS '*04E6E1273D1783DF7D57DC5479FE01CFFDFD0058''
…………

这里提示的比较清楚，意思是从库在同步新的主库数据时，执行创建同步帐号的CREATE USER语句时报错
思考我们的构建过程，从库创建同步帐号是分别创建在自己的数据库里的，所以两个从库GTID事务和binlog日志实际上数据并不一致，导致主从切换后，从库要从新的master同步数据时，就会同步这个创建帐号的语句，但是此时从库已经有这个帐号了，所以创建帐号失败，从而导致主从复制失败。
此时的解决方法就是将master的数据库全量导入到slave中，然后重新建立主从，参照下文。

我们再按"5.从节点也要创建同步帐号"中的正确方式搭建好主从后进行测试

搭建步骤参看"5.从节点也要创建同步帐号"中的正确方式
启动MHA
关闭mysql-master数据库服务
监控mha-manager的日志

…………
Started automated(non-interactive) failover.
Selected 10.250.0.186(10.250.0.186:3306) as a new master.
10.250.0.186(10.250.0.186:3306): OK: Applying all logs succeeded.
10.250.0.102(10.250.0.102:3306): OK: Slave started, replicating from 10.250.0.186(10.250.0.186:3306)
10.250.0.186(10.250.0.186:3306): Resetting slave info succeeded.
Master failover to 10.250.0.186(10.250.0.186:3306) completed successfully.

此时说明新的master切换成功，并且成功建立了主从。

检查master和slave状态是否正常

# 主库
mysql> show master status;
+------------------+----------+--------------+------------------+------------------------------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                        |
+------------------+----------+--------------+------------------+------------------------------------------+
| mysql-bin.000001 |      157 |              |                  | 6e9a571e-330e-11ed-a3f8-0a53e7cced42:1-9 |
+------------------+----------+--------------+------------------+------------------------------------------+

# 从库
mysql> show slave status \G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for source to send event
                  Master_Host: 10.250.0.186
                  Master_User: vagrant
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 157
               Relay_Log_File: slave-relay-bin.000002
                Relay_Log_Pos: 373
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

此时说明MHA搭建及测试成功。

知识点：关于MHA有几点内容需要说明：

1.MHA监控主从一旦完成切换，MHA服务进程就会停止，需要mysql集群修复完成后手工重新启动
2.挂掉的mysql完成修复后可以作为slave加入mysql主从集群，注意修改其mysql配置文件，加入时必须重新全量导入master的数据库
3.新加入mysql时要注意ip是否发生变更，如果变更要及时修改MHA的配置文件
4.MHA切换主从后，新的master配置文件中的read_only不会被修改，需要手工修改，否则重启服务器后不能写入数据
5.MHA只会监控master，slave挂掉时不会触发MHA做任何操作，此时只要从库的ip没有变化，修复后可以直接启动即可。
6.故障转移发生后，MHA工作目录会生成mha.failover.complete文件，如果要在故障发生后8小时内重新启动MHA，则重新启动MHA前一定要查看工作目录下是否存在mha.failover.complete文件，如果存在要先删除，否则不能完成新一轮的主从切换，或者在启动MHA时加上--ignore_last_failover。

新的问题：

1.MHA可以帮助我们实现master的监控，当其监测到master不可用时会在slave中进行选主，从而实现主从切换。
2.但是这里有个问题，主从切换后，新的master的ip就会发生变化，所以客户端连接时需要改变为新的ip地址。
3.那有什么办法可以在不改变客户端连接mysql的地址情况下，自动完成切换呢，答案就是VIP(虚拟IP).

MHA脚本扩展

我们在使用MHA检查mysql集群的连接配置参数时，会看到打印如下信息

[root@ip-10-250-0-91 mha_master]# masterha_check_repl --conf=/etc/mha_master/mha.cnf
………………
Thu Sep 15 10:16:36 2022 - [info] Checking replication health on 10.250.0.186..
Thu Sep 15 10:16:36 2022 - [info]  ok.
Thu Sep 15 10:16:36 2022 - [info] Checking replication health on 10.250.0.102..
Thu Sep 15 10:16:36 2022 - [info]  ok.
Thu Sep 15 10:16:36 2022 - [warning] master_ip_failover_script is not defined.
Thu Sep 15 10:16:36 2022 - [warning] shutdown_script is not defined.
Thu Sep 15 10:16:36 2022 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

这里有两个警告，master_ip_failover_script is not defined.和shutdown_script is not defined.
shutdown_script脚本用于指定故障发生后关闭故障主机的脚本，而master_ip_failover_script脚本用于指定故障转移时需要做的操作。

实际上，MHA支持多种脚本扩展

secondary_check_script：用于检查来自多个网络路由的master可用性
master_ip_failover_script：用于更新应用程序使用的master的ip地址
shutdown_script：为了强制关机master
report_script：用于发送报告
init_conf_load_script：用于加载初始配置参数
master_ip_online_change_script：用于更新masterIP地址。这不用于主故障转移，而是用于在线主交换机

MHA的配置文件中可以指定master_ip_failover_script，其值是个脚本文件，从名称上我们也能看出来其作用是当发生故障转移时要执行的脚本，一般我们使用这个脚本进行虚拟IP的重新绑定。
MHA并没有实现这个脚本，这个脚本需要我们自己编写，这里参考网上给出的一个示例，vim /etc/mha_master/master_ip_failover_script

#!/bin/env perl
use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
    $command, $orig_master_host, $orig_master_ip,$ssh_user,
    $orig_master_port, $new_master_host, $new_master_ip,$new_master_port,
    $orig_master_ssh_port,$new_master_ssh_port,$new_master_user,$new_master_password
);

# 这里定义的虚拟IP配置要注意，这个ip必须要与你自己的集群在同一个网段，否则无效
my $vip = '10.250.0.199';
my $key = '1';
# 这里的网卡名称 “eth0” 需要根据你机器的网卡名称进行修改
my $ssh_start_vip = "sudo /sbin/ifconfig eth0:$key $vip up";
my $ssh_stop_vip = "sudo /sbin/ifconfig eth0:$key down";
my $ssh_Bcast_arp= "sudo /sbin/arping -I eth0 -c 3 -A $vip";

GetOptions(
    'command=s'          => \$command,
    'ssh_user=s'         => \$ssh_user,
    'orig_master_host=s' => \$orig_master_host,
    'orig_master_ip=s'   => \$orig_master_ip,
    'orig_master_port=i' => \$orig_master_port,
    'orig_master_ssh_port=i' => \$orig_master_ssh_port,
    'new_master_host=s'  => \$new_master_host,
    'new_master_ip=s'    => \$new_master_ip,
    'new_master_port=i'  => \$new_master_port,
    'new_master_ssh_port' => \$new_master_ssh_port,
    'new_master_user' => \$new_master_user,
    'new_master_password' => \$new_master_password

);

exit &main();

sub main {
    $ssh_user = defined $ssh_user ? $ssh_user : 'root';
    print "\n\n SCRIPT START \[$ssh_user|$ssh_start_vip\]";
    print "\n SCRIPT STOP \[$ssh_user|$ssh_stop_vip\]\n\n";


    if ( $command eq "stop" || $command eq "stopssh" ) {

        my $exit_code = 1;
        eval {
            print "Disabling the VIP on old master: $orig_master_host \n";
            &stop_vip();
            $exit_code = 0;
        };
        if ($@) {
            warn "Got Error: $@\n";
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "start" ) {

        my $exit_code = 10;
        eval {
            print "Enabling the VIP - $vip on the new master - $new_master_host \n";
            &start_vip();
        &start_arp();
            $exit_code = 0;
        };
        if ($@) {
            warn $@;
            exit $exit_code;
        }
        exit $exit_code;
    }
    elsif ( $command eq "status" ) {
        print "Checking the Status of the script.. OK \n";
        exit 0;
    }
    else {
        &usage();
        exit 1;
    }
}

sub start_vip() {
    `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`;
}
sub stop_vip() {
    `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`;
}

sub start_arp() {
    `ssh $ssh_user\@$new_master_host \" $ssh_Bcast_arp \"`;
}
sub usage {
    print
    "Usage: master_ip_failover --command=start|stop|stopssh|status --ssh_user=user --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n";
}

chmod a+x /etc/mha_master/master_ip_failover_script
vim /etc/mha_master/mha.cnf，添加master_ip_failover_script配置

…………
#指定检测间隔时间
ping_interval=1
#指定一个脚本，该脚本实现了在主从切换之后，将虚拟ip漂移到新的master上
master_ip_failover_script=/etc/mha_master/master_ip_failover_script
# mysql主从服务器信息
…………

虚拟IP为10.250.0.199，将其绑定到mysql-master上

[root@ip-10-250-0-118 mha_master]# ifconfig eth0:1 10.250.0.199/24
[root@ip-10-250-0-118 mha_master]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.250.0.118  netmask 255.255.255.0  broadcast 10.250.0.255
        inet6 fe80::803:41ff:fe19:356a  prefixlen 64  scopeid 0x20<link>
        ether 0a:03:41:19:35:6a  txqueuelen 1000  (Ethernet)
        RX packets 12260  bytes 2511078 (2.3 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 10202  bytes 5216046 (4.9 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.250.0.199  netmask 255.255.255.0  broadcast 10.250.0.255
        ether 0a:03:41:19:35:6a  txqueuelen 1000  (Ethernet)

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 43  bytes 9386 (9.1 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 43  bytes 9386 (9.1 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

再次检查配置参数

[root@ip-10-250-0-91 mha_master]# masterha_check_repl --conf=/etc/mha_master/mha.cnf
………………
Fri Sep 16 07:06:02 2022 - [info] Checking replication health on 10.250.0.186..
Fri Sep 16 07:06:02 2022 - [info]  ok.
Fri Sep 16 07:06:02 2022 - [info] Checking replication health on 10.250.0.102..
Fri Sep 16 07:06:02 2022 - [info]  ok.
Fri Sep 16 07:06:02 2022 - [info] Checking master_ip_failover_script status:
Fri Sep 16 07:06:02 2022 - [info]   /etc/mha_master/master_ip_failover_script --command=status --ssh_user=root --orig_master_host=10.250.0.118 --orig_master_ip=10.250.0.118 --orig_master_port=3306


 SCRIPT START [root|sudo /sbin/ifconfig eth0:1 10.250.0.199 up]
 SCRIPT STOP [root|sudo /sbin/ifconfig eth0:1 down]

Checking the Status of the script.. OK
Fri Sep 16 07:06:02 2022 - [info]  OK.
Fri Sep 16 07:06:02 2022 - [warning] shutdown_script is not defined.
Fri Sep 16 07:06:02 2022 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

重新启动MHA
停止mysql-master服务
查看MHA日志

----- Failover Report -----

mha: MySQL Master failover 10.250.0.118(10.250.0.118:3306) to 10.250.0.186(10.250.0.186:3306) succeeded

Master 10.250.0.118(10.250.0.118:3306) is down!

Check MHA Manager logs at ip-10-250-0-91.cn-northwest-1.compute.internal:/etc/mha_master/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 10.250.0.118(10.250.0.118:3306)
Selected 10.250.0.186(10.250.0.186:3306) as a new master.
10.250.0.186(10.250.0.186:3306): OK: Applying all logs succeeded.
10.250.0.186(10.250.0.186:3306): OK: Activated master IP address.
10.250.0.102(10.250.0.102:3306): OK: Slave started, replicating from 10.250.0.186(10.250.0.186:3306)
10.250.0.186(10.250.0.186:3306): Resetting slave info succeeded.
Master failover to 10.250.0.186(10.250.0.186:3306) completed successfully.

此时查看新的master的ip

[root@ip-10-250-0-186 mysql8]# ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.250.0.186  netmask 255.255.255.0  broadcast 10.250.0.255
        inet6 fe80::895:fcff:fe7f:ee0c  prefixlen 64  scopeid 0x20<link>
        ether 0a:95:fc:7f:ee:0c  txqueuelen 1000  (Ethernet)
        RX packets 12284  bytes 3862460 (3.6 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 10558  bytes 2559446 (2.4 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9001
        inet 10.250.0.199  netmask 255.0.0.0  broadcast 10.255.255.255
        ether 0a:95:fc:7f:ee:0c  txqueuelen 1000  (Ethernet)

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 41  bytes 9270 (9.0 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 41  bytes 9270 (9.0 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

另外查询原master，其也已经解除了虚拟IP的绑定，此时说明虚拟IP迁移成功，故障转移成功！

AWS不支持在EC2上开启VIP，需要使用其它解决方案，这个不在这里讨论。

可以在MHA配置文件中配置report_script，其对应一个脚本文件，用于故障转移时发出告警信息。
可以发送邮件，也可以调用企业微信或者钉钉的接口实现，这里不再赘述。