#AutoFailover | Explore Tumblr posts and blogs

globalmediacampaign · 4 years ago

Text

What you can do with Auto-failover and Percona Server Distribution (8.0.x)

Where x is >= 22 ;) The Problem There are few things your data does not like. One is water and another is fire. Well, guess what: If you think that everything will be fine after all, take a look: Given my ISP had part of its management infrastructure on OVH, they had been impacted by the incident. As you can see from the highlight, the ticket number in three years changes very little (2k cases) and the date jumps from 2018 to 2021. On top of that, I have to mention I had opened several tickets the month before that disappeared. So either my ISP was very lucky and had very few cases in three years and sent all my tickets to /dev/null... or they have lost THREE YEARS of data. Let us go straight to the chase; they have lost their data, period. After the fire at the OVH, these guys did not have a good backup to use for data restoring and did not even have a decent Disaster Recovery solution. Their platform remained INACCESSIBLE for more than five days, during which they also lost visibility of their own network/access point/clients and so on. Restoring data has brought them back online, but it takes them more than a month to review and fix the internal management system and bring the service back to acceptable standards. Needless to say, complaints and more costly legal actions had been raised against them. All this because they missed two basic Best Practices when designing a system: Good backup/restore procedure Always have a Disaster Recovery solution in place Yeah, I know... I should change ISP. Anyhow, a Disaster Recovery (DR) solution is a crucial element in any production system. It is weird we still have to cover this in 2021, but apparently, it still is something being underestimated that requires our attention. This is why in this (long) article, I will illustrate how to implement another improved DR solution utilizing Percona Server for MySQL and standard MySQL features as group replication and asynchronous replication automatic failover (AAF). Asynchronous Replication Automatic Failover I have already covered the new MySQL feature here (http://www.tusacentral.net/joomla/index.php/mysql-blogs/227-mysql-asynchronous-source-auto-failover) but let us recap. From MySQL 8.0.22 and Percona Server for MySQL 8.0.22 you can take advantage of AAF when designing distributed solutions. What does this mean? When using simple Async-replication you have this: Whereas, a Highly Available (HA) solution in DC2 is pulling data out from another HA solution in DC1 with the relation 1:1, meaning the connection is one node against another node. If you have this: Your data replication is interrupted and the two DCs diverge. Also you need to manually (or by script) recover the interrupted link. With AAF you can count on a significant improvement: The link now is NOT 1:1, but a node in DC2 can count on AAF to recover the link on the other remaining nodes: If a node in the DC2 (the replica side) fails, then the link is broken again and it requires manual intervention. This solves a quite large chunk of problems, but it does not fix all, as I mentioned in the article above. If a node in the DC2 (the replica side) fails, then the link is broken again and it requires manual intervention. GR Failover I was hoping to have this fixed in MySQL 8.0.23, but unfortunately, it is not. So I decided to develop a Proof Of Concept and see if it would fix the problem, and more importantly what needs to be done to do it safely. The result is a very basic (and I need to refine the code) Stored Procedure called grfailover, which manages the shift between primaries inside a Group Replication cluster: I borrowed the concept from Yves' Replication Manager for Percona XtraDB Cluster (https://github.com/y-trudeau/Mysql-tools/tree/master/PXC), but as we will see for GR and this use we need much less. Why Can This Be a Simplified Version? Because in GR we already have a lot of information and we also have the autofailover for async replication. Given that, what we need to do is only manage the start/stop of the Replica. Auto-failover will take care of the shift from one source to the other, while GR will take care of which node should be the preferred Replica (Primary on replica site). In short, the check just needs to see if the node is a Primary, and if so, start the replication if it is not already active while eventually stopping it if the node IS NOT a primary. We can also maintain a table of what is going on, to be sure that we do not have two nodes replicating at the same time. The definition will be something like this: +--------------+---------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+---------------+------+-----+---------+-------+ | server_uuid | char(36) | NO | PRI | NULL | | | HOST | varchar(255) | NO | | NULL | | | PORT | int | NO | | 3306 | | | channel_name | varchar(100) | NO | | NULL | | | gr_role | varchar(30) | NO | | NULL | | | STATUS | varchar(50) | YES | | NULL | | | started | timestamp(6) | YES | | NULL | | | lastupdate | timestamp(6) | YES | | NULL | | | active | tinyint | YES | | 0 | | | COMMENT | varchar(2000) | YES | | NULL | | +--------------+---------------+------+-----+---------+-------+ The full code can be found in GitHub here: https://github.com/Tusamarco/blogs/tree/master/asyncAutoFailOver. How-To The first thing you need to do is deploy Percona Server Distribution for MySQL (8.0.22 or greater) using Group Replication as a HA solution. To do so, refer to the extensive guide here: Percona Distribution for MySQL: High Availability with Group Replication Solution. Once you have it running on both DCs, you can configure AAF on both DCs Primary node following either MySQL 8.0.22: Asynchronous Replication Automatic Connection (IO Thread) Failover or this MySQL Asynchronous SOURCE auto failover. Once you have the AAF replication up and running, it is time for you to create the procedure and the management table in your DC-Source Primary. First of all, be sure you have a `percona` schema, and if not, create it: Create schema percona; Then create the table: CREATE TABLE `group_replication_failover_manager` ( `server_uuid` char(36) NOT NULL, `HOST` varchar(255) NOT NULL, `PORT` int NOT NULL DEFAULT '3306', `channel_name` varchar(100) NOT NULL, `gr_role` varchar(30) NOT NULL, `STATUS` varchar(50) DEFAULT NULL, `started` timestamp(6) NULL DEFAULT NULL, `lastupdate` timestamp(6) NULL DEFAULT NULL, `active` tinyint DEFAULT '0', `COMMENT` varchar(2000) DEFAULT NULL, PRIMARY KEY (`server_uuid`) ) ENGINE=InnoDB; Last, create the procedure. Keep in mind you may need to change the DEFINER or simply remove it. The code will be replicated on all nodes. To be sure, run the command below on all nodes: select ROUTINE_SCHEMA,ROUTINE_NAME,ROUTINE_TYPE from information_schema.ROUTINES where ROUTINE_SCHEMA ='percona' ; +----------------+--------------+--------------+ | ROUTINE_SCHEMA | ROUTINE_NAME | ROUTINE_TYPE | +----------------+--------------+--------------+ | percona | grfailover | PROCEDURE | +----------------+--------------+--------------+ You should get something as above. If not, then check your replication, something probably needs to be fixed. If instead, it all works out, this means you are ready to go. To run the procedure you can use any kind of approach you like, the only important thing is that you MUST run it FIRST on the current PRIMARY node of each DCs. This is because the PRIMARY node must be the first one to register in the management table. Personally, I like to run it from cron when in ��production” while manually when testing: IE:/opt/mysql_templates/PS-8P/bin/mysql -h 127.0.0.1 -P 3306 -D percona -e "call grfailover(5,"dc2_to_dc1");" Where: grfailover is the name of the procedure. 5 is the timeout in minutes after which the procedure will activate the replication in the Node. dc2_to_dc1 Is the name of the channel in the current node, the procedure needs to manage. Given two clusters as: DC1-1(root@localhost) [(none)]>SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ | group_replication_applier | e891d1b4-9793-11eb-92ac-08002734ed50 | gr3 | 3306 | ONLINE | SECONDARY | 8.0.23 | | group_replication_applier | ebff1ab8-9793-11eb-ba5f-08002734ed50 | gr1 | 3306 | ONLINE | SECONDARY | 8.0.23 | | group_replication_applier | f47df54e-9793-11eb-a60b-08002734ed50 | gr2 | 3306 | ONLINE | PRIMARY | 8.0.23 | +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ DC2-2(root@localhost) [percona]>SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ | group_replication_applier | 79ede65d-9797-11eb-9963-08002734ed50 | gr4 | 3306 | ONLINE | SECONDARY | 8.0.23 | | group_replication_applier | 7e214802-9797-11eb-a0cf-08002734ed50 | gr6 | 3306 | ONLINE | PRIMARY | 8.0.23 | | group_replication_applier | 7fddf04f-9797-11eb-a193-08002734ed50 | gr5 | 3306 | ONLINE | SECONDARY | 8.0.23 | +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ If you query the management table after you have run the procedure ONLY on the two Primaries: >select * from percona.group_replication_failover_manager order by hostG *************************** 1. row *************************** server_uuid: f47df54e-9793-11eb-a60b-08002734ed50 HOST: gr2 PORT: 3306 channel_name: dc2_to_dc1 gr_role: PRIMARY STATUS: ONLINE started: 2021-04-08 10:22:40.000000 lastupdate: 2021-04-08 10:22:53.000000 active: 1 COMMENT: Just inserted *************************** 2. row *************************** server_uuid: 7e214802-9797-11eb-a0cf-08002734ed50 HOST: gr6 PORT: 3306 channel_name: dc1_to_dc2 gr_role: PRIMARY STATUS: ONLINE started: 2021-04-08 09:17:50.000000 lastupdate: 2021-04-08 09:17:50.000000 active: 1 COMMENT: Just inserted Given the replication link was already active, the nodes will report only “Just Inserted” in the comment. While if one of the two channels was down and the node NOT deactivated (set the active flag in the management table to 0), the comment will change to “COMMENT: REPLICA restarted for the channel ” At this point, you can run the procedure also on the other nodes and after that, if you query the table by channel: DC1-1(root@localhost) [(none)]>select * from percona.group_replication_failover_manager where channel_name ='dc2_to_dc1' order by hostG *************************** 1. row *************************** server_uuid: ebff1ab8-9793-11eb-ba5f-08002734ed50 HOST: gr1 PORT: 3306 channel_name: dc2_to_dc1 gr_role: SECONDARY STATUS: ONLINE started: NULL lastupdate: NULL active: 1 COMMENT: Just inserted *************************** 2. row *************************** server_uuid: f47df54e-9793-11eb-a60b-08002734ed50 HOST: gr2 PORT: 3306 channel_name: dc2_to_dc1 gr_role: PRIMARY STATUS: ONLINE started: 2021-04-08 10:22:40.000000 lastupdate: 2021-04-08 10:22:53.000000 active: 1 COMMENT: REPLICA restarted for the channel dc2_to_dc1 *************************** 3. row *************************** server_uuid: e891d1b4-9793-11eb-92ac-08002734ed50 HOST: gr3 PORT: 3306 channel_name: dc2_to_dc1 gr_role: SECONDARY STATUS: ONLINE started: NULL lastupdate: NULL active: 1 COMMENT: Just inserted 3 rows in set (0.00 sec) What happens if I now change my Primary, or if the Primary goes down? Well let say we “just” shift our PRIMARY: stop slave for channel 'dc2_to_dc1';SELECT group_replication_set_as_primary('ebff1ab8-9793-11eb-ba5f-08002734ed50'); Query OK, 0 rows affected, 1 warning (0.01 sec) +--------------------------------------------------------------------------+ | group_replication_set_as_primary('ebff1ab8-9793-11eb-ba5f-08002734ed50') | +--------------------------------------------------------------------------+ | Primary server switched to: ebff1ab8-9793-11eb-ba5f-08002734ed50 | +--------------------------------------------------------------------------+ Please note that given I have an ACTIVE replication channel, to successfully shift the primary, I MUST stop the replication channel first. C1-2(root@localhost) [percona]>DC1-2(root@localhost) [percona]>SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ | group_replication_applier | e891d1b4-9793-11eb-92ac-08002734ed50 | gr3 | 3306 | ONLINE | SECONDARY | 8.0.23 | | group_replication_applier | ebff1ab8-9793-11eb-ba5f-08002734ed50 | gr1 | 3306 | ONLINE | PRIMARY | 8.0.23 | | group_replication_applier | f47df54e-9793-11eb-a60b-08002734ed50 | gr2 | 3306 | ONLINE | SECONDARY | 8.0.23 | +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ Reading the management table we will see that grFailOver had started the shift: DC1-1(root@localhost) [(none)]>select * from percona.group_replication_failover_manager where channel_name ='dc2_to_dc1' order by hostG *************************** 1. row *************************** server_uuid: ebff1ab8-9793-11eb-ba5f-08002734ed50 HOST: gr1 PORT: 3306 channel_name: dc2_to_dc1 gr_role: PRIMARY STATUS: ONLINE started: NULL lastupdate: NULL active: 1 COMMENT: Need to wait 5 minutes, passed: 0 *************************** 2. row *************************** server_uuid: f47df54e-9793-11eb-a60b-08002734ed50 HOST: gr2 PORT: 3306 channel_name: dc2_to_dc1 gr_role: PRIMARY STATUS: ONLINE started: 2021-04-08 10:22:40.000000 lastupdate: 2021-04-08 10:22:53.000000 active: 1 COMMENT: REPLICA restarted for the channel dc2_to_dc1 *************************** 3. row *************************** server_uuid: e891d1b4-9793-11eb-92ac-08002734ed50 HOST: gr3 PORT: 3306 channel_name: dc2_to_dc1 gr_role: SECONDARY STATUS: ONLINE started: NULL lastupdate: NULL active: 1 COMMENT: Just inserted Checking the new PRIMARY node gr1, we can see that: Gr_role is PRIMARY COMMENT reports the countdown (in minutes) the node waits After the 5 minutes: DC1-1(root@localhost) [(none)]>select * from percona.group_replication_failover_manager where channel_name ='dc2_to_dc1' order by hostG *************************** 1. row *************************** server_uuid: ebff1ab8-9793-11eb-ba5f-08002734ed50 HOST: gr1 PORT: 3306 channel_name: dc2_to_dc1 gr_role: PRIMARY STATUS: ONLINE started: 2021-04-08 10:27:54.000000 lastupdate: 2021-04-08 10:30:12.000000 active: 1 COMMENT: REPLICA restarted for the channel dc2_to_dc1 *************************** 2. row *************************** server_uuid: f47df54e-9793-11eb-a60b-08002734ed50 HOST: gr2 PORT: 3306 channel_name: dc2_to_dc1 gr_role: SECONDARY STATUS: ONLINE started: NULL lastupdate: NULL active: 1 COMMENT: Resetted by primary node ebff1ab8-9793-11eb-ba5f-08002734ed50 at 2021-04-08 10:27:53 *************************** 3. row *************************** server_uuid: e891d1b4-9793-11eb-92ac-08002734ed50 HOST: gr3 PORT: 3306 channel_name: dc2_to_dc1 gr_role: SECONDARY STATUS: ONLINE started: NULL lastupdate: NULL active: 1 COMMENT: Just inserted Now, what we can see is: Node gr1 had become active in replicating It reports the time it started the replication It reports the last time it checked for the replication to be active Node gr2 is marked SECONDARY In the comment is also reported the time and when the replication was restarted on the new REPLICA node If for any reason the replication in the original node gr2 was restarted (like moving back the PRIMARY) while the countdown was still in place, grFailOver will stop any action and reset the gr1 status. In short, now my two DCs can rely on AAF for failing over on a different SOURCE and on grFailOver for shifting the Node following GR Primary, or to failover to another node when my Primary crashes. Conclusion I am sure Oracle is backing something about this and I am sure we will see it out soon, but in the meantime, I have to say that this simple solution works. It has improved the resiliency of my testing architecture A LOT. And while I am still testing it and I am totally confident that the procedure can be written in a more efficient way, I am also sure bugs and errors are around the corner. BUT, this was a POC and I am happy with the outcome. This proves it is not so difficult to make better what we have, and also proves that sometimes a small thing can have a HUGE impact. It also proves we should not always wait for others to do what is required and that ANYONE can help. Finally, as mentioned above, this is a POC solution, but no one prevents you to start from it and make it a production solution, as my colleague Yves did for his Percona XtraDB Cluster Replication Manager. Is just on you! Great MySQL to all. References https://www.datacenterdynamics.com/en/news/fire-destroys-ovhclouds-sbg2-data-center-strasbourg/ http://www.tusacentral.net/joomla/index.php/mysql-blogs/227-mysql-asynchronous-source-auto-failover https://github.com/y-trudeau/Mysql-tools/tree/master/PXC https://www.percona.com/blog/2020/10/26/mysql-8-0-22-asynchronous-replication-automatic-connection-io-thread-failover/ http://www.tusacentral.com/joomla/index.php/mysql-blogs/233-what-you-can-do-with-auto-failover-and-percona-server-distribution-8-0-x

0 notes

loagarvysi1986-blog · 7 years ago

Text

old teamviewer version

———————————————————

>>> Получить файл <<<

——————————————————— Все ок! Администрация рекомендует ———————————————————

When you upload software to you get rewarded by points. For every field that is filled out correctly, points will be rewarded, some fields are optional but the more you provide the more you will get rewarded! ## Старые Версии Программное обеспечение Для переустановки описанным в статье способом нужно загрузиться в систему. С вашей проблемой обратитесь в форум, пожалуйста. ### BELOFF [minstall vs wpi] скачать торрент файлом Gracias por tu magia RUSO!!! Te banco a morir!!! Genio!!! I LOVE YOU SERGEI!! GREETINGS FROM ARGENTINA!!! xD #### WinPE 10-8 Sergei Strelec (x86/x64/Native x86) Переустановка (обновление) Windows выполняется поверх установленной операционной системы без форматирование системного раздела. При этом сохраняются ваши файлы и настройки, а также установленные программы и их параметры. Problem Event Name: startuprepairoffline 6: 7: 8: unknown 9: 76755897 5: AutoFailover 6: 7 7: norootcause OS Version: Locale ID: 6599 Подпишитесь на бесплатные уведомления о новых записях и получите в подарок мою книгу об ускорении загрузки Windows ! Hello, and greetings. Please think about adding NTFS Permission Tools (http:///get/Internet/Servers/Server-Tools/NTFS-Permissions-To ). There are separate executables for 87 &amp 69 bit. нияего не мешает, жаль потраченного времени просто. Да и не уверен я если изминение значков заблокировано, что можно будет ярлык удалить этот Как правило, этот способ советуют в тех случаях, когда исчерпаны все остальные варианты решения проблемы, хотя в современных Windows это вполне приемлемое решение с технической точки зрения. Зачастую, оно приводит к цели намного быстрее, чем пляски с бубном. Вадим, папка должна была быть. Возможно, вы зачистили ее CCleaner 8767 ом. С AppData решайте сами, . на основе предоставленной вами информации больше сказать и нечего.

0 notes

newtrendcn · 1 month ago

Text

RUTC50 YOUR 5G WIRELESS MULTITOOL

The RUTC50 is a 5G router featuring dual-band Wi-Fi 6 technology with multi-user MIMO, ensuring ultra-high cellular speeds of up to 3.4 Gbps.

For more information, Visit: https://www.newtrend.ae/ Live chat: +971 507542792

#RUTC50 #5GRouter #WiFi6 #MultiUserMIMO #UltraHighSpeed #DataRedundancy #AutoFailover #IndustrialRouter #RemoteManagement #Teltonika

0 notes

newtrendcn · 3 months ago

Text

RUTX50

INDUSTRIAL 5G ROUTER

RUTX50 is a dual-SIM multi-network router offering 5G mobile communication for high-speed and data-heavy applications. Together with 5x Gigabit Ethernet ports and dual-band Wi-Fi, it provides data connection redundancy with auto-failover.

For more information,

Visit: https://www.newtrend.ae/

Live chat: +971 507542792

#RUTX50 #IndustrialRouter #5GRouter #DualSIMRouter #HighSpeedConnectivity #DataRedundancy #GigabitEthernet #WiFiRouter #AutoFailover #TechSolutions #MobileCommunication #NetworkReliability #5GTechnology #IoTConnectivity #BusinessConnectivity #NetworkRouter #TechInnovation #NewTrend

0 notes

globalmediacampaign · 4 years ago

Text

What You Can Do With Auto-Failover and Percona Distribution for MySQL (8.0.x)

Where x is >= 22 😉 The Problem There are few things your data does not like. One is water and another is fire. Well, guess what: If you think that everything will be fine after all, take a look: Given my ISP had part of its management infrastructure on OVH, they had been impacted by the incident. As you can see from the highlight, the ticket number in three years changes very little (2k cases) and the date jumps from 2018 to 2021. On top of that, I have to mention I had opened several tickets the month before that disappeared. So either my ISP was very lucky and had very few cases in three years and sent all my tickets to /dev/null… or they have lost THREE YEARS of data. Let us go straight to the chase; they have lost their data, period. After the fire at the OVH, these guys did not have a good backup to use for data restoring and did not even have a decent Disaster Recovery solution. Their platform remained INACCESSIBLE for more than five days, during which they also lost visibility of their own network/access point/clients and so on. Restoring data has brought them back online, but it takes them more than a month to review and fix the internal management system and bring the service back to acceptable standards. Needless to say, complaints and more costly legal actions had been raised against them. All this because they missed two basic Best Practices when designing a system: Good backup/restore procedure Always have a Disaster Recovery solution in place Yeah, I know… I should change ISP. Anyhow, a Disaster Recovery (DR) solution is a crucial element in any production system. It is weird we still have to cover this in 2021, but apparently, it still is something being underestimated that requires our attention. This is why in this (long) article, I will illustrate how to implement another improved DR solution utilizing Percona Server for MySQL and standard MySQL features as group replication and asynchronous replication automatic failover (AAF). Asynchronous Replication Automatic Failover I have already covered the new MySQL feature here (http://www.tusacentral.net/joomla/index.php/mysql-blogs/227-mysql-asynchronous-source-auto-failover) but let us recap. From MySQL 8.0.22 and Percona Server for MySQL 8.0.22 you can take advantage of AAF when designing distributed solutions. What does this mean? When using simple Async-replication you have this: Whereas, a Highly Available (HA) solution in DC2 is pulling data out from another HA solution in DC1 with the relation 1:1, meaning the connection is one node against another node. If you have this: Your data replication is interrupted and the two DCs diverge. Also you need to manually (or by script) recover the interrupted link. With AAF you can count on a significant improvement: The link now is NOT 1:1, but a node in DC2 can count on AAF to recover the link on the other remaining nodes: If a node in the DC2 (the replica side) fails, then the link is broken again and it requires manual intervention. This solves a quite large chunk of problems, but it does not fix all, as I mentioned in the article above. If a node in the DC2 (the replica side) fails, then the link is broken again and it requires manual intervention. GR Failover I was hoping to have this fixed in MySQL 8.0.23, but unfortunately, it is not. So I decided to develop a Proof Of Concept and see if it would fix the problem, and more importantly what needs to be done to do it safely. The result is a very basic (and I need to refine the code) Stored Procedure called grfailover, which manages the shift between primaries inside a Group Replication cluster: I borrowed the concept from Yves’ Replication Manager for Percona XtraDB Cluster (https://github.com/y-trudeau/Mysql-tools/tree/master/PXC), but as we will see for GR and this use we need much less. Why Can This Be a Simplified Version? Because in GR we already have a lot of information and we also have the autofailover for async replication. Given that, what we need to do is only manage the start/stop of the Replica. Auto-failover will take care of the shift from one source to the other, while GR will take care of which node should be the preferred Replica (Primary on replica site). In short, the check just needs to see if the node is a Primary, and if so, start the replication if it is not already active while eventually stopping it if the node IS NOT a primary. We can also maintain a table of what is going on, to be sure that we do not have two nodes replicating at the same time. The definition will be something like this:+--------------+---------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------------+---------------+------+-----+---------+-------+ | server_uuid | char(36) | NO | PRI | NULL | | | HOST | varchar(255) | NO | | NULL | | | PORT | int | NO | | 3306 | | | channel_name | varchar(100) | NO | | NULL | | | gr_role | varchar(30) | NO | | NULL | | | STATUS | varchar(50) | YES | | NULL | | | started | timestamp(6) | YES | | NULL | | | lastupdate | timestamp(6) | YES | | NULL | | | active | tinyint | YES | | 0 | | | COMMENT | varchar(2000) | YES | | NULL | | +--------------+---------------+------+-----+---------+-------+The full code can be found in GitHub here: https://github.com/Tusamarco/blogs/tree/master/asyncAutoFailOver. How-To The first thing you need to do is deploy Percona Server Distribution for MySQL (8.0.22 or greater) using Group Replication as a HA solution. To do so, refer to the extensive guide here: Percona Distribution for MySQL: High Availability with Group Replication Solution. Once you have it running on both DCs, you can configure AAF on both DCs Primary node following either MySQL 8.0.22: Asynchronous Replication Automatic Connection (IO Thread) Failover or this MySQL Asynchronous SOURCE auto failover. Once you have the AAF replication up and running, it is time for you to create the procedure and the management table in your DC-Source Primary. First of all, be sure you have a percona schema, and if not, create it:Create schema percona;Then create the table:CREATE TABLE `group_replication_failover_manager` ( `server_uuid` char(36) NOT NULL, `HOST` varchar(255) NOT NULL, `PORT` int NOT NULL DEFAULT '3306', `channel_name` varchar(100) NOT NULL, `gr_role` varchar(30) NOT NULL, `STATUS` varchar(50) DEFAULT NULL, `started` timestamp(6) NULL DEFAULT NULL, `lastupdate` timestamp(6) NULL DEFAULT NULL, `active` tinyint DEFAULT '0', `COMMENT` varchar(2000) DEFAULT NULL, PRIMARY KEY (`server_uuid`) ) ENGINE=InnoDB;Last, create the procedure. Keep in mind you may need to change the DEFINER or simply remove it. The code will be replicated on all nodes. To be sure, run the command below on all nodes:select ROUTINE_SCHEMA,ROUTINE_NAME,ROUTINE_TYPE from information_schema.ROUTINES where ROUTINE_SCHEMA ='percona' ; +----------------+--------------+--------------+ | ROUTINE_SCHEMA | ROUTINE_NAME | ROUTINE_TYPE | +----------------+--------------+--------------+ | percona | grfailover | PROCEDURE | +----------------+--------------+--------------+You should get something as above. If not, then check your replication, something probably needs to be fixed. If instead, it all works out, this means you are ready to go. To run the procedure you can use any kind of approach you like, the only important thing is that you MUST run it FIRST on the current PRIMARY node of each DCs. This is because the PRIMARY node must be the first one to register in the management table. Personally, I like to run it from cron when in “production” while manually when testing:IE:/opt/mysql_templates/PS-8P/bin/mysql -h 127.0.0.1 -P 3306 -D percona -e "call grfailover(5,"dc2_to_dc1");"Where: grfailover is the name of the procedure. 5 is the timeout in minutes after which the procedure will activate the replication in the Node. dc2_to_dc1 Is the name of the channel in the current node, the procedure needs to manage. Given two clusters as:DC1-1(root@localhost) [(none)]>SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ | group_replication_applier | e891d1b4-9793-11eb-92ac-08002734ed50 | gr3 | 3306 | ONLINE | SECONDARY | 8.0.23 | | group_replication_applier | ebff1ab8-9793-11eb-ba5f-08002734ed50 | gr1 | 3306 | ONLINE | SECONDARY | 8.0.23 | | group_replication_applier | f47df54e-9793-11eb-a60b-08002734ed50 | gr2 | 3306 | ONLINE | PRIMARY | 8.0.23 | +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ DC2-2(root@localhost) [percona]>SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ | group_replication_applier | 79ede65d-9797-11eb-9963-08002734ed50 | gr4 | 3306 | ONLINE | SECONDARY | 8.0.23 | | group_replication_applier | 7e214802-9797-11eb-a0cf-08002734ed50 | gr6 | 3306 | ONLINE | PRIMARY | 8.0.23 | | group_replication_applier | 7fddf04f-9797-11eb-a193-08002734ed50 | gr5 | 3306 | ONLINE | SECONDARY | 8.0.23 | +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+If you query the management table after you have run the procedure ONLY on the two Primaries:>select * from percona.group_replication_failover_manager order by hostG *************************** 1. row *************************** server_uuid: f47df54e-9793-11eb-a60b-08002734ed50 HOST: gr2 PORT: 3306 channel_name: dc2_to_dc1 gr_role: PRIMARY STATUS: ONLINE started: 2021-04-08 10:22:40.000000 lastupdate: 2021-04-08 10:22:53.000000 active: 1 COMMENT: Just inserted *************************** 2. row *************************** server_uuid: 7e214802-9797-11eb-a0cf-08002734ed50 HOST: gr6 PORT: 3306 channel_name: dc1_to_dc2 gr_role: PRIMARY STATUS: ONLINE started: 2021-04-08 09:17:50.000000 lastupdate: 2021-04-08 09:17:50.000000 active: 1 COMMENT: Just insertedGiven the replication link was already active, the nodes will report only “Just Inserted” in the comment. While if one of the two channels was down and the node NOT deactivated (set the active flag in the management table to 0), the comment will change to “COMMENT: REPLICA restarted for the channel ” At this point, you can run the procedure also on the other nodes and after that, if you query the table by channel:DC1-1(root@localhost) [(none)]>select * from percona.group_replication_failover_manager where channel_name ='dc2_to_dc1' order by hostG *************************** 1. row *************************** server_uuid: ebff1ab8-9793-11eb-ba5f-08002734ed50 HOST: gr1 PORT: 3306 channel_name: dc2_to_dc1 gr_role: SECONDARY STATUS: ONLINE started: NULL lastupdate: NULL active: 1 COMMENT: Just inserted *************************** 2. row *************************** server_uuid: f47df54e-9793-11eb-a60b-08002734ed50 HOST: gr2 PORT: 3306 channel_name: dc2_to_dc1 gr_role: PRIMARY STATUS: ONLINE started: 2021-04-08 10:22:40.000000 lastupdate: 2021-04-08 10:22:53.000000 active: 1 COMMENT: REPLICA restarted for the channel dc2_to_dc1 *************************** 3. row *************************** server_uuid: e891d1b4-9793-11eb-92ac-08002734ed50 HOST: gr3 PORT: 3306 channel_name: dc2_to_dc1 gr_role: SECONDARY STATUS: ONLINE started: NULL lastupdate: NULL active: 1 COMMENT: Just inserted 3 rows in set (0.00 sec)What happens if I now change my Primary, or if the Primary goes down? Well let say we “just” shift our PRIMARY:stop slave for channel 'dc2_to_dc1';SELECT group_replication_set_as_primary('ebff1ab8-9793-11eb-ba5f-08002734ed50'); Query OK, 0 rows affected, 1 warning (0.01 sec) +--------------------------------------------------------------------------+ | group_replication_set_as_primary('ebff1ab8-9793-11eb-ba5f-08002734ed50') | +--------------------------------------------------------------------------+ | Primary server switched to: ebff1ab8-9793-11eb-ba5f-08002734ed50 | +--------------------------------------------------------------------------+Please note that given I have an ACTIVE replication channel, to successfully shift the primary, I MUST stop the replication channel first.C1-2(root@localhost) [percona]>DC1-2(root@localhost) [percona]>SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION | +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+ | group_replication_applier | e891d1b4-9793-11eb-92ac-08002734ed50 | gr3 | 3306 | ONLINE | SECONDARY | 8.0.23 | | group_replication_applier | ebff1ab8-9793-11eb-ba5f-08002734ed50 | gr1 | 3306 | ONLINE | PRIMARY | 8.0.23 | | group_replication_applier | f47df54e-9793-11eb-a60b-08002734ed50 | gr2 | 3306 | ONLINE | SECONDARY | 8.0.23 | +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+Reading the management table we will see that grFailOver had started the shift:DC1-1(root@localhost) [(none)]>select * from percona.group_replication_failover_manager where channel_name ='dc2_to_dc1' order by hostG *************************** 1. row *************************** server_uuid: ebff1ab8-9793-11eb-ba5f-08002734ed50 HOST: gr1 PORT: 3306 channel_name: dc2_to_dc1 gr_role: PRIMARY STATUS: ONLINE started: NULL lastupdate: NULL active: 1 COMMENT: Need to wait 5 minutes, passed: 0 *************************** 2. row *************************** server_uuid: f47df54e-9793-11eb-a60b-08002734ed50 HOST: gr2 PORT: 3306 channel_name: dc2_to_dc1 gr_role: PRIMARY STATUS: ONLINE started: 2021-04-08 10:22:40.000000 lastupdate: 2021-04-08 10:22:53.000000 active: 1 COMMENT: REPLICA restarted for the channel dc2_to_dc1 *************************** 3. row *************************** server_uuid: e891d1b4-9793-11eb-92ac-08002734ed50 HOST: gr3 PORT: 3306 channel_name: dc2_to_dc1 gr_role: SECONDARY STATUS: ONLINE started: NULL lastupdate: NULL active: 1 COMMENT: Just insertedChecking the new PRIMARY node gr1, we can see that: Gr_role is PRIMARY COMMENT reports the countdown (in minutes) the node waits After the 5 minutes: DC1-1(root@localhost) [(none)]>select * from percona.group_replication_failover_manager where channel_name ='dc2_to_dc1' order by hostG *************************** 1. row *************************** server_uuid: ebff1ab8-9793-11eb-ba5f-08002734ed50 HOST: gr1 PORT: 3306 channel_name: dc2_to_dc1 gr_role: PRIMARY STATUS: ONLINE started: 2021-04-08 10:27:54.000000 lastupdate: 2021-04-08 10:30:12.000000 active: 1 COMMENT: REPLICA restarted for the channel dc2_to_dc1 *************************** 2. row *************************** server_uuid: f47df54e-9793-11eb-a60b-08002734ed50 HOST: gr2 PORT: 3306 channel_name: dc2_to_dc1 gr_role: SECONDARY STATUS: ONLINE started: NULL lastupdate: NULL active: 1 COMMENT: Resetted by primary node ebff1ab8-9793-11eb-ba5f-08002734ed50 at 2021-04-08 10:27:53 *************************** 3. row *************************** server_uuid: e891d1b4-9793-11eb-92ac-08002734ed50 HOST: gr3 PORT: 3306 channel_name: dc2_to_dc1 gr_role: SECONDARY STATUS: ONLINE started: NULL lastupdate: NULL active: 1 COMMENT: Just insertedNow, what we can see is: Node gr1 had become active in replicating It reports the time it started the replication It reports the last time it checked for the replication to be active Node gr2 is marked SECONDARY In the comment is also reported the time and when the replication was restarted on the new REPLICA node If for any reason the replication in the original node gr2 was restarted (like moving back the PRIMARY) while the countdown was still in place, grFailOver will stop any action and reset the gr1 status. In short, now my two DCs can rely on AAF for failing over on a different SOURCE and on grFailOver for shifting the Node following GR Primary, or to failover to another node when my Primary crashes. Conclusion I am sure Oracle is backing something about this and I am sure we will see it out soon, but in the meantime, I have to say that this simple solution works. It has improved the resiliency of my testing architecture A LOT. And while I am still testing it and I am totally confident that the procedure can be written in a more efficient way, I am also sure bugs and errors are around the corner. BUT, this was a POC and I am happy with the outcome. This proves it is not so difficult to make better what we have, and also proves that sometimes a small thing can have a HUGE impact. It also proves we should not always wait for others to do what is required and that ANYONE can help. Finally, as mentioned above, this is a POC solution, but no one prevents you to start from it and make it a production solution, as my colleague Yves did for his Percona XtraDB Cluster Replication Manager. Is just on you! Great MySQL to all. References https://www.datacenterdynamics.com/en/news/fire-destroys-ovhclouds-sbg2-data-center-strasbourg/ http://www.tusacentral.net/joomla/index.php/mysql-blogs/227-mysql-asynchronous-source-auto-failover https://github.com/y-trudeau/Mysql-tools/tree/master/PXC https://www.percona.com/blog/2020/10/26/mysql-8-0-22-asynchronous-replication-automatic-connection-io-thread-failover/ https://www.percona.com/blog/2021/04/14/what-you-can-do-with-auto-failover-and-percona-distribution-for-mysql-8-0-x/

0 notes

globalmediacampaign · 4 years ago

Text

MySQL 8.0.22: Asynchronous Replication Automatic Connection (IO Thread) Failover

MySQL 8.0.22 was released on Oct 19, 2020, and came with nice features and a lot of bug fixes. Now, you can configure your async replica to choose the new source in case the existing source connection (IO thread) fails. In this blog, I am going to explain the entire process involved in this configuration with a use case. Overview This feature is very helpful to keep your replica server in sync in case of current source fails. To activate asynchronous connection failover, we need to set the “SOURCE_CONNECTION_AUTO_FAILOVER=1” on the “CHANGE MASTER” statement. Once the IO connection fails, it will try to connect the existing source based on the “MASTER_RETRY_COUNT, MASTER_CONNECT_RETRY”. Then only it will do the failover. The feature will only work when the IO connection is failed, maybe the source crashed or stopped, or any network failures. This will not work if the replica is manually stopped using the “STOP REPLICA”. We have two new functions, which will help to add and delete the server entries from the source list. asynchronous_connection_failover_add_source → Arguments (‘channel’,’host’,port,’network_namespace’,weight) asynchronous_connection_failover_delete_source — Arguments (‘channel’,’host’,port,’network_namespace) The source servers need to be configured in the table “mysql.replication_asynchronous_connection_failover”. We can also use the table “performance_schema.replication_asynchronous_connection_failover” to view the available servers in source list. Requirements GTID should be enabled on all the servers. Regarding auto-positioning purpose, MASTER_AUTO_POSITION should be enabled on the replica ( CHANGE MASTER ). The user and password should be the same on all the source servers. Replication user and password must be set for the channel using the CHANGE MASTER .. FOR CHANNEL statement. Use Case I have two data centers and three servers (dc1, dc2, report). “dc1” and “report” servers are in the same data center. “dc2” is in a different data center. “dc1” and “dc2” are in active-passive async replication setup (dc1 – active, dc2 – passive) “report” is configured as an async replica under “dc1” for reporting purposes. Here, my requirement is, if the active node “dc1” is failed, I need to configure the “report” server under “dc2” to get the live data without manual work after a failure happens. Configuration for Automatic Connection Failover I have installed MySQL 8.0.22 on all three servers and configured the active – passive replication between “dc1” and “dc2”. [root@dc1 ~]# mysql -e "select @@version, @@version_commentG" *************************** 1. row *************************** @@version: 8.0.22 @@version_comment: MySQL Community Server - GPL At dc1, mysql> show replica statusG Source_Host: dc2 Replica_IO_Running: Yes Replica_SQL_Running: Yes 1 row in set (0.00 sec) At dc2, mysql> show replica statusG Source_Host: dc1 Replica_IO_Running: Yes Replica_SQL_Running: Yes 1 row in set (0.00 sec) Now, I need to configure the “report” server as an async replica under “dc1” with automatic failover options. At report, mysql> change master to -> master_user='Autofailover', -> master_password='Autofailover@321', -> master_host='dc1', -> master_auto_position=1, -> get_master_public_key=1, -> source_connection_auto_failover=1, -> master_retry_count=3, -> master_connect_retry=10 -> for channel "herc7"; Query OK, 0 rows affected, 2 warnings (0.03 sec) source_connection_auto_failover : To activate the automatic failover feature. master_retry_count, master_connect_retry : The default setting is huge ( master_retry_count = 86400, master_connect_retry = 60 ), with that we need to wait 60 days ( 86400 * 60 /60/60/24 ) for the failover. So, i reduced the settings to 30 seconds ( 10 *3 ) mysql> start replica for channel "herc7"; Query OK, 0 rows affected, 1 warning (0.00 sec) mysql> show replica statusG Source_Host: dc1 Connect_Retry: 10 Replica_IO_Running: Yes Replica_SQL_Running: Yes Seconds_Behind_Source: 0 Last_IO_Error: Replica_SQL_Running_State: Slave has read all relay log; waiting for more updates Source_Retry_Count: 3 Last_IO_Error_Timestamp: Auto_Position: 1 Channel_Name: herc7 1 row in set (0.00 sec) You can see the replication is started and the failover settings are applied. The current primary source is “dc1”. Now, I am going to use the function to add the server details into the source list for the failover to dc2. At “report”, mysql> select asynchronous_connection_failover_add_source('herc7', 'dc2', 3306, '', 50); +------------------------------------------------------------------------------+ | asynchronous_connection_failover_add_source('herc7', 'dc2', 3306, '', 50) | +------------------------------------------------------------------------------+ | The UDF asynchronous_connection_failover_add_source() executed successfully. | +------------------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> select * from mysql.replication_asynchronous_connection_failoverG *************************** 1. row *************************** Channel_name: herc7 Host: dc2 Port: 3306 Network_namespace: Weight: 50 1 row in set (0.00 sec) It shows the source list is updated with dc2 details. We are good to perform the failover now. I am going to shut down the MySQL service on dc1. At dc1, [root@dc1 ~]# service mysqld stop Redirecting to /bin/systemctl stop mysqld.service [root@dc1 ~]# At report server, mysql> show replica statusG Source_Host: dc1 Connect_Retry: 10 Replica_IO_Running: Connecting Replica_SQL_Running: Yes Seconds_Behind_Source: NULL Last_IO_Error: error reconnecting to master 'Autofailover@dc1:3306' - retry-time: 10 retries: 2 message: Can't connect to MySQL server on 'dc1' (111) Replica_SQL_Running_State: Slave has read all relay log; waiting for more updates Source_Retry_Count: 3 Last_IO_Error_Timestamp: 201019 21:32:26 Auto_Position: 1 Channel_Name: herc7 1 row in set (0.00 sec) The IO thread is in “connecting” state. This means it is trying to establish the connection to the existing source (dc1) based on the “master_retry_count” and “master_connect_retry” settings. After 30 seconds, mysql> show replica statusG Source_Host: dc2 Connect_Retry: 10 Replica_IO_Running: Yes Replica_SQL_Running: Yes Seconds_Behind_Source: 0 Last_IO_Error: Replica_SQL_Running_State: Slave has read all relay log; waiting for more updates Source_Retry_Count: 3 Last_IO_Error_Timestamp: Auto_Position: 1 Channel_Name: herc7 1 row in set (0.00 sec) You can see the source_host was changed to “dc2”. So, the server “report” performed the auto failover and connected to “dc2”. From the error log, 2020-10-19T21:32:16.247460Z 53 [ERROR] [MY-010584] [Repl] Slave I/O for channel 'herc7': error reconnecting to master 'Autofailover@dc1:3306' - retry-time: 10 retries: 1 message: Can't connect to MySQL server on 'dc1' (111), Error_code: MY-002003 2020-10-19T21:32:26.249887Z 53 [ERROR] [MY-010584] [Repl] Slave I/O for channel 'herc7': error reconnecting to master 'Autofailover@dc1:3306' - retry-time: 10 retries: 2 message: Can't connect to MySQL server on 'dc1' (111), Error_code: MY-002003 2020-10-19T21:32:36.251989Z 53 [ERROR] [MY-010584] [Repl] Slave I/O for channel 'herc7': error reconnecting to master 'Autofailover@dc1:3306' - retry-time: 10 retries: 3 message: Can't connect to MySQL server on 'dc1' (111), Error_code: MY-002003 2020-10-19T21:32:36.254585Z 56 [Warning] [MY-010897] [Repl] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information. 2020-10-19T21:32:36.256170Z 56 [System] [MY-010562] [Repl] Slave I/O thread for channel 'herc7': connected to master 'Autofailover@dc2:3306',replication started in log 'FIRST' at position 196 2020-10-19T21:32:36.258628Z 56 [Warning] [MY-010549] [Repl] The master's UUID has changed, although this should not happen unless you have changed it manually. The old UUID was f68b8693-1246-11eb-a6c0-5254004d77d3. The first three lines say it tried to connect the existing primary source “dc1” in a 10 seconds interval. There was no response from “dc1”, so it does the failover to “dc2” (connected to master ‘Autofailover@dc2:3306’). It works perfectly! Is Failback Possible? Let’s experiment with the below two scenarios, What happens if the primary node comes back online? Does it perform a failback in case the server with higher weight comes back online? What happens if the primary node comes back online? I am going to start the “dc1”, which was shut down earlier to test the failover. At “dc1”, [root@dc1 ~]# service mysqld start Redirecting to /bin/systemctl start mysqld.service [root@dc1 ~]# mysql -e "show status like 'uptime'G" *************************** 1. row *************************** Variable_name: Uptime Value: 4 Let’s see the replication on the “report” server. mysql> show replica statusG Source_Host: dc2 Replica_IO_Running: Yes Replica_SQL_Running: Yes Replica_SQL_Running_State: Slave has read all relay log; waiting for more updates Channel_Name: herc7 1 row in set (0.00 sec) No changes. It is still connected to “dc2”. Failback has not happened. Does it perform a failback in case the server with higher weight comes back online? To test this, again I shut down the MySQL on “dc1” and updated the source list on the “report” server (dc1 weight > dc2 weight). select asynchronous_connection_failover_add_source('herc7', 'dc1', 3306, '', 70)G mysql> select * from replication_asynchronous_connection_failoverG *************************** 1. row *************************** Channel_name: herc7 Host: dc1 Port: 3306 Network_namespace: Weight: 70 *************************** 2. row *************************** Channel_name: herc7 Host: dc2 Port: 3306 Network_namespace: Weight: 50 2 rows in set (0.00 sec) You can see the server “dc1” is configured with a higher weight (70). Now I am going to start the MySQL service on “dc1”. At “dc1”, [root@dc1 ~]# service mysqld start Redirecting to /bin/systemctl start mysqld.service [root@dc1 ~]# mysql -e "show status like 'uptime'G" *************************** 1. row *************************** Variable_name: Uptime Value: 37 At “report” server, mysql> show replica statusG Source_Host: dc2 Replica_IO_Running: Yes Replica_SQL_Running: Yes Replica_SQL_Running_State: Slave has read all relay log; waiting for more updates 1 row in set (0.00 sec) No changes, so once the failover is done to the new source, the automatic failback will not happen until the new source goes down. From MySQL doc: Once the replica has succeeded in making a connection, it does not change the connection unless the new source stops or there is a network failure. This is the case even if the source that became unavailable and triggered the connection change becomes available again and has a higher priority setting. This solution is also very helpful in (cluster + async replica) environments. You can automatically switch the connection to another cluster node, in case the existing source cluster node fails. If your network is not stable, you need to consider to set the proper retry settings, because you may face the frequent failover with low thresholds. https://www.percona.com/blog/2020/10/26/mysql-8-0-22-asynchronous-replication-automatic-connection-io-thread-failover/

0 notes