Oracle ADG on 11.2.0.4 version went to hang state with the lots of below messages in the alert log on both primary and ADG instances.
WARN: ARC5: Terminating pid 31981640 hung on an I/O operation
Thu Feb 20 05:00:38 2014
Killing 1 processes with pids 66912506 (Process by index) in order to remove hung processes. Requested by OS process 19267810
Number of archiver generation is normal at the primary site and database load also normal at the primary site.
Problem we have observed archivers are not getting transferred from primary to ADG.
Even shut immediate has no progress and not even able to collect the hang analysis for the db.
When we analysized, messages in the alert log realted to kill process are the "ARC" prceess.
This kind of a problem usually occurs after OS or network errors, or restarting the primary or standby instance or reboot the primary or standby node that abruptly crashes log shipping between the primary and standby
Cause for this problem:
ARCx processes on the primary stuck on the network forever or that are responsible to update the APPLIED column get stuck and can not recover themselves.
Additionally these processes that may be used for local and remote archiving, heartbeat and FAL fetching logs on the primary.
So when they are all stuck and reach the maximum number of values specified in log_archive_max_processes, they can cause ambiguous errors as shown above.
The worst case would be all ARCx processes on the primary are stuck and they couldn't do local archiving, so that all online redo log files are full which causes the primary database hangs.
This may be due to standby db crash, network errors or some abrupt outage on the standby or primary.
The other common cause is the firewall.
Assuming that log transport from the primary is configured by log_archive_dest_2.
Please perform the following:
1) If the Data Guard Broker is running, disable Data Guard Broker on both primary and standby:
2) On the Primary Database:
3) On the Standby Database:
4) On the Primary: kill the ARCx Processes and the Database will respawn them automatically immediately without harming it.
5) On standby server, startup Standby Database and resume Managed Recovery
6) Re-enable Log Transport Services on the Primary:
At this point all the ARCx processes should be up and running on the Primary.
7) Re-enable the Data Guard Broker for both, Primary and Standby if applicable:
8) Please work with your Network Administrator to make sure the following Firewall Features are disabled.
WARN: ARC5: Terminating pid 31981640 hung on an I/O operation
Thu Feb 20 05:00:38 2014
Killing 1 processes with pids 66912506 (Process by index) in order to remove hung processes. Requested by OS process 19267810
Number of archiver generation is normal at the primary site and database load also normal at the primary site.
Problem we have observed archivers are not getting transferred from primary to ADG.
Even shut immediate has no progress and not even able to collect the hang analysis for the db.
When we analysized, messages in the alert log realted to kill process are the "ARC" prceess.
This kind of a problem usually occurs after OS or network errors, or restarting the primary or standby instance or reboot the primary or standby node that abruptly crashes log shipping between the primary and standby
Cause for this problem:
ARCx processes on the primary stuck on the network forever or that are responsible to update the APPLIED column get stuck and can not recover themselves.
Additionally these processes that may be used for local and remote archiving, heartbeat and FAL fetching logs on the primary.
So when they are all stuck and reach the maximum number of values specified in log_archive_max_processes, they can cause ambiguous errors as shown above.
The worst case would be all ARCx processes on the primary are stuck and they couldn't do local archiving, so that all online redo log files are full which causes the primary database hangs.
This may be due to standby db crash, network errors or some abrupt outage on the standby or primary.
The other common cause is the firewall.
Solution:
ARCx processes on the primary need to be restarted.Assuming that log transport from the primary is configured by log_archive_dest_2.
Please perform the following:
1) If the Data Guard Broker is running, disable Data Guard Broker on both primary and standby:
SQL> alter system set
dg_broker_start=FALSE;
2) On the Primary Database:
- Set log transport state to DEFER
status:
SQL> alter system set log_archive_dest_state_2='defer';
SQL> alter system switch logfile;
- Reset log_archive_dest_2
SQL> show parameter log_archive_dest_2
SQL> alter system set log_archive_dest_2 = '........';
- Switch logfiles on the Primary
SQL> alter system switch logfile;
SQL> alter system set log_archive_dest_state_2='defer';
SQL> alter system switch logfile;
- Reset log_archive_dest_2
SQL> show parameter log_archive_dest_2
SQL> alter system set log_archive_dest_2 = '........';
- Switch logfiles on the Primary
SQL> alter system switch logfile;
3) On the Standby Database:
- Cancel Managed Recovery
SQL> alter database recover managed standby database cancel;
- Shutdown the Standby Database
SQL> shutdown immediate
SQL> alter database recover managed standby database cancel;
- Shutdown the Standby Database
SQL> shutdown immediate
4) On the Primary: kill the ARCx Processes and the Database will respawn them automatically immediately without harming it.
ps -ef | grep -i arc
kill -9 <ospid of ARC process> <another ospid of ARC process> ...
kill -9 <ospid of ARC process> <another ospid of ARC process> ...
5) On standby server, startup Standby Database and resume Managed Recovery
SQL> startup mount;
SQL> alter database recover managed standby database [using current logfile] disconnect;
SQL> alter database recover managed standby database [using current logfile] disconnect;
6) Re-enable Log Transport Services on the Primary:
SQL> alter system set
log_archive_dest_state_2='enable' ;
At this point all the ARCx processes should be up and running on the Primary.
7) Re-enable the Data Guard Broker for both, Primary and Standby if applicable:
SQL> alter system set
dg_broker_start=true;
8) Please work with your Network Administrator to make sure the following Firewall Features are disabled.
- SQLNet fixup protocol
- Deep Packet Inspection (DPI)
- SQLNet packet inspection
- SQL Fixup
- SQL ALG (Juniper firewall)
No comments:
Post a Comment