Troubleshooting Node Reboot
September 22, 2015 Leave a comment
Sometime simplest looking issues tends to be very hard to debug and can be caused by something that you would have not thought about. Well it was a quite week so far and you got paged for something that you don’t wanna listen and yes you guessed right it was node eviction. Well the first thing you look for is to make sure that DB came up correct before just going to find the cause by the way this is 2 node RAC cluster running 11.2.0.3 DB and the 1st node went down. This roughly happened at afternoon and while looking for the cause after making sure everything is up and running suddenly second node went down. Now, this is really not a good sign and something is going on. So, started looking at some logs and this is what system message tells me.
Jul 14 12:10:01 dbservernode1 kernel: Call Trace: Jul 14 12:10:02 dbservernode1 kernel: [] thread_return+0x62/0xfe Jul 14 12:10:02 dbservernode1 kernel: [] find_get_page+0x21/0x51 Jul 14 12:10:02 dbservernode1 kernel: [] __find_get_block_slow+0xec/0xf8 Jul 14 12:10:02 dbservernode1 kernel: [] :jbd2:start_this_handle+0x2ed/0x3b7 Jul 14 12:10:02 dbservernode1 kernel: [] autoremove_wake_function+0x0/0x2e Jul 14 12:10:02 dbservernode1 kernel: [] :jbd2:jbd2_journal_start+0xa3/0xda Jul 14 12:10:02 dbservernode1 kernel: [] :ext4:ext4_da_write_begin+0x151/0x25b Jul 14 12:10:02 dbservernode1 kernel: [] :jbd2:jbd2_journal_dirty_metadata+0xef/0x11e Jul 14 12:10:02 dbservernode1 kernel: [] generic_file_buffered_write+0x14b/0x6a9 Jul 14 12:10:02 dbservernode1 kernel: [] :jbd2:jbd2_journal_stop+0x1c0/0x1cc Jul 14 12:10:02 dbservernode1 kernel: [] :ext4:__ext4_journal_stop+0x66/0x73 Jul 14 12:10:02 dbservernode1 kernel: [] __generic_file_aio_write_nolock+0x369/0x3b6 Jul 14 12:10:02 dbservernode1 kernel: [] generic_file_aio_write+0x67/0xc3 Jul 14 12:10:02 dbservernode1 kernel: [] :ext4:ext4_file_write+0xbf/0x207 Jul 14 12:10:02 dbservernode1 kernel: [] do_sync_write+0xc7/0x104 Jul 14 12:10:02 dbservernode1 kernel: [] do_page_fault+0x4cc/0x842 Jul 14 12:10:02 dbservernode1 kernel: [] autoremove_wake_function+0x0/0x2e Jul 14 12:10:02 dbservernode1 kernel: [] audit_filter_syscall+0x53/0x90 Jul 14 12:10:02 dbservernode1 kernel: [] audit_filter_syscall+0x53/0x90 Jul 14 12:10:02 dbservernode1 kernel: [] vfs_write+0xce/0x174 Jul 14 12:10:02 dbservernode1 kernel: [] sys_write+0x45/0x6e Jul 14 12:10:02 dbservernode1 kernel: [] tracesys+0xd5/0xdf Jul 14 12:10:02 dbservernode1 kernel: Jul 14 15:13:03 dbservernode1 syslogd 1.4.1: restart. Jul 14 15:13:03 dbservernode1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
So, this won’t reveal much except some IO writes or maybe it’s too early to say that disk has some issues. Well also looked at oracle logs and it also won’t reveal much.
opidcl aborting process unknown ospid (17918) as a result of ORA-604 Tue Jul 14 12:09:46 2015 opidcl aborting process unknown ospid (30198) as a result of ORA-604 Tue Jul 14 12:09:51 2015 IPC Send timeout detected. Receiver ospid 22954 [oracle@dbservernode1 (LMD0)] Tue Jul 14 12:09:54 2015 Timed out trying to start process m000. Tue Jul 14 12:10:05 2015 Process RSM0, PID = 24496, will be killed Tue Jul 14 12:10:08 2015 DIAG (ospid: 22942) waits for event 'ksdxexeotherwait' for 10 secs. Tue Jul 14 12:10:08 2015 Received an instance abort message from instance 2 Please check instance 2 alert and LMON trace files for detail. Tue Jul 14 12:10:08 2015 Errors in file /u01/app/oracle/diag/rdbms/dbnode/dbnode1/trace/dbnode1_lmd0_22954.trc: LMS0 (ospid: 22956): terminating the instance due to error 481 Tue Jul 14 12:10:18 2015 System state dump requested by (instance=1, osid=22956 (LMS0)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/dbnode/dbnode1/trace/dbnode1_diag_22942.trc Tue Jul 14 12:10:30 2015
But at least it pointed to something. As you can see from above output that process are getting timed out which leads to something may be CPU starvation or IO or may be memory issues where OS is not able to fork the process. So next step to see OSwatcher logs and yes we have OSwatcher setup lucky we. So, here is the output from vmstat and top
zzz ***Tue Jul 14 12:01:56 CDT 2015 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 12 18 64840124 307500 6344 273596 0 0 2989 103 0 0 2 1 95 2 0 10 18 64982676 262024 6824 275580 3460 143688 7887 145896 4802 11780 7 22 34 37 0 4 16 65188832 260768 7416 275392 6040 208012 7539 209547 5906 12403 6 26 39 28 0 zzz ***Tue Jul 14 12:02:19 CDT 2015 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 1 24 67111528 256496 10364 285692 0 0 2989 103 0 0 2 1 95 2 0 5 30 67111528 256492 10384 284004 1136 3160 1558 4472 3862 20090 2 7 36 55 0 0 47 67111528 258796 10284 284656 480 1448 723 1786 1946 11222 1 7 26 66 0 zzz ***Tue Jul 14 12:02:47 CDT 2015 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 20 43 67111440 257700 8696 278252 0 0 2989 103 0 0 2 1 95 2 0 23 21 67111488 257232 8704 274376 10264 2956 13639 9158 9056 31253 3 68 6 23 0 6 38 67111016 256388 7020 274244 3100 332 6391 6213 3265 9385 3 46 17 35 0 zzz ***Tue Jul 14 12:09:30 CDT 2015 procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 80 24 67101188 256860 2728 228984 0 0 2989 103 0 0 2 1 95 2 0 88 10 67111528 256676 2748 236556 5748 11448 13675 12197 2080 9755 0 100 0 0 0 78 5 67111528 256436 2528 240276 932 140 4791 750 1300 9501 0 100 0 0 0 zzz ***Tue Jul 14 12:09:59 CDT 2015
Bingo! That explains why we were not able to fork any process and yes we are doing lot’s of swapping also. So, what’s going on well we also have historic top data lets see that also.
top - 12:00:16 up 72 days, 7:52, 2 users, load average: 23.11, 15.23, 7.98 Tasks: 1658 total, 4 running, 1654 sleeping, 0 stopped, 0 zombie Cpu(s): 1.3%us, 8.1%sy, 0.0%ni, 41.2%id, 49.3%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 111408060k total, 111145512k used, 262548k free, 9796k buffers Swap: 67111528k total, 55064268k used, 12047260k free, 280796k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1506 root 20 -5 0 0 0 R 99.9 0.0 4:01.35 [kswapd1] 8517 oracle 18 0 114g 27g 32m D 50.3 26.1 8:39.87 oracledbnode1 (LOCA 19395 oracle 18 0 63.6g 22g 33m D 47.4 21.6 3:37.75 oracledbnode1 (LOCA 1505 root 10 -5 0 0 0 D 45.8 0.0 3:59.63 [kswapd0] 3976 oracle 17 0 233m 24m 20m D 8.6 0.0 0:00.27 oracledbnode1 (LOCA 2468 oracle 16 0 40.2g 29m 27m D 8.3 0.0 0:23.11 oracledbnode1 (LOCA 25539 oracle 16 0 40.2g 31m 30m S 6.4 0.0 26:34.46 ora_cjq0_dbnode1 1447 root 18 0 26952 2208 536 S 6.0 0.0 297:29.11 cmaperfd -p 30 -s O 3896 oracle 17 0 40.2g 22m 19m S 6.0 0.0 0:00.43 oracledbnode1 (LOCA 3898 oracle 17 0 40.2g 22m 19m D 5.4 0.0 0:00.46 oracledbnode1 (LOCA 3780 oracle 17 0 40.2g 22m 19m D 5.1 0.0 0:00.69 oracledbnode1 (LOCA 3423 oracle 15 0 1536m 42m 14m S 4.5 0.0 224:33.35 /u01/app/11.2.0.3/g 25716 root 10 -5 0 0 0 S 4.5 0.0 5:43.75 [jbd2/dm-7-8] 26232 root RT 0 164m 16m 2860 S 4.1 0.0 2965:32 /sbin/multipathd zzz ***Tue Jul 14 12:02:24 CDT 2015 top - 12:02:40 up 72 days, 7:54, 2 users, load average: 36.57, 22.90, 11.94 Tasks: 1662 total, 13 running, 1649 sleeping, 0 stopped, 0 zombie Cpu(s): 1.5%us, 40.3%sy, 0.0%ni, 17.5%id, 40.6%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 111408060k total, 111146944k used, 261116k free, 10056k buffers Swap: 67111528k total, 67111328k used, 200k free, 277708k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 30993 oracle 15 0 372m 14m 11m S 67.2 0.0 100:24.55 /u01/app/11.2.0.3/g 31029 oracle 15 0 307m 16m 10m S 58.2 0.0 86:05.45 /u01/app/11.2.0.3/g 25872 oracle 17 0 40.2g 30m 28m S 50.4 0.0 391:19.16 oracledbnode1 (LOCA 26153 oracle 19 0 40.2g 28m 28m D 40.9 0.0 4:52.36 oracledbnode1 (LOCA 25876 oracle 15 0 40.2g 31m 28m S 40.8 0.0 396:48.27 oracledbnode1 (LOCA 1506 root 10 -5 0 0 0 R 38.1 0.0 5:16.66 [kswapd1] 26101 oracle 18 0 40.2g 30m 28m D 34.3 0.0 86:07.11 oracledbnode1 (LOCA 19395 oracle 18 0 69.6g 24g 33m R 32.2 22.8 4:57.43 oracledbnode1 (LOCA 31015 oracle 15 0 236m 11m 10m S 30.9 0.0 22:13.52 /u01/app/11.2.0.3/g 8517 oracle 18 0 120g 26g 32m D 30.8 24.9 9:59.00 oracledbnode1 (LOCA
Ok now this confirms that we are swapping heavily and we have very less space left and that also explains everything. Well this also give something that needs attention see that process 8517 which is consuming 120G of virtual memory and which is an oracle process. So, as we have identified the cause and now how do we know what that session was doing during that time.
Well AWR/ASH to the rescue and believe me i didn’t find anything related to heavy memory usage by any SQL and i don’t know how to tie PID with SID using historic views. Meanwhile client was also working on something and one of the user sent me an email with two thing 1st he way running a SELECT query while the issue happened and one small pl/sql block may ten lines of code. But the SELECT statement was throwing ORA-04030: out of process memory when trying to allocate 55600 bytes (kxs-heap-w,qmemNextBuf:Large Alloc)
So, this was a wow moment and we have found the culprit, and this was also reproducible so ran that SQL while closely monitoring memory well it consumed hardly 52 GB virtual memory and died and never crossed what we are observing(120GB), at least i would expect it to grow consistently. Then ran that PL/SQL block to see if that may be causing the issue, and yes that was the real culprit and here is that code.
declare v_count number:=0; begin for C_detail in (select * from test10) loop v_count:=v_count+1; --dbms_output.put_line(C_detail.p_id || '-'|| C_detail.s_id||'-'||C_detail.dv); INSERT INTO TEST (dv, s_id, p_id, COL3, COL4) SELECT C_detail.dv, C_detail.s_id, C_detail.p_id, REGEXP_SUBSTR(C_detail.dv, '[^|]+', 1, 3), REGEXP_SUBSTR(C_detail.dv, '[^|]+', 1, 4) FROM DUAL; if v_count=1000 then commit; v_count:=0; end if; end loop; commit; end; /
This blog post is going way too long than expected, so let’s see what was the real issue.
1. Original table TARAL.dbnode1>select count(*) from test10; COUNT(*) ---------- 1008024 2. Among this approximately 1M records there are some records which has partial multibyte character ERROR: ORA-29275: partial multibyte character 6000 rows selected. TARAL.dbnode1>l 1* select * from test10 3. So, I did identified those records and put into another table which is test12 TARAL.dbnode1>select count(*) from test12; COUNT(*) ---------- 3 TARAL.dbnode1>select * from test12; ERROR: ORA-29275: partial multibyte character no rows selected 4. To see what those records are TARAL.dbnode1>select convert(dv,'US7ASCII','WE8ISO8859P1') dv from test12; DV --------- zO zO zO 5. So when I ran this simple select statement ( I have to kill session from OS as it was consuming lot of memory) TARAL.dbnode1>--My processId is 1059 TARAL.dbnode1>select REGEXP_SUBSTR(dv, '[^|]+', 1, 3) from test12; select REGEXP_SUBSTR(dv, '[^|]+', 1, 3) from test12 * ERROR at line 1: ORA-03113: end-of-file on communication channel Process ID: 1059 Session ID: 993 Serial number: 1289 So simple select on 3 record table can consume 87GB of Virtual Memory And here is the top output top - 23:14:58 up 4 days, 13:37, 6 users, load average: 5.16, 4.59, 4.17 Tasks: 841 total, 6 running, 835 sleeping, 0 stopped, 0 zombie Cpu(s): 20.2%us, 0.8%sy, 0.0%ni, 78.8%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st Mem: 111408060k total, 110066080k used, 1341980k free, 338664k buffers Swap: 67111528k total, 1806208k used, 65305320k free, 3203500k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 24833 oracle 25 0 40.5g 264m 40m R 100.0 0.2 51:39.04 oracle 24838 oracle 25 0 40.5g 264m 40m R 100.0 0.2 50:23.10 oracle 24839 oracle 25 0 40.5g 264m 40m R 100.0 0.2 51:09.33 oracle 24840 oracle 25 0 40.5g 264m 40m R 100.0 0.2 50:40.74 oracle 1059 oracle 25 0 87.2g 46g 23m R 99.7 43.9 1:52.66 oracle 11083 oracle 15 0 13296 1736 844 R 0.7 0.0 0:00.61 top
So, we had node eviction due to 3 bad records and indeed this was a bug which has one-off patch available.