Troubleshooting Node Reboot

Sometime simplest looking issues tends to be very hard to debug and can be caused by something that you would have not thought about. Well it was a quite week so far and you got paged for something that you don’t wanna listen and yes you guessed right it was node eviction. Well the first thing you look for is to make sure that DB came up correct before just going to find the cause by the way this is 2 node RAC cluster running 11.2.0.3 DB and the 1st node went down. This roughly happened at afternoon and while looking for the cause after making sure everything is up and running suddenly second node went down. Now, this is really not a good sign and something is going on. So, started looking at some logs and this is what system message tells me.

Jul 14 12:10:01 dbservernode1 kernel: Call Trace:
Jul 14 12:10:02 dbservernode1 kernel: [] thread_return+0x62/0xfe
Jul 14 12:10:02 dbservernode1 kernel: [] find_get_page+0x21/0x51
Jul 14 12:10:02 dbservernode1 kernel: [] __find_get_block_slow+0xec/0xf8
Jul 14 12:10:02 dbservernode1 kernel: [] :jbd2:start_this_handle+0x2ed/0x3b7
Jul 14 12:10:02 dbservernode1 kernel: [] autoremove_wake_function+0x0/0x2e
Jul 14 12:10:02 dbservernode1 kernel: [] :jbd2:jbd2_journal_start+0xa3/0xda
Jul 14 12:10:02 dbservernode1 kernel: [] :ext4:ext4_da_write_begin+0x151/0x25b
Jul 14 12:10:02 dbservernode1 kernel: [] :jbd2:jbd2_journal_dirty_metadata+0xef/0x11e
Jul 14 12:10:02 dbservernode1 kernel: [] generic_file_buffered_write+0x14b/0x6a9
Jul 14 12:10:02 dbservernode1 kernel: [] :jbd2:jbd2_journal_stop+0x1c0/0x1cc
Jul 14 12:10:02 dbservernode1 kernel: [] :ext4:__ext4_journal_stop+0x66/0x73
Jul 14 12:10:02 dbservernode1 kernel: [] __generic_file_aio_write_nolock+0x369/0x3b6
Jul 14 12:10:02 dbservernode1 kernel: [] generic_file_aio_write+0x67/0xc3
Jul 14 12:10:02 dbservernode1 kernel: [] :ext4:ext4_file_write+0xbf/0x207
Jul 14 12:10:02 dbservernode1 kernel: [] do_sync_write+0xc7/0x104
Jul 14 12:10:02 dbservernode1 kernel: [] do_page_fault+0x4cc/0x842
Jul 14 12:10:02 dbservernode1 kernel: [] autoremove_wake_function+0x0/0x2e
Jul 14 12:10:02 dbservernode1 kernel: [] audit_filter_syscall+0x53/0x90
Jul 14 12:10:02 dbservernode1 kernel: [] audit_filter_syscall+0x53/0x90
Jul 14 12:10:02 dbservernode1 kernel: [] vfs_write+0xce/0x174
Jul 14 12:10:02 dbservernode1 kernel: [] sys_write+0x45/0x6e
Jul 14 12:10:02 dbservernode1 kernel: [] tracesys+0xd5/0xdf
Jul 14 12:10:02 dbservernode1 kernel:

Jul 14 15:13:03 dbservernode1 syslogd 1.4.1: restart.
Jul 14 15:13:03 dbservernode1 kernel: klogd 1.4.1, log source = /proc/kmsg started.

So, this won’t reveal much except some IO writes or maybe it’s too early to say that disk has some issues. Well also looked at oracle logs and it also won’t reveal much.

opidcl aborting process unknown ospid (17918) as a result of ORA-604
Tue Jul 14 12:09:46 2015
opidcl aborting process unknown ospid (30198) as a result of ORA-604
Tue Jul 14 12:09:51 2015
IPC Send timeout detected. Receiver ospid 22954 [oracle@dbservernode1 (LMD0)]
Tue Jul 14 12:09:54 2015
Timed out trying to start process m000.
Tue Jul 14 12:10:05 2015
Process RSM0, PID = 24496, will be killed
Tue Jul 14 12:10:08 2015
DIAG (ospid: 22942) waits for event 'ksdxexeotherwait' for 10 secs.
Tue Jul 14 12:10:08 2015
Received an instance abort message from instance 2
Please check instance 2 alert and LMON trace files for detail.
Tue Jul 14 12:10:08 2015
Errors in file /u01/app/oracle/diag/rdbms/dbnode/dbnode1/trace/dbnode1_lmd0_22954.trc:
LMS0 (ospid: 22956): terminating the instance due to error 481
Tue Jul 14 12:10:18 2015
System state dump requested by (instance=1, osid=22956 (LMS0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/dbnode/dbnode1/trace/dbnode1_diag_22942.trc
Tue Jul 14 12:10:30 2015

But at least it pointed to something. As you can see from above output that process are getting timed out which leads to something may be CPU starvation or IO or may be memory issues where OS is not able to fork the process. So next step to see OSwatcher logs and yes we have OSwatcher setup lucky we. So, here is the output from vmstat and top

zzz ***Tue Jul 14 12:01:56 CDT 2015
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
12 18 64840124 307500   6344 273596    0    0  2989   103    0    0  2  1 95  2  0
10 18 64982676 262024   6824 275580 3460 143688  7887 145896 4802 11780  7 22 34 37  0
 4 16 65188832 260768   7416 275392 6040 208012  7539 209547 5906 12403  6 26 39 28  0
zzz ***Tue Jul 14 12:02:19 CDT 2015
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1 24 67111528 256496  10364 285692    0    0  2989   103    0    0  2  1 95  2  0
 5 30 67111528 256492  10384 284004 1136 3160  1558  4472 3862 20090  2  7 36 55  0
 0 47 67111528 258796  10284 284656  480 1448   723  1786 1946 11222  1  7 26 66  0
zzz ***Tue Jul 14 12:02:47 CDT 2015
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
20 43 67111440 257700   8696 278252    0    0  2989   103    0    0  2  1 95  2  0
23 21 67111488 257232   8704 274376 10264 2956 13639  9158 9056 31253  3 68  6 23  0
 6 38 67111016 256388   7020 274244 3100  332  6391  6213 3265 9385  3 46 17 35  0
zzz ***Tue Jul 14 12:09:30 CDT 2015
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
80 24 67101188 256860   2728 228984    0    0  2989   103    0    0  2  1 95  2  0
88 10 67111528 256676   2748 236556 5748 11448 13675 12197 2080 9755  0 100  0  0  0
78  5 67111528 256436   2528 240276  932  140  4791   750 1300 9501  0 100  0  0  0
zzz ***Tue Jul 14 12:09:59 CDT 2015

Bingo! That explains why we were not able to fork any process and yes we are doing lot’s of swapping also. So, what’s going on well we also have historic top data lets see that also.

top - 12:00:16 up 72 days,  7:52,  2 users,  load average: 23.11, 15.23, 7.98
Tasks: 1658 total,   4 running, 1654 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.3%us,  8.1%sy,  0.0%ni, 41.2%id, 49.3%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  111408060k total, 111145512k used,   262548k free,     9796k buffers
Swap: 67111528k total, 55064268k used, 12047260k free,   280796k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1506 root      20  -5     0    0    0 R 99.9  0.0   4:01.35 [kswapd1]
 8517 oracle    18   0  114g  27g  32m D 50.3 26.1   8:39.87 oracledbnode1 (LOCA
19395 oracle    18   0 63.6g  22g  33m D 47.4 21.6   3:37.75 oracledbnode1 (LOCA
 1505 root      10  -5     0    0    0 D 45.8  0.0   3:59.63 [kswapd0]
 3976 oracle    17   0  233m  24m  20m D  8.6  0.0   0:00.27 oracledbnode1 (LOCA
 2468 oracle    16   0 40.2g  29m  27m D  8.3  0.0   0:23.11 oracledbnode1 (LOCA
25539 oracle    16   0 40.2g  31m  30m S  6.4  0.0  26:34.46 ora_cjq0_dbnode1
 1447 root      18   0 26952 2208  536 S  6.0  0.0 297:29.11 cmaperfd -p 30 -s O
 3896 oracle    17   0 40.2g  22m  19m S  6.0  0.0   0:00.43 oracledbnode1 (LOCA
 3898 oracle    17   0 40.2g  22m  19m D  5.4  0.0   0:00.46 oracledbnode1 (LOCA
 3780 oracle    17   0 40.2g  22m  19m D  5.1  0.0   0:00.69 oracledbnode1 (LOCA
 3423 oracle    15   0 1536m  42m  14m S  4.5  0.0 224:33.35 /u01/app/11.2.0.3/g
25716 root      10  -5     0    0    0 S  4.5  0.0   5:43.75 [jbd2/dm-7-8]
26232 root      RT   0  164m  16m 2860 S  4.1  0.0   2965:32 /sbin/multipathd

zzz ***Tue Jul 14 12:02:24 CDT 2015

top - 12:02:40 up 72 days,  7:54,  2 users,  load average: 36.57, 22.90, 11.94
Tasks: 1662 total,  13 running, 1649 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.5%us, 40.3%sy,  0.0%ni, 17.5%id, 40.6%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  111408060k total, 111146944k used,   261116k free,    10056k buffers
Swap: 67111528k total, 67111328k used,      200k free,   277708k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
30993 oracle    15   0  372m  14m  11m S 67.2  0.0 100:24.55 /u01/app/11.2.0.3/g
31029 oracle    15   0  307m  16m  10m S 58.2  0.0  86:05.45 /u01/app/11.2.0.3/g
25872 oracle    17   0 40.2g  30m  28m S 50.4  0.0 391:19.16 oracledbnode1 (LOCA
26153 oracle    19   0 40.2g  28m  28m D 40.9  0.0   4:52.36 oracledbnode1 (LOCA
25876 oracle    15   0 40.2g  31m  28m S 40.8  0.0 396:48.27 oracledbnode1 (LOCA
 1506 root      10  -5     0    0    0 R 38.1  0.0   5:16.66 [kswapd1]
26101 oracle    18   0 40.2g  30m  28m D 34.3  0.0  86:07.11 oracledbnode1 (LOCA
19395 oracle    18   0 69.6g  24g  33m R 32.2 22.8   4:57.43 oracledbnode1 (LOCA
31015 oracle    15   0  236m  11m  10m S 30.9  0.0  22:13.52 /u01/app/11.2.0.3/g
 8517 oracle    18   0  120g  26g  32m D 30.8 24.9   9:59.00 oracledbnode1 (LOCA

Ok now this confirms that we are swapping heavily and we have very less space left and that also explains everything. Well this also give something that needs attention see that process 8517 which is consuming 120G of virtual memory and which is an oracle process. So, as we have identified the cause and now how do we know what that session was doing during that time.

Well AWR/ASH to the rescue and believe me i didn’t find anything related to heavy memory usage by any SQL and i don’t know how to tie PID with SID using historic views. Meanwhile client was also working on something and one of the user sent me an email with two thing 1st he way running a SELECT query while the issue happened and one small pl/sql block may ten lines of code. But the SELECT statement was throwing ORA-04030: out of process memory when trying to allocate 55600 bytes (kxs-heap-w,qmemNextBuf:Large Alloc)

So, this was a wow moment and we have found the culprit, and this was also reproducible so ran that SQL while closely monitoring memory well it consumed hardly 52 GB virtual memory and died and never crossed what we are observing(120GB), at least i would expect it to grow consistently. Then ran that PL/SQL block to see if that may be causing the issue, and yes that was the real culprit and here is that code.


declare
v_count number:=0;
begin
for C_detail in (select * from test10) loop
v_count:=v_count+1;
--dbms_output.put_line(C_detail.p_id || '-'|| C_detail.s_id||'-'||C_detail.dv);
INSERT INTO TEST
  (dv, s_id, p_id, COL3, COL4)
  SELECT C_detail.dv,
         C_detail.s_id,
         C_detail.p_id,
         REGEXP_SUBSTR(C_detail.dv, '[^|]+', 1, 3),
         REGEXP_SUBSTR(C_detail.dv, '[^|]+', 1, 4)
    FROM DUAL;
  if v_count=1000 then
    commit;
    v_count:=0;
  end if;

end loop;
commit;
end;
/

This blog post is going way too long than expected, so let’s see what was the real issue.


1.	Original table

TARAL.dbnode1>select count(*) from test10;

  COUNT(*)
----------
   1008024

2.	Among this  approximately 1M records there are some records which has partial multibyte character

ERROR:
ORA-29275: partial multibyte character



6000 rows selected.

TARAL.dbnode1>l
  1* select * from test10

3.	So, I did identified those records and put into another table which is test12

TARAL.dbnode1>select count(*) from test12;

  COUNT(*)
----------
         3

TARAL.dbnode1>select * from test12;
ERROR:
ORA-29275: partial multibyte character



no rows selected

4.	To see what those records are

TARAL.dbnode1>select convert(dv,'US7ASCII','WE8ISO8859P1') dv from test12;

DV	     
---------
zO       
zO       
zO       

5. So when I ran this simple select statement ( I have to kill session from OS as it was consuming lot of memory)


TARAL.dbnode1>--My processId is 1059
TARAL.dbnode1>select REGEXP_SUBSTR(dv, '[^|]+', 1, 3) from test12;
select REGEXP_SUBSTR(dv, '[^|]+', 1, 3) from test12
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 1059
Session ID: 993 Serial number: 1289

So simple select on 3 record table can consume 87GB of Virtual Memory

And here is the top output

top - 23:14:58 up 4 days, 13:37,  6 users,  load average: 5.16, 4.59, 4.17
Tasks: 841 total,   6 running, 835 sleeping,   0 stopped,   0 zombie
Cpu(s): 20.2%us,  0.8%sy,  0.0%ni, 78.8%id,  0.0%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:  111408060k total, 110066080k used,  1341980k free,   338664k buffers
Swap: 67111528k total,  1806208k used, 65305320k free,  3203500k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                           
24833 oracle    25   0 40.5g 264m  40m R 100.0  0.2  51:39.04 oracle                                                                           
24838 oracle    25   0 40.5g 264m  40m R 100.0  0.2  50:23.10 oracle                                                                           
24839 oracle    25   0 40.5g 264m  40m R 100.0  0.2  51:09.33 oracle                                                                           
24840 oracle    25   0 40.5g 264m  40m R 100.0  0.2  50:40.74 oracle                                                                           
 1059 oracle    25   0 87.2g  46g  23m R 99.7 43.9   1:52.66 oracle                                                                            
11083 oracle    15   0 13296 1736  844 R  0.7  0.0   0:00.61 top                                                                               


So, we had node eviction due to 3 bad records and indeed this was a bug which has one-off patch available.

Bug 14079850 – ORA-7445 [lxregmatpush] / spin when using REGEXP_SUBSTR (Doc ID 14079850.8)

library cache lock caused by wrong password login

Well mostly when DBA’s got called, conversation start with “everything is running slow”. Sometime back got same type of call from client that, we are not able to login to the system and calls are not processed.  Well started investigation and found that lot’s of session are waiting for library cache lock wait and here is ASH report which pretty much gives the reason for this locks


          Analysis Begin Time:   08-Jun-15 09:00:00
            Analysis End Time:   08-Jun-15 09:22:00
                 Elapsed Time:        22.0 (mins)
            Begin Data Source:   V$ACTIVE_SESSION_HISTORY
              End Data Source:   V$ACTIVE_SESSION_HISTORY
                 Sample Count:     206,434
      Average Active Sessions:      156.39
  Avg. Active Session per CPU:        6.52
                Report Target:   None specified

                                                               Avg Active
Event                               Event Class        % Event   Sessions
----------------------------------- --------------- ---------- ----------
library cache lock                  Concurrency          99.36     155.39

Top Event P1/P2/P3 Values     DB/Inst: PCHGCH/PCHGCH1  (Jun 08 09:00 to 09:22)

Event                          % Event  P1 Value, P2 Value, P3 Value % Activity
------------------------------ ------- ----------------------------- ----------
Parameter 1                Parameter 2                Parameter 3
-------------------------- -------------------------- --------------------------
library cache lock               99.36 "33699225536","3006886000","5       0.43
handle address             lock address               100*mode+namespace

Top Phases of Execution       DB/Inst: PCHGCH/PCHGCH1  (Jun 08 09:00 to 09:22)

                                          Avg Active
Phase of Execution             % Activity   Sessions
------------------------------ ---------- ----------
Connection Management               99.57     155.71

Top Call Types                DB/Inst: PCHGCH/PCHGCH1  (Jun 08 09:00 to 09:22)

Call Type                                     Count % Activity Avg Active
---------------------------------------- ---------- ---------- ----------
OAUTH                                       205,543      99.57     155.71

                         Slot                                   Event
Slot Time (Duration)    Count Event                             Count % Event
-------------------- -------- ------------------------------ -------- -------
09:00:00   (2.0 min)   18,508 library cache lock               18,290    8.86
                              CPU + Wait for CPU                  123    0.06
                              enq: JS - queue lock                 65    0.03
09:02:00   (2.0 min)   18,934 library cache lock               18,768    9.09
                              CPU + Wait for CPU                  107    0.05
                              row cache lock                       21    0.01
09:04:00   (2.0 min)   18,105 library cache lock               18,001    8.72
                              CPU + Wait for CPU                   96    0.05
                              library cache pin                     3    0.00
09:06:00   (2.0 min)   16,978 library cache lock               16,853    8.16
                              CPU + Wait for CPU                  118    0.06
                              Disk file Mirror Read                 3    0.00
09:08:00   (2.0 min)   19,685 library cache lock               19,591    9.49
                              CPU + Wait for CPU                   88    0.04
                              ADR block file read                   1    0.00
09:10:00   (2.0 min)   20,215 library cache lock               20,088    9.73
                              CPU + Wait for CPU                  111    0.05
                              Streams AQ: qmn coordinator wa        6    0.00
09:12:00   (2.0 min)   19,816 library cache lock               19,696    9.54
                              CPU + Wait for CPU                  120    0.06
09:14:00   (2.0 min)   17,866 library cache lock               17,812    8.63
                              CPU + Wait for CPU                   49    0.02
                              reliable message                      4    0.00
09:16:00   (2.0 min)   17,419 library cache lock               17,318    8.39
                              CPU + Wait for CPU                   85    0.04
                              control file sequential read          6    0.00
09:18:00   (2.0 min)   19,683 library cache lock               19,582    9.49
                              CPU + Wait for CPU                  100    0.05
                              ADR block file read                   1    0.00
09:20:00   (2.0 min)   19,225 library cache lock               19,117    9.26
                              CPU + Wait for CPU                   92    0.04
                              direct path read                      6    0.00

So, as you can see from above output 155 AAS was waiting for this lock which explains why call are not completing and from line 30 and 36 we can say that something is going wrong with the authentication. Well further we can see something same in AWR also.


                                                           Avg
                                                          wait   % DB
Event                                 Waits     Time(s)   (ms)   time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
library cache lock                    3,158     751,422 2.E+05   99.6 Concurrenc
DB CPU                                            1,735            .2
inactive session                         60          57    951     .0 Other
control file sequential read         73,205          34      0     .0 System I/O
enq: PS - contention                 21,631          30      1     .0 Other

Statistic Name                                       Time (s) % of DB Time
------------------------------------------ ------------------ ------------
connection management call elapsed time             752,541.7         99.7
sql execute elapsed time                              1,814.3           .2
DB CPU                                                1,734.6           .2
PL/SQL execution elapsed time                           266.5           .0
parse time elapsed                                        8.7           .0

So, connection management was consuming majority of DB time. Unfortunately we haven’t enabled auditing due to some other issues. But, here is output from Tanel’s awesome tool.

dashtop

Well further investigating found out that this may caused because of

High “Library Cache Locks” Wait Time Due to Invalid Login Attempts (Doc ID 1309738.1)

Bug 19637855 : LIBRARY CACHE LOCK CAUSED BY WRONG PASSWORD LOGIN

Oracle timezone mystery

The post title says mystery as it took bit time to identify the real issue. So, to start with we were suppose to change OS and DB timezone to CST as per customer request. Well pretty straight process change OS settings -> DB Settings -> If needed register with CRS. But after all this also it was not converted to CST.


Local Connection

SQL> select to_char(sysdate, 'DD/MM/YYYY HH24:MI:SS') , 
 to_char(systimestamp, 'DD/MM/YYYY HH24:MI:SS TZR') from dual;

TO_CHAR(SYSDATE,'DD TO_CHAR(SYSTIMESTAMP,'DD/MM/YYYYHH24:MI:SSTZR')
------------------- ----------------------------------------------------
10/07/2015 11:41:22 10/07/2015 11:41:22 -05:00

Remote Connection

SQL> select to_char(sysdate, 'DD/MM/YYYY HH24:MI:SS'), 
 to_char(systimestamp, 'DD/MM/YYYY HH24:MI:SS TZR') from dual;

TO_CHAR(SYSDATE,'DD TO_CHAR(SYSTIMESTAMP,'DD/MM/YYYYHH24:MI:SSTZR')
------------------- ----------------------------------------------------
10/07/2015 12:41:54 10/07/2015 12:41:54 -04:00

 

So local connection (BEQ) was giving correct time, but when you connect through TNS it was still giving eastern time. So, obviously need to check few configuration files.

cat /u01/app/11.2.0.3/grid/crs/install_crsconfig_<host>_env.txt

TZ=America/Chicago
NLS_LANG=AMERICAN_AMERICA.AL32UTF8
TNS_ADMIN=
ORACLE_BASE=

cat /etc/sysconfig/clock
ZONE="America/Chicago"
UTC=true 
ARC=false

So far everything looks good, so there is something wrong somewhere and after checking further found this

export TZ=UTC
date
Fri Jul 10 20:31:38 UTC 2015

export TZ=America/New_York
date
Fri Jul 10 16:36:08 EDT 2015

export TZ=America/Chicago
date 
Fri Jul 10 16:33:32 EDT 2015

export TZ=America/Matamoros
date
Fri Jul 10 15:34:08 CDT 2015


export TZ=America/Chihuahua
date
Fri Jul 10 14:40:52 MDT 2015

export TZ=America/Chihuahua
date
Fri Jul 10 14:40:52 MDT 2015

Now when you look zoneinfo 

cd /usr/share/zoneinfo/America
ls -ltra
total 584
-rw-r--r-- 2 root root 3519 Feb 1 02:45 Chicago
-rw-r--r-- 1 root root 1966 Mar 21 08:56 Yellowknife
-rw-r--r-- 1 root root 2288 Mar 21 08:56 Yakutat
-rw-r--r-- 2 root root 2865 Mar 21 08:56 Winnipeg
-rw-r--r-- 2 root root 2067 Mar 21 08:56 Whitehorse
-rw-r--r-- 15 root root 156 Mar 21 08:56 Virgin


So only Chicago has older timestamp so maybe it’s corrupted, well on all other box the file size is different than this.

So from above after changing to CST(Chicago) still my time was showing me  EDT, Well further looking to zoneinfo files only Chicago has different timestamp than others. So, checking on other server size was also different for this file and seems like that only file was corrupted. Once replaced with good file the issue got fixed. no need to reboot or restart.