ORA-15075

We faced issue with ASM 2 Weeks Back and here is small background behind this. This is Oracle 11.1.0.7.0 and using ASMLIB. Changed name of original disk and other stuff to avoid policy violation.

We wanted to have some more space on lefthand SCSI so SA’s Added One 50GB Disk. While adding so he accidently overwrite DISK1 which was already in use and thanks to ASM it invalidated DISK1 and rebalanced to other disks. Issue didn’t cause any outage and we were happy till we have maintenence for this database.

Now, this time we wanted to move this Disk to SAN and we are using external reduncency for same. So, best way is to add disk and remove disk without any downtime needed. Below is the output before starting any work. If you can see that DISK1 is marked as CANDIDATE disk and which now won’t exist as of the problem mention above.


GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU MODE_ST STATE    NAME                           LABEL                           PATH
------------ ----------- ------- ------------ ------- -------- ------------------------------ ------------------------------- ----------------------
           1           0 CACHED  CANDIDATE    ONLINE  NORMAL   DISK001                        DISK001                  		  ORCL:DISK001
           1           1 CACHED  MEMBER       ONLINE  NORMAL   DISK002                 	      DISK002                  		  ORCL:DISK002
           1           2 CACHED  MEMBER       ONLINE  NORMAL   DISK003                        DISK003                  		  ORCL:DISK003
           1           3 CACHED  MEMBER       ONLINE  NORMAL   DISK004                        DISK004                  		  ORCL:DISK004
           1           4 CACHED  MEMBER       ONLINE  NORMAL   DISK005                        DISK005                  		  ORCL:DISK005
           1           5 CACHED  MEMBER       ONLINE  NORMAL   DISK006                        DISK006                  		  ORCL:DISK006
           2           0 CACHED  MEMBER       ONLINE  NORMAL   DISK007                        DISK007                  		  ORCL:DISK007
           2           1 CACHED  MEMBER       ONLINE  NORMAL   DISK008                        DISK008                  		  ORCL:DISK008
           0           0 CACHED  PROVISIONED  ONLINE  NORMAL                   		          DISK009                  		  ORCL:DISK009
           0           1 CACHED  PROVISIONED  ONLINE  NORMAL                    	          DISK010                  		  ORCL:DISK010

There we about 25-30 Disk provisioned so cut the other output. But as you can see that this two disk are ready to add. While Adding this disk we got this error.


SQL> ALTER DISKGROUP XXXXXXDATA ADD  DISK 'ORCL:DISK009' SIZE 51200M ,
'ORCL:DISK010' SIZE 51200M
NOTE: Assigning number (1,6) to disk (ORCL:DISK009)
NOTE: Assigning number (1,7) to disk (ORCL:DISK010)
NOTE: requesting all-instance membership refresh for group=1
NOTE: initializing header on grp 1 disk DISK009
NOTE: initializing header on grp 1 disk DISK010
NOTE: cache opening disk 6 of grp 1: DISK009 label:DISK009
NOTE: cache opening disk 7 of grp 1: DISK010 label:DISK010
NOTE: requesting all-instance disk validation for group=1
Wed May 25 17:36:50 2011
NOTE: disk validation pending for group 1/0xb4587064 (XXXXXXDATA)
SUCCESS: validated disks for 1/0xb4587064 (XXXXXXDATA)
ERROR: ORA-15075 signalled during reconfiguration of diskgroup XXXXXXDATA
NOTE: membership refresh pending for group 1/0xb4587064 (XXXXXXDATA)
kfdp_query(XXXXXXDATA): 7
Wed May 25 17:36:56 2011
kfdp_queryBg(): 7
kfdp_query(XXXXXXDATA): 8
kfdp_queryBg(): 8
NOTE: cache closing disk 6 of grp 1: DISK009 label:DISK009
NOTE: cache closing disk 6 of grp 1: DISK009 label:DISK009
NOTE: De-assigning number (1,6) from disk (ORCL:DISK009)
NOTE: cache closing disk 7 of grp 1: DISK010 label:DISK010
NOTE: cache closing disk 7 of grp 1: DISK010 label:DISK010
NOTE: De-assigning number (1,7) from disk (ORCL:DISK010)
kfdp_query(XXXXXXDATA): 9
kfdp_queryBg(): 9
SUCCESS: refreshed membership for 1/0xb4587064 (XXXXXXDATA)
Wed May 25 17:36:59 2011
ORA-15032: not all alterations performed
ORA-15075: disk(s) are not visible cluster-wide
ERROR: ALTER DISKGROUP XXXXXXDATA ADD  DISK 'ORCL:DISK009' SIZE 51200M ,
'ORCL:DISK010' SIZE 51200M
Wed May 25 17:37:07 2011
SQL> ALTER DISKGROUP XXXXXXDATA ADD  DISK 'ORCL:DISK009' SIZE 51200M ,
'ORCL:DISK010' SIZE 51200M
NOTE: Assigning number (1,8) to disk (ORCL:DISK009)
NOTE: Assigning number (1,9) to disk (ORCL:DISK010)
NOTE: requesting all-instance membership refresh for group=1
NOTE: De-assigning number (1,8) from disk (ORCL:DISK009)
NOTE: De-assigning number (1,9) from disk (ORCL:DISK010)
ERROR: ORA-15033 signalled during reconfiguration of diskgroup XXXXXXDATA

There could be many reasons for this like scandisk not ran on all nodes, permission etc. For our case multipath was not set maybe. But for us we need to remove this two disk and let SA’s correct the issue and then try to add them back.


SQL> alter diskgroup XXXXXXDATA drop disk 'ORCL:DISK009'
ORA-15032: not all alterations performed
ORA-15054: disk "ORCL:DISK009" does not exist in diskgroup "XXXXXXDATA"
ERROR: alter diskgroup XXXXXXDATA drop disk 'ORCL:DISK009'
Wed May 25 18:32:48 2011
SQL> alter diskgroup XXXXXXDATA drop disk 'ORCL:DISK010'
ORA-15032: not all alterations performed
ORA-15054: disk "ORCL:DISK010" does not exist in diskgroup "XXXXXXDATA"
ERROR: alter diskgroup XXXXXXDATA drop disk 'ORCL:DISK010'
Wed May 25 18:33:04 2011
SQL> alter diskgroup XXXXXXDATA drop disk 'ORCL:DISK010' force
ORA-15032: not all alterations performed
ORA-15054: disk "ORCL:DISK010" does not exist in diskgroup "XXXXXXDATA"
ERROR: alter diskgroup XXXXXXDATA drop disk 'ORCL:DISK010' force

So, you can’t remove this disk. As to remove disk you have to format it’s header using something like this

dd if=/dev/zero of=ORCL:DISK009 bs=4096 count=5000

But, as we didn’t have any data we planned to take some downtime as this was QA database and let SA’s remove SAN and re-configure as it needed. But, When they rebooted box issue got more worse we were not able to see DATA diskgroup and ASM was not able to mount that also.


SQL> ALTER DISKGROUP ALL MOUNT
NOTE: cache registered group XXXXXXDATA number=1 incarn=0xb7f88ac9
NOTE: cache began mount (first) of group XXXXXXDATA number=1 incarn=0xb7f88ac9
NOTE: cache registered group XXXXXXFLASH number=2 incarn=0xb7f88aca
NOTE: cache began mount (first) of group XXXXXXFLASH number=2 incarn=0xb7f88aca
NOTE:Loaded lib: /opt/oracle/extapi/64/asm/orcl/1/libasm.so
NOTE: Assigning number (1,1) to disk (ORCL:DISK002)
NOTE: Assigning number (1,2) to disk (ORCL:DISK003)
NOTE: Assigning number (1,3) to disk (ORCL:DISK004)
NOTE: Assigning number (1,4) to disk (ORCL:DISK005)
NOTE: Assigning number (1,5) to disk (ORCL:DISK006)
NOTE: Assigning number (1,8) to disk (ORCL:DISK009)
NOTE: Assigning number (1,9) to disk (ORCL:DISK010)
ERROR: no PST quorum in group 1: required 1, found 0
NOTE: cache dismounting group 1/0xB7F88AC9 (XXXXXXDATA)
NOTE: dbwr not being msg'd to dismount
NOTE: lgwr not being msg'd to dismount
NOTE: cache dismounted group 1/0xB7F88AC9 (XXXXXXDATA)
NOTE: cache ending mount (fail) of group XXXXXXDATA number=1 incarn=0xb7f88ac9
kfdp_dismount(): 2
kfdp_dismountBg(): 2
NOTE: De-assigning number (1,1) from disk (ORCL:DISK002)
NOTE: De-assigning number (1,2) from disk (ORCL:DISK003)
NOTE: De-assigning number (1,3) from disk (ORCL:DISK004)
NOTE: De-assigning number (1,4) from disk (ORCL:DISK005)
NOTE: De-assigning number (1,5) from disk (ORCL:DISK006)
NOTE: De-assigning number (1,8) from disk (ORCL:DISK009)
NOTE: De-assigning number (1,9) from disk (ORCL:DISK010)
ERROR: diskgroup XXXXXXDATA was not mounted
NOTE: Assigning number (2,0) to disk (ORCL:DISK007)
NOTE: Assigning number (2,1) to disk (ORCL:DISK008)
NOTE: start heartbeating (grp 2)
kfdp_query(XXXXXXFLASH): 5
kfdp_queryBg(): 5
NOTE: cache opening disk 0 of grp 2: DISK007 label:DISK007
NOTE: F1X0 found on disk 0 fcn 0.0
NOTE: cache opening disk 1 of grp 2: DISK008 label:DISK008
NOTE: cache mounting (first) group 2/0xB7F88ACA (XXXXXXFLASH)
* allocate domain 2, invalid = TRUE
kjbdomatt send to node 1
NOTE: attached to recovery domain 2
NOTE: cache recovered group 2 to fcn 0.144346
NOTE: LGWR attempting to mount thread 1 for diskgroup 2
NOTE: LGWR mounted thread 1 for disk group 2
NOTE: opening chunk 1 at fcn 0.144298 ABA
NOTE: seq=13 blk=8268
NOTE: cache mounting group 2/0xB7F88ACA (XXXXXXFLASH) succeeded
NOTE: cache ending mount (success) of group XXXXXXFLASH number=2 incarn=0xb7f88aca
kfdp_query(XXXXXXFLASH): 6
kfdp_queryBg(): 6
NOTE: Instance updated compatible.asm to 10.1.0.0.0 for grp 2
SUCCESS: diskgroup XXXXXXFLASH was mounted
ORA-15032: not all alterations performed
ORA-15063: ASM discovered an insufficient number of disks for diskgroup "XXXXXXDATA"
ERROR: ALTER DISKGROUP ALL MOUNT

So, Now the issue is the DISK1 Which was wrongly Labeled last time oracle is trying to find that and now it’s not able to find so and it had assigned that DISK35 header value to that which was wrongly overwritten but it’s actually 01 Disk.


----------------------------- DISK REPORT N0008 ------------------------------
                Disk Path: ORCL:DISK035
           Unique Disk ID:
               Disk Label: DISK035
     Physical Sector Size: 512 bytes
                Disk Size: 51200 megabytes
** NOT A VALID ASM DISK HEADER. BAD VALUE IN FIELD blksize_kfdhdb **

kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: T=0 NUMB=0x0
kfbh.block.obj:                       0 ; 0x008: TYPE=0x0 NUMB=0x0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000

But Now comes the Cool Part. Opened SR1 and come to know somebody might know this but for me it was New. As, we know here that disk header was wiped out

ORACLE Keep backup of Disk Header Information structure of AUNUM 1 and Blocknum 254. From that you can retrieve this information. So, now KFED was the one to rescue for merging this information.

kfed read /dev/d1 aun=1 blkn=254 text=/tmp/d1.log
kfed merge /dev/d1 text=/tmp/d1.log

And then Run Scandisk. That’s it everthing Came up Good.

Strange ASM Hang Issue

Sometime Back there was strange ASM issue. Actually Users were complaining about connectivity Issue. At first Look everything seems to be Okay but after looking more found that ASM instance was hanging and this is for Oracle 11.1.0.7.0.

Well, Oracle Support filed this as a BUG but solution was to bounce DB and ASM instance and we faced 10 Min Downtime. And Strange part is it Affected DR Site Also with same issue.

Here are some details


Current Wait Stack:
 0: waiting for 'enq: CF - contention'
    name|mode=43460004, 0=0, operation=0
    wait_id=13621 seq_num=13737 snap_id=1
    wait times: snap=13 min 44 sec, exc=13 min 44 sec, total=13 min 44 sec
    wait times: max=15 min 0 sec
    wait counts: calls=1646 os=1646
    in_wait=1 iflags=0x15a0
There is at least one session blocking this session.
Dumping one blocker:
  inst: 1, sid: 1358, ser: 1
Wait State:
  auto_close=0 flags=0x22 boundary=(nil)/-1

So, Controlfile Enqueue Was held more then Expected.

Current Wait Stack:
 1: waiting for 'KSV master wait'
    =0, =0, =0
    wait_id=6434 seq_num=7633 snap_id=2
    wait times: snap=2 min 0 sec, exc=12 min 3 sec, total=12 min 4 sec
    wait times: max=infinite
    wait counts: calls=242 os=242
    in_wait=1 iflags=0x15a0
 0: waiting for 'ASM file metadata operation'
    msgop=11, locn=0, =0
    wait_id=6432 seq_num=7630 snap_id=2
    wait times: snap=0.000000 sec, exc=0.000012 sec, total=12 min 7 sec
    wait times: max=infinite
    wait counts: calls=0 os=0
    in_wait=1 iflags=0x1520
Wait State:
  auto_close=0 flags=0x23 boundary=(nil)/-1

And was held my archiver process. And If we see systemstate Dump

System State 2
~~~~~~~~~~~~~~~~
1:
2:  0: waiting for 'pmon timer'
3:  0: waiting for 'VKTM Logical Idle Wait'
4:
5:  0: waiting for 'PING'
6:  0: waiting for 'rdbms ipc message'
7:  0: waiting for 'DIAG idle wait'
8:  0: waiting for 'rdbms ipc message'
9:  0: waiting for 'ges remote message'
10: 0: waiting for 'gcs remote message'
11: 0: waiting for 'rdbms ipc message'
12: 0: waiting for 'rdbms ipc message'
13: 0: waiting for 'rdbms ipc message'
14: 0: waiting for 'rdbms ipc message'
15: 0: waiting for 'rdbms ipc message'
16: 0: waiting for 'GCS lock cvt S'
17: 0: waiting for 'rdbms ipc message'
18: 0: waiting for 'rdbms ipc message'
19: 0: waiting for 'ASM file metadata operation'
20: 0: waiting for 'ASM file metadata operation'[Enqueue FA-00000002-00000168]
21: 0: waiting for 'ASM file metadata operation'
22: 0: waiting for 'ASM file metadata operation'[Enqueue FA-00000002-00000168]
23: 0: waiting for 'ASM file metadata operation'[Enqueue FA-00000002-00000168]
24: 0: waiting for 'ASM file metadata operation'[Enqueue FA-00000002-00000168]
Blockers
~~~~~~~~

        Above is a list of all the processes. If they are waiting for a resource
        then it will be given in square brackets. Below is a summary of the
        waited upon resources, together with the holder of that resource.
        Notes:
        ~~~~~
         o A process id of '???' implies that the holder was not found in the
           systemstate.

                    Resource Holder State
Enqueue FA-00000002-00000168    19: 0: waiting for 'ASM file metadata operation'
Enqueue FA-00000002-00000168    21: 0: waiting for 'ASM file metadata operation'
Enqueue FA-00000002-00000168    24: 24: is waiting for 19: 21: 24:

System State 9
~~~~~~~~~~~~~~~~
1:
2:
3:  0: waiting for 'VKTM Logical Idle Wait'
4:
5:  0: waiting for 'PING'
6:  0: waiting for 'rdbms ipc message'
7:
8:  0: waiting for 'rdbms ipc message'
9:  0: waiting for 'ges remote message'
10: 0: waiting for 'gcs remote message'
11: 0: waiting for 'rdbms ipc message'
12: 0: waiting for 'rdbms ipc message'
13: 0: waiting for 'rdbms ipc message'
14: 0: waiting for 'rdbms ipc message'
15: 0: waiting for 'rdbms ipc message'
16: 0: waiting for 'rdbms ipc message'
17: 0: waiting for 'rdbms ipc message'
18: 0: waiting for 'rdbms ipc message'
19: 0: waiting for 'ASM network foreground exits'
20: 0: waiting for 'log write(odd)'
24: 0: waiting for 'ASM file metadata operation'
25: 0: waiting for 'ASM file metadata operation'
26: 0: waiting for 'ASM file metadata operation'[Latch for]
27: 0: waiting for 'ASM file metadata operation'
28: 0: waiting for 'buffer busy'       [Latch for]
     Cmd: Select
29: 0: waiting for 'ASM file metadata operation'
30: 0: waiting for 'ASM file metadata operation'
31: 0: waiting for 'ASM file metadata operation'
33: 0: waiting for 'ASM file metadata operation'
34: 0: waiting for 'ASM file metadata operation'[Latch for]
35: 0: waiting for 'ASM file metadata operation'[Latch for]
36: 0: waiting for 'ASM file metadata operation'[Latch for]
37: 0: waiting for 'ASM file metadata operation'[Latch for]
Blockers
~~~~~~~~

        Above is a list of all the processes. If they are waiting for a resource
        then it will be given in square brackets. Below is a summary of the
        waited upon resources, together with the holder of that resource.
        Notes:
        ~~~~~
         o A process id of '???' implies that the holder was not found in the
           systemstate.

                    Resource Holder State
                   Latch for    ??? Blocker

Object Names
~~~~~~~~~~~~
   Latch for                    write state=waiting for log force

Well Came to know that If a process hangs holding the FAL queue latch, no other process will be able to acquire the FAL queue latch. As a result, all processes that attempt to acquire the FAL queue latch will hang. If a process is holding the controlfile transaction, the hang will broaden beyond ARCH processes. The Archiver Process hangs while holding the FAL queue latch and then in turn reports the error ORA-00494: enqueue [CF] held for too long (more than 900 seconds). Bug 7432601

kksfbc child completion

It’s been long time that i have blogged. But here is the small issue which was encounter few weeks back

One particular job was taking long time and was waiting on following Wait.


                                                       Wait Wait
  SID Username     Event                              hsecs  is
----- ------------ -------------------------------- ------- ----
 1637 XXXXX        kksfbc child completion                5 prev

Here are some more details. As you can see this session was mostly on CPU and not moving forward.

server1:XYZ>@sw 1637

    SID USERN STATE   EVENT                                          SEQ# SEC_IN_STATE SQL_ID         CHN P1                        P2                        P3
------- ----- ------- ---------------------------------------- ---------- ------------ ------------- ---- ------------------------- ------------------------- -------------------------
   1637 XXXXX WORKING On CPU / runqueue                             29784       429481 31z2drgv36qpw      0                         0                         0

    SID USERN STATE   EVENT                                          SEQ# SEC_IN_STATE SQL_ID         CHN P1                        P2                        P3
------- ----- ------- ---------------------------------------- ---------- ------------ ------------- ---- ------------------------- ------------------------- -------------------------
   1637 XXXXX WORKING On CPU / runqueue                             29784       429481 31z2drgv36qpw      0                         0                         0

    SID USERN STATE   EVENT                                          SEQ# SEC_IN_STATE SQL_ID         CHN P1                        P2                        P3
------- ----- ------- ---------------------------------------- ---------- ------------ ------------- ---- ------------------------- ------------------------- -------------------------
   1637 XXXXX WORKING On CPU / runqueue                             29784       429482 31z2drgv36qpw      0                         0                         0

    SID USERN STATE   EVENT                                          SEQ# SEC_IN_STATE SQL_ID         CHN P1                        P2                        P3
------- ----- ------- ---------------------------------------- ---------- ------------ ------------- ---- ------------------------- ------------------------- -------------------------
   1637 XXXXX WORKING On CPU / runqueue                             29784       429483 31z2drgv36qpw      0                         0                         0

SLEEPS       GETS MUTEX_TYPE              ID GET_LOCATION                             REQ        BLK OBJECT_NAME
---------- ---------- --------------- ---------- --------------------------------- ---------- ---------- -------------------------------------
     53352 1682367421 Cursor Pin      2213265387 kksfbc [KKSCHLFSP2]                     1649       1302 SELECT * from abc where id=:B1
     53351 1682317807 Cursor Pin      2213265387 kksfbc [KKSCHLFSP2]                     1668       1676 SELECT * from abc where id=:B1
     53348 1682026882 Cursor Pin      2213265387 kksLockDelete [KKSCHLPIN6]              2018       1334 SELECT * from abc where id=:B1
     53347 1682026002 Cursor Pin      2213265387 kksfbc [KKSCHLFSP2]                     1323       1393 SELECT * from abc where id=:B1
     53346 1681937885 Cursor Pin      2213265387 kksLockDelete [KKSCHLPIN6]              1283       1408 SELECT * from abc where id=:B1
     53345 1681850426 Cursor Pin      2213265387 kksfbc [KKSCHLFSP2]                     1283       1323 SELECT * from abc where id=:B1
     53344 1681743220 Cursor Pin      2213265387 kksfbc [KKSCHLFSP2]                     2048       1679 SELECT * from abc where id=:B1
     53343 1681468292 Cursor Pin      2213265387 kksfbc [KKSCHLFSP2]                     1495       1686 SELECT * from abc where id=:B1
     53342 1681409080 Cursor Pin      2213265387 kksfbc [KKSCHLFSP2]                     2018       1669 SELECT * from abc where id=:B1
     53339 1681267876 Cursor Pin      2213265387 kksfbc [KKSCHLFSP2]                     1714       1668 SELECT * from abc where id=:B1
server1:XYZ>select * from V$mutex_sleep;

MUTEX_TYPE                       LOCATION                                     SLEEPS  WAIT_TIME
-------------------------------- ---------------------------------------- ---------- ----------
Cursor Parent                    kkspsc0 [KKSPRTLOC27]                            20         37
Cursor Parent                    kkspsc0 [KKSPRTLOC26]                          3440       4288
Cursor Parent                    kksfbc [KKSPRTLOC2]                              97       1311
Cursor Parent                    kksfbc [KKSPRTLOC1]                              36         79
Cursor Pin                       kksLockDelete [KKSCHLPIN6]                    13544      55385
Cursor Pin                       kksxsccmp [KKSCHLPIN5]                           15      48900
Cursor Pin                       kkslce [KKSCHLPIN2]                           22641  209586425
Cursor Pin                       kksfbc [KKSCHLPIN1]                              46     166792
Cursor Pin                       kksfbc [KKSCHLFSP2]                          106768  632164142

So, it seems that that particular SQL was burning CPU while in parsing phase and after that it releases runtime locks. In between that it was just burning CPU.

This Seems Like Bug 6795880 Session spins / OERI after ‘kksfbc child completion’ wait – superceded.

“A session may go into an infinite spin just after a wait for ‘kksfbc child completion’. The spin occurs with a stack including kksSearchChildList -> kkshgnc where kksSearchChildList loops forever.”