Archive
ORA-15075
We faced issue with ASM 2 Weeks Back and here is small background behind this. This is Oracle 11.1.0.7.0 and using ASMLIB. Changed name of original disk and other stuff to avoid policy violation.
We wanted to have some more space on lefthand SCSI so SA’s Added One 50GB Disk. While adding so he accidently overwrite DISK1 which was already in use and thanks to ASM it invalidated DISK1 and rebalanced to other disks. Issue didn’t cause any outage and we were happy till we have maintenence for this database.
Now, this time we wanted to move this Disk to SAN and we are using external reduncency for same. So, best way is to add disk and remove disk without any downtime needed. Below is the output before starting any work. If you can see that DISK1 is marked as CANDIDATE disk and which now won’t exist as of the problem mention above.
GROUP_NUMBER DISK_NUMBER MOUNT_S HEADER_STATU MODE_ST STATE NAME LABEL PATH
------------ ----------- ------- ------------ ------- -------- ------------------------------ ------------------------------- ----------------------
1 0 CACHED CANDIDATE ONLINE NORMAL DISK001 DISK001 ORCL:DISK001
1 1 CACHED MEMBER ONLINE NORMAL DISK002 DISK002 ORCL:DISK002
1 2 CACHED MEMBER ONLINE NORMAL DISK003 DISK003 ORCL:DISK003
1 3 CACHED MEMBER ONLINE NORMAL DISK004 DISK004 ORCL:DISK004
1 4 CACHED MEMBER ONLINE NORMAL DISK005 DISK005 ORCL:DISK005
1 5 CACHED MEMBER ONLINE NORMAL DISK006 DISK006 ORCL:DISK006
2 0 CACHED MEMBER ONLINE NORMAL DISK007 DISK007 ORCL:DISK007
2 1 CACHED MEMBER ONLINE NORMAL DISK008 DISK008 ORCL:DISK008
0 0 CACHED PROVISIONED ONLINE NORMAL DISK009 ORCL:DISK009
0 1 CACHED PROVISIONED ONLINE NORMAL DISK010 ORCL:DISK010
There we about 25-30 Disk provisioned so cut the other output. But as you can see that this two disk are ready to add. While Adding this disk we got this error.
SQL> ALTER DISKGROUP XXXXXXDATA ADD DISK 'ORCL:DISK009' SIZE 51200M , 'ORCL:DISK010' SIZE 51200M NOTE: Assigning number (1,6) to disk (ORCL:DISK009) NOTE: Assigning number (1,7) to disk (ORCL:DISK010) NOTE: requesting all-instance membership refresh for group=1 NOTE: initializing header on grp 1 disk DISK009 NOTE: initializing header on grp 1 disk DISK010 NOTE: cache opening disk 6 of grp 1: DISK009 label:DISK009 NOTE: cache opening disk 7 of grp 1: DISK010 label:DISK010 NOTE: requesting all-instance disk validation for group=1 Wed May 25 17:36:50 2011 NOTE: disk validation pending for group 1/0xb4587064 (XXXXXXDATA) SUCCESS: validated disks for 1/0xb4587064 (XXXXXXDATA) ERROR: ORA-15075 signalled during reconfiguration of diskgroup XXXXXXDATA NOTE: membership refresh pending for group 1/0xb4587064 (XXXXXXDATA) kfdp_query(XXXXXXDATA): 7 Wed May 25 17:36:56 2011 kfdp_queryBg(): 7 kfdp_query(XXXXXXDATA): 8 kfdp_queryBg(): 8 NOTE: cache closing disk 6 of grp 1: DISK009 label:DISK009 NOTE: cache closing disk 6 of grp 1: DISK009 label:DISK009 NOTE: De-assigning number (1,6) from disk (ORCL:DISK009) NOTE: cache closing disk 7 of grp 1: DISK010 label:DISK010 NOTE: cache closing disk 7 of grp 1: DISK010 label:DISK010 NOTE: De-assigning number (1,7) from disk (ORCL:DISK010) kfdp_query(XXXXXXDATA): 9 kfdp_queryBg(): 9 SUCCESS: refreshed membership for 1/0xb4587064 (XXXXXXDATA) Wed May 25 17:36:59 2011 ORA-15032: not all alterations performed ORA-15075: disk(s) are not visible cluster-wide ERROR: ALTER DISKGROUP XXXXXXDATA ADD DISK 'ORCL:DISK009' SIZE 51200M , 'ORCL:DISK010' SIZE 51200M Wed May 25 17:37:07 2011 SQL> ALTER DISKGROUP XXXXXXDATA ADD DISK 'ORCL:DISK009' SIZE 51200M , 'ORCL:DISK010' SIZE 51200M NOTE: Assigning number (1,8) to disk (ORCL:DISK009) NOTE: Assigning number (1,9) to disk (ORCL:DISK010) NOTE: requesting all-instance membership refresh for group=1 NOTE: De-assigning number (1,8) from disk (ORCL:DISK009) NOTE: De-assigning number (1,9) from disk (ORCL:DISK010) ERROR: ORA-15033 signalled during reconfiguration of diskgroup XXXXXXDATA
There could be many reasons for this like scandisk not ran on all nodes, permission etc. For our case multipath was not set maybe. But for us we need to remove this two disk and let SA’s correct the issue and then try to add them back.
SQL> alter diskgroup XXXXXXDATA drop disk 'ORCL:DISK009' ORA-15032: not all alterations performed ORA-15054: disk "ORCL:DISK009" does not exist in diskgroup "XXXXXXDATA" ERROR: alter diskgroup XXXXXXDATA drop disk 'ORCL:DISK009' Wed May 25 18:32:48 2011 SQL> alter diskgroup XXXXXXDATA drop disk 'ORCL:DISK010' ORA-15032: not all alterations performed ORA-15054: disk "ORCL:DISK010" does not exist in diskgroup "XXXXXXDATA" ERROR: alter diskgroup XXXXXXDATA drop disk 'ORCL:DISK010' Wed May 25 18:33:04 2011 SQL> alter diskgroup XXXXXXDATA drop disk 'ORCL:DISK010' force ORA-15032: not all alterations performed ORA-15054: disk "ORCL:DISK010" does not exist in diskgroup "XXXXXXDATA" ERROR: alter diskgroup XXXXXXDATA drop disk 'ORCL:DISK010' force
So, you can’t remove this disk. As to remove disk you have to format it’s header using something like this
dd if=/dev/zero of=ORCL:DISK009 bs=4096 count=5000
But, as we didn’t have any data we planned to take some downtime as this was QA database and let SA’s remove SAN and re-configure as it needed. But, When they rebooted box issue got more worse we were not able to see DATA diskgroup and ASM was not able to mount that also.
SQL> ALTER DISKGROUP ALL MOUNT NOTE: cache registered group XXXXXXDATA number=1 incarn=0xb7f88ac9 NOTE: cache began mount (first) of group XXXXXXDATA number=1 incarn=0xb7f88ac9 NOTE: cache registered group XXXXXXFLASH number=2 incarn=0xb7f88aca NOTE: cache began mount (first) of group XXXXXXFLASH number=2 incarn=0xb7f88aca NOTE:Loaded lib: /opt/oracle/extapi/64/asm/orcl/1/libasm.so NOTE: Assigning number (1,1) to disk (ORCL:DISK002) NOTE: Assigning number (1,2) to disk (ORCL:DISK003) NOTE: Assigning number (1,3) to disk (ORCL:DISK004) NOTE: Assigning number (1,4) to disk (ORCL:DISK005) NOTE: Assigning number (1,5) to disk (ORCL:DISK006) NOTE: Assigning number (1,8) to disk (ORCL:DISK009) NOTE: Assigning number (1,9) to disk (ORCL:DISK010) ERROR: no PST quorum in group 1: required 1, found 0 NOTE: cache dismounting group 1/0xB7F88AC9 (XXXXXXDATA) NOTE: dbwr not being msg'd to dismount NOTE: lgwr not being msg'd to dismount NOTE: cache dismounted group 1/0xB7F88AC9 (XXXXXXDATA) NOTE: cache ending mount (fail) of group XXXXXXDATA number=1 incarn=0xb7f88ac9 kfdp_dismount(): 2 kfdp_dismountBg(): 2 NOTE: De-assigning number (1,1) from disk (ORCL:DISK002) NOTE: De-assigning number (1,2) from disk (ORCL:DISK003) NOTE: De-assigning number (1,3) from disk (ORCL:DISK004) NOTE: De-assigning number (1,4) from disk (ORCL:DISK005) NOTE: De-assigning number (1,5) from disk (ORCL:DISK006) NOTE: De-assigning number (1,8) from disk (ORCL:DISK009) NOTE: De-assigning number (1,9) from disk (ORCL:DISK010) ERROR: diskgroup XXXXXXDATA was not mounted NOTE: Assigning number (2,0) to disk (ORCL:DISK007) NOTE: Assigning number (2,1) to disk (ORCL:DISK008) NOTE: start heartbeating (grp 2) kfdp_query(XXXXXXFLASH): 5 kfdp_queryBg(): 5 NOTE: cache opening disk 0 of grp 2: DISK007 label:DISK007 NOTE: F1X0 found on disk 0 fcn 0.0 NOTE: cache opening disk 1 of grp 2: DISK008 label:DISK008 NOTE: cache mounting (first) group 2/0xB7F88ACA (XXXXXXFLASH) * allocate domain 2, invalid = TRUE kjbdomatt send to node 1 NOTE: attached to recovery domain 2 NOTE: cache recovered group 2 to fcn 0.144346 NOTE: LGWR attempting to mount thread 1 for diskgroup 2 NOTE: LGWR mounted thread 1 for disk group 2 NOTE: opening chunk 1 at fcn 0.144298 ABA NOTE: seq=13 blk=8268 NOTE: cache mounting group 2/0xB7F88ACA (XXXXXXFLASH) succeeded NOTE: cache ending mount (success) of group XXXXXXFLASH number=2 incarn=0xb7f88aca kfdp_query(XXXXXXFLASH): 6 kfdp_queryBg(): 6 NOTE: Instance updated compatible.asm to 10.1.0.0.0 for grp 2 SUCCESS: diskgroup XXXXXXFLASH was mounted ORA-15032: not all alterations performed ORA-15063: ASM discovered an insufficient number of disks for diskgroup "XXXXXXDATA" ERROR: ALTER DISKGROUP ALL MOUNT
So, Now the issue is the DISK1 Which was wrongly Labeled last time oracle is trying to find that and now it’s not able to find so and it had assigned that DISK35 header value to that which was wrongly overwritten but it’s actually 01 Disk.
----------------------------- DISK REPORT N0008 ------------------------------
Disk Path: ORCL:DISK035
Unique Disk ID:
Disk Label: DISK035
Physical Sector Size: 512 bytes
Disk Size: 51200 megabytes
** NOT A VALID ASM DISK HEADER. BAD VALUE IN FIELD blksize_kfdhdb **
kfbh.endian: 0 ; 0x000: 0x00
kfbh.hard: 0 ; 0x001: 0x00
kfbh.type: 0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt: 0 ; 0x003: 0x00
kfbh.block.blk: 0 ; 0x004: T=0 NUMB=0x0
kfbh.block.obj: 0 ; 0x008: TYPE=0x0 NUMB=0x0
kfbh.check: 0 ; 0x00c: 0x00000000
kfbh.fcn.base: 0 ; 0x010: 0x00000000
kfbh.fcn.wrap: 0 ; 0x014: 0x00000000
kfbh.spare1: 0 ; 0x018: 0x00000000
kfbh.spare2: 0 ; 0x01c: 0x00000000
But Now comes the Cool Part. Opened SR1 and come to know somebody might know this but for me it was New. As, we know here that disk header was wiped out
ORACLE Keep backup of Disk Header Information structure of AUNUM 1 and Blocknum 254. From that you can retrieve this information. So, now KFED was the one to rescue for merging this information.
kfed read /dev/d1 aun=1 blkn=254 text=/tmp/d1.log
kfed merge /dev/d1 text=/tmp/d1.log
And then Run Scandisk. That’s it everthing Came up Good.
Strange ASM Hang Issue
Sometime Back there was strange ASM issue. Actually Users were complaining about connectivity Issue. At first Look everything seems to be Okay but after looking more found that ASM instance was hanging and this is for Oracle 11.1.0.7.0.
Well, Oracle Support filed this as a BUG but solution was to bounce DB and ASM instance and we faced 10 Min Downtime. And Strange part is it Affected DR Site Also with same issue.
Here are some details
Current Wait Stack:
0: waiting for 'enq: CF - contention'
name|mode=43460004, 0=0, operation=0
wait_id=13621 seq_num=13737 snap_id=1
wait times: snap=13 min 44 sec, exc=13 min 44 sec, total=13 min 44 sec
wait times: max=15 min 0 sec
wait counts: calls=1646 os=1646
in_wait=1 iflags=0x15a0
There is at least one session blocking this session.
Dumping one blocker:
inst: 1, sid: 1358, ser: 1
Wait State:
auto_close=0 flags=0x22 boundary=(nil)/-1
So, Controlfile Enqueue Was held more then Expected.
Current Wait Stack:
1: waiting for 'KSV master wait'
=0, =0, =0
wait_id=6434 seq_num=7633 snap_id=2
wait times: snap=2 min 0 sec, exc=12 min 3 sec, total=12 min 4 sec
wait times: max=infinite
wait counts: calls=242 os=242
in_wait=1 iflags=0x15a0
0: waiting for 'ASM file metadata operation'
msgop=11, locn=0, =0
wait_id=6432 seq_num=7630 snap_id=2
wait times: snap=0.000000 sec, exc=0.000012 sec, total=12 min 7 sec
wait times: max=infinite
wait counts: calls=0 os=0
in_wait=1 iflags=0x1520
Wait State:
auto_close=0 flags=0x23 boundary=(nil)/-1
And was held my archiver process. And If we see systemstate Dump
System State 2
~~~~~~~~~~~~~~~~
1:
2: 0: waiting for 'pmon timer'
3: 0: waiting for 'VKTM Logical Idle Wait'
4:
5: 0: waiting for 'PING'
6: 0: waiting for 'rdbms ipc message'
7: 0: waiting for 'DIAG idle wait'
8: 0: waiting for 'rdbms ipc message'
9: 0: waiting for 'ges remote message'
10: 0: waiting for 'gcs remote message'
11: 0: waiting for 'rdbms ipc message'
12: 0: waiting for 'rdbms ipc message'
13: 0: waiting for 'rdbms ipc message'
14: 0: waiting for 'rdbms ipc message'
15: 0: waiting for 'rdbms ipc message'
16: 0: waiting for 'GCS lock cvt S'
17: 0: waiting for 'rdbms ipc message'
18: 0: waiting for 'rdbms ipc message'
19: 0: waiting for 'ASM file metadata operation'
20: 0: waiting for 'ASM file metadata operation'[Enqueue FA-00000002-00000168]
21: 0: waiting for 'ASM file metadata operation'
22: 0: waiting for 'ASM file metadata operation'[Enqueue FA-00000002-00000168]
23: 0: waiting for 'ASM file metadata operation'[Enqueue FA-00000002-00000168]
24: 0: waiting for 'ASM file metadata operation'[Enqueue FA-00000002-00000168]
Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.
Resource Holder State
Enqueue FA-00000002-00000168 19: 0: waiting for 'ASM file metadata operation'
Enqueue FA-00000002-00000168 21: 0: waiting for 'ASM file metadata operation'
Enqueue FA-00000002-00000168 24: 24: is waiting for 19: 21: 24:
System State 9
~~~~~~~~~~~~~~~~
1:
2:
3: 0: waiting for 'VKTM Logical Idle Wait'
4:
5: 0: waiting for 'PING'
6: 0: waiting for 'rdbms ipc message'
7:
8: 0: waiting for 'rdbms ipc message'
9: 0: waiting for 'ges remote message'
10: 0: waiting for 'gcs remote message'
11: 0: waiting for 'rdbms ipc message'
12: 0: waiting for 'rdbms ipc message'
13: 0: waiting for 'rdbms ipc message'
14: 0: waiting for 'rdbms ipc message'
15: 0: waiting for 'rdbms ipc message'
16: 0: waiting for 'rdbms ipc message'
17: 0: waiting for 'rdbms ipc message'
18: 0: waiting for 'rdbms ipc message'
19: 0: waiting for 'ASM network foreground exits'
20: 0: waiting for 'log write(odd)'
24: 0: waiting for 'ASM file metadata operation'
25: 0: waiting for 'ASM file metadata operation'
26: 0: waiting for 'ASM file metadata operation'[Latch for]
27: 0: waiting for 'ASM file metadata operation'
28: 0: waiting for 'buffer busy' [Latch for]
Cmd: Select
29: 0: waiting for 'ASM file metadata operation'
30: 0: waiting for 'ASM file metadata operation'
31: 0: waiting for 'ASM file metadata operation'
33: 0: waiting for 'ASM file metadata operation'
34: 0: waiting for 'ASM file metadata operation'[Latch for]
35: 0: waiting for 'ASM file metadata operation'[Latch for]
36: 0: waiting for 'ASM file metadata operation'[Latch for]
37: 0: waiting for 'ASM file metadata operation'[Latch for]
Blockers
~~~~~~~~
Above is a list of all the processes. If they are waiting for a resource
then it will be given in square brackets. Below is a summary of the
waited upon resources, together with the holder of that resource.
Notes:
~~~~~
o A process id of '???' implies that the holder was not found in the
systemstate.
Resource Holder State
Latch for ??? Blocker
Object Names
~~~~~~~~~~~~
Latch for write state=waiting for log force
Well Came to know that If a process hangs holding the FAL queue latch, no other process will be able to acquire the FAL queue latch. As a result, all processes that attempt to acquire the FAL queue latch will hang. If a process is holding the controlfile transaction, the hang will broaden beyond ARCH processes. The Archiver Process hangs while holding the FAL queue latch and then in turn reports the error ORA-00494: enqueue [CF] held for too long (more than 900 seconds). Bug 7432601
kksfbc child completion
It’s been long time that i have blogged. But here is the small issue which was encounter few weeks back
One particular job was taking long time and was waiting on following Wait.
Wait Wait
SID Username Event hsecs is
----- ------------ -------------------------------- ------- ----
1637 XXXXX kksfbc child completion 5 prev
Here are some more details. As you can see this session was mostly on CPU and not moving forward.
server1:XYZ>@sw 1637
SID USERN STATE EVENT SEQ# SEC_IN_STATE SQL_ID CHN P1 P2 P3
------- ----- ------- ---------------------------------------- ---------- ------------ ------------- ---- ------------------------- ------------------------- -------------------------
1637 XXXXX WORKING On CPU / runqueue 29784 429481 31z2drgv36qpw 0 0 0
SID USERN STATE EVENT SEQ# SEC_IN_STATE SQL_ID CHN P1 P2 P3
------- ----- ------- ---------------------------------------- ---------- ------------ ------------- ---- ------------------------- ------------------------- -------------------------
1637 XXXXX WORKING On CPU / runqueue 29784 429481 31z2drgv36qpw 0 0 0
SID USERN STATE EVENT SEQ# SEC_IN_STATE SQL_ID CHN P1 P2 P3
------- ----- ------- ---------------------------------------- ---------- ------------ ------------- ---- ------------------------- ------------------------- -------------------------
1637 XXXXX WORKING On CPU / runqueue 29784 429482 31z2drgv36qpw 0 0 0
SID USERN STATE EVENT SEQ# SEC_IN_STATE SQL_ID CHN P1 P2 P3
------- ----- ------- ---------------------------------------- ---------- ------------ ------------- ---- ------------------------- ------------------------- -------------------------
1637 XXXXX WORKING On CPU / runqueue 29784 429483 31z2drgv36qpw 0 0 0
SLEEPS GETS MUTEX_TYPE ID GET_LOCATION REQ BLK OBJECT_NAME
---------- ---------- --------------- ---------- --------------------------------- ---------- ---------- -------------------------------------
53352 1682367421 Cursor Pin 2213265387 kksfbc [KKSCHLFSP2] 1649 1302 SELECT * from abc where id=:B1
53351 1682317807 Cursor Pin 2213265387 kksfbc [KKSCHLFSP2] 1668 1676 SELECT * from abc where id=:B1
53348 1682026882 Cursor Pin 2213265387 kksLockDelete [KKSCHLPIN6] 2018 1334 SELECT * from abc where id=:B1
53347 1682026002 Cursor Pin 2213265387 kksfbc [KKSCHLFSP2] 1323 1393 SELECT * from abc where id=:B1
53346 1681937885 Cursor Pin 2213265387 kksLockDelete [KKSCHLPIN6] 1283 1408 SELECT * from abc where id=:B1
53345 1681850426 Cursor Pin 2213265387 kksfbc [KKSCHLFSP2] 1283 1323 SELECT * from abc where id=:B1
53344 1681743220 Cursor Pin 2213265387 kksfbc [KKSCHLFSP2] 2048 1679 SELECT * from abc where id=:B1
53343 1681468292 Cursor Pin 2213265387 kksfbc [KKSCHLFSP2] 1495 1686 SELECT * from abc where id=:B1
53342 1681409080 Cursor Pin 2213265387 kksfbc [KKSCHLFSP2] 2018 1669 SELECT * from abc where id=:B1
53339 1681267876 Cursor Pin 2213265387 kksfbc [KKSCHLFSP2] 1714 1668 SELECT * from abc where id=:B1
server1:XYZ>select * from V$mutex_sleep;
MUTEX_TYPE LOCATION SLEEPS WAIT_TIME
-------------------------------- ---------------------------------------- ---------- ----------
Cursor Parent kkspsc0 [KKSPRTLOC27] 20 37
Cursor Parent kkspsc0 [KKSPRTLOC26] 3440 4288
Cursor Parent kksfbc [KKSPRTLOC2] 97 1311
Cursor Parent kksfbc [KKSPRTLOC1] 36 79
Cursor Pin kksLockDelete [KKSCHLPIN6] 13544 55385
Cursor Pin kksxsccmp [KKSCHLPIN5] 15 48900
Cursor Pin kkslce [KKSCHLPIN2] 22641 209586425
Cursor Pin kksfbc [KKSCHLPIN1] 46 166792
Cursor Pin kksfbc [KKSCHLFSP2] 106768 632164142
So, it seems that that particular SQL was burning CPU while in parsing phase and after that it releases runtime locks. In between that it was just burning CPU.
This Seems Like Bug 6795880 Session spins / OERI after ‘kksfbc child completion’ wait – superceded.
“A session may go into an infinite spin just after a wait for ‘kksfbc child completion’. The spin occurs with a stack including kksSearchChildList -> kkshgnc where kksSearchChildList loops forever.”