Wednesday 3 October 2012

Host Power-On issue in Oracle T4-1 server from ILOM prompt : Solved

It is the first time i am facing such an issue with Oracle T-series servers. Starting from T1000, i have worked on different T-series SPARC machines like T2000, T3-1, T3-2, T4-1.... and they are all like my pets. It is something more than a normal pleasure that i feel when i work on  T-series and M-series SPARC servers. Unfortunate that M-series servers are no more into market except its support and that too, EOSL is almost reached

While working on these servers, I came to know that  if any small hardware issue is there, ILOM prompt doesn't allow the host to run the POST itself and it will stay in the same prompt.

There is an easy way to solve this issue which i have already done and got a positive output. Its just like "Cheating the server". Everything is mentioned in detail below with the commands and  outputs

When you try to start the Server, the issue appears like the below mentioned


-> start /SYS
Are you sure you want to start /SYS (y/n)? y
start: System faults or hardware configuration prevents power on.

What you need to do is:

1) Go to the Fault management shell to use the fmadm utility, so that we can find if any hardware issues are 
    there

-> start /SP/faultmgmt/shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y

*********************************************************************************

2) You can see the hardware issues after entering the command fmadm faulty and its the same that you  can         
     see from run level3

faultmgmtsp> fmadm faulty
------------------- ------------------------------------ -------------- --------
Time                UUID                                 msgid          Severity
------------------- ------------------------------------ -------------- --------
2012-09-26/07:24:59 dc873a7b-f661-ca2f-ae23-f59753bff70c SPT-8000-DH    Critical

Fault class : fault.chassis.voltage.fail

FRU         : /SYS/MB
              (Part Number: 7015924)
              (Serial Number: 465769T+1220BW09L3)

Description : A chassis voltage supply is operating outside of the
              allowable range.

Response    : The system will be powered off.  The chassis-wide service
              required LED will be illuminated.

Impact      : The system is not usable until repaired.  ILOM will not allow
              the system to be powered on until repaired.

Action      : The administrator should review the ILOM event log for
              additional information pertaining to this diagnosis.  Please
              refer to the Details section of the Knowledge Article for
              additional information.

------------------- ------------------------------------ -------------- --------
Time                UUID                                 msgid          Severity
------------------- ------------------------------------ -------------- --------
2012-10-03/04:47:56 47f0af46-19ce-c28d-ae6e-a01e19522e79 SPT-8000-5X    Major

Fault class : fault.chassis.env.power.loss

FRU         : /SYS/PS0
              (Part Number: 300-2235)
              (Serial Number: B70386)

Description : A power supply AC input voltage failure has occurred.

Response    : The service-required LED on the affected power supply and
              chassis will be illuminated.

Impact      : Server will be powered down when there are insufficient
              operational power supplies.

Action      : The administrator should review the ILOM event log for
              additional information pertaining to this diagnosis.  Please
              refer to the Details section of the Knowledge Article for
              additional information.

------------------- ------------------------------------ -------------- --------
Time                UUID                                 msgid          Severity
------------------- ------------------------------------ -------------- --------
2012-07-17/10:03:18 01877670-70dc-667e-928b-c13be3cac7da SPT-8000-MJ    Critical

Fault class : fault.chassis.power.fail

FRU         : /SYS/PS1
              (Part Number: 300-2235)
              (Serial Number: B70387)

Description : A Power Supply has failed and is not providing power to the
              server.

Response    : The service required LED on the chassis and on the affected
              Power Supply may be illuminated.

Impact      : Server will be powered down when there are insufficient
              operational power supplies

Action      : The administrator should review the ILOM event log for
              additional information pertaining to this diagnosis.  Please
              refer to the Details section of the Knowledge Article for
              additional information.

faultmgmtsp>

*********************************************************************************

3) After getting the outputs, please note the faulty FRU's and set the property "clear_fault_action=true" for 
    all these faulty ones going back to the ILOM prompt

faultmgmtsp> exit
-> set /SYS/MB clear_fault_action=true
Are you sure you want to clear /SYS/MB (y/n)? y
Set 'clear_fault_action' to 'true'

-> set /SYS/PS0 clear_fault_action=true
Are you sure you want to clear /SYS/PS0 (y/n)? y
Set 'clear_fault_action' to 'true'

-> set /SYS/PS1 clear_fault_action=true
Are you sure you want to clear /SYS/PS1 (y/n)? y
Set 'clear_fault_action' to 'true'


*********************************************************************************

4) Once this property is set to true, we need to go back to fault management shell and repair the FRU's using 
     the below mentioned command

-> start /SP/faultmgmt/shell
Are you sure you want to start /SP/faultmgmt/shell (y/n)? y

faultmgmtsp> fmadm repair /SYS/MB
faultmgmtsp> fmadm repair /SYS/PS0
faultmgmtsp> fmadm repair /SYS/PS1


*********************************************************************************

5) Check for any more faults are there in the server

faultmgmtsp> fmadm faulty
faultmgmtsp>
faultmgmtsp> exit


*********************************************************************************

6) Try startin the System, you can find the error is resolved and the server is up and running

-> start /SYS
Are you sure you want to start /SYS (y/n)? y
Starting /SYS

-> start /HOST/console
Are you sure you want to start /HOST/console (y/n)? y

Serial console started.  To stop, type #.
[CPU 0:0:0] NOTICE:  Initializing TOD: 2012/10/03 05:44:10
[CPU 0:0:0] NOTICE:  Loaded ASR status DB data. Ver. 3.
[CPU 0:0:0] NOTICE:  Initializing TPM with:
                        tpm_enable = false
                        tpm_activate = false
                        tpm_forceclear = false
[CPU 0:0:0] NOTICE:  TPM found: Ver 1.2, Rev 1.2, SpecLevel 2, errataRev 0, VendorId 'IFX'
[CPU 0:0:0] NOTICE:  TPM initialized successfully. Current state is: disabled
[CPU 0:0:0] NOTICE:  Serial#:     000000000000002a.015948c07cda22a6
[CPU 0:0:0] NOTICE:  Version:     003e003012030607
[CPU 0:0:0] NOTICE:  T4 Revision: 1.2
..............................................................

bash-3.2# prtdiag -v |more
System Configuration:  Oracle Corporation  sun4v SPARC T4-1

*********************************************************************************

That's it.. Very simple....huh.... :P

11 comments:

  1. Hey Joseph,

    I had the exact issue with T4-1, since its brand new didnt think it could be hardware failure. Your solution was bang on.

    Thanks
    Sri

    ReplyDelete
  2. Worked like a charm on my T4-1

    Thx,

    Pete

    ReplyDelete
  3. Excellent. Worked for me. Good job on your writing.

    ReplyDelete
  4. Good job on writing this. It helped me very quickly.

    ReplyDelete
  5. good job.
    it is working for T5-2 as well
    michal.

    ReplyDelete
  6. it is working fine for T5-2 as well.

    ReplyDelete
  7. This is exacly that I got with my T4-1, thanks)

    ReplyDelete
  8. Thank you,
    It worked fine for T5-2. But why this has happened? Is there any fault on MB? Will this occur again in future?

    thanks in advance.

    ReplyDelete