Getting overrun with FM errors on Sunfire X4170

Added by Matthew Goheen over 3 years ago

I trying to test out the Community Edition on a new Sunfire X4170 system. I have the software installed onto the internal SAS hardware mirror. The system also has two QLogic QLE2560 Fibre Channel adapters. Only one is hooked up, and currently no Fibre channel disks are zoned for the system -- there is no storage allocated other than the system disk.

I've getting about 1000 lines per second logged to /var/fm/fmd/errlog. A small sample of "fmdump -e":

May 18 16:31:42.3871 ereport.io.pciex.pl.re May 18 16:31:42.3880 ereport.io.pci.fabric May 18 16:31:42.3880 ereport.io.pci.fabric May 18 16:31:42.3880 ereport.io.pciex.rc.ce-msg May 18 16:31:42.3880 ereport.io.pciex.pl.re May 18 16:31:42.3957 ereport.io.pci.fabric May 18 16:31:42.3957 ereport.io.pci.fabric May 18 16:31:42.3957 ereport.io.pciex.rc.ce-msg May 18 16:31:42.3957 ereport.io.pciex.pl.re May 18 16:31:42.4027 ereport.io.pci.fabric May 18 16:31:42.4027 ereport.io.pci.fabric May 18 16:31:42.4027 ereport.io.pciex.rc.ce-msg May 18 16:31:42.4027 ereport.io.pciex.pl.re May 18 16:31:42.4121 ereport.io.pci.fabric May 18 16:31:42.4122 ereport.io.pci.fabric May 18 16:31:42.4121 ereport.io.pciex.rc.ce-msg May 18 16:31:42.4122 ereport.io.pciex.pl.re May 18 16:31:42.4228 ereport.io.pci.fabric May 18 16:31:42.4229 ereport.io.pci.fabric May 18 16:31:42.4228 ereport.io.pciex.rc.ce-msg May 18 16:31:42.4229 ereport.io.pciex.pl.re May 18 16:31:42.4386 ereport.io.pci.fabric May 18 16:31:42.4386 ereport.io.pci.fabric May 18 16:31:42.4386 ereport.io.pciex.rc.ce-msg May 18 16:31:42.4386 ereport.io.pciex.pl.re May 18 16:31:42.4423 ereport.io.pci.fabric May 18 16:31:42.4423 ereport.io.pci.fabric

Here's a sample "fmdump -eV":

TIME CLASS May 18 2010 16:31:42.371497612 ereport.io.pci.fabric nvlist version: 0 class = ereport.io.pci.fabric ena = 0x79550cfed1802001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci@0,0/pci8086,3410@9 (end detector)

    bdf = 0x48
    device_id = 0x3410
    vendor_id = 0x8086
    rev_id = 0x13
    dev_type = 0x40
    pcie_off = 0x90
    pcix_off = 0x0
    aer_off = 0x100
    ecc_ver = 0x0
    pci_status = 0x10
    pci_command = 0x47
    pci_bdg_sec_status = 0x0
    pci_bdg_ctrl = 0x3
    pcie_status = 0x0
    pcie_command = 0x27
    pcie_dev_cap = 0x8021
    pcie_adv_ctl = 0x0
    pcie_ue_status = 0x0
    pcie_ue_mask = 0x100000
    pcie_ue_sev = 0x62030
    pcie_ue_hdr0 = 0x0
    pcie_ue_hdr1 = 0x0
    pcie_ue_hdr2 = 0x0
    pcie_ue_hdr3 = 0x0
    pcie_ce_status = 0x0
    pcie_ce_mask = 0x0
    pcie_rp_status = 0x0
    pcie_rp_control = 0x0
    pcie_adv_rp_status = 0x1
    pcie_adv_rp_command = 0x7
    pcie_adv_rp_ce_src_id = 0x0
    pcie_adv_rp_ue_src_id = 0x0
    remainder = 0x1
    severity = 0x1
    __ttl = 0x1
    __tod = 0x4bf2f92e 0x16249a8c

May 18 2010 16:31:42.371514172 ereport.io.pci.fabric nvlist version: 0 class = ereport.io.pci.fabric ena = 0x79550d02ef102001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci@0,0/pci8086,3410@9/pci1077,170@0 (end detector)

    bdf = 0x1900
    device_id = 0x2532
    vendor_id = 0x1077
    rev_id = 0x2
    dev_type = 0x0
    pcie_off = 0x4c
    pcix_off = 0x0
    aer_off = 0x100
    ecc_ver = 0x0
    pci_status = 0x10
    pci_command = 0x147
    pcie_status = 0x1
    pcie_command = 0x4037
    pcie_dev_cap = 0x10648183
    pcie_adv_ctl = 0xa0
    pcie_ue_status = 0x0
    pcie_ue_mask = 0x180000
    pcie_ue_sev = 0x62030
    pcie_ue_hdr0 = 0x0
    pcie_ue_hdr1 = 0x0
    pcie_ue_hdr2 = 0x0
    pcie_ue_hdr3 = 0x0
    pcie_ce_status = 0x1
    pcie_ce_mask = 0x0
    remainder = 0x0
    severity = 0x3
    __ttl = 0x1
    __tod = 0x4bf2f92e 0x1624db3c

May 18 2010 16:31:42.371497612 ereport.io.pciex.rc.ce-msg nvlist version: 0 ena = 0x79550cfed1802001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci@0,0/pci8086,3410@9 (end detector)

    class = ereport.io.pciex.rc.ce-msg
    rc-status = 0x1
    __ttl = 0x1
    __tod = 0x4bf2f92e 0x16249a8c

May 18 2010 16:31:42.371514172 ereport.io.pciex.pl.re nvlist version: 0 ena = 0x79550d02ef102001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci@0,0/pci8086,3410@9/pci1077,170@0 (end detector)

    class = ereport.io.pciex.pl.re
    dev-status = 0x1
    ce-status = 0x1
    __ttl = 0x1
    __tod = 0x4bf2f92e 0x1624db3c

Here is "fmdump":

TIME UUID SUNW-MSG-ID May 11 19:32:22.6437 f241c9a1-5510-ccdb-88ef-9d2d4de14000 PCIEX-8000-KP May 11 19:32:28.5353 83fb009f-d8b5-483b-a0a3-f51baf5d4860 PCIEX-8000-KP May 11 20:12:47.7208 83fb009f-d8b5-483b-a0a3-f51baf5d4860 FMD-8000-58 Updated May 11 20:12:56.9098 4890c273-fdc2-c5f5-e32b-a4b6bf86c188 PCIEX-8000-KP May 11 20:22:34.7249 f241c9a1-5510-ccdb-88ef-9d2d4de14000 FMD-8000-58 Updated May 11 20:22:40.3303 b65ba66a-649e-421f-9e95-f90cd9085290 PCIEX-8000-KP

Here is "fmadm faulty":


TIME EVENT-ID MSG-ID SEVERITY


May 11 20:22:40 b65ba66a-649e-421f-9e95-f90cd9085290 PCIEX-8000-KP Major

Host : myhost Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1015XF50CF Product_sn :

Fault class : fault.io.pciex.device-interr-corr 67% fault.io.pciex.bus-linkerr-corr 33% Affects : dev:////pci@0,0/pci8086,340e@7/pci1077,170@0 faulted but still in service FRU : "PCIe 1" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=myhost:chassis-id=1015XF50CF/motherboard=0/hostbridge=3/pciexrc=3/pciexbus=19/pciexdev=0) faulty

Description : Too many recovered bus errors have been detected, which indicates a problem with the specified bus or with the specified transmitting device. This may degrade into an unrecoverable fault. Refer to http://sun.com/msg/PCIEX-8000-KP for more information.

Response : One or more device instances may be disabled

Impact : Loss of services provided by the device instances associated with this fault

Action : If a plug-in card is involved check for badly-seated cards or bent pins. Otherwise schedule a repair procedure to replace the affected device. Use fmadm faulty to identify the device or contact Sun for support.


TIME EVENT-ID MSG-ID SEVERITY


May 11 20:12:56 4890c273-fdc2-c5f5-e32b-a4b6bf86c188 PCIEX-8000-KP Major

Host : myhost Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1015XF50CF Product_sn :

Fault class : fault.io.pciex.device-interr-corr 67% fault.io.pciex.bus-linkerr-corr 33% Affects : dev:////pci@0,0/pci8086,3410@9/pci1077,170@0 faulted but still in service FRU : "PCIe 2" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=myhost:chassis-id=1015XF50CF/motherboard=0/hostbridge=4/pciexrc=4/pciexbus=25/pciexdev=0) faulty

Description : Too many recovered bus errors have been detected, which indicates a problem with the specified bus or with the specified transmitting device. This may degrade into an unrecoverable fault. Refer to http://sun.com/msg/PCIEX-8000-KP for more information.

Response : One or more device instances may be disabled

Impact : Loss of services provided by the device instances associated with this fault

Action : If a plug-in card is involved check for badly-seated cards or bent pins. Otherwise schedule a repair procedure to replace the affected device. Use fmadm faulty to identify the device or contact Sun for support.


TIME EVENT-ID MSG-ID SEVERITY


May 11 19:32:28 83fb009f-d8b5-483b-a0a3-f51baf5d4860 PCIEX-8000-KP Major

Host : myhost Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1015XF50CF Product_sn :

Fault class : fault.io.pciex.device-interr-corr max 50% fault.io.pciex.bus-linkerr-corr 25% Affects : dev:////pci@0,0/pci8086,3410@9/pci1077,170@0 ok and in service dev:////pci@0,0/pci8086,3410@9 faulted but still in service FRU : "PCIe 2" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=myhost:chassis-id=1015XF50CF/motherboard=0/hostbridge=4/pciexrc=4/pciexbus=25/pciexdev=0) max 50% acquitted "MB" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=myhost:chassis-id=1015XF50CF/motherboard=0) 25% faulty

Description : Too many recovered bus errors have been detected, which indicates a problem with the specified bus or with the specified transmitting device. This may degrade into an unrecoverable fault. Refer to http://sun.com/msg/PCIEX-8000-KP for more information.

Response : One or more device instances may be disabled

Impact : Loss of services provided by the device instances associated with this fault

Action : If a plug-in card is involved check for badly-seated cards or bent pins. Otherwise schedule a repair procedure to replace the affected device. Use fmadm faulty to identify the device or contact Sun for support.


TIME EVENT-ID MSG-ID SEVERITY


May 11 19:32:22 f241c9a1-5510-ccdb-88ef-9d2d4de14000 PCIEX-8000-KP Major

Host : myhost Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1015XF50CF Product_sn :

Fault class : fault.io.pciex.device-interr-corr max 50% fault.io.pciex.bus-linkerr-corr 25% Affects : dev:////pci@0,0/pci8086,340e@7/pci1077,170@0 ok and in service dev:////pci@0,0/pci8086,340e@7 faulted but still in service FRU : "PCIe 1" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=myhost:chassis-id=1015XF50CF/motherboard=0/hostbridge=3/pciexrc=3/pciexbus=19/pciexdev=0) max 50% acquitted "MB" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=myhost:chassis-id=1015XF50CF/motherboard=0) 25% faulty

Description : Too many recovered bus errors have been detected, which indicates a problem with the specified bus or with the specified transmitting device. This may degrade into an unrecoverable fault. Refer to http://sun.com/msg/PCIEX-8000-KP for more information.

Response : One or more device instances may be disabled

Impact : Loss of services provided by the device instances associated with this fault

Action : If a plug-in card is involved check for badly-seated cards or bent pins. Otherwise schedule a repair procedure to replace the affected device. Use fmadm faulty to identify the device or

contact Sun for support.

Given the above, we tried reseating both of the QLogic cards, but there has been no change in the behavior. Shall I call Sun Support for this, or is there something else I need to do?

This is Nexentastor 3.0.2.

Thanks, Matt Goheen


Replies

RE: Getting overrun with FM errors on Sunfire X4170 - Added by Matthew Goheen over 3 years ago

I hate things that reformat what I type...sigh...

Let me know if you want me to try posting that all again (if it will help).

Thanks, Matt