Getting overrun with FM errors on Sunfire X4170
Added by Matthew Goheen over 3 years ago
I trying to test out the Community Edition on a new Sunfire X4170 system. I have the software installed onto the internal SAS hardware mirror. The system also has two QLogic QLE2560 Fibre Channel adapters. Only one is hooked up, and currently no Fibre channel disks are zoned for the system -- there is no storage allocated other than the system disk.
I've getting about 1000 lines per second logged to /var/fm/fmd/errlog. A small sample of "fmdump -e":
May 18 16:31:42.3871 ereport.io.pciex.pl.re May 18 16:31:42.3880 ereport.io.pci.fabric May 18 16:31:42.3880 ereport.io.pci.fabric May 18 16:31:42.3880 ereport.io.pciex.rc.ce-msg May 18 16:31:42.3880 ereport.io.pciex.pl.re May 18 16:31:42.3957 ereport.io.pci.fabric May 18 16:31:42.3957 ereport.io.pci.fabric May 18 16:31:42.3957 ereport.io.pciex.rc.ce-msg May 18 16:31:42.3957 ereport.io.pciex.pl.re May 18 16:31:42.4027 ereport.io.pci.fabric May 18 16:31:42.4027 ereport.io.pci.fabric May 18 16:31:42.4027 ereport.io.pciex.rc.ce-msg May 18 16:31:42.4027 ereport.io.pciex.pl.re May 18 16:31:42.4121 ereport.io.pci.fabric May 18 16:31:42.4122 ereport.io.pci.fabric May 18 16:31:42.4121 ereport.io.pciex.rc.ce-msg May 18 16:31:42.4122 ereport.io.pciex.pl.re May 18 16:31:42.4228 ereport.io.pci.fabric May 18 16:31:42.4229 ereport.io.pci.fabric May 18 16:31:42.4228 ereport.io.pciex.rc.ce-msg May 18 16:31:42.4229 ereport.io.pciex.pl.re May 18 16:31:42.4386 ereport.io.pci.fabric May 18 16:31:42.4386 ereport.io.pci.fabric May 18 16:31:42.4386 ereport.io.pciex.rc.ce-msg May 18 16:31:42.4386 ereport.io.pciex.pl.re May 18 16:31:42.4423 ereport.io.pci.fabric May 18 16:31:42.4423 ereport.io.pci.fabric
Here's a sample "fmdump -eV":
TIME CLASS May 18 2010 16:31:42.371497612 ereport.io.pci.fabric nvlist version: 0 class = ereport.io.pci.fabric ena = 0x79550cfed1802001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci@0,0/pci8086,3410@9 (end detector)
bdf = 0x48
device_id = 0x3410
vendor_id = 0x8086
rev_id = 0x13
dev_type = 0x40
pcie_off = 0x90
pcix_off = 0x0
aer_off = 0x100
ecc_ver = 0x0
pci_status = 0x10
pci_command = 0x47
pci_bdg_sec_status = 0x0
pci_bdg_ctrl = 0x3
pcie_status = 0x0
pcie_command = 0x27
pcie_dev_cap = 0x8021
pcie_adv_ctl = 0x0
pcie_ue_status = 0x0
pcie_ue_mask = 0x100000
pcie_ue_sev = 0x62030
pcie_ue_hdr0 = 0x0
pcie_ue_hdr1 = 0x0
pcie_ue_hdr2 = 0x0
pcie_ue_hdr3 = 0x0
pcie_ce_status = 0x0
pcie_ce_mask = 0x0
pcie_rp_status = 0x0
pcie_rp_control = 0x0
pcie_adv_rp_status = 0x1
pcie_adv_rp_command = 0x7
pcie_adv_rp_ce_src_id = 0x0
pcie_adv_rp_ue_src_id = 0x0
remainder = 0x1
severity = 0x1
__ttl = 0x1
__tod = 0x4bf2f92e 0x16249a8c
May 18 2010 16:31:42.371514172 ereport.io.pci.fabric nvlist version: 0 class = ereport.io.pci.fabric ena = 0x79550d02ef102001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci@0,0/pci8086,3410@9/pci1077,170@0 (end detector)
bdf = 0x1900
device_id = 0x2532
vendor_id = 0x1077
rev_id = 0x2
dev_type = 0x0
pcie_off = 0x4c
pcix_off = 0x0
aer_off = 0x100
ecc_ver = 0x0
pci_status = 0x10
pci_command = 0x147
pcie_status = 0x1
pcie_command = 0x4037
pcie_dev_cap = 0x10648183
pcie_adv_ctl = 0xa0
pcie_ue_status = 0x0
pcie_ue_mask = 0x180000
pcie_ue_sev = 0x62030
pcie_ue_hdr0 = 0x0
pcie_ue_hdr1 = 0x0
pcie_ue_hdr2 = 0x0
pcie_ue_hdr3 = 0x0
pcie_ce_status = 0x1
pcie_ce_mask = 0x0
remainder = 0x0
severity = 0x3
__ttl = 0x1
__tod = 0x4bf2f92e 0x1624db3c
May 18 2010 16:31:42.371497612 ereport.io.pciex.rc.ce-msg nvlist version: 0 ena = 0x79550cfed1802001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci@0,0/pci8086,3410@9 (end detector)
class = ereport.io.pciex.rc.ce-msg
rc-status = 0x1
__ttl = 0x1
__tod = 0x4bf2f92e 0x16249a8c
May 18 2010 16:31:42.371514172 ereport.io.pciex.pl.re nvlist version: 0 ena = 0x79550d02ef102001 detector = (embedded nvlist) nvlist version: 0 version = 0x0 scheme = dev device-path = /pci@0,0/pci8086,3410@9/pci1077,170@0 (end detector)
class = ereport.io.pciex.pl.re
dev-status = 0x1
ce-status = 0x1
__ttl = 0x1
__tod = 0x4bf2f92e 0x1624db3c
Here is "fmdump":
TIME UUID SUNW-MSG-ID May 11 19:32:22.6437 f241c9a1-5510-ccdb-88ef-9d2d4de14000 PCIEX-8000-KP May 11 19:32:28.5353 83fb009f-d8b5-483b-a0a3-f51baf5d4860 PCIEX-8000-KP May 11 20:12:47.7208 83fb009f-d8b5-483b-a0a3-f51baf5d4860 FMD-8000-58 Updated May 11 20:12:56.9098 4890c273-fdc2-c5f5-e32b-a4b6bf86c188 PCIEX-8000-KP May 11 20:22:34.7249 f241c9a1-5510-ccdb-88ef-9d2d4de14000 FMD-8000-58 Updated May 11 20:22:40.3303 b65ba66a-649e-421f-9e95-f90cd9085290 PCIEX-8000-KP
Here is "fmadm faulty":
TIME EVENT-ID MSG-ID SEVERITY
May 11 20:22:40 b65ba66a-649e-421f-9e95-f90cd9085290 PCIEX-8000-KP Major
Host : myhost Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1015XF50CF Product_sn :
Fault class : fault.io.pciex.device-interr-corr 67% fault.io.pciex.bus-linkerr-corr 33% Affects : dev:////pci@0,0/pci8086,340e@7/pci1077,170@0 faulted but still in service FRU : "PCIe 1" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=myhost:chassis-id=1015XF50CF/motherboard=0/hostbridge=3/pciexrc=3/pciexbus=19/pciexdev=0) faulty
Description : Too many recovered bus errors have been detected, which indicates a problem with the specified bus or with the specified transmitting device. This may degrade into an unrecoverable fault. Refer to http://sun.com/msg/PCIEX-8000-KP for more information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances associated with this fault
Action : If a plug-in card is involved check for badly-seated cards or bent pins. Otherwise schedule a repair procedure to replace the affected device. Use fmadm faulty to identify the device or contact Sun for support.
TIME EVENT-ID MSG-ID SEVERITY
May 11 20:12:56 4890c273-fdc2-c5f5-e32b-a4b6bf86c188 PCIEX-8000-KP Major
Host : myhost Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1015XF50CF Product_sn :
Fault class : fault.io.pciex.device-interr-corr 67% fault.io.pciex.bus-linkerr-corr 33% Affects : dev:////pci@0,0/pci8086,3410@9/pci1077,170@0 faulted but still in service FRU : "PCIe 2" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=myhost:chassis-id=1015XF50CF/motherboard=0/hostbridge=4/pciexrc=4/pciexbus=25/pciexdev=0) faulty
Description : Too many recovered bus errors have been detected, which indicates a problem with the specified bus or with the specified transmitting device. This may degrade into an unrecoverable fault. Refer to http://sun.com/msg/PCIEX-8000-KP for more information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances associated with this fault
Action : If a plug-in card is involved check for badly-seated cards or bent pins. Otherwise schedule a repair procedure to replace the affected device. Use fmadm faulty to identify the device or contact Sun for support.
TIME EVENT-ID MSG-ID SEVERITY
May 11 19:32:28 83fb009f-d8b5-483b-a0a3-f51baf5d4860 PCIEX-8000-KP Major
Host : myhost Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1015XF50CF Product_sn :
Fault class : fault.io.pciex.device-interr-corr max 50% fault.io.pciex.bus-linkerr-corr 25% Affects : dev:////pci@0,0/pci8086,3410@9/pci1077,170@0 ok and in service dev:////pci@0,0/pci8086,3410@9 faulted but still in service FRU : "PCIe 2" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=myhost:chassis-id=1015XF50CF/motherboard=0/hostbridge=4/pciexrc=4/pciexbus=25/pciexdev=0) max 50% acquitted "MB" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=myhost:chassis-id=1015XF50CF/motherboard=0) 25% faulty
Description : Too many recovered bus errors have been detected, which indicates a problem with the specified bus or with the specified transmitting device. This may degrade into an unrecoverable fault. Refer to http://sun.com/msg/PCIEX-8000-KP for more information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances associated with this fault
Action : If a plug-in card is involved check for badly-seated cards or bent pins. Otherwise schedule a repair procedure to replace the affected device. Use fmadm faulty to identify the device or contact Sun for support.
TIME EVENT-ID MSG-ID SEVERITY
May 11 19:32:22 f241c9a1-5510-ccdb-88ef-9d2d4de14000 PCIEX-8000-KP Major
Host : myhost Platform : SUN-FIRE-X4170-SERVER Chassis_id : 1015XF50CF Product_sn :
Fault class : fault.io.pciex.device-interr-corr max 50% fault.io.pciex.bus-linkerr-corr 25% Affects : dev:////pci@0,0/pci8086,340e@7/pci1077,170@0 ok and in service dev:////pci@0,0/pci8086,340e@7 faulted but still in service FRU : "PCIe 1" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=myhost:chassis-id=1015XF50CF/motherboard=0/hostbridge=3/pciexrc=3/pciexbus=19/pciexdev=0) max 50% acquitted "MB" (hc://:product-id=SUN-FIRE-X4170-SERVER:server-id=myhost:chassis-id=1015XF50CF/motherboard=0) 25% faulty
Description : Too many recovered bus errors have been detected, which indicates a problem with the specified bus or with the specified transmitting device. This may degrade into an unrecoverable fault. Refer to http://sun.com/msg/PCIEX-8000-KP for more information.
Response : One or more device instances may be disabled
Impact : Loss of services provided by the device instances associated with this fault
Action : If a plug-in card is involved check for badly-seated cards or bent pins. Otherwise schedule a repair procedure to replace the affected device. Use fmadm faulty to identify the device or
contact Sun for support.
Given the above, we tried reseating both of the QLogic cards, but there has been no change in the behavior. Shall I call Sun Support for this, or is there something else I need to do?
This is Nexentastor 3.0.2.
Thanks, Matt Goheen
Replies
RE: Getting overrun with FM errors on Sunfire X4170 - Added by Matthew Goheen over 3 years ago
I hate things that reformat what I type...sigh...
Let me know if you want me to try posting that all again (if it will help).
Thanks, Matt