Significant Performance Degredation after Drive Failures
Added by Sean L. about 1 year ago
Everything was chugging along just great, I was seeing read/write rates in the 200MB/sec range. At some point, I had a couple of drive failures. I went through and replaced the drives. The volume was resilvered. Volume shows as 'OK' now. Here is the problem. When I write to a shared volume (Shared via NFS or CIFS), I am getting write rates in the 50-150KB/sec range. When I read, I am seeing maybe 1-5MBps over the network (I'll see flashing of 7-9 in iostat). Also, anything related to drive commands, are extremely slow, however, other commands, like 'show version', respond instantly.
nmc@xen01:/VOL01/Video$ show version
NMS version: 3.1.2-8147 (r9697)
NMC version: 3.1.2-8147 (r9697)
NMV version: 3.1.2-8147 (r9697)
Release Date: Feb 28 2012
Operating System: Nexenta/OpenSolaris (version 3.1.2)
Copyright (c) 2005-2011 Nexenta Systems, Inc. All rights reserved.
For instance, the command below, took about 15 seconds to execute:
nmc@xen01:/VOL01/Video$ zpool status
pool: VOL01
state: ONLINE
scan: resilvered 8.02G in 1h50m with 0 errors on Fri Dec 9 10:33:54 2011
config:
NAME STATE READ WRITE CKSUM
VOL01 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
c0t10d0 ONLINE 0 0 0
c0t11d0 ONLINE 0 0 0
c0t12d0 ONLINE 0 0 0
c0t13d0 ONLINE 0 0 0
c0t14d0 ONLINE 0 0 0
c0t15d0 ONLINE 0 0 0
c0t31d0 ONLINE 0 0 0
c0t30d0 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
c0t16d0 ONLINE 0 0 0
c0t17d0 ONLINE 0 0 0
c0t18d0 ONLINE 0 0 0
c0t19d0 ONLINE 0 0 0
c0t20d0 ONLINE 0 0 0
c0t21d0 ONLINE 0 0 0
c0t22d0 ONLINE 0 0 0
c0t23d0 ONLINE 0 0 0
errors: No known data errors
pool: syspool
state: ONLINE
scan: scrub repaired 0 in 0h9m with 0 errors on Sun Feb 26 03:09:17 2012
config:
NAME STATE READ WRITE CKSUM
syspool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c0t8d0s0 ONLINE 0 0 0
c0t9d0s0 ONLINE 0 0 0
errors: No known data errors
raidz2-0 is comprised of 640GB Drives raidz2-1 is comprised of 1TB Drives
This is while reading a 30 GB file from the array (via NFS), initially:
nmc@xen01:/VOL01/Video$ show VOL01/ iostat
capacity operations bandwidth
name alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
VOL01 8.45T 3.46T 3 4 285K 27.1K
VOL01 8.45T 3.46T 0 30 0 77.5K
VOL01 8.45T 3.46T 76 4 9.50M 18.0K
VOL01 8.45T 3.46T 0 0 0 0
VOL01 8.45T 3.46T 0 0 0 0
VOL01 8.45T 3.46T 0 0 0 0
VOL01 8.45T 3.46T 62 0 7.87M 0
VOL01 8.45T 3.46T 43 0 5.38M 0
VOL01 8.45T 3.46T 15 0 2.00M 0
VOL01 8.45T 3.46T 61 0 7.75M 0
After about 5 minutes, it starts ramping up some (this never seems to happen with writes):
nmc@xen01:/VOL01/Video$ show VOL01/ iostat
capacity operations bandwidth
name alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
VOL01 8.45T 3.46T 8 3 963K 26.3K
VOL01 8.45T 3.46T 346 0 43.0M 0
VOL01 8.45T 3.46T 416 0 51.7M 0
VOL01 8.45T 3.46T 257 0 32.0M 0
VOL01 8.45T 3.46T 325 0 40.4M 0
VOL01 8.45T 3.46T 444 0 55.2M 0
VOL01 8.45T 3.46T 257 0 32.0M 0
VOL01 8.45T 3.46T 257 0 32.0M 0
VOL01 8.45T 3.46T 500 0 62.1M 0
VOL01 8.45T 3.46T 257 0 32.0M 0
Anyone have any thoughts on what I can do to troubleshoot this problem?
Replies
RE: Significant Performance Degredation after Drive Failures - Added by Sean L. about 1 year ago
I ran an iostat -xn 10 while streaming some video off VOL01:
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 18.0 0.0 73.0 0.0 0.0 0.0 0.9 0 0 c0t8d0
0.0 18.1 0.0 73.0 0.0 0.0 0.0 1.0 0 0 c0t9d0
9.4 0.4 608.1 0.0 0.0 0.1 0.1 13.5 0 3 c0t10d0
9.9 0.4 601.7 0.0 0.0 0.3 0.1 27.6 0 6 c0t11d0
9.6 0.4 608.2 0.0 0.0 0.1 0.1 12.4 0 3 c0t12d0
8.7 0.0 600.3 0.0 0.0 0.1 0.1 13.3 0 2 c0t13d0
8.1 0.4 790.5 0.0 0.0 7.3 0.1 856.1 0 100 c0t14d0
9.6 0.2 602.2 0.0 0.0 0.2 0.1 15.5 0 2 c0t15d0
8.6 0.0 602.5 0.0 0.0 0.1 0.1 15.8 0 2 c0t30d0
9.6 0.2 597.7 0.0 0.0 0.1 0.1 11.2 0 2 c0t31d0
0.5 0.2 1.7 0.0 0.0 0.0 0.0 0.0 0 0 c0t16d0
0.5 0.2 1.7 0.0 0.0 0.0 0.0 0.0 0 0 c0t17d0
0.5 0.2 1.7 0.0 0.0 0.0 0.0 0.0 0 0 c0t18d0
0.5 0.2 1.7 0.0 0.0 0.0 0.0 0.0 0 0 c0t19d0
0.5 0.2 1.7 0.0 0.0 0.0 0.0 0.0 0 0 c0t20d0
0.5 0.2 1.7 0.0 0.0 0.0 0.0 0.0 0 0 c0t21d0
0.5 0.2 1.7 0.0 0.0 0.0 0.0 0.0 0 0 c0t22d0
0.5 0.2 1.7 0.0 0.0 0.0 0.0 0.0 0 0 c0t23d0
Does the fact that c0t14d0 is showing a large asvc_t and b% indicate a problem with that drive?
RE: Significant Performance Degredation after Drive Failures - Added by Sean L. about 1 year ago
For those following along at home, seems like that c0t14d0 drive is an issue...will replace it, and report back if that has solved things.
root@xen01:/var/adm# smartctl -H /dev/rdsk/c0t14d0
smartctl 5.41 2011-03-16 r3296 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
Failed Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 136 136 140 Pre-fail Always FAILING_NOW 512
For those of you wondering how to do the above, you need to:
- get to a root pompt:
- ssh into your server, logon as root
- option expert_mode = 1
- !bash
- apt-get install smatmontools
- smatctl -a /dev/dsk/c0t14d0 -s on
Go through all of your drives, and it will let you know if there are any SMART issues.
Also, I found, iostat -nex 10 to be a good help, as it reports errors reported by the drives in your system.
RE: Significant Performance Degredation after Drive Failures - Added by Reuben Bryant about 1 year ago
Awesome fault finding. Thank you for the detailed account.
I am building a document with fault findings and resolutions from the front line for here at work :)
I have a current issue at the moment where a process has a memory leak and I am rebooting the SAN every 14days at the moment the joy :)
Again thank you.
Cheers Reuben
RE: Significant Performance Degredation after Drive Failures - Added by Ryan W about 1 year ago
High %b (Percent Busy) alone would have indicated enough to replace the disk.
FWIW there IS a GUI counterpart for SMART Status.
http://www.nexenta.org/corp/images/stories/pdfs/autosmart-userguide
RE: Significant Performance Degredation after Drive Failures - Added by Sean L. about 1 year ago
Yikes...replaced the c0t14d0 drive, and it's currently resilvering...
root@xen01:/volumes# zpool status VOL01
pool: VOL01
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Mar 7 13:17:50 2012
8.54G scanned out of 8.43T at 363K/s, 6911h32m to go
1.03G resilvered, 0.10% done
6911 hours? How can I speed that up? I don't want to wait till Christmas for my array to be fast again :)
RE: Significant Performance Degredation after Drive Failures - Added by Linda Kateley about 1 year ago
This could be symptomatic of a hardware problem closer in than the disk
Check to see if there are any error in the io channel by
iostat -zxCnTd 5
look for high busy or wait or high asvc_t on a particular device
Throttleing of resilver can be tuned using
setup appliance nms property
try setting..
vdev_max_pendingto 4
zfs_resilver_delay to 0
zfs_resilver_min_time_ms to 4000
The originals are 10 , 2 ,3000
RE: Significant Performance Degredation after Drive Failures - Added by Sean L. about 1 year ago
here's what I'm seeing on iostat -zxCnTd 5 (before making those zfs_resilver changes)
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
138.8 41.2 6008.5 503.8 0.0 6.7 0.0 37.3 0 143 c0
15.2 1.6 752.5 1.1 0.0 0.3 0.0 18.3 0 5 c0t10d0
17.2 1.6 752.5 1.1 0.0 0.2 0.0 12.0 0 5 c0t11d0
18.6 1.6 752.8 1.2 0.0 0.2 0.0 10.5 0 4 c0t12d0
16.4 1.4 748.5 1.1 0.0 0.3 0.0 19.1 0 6 c0t13d0
16.2 1.4 748.4 1.1 0.0 0.4 0.0 20.5 0 10 c0t14d0
17.4 1.4 748.3 1.0 0.0 0.3 0.0 13.3 0 5 c0t15d0
17.0 1.4 752.5 1.0 0.0 0.2 0.0 12.0 0 4 c0t30d0
16.0 1.4 748.2 1.0 0.0 0.3 0.0 15.9 0 5 c0t31d0
0.6 2.0 0.6 1.3 0.0 0.0 0.0 5.7 0 1 c0t16d0
0.6 2.0 0.6 1.4 0.0 0.0 0.0 4.8 0 1 c0t17d0
0.6 2.2 0.6 1.6 0.0 0.0 0.0 6.5 0 1 c0t18d0
0.6 2.2 0.6 1.5 0.0 0.0 0.0 16.0 0 3 c0t19d0
0.6 2.0 0.6 1.4 0.0 1.6 0.0 625.0 0 54 c0t20d0
0.6 1.8 0.6 1.2 0.0 0.0 0.0 6.1 0 1 c0t21d0
0.6 2.0 0.6 1.2 0.0 0.1 0.0 49.3 0 5 c0t22d0
0.6 1.8 0.6 1.2 0.0 0.0 0.0 6.8 0 1 c0t23d0
0.0 13.4 0.0 484.4 0.0 2.7 0.0 198.0 0 31 c0t24d0
Then, I made the zfs changes, and I'm seeing:
nmc@xen01:/$ zpool status
pool: VOL01
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Mar 7 13:17:50 2012
258G scanned out of 8.43T at 2.21M/s, 1078h33m to go
31.0G resilvered, 2.99% done
and then on iostat:
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
139.8 19.6 4227.6 676.0 0.0 6.2 0.0 39.1 0 193 c0
0.2 0.0 0.1 0.0 0.0 0.0 0.0 0.2 0 0 c0t8d0
0.2 0.0 0.1 0.0 0.0 0.0 0.0 0.1 0 0 c0t9d0
18.0 0.4 504.5 0.0 0.0 0.1 0.0 4.9 0 3 c0t10d0
18.4 0.4 504.6 0.0 0.0 0.1 0.0 5.2 0 3 c0t11d0
18.8 0.4 504.6 0.0 0.0 0.1 0.0 5.1 0 3 c0t12d0
17.6 0.0 518.2 0.0 0.0 0.1 0.0 4.8 0 3 c0t13d0
7.4 0.4 675.8 0.0 0.0 4.1 0.0 521.1 0 98 c0t14d0
17.0 0.0 501.1 0.0 0.0 0.1 0.0 4.7 0 3 c0t15d0
16.2 0.0 501.2 0.0 0.0 0.1 0.0 6.0 0 4 c0t30d0
17.2 0.0 501.2 0.0 0.0 0.1 0.0 5.3 0 3 c0t31d0
0.6 0.0 0.3 0.0 0.0 0.0 0.0 17.4 0 1 c0t16d0
0.6 0.0 0.3 0.0 0.0 0.0 0.0 17.6 0 1 c0t17d0
0.6 0.0 0.3 0.0 0.0 0.0 0.0 0.1 0 0 c0t18d0
0.6 0.0 0.3 0.0 0.0 0.0 0.0 0.2 0 0 c0t19d0
1.4 0.4 3.7 0.0 0.0 0.8 0.0 442.6 0 50 c0t20d0
1.6 0.4 3.7 0.0 0.0 0.0 0.0 7.5 0 1 c0t21d0
1.6 0.4 3.7 0.0 0.0 0.0 0.0 0.1 0 0 c0t22d0
1.6 0.4 3.7 0.0 0.0 0.0 0.0 0.0 0 0 c0t23d0
0.2 16.4 0.1 676.0 0.0 0.7 0.0 41.8 0 22 c0t24d0
And note, Disk 14 is being replaced with disk 24.
Not sure what is up with disk 20? Should I read anything into the high asvc_t there?
RE: Significant Performance Degredation after Drive Failures - Added by Linda Kateley about 1 year ago
yes, these all look like they are on the same controller? t20 and t14 are beyond saturated
can you show me zpool status -V?
it looks like you are down to 1000 hours on the resilver though :)
RE: Significant Performance Degredation after Drive Failures - Added by Dan Swartzendruber about 1 year ago
Woo hoo :)
RE: Significant Performance Degredation after Drive Failures - Added by Sean L. about 1 year ago
Linda, yes, only one controller, it's one of the supermicro LSI 1068E, HBA cards. It's in a SuperMicro 216E1 chassis, almost full with 2.5" SATA drives. Do those stats indicate that the controller is saturated? Doesn't seem like a lot of traffic on it?
Down to 600 hours :)
usage:
status [-vx] [-T d|u] [pool] ... [interval [count]]
root@xen01:/volumes# zpool status -v
pool: VOL01
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Mar 7 13:17:50 2012
559G scanned out of 8.43T at 3.78M/s, 606h52m to go
40.5G resilvered, 6.48% done
config:
NAME STATE READ WRITE CKSUM
VOL01 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
c0t10d0 ONLINE 0 0 0
c0t11d0 ONLINE 0 0 0
c0t12d0 ONLINE 0 0 0
c0t13d0 ONLINE 0 0 0
replacing-4 ONLINE 0 0 0
c0t14d0 ONLINE 0 0 0 (resilvering)
c0t24d0 ONLINE 0 0 0 (resilvering)
c0t15d0 ONLINE 0 0 0
c0t31d0 ONLINE 0 0 0
c0t30d0 ONLINE 0 0 0
raidz2-1 ONLINE 0 0 0
c0t16d0 ONLINE 0 0 0
c0t17d0 ONLINE 0 0 0
c0t18d0 ONLINE 0 0 0
c0t19d0 ONLINE 0 0 0
c0t20d0 ONLINE 0 0 0
c0t21d0 ONLINE 0 0 0
c0t22d0 ONLINE 0 0 0
c0t23d0 ONLINE 0 0 0
errors: No known data errors
pool: syspool
state: ONLINE
scan: scrub repaired 0 in 0h10m with 0 errors on Sun Mar 4 03:10:21 2012
config:
NAME STATE READ WRITE CKSUM
syspool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c0t8d0s0 ONLINE 0 0 0
c0t9d0s0 ONLINE 0 0 0
errors: No known data errors
RE: Significant Performance Degredation after Drive Failures - Added by Linda Kateley about 1 year ago
Sorry, no the controller isn't saturated, we look for wait in iostat... The disks however are.
The disk c0t14d0 makes sense to have such a high load but the 20 disk does not. Could be a sign of an impeding failure.
One other command you can run to check is
fmadm faulty
or
fmdump -V
These can show impending failures
Something is really not right here..
RE: Significant Performance Degredation after Drive Failures - Added by Ryan W about 1 year ago
I have that same chassis and was running LSI 3442E-R cards (1068 chipsets, if I remember right)/(all done by a Nexenta Partner).. and back in December I had a rather large failure that started with a single syspool disk failing and corrupted my main pool.
What Nexenta support eventually deduced is that the 3G SAS hardware in the SC216E1 and my cards should have never been on the HCL. I'm now using their similar 6GB counterparts and not throwing nearly as many errors through the phy's as I was before. (tens, versus millions).
RE: Significant Performance Degredation after Drive Failures - Added by Sean L. about 1 year ago
Ryan, was there a swap out backplane for the SC216E1? Or just the controller card? What card are you using now?
It has been pretty odd, as I've had a number of drives 'fail' over the past few months. It started with 1, that was reported as an actual FAILURE in zpool status. Then, in the process of replacing that one, two other drives failed. Now, when I'm replacing this #14 drive, it looks like #20 may be failing as well?
These are drives from different manufacturers, different batches, all about 2 years old. This array really just serves my storage needs at home (lots of videos/music/pictures) that are streamed around the house. So, it's not like I'm using this in a production environment where the drives are constantly being hammered. Fans in the chassis are all operational, there do not seem to be any cooling issues either.
I will say, through all of this, I haven't lost any data (that I'm aware of..)
RE: Significant Performance Degredation after Drive Failures - Added by Ryan W about 1 year ago
Swapped out all my 3G components.. backplane and two SAS cards.
Now running the backplane out of the SM SC216E16 chassis and a pair of 9211-4i's.
RE: Significant Performance Degredation after Drive Failures - Added by Sean L. about 1 year ago
No issues putting that E16 backplane into the E1?
Thanks for the insight Ryan!
RE: Significant Performance Degredation after Drive Failures - Added by Ryan W about 1 year ago
Nope. Pull all disks part way out to detach from backplane, spin out some screws on the backside, unhook some molex connectors/SAS connectors.. reverse process. :)
Might want to export your pool before you do it though.
RE: Significant Performance Degredation after Drive Failures - Added by Ryan W about 1 year ago
Keep in mind the 9211-4i are 6GB parts too...
RE: Significant Performance Degredation after Drive Failures - Added by Sean L. about 1 year ago
Cool, thanks Ryan.
Though, not sure I like having to spend another $750 on my home storage :)
RE: Significant Performance Degredation after Drive Failures - Added by Ryan W about 1 year ago
Yeah, I hear ya. I was lucky enough my Nexenta Partner gave me the parts (was in warranty). Otherwise I'd have had to explain to my boss why the SAN worked and now doesn't (without any real explanation) and need to spend $1k in hopes of salvaging the system.
FWIW when mine failed, my pool was no longer importable. I could mount it read only and moved it over my network (at a snails pace), dropped in the new hardware and made new pools from scratch and moved the data back.
Worst two weeks of my life in the past year.
RE: Significant Performance Degredation after Drive Failures - Added by Sean L. about 1 year ago
Hmmm....now I can no longer ping the box. Checked in on the resilvering process, still around 600 hours. Fingers crossed :)