iSCSI Queue Depth Issue

Added by Jeff Gibson over 2 years ago

Good afternoon fellow nexenta-ers,

I'm hoping someone can point me in the right direction to find the next bottleneck in my system. The system is a supermicro x8dtn+f with 4 intel nics (2 onboard and 2/4 of 4port pcie card) dedicated to iSCSI. The system has 72GB of ram. I have the following disk setup:

zpool status -v Store
  pool: Store
 state: ONLINE
 scan: none requested
config:

        NAME                        STATE     READ WRITE CKSUM
        Store                       ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            c10t50000393B802D9E6d0  ONLINE       0     0     0
            c13t50000393B8035F6Ad0  ONLINE       0     0     0
            c16t50000393B802DA52d0  ONLINE       0     0     0
            c19t50000393B8035ABAd0  ONLINE       0     0     0
            c7t50000393B802DA92d0   ONLINE       0     0     0
          raidz1-1                  ONLINE       0     0     0
            c12t50000393B802E0DAd0  ONLINE       0     0     0
            c15t50000393B802DA32d0  ONLINE       0     0     0
            c4t50000393B8037336d0   ONLINE       0     0     0
            c6t50000393B802DA2Ed0   ONLINE       0     0     0
            c9t50000393B802DA06d0   ONLINE       0     0     0
          raidz1-2                  ONLINE       0     0     0
            c11t50000393B8037356d0  ONLINE       0     0     0
            c14t50000393B801B65Ad0  ONLINE       0     0     0
            c17t50000393B802DA96d0  ONLINE       0     0     0
            c1t50000393B803ACD6d0   ONLINE       0     0     0
            c5t50000393B802E092d0   ONLINE       0     0     0
        logs
          c0t50015179595CCEE9d0     ONLINE       0     0     0
          c0t50015179595CD1B9d0     ONLINE       0     0     0
        cache
          c0t5001517A6BE6B48Bd0     ONLINE       0     0     0
          c0t5001517A6BE7289Ed0     ONLINE       0     0     0
        spares
          c18t50000393B802D9F6d0    AVAIL
          c2t50000393B8037386d0     AVAIL
          c8t50000393B802E0A6d0     AVAIL

errors: No known data errors

The problem I'm seeing is with how the system handles increasing queue depths on writes. From a single ESXi host running a test vm i'm able to acheive the following:

8k 100% sequential read QD=32 Align=8k -> 29,523 IOPs, 241.85 MB/s (Maxing the 2x iSCSI Ports), 1.08/17.63 Avg/Max IO time(ms)
8k 100% random read QD=32 Align=8k -> 28,894 IOPs, 236.7 MB/s, 1.11/35.37 Avg/Max IO time(ms)

So from here I can tell the iSCSI paths and the ARC (i doubt the L2ARC is getting anything at this point) are working as expected.

My next test is Writing:

8k 100% random Write QD=4 Align=8k -> 3,519 IOPs, 28.83 MB/s, 1.13/292.56 Avg/Max IO time(ms)
8k 100% random Write QD=16 Align=8k -> 7,838 IOPs, 64.22 MB/s, 2.04/335.77 Avg/Max IO time(ms)

OK, so far so good with the scaling. Now lets say I would like to read AND write from my disks at the same time (See plot on attached file). As long as I don't want to do more than about 5 or 6 of those at the same time I can see a nice 6k+ IOPs (which almost linearly scaled from 1 outstanding IO). At 7 I've fallen 1k from peak, by 9 I'm just above where I started with 1 writer. At 16 I might as well have gotten an external NAS because i'm only getting about 575 IOPs and <4.5MB/s. I had seen the same scaling when I had these disks in a RAID0 stripe, so I don't belive the disk layout is to blame. While this abomination to speed is happening I see no more than 5%b on the disks and 0%b on the write cache (Which i would have thought the bottleneck). At QD=5 i see the ssds at 10-15%b (looks like the 3-4k max write IOPs) and the disks at 10-20%b for comparison.

My question, if you're still reading at this point, is why? Why can a system pump out 6k IOPs at a depth of 5 but less than 600 at a depth of 32?

I am at a loss of where to look next for my next hurdle on the great bottleneck game.

IOmeter_Test.png - IOMeter test plot for Increasing QD with 60%Read65%Random of 8k blocks and is 8k aligned (24.8 KB)


Replies

RE: iSCSI Queue Depth Issue - Added by Linda Kateley over 2 years ago

I have some people looking at this.. standby

RE: iSCSI Queue Depth Issue - Added by Jeff Gibson over 2 years ago

Here is an output of iostat -zxCnTd 5 from the disks while running the 16 IOs

Fri Mar  9 08:18:12 CST 2012
                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   17.2  291.0   55.2 3215.5  0.0  0.1    0.0    0.3   0   2 c0
    4.2    2.8   13.8    7.2  0.0  0.0    0.0    0.0   0   0 c0t5001517A6BE7289Ed0
    4.2   11.4   13.8 1011.2  0.0  0.1    0.0    4.0   0   1 c0t5001517A6BE6B48Bd0
    0.2   19.2    0.1   49.3  0.0  0.0    0.0    0.2   0   0 c0t5001517BB278B21Fd0
    0.2   19.2    0.1   49.3  0.0  0.0    0.0    0.2   0   0 c0t5001517BB2770047d0
    4.2  119.0   13.8 1028.0  0.0  0.0    0.0    0.1   0   1 c0t50015179595CD1B9d0
    4.2  119.4   13.8 1070.4  0.0  0.0    0.0    0.1   0   1 c0t50015179595CCEE9d0
    4.2   18.2   13.8  187.6  0.0  0.1    0.0    5.4   0   3 c1
    4.2   18.2   13.8  187.6  0.0  0.1    0.0    5.4   0   3 c1t50000393B803ACD6d0
    4.2    1.6   13.8    0.0  0.0  0.0    0.0    0.0   0   0 c2
    4.2    1.6   13.8    0.0  0.0  0.0    0.0    0.0   0   0 c2t50000393B8037386d0
    4.2   13.2   13.8  114.9  0.0  0.1    0.0    6.5   0   2 c4
    4.2   13.2   13.8  114.9  0.0  0.1    0.0    6.5   0   2 c4t50000393B8037336d0
    4.2   18.2   13.8  188.1  0.0  0.1    0.0    4.2   0   2 c5
    4.2   18.2   13.8  188.1  0.0  0.1    0.0    4.2   0   2 c5t50000393B802E092d0
    4.2   14.6   13.8  115.2  0.0  0.1    0.0    3.7   0   2 c6
    4.2   14.6   13.8  115.2  0.0  0.1    0.0    3.7   0   2 c6t50000393B802DA2Ed0
    4.2   30.0   13.8  177.0  0.0  0.1    0.0    1.8   0   2 c7
    4.2   30.0   13.8  177.0  0.0  0.1    0.0    1.8   0   2 c7t50000393B802DA92d0
    4.2    1.6   13.8    0.0  0.0  0.0    0.0    0.0   0   0 c8
    4.2    1.6   13.8    0.0  0.0  0.0    0.0    0.0   0   0 c8t50000393B802E0A6d0
    4.2   17.0   13.8  114.9  0.0  0.1    0.0    5.1   0   2 c9
    4.2   17.0   13.8  114.9  0.0  0.1    0.0    5.1   0   2 c9t50000393B802DA06d0
    4.2   25.8   13.8  176.9  0.0  0.1    0.0    2.6   0   2 c10
    4.2   25.8   13.8  176.9  0.0  0.1    0.0    2.6   0   2 c10t50000393B802D9E6d0
    4.2   18.6   13.8  188.2  0.0  0.1    0.0    3.1   0   2 c11
    4.2   18.6   13.8  188.2  0.0  0.1    0.0    3.1   0   2 c11t50000393B8037356d0
    4.2   19.6   13.8  115.1  0.0  0.1    0.0    4.6   0   3 c12
    4.2   19.6   13.8  115.1  0.0  0.1    0.0    4.6   0   3 c12t50000393B802E0DAd0
    4.2   23.4   13.8  177.0  0.0  0.1    0.0    4.2   0   3 c13
    4.2   23.4   13.8  177.0  0.0  0.1    0.0    4.2   0   3 c13t50000393B8035F6Ad0
    4.2   22.4   13.8  187.8  0.0  0.1    0.0    5.2   0   2 c14
    4.2   22.4   13.8  187.8  0.0  0.1    0.0    5.2   0   2 c14t50000393B801B65Ad0
    4.2   19.2   13.8  115.1  0.0  0.1    0.0    3.7   0   3 c15
    4.2   19.2   13.8  115.1  0.0  0.1    0.0    3.7   0   3 c15t50000393B802DA32d0
    4.2   22.2   13.8  177.3  0.0  0.1    0.0    3.8   0   2 c16
    4.2   22.2   13.8  177.3  0.0  0.1    0.0    3.8   0   2 c16t50000393B802DA52d0
    4.2   18.8   13.8  188.1  0.0  0.1    0.0    5.1   0   2 c17
    4.2   18.8   13.8  188.1  0.0  0.1    0.0    5.1   0   2 c17t50000393B802DA96d0
    4.2    1.6   13.8    0.0  0.0  0.0    0.0    0.0   0   0 c18
    4.2    1.6   13.8    0.0  0.0  0.0    0.0    0.0   0   0 c18t50000393B802D9F6d0
    4.2   23.8   13.8  177.0  0.0  0.1    0.0    3.0   0   2 c19
    4.2   23.8   13.8  177.0  0.0  0.1    0.0    3.0   0   2 c19t50000393B8035ABAd0

IOMeter shows the service times as 33.93/1067 Avg/Max ms. CPU on the host doing the test is 2.43% and the nexenta machine shows 4-11% (single E5606 Xeon). This is the same system Andrew helped me with last week.

RE: iSCSI Queue Depth Issue - Added by Ryan W over 2 years ago

If you're seeing low %b on the disks, have you considered the iSCSI connection to be the bottleneck? From what I have read and been told there's not a lot of point in aggregating iSCSI links because of the way iSCSI was written and generally treats the entire connection as 1Gbps rather than the sum of the aggregated link.

It's been a while since I read up on the subject, but aside from using something like round robin for link aggregation and MPXIO purposes there wasn't any throughput gains.

RE: iSCSI Queue Depth Issue - Added by Jeff Gibson over 2 years ago

I have MPIO (no nic teaming, each has a separate IP and there are two segmented networks) setup as round robin inside vmware and was able to achieve 230MB/s+ on reads so i do believe the multipathing is working properly. Also I did see this effect when I only had a single nic configured in Nexenta and ESXi, but that was in very early testing and I would like to stay away from going backwards on that front. Any other directions to look or do you think that is the only place to look before going forward?

I'm also interested in the other threads that there's an esxi 5.0.2 patch that makes VAAI work better. I've not yet seen a 5.0.1 patch and haven't been able to find anything other than an EMC 5.0.2 patch that was supposed to make their VAAI primitives work better.

RE: iSCSI Queue Depth Issue - Added by Linda Kateley over 2 years ago

So when the queue gets too deep we usually see i little spike in cpu in trying to manage the queue.. Are you seeing that?

RE: iSCSI Queue Depth Issue - Added by Jeff Gibson over 2 years ago

In system CPU? no it is not using more than 11% according to the general page.

RE: iSCSI Queue Depth Issue - Added by Linda Kateley over 2 years ago

sorry i realized you are adjusting the queue depth inside the vm..you will probably see it in the client

I would like to start by saying performance tuning is an art not a science. You are always trying to find the knee, the place where you oversaturate one point on the system. You tune different things in order to get optimal performance from your hardware. It seems you have found the knee for queue depth, now we might want to start looking at other tunables to adjust..

One of the things we would like to see is if you leave the queue depth to optimal, then see what happens running 10 vms running that load.

There is also another common thought in tuning which is, does the test that you are running match the workload you are planning to run? We can always find the knee, but is this knee the right knee to match your workload?

RE: iSCSI Queue Depth Issue - Added by Jeff Gibson over 2 years ago

Yes this is from IOMeter inside of a guest VM. My problem is that the knee is inconsistent. Do you disagree that if I can read and write separately with a queue depth of up to 64 (I just tested my reads and write and get 28k+ and 9k+ IOPs each) that i should be able to combine the two at least up to a quarter on that test without problems?

If i assumed linear scaling I should expect to get 18.2k worth of reads and 3.15k writes for a total of 21.35k of total IOPs at 64 depth Instead of 2.1k total. I'm not expecting to keep 64 active IOs, but I would like to expect 8-16 to be more common. In that case I would expect at least 17.55kR+2.38kW=19.93k Total IOPs from my separate tests.

Am I crazy to think that if I can achieve decent reads and writes at a depth of 64 that i could do a combined workload of 16?

RE: iSCSI Queue Depth Issue - Added by Jeff Gibson over 2 years ago

Linda, do you have any other suggestions?

I'm not editing any queue depth settings inside the VMWare Host or the Nexenta server. The only value I'm changing is inside IOMeter to increase the number of outstanding IOs that are part of the bechmark/test.

RE: iSCSI Queue Depth Issue - Added by Linda Kateley over 2 years ago

If you are trying to test scale, i would spin up more vm's and have them all run the same test at the peak queue depth.

RE: iSCSI Queue Depth Issue - Added by Jeff Gibson over 2 years ago

Ok to add more confusion to the mix, this morning I ran the read/write test again at a QD=64 and performance was over 10x better than last week (no changes to the environment except time). I did notice the following in my /var/log/vmkernel.log file on the ESXi host at that high QD:

~ # tail /var/log/vmkernel.log
2012-03-12T12:38:02.974Z cpu3:2051)ScsiDeviceIO: 1198: Device naa.600144f0d06fc10000004f5906070001 performance has improved. I/O latency reduced from 107987 microseconds to 20952 microseconds.
2012-03-12T12:38:03.837Z cpu2:2050)ScsiDeviceIO: 1198: Device naa.600144f0d06fc10000004f5906070001 performance has improved. I/O latency reduced from 20952 microseconds to 10502 microseconds.
2012-03-12T12:38:11.205Z cpu4:2052)WARNING: ScsiDeviceIO: 1218: Device naa.600144f0d06fc10000004f5906070001 performance has deteriorated. I/O latency increased from average value of 5331 microseconds to 108040 microseconds.
2012-03-12T12:38:12.558Z cpu0:2056)ScsiDeviceIO: 1198: Device naa.600144f0d06fc10000004f5906070001 performance has improved. I/O latency reduced from 108040 microseconds to 21493 microseconds.
2012-03-12T12:38:19.465Z cpu3:2051)WARNING: ScsiDeviceIO: 1218: Device naa.600144f0d06fc10000004f5906070001 performance has deteriorated. I/O latency increased from average value of 5358 microseconds to 116741 microseconds.
2012-03-12T12:38:20.017Z cpu3:2051)ScsiDeviceIO: 1198: Device naa.600144f0d06fc10000004f5906070001 performance has improved. I/O latency reduced from 116741 microseconds to 22760 microseconds.
2012-03-12T12:38:22.383Z cpu5:2053)ScsiDeviceIO: 1198: Device naa.600144f0d06fc10000004f5906070001 performance has improved. I/O latency reduced from 22760 microseconds to 10718 microseconds.
2012-03-12T12:39:04.566Z cpu6:2054)WARNING: ScsiDeviceIO: 1218: Device naa.600144f0d06fc10000004f5906070001 performance has deteriorated. I/O latency increased from average value of 5499 microseconds to 112502 microseconds.
2012-03-12T12:39:06.570Z cpu0:2056)ScsiDeviceIO: 1198: Device naa.600144f0d06fc10000004f5906070001 performance has improved. I/O latency reduced from 112502 microseconds to 22074 microseconds.
2012-03-12T12:39:21.699Z cpu1:2049)ScsiDeviceIO: 1198: Device naa.600144f0d06fc10000004f5906070001 performance has improved. I/O latency reduced from 22074 microseconds to 10869 microseconds.

Those logs show the storage path is overloaded (to be expected with 64 outstanding IOs). However, reducing the system down to the QD of 16 & 32 that I had previously tested at (at expect the system to perform at 16) I am now getting ~10k IOPs and 80+MB/s. The above errors do not appear while doing the last two tests. Weird, but it's stable like that now for the last 15 minutes.

Content-Type: text/html; charset=utf-8 Set-Cookie: _redmine_session=BAh7BiIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNoSGFzaHsABjoKQHVzZWR7AA%3D%3D--cebfb08d300a85bd88dafd1422210ebe7c9a5873; path=/; HttpOnly Status: 500 Internal Server Error X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 2.0.3 ETag: "7c353b1aade7b363a3820d7e7b0a2f8e" X-Runtime: 819ms Content-Length: 24791 Cache-Control: private, max-age=0, must-revalidate redMine 500 error

Internal error

An error occurred on the page you were trying to access.
If you continue to experience problems please contact your redMine administrator for assistance.

Back