iSCSI / ESXi - strange read behaviour

Added by Krzysztof Starczewski over 2 years ago

Hi all!

This is my first post on this forum but I'm reader for a long time. Time to write down few words has come. Of course we always complain - rarely one says 'I'm happy!' and I'll be no exception for this rule, unfortunately. But, let's get to the point...

My config is quite simple: one Nexenta box (running CE 3.1.1 fully updated) and two ESXi 5.0 boxes also fully updated (build 504890) - managed via vCenter. Nexenta box has 6 GbE NICs and 4 of them is dedicated for iSCSI traffic only. Every ESXi box has two dedicated GbE NICs for iSCSI. Everything is connected through two different switches (HP V1910 series) in a way that complete failover and load balancing is provided. ESXi iSCSI initiators are hardware (Broadcom NICs with hardware iSCSI and TCP offload) type. MPIO policy is set to Round Robin with modification to switch path every single IO operation. On Nexenta box there is one dataset - two stripped RAIDZ1 vdevs (4 disk each). Effect - something similar to RAID50 using 8 disks. Disks are Seagate Constellation.ES 2TB NearSAS which are connected to Dell PERC H700 controller (no JBOD mode, so I configured them as 8 separate RAID0 devices with read ahead buffering policy). Every iSCSI path (there are 4 of them - 2 per each ESXi box) is configured in separate network segment. So far so good - eveyrthing is working fine: esxtop show that traffic is divided equally between two NICs. IOmeter shows throughput reaching ~200MB/s and ~45k (sic!) IOPS (of course for different load types). And generally speaking benchmarks gives excelent results.

But I found some scratches on the shiny surface... During copying large files (to exclude buffering effect both Windows (all test are made on VM machine which is Windows Server 2008 R2) and Nexenta) between two virtual disks (each is on separate VMFS datastore) transfer speed drops to 35-50 MB/s when I expect Windows will be showing about 100MB/s (with RR load balancing it should be 100MB/s read and 100MB/s write speed reached at the same time). When I started to dig deeper using 'esxtop' and 'zfs iostat' commands I have noticed that if file is completly cached in ARC then I can reach 200MB/s. If read operation (from zvol) is required then 'zfs iostat' shows something between 30-50MB/s as read speed from dataset. It is easily when file is cached (reads from disk are near 0 and network transfer speed is high) and not (opposite). Write operations in 'zpool iostat' are rare and big (hundreds of megabytes for single write operation) while read are constant and slow.

Firstly I suspected low dataset speed, but tested with bonnie or simple 'dd bs=1M count=32768 if=/dev/zero of=32G conv=fdatasync' shows write speed ~750MB/s and the same command only for read gives me ~820MB/s read. Morover, when I copy 32G file to the same dataset locally I reached ~220 MB/s average speed. So I excluded this reason. I also excluded networking issues because if transferred data are in cache then I approach wire speed: 200MB/s (on 2x1GbE).

I suspect that there is something between COMSTAR and networking layer on Nexenta side. But I don't know how to check this and not sure if I'm not wrong with this assumption. Does anyone can help solve described problem? Any help will be apreciated because I really want to use Nexenta in production for ZFS power ;)


Replies

RE: iSCSI / ESXi - strange read behaviour - Added by Linda Kateley over 2 years ago

This really sounds like caching. I am guessing you have around 32GB of ram? ZFS will cache everything in ram it can. You may want to consider adding some ssd to read and write caches. SSD's give a nice big cache so that read/write behavior can be closer to the in ram speeds.

also, the other thing i have seen is the bug in unmap in vmware which drops perf to these numbers.

RE: iSCSI / ESXi - strange read behaviour - Added by Krzysztof Starczewski over 2 years ago

Hi Linda,

Thanks for reply. Nexenta box has 16G RAM which soon will be expanded to 32G. I've already disabled UNMAP feature in ESXi before any test beginning. I'm not sure if SSD cache wile resolve this performance issue because it only is present in read operations over iSCSI. The most strange thing is that locally (in terminal) I can obtain really good results (~750 MB/s write, ~820MB/s read, ~315MB/s copy - sample file was 32G what means it is 2 times larger than physical RAM size in Nexenta box). This is what I mean: why read is slow over iSCSI and fast locally? Does SSD dedicated cache device will help? I doubt...

RE: iSCSI / ESXi - strange read behaviour - Added by Linda Kateley over 2 years ago

yes, i am seeing what you are saying.. let me check around. stay tuned.

RE: iSCSI / ESXi - strange read behaviour - Added by Steven Rodenburg over 2 years ago

"This is what I mean: why read is slow over iSCSI and fast locally? Does SSD dedicated cache device will help? I doubt..."

In iSCSI implementations with iSCSI and Nexenta, reads (as measured on the ESX side) are always a bit slower than writes (despite L2ARC devices). I see the same with, for example, EMC iSCSI Arrays by the way, therefore i don't think it is Nexenta/ComStar specific.

I always suspected it has something to do with the network.

  • Writes in ESX are done over multiple NIC's and the Round Robin-PSP. All involved NIC's are participating in a nice equal manner. I've done extensive tuning there.
  • Reads however are spread over multiple links by LACP between Nexenta and the Switch and i always suspected some sub-optimal condition there which i was unable to optimize enough (by using different Load Balancing-algorithms) to equal out the read and write performances.

The same Nexenta and ESX systems are connected by 4Gbps FibreChannel since a month or so (found two switches and a bunch of HBA's on eBay) and since then, the read and write performance is the same and very constant (full wire-speed in both directions).

Besides, i've had massive problems with iSCSI in vSphere 5 (SCSI I/O errors and corresponding path-failures). Went back to 4.1 U2 and the problems have completely disappeared. I have no explanation for this other than my suspicion of the ESXi 5 I/O stack being re-written to an extent that it introduces problems (we've seen the same happening in the ESX 3.5 -> 4.0 upgrade era which later stabilized thanks to patches).

RE: iSCSI / ESXi - strange read behaviour - Added by Matt Van Mater over 2 years ago

Hi all,

I wanted to chime in that i am seeing similar constraints on nexenta 3.1.2 community edition, fully patched. I will spare you the details of my setup (they are available in other posts) but in my case ESXi 4.1 U2 with 6 vnics in round robin has the same issue Steven mentions with 5.0...

To restate it slightly, I can create my zvol or NFS share as having 64K block size with compression off and dedup off (64K to match VMWare's sub-block allocation size and in theory align the two for best performance).

From the Nexenta host itself i can execute dd if=/dev/zero of=/volumes/tank/file bs=64k and get fantastic sequential write speed (over 300-600MB/s depending on my vdev configuration) sustained over a very long time writing over 500 GB of data, outstripping any ARC/ZIL benefits.

If I create an Ubuntu VM in ESXi on the mapped iscsi or NFS volume and execute the same dd command above the performance is MUCH slower. like 110 MB/s max.

Similarly, I am having trouble using iometer in a Win 2k3 Server VM on ESXi to push sequential reads/writes anywhere close to what dd on the native nexenta shows.

In none of these simple test cases does my nexenta's ram or CPU get maxed out.

So what gives? What is the best method to find the bottleneck in this setup?

RE: iSCSI / ESXi - strange read behaviour - Added by Dan Swartzendruber over 2 years ago

Maybe I am confused, but I've been told that NFS at least (dunno about iSCSI) will not take advantage of multiple NICs per I/O stream, so if you have a gb nic, you're right on about that speed.

RE: iSCSI / ESXi - strange read behaviour - Added by Steven Rodenburg over 2 years ago

Dan Swartzendruber wrote:

Maybe I am confused, but I've been told that NFS at least (dunno about iSCSI) will not take advantage of multiple NICs per I/O stream, so if you have a gb nic, you're right on about that speed.

That is correct. A single NFS Session is bound to it's physical medium. That means that a NFS Datastore cannot be access faster than 1Gbps in such a network. NIC-Bonding techniques like LACP or "EtherChannel" will not change that because ESXi does not divide TCP-Sessions, chopping them up at the L2 level and them sending them over multiple links like some Switch-Interlinks can (in expensive switches). To be able to do this, the driver should split traffic before sending them to the PHY of a NIC. ESXi simply does not do this so TCP-Sessions ar bound to a single L2 session, meaning being bound to a phyisical medium.

Example with 4 IP-Storage NIC's in ESXi: To utililze multiple NIC's in an ESXi environment, one should have as many datastores as ESXi NIC's (dedicated for ip-storage) and mount each datastore over a different target-ip.

This is done by using LACP on the storage-side (Nexenta, EMC, NetApp, whatever) with it's IP. Then add 3 Aliases onto that virtual NIC. This makes 4 "target" IP-adresses. They must be in different subnets by the way (reason: Load-balancing algorithm in switches).

Create 4 Datastores on the storage, coupling them to the 4 IP's (every Storage array has it's own way of doing this so rtfm :-) Now, on the ESXi side, mount all 4 Datastores to their respective target IP's. This forces 4 unique source-destination IP combinations, causing LACP to devide them over all links. It might need tweaking with the Load-balancing algorithm to reach max.speed (L2, L3, L4, Combo's, depending on Switch capabilities).

This way, even though a single NFS Datastore will never be able to surpass 1Gbps, all links and with that, all datastores can be used in parallel now. A typical database server scenario would look like have databases on different datastores, having the temp-db and logs on others, all working in tandem. This way, performance is not bad at all. Just don't expect 4Gbps FC (or faster) performance.

iSCSI is totally different because of the VMkernels round-robin capabilities to spread outbound traffic (going towards the storage) over all involved NIC's. Inbound (going away from the storage) traffic spreading is done by LACP (with the problems i described earlier).

The simplest way to "not having to do all this" is to use 10G Ethernet. Fat-pipe -> Simple config. Money-wise however, 10G is a totally different ball-game....

RE: iSCSI / ESXi - strange read behaviour - Added by Dan Swartzendruber over 2 years ago

Infiniband seems interesting too...

RE: iSCSI / ESXi - strange read behaviour - Added by Steven Rodenburg over 2 years ago

Yep. Look @ the Xsigo solutions ;-)

RE: iSCSI / ESXi - strange read behaviour - Added by Matt Van Mater over 2 years ago

Steven's reply describes the way I was able to test my configuration in an attempt to achieve maximum speeds. As he said in his last bullet, the most common vendor answer is "buy 10 Gig NICs for everything and switches and the NFS vs iSCSI debate becomes moot". That is just not realistic for smaller shops that already have plenty of 1 Gbit capacity and just want to use it... especially in this specific case because it is for my home rig (but our work nexenta box has the same limitations).

I have been performing extensive benchmarks and have found that using VMWare's out of the box round robin multipath configuration with nexenta helps IOPS and throughput somewhat, but is still NOWHERE near what the nexenta box is capable of doing natively. Hence my problem described above... It is crazy that a 8 disk raid 10 ZFS volume accessible over multipath iSCSI can achieve only 80 MB/s sequential writes when local nexenta can achieve 220MB/s writes. That tells me there is a major bottleneck somewhere in networking, or iSCSI daemon.

In the case of iSCSI, my next step to test tonight is to change the VMWare vswitch to decrease the number of IOPS of KBs before switching from one path to the next. The VMWare default is to send 1000 IOPS down a single path before moving on to the next path. The guys at EMC did some testing to change that metric and saw significant performance gains in some scenarios, and i am hopeful that i will see improvement. Read this link if you are interested: http://www.emc.com/collateral/hardware/white-papers/h8119-tuning-vmware-symmetrix-wp.pdf

Of course, in my case if Nexenta supported the VMWare VMXNET3 driver, then my vNIC wouldn't be limited to 1 Gbit (it would be closer to 30 Gbit) and none of this multipathing setup would be necessary. I / we have asked Nexenta for some comment/commitment regarding VMXNET3 in future releases but so far no luck. All I have seen on this topic is radio silence.

RE: iSCSI / ESXi - strange read behaviour - Added by Linda Kateley over 2 years ago

vmxnet3 is being actively worked on by the illumos community. We don't have dates or have it the pipeline because it is a community effort. When it gets done there we will get it putback into nexentastor as quickly as possible.

RE: iSCSI / ESXi - strange read behaviour - Added by Steven Rodenburg over 2 years ago

Matt Van Mater wrote: "Of course, in my case if Nexenta supported the VMWare VMXNET3 driver, then my vNIC wouldn't be limited to 1 Gbit (it would be closer to 30 Gbit) and none of this multipathing setup would be necessary."

There is no such thing as "being limited to 1Gbit". The whole 10Mbit, 100Mbit, 1Gbit, 10Gbit thing that the various adapters display is only to satisfy the Windows driver model. In reality, it all happens in RAM and is therefore not bound to "classical physical speeds". Packets flow as fast as the hardware and hypervisor can transport them (in other words: it's variable). Don't be fooled by such "reported speeds" because they do not exist in the virtual world.

Remember that kid in the "The Matrix" movie ?

"There is no spoon" :-)

RE: iSCSI / ESXi - strange read behaviour - Added by Dan Swartzendruber over 2 years ago

Correct, I routinely get 2-3gb/sec with a virtualized e1000.

RE: iSCSI / ESXi - strange read behaviour - Added by Matt Van Mater over 2 years ago

Linda -- thanks very much for your reply, that is good news. Does your development team believe this will be released in the pending 4x release of Nexenta?

Regarding Steven's response, i think we generally agree but are saying the same thing different ways...

In a virtual environment, it is true that the traditional IEEE network speed boundaries of 10/100/1000/10000/etc need not apply. But there is no disputing that VMWare's specialized VMXNET3 driver will result in significantly higher performance at lower system overhead than the default e1000 driver.

I recognize that it is implementation specific, but I think you oversimplified it just a bit to say the limits in a hypervisor "do not exist". Limits most certainly do exist, primarily 1) limits imposed by the quality of each OS's default e1000 network driver 2) each OS's ability to support specialized driver like VMXNET3 that brings performance closer to the theoretical maximums you allude to. In a practical sense, there ARE performance limitations due to how each OS implements its network driver (by default e1000). In some cases, we get just 1 Gbit, in others we can eek out a little more (i.e. Dan's example ) but you most definitely aren't getting 30 Gbit+.

RE: iSCSI / ESXi - strange read behaviour - Added by Matt Van Mater over 2 years ago

...I just realized how off topic my last post was, my apologies.

More importantly I still can't seem to find the bottleneck. Why can my nexenta VM write 240MB/s and read 400 MB/s to a volume (real writes to hard disk, not cache), and ESXi 4.1 U2 with 6 VMKernel Ports in separate port groups / subnets accesses that same volume via iSCSI with 6 vNICs (in corresponding portgroup/subnets) in round robin multipath (tuned and untuned: http://blog.dave.vc/2011/07/esx-iscsi-round-robin-mpio-multipath-io.html) can only achieve about 1/3 of that throughput?

FYI, My IOPS are fairly good but not stellar.

RE: iSCSI / ESXi - strange read behaviour - Added by Karl Rossing over 2 years ago

Did you find the bottleneck?

RE: iSCSI / ESXi - strange read behaviour - Added by Matt Van Mater over 2 years ago

No I did not.

I found that that I got the best overall performance from having 2 vnics and 2 vmks in round robin with a policy set to 1 iop. Generally speaking here is my tuning advice for the community

  • --setting the vmware round robin policy to 1 iop was better than the default setting (1000 iops) in every test/use case
  • --configuring more vnics/vmk pairs gave me better sequential read/write throughput but worse IOPS.
  • --best case scenario for throughput was about 170MB/s read, 140 MB/s write, but the IOPS suffered somewhat with that many vnic/vmk pairs.
  • --ESXi5.0 had significantly better iops than ESXi4.1U2 due to ESXi5.0 using a 8k sub-block allocation size (ESXi4.1 uses a 64k sub-block). It is critical that you set you Nexenta iscsi luns and/or NFS shares as 8k block size to get maximum IOPS performance. If Nexenta was configured with 64k block/stripe size then there was no significant difference in performance between 5.0 and 4.1u2.

Since this is a vmware setup and I expect some non-trivial random IO, the best combination was the one with the better IOPS measurement (2 vnic/vmk). 7000+ IOPS in the iometer 'default' test with heavy random IO is very good, so I have resigned myself to getting less than stellar sequential throughput.

I think that the cause of the bottleneck is latency introduced by Nexenta using the non-optimal e1000 driver and also the vmware vmk. I strongly believe that the use of VMXNET3 will improve that latency and overhead, and therefore make performance numbers improve noticeably. It has been a while since I captured esxtop data while running my benchmark (to validate where the latency lies) but I will probably do that soon for posterity sake.

edit: why does this forum remove the letter R when using the markup (e.g when i added bullets to the items above)?

RE: iSCSI / ESXi - strange read behaviour - Added by Karl Rossing over 2 years ago

I'm running into a very similar problem except with opensolaris b134 on a much smaller box but with slog and l2arc.

I currently have: 1 nic used for comstar 3 vmware servers each vm server has link aggregation for the vm's and the vmk's use this also.

I'm thinking of setting up the following: 2 nics used for comstar (I need to dig out some hardware that supports 6 nics) 3 vmware servers each with 2 nics/vmk for iscsi configured for round robin. each vm server has link aggregation for the vm's. vmk's would not use the aggregate

Have you tried http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1002598 disabling delayed ack?

RE: iSCSI / ESXi - strange read behaviour - Added by Matt Van Mater over 2 years ago

Interesting, I have not seen or tried that hint about disabling delayed ACK. I don't think it will apply in my situation because all of the iscsi traffic between vnic and vmk are on a dedicated vswitch on the same physical host, so there should be zero congestion (which is the source of the problem in that KB article.) However it is still worth a shot since i've mostly run out of ideas.

Note: I am using a SSD for log and cache, it is how I am able to get such a consistently high IOPS measurement. I need to clean up my benchmark sheet a bit and post it to the forum for review.

RE: iSCSI / ESXi - strange read behaviour - Added by Karl Rossing over 2 years ago

Have you enabled jumbo frames?

RE: iSCSI / ESXi - strange read behaviour - Added by Matt Van Mater over 2 years ago

Not really...

During my testing I think I found a bug in Nexenta. I was able to configure jumbo frames (MTU 9000) on the first interface I use for iSCSI but Nexenta would not permit me to change the other interfaces to MTU=9000. Linda: are you aware of this bug?

My understanding is generally jumbo frames will result in lower processing overhead but not necessarily/significantly better throughput (which is my issue). So I chose to not make one interface MTU 9000 and the other(s) 1500, it just seemed like a bad idea to have them configured differently but it might be worth a shot.

RE: iSCSI / ESXi - strange read behaviour - Added by Linda Kateley over 2 years ago

Is this a broadcom nic? If not what nic is it? We do have a known problem with the broadcom. All of our software is built on open source and the broadcom driver is closed.

RE: iSCSI / ESXi - strange read behaviour - Added by Matt Van Mater over 2 years ago

No. The NICs in question are VMWare 5.0 vnic using the Intel e1000 driver. I had the same problem with VMWare 4.1u2.

RE: iSCSI / ESXi - strange read behaviour - Added by Dan Swartzendruber over 2 years ago

The vmware guys claim that jumbo frames are a total waste of time unless you are using 10gb or something much faster than 1gb.

RE: iSCSI / ESXi - strange read behaviour - Added by Linda Kateley over 2 years ago

So i talked to the guys on the backline, we haven't seen these problems in awhile, but we did see them for awhile :) So if you haven't already try to run upgrade.

as far as jumbo frames, it's important to align with io size. alot of the windows vms read and write at 4k, so a 9k packet is lost..

RE: iSCSI / ESXi - strange read behaviour - Added by Matt Van Mater over 2 years ago

Linda: I am running the most recent software available and seeing that MTU problem: NMS Version 3.1.2-8147 (r9697) NMC Version 3.1.2-8147 (r9697) NMV Version 3.1.2-8147 (r9697) OS Version 3.1.2

Regarding jumbo frames, vmware experts generally say to use jumbo frames only if you use it elsewhere in your environment already. Some usage patterns benefit from them, some don't... there is no universal answer. Which is why they say to spend your time only if you already have jumbo frames in use elsewhere and are willing to tune the system to your usage pattern.

In the specific case of jumbo frames, multipath iscsi configured as round robin with iops=1, vmware ESXi5.0 and a datastore formatted as VMFS 5.x will use 8k sub-blocks. This means a single 8k byte sub-block read/write request should be able to fit within a single 9000 byte jumbo frame, whereas a 1500 byte frame MTU will require 6 frames (or 6 IOPs) to process that same request. This applies to ANY Guest OS that is running on your hypervisor.

So in theory, enabling jumbo frames on an esxi host with iscsi round robin with iops=1 should result in less thrashing back and forth between the vnics. I'm not sure how much overhead/latency occurs by going back and forth between the iscsi paths, and might result in better throughput due to the difference in latency. I will try it out and let everyone know.

RE: iSCSI / ESXi - strange read behaviour - Added by pos _ei_don over 2 years ago

try something like esxcli storage nmp psp roundrobin deviceconfig set -d DEVICE --type "bytes" --bytes 8960 instead of iops=1. I gain about 10% of performance when filling the jumboframes up before changing the path to the storage. Play with the bytesize! when setting smaller, i get more iops!

RE: iSCSI / ESXi - strange read behaviour - Added by pos _ei_don over 2 years ago

Did you try this? On Linux you get vmxnet3-support installing the tools http://www.tumfatig.net/20120315/install-vmware-tools-for-nexenta-on-esxi/

RE: iSCSI / ESXi - strange read behaviour - Added by Jeff Gibson over 2 years ago

I'm not sure if I'm having the exect same problem or not, but here is my info:

standalone nexenta box with e5606, 72gb ram (68gb min arc size), 3x5 450GB 10k SAS in RaidZ, 2x120GB intel 320 for L2ARC, 2x20GB intel 311 for LOG. On nexenta the current stats are:

General ZFS ARC Information (updates every 10 seconds)
Property    Value
Min / Current / Max ARC Size    67.43 GB / 54.50 GB / 70.98 GB
Cache Hits / Misses 89.10% / 10.90%
Demand Data Cache Hits / Misses 60.56% / 30.01%
Demand Metadata Cache Hits / Misses 15.25% / 4.97%
Prefetch Data Cache Hits / Misses   21.22% / 52.86%
Prefetch Metadata Cache Hits / Misses   2.97% / 12.17%

Compression is on, Dedup is off, Sync is Default, comstart write cache is disabled.

I'm not sure why my cache hits aren't higher since this system should be warm after being on for almost 2 weeks now and has 13GB of ARC free. I'm 90% certain our working set should fit entirely inside the ARC and about 99% certain it should fit in the L2ARC (total space used is ~300GB)

Store   raidz1 groups: 3, caches: 2, logs: 2, spares: 3, devices: 22    6.09 TB     299.00 GB   5.80 TB     4%  1.00x   ONLINE

The issue I see is that when one of the VMs (the biggest of our lot) starts one of it's 15 minute processes read latency goes through the roof while write stays at a normal level. The system is doing between 100-200 Read IOPs at 100+ms latency while doing 30-60 write IOPs at 0-30ms (the writes are continuous and the latency spikes happen when the reads start). While this is going there seems to be almost no activity on the physical disks; here is the iostat for the pool:

                    extended device statistics
    r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
    5.0    1.6   76.1  115.2  0.0  0.0    0.0    1.4   0   0 c0t5001517A6BE7289Ed0
    7.0   13.7   19.8   58.2  0.0  0.1    0.0    4.9   0   3 c4t50000393B8037336d0
    7.0   13.7   19.7   58.2  0.0  0.1    0.0    4.9   0   3 c12t50000393B802E0DAd0
    4.9    1.6   74.9  114.0  0.0  0.0    0.0    1.5   0   0 c0t5001517A6BE6B48Bd0
    7.1   13.8   19.6   58.2  0.0  0.1    0.0    4.8   0   3 c5t50000393B802E092d0
    7.0   13.7   19.9   58.2  0.0  0.1    0.0    4.8   0   3 c13t50000393B8035F6Ad0
    7.1   13.7   19.7   58.2  0.0  0.1    0.0    4.8   0   3 c1t50000393B803ACD6d0
    7.0   13.7   19.7   58.2  0.0  0.1    0.0    4.9   0   3 c6t50000393B802DA2Ed0
    0.7    0.2    2.0    0.0  0.0  0.0    0.0    0.1   0   0 c2t50000393B8037386d0
    7.0   13.7   19.7   58.2  0.0  0.1    0.0    4.8   0   3 c7t50000393B802DA92d0
    7.0   13.7   19.8   58.2  0.0  0.1    0.0    4.9   0   3 c15t50000393B802DA32d0
    1.1   14.3    7.2   81.7  0.0  0.0    0.0    0.4   0   0 c0t5001517BB278B21Fd0
    0.7    0.2    2.0    0.0  0.0  0.0    0.0    0.1   0   0 c8t50000393B802E0A6d0
    6.9   13.7   19.7   58.2  0.0  0.1    0.0    4.8   0   3 c16t50000393B802DA52d0
    1.2   14.3    5.8   81.7  0.0  0.0    0.0    0.4   0   0 c0t5001517BB2770047d0
    7.0   13.7   19.7   58.2  0.0  0.1    0.0    4.9   0   3 c9t50000393B802DA06d0
    7.0   13.8   19.6   58.2  0.0  0.1    0.0    4.8   0   3 c17t50000393B802DA96d0
    7.0   13.7   19.8   58.2  0.0  0.1    0.0    4.8   0   3 c10t50000393B802D9E6d0
    7.1   13.7   19.6   58.2  0.0  0.1    0.0    4.8   0   3 c18t50000393B802D9F6d0
    7.1   13.7   19.7   58.2  0.0  0.1    0.0    4.8   0   3 c11t50000393B8037356d0
    7.0   13.6   19.8   58.2  0.0  0.1    0.0    5.0   0   3 c19t50000393B8035ABAd0
    0.7   65.4    2.0  831.4  0.0  0.0    0.0    0.1   0   0 c0t50015179595CD1B9d0
    0.7   65.4    2.0  831.2  0.0  0.0    0.0    0.1   0   0 c0t50015179595CCEE9d0
    0.6    0.2    1.9    0.0  0.0  0.0    0.0    0.2   0   0 c14t50000393B8233A3Ad0

I have 2 other ESXi 5.0p1 hosts connected to this that aren't showing latency spikes (they may just be idle and not trying to do anything) while this is going on (I vmotioned this guest to other machines and the 15minute latency spike follows it to that host). The virtual machines still seem mostly responsive. I have bytes set to 8800 for round robin, 9k jumbo clean and tested with vmkping, 8k cluster size in windows and block size for zfs/comstar. Is there a way to tell if this is the disks/pool that is causing the latency, the comstar layer, the networking layer, or esxi that is causing this?

I had also tried the esxcli iscsi adapter param set --adapter=vmhba38 --key=DelayedAck --value=false command that was suggested somewhere else, but this has made no effect in latency.

bck-esxi01-latency.png - ESXi Datastore Page (86.1 KB)

RE: iSCSI / ESXi - strange read behaviour - Added by Jeff Gibson over 2 years ago

I just realized that I didn't properly set the pool datastore.

root@bck-nexenta:/export/home/admin# zfs list -o all Store
NAME   TYPE        CREATION                USED  AVAIL  REFER  RATIO  MOUNTED  ORIGIN  QUOTA  RESERV  VOLSIZE  VOLBLOCK  RECSIZE  MOUNTPOINT      SHARENFS   CHECKSUM  COMPRESS  ATIME  DEVICES  EXEC  SETUID  RDONLY  ZONED  SNAPDIR      ACLMODE     ACLINHERIT  CANMOUNT  XATTR  COPIES  VERSION  UTF8ONLY  NORMALIZATION         CASE  VSCAN  NBMAND  SHARESMB  REFQUOTA  REFRESERV  PRIMARYCACHE  SECONDARYCACHE  USEDSNAP  USEDDS  USEDCHILD  USEDREFRESERV  DEFER_DESTROY  USERREFS     LOGBIAS          DEDUP  MLSLABEL      SYNC  NMS:DESCRIPTION                      NMS:DEDUP-DIRTY
Store  filesystem  Wed Mar  7 10:07 2012  3.81T  1008G  73.5K  2.19x      yes  -        none    none        -         -     128K  /volumes/Store  off              on        on     on       on    on      on     off    off   hidden      discard     restricted        on     on       1        5       off           none    sensitive    off     off  off           none       none           all             all         0   73.5K      3.81T              0              -         -     latency            off  none      standard  Main pool for storage at BCK         off

Could this blocksize mismatch be causing the issue where i'm having to read 32x just to get my 8k block or does ZFS know better somehow?

RE: iSCSI / ESXi - strange read behaviour - Added by Rick van der Linde over 2 years ago

I've seen this behaviour too on a OpenFiler setup. Nowadays I;m migrating this setup to Nexentastor.

In the openfiler setup I've been able to setup MPIO on 3 x 1 Gb setup. I;ve managed to get approx 240 MB/s writes, which is approx 2.5 Gb. It also shows the 3 Gb networks are equally loaded as expected from rond robin. Reads however maxed at approx 115 MB/s. I've seen some higher numbers ut found they are related to local caching. Investigation learnt that again all 3 networks are equally loaded (at 1/3rd each). So it looks like round robin is working but somewhere is waiting for a acknowledgement or something equivalent. Seems the reads are serialised somewhere, Thibking it over and over again it can be OpenFiler or the VMWare MPIO engine that is causing this.

I tried testing this with multiple VM's benchmarking the setup. Writes can probably go faster, but due to my setup the writes are becoming hardware bound. Reads still max out at 110 MB/s. To solve this 10 Gb can be used but it is too expensive for me (private). Nevertheless MPIO is partly working as expected. For writes it is satisfying, for reads not.

Probably testing this in Nexenta will show the same. In next week I will be able to test this setup on a nexentastor CE based setup. Reading this post I suspect I will find same results as for the OpenFiler setup.

RE: iSCSI / ESXi - strange read behaviour - Added by Jeff Gibson over 2 years ago

Rick, How are you testing your reads? Are you seeing them use the paths equally?

If you run esxtop, then press "u" for the luns, then press "P" and copy/paste the identifier. This will let you see the paths for that lun to your datastore. You should see roughly the same CMDS/s going to each if MPIO is working right (you can verify reads and writes are doing the same). When you run your pure read or pure write tests what is your DAVG/cmd for each path?

RE: iSCSI / ESXi - strange read behaviour - Added by Rick van der Linde over 2 years ago

@Jeff

My OpenFiler setup has now been brought down to setup the NexentaStor setup. Currently the box is running, will need to ge iSCSI LUN;s attached and setup my ESX environment again. If I completed that I wil reinvenstigate the performance again and also use your suggestions.

In my Openfiler setup i've seen the NIC performance on the OpenFiler box balances at 30-38 MB/s each for reads. Writes did do approx 75-80 MB/s each. VM's were CentOS 6.2 based and ESXi 5.

RE: iSCSI / ESXi - strange read behaviour - Added by Rick van der Linde over 2 years ago

Well, I;ve tested now and found some things that might be interesting. In this case I've used two pysical seperated Gb networks for MPIO. Installed ESX 5.0 U1 and configured iSCSI LUN's on ESXi. Furthemore I configured round robin. A simple setup with two paths. Now I created a VM and assigned it storage that is located on the iSCSI LUN (VMFS).

Running benchmarks i've found writes are perfectly load balanced and are (for seq. writes) able to saturate two Gb connections. Seen transfers up to 215 MB/s, Reads however are again perfectly balanced, but never pass (for seq. reads) 115 MB/s. Seems reads are waiting for packets to come in before it continues on next packets.

Now I installed another VM (on the same VMFS LUN) and tried to benchmmark again with two simultaneous transfers. In this case I again found writes max (consolidated) 220 Mb/s and again read maxing to 110 MB/s. So two IO streams does not give the increase I wanted to see.

And now it comes. I created another iSCSI LUN, created another VMFS filesystem on it and created a VM on that LUN. Ran benchmarks with 1 VM on the first LUN and a second VM on the second VM. Again writes max out to 220 Mb/s (saturates 2 Gb). Now I get a consoldated read throughput of apprx 190 Mb/s, This is what I wanted to see. (Yeah!).

Will add another Gb connection and find out what happens.

Probably a good advice (based on my findings) could be to use seperate LUN''s for VM (or groups of VM's) to maximise the read capabilities of MPIO on ESX.

Content-Type: text/html; charset=utf-8 Set-Cookie: _redmine_session=BAh7BiIKZmxhc2hJQzonQWN0aW9uQ29udHJvbGxlcjo6Rmxhc2g6OkZsYXNoSGFzaHsABjoKQHVzZWR7AA%3D%3D--cebfb08d300a85bd88dafd1422210ebe7c9a5873; path=/; HttpOnly Status: 500 Internal Server Error X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 2.0.3 ETag: "13cd2752fcf960d63fb26a3e0c2d3158" X-Runtime: 2167ms Content-Length: 52465 Cache-Control: private, max-age=0, must-revalidate redMine 500 error

Internal error

An error occurred on the page you were trying to access.
If you continue to experience problems please contact your redMine administrator for assistance.

Back