Questions about NexentaStor virtualised under ESXi

Added by Matt K about 1 year ago

As per subject:

1) Any update on where VMXNET3 support for NexentaStor is up to, and/or an ETA?

2) There were some known problems trying to use multiple vCPUs with NexentaStor v3.0.x - IIRC excessive guest CPU usage, and NMS lockups. Has anyone tried multiple vCPUs under v3.1.1? Did it work fine and were the issues resolved?

Thanks, Matt


Replies

RE: Questions about NexentaStor virtualised under ESXi - Added by Matt Van Mater about 1 year ago

I am also very interested in this!

Some testing showed the excessive CPU usage is still at least occasionally an issue, but it it NOT an issue under nexentacore 3.x

RE: Questions about NexentaStor virtualised under ESXi - Added by Jason Litka about 1 year ago

  1. Still doesn't work and there's been no official word that I've seen. I believe it does work in OpenIndiana though so once NS migrates to Illumos for 4.0 it should start working. In the interim, do what I do. Add a dual-port 10Gbe adapter to the box with a cable connected between the ports. Assign one port to a vSwitch in vSphere and give the other one to the NS VM with VT-D.

  2. My VM has two vCPUs and it's fine as long as I'm NOT using the web admin. One thing I'm not thrilled with is the constant background CPU usage from all the automated tasks that prevent the CPU from entering a low power state. I'm considering migrating to OI + napp-it.

RE: Questions about NexentaStor virtualised under ESXi - Added by Matt K about 1 year ago

Matt Van Mater wrote:

Some testing showed the excessive CPU usage is still at least occasionally an issue, but it it NOT an issue under nexentacore 3.x

Thanks for the feedback - interesting that it's not an issue under NexentaCore.

Did you notice how severe the excessive CPU usage was in NexentaStor? Interested as to whether it's worth upgrading a CPU-constainted single-vCPU NexentaStor VM to dual vCPU, or whether any potential benefit is still wiped out by the excessive vCPU usage problem? Also, did you notice whether the excessive CPU usage was linked to particular activities (IIRC under 3.0.x it occurred when using the web GUI)?

Jason Litka wrote:

  1. Still doesn't work and there's been no official word that I've seen. I believe it does work in OpenIndiana though so once NS migrates to Illumos for 4.0 it should start working.

Sigh, pretty pathetic after however many months (years?) of asking and waiting. Some sort of acknowledgement would be nice, even if it was just a "coming, but we don't have an ETA yet".

In the interim, do what I do. Add a dual-port 10Gbe adapter to the box with a cable connected between the ports. Assign one port to a vSwitch in vSphere and give the other one to the NS VM with VT-D.

Not a bad idea... I'd prefer not to incur the cost of purchasing (and extra CPU usage of running) a pNIC just for this though.

What type of 10Gbe adapter did you use? How much did it cost?

  1. My VM has two vCPUs and it's fine as long as I'm NOT using the web admin.

Ahh, so the problem is linked to using the web GUI? And there are real CPU capacity benefits of running two vCPUs during times when I'm not using the web GUI (as per my question to Matt Van Mater, above)?

Finally, thanks for the responses guys.

RE: Questions about NexentaStor virtualised under ESXi - Added by Matt Van Mater about 1 year ago

In my case, the excessive CPU usage occurred at all times, even if I did not have the webgui open. I think i only ran using a single CPU for my testing and it consumed 100% of that CPU, so i'm not sure if a SMP configureation would have helped.

Regarding the other part, i agree that it it too bad they haven't responded regarding these perfomrance issues. I understand Nexenta is a sofware company that sells hardware, and so the virtualization route isn't as important to them but I know they use VMWare in the lab and am surprised that they haven't taken the time to resolve a bug that must surely impact their development efforts.

Also regarding Illumos/4.0/Illumian software... any signs of it being released? I saw a message on December 7 saying the new Illumos/Illumian variant would be released in a month and here we are 6 weeks later with no news...

RE: Questions about NexentaStor virtualised under ESXi - Added by Linda Kateley about 1 year ago

I have the server os done. I am just still working on the repo. Just having trouble finding a location to host.

If you send me an email, i can get you access to illumian. The repo will only be necessary for updates, which we don't have any yet :)

linda.kateley@nexenta.com

RE: Questions about NexentaStor virtualised under ESXi - Added by Jason Litka about 1 year ago

Matt K wrote:

Jason Litka wrote:

In the interim, do what I do. Add a dual-port 10Gbe adapter to the box with a cable connected between the ports. Assign one port to a vSwitch in vSphere and give the other one to the NS VM with VT-D.

Not a bad idea... I'd prefer not to incur the cost of purchasing (and extra CPU usage of running) a pNIC just for this though.

What type of 10Gbe adapter did you use? How much did it cost?

I used the Intel X520-DA2 with a 0.5m SFP+ direct-attach cable from CablesOnDemand.com. The card was about $600, the cable about $50. I had the card as a spare though (I use these in my vSphere boxes for HA & FT) so I didn't really need to pay for it.

  1. My VM has two vCPUs and it's fine as long as I'm NOT using the web admin.

Ahh, so the problem is linked to using the web GUI? And there are real CPU capacity benefits of running two vCPUs during times when I'm not using the web GUI (as per my question to Matt Van Mater, above)?

For me it is. with 2vCPUs there is a constant 5-15% from all the background tasks that are running (10-30% with a single vCPU). When using the web admin for ANYTHING the usage spikes closer to 50% and certain screens will hit 100%. With SSL enabled it doesn't even reliably work. If I stay out of the admin though I just get that 5-15% background noise.

RE: Questions about NexentaStor virtualised under ESXi - Added by Jason Litka about 1 year ago

... but you can get those same cards for $350-400 on eBay.

RE: Questions about NexentaStor virtualised under ESXi - Added by Matt Van Mater about 1 year ago

Thanks for the update Linda. I am happy to wait until it is ready for the community at large and would prefer to wait until you and your team think it is ready... Do you have an updated estimate on general availability?

RE: Questions about NexentaStor virtualised under ESXi - Added by Linda Kateley about 1 year ago

I am hoping by end of week. The iso is ready, but the infrastructure isn't ready.

RE: Questions about NexentaStor virtualised under ESXi - Added by Linda Kateley about 1 year ago

illumian is now available at illumian.org,

it is a minimal server install

RE: Questions about NexentaStor virtualised under ESXi - Added by Matt K about 1 year ago

Jason Litka wrote:

  1. My VM has two vCPUs and it's fine as long as I'm NOT using the web admin.

Ahh, so the problem is linked to using the web GUI? And there are real CPU capacity benefits of running two vCPUs during times when I'm not using the web GUI (as per my question to Matt Van Mater, above)?

For me it is. with 2vCPUs there is a constant 5-15% from all the background tasks that are running (10-30% with a single vCPU). When using the web admin for ANYTHING the usage spikes closer to 50% and certain screens will hit 100%. With SSL enabled it doesn't even reliably work. If I stay out of the admin though I just get that 5-15% background noise.

Just an update on my on my original question:

I tested out dual vCPUs again under NexentaStor v3.1.1 and the problems are still there. For me it has nothing to do with the web GUI, just excessive "idle" CPU usage. With a single vCPU my NexentaStor VM hovers around ~15% CPU usage (according to VMware Client performance stats) when "idle" - with dual vCPUs this jumps up to ~30% CPU usage across both vCPUs (so it's actually a 300% increase in vCPU usage in total).

Sigh.

RE: Questions about NexentaStor virtualised under ESXi - Added by Linda Kateley about 1 year ago

So I have been looking alot at this problem.

The major assumption of this problem is that if there is this activity when idle, whatever it is doing will be additive under load..

eg if i 15% cpu util while idle, if i added a workload which would consume typically cause 30% load would then be additive to 45% with this idle load.

The thing is that this problem doesn't do that.. I am going to add this as a bug, but as a problem, i don't think it is critical.

We suspect that what is probably happening is cpu reservations that dtrace gives to vmware. vmware might not know the difference between a reservation for cpu and a runnable thread, so it counts it as non-idle. When i looked at some details i saw the nmdtrace running a little hot. This primarily catches data for analytics.

RE: Questions about NexentaStor virtualised under ESXi - Added by Ashley Watson about 1 year ago

just in terms of VMXNET3 verses E1000 - on all our windows and linux VMs we follow the VMware recommendations to use VMXNET3 where possible.
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1001805
http://www.vmware.com/files/pdf/techpaper/VMW-Tuning-Latency-Sensitive-Workloads.pdf

We have seen m0n0wall VMs that we sometimes use in development under vSphere5 for routing between vlans have high idle CPU specifically caused by the e1000 driver.

According to the VMware note, it says that VMXNET3 is supported on 32bit and 64bit versions of Sun Solaris 10 U4 and later - is that significantly different from what NexentaStor 3.1.2 is based on?

A working VMXNET3 (with VMware tool support) for a vNexenta instance is really important for VMware shops to get the maximum performance. We'd love to be able to implement things like jumbo frames to our vNexentaStor box to maximise performance.

When can we expect to see a full NexentaStor v4 community edition distro with full vmxnet3 support?

RE: Questions about NexentaStor virtualised under ESXi - Added by Matt K about 1 year ago

Linda Kateley wrote:

So I have been looking alot at this problem.

Glad to see it's getting some attention finally - thanks!

The major assumption of this problem is that if there is this activity when idle, whatever it is doing will be additive under load..

eg if i 15% cpu util while idle, if i added a workload which would consume typically cause 30% load would then be additive to 45% with this idle load.

The thing is that this problem doesn't do that.. I am going to add this as a bug, but as a problem, i don't think it is critical.

We suspect that what is probably happening is cpu reservations that dtrace gives to vmware. vmware might not know the difference between a reservation for cpu and a runnable thread, so it counts it as non-idle. When i looked at some details i saw the nmdtrace running a little hot. This primarily catches data for analytics.

I understand what you mean, but in this case wouldn't you expect the vCPU "idle" load as reported by VMware to be exactly double with 2 x vCPU vs one? e.g. the "usage" parameter still 15%, but it's 15% across two vCPUs rather than one? What I'm seeing is a "usage" figure of 30% across two vCPUs compared to 15% for one - so it's actually quadruple the reported vCPU usage (double per vCPU, and double the number of vCPUs).

Ultimately I'll be very happy if you're correct, as I assume that means that there is no real "problem" running a NexentaStor VM with multiple vCPUs, but rather it's just misleading stats... but I'd like to know for sure :-)

RE: Questions about NexentaStor virtualised under ESXi - Added by Matt K about 1 year ago

Ashley Watson wrote:

We have seen m0n0wall VMs that we sometimes use in development under vSphere5 for routing between vlans have high idle CPU specifically caused by the e1000 driver.

Just as a comparison point, we use Vyatta for our virtualised routers and never saw high CPU usage at idle with E1000 vNICs.

Switching from E1000 to VMXNET3 (when support for it was introduced) did make a massive difference to CPU usage when under high traffic loads though.

According to the VMware note, it says that VMXNET3 is supported on 32bit and 64bit versions of Sun Solaris 10 U4 and later - is that significantly different from what NexentaStor 3.1.2 is based on?

It's been a long time since I looked into this (VMXNET3 support in NexentaStor) in detail, but IIRC:

  • the parts of the the current version of NexentaStor that are relevant to hardware / drivers are based on the last version of OpenSolaris that was released before Oracle bought Sun and screwed up OpenSolaris
  • additionally, Nexenta made quite a lot of their own changes (extra features, and backporting fixes from bleed-edge development builds of OpenSolaris)
  • the VMXNET3 drivers worked fine in OpenSolaris
  • but do not in NexentaStor... no idea why

Take that with a grain of salt though - as I said, it's been a long time.

A working VMXNET3 (with VMware tool support) for a vNexenta instance is really important for VMware shops to get the maximum performance. We'd love to be able to implement things like jumbo frames to our vNexentaStor box to maximise performance.

+1000

When can we expect to see a full NexentaStor v4 community edition distro with full vmxnet3 support?

IIRC Linda has said that NexentaStor CE v4 is due out in March 2012, but I haven't read anything official about whether it will properly support VMXNET3 or not.

RE: Questions about NexentaStor virtualised under ESXi - Added by Matt Van Mater about 1 year ago

Based on Linda's feedback I did some more testing, and I agree with her assessment, but only partially.

My testing was done on an idle VM (see my post for specs)... Vmware typically reported 17-21 % of a 4vCPU box utilized (much more if the WebUI is open) an Nexenta's built in prstat showed under 2% utilized.

I artificially generated load two times:

  1. basic dd load ---- dd if=/dev/zero of=/volumes/tank/bigfileX bs=64k
  2. heavy dd load --- by having 10 concurrent open ssh sessions and running 5 instances of dd if=/dev/zero of=/volumes/tank/bigfileX bs=64k and also 5 instances of dd if=/dev/urandom of=/dev/null

When I ran the basic dd load, vmware's cpu utilization shot up to about 85%, and nexenta prstat showed about 20%. This tells me that the cpu usage stats/growth reported by nexenta and by vmware are not correlated or linear. Trying to mimic Linda's terms i might have expected vmware to report base idle load (20%) + new vm reported basic dd load (20%) = vmware should have been about 40% utilized and not the 85% that i saw.

When i ran the heavy dd load, VMware's utilization quickly reached 100% while nexenta's prstat still only reported about 50%. Adding more concurrent load generation somewhat validated Linda's statement because even though vmware's performance metrics showed 100%, nexenta still had room for more CPU cycles and was able to use them (according to prstat)

Here is where I disagree with Linda: While the nexenta vm does not seem to have an artificial performance cap (like we might have feared), the fact is that VMWare's resources scheduler THINKS those cpu cycles are in use. This has a few ramifications:

  1. CPUs are in fact working hard, effectively executing NOOPS and are consuming more power and generating more heat. I assure you my system is working hade and is generating that heat and increasing fan speed to compensate to cool off the CPUs
  2. If the VMWare host becomes moe heavily taxed, VMWare's resouce sharing model will cause more context switches between VMs to try and share resouces fairly (in this case due to a faulty metric gathering system). This WILL cause thrashing and resouce contention eventually.

Regarding Ashley's comment, I agree that a VMXNET3 driver would be very beneficial to increase performance and lower CPU. I have also wondered how/why it was broken?

My hope is that these two features / bugs can be resolved prior to the upcoming v4 release. We all hate it, but the reality is that perception is kings and few people would have the patience to go through the lengths to discover these caveats. Without these fixes in place, customers evaluating nexenta will take the high system utilization at face value and will quickly move on to other competing products that don't have these metric reporting problems.

RE: Questions about NexentaStor virtualised under ESXi - Added by Linda Kateley about 1 year ago

I asked engineering to comment on this thread

lk

RE: Questions about NexentaStor virtualised under ESXi - Added by Matt Van Mater about 1 year ago

Thanks Linda!

One more observation to share:

I ran another test variant involved the command dd if=/dev/urandom of=/dev/null, which is an easy way to highlight the CPU's performance with minimal bottlenecks.

My goal was to validate that even though VMWare claimed the VM was at 100% CPU utilization, I did in fact have more performance/capacity available in the VM that claims less CPU utilization. I started with 1 command running in one shell/session for several minutes then slowly added new sessions with the identical dd command until i reached of 4 total concurrent dd occurring. This method enabled me to have one dd process per vCPU, and allowed me to easily push the system to 100% CPU utilization.

I was able to confirm that the MB/s write average on the first process (which ran for about 20 minutes) was almost exactly the same as the MB/s write average on the last vCPU/process. To me, that confirmed that the nexenta VM's utilization statistics are accurate and vmware's are "wrong".

.... however my earlier concerns about power/cooling and resource contention still hold true.

RE: Questions about NexentaStor virtualised under ESXi - Added by Linda Kateley about 1 year ago

I have filed the cpu run up as a bug.

thanks

RE: Questions about NexentaStor virtualised under ESXi - Added by Linda Kateley about 1 year ago

Our bug numbers are not exposed externally but for contract customers the bugid is 8267.

RE: Questions about NexentaStor virtualised under ESXi - Added by Matt Van Mater about 1 year ago

Hi Linda,

Am I correct that the bugid above is for the CPU utilization issue?

Do you have a similar bugid or comment that you can share with us regarding the VMXNET3 concern Matt K originally mentioned?

RE: Questions about NexentaStor virtualised under ESXi - Added by Matt K about 1 year ago

Matt Van Mater wrote:

When i ran the heavy dd load, VMware's utilization quickly reached 100% while nexenta's prstat still only reported about 50%. Adding more concurrent load generation somewhat validated Linda's statement because even though vmware's performance metrics showed 100%, nexenta still had room for more CPU cycles and was able to use them (according to prstat)

Thank-you for taking the time to run that testing Matt - very interesting, and a relief to know that the picture VMware's performance graphs paint is not accurate.

Here is where I disagree with Linda: While the nexenta vm does not seem to have an artificial performance cap (like we might have feared), the fact is that VMWare's resources scheduler THINKS those cpu cycles are in use. This has a few ramifications:

  1. CPUs are in fact working hard, effectively executing NOOPS and are consuming more power and generating more heat. I assure you my system is working hade and is generating that heat and increasing fan speed to compensate to cool off the CPUs

To the best of my understanding, if all the CPUs are doing it executing extra NOOPs then I don't think this should cause any additional heat - or at least not much? Although maybe these might prevent the CPU from going into a low power state?

  1. If the VMWare host becomes moe heavily taxed, VMWare's resouce sharing model will cause more context switches between VMs to try and share resouces fairly (in this case due to a faulty metric gathering system). This WILL cause thrashing and resouce contention eventually.

I would also add:

  1. Related to #2, even if the VMware host is not that heavily taxed, will cause some performance degradation for other VMs on the same host due to extra context switches / other VMs sometimes having to wait longer for an execution "slot" on a CPU (CPU Ready Time in VMware-speak). AFAIK this problem would begin with any additional, unncessary host CPU load, but would progressively (exponentially past a point?) get worse the more heavily te host's CPUs are taxed?

My hope is that these two features / bugs can be resolved prior to the upcoming v4 release. We all hate it, but the reality is that perception is kings and few people would have the patience to go through the lengths to discover these caveats. Without these fixes in place, customers evaluating nexenta will take the high system utilization at face value and will quickly move on to other competing products that don't have these metric reporting problems.

+1000!

These issues should be fixed simply for the sake of users, but if that's not enough, Nexenta should fix them for their own sake.

As you said, it's a perception thing: I had been considering depoying NexentaStor at work for second-tier storage, but issues like these have so far prevented me from doing so. It's not just the actual issues themselves, but also the fact that I find such issues hanging around for so long without even a real acknowledgment (until now - thanks Linda) quite worrying.

Now Nexenta probably wouldn't care very much about my specific case a we'd represent a small, single order, but if these issues have affected my purchasing decision then I'm sure they have affected many, many others too!

RE: Questions about NexentaStor virtualised under ESXi - Added by Matt K about 1 year ago

Matt Van Mater wrote:

My testing was done on an idle VM (see my post for specs)... Vmware typically reported 17-21 % of a 4vCPU box utilized (much more if the WebUI is open) an Nexenta's built in prstat showed under 2% utilized.

Forgot to ask in my previous post: what is the prstat command to show overall CPU usage? I've been able to get it to show usage for individual processes, but can't figure out how to show overall.

As an aside, I've got another example to help demonstrate your and Linda's conclusions going on right now. My NexentaStor VM is currently running a scrub on my main zpool - VMware is reporting 100% CPU usage whereas prstat (just manually adding up the CPU usage of individual processes) is somewhere around 55%.

RE: Questions about NexentaStor virtualised under ESXi - Added by Linda Kateley about 1 year ago

I have filed a bug on this..It seems pretty repeatable. I also got this test included in our test suite

thanks

RE: Questions about NexentaStor virtualised under ESXi - Added by Matt K about 1 year ago

Linda Kateley wrote:

I have filed a bug on this..It seems pretty repeatable. I also got this test included in our test suite

Thanks Linda - it's nice to see this issue getting some attention after so long!

RE: Questions about NexentaStor virtualised under ESXi - Added by Joseph Kotran about 1 year ago

bump

I'm a new user and I'm experiencing this too. My NexentaStor VM seems to be using 3-7% of CPU yet VMware vSphere client shows 20-25%. I'm able to work around this by assigning only one CPU core to the VM and disabling the web GUI.

RE: Questions about NexentaStor virtualised under ESXi - Added by Jan Schotsmans about 1 year ago

Just killing the webgui is all that is needed, if you want to do some performance tests with compression and/or dedup, you will need more CPU power then 1 core.

RE: Questions about NexentaStor virtualised under ESXi - Added by James H 12 months ago

Apologies for the novice question, but could you explain how to disable/kill the webgui. I work exclusively via the command line so would like to use this workaround to correct the VMWare CPU reporting if possible.

RE: Questions about NexentaStor virtualised under ESXi - Added by Linda Kateley 12 months ago

the webgui is a service. you can svcadm disable nmv as root

RE: Questions about NexentaStor virtualised under ESXi - Added by James H 12 months ago

Thanks for the instructions. Just to confirm that running with 2 or 4 vCPUs, ESX reports around 50% CPU utilisation at idle even with NMV disabled. Changing to 1vCPU dropped reported idle utilisation to 13% (didn't seem to drop further with NMV disabled). Unfortuantely 1vCPU won't be sufficient so I'll just have to wait for that bug fix.

RE: Questions about NexentaStor virtualised under ESXi - Added by Linda Kateley 12 months ago

That is currently listed as a bug. If you look at the cpu utilization inside the nexentastor, you will see that it usually about 1-5%.

RE: Questions about NexentaStor virtualised under ESXi - Added by Jan Schotsmans 12 months ago

It's nmdtrace you have to kill (which also kills the web interface), not NMV.

As for the bug, it doesn't matter what Nexentastor reports, in a VM, the CPU utilization is high, because nmdtrace is causing crazy amounts of interrupts in the kernel, which does eat resources and slows nexentastor down.

If you disable the nmdtrace server, esxi will report normal CPU usage, as does nexentastor and the performance is very similar to running on a dedicated system.

RE: Questions about NexentaStor virtualised under ESXi - Added by Matt K 12 months ago

Does it cause any issues for NexentaStor if you disable nmdtrace (and the web interface), and simply re-enable both services if the web interface is ever required? What functions does nmdtrace perform?

RE: Questions about NexentaStor virtualised under ESXi - Added by Raul Rangel 11 months ago

Woot!! I'm glad this is finally getting looked at! I've been running for over two years with the CPU pegged at 50%. I ran with out nmdtrace for a while but decided that without the webgui it was just like running openindiana server, so I bit the bullet and just let the CPU burn. I'm excited all these long standing issues are finally getting solved. Thank you Nexenta Team, it's really appreciated!

Raul

RE: Questions about NexentaStor virtualised under ESXi - Added by Björn Ott 9 months ago

Hi,

take a close look at this posts here: http://nexentastor.org/boards/1/topics/4818 That fixed my problem with high CPU Load. I don't need the I/O Gauges etc. But the rest of the WebGUI is mandatory for easy configuration. With the steps described here you could use the GUI without the problems --> High "Ghost" CPU Load under an VMware Environment.

Regards Björn