Forums » Performance discussions »
stripe a zil
Added by Kenneth Foster about 1 year ago
i have about 10 SAS2 SSDs. they get around 500MB/sec read/write. i was thinking of striping 2 of these together for my zil. is that possible? i hear a lot about mirroring the ZIL but not about striping the ZIL.
the reason i'm asking this is that at 500MB/s the ZIL becomes my bottleneck. i have 2 10Gbe interfaces going into the box. they are both running about 5Gb/sec right now. I currently have the ZIL disabled but i see moments of high latency so i'm thinking a ZIL might even that out for me.
Same thing for the l2ARC. if i just add 4 SAS2 SSDs to the cache does it in effect just stipe them? it looks like it does but i want to make sure. I'm not that worried about a failure, so i don't care to hear about that. i'm going strictly for most I/O possible.
thanks.
Replies
RE: stripe a zil - Added by Jeff Gibson about 1 year ago
The current method of splitting the zil to a separate device is effectively single threaded. It has to wait for the first device to reply (synch) before it can write to the next drive for consistency. The way around this is to use a battery/capacitor backed flash (don't know of a spinning device that can be individually backed) based device that can safely ignore the synch request such as the ZeusIOPs, DDRDrive x1, or Talos R (I'm going on second hand info on all 3 of those for now).
I also don't believe you will ever see a speed faster than with the zil disabled, so if that's the bottleneck you're trying to get around I'm guessing that would require different hardware to go faster. If you're seeing spikes of latency with the ZIL disabled I'm guessing that's the system pausing while it decides what it want's to do with that block. It would have to make the same decision (and hence latency) if you had a separate log device(s).
For the L2ARC it doesn't strip reads and writes for individual blocks, but I believe does some type of round robin when when information is evicted from the ARC to the drives. This doesn't guarantee the data will be precisely spread among all your disks, but does make a pretty good job of spreading out any cached data to multiple devices. Effectively you wont see a 1:1 speed increase for each disk you add to the L2ARC, but adding more will help with a large highly random working set.
RE: stripe a zil - Added by Linda Kateley about 1 year ago
Each write will be singled threaded. If you have multiple write threads you should be able to increase overall bandwidth, but the speed(latency) of a single threaded write will not increase.
RE: stripe a zil - Added by Jeff Gibson about 1 year ago
Multiple write threads will still hit the bottleneck of writting to a single log device at a time (quite the gotcha). Here is a link to more info on the problem that Andrew helped me track down (and a great writeup on it too) http://www.nex7.com/node/12
RE: stripe a zil - Added by Linda Kateley about 1 year ago
Yes, i meant with adding striped devices to a zil. If i add additional devices i won't see any perf improvement on a single threaded workload. I will see improvement if i have multiple workloads
RE: stripe a zil - Added by Jeff Gibson about 1 year ago
That's the gotcha, you wont see an improvement (unless the code has been changed) because ZFS will only write to one log device (per pool) at a time until it has gotten a response saying it's done writing (and effectively could then be used to write to again).
The only way to improve performance would be to have extra pools each with a log or disable cache flushing (dangerous if not battery backed) on the log device so that it doesn't wait to get the results of the sync command.
RE: stripe a zil - Added by Kenneth Foster about 1 year ago
well,
I can't have 2 pools. i need that I/O for the writes, 250 spindles in a raid 10 (equivalent) config.
I guess i'll just have to get something like a FusionIO or similar for the ZIL. with that in mind any recommendations for something that low profile and works well, my machine has 128GB of RAM. so not sure what size of N i should get. i've read conflicting numbers. some people say 8GB is all you need, then others say get half of RAM. whats the consensus?
RE: stripe a zil - Added by Jeff Gibson about 1 year ago
I don't have any experience yet with "designed for" zil devices, but I expect to have some benchmarks using the DDRDrive x1 in a system in the next few weeks. I just reread your first post that says you have SAS SSDs. Which ones? I've been told the TALOS R drives will behave decently for a lower end ZIL (even using several of them) but I don't have first or second hand knowledge of how well they scale.
RE: stripe a zil - Added by Linda Kateley about 1 year ago
we have seen good latencies with the zeusram type drives.
RE: stripe a zil - Added by Linda Kateley about 1 year ago
i think the fusionio is still in the cert process.
RE: stripe a zil - Added by Kenneth Foster about 1 year ago
i have about a dozen engineering "samples" of this
http://www.smartm.com/files/salesLiterature/storage/Optimus.pdf
i don't think they will be available to the public until may. they have another version that supports 25 complete overwrites/day coming out in June or july. this version only handles about 10 complete overwrites/day.
i have the 200gb versions.
I'm very happy with it but i don't think i can say anything about it officially until its released.
RE: stripe a zil - Added by Kenneth Foster about 1 year ago
i was looking at the fusion dual i/o cards but they are full profile and i need a low profile card for the controller unit. I found an 80GB fusionIO card, low profile, but it looks like my SSD is about the same performance, spec wise.
oh well. guess its time to redesign the controller chassis so i can use full height cards.
the zeusram looks like it has the same spec as the SSD's i'm using and its only 8GB, at least the one i saw. And its a lot more expensive. the SAS SSD i'm using will run about $700 for 200GB (at least thats what they told me).
RE: stripe a zil - Added by Jeff Gibson about 1 year ago
That drive you linked indicates that it has a backup of some kind to make sure everything is written. The difference between 8 and 80gb for a LOG device makes very little difference. The most you will be writting to it (in optimum conditions) is 10seconds worth of data or half of the system ram. If you're using using 10GB links and a systems with 64GB (or more) ram you would max out using 25GB (2x 10Gb / 8b/B * 10s = 25GB) of LOG device if you're able to push 19.2Gb/s from your hosts.
This is all "ideal" cases, but I would love to see if someone built such a beast.
The specs of those drives put them very much like the Talos 2R drives. From talking to Andrew when I was looking at this issue before if you can find out what the average response time is for a single IO (at i believe 8k) (over say 100-1000 attempts) you can plug it into this formula to see the number of IOPs you'll be able to achieve while writing: 1/(response time in seconds). To max out dual 10GB ports you'd need about 330k 8k IOs (will need less if you've got something that uses larger blocks) or a response time of under 3us to give a point of reference.
ps. someone correct me if my theoretical case or numbers are wrong.